Some suggested rules for assessing the available evidence in women’s health.

Like all health practitioners, those in the area of women’s health are obligated to use an evidence-based approach to clinical care. However, just what is the relevant evidence and how and by whom should it be interpreted? Few practising obstetricians have not experienced frustrations following guidelines that derive from blind worship at the holy shrine of the randomised-controlled trial (RCT). Clinicians in other disciplines have had similar experiences1,2 and such a restrictive interpretation of evidence is increasingly discredited.3-5 With changes in the clinical workforce, a consortium of government, clinical networks and hospital administrators demonstrate an imperative to develop more and more clinical protocols and guidelines so that patient care may become driven by slavish adherence to a ‘recipe’ rather than by the thoughtful application of evidence by individual clinicians. It is therefore timely to ask just what evidence these bodies should be using and what directions should be pursued for future accumulation of evidence?

Rule 1. The RCT: often not the best evidence

The RCT has pride of place among levels of evidence as defined by the guideline development group from the National Health and Medical Research Council (NHMRC).6 The Royal College of Obstetricians and Gynaecologists (RCOG) in their ‘green-top guidelines’ emphasises the privileged position of the RCT, a ‘grade A recommendation’ can only come from an RCT. Is this position in the hierarchy of evidence justified?

Rule 1.1. The under-powered RCT: a source of damaging false-negative conclusions

Where an RCT reports a negative result, it may well be that no clinically important difference exists. However, it is equally plausible that the study size was too small (‘under-powered’) and a clinically important difference was in reality present, but not demonstrated. Unfortunately, obstetrics is particularly prone to this phenomenon because very small differences in outcomes are clinically important to both patients and their carers7 – meaning that massive sample sizes become necessary to detect clinically important differences.

Using only the RCT evidence, it is possible to (unwisely) conclude that an admission cardiotocograph (CTG) does not impact on fetal wellbeing.8 The extremely low incidence of adverse outcomes in low-risk patients means that the RCTs have been under-powered with respect to admission CTGs and serious adverse neonatal outcomes. Having convinced the RCT ‘enthusiast’ that perhaps the RCT was under-powered to answer the issue of perinatal outcomes, the enthusiast then concludes: ‘we do not know if it is beneficial or not.’ Other evidence points overwhelmingly to the fact that early identification of the hypoxic fetus in labour will prevent further hypoxic damage or death as labour advances. It is almost as apparent as the need for a parachute when falling from a great height!4

Rule 1.2. RCTs may lead to recommendations based on clinical protocols that are not in common practice

To achieve even a moderately acceptable sample size, it is frequently necessary to use multiple trial centres, sometimes across international boundaries. The trial protocol must then be able to accommodate the clinical vagaries of each centre, sometimes at the expense of almost invalidating the study with respect to making recommendations for common clinical practice.

In the ‘Term Breech Trial’, key management specifications were not directed in the trial protocol, including obligatory intrapartum CTG, a recent obstetric ultrasound and good progress in labour.9 Each was left largely at the discretion of the managing clinician – despite the fact that a substantial body of opinion would require continuous cardiotocography in labour, a recent ultrasound to exclude head extension or abnormality, and would deliver by caesarean section if there was inadequate progress in labour.

Rule 1.3. RCTs may lead to recommendations that are not valid for particular subgroups

The RCT design must permit sufficient heterogeneity in the patients to have a realistic chance of achieving the targeted number of trial subjects. Inevitably this diversity among the study subjects will mean that any conclusion may not be applicable to specific sub-groups within that population. The Women’s Health Initiative (WHI) trial still very much guides the use of hormone replacement therapy.10 The attentive gynaecologist will be aware that this trial included many women with risk factors for atherosclerosis, including women over 70 years of age with hypertension, diabetes and hyperlipidaemia. Few clinicians would dispute that a pro-thrombotic drug (such as oral oestrogen) is undesirable in the presence of established atherosclerosis, given that thrombosis of the atherosclerotic plaque may lead to ischaemic injury of heart or brain. However, in the absence of established atherosclerosis, oestrogen may actually confer real benefits by improving the lipid profile and thereby reducing the occurrence of atherosclerosis in the first place. Yet, such has been the blind reverence shown to this so-called ‘level I evidence’, health practitioners have largely condemned HRT – perhaps at some considerable health cost to those women at very low risk of established atherosclerosis.

Rule 1.3.1. Subgroup analysis does not necessarily overcome the problems arising from a diverse population under study

Sub-group analysis has serious deficiencies.11 Firstly, the sample size is smaller than the population itself, leading to an even greater likelihood of a type 2 error (missing a true difference). Just as importantly, analysis of multiple subgroups is a recipe for type 1 error (falsely reporting a difference when none really exists through multiple comparisons). If enough subgroups are analysed, some difference will eventually become significant.

This is not to say the sub-group analysis should not be performed. It is evidence that should join the complex coalition of all evidence in decision-making. The weight given to such an analysis will appropriately depend on such considerations as whether there was prospective definition of the sub-group analysis, the number of sub-groups assessed, the strength of the trend and the confidence limits of association. Whether a recommendation follows that trend, will depend on all other available evidence, as assessed by those best equipped to assess all evidence, something I address below.

Rule 1.4. RCT results may not be applicable outside the trial situation – The Hawthorn Effect

Diligent adherence to trial protocols and specific resources allocated to the clinical trials situation may well lead to outcomes that are valid within the study context, but are not replicated outside the trial situation. The infamous ‘Dublin RCT of electronic fetal heart rate monitoring’ has been used widely to infer that there is ‘no benefit from continuous electronic fetal heart rate monitoring in the absence of risk factors for fetal compromise’.12 Yet few labour wards in Australia or New Zealand come even remotely close to the management in that trial. All women in that study had their membranes ruptured on admission in labour revealing liquor that was both clear and adequate. They also had auscultation of the fetal heart rate for one minute 15-minutely in labour and after each contraction in the second stage. A dedicated midwife attended to each woman. Even with such vigilance, there were significantly more neonatal convulsions in the auscultation group and the primary hypothesis of adverse neonatal outcome came very close to clinical significance, with a p-value after adjustment of 0.08.

What, then, is the value of continuous electronic fetal heart rate monitoring in a labour ward where there is not a one-to-one midwife-patient ratio, where the colour of the liquor is unknown for a major proportion of labour and where auscultation of the fetal heart rate occurs half hourly at most and rarely for a full minute? Is it still reasonable to conclude that there is no benefit to continuous electronic fetal monitoring in low-risk labour? Clearly not. All evidence must be considered and that evidence applied to each clinical circumstance.

Rule 2. In making clinical decisions or developing a recommendation for clinical management, all evidence must be considered

Rule 2.1. In the absence of an RCT, it is wrong to conclude there is no evidence

The BMJ paper ‘Parachute use to prevent death and major trauma related to gravitational challenge: systematic review of randomised controlled trials’4 perfectly illustrates how medicine does not depend on the clinical trial for all clinical guidance. There is other evidence in abundance. Physics can predict the effect of a high-velocity collision between two spherical objects. In the case of a collision between a human head and the planet earth, a clinical trial would seem to be unnecessary.

The clinical equivalents are almost unbounded in their number. Who would dare question the wisdom of treating a blood pressure of 300/160, a ruptured appendix or even prompt delivery in the presence of a sustained fetal bradycardia in the second stage of labour? Yet, these clinical decisions are not based on RCT evidence. Happily, the curriculum of RANZCOG and other specialist colleges continue to recommend a sound foundation in the scientific basis of their disciplines. It is distressing that a number of undergraduate medical curricula appear to have lost their way in this respect. An understanding of ‘causation’ underlies good clinical practice and in many ways defines ‘medical’ care.

Rule 2.2. ‘Deterministic causation’ – the most powerful evidence of all

The parachute against gravitational challenge, the ruptured appendix and severe fetal compromise in labour are all examples of deterministic causation. Broadly speaking, a strong rationale for causation can be established in a similar way to Robert Koch and his ‘postulates’ of 1890.1 Having determined causation beyond reasonable doubt (for example, a bad outcome ensuing from head crashing into ground or conservative management of a cord prolapse in early labour), the need to avoid the causative insult is immediately evidenced.

Rule 2.3. Thou shalt not be overly dismissive of anecdotes

All dogs have tails. There is a tail, therefore it is a dog. The anecdotal observation of a cat (hopefully in possession of a tail), will disprove the statement that a tail means it is a dog, thereby demonstrating the power of a single observation. How often do we hear colleagues dismissive of a clinical anecdote? ‘Provide some real evidence.’

A clinical tale (or tail) may not just be a useful learning experience, it may actually provide the best available evidence, sometimes more powerful than an under-powered RCT. In deciding on the wisdom of epidural anaesthesia in the presence of severe thrombocytopenia, just a single case report of a large epidural haematoma (with catastrophic sequelae), might be powerful enough evidence to dissuade the conscientious anaesthetist from embarking on neuraxial anaesthesia under such circumstances. The numerator in a single case report may be ‘1’ but the denominator (epidural anaesthesia in severe thrombocytopenia) is not large, so the magnitude of the true risk may be of clinically relevant proportions.

Rule 2.4. Cohort, case-control and population studies are also often the ‘best’ evidence

So often these other sources of ‘evidence’ are dismissed in favour of a loosely relevant RCT. These studies are often based on vast numbers, sometimes population sizes that are unattainable in RCTs but, as stated above, necessary in order to guide management where it is appropriate that rare adversity guides clinical practice.

What is the perinatal mortality after 41.0 weeks gestation? What is the neonatal mortality after elective caesarean section at 39 weeks? At 38 weeks? What is the likelihood of perinatal death or long-term morbidity after caesarean section at term? What is the incidence of placenta accreta in subsequent pregnancies? All critically important questions that guide day-to-day clinical practice, but only ascertainable by population studies.

Rule 2.5. Clinical experience may be compelling evidence in itself

So much of the art of obstetric practice has its origins in the teachings of senior obstetricians handed down, generation after generation. From the application of forceps to the conduct of vaginal breech delivery, the techniques have been learnt by the experience of our predecessors. While it is undoubtedly both proper and essential to question established practice, it is even more wrong to discard it for ‘lack of evidence’. Bayes’ theorem recognises the imperative of placing a high value on established practices and the onus of proof applied to new alternatives should be considerable.

Rule 3. All guidelines and recommendations are the subjective opinion of their authors, based on their interpretation of the evidence they select

Clinical recommendations or guidelines are the result of an attempt to assess the evidence by a person, or more commonly a task group. The evidence is the study. The selection of evidence and application of that evidence to a clinical situation is interpretative and dependant on the knowledge, expertise and biases brought to the problem by those writing the guideline. It is concerning that a small group of ‘clinical trials specialists’ with little background knowledge and almost no clinical experience in the discipline can interpret RCT data and cite that opinion as ‘level I evidence’. In contrast, the vast body of clinical experts in the field, in possession of untold knowledge, skills and experience, assessing all evidence (not just RCTs), have their opinion relegated to ‘Level of Evidence: IV’ and ‘Grade of Recommendation: C’ – with the implication that it is barely worth considering.

Any recommendation must be based on the careful selection and interpretation of the available evidence. The likelihood of incorrectly evaluating the available evidence will be minimised by confining recommendations only to those issues on which a broad consensus can be achieved. The perpetual tragedy of obstetrics is that those clinicians most experienced (and therefore most able to evaluate available evidence) will rarely have time to sit on a ‘guideline development group’. Instead, guideline development groups are populated by epidemiologists and administrators – not necessarily the most worthy group to be determining clinical policy.

Rule 4. The Principle of Uncertainty: even in the presence of apparently overwhelming evidence, there always remains an element of uncertainty

On occasions, the available evidence may be overwhelming and the recommendation very strong. A clinical group may be so profoundly confident of their recommendation that an alternative approach could not be in any way countenanced. Yet there is always an element of doubt and health services should always be prepared to review recommendations and must exhibit a liberal toleration of diversity in clinical management. The door must remain open to allow continued accumulation of new evidence or an alternative interpretation of existing evidence. If evidence interpretation becomes clinical ‘law’, the continued accumulation of evidence is hindered.


The most useful evidence for determining clinical care is most often not an RCT – even when it exists. Recommendations should come from a complex coalition of relevant RCTs, cohort, case control and population studies combined with a plausible rationale – according to the principles of ‘deterministic causation’. Ultimately, all recommendations are effectively expert opinion: the product of evidence selection and interpretation of the group making the recommendation. Importantly, those involved must possess the insight that comes from extensive clinical experience so they are able to assimilate all the available evidence in the most expert manner available. Only when these stars align can we ensure the highest probability of a valid recommendation.