Results of Observational Studies: Analysis of Findings from the Nurses’ Health Study

Background The role of observational studies in informing clinical practice is debated, and high profile examples of discrepancies between the results of observational studies and randomised controlled trials (RCTs) have intensified that debate. We systematically reviewed findings from the Nurses’ Health Study (NHS), one of the longest and largest observational studies, to assess the number and strength of the associations reported and to determine if they have been confirmed in RCTs. Methods We reviewed NHS publication abstracts from 1978–2012, extracted information on associations tested, and graded the strength of the reported effect sizes. We searched PubMed for RCTs or systematic reviews for 3 health outcomes commonly reported in NHS publications: breast cancer, ischaemic heart disease (IHD) and osteoporosis. NHS results were compared with RCT results and deemed concordant when the difference in effect sizes between studies was ≤0.15. Findings 2007 associations between health outcomes and independent variables were reported in 1053 abstracts. 58.0% (1165/2007) were statistically significant, and 22.2% (445/2007) were neutral (no association). Among the statistically significant results that reported a numeric odds ratio (OR) or relative risk (RR), 70.5% (706/1002) reported a weak association (OR/RR 0.5–2.0), 24.5% (246/1002) a moderate association (OR/RR 0.25–0.5 or 2.0–4.0) and 5.0% (50/1002) a strong association (OR/RR ≤0.25 or ≥4.0). 19 associations reported in NHS publications for breast cancer, IHD and osteoporosis have been tested in RCTs, and the concordance between NHS and RCT results was low (≤25%). Conclusions NHS publications contain a large number of analyses, the majority of which reported statistically significant but weak associations. Few of these associations have been tested in RCTs, and where they have, the agreement between NHS results and RCTs is poor.


Introduction
Observational research is commonly undertaken, reported and publicised, but the role of observational studies in informing clinical practice is debated. High quality randomised controlled trials (RCTs) are usually considered to be the highest level of evidence (Level 1), with high quality cohort studies ranked immediately below this (Level 2) [1]. Some authors have suggested a broad role for observational studies because the study population may better represent the general population than in RCTs, because RCTs can be difficult or impossible to carry out for some conditions, and because systematic reviews have generally reported that results from observational studies do not differ markedly from RCTs [2][3][4][5]. However, a number of high profile examples of discrepancies between results of observational studies and subsequent RCTs have led others to suggest the role for observational studies should be limited [6][7][8]. Observational studies suggested beneficial effects of oestrogen with progesterone on cardiovascular disease [9], antioxidants on cancer prevention [10], and folic acid/B vitamins for cardiovascular disease [11], but subsequent RCTs reported either harms [12][13][14][15] or no benefits [16][17][18] from these agents. Because observational studies cannot test causality, one view is that their results should be regarded as hypothesis-generating and should not influence clinical practice until these hypotheses are tested in adequately powered RCTs [19]. Others suggest that small effects seen in observational studies should not be considered credible because they are more likely to represent bias and confounding than a causal relationship [8].
One of the largest, longest and most influential observational studies is the Nurses' Health Study (NHS). The NHS began in 1976 and has subsequently followed more than 100,000 women in the original cohort study. Numerous papers in high impact biomedical journals have originated from this study. The size, duration and eminence of the NHS make it a good model to formally explore the scope, veracity and impact of data from observational analyses. In the present work, we have undertaken a systematic review of publications from the NHS. We set out to determine how many hypotheses have been explored in NHS publications, the strength of the associations reported, and how these findings align with those from RCTs on the same topics.

NHS publications
In November 2013, we extracted the citations of all 1235 NHS publications between 1978-2012 from the NHS website (http:// www.channing.harvard.edu/nhs/?page_id=154). We included publications with an abstract, those in which the NHS cohort was part of the population studied, and those with an observational (case-control or cohort) study design. Figure 1 shows the flow of studies. 28 publications did not have an abstract, 52 studies did not include the NHS cohort, and 102 publications did not report findings from observational analyses, leaving 1053 publications included in our analyses.
One investigator (VT) reviewed the abstracts of all eligible publications. For all analyses reported in the abstracts, we extracted information on the associations analysed by the investigators, the endpoints assessed, the independent variables for each of those endpoints, and the reported effect sizes and 95% confidence intervals (CIs). We classified each result by the strength and direction of the association reported and the level of statistical significance. Statistically significant results with an odds ratio (OR) or relative risk (RR) of #0.25 or $4 were considered strong associations, those with OR/RR of 0.25-0.5 or 2-4 were considered moderate associations, and those with OR/RR of 0.5-2 were considered weak associations [8]. When the CI for the OR/RR of a reported association included 1 but the text implied a relationship between the outcome and independent variable existed, we classified these associations as statistically nonsignificant.

Randomised clinical trials
We selected 3 health outcomes (breast cancer, ischaemic heart disease [IHD] and osteoporosis) that are important to women's health and were frequently studied in NHS publications. One investigator (VT) searched PubMed for RCTs or systematic reviews of RCTs with breast cancer, IHD and osteoporosis as the primary endpoint that evaluated similar factors to the independent variables studied in NHS publications. We used the following format in our PubMed search: health outcome AND independent variable AND random*. We included the latest meta-analysis of RCTs identified, and when one was not available or suitable, we included all large relevant RCTs. Two investigators (VT and MB) reviewed the full texts of these meta-analyses or RCTs and extracted information on effect sizes and 95% CIs for the individual result or the pooled analyses of RCTs.

Comparison of results of NHS publications with RCT results
We compared the effect sizes reported in the NHS publications with those from relevant RCTs, and considered them concordant when the difference between the effect sizes was #0.15. There is no generally accepted definition of concordance of results from studies of different designs. We chose a threshold of an absolute difference of 15% on the basis that this effect size is close to the smallest that is clinically meaningful. Smaller effect sizes are generally unlikely to be considered clinically meaningful to patients, because the absolute benefits from taking a treatment are small in this situation.

NHS results
Associations between 61 health outcomes and 1383 independent variables were reported in the abstracts of 1053 NHS publications (Table S1). Many of these independent variables were reported slightly differently or were closely related in different publications and so we were able to classify them into 136 broad groups comprising closely related variables. The three most commonly tested outcomes were breast cancer, colorectal cancer and IHD, and associations were reported with these endpoints for 56, 49 and 46 broad groups of independent variables, respectively (Table 1). In total, 2007 associations between health outcomes and independent variables were reported. Of these associations, 1433 (71.4%) were results from the NHS cohort alone, and 574 (28.6%) were from studies where the NHS cohort was pooled with other cohorts. Figure 2 shows that 1165 (58.0%) of the 2007 associations reported were statistically significant (477 beneficial, 688 harmful), 204 (10.2%) were statistically non-significant but the abstract implied an association exists (114 beneficial, 90 harmful), and 445 (22.2%) were neutral (no association). The majority of the 204 statistically non-significant results reported effect sizes for an individual subgroup that was not statistically significant, but a test for a trend across the subgroups was statistically significant. For a further 193 (9.6%) results, an association was reported in the abstract but there was insufficient information to determine whether the association was beneficial or harmful. Among the 1165 statistically significant associations, 1002 had a reported numeric RR or OR. Figure 3 shows that the majority of these associations, 706 (70.5%), were weak, with 246 (24.5%) associations classified as moderate, and only 50 (5.0%) classified as strong associations. Table 2 shows the journals and frequency of NHS publications. The impact of NHS publications is apparent: 30% of the publications were published in journals with impact factors .10, and 15% were published in one of the 6 most prestigious internal medicine journals (Annals of Internal Medicine, Archives of Internal Medicine, BMJ, Lancet, JAMA, New England Journal of Medicine).

Comparisons between NHS findings and RCT results
The results from NHS publications and relevant RCTs are summarised in Table 3 for breast cancer, Table 4 for IHD, and Table 5 for osteoporosis. Of the 49 associations in NHS publications for these 3 outcomes, 16 were not statistically significant, and 30 statistically significant associations were classified as weak, 3 as moderate, and 0 as strong. For breast cancer, NHS publications reported associations with 56 broad Results were statistically non-significant but the abstract implied an association exists. c An association was reported in the abstract with insufficient information to determine the strength or direction of the association.

Discussion
NHS publications report a very large number of associations between health outcomes and independent variables. Only 1 in 5 associations was reported as neutral (no association). Of the statistically significant associations, only 5% were strong associations (OR/RR #0.25 or $4), with 70% of effect sizes being weak (OR/RR between 0.5 and 2.0). Few of the associations have been tested in RCTs and, where relevant RCTs have been reported, only 1 in 5 NHS study results was concordant with the RCT result. Despite this, NHS publications were frequently published in high impact journals.
More than 2000 associations from this single study were reported in the abstracts of publications we reviewed. This is likely to be a substantial underestimate of the actual number of associations examined, because many results will have only been reported in the text or tables of the full article or will not have been reported. The large number of statistical tests raises concerns about false positive results. None of the abstracts highlighted this possibility, reported analyses adjusted for multiple statistical testing, or mentioned the number of analyses previously conducted in the NHS cohort.
1358 results (68%) in NHS publication abstracts were either statistically significant or reported as though an association existed. It is difficult to estimate the likely number of false positives amongst these results. If all of the 2007 associations examined were of unrelated variables and there was no relationship between the health outcomes and these variables, about 100 results (5%) would be statistically significant due to chance. However, many of the variables examined were closely related which would decrease the total number of independent tests. On the other hand, it is likely that the results reported in the abstract are only a small proportion of the total statistical tests conducted (either reported in the full article or not reported) which would substantially increase the total number of independent tests. Furthermore, statistically significant results are more likely to be reported in the abstract than nonsignificant results. Given the likely bias toward significant results and the very large number of statistical tests performed, it seems reasonable to conclude that a substantial proportion of results were false positives. This concern was not raised in any of the abstracts.
The strength of associations reported in observational studies is often viewed as an indicator of the credibility of the association [8,[69][70][71]. Associations with OR or RR $4 or #0.25 are considered strong and more likely to be reliable in the absence of  significant bias [8,70,71]. However, where the association is weak or moderate, such results should be viewed with scepticism. Effect sizes may be inflated, observational studies are limited by selection bias, confounding, and methodological weaknesses in their study design and analysis, and large observational studies can produce implausibly precise estimates of effect sizes that are highly statistically significant but clinically unimportant [8,[69][70][71]. Only 5% of results reported in NHS publications were strong associations. Despite this, a very large number of NHS papers were published in high-impact general medical and speciality journals. A recent survey reported that only 14% of publications of observational studies in high impact medical journals called for RCTs to support their findings, with the majority making explicit recommendations regarding clinical practice based upon the observational study findings [19]. Taken together, these findings suggest that many journals, including high impact journals, place a low importance on the strength of an association or the non-randomised nature of the study and hence the credibility of the association when evaluating observational studies for publication. In addition, since clinical research findings published in prominent journals influence clinical behaviour, our findings suggest that clinical practice might often be driven by false positive results from observational studies. We compared findings from NHS publications and RCTs for 3 important health outcomes that were studied commonly in NHS publications. Results of 496 associations between breast cancer, IHD, and osteoporosis and independent variables were reported in 326 publications. However, few RCTs examining the relationship between these outcomes and the independent variables have been undertaken. Thus, we identified RCTs for only 19 of these broad groups of variables for these 3 outcomes, and the concordance between the results of the RCTs and the NHS results was poor. The reasons for the small number of RCTs are not clear. It is possible that investigators do not view the NHS results as credible because of the small effect sizes, and thus have not chosen to examine their findings in RCTs, but this seems unlikely. A possible explanation is that RCTs are more difficult, more expensive, and take longer to conduct than new analyses of the NHS, or comparable analyses of other observational datasets. In addition, the volume of hypotheses generated -about 30 NHS papers eligible for our analysis were published annually -and the small effect sizes reported means that an impractically large number of very large RCTs would be needed to test all the associations reported. About 60% of associations reported by NHS studies suggested a harmful effect of the independent variable on the outcome. This is another possible explanation for the small number of RCTs as directly assessing potential harms in an RCT is likely to be unattractive to researchers, ethics committees, funding bodies, and participants. However, potential harms identified in observational studies can usually be indirectly assessed in RCTs, by exploring whether interventions that reduce the potential harmful exposure improve health outcomes. If reduction of a potentially harmful exposure has no impact on health outcomes, this suggests that harm from the exposure is spurious and not clinically relevant. Previous systematic comparisons of the results of observational studies and RCTs have reported that pooled results from observational studies generally do not differ markedly from pooled results from RCTs [2][3][4]. However, within these pooled analyses, there were marked variations in individual results, discrepancies did occur, and differences in estimated magnitude of treatment effect were common [3,4]. There was agreement between the results of NHS publications and relevant RCTs for only 10-25% of analyses for the 3 outcomes we assessed. The low rate of concordance likely reflects the propensity of observational analyses to generate inaccurate estimates of effect, as a result of confounding and bias. Other contributing factors might be that our definition of concordance was quite stringent, or that the factors studied in RCTs were not always identical to those studied in NHS publications (eg. calcium supplements vs. dietary calcium intake).
In summary, we found that a very large number of associations have been reported in NHS publications, but 95% were weak or moderate in strength, and therefore unlikely to be causal. Few of these associations have been tested in RCTs, and where they have been, agreement between NHS and RCT findings is poor. Clinicians interpreting the findings of observational studies such as the NHS should be aware of the possibility that multiple statistical tests have been undertaken with the resulting likelihood of false positive results, and of the lack of credibility for associations where the effect size is small. The low concordance of NHS findings with RCT findings suggests that clinical practice should not be informed by observational studies, and that findings from observational studies should not necessarily lead to confirmatory RCTs being conducted, especially when the effect size is small. Reporting of observational studies would be improved by including the total number of associations ever tested in the study, the proportions of statistically significant results previously published, and whether previous findings from the observational study are concordant with RCTs.