Concordance of Results from Randomized and Observational Analyses within the Same Study: A Re-Analysis of the Women’s Health Initiative Limited-Access Dataset

Background Observational studies (OS) and randomized controlled trials (RCTs) often report discordant results. In the Women’s Health Initiative Calcium and Vitamin D (WHI CaD) RCT, women were randomly assigned to CaD or placebo, but were permitted to use personal calcium and vitamin D supplements, creating a unique opportunity to compare results from randomized and observational analyses within the same study. Methods WHI CaD was a 7-year RCT of 1g calcium/400IU vitamin D daily in 36,282 post-menopausal women. We assessed the effects of CaD on cardiovascular events, death, cancer and fracture in a randomized design- comparing CaD with placebo in 43% of women not using personal calcium or vitamin D supplements- and in a observational design- comparing women in the placebo group (44%) using personal calcium and vitamin D supplements with non-users. Incidence was assessed using Cox proportional hazards models, and results from the two study designs deemed concordant if the absolute difference in hazard ratios was ≤0.15. We also compared results from WHI CaD to those from the WHI Observational Study(WHI OS), which used similar methodology for analyses and recruited from the same population. Results In WHI CaD, for myocardial infarction and stroke, results of unadjusted and 6/8 covariate-controlled observational analyses (age-adjusted, multivariate-adjusted, propensity-adjusted, propensity-matched) were not concordant with the randomized design results. For death, hip and total fracture, colorectal and total cancer, unadjusted and covariate-controlled observational results were concordant with randomized results. For breast cancer, unadjusted and age-adjusted observational results were concordant with randomized results, but only 1/3 other covariate-controlled observational results were concordant with randomized results. Multivariate-adjusted results from WHI OS were concordant with randomized WHI CaD results for only 4/8 endpoints. Conclusions Results of randomized analyses in WHI CaD were concordant with observational analyses for 5/8 endpoints in WHI CaD and 4/8 endpoints in WHI OS.


Methods
WHI CaD was a 7-year RCT of 1g calcium/400IU vitamin D daily in 36,282 post-menopausal women. We assessed the effects of CaD on cardiovascular events, death, cancer and fracture in a randomized design-comparing CaD with placebo in 43% of women not using personal calcium or vitamin D supplements-and in a observational design-comparing women in the placebo group (44%) using personal calcium and vitamin D supplements with non-users. Incidence was assessed using Cox proportional hazards models, and results from the two study designs deemed concordant if the absolute difference in hazard ratios was 0.15. We also compared results from WHI CaD to those from the WHI Observational Study(WHI OS), which used similar methodology for analyses and recruited from the same population.

Results
In WHI CaD, for myocardial infarction and stroke, results of unadjusted and 6/8 covariatecontrolled observational analyses (age-adjusted, multivariate-adjusted, propensityadjusted, propensity-matched) were not concordant with the randomized design results. For death, hip and total fracture, colorectal and total cancer, unadjusted and covariate-controlled observational results were concordant with randomized results. For breast cancer, unadjusted and age-adjusted observational results were concordant with randomized

Introduction
The role that observational studies reporting effects of treatments should play in informing clinical practice is debated. Marked differences in the results of high-profile randomized controlled trials (RCTs) and observational studies have led to questions about the reliability of results of observational studies. The observational Nurses' Health Study reported that use of oestrogen with or without progesterone was associated with a substantial reduction in the risk of cardiovascular disease in post-menopausal women [1,2]. However in two large RCTs, women randomly allocated to oestrogen and progesterone treatment had increases in risk of cardiovascular disease [3,4]. Similarly, observational studies suggested benefits for antioxidants on cancer prevention [5] and folic acid/ B vitamins for cardiovascular disease [6], but later RCTs reported either harms [7,8] or no benefits [9][10][11] from these agents. In contrast, results from systematic reviews show generally good agreement between results from observational studies and those from RCTs [12][13][14]. However, within these systematic reviews, discrepancies did occur and substantial differences in the estimated magnitude of treatment effect between the different study designs were common [14]. For example, 62% of observation and randomized studies on the same topic had a >50% difference in the odds ratio [14].
There are many potential reasons for differences in results between observational studies and RCTs. They might result from differences in study design-for example, study populations may differ; RCTs are usually smaller and may not detect small effects; and RCTs usually involve shorter treatment exposure. Other differences might arise through confounding and bias in observational studies. Users of dietary supplements are generally healthier and of higher socioeconomic status than non-users, and these factors are often difficult to control for in statistical analyses. Thus, some of the benefits observed in the observational studies for such agents may reflect underlying health differences between people who use supplements and those who do not, even though attempts were made to adjust for such differences in statistical models.
The Women's Health Initiative Calcium and Vitamin D trial (WHI CaD) represents a unique opportunity to explore differences in results between observational studies and RCTs. WHI CaD was a very large, long duration RCT that permitted the non-protocol use of study agents: women were randomly assigned to CaD or placebo, but were permitted to use personal calcium and vitamin D supplements. At randomization, 57% of participants were using either personal calcium or vitamin D supplements. Thus, it is possible to compare results from the two different study designs within the same study: a randomized design comparing the effects of CaD with placebo in women not using personal calcium or vitamin D supplements, and an observational design restricted to the placebo group comparing outcomes in women using personal calcium and vitamin D supplements with outcomes in non-users. Whether the results from these two different study designs are concordant or not might provide insights into differences between results from observational studies and RCTs.

WHI CaD trial
The design and results of the WHI CaD trial have been published in full [15][16][17][18][19]. The WHI clinical trials programme consisted of 3 trials. At entry to the programme, women were invited to take part in the WHI dietary modification trial, the WHI hormone therapy trial, or both. At their first or second annual follow-up visit, participants in these trials were invited to take part in WHI CaD. 36,282 post-menopausal women were randomized to daily supplemental calcium (1g) and vitamin D (400 IU) or matching placebos and followed for an average of 7y. Personal calcium supplements of up to 1g daily, and personal vitamin D supplements of up to 600 IU daily (and later 1000 IU daily) were permitted in WHI CaD [15]. Outcomes for cardiovascular events, hip and total fracture, colorectal, breast, endometrial and ovarian cancer, and mortality were adjudicated centrally, while other cancers were adjudicated by local researchers [20]. CaD had no effect on the incidence of hip or total fracture, cardiovascular outcomes, colorectal or breast cancer, or mortality [15][16][17][18][19]. We obtained the WHI limited-access clinical trials dataset from the National Heart Lung and Blood Institute (NHLBI). Data are anonymous in the dataset. A protocol was submitted to the NHLBI before any analyses were carried out. We attempted to replicate the approach of the WHI investigators where possible. Our re-analysis was approved by the Northern X regional ethics committee.

Randomized study design analyses
We assessed the effects of CaD on myocardial infarction, stroke, all-cause mortality, hip and total fracture, and breast, colorectal, and total cancer (total cancer excludes non-melanoma skin cancer). Using an intention-to-treat approach, the effect of CaD on the time since randomization to the first event for each of these endpoints was assessed using Cox proportional hazards models, stratified by age, randomization status in the WHI hormone and dietary modification trials and relevant prevalent disease at baseline (history of breast, colorectal, or any cancer for breast, colorectal and total cancer endpoints respectively; and history of fracture for hip and total fracture; and history of cardiovascular disease for myocardial infarction and stroke). These analyses were performed in the cohort of participants who were not using personal non-protocol calcium or vitamin D supplements at randomization. We also performed these analyses in the entire WHI CaD cohort for comparison with the original publications.

Observational study design analyses
We restricted analyses to the placebo group and compared outcomes in women using personal calcium and vitamin D supplements at randomization with women not using either personal calcium or vitamin D supplements at randomization for each of the above endpoints using Cox proportional hazards models as described for the randomized design. Because there were differences in baseline characteristics between supplement users and non-users, we carried out unadjusted and age-adjusted analyses, and analyses that controlled for other covariates. For multivariate analyses, we included variables that differed between the groups and/or might be potentially related to the outcome with the final model selection based on plausibility, parsimony, and consideration of similar models used by the WHI investigators [21]. We also used propensity scores to control for baseline differences. We used a stepwise logistic regression model that selected 52 of 478 baseline variables to create a propensity score for baseline personal use of calcium and vitamin D supplements that was included as a covariate in the Cox proportional hazards models. Finally, we performed analyses in which users of personal calcium and vitamin D supplements were matched with non-users based upon their propensity score. 5363 matched pairs were identified with propensity scores that differed by 0.07: the mean difference in propensity score for the pairs was 0.0041.
The WHI investigators reported analyses based on use of personal calcium and vitamin D supplements in the prospective WHI Observational Study (OS) which was recruited from the same catchment population as WHI CaD [21]. They compared outcomes over 7.2y for 15,476 women taking 500mg/d calcium and 400IU/d vitamin D at baseline with 23,561 women not using these supplements for cardiovascular, fracture, mortality and cancer endpoints [21]. We compared the results from our analyses with these previously published results.

Concordance of results
There are no accepted criteria for defining concordance of results between studies. The point estimates of the hazard ratios for the treatment effects of CaD on the major outcomes in WHI CaD ranged from 0.88-1.08 with 95% confidence intervals spanning approximately ±0.15 [15][16][17][18][19]. We think a difference of 0.15 between hazard ratios is a reasonable threshold for concordance because smaller differences have little effect on absolute risk, and are therefore of less clinical relevance to individual patients. For these reasons, we considered results from the two study designs concordant when the absolute difference between the point estimates of the treatment effect is 0.15.

Data and statistical analyses
We have reported the baseline characteristics at the time of randomization to CaD, whereas the WHI investigators reported these characteristics at the time of entry to the WHI programme. For body mass index, and dietary and supplemental calcium and vitamin D intakes, we used the latest value recorded between screening and one month following CaD randomization. Cox proportional hazards models and logistic regression were undertaken as described above using the SAS software package (SAS Institute, Cary, NC version 9.4). We matched personal users of calcium and vitamin D supplements by propensity score with the %gmatch macro in SAS [22]. The assumption of proportional hazards was explored by performing a test for proportionality of the interaction between variables included in the model and the logarithm of time. All tests were two-tailed and P<0.05 was considered significant.

Results
At randomization, 43% of participants were not using personal calcium or vitamin D supplements, 54% were using personal calcium, 47% personal vitamin D, and 44% both personal calcium and vitamin D. For our analyses, the randomized design included the 15,646 (43%) participants not using personal calcium or vitamin D supplements. The observational design included the 15,828 (44%) participants from the placebo group who were either using both personal calcium and vitamin D or were not using either of these supplements at randomization. Baseline characteristics for the entire cohort and for the subgroups defined by treatment allocation and personal supplement use are shown in Table 1. The subgroups for the randomized design were well-matched for these baseline characteristics, whereas for the observational design, there were a number of important differences between the subgroups, including for variables such as age, body mass index, race, hormone replacement therapy use and history of medical conditions such as hypertension and fracture. Personal, non-protocol supplemental vitamin D intake (μg/d) Blood pressure (mmHg) Systolic 126 (17) 126 (17) 126 (17) 125 (17) 126 (17) Diastolic 74 (9) 75 (9) 75 (9) 74 (9) 75 (9) Medical history c Personal supplement use tended to increase throughout the study. At their final study visit, 32% of participants in the entire cohort were not using personal calcium or vitamin D, and 60% were using both supplements. For the randomized design, 53% of participants in both groups continued to be non-users of personal calcium at their final visit. For the observational design, 14% of participants using personal calcium and vitamin D at randomization were no longer using these supplements at their final visit, and 53% of participants not using these supplements at randomization continued to be non-users at their final visit. Tables 2-4 and Fig 1 show the results for the randomized design, the observational design, and for comparison, the multivariate-adjusted results from the WHI OS. For myocardial infarction and stroke (Table 2), the results for the randomized and unadjusted observational designs were not concordant, and there was concordance with the randomized design results in only 2/8 analyses that controlled for covariates (age-, multivariate-adjusted, propensityadjusted, or propensity-matched) observational analyses. The results of WHI OS were not concordant with the randomized design results.
In contrast, for death ( Table 2), all of the unadjusted and covariate-controlled observational design results and the WHI OS result were concordant with the randomized design result. Similarly, for hip and total fracture (Table 3), the unadjusted observational design result, 7/8 of the covariate-controlled observational results, and the WHI OS result were concordant with the randomized design result. For breast cancer (Table 4), the unadjusted, age-and multivariate-adjusted observational design results were concordant with the randomized design result. However, neither the WHI OS result nor the propensity-adjusted or propensity-matched observational design results were concordant with the randomized design result. For colorectal and any cancer (Table 4), the unadjusted and covariate-controlled observational design results were concordant with the randomized design results. However, only the WHI OS result for colorectal cancer was concordant with the randomized result.
In sensitivity analyses, we explored the effect of selecting different thresholds for defining concordance. If we adopted a threshold of ±0.10 for concordance, 3/8 unadjusted and 15/32 covariate-controlled observational design results, and 4/8 WHI OS results were concordant with the randomized design results. Using a threshold of ±0.20, 7/8 unadjusted and 26/32 covariate-controlled observational design results, and 5/8 WHI OS results were concordant with the randomized design results. (For the primary analyses with a threshold of ±0.15, the frequency of concordance was 6/8, 23/32, and 4/8, respectively).

Discussion
There were different patterns of results from randomized and observational study designs for different outcomes in WHI CaD. For death, colorectal and total cancer, and hip and total fracture, results of unadjusted observational analyses were concordant with randomized design results, and adjustment for other variables in the observational analyses generally had little effect. For myocardial infarction and stroke, results of unadjusted observational analyses were not concordant with the randomized design results, and adjustment for other variables generally did not substantially decrease the differences between the results. For breast cancer, the unadjusted, age-and multivariate-adjusted observational results were concordant with the randomized results, but propensity adjustment or matching increased the differences between the results. Overall, 6/8 unadjusted, 6/8 age-adjusted, 8/8 multivariate-adjusted, 5/8 propensityadjusted, and 4/8 propensity-matched observational results were concordant with the randomized results. In comparison, 4/8 results from the WHI OS were concordant with the randomized results.
The results suggest that within the same study there are not substantial differences between results from randomized and observational study designs. Other than for myocardial infarction and stroke, all the unadjusted observational results were concordant with the randomized design results, and results from all multivariate-adjusted results using Cox proportional hazard models incorporating potential confounders were concordant. Results from propensity-adjusted and propensity-matched models were generally similar to the multivariate Cox proportional hazard model results. However, there were small differences between these models for some endpoints (myocardial infarction and breast cancer), and the propensity-adjusted and propensity-matched models did not fall within the defined range for concordance for these two outcomes or for stroke. An important limitation is that the randomized and observational study designs were not independent because the control group was the same for both designs. This feature may have contributed to the smaller differences between the within-study observational and randomized design comparisons compared to the between-study comparisons.  Although there was fairly high concordance of observational and randomized design results within WHI CaD, concordance between the WHI CaD randomized results and the WHI OS results was only 50%, even though the two studies used similar methodology and recruited participants from the same population. Thus, differences in results between RCTs and observational studies may be due to differences between studies, even when they are small and subtle, rather than due to the specific design of the study (observational versus RCT). One potential difference is the willingness of participants to take part in a clinical trial and be randomized and blinded to a treatment. It is possible that responses to a treatment might be different in people willing to participate in a clinical trial compared to people unwilling to participate.
The results suggest that the influence of potential confounders may vary for different outcome variables and in different statistical models, although any such differences were small. There were substantial differences between users of personal calcium and vitamin D and those not taking either of these supplements for variables such as age, body mass index, and race which are all associated with cardiovascular disease, fractures, and cancer. Age and race were statistically significant predictors of fracture and cancer outcomes in our analyses, but adjustment for these and other variables did not have a substantial impact in any of the observational analyses, with all differences between the unadjusted and covariate-controlled effect estimates being <0.12. There were small differences between effect estimates from Cox proportional hazards models and propensity score-based models, but all differences were <0.17. When effect sizes are large, such differences are likely to have little impact. However, 70% of numeric associations were weak (odds ratio or relative risk between 0.5 and 2.0) in a recent survey of >2000 outcomes assessed in the influential observational Nurses' Health Study [23]. For effect estimates of this magnitude, small effects from adjusting for potential confounders could have substantial impact. It is not certain what accounts for the different impacts of confounders on outcome variables, but it highlights the difficulties in carrying out and interpreting multivariate analyses. It suggests that multivariate analyses of observational studies should be treated as exploratory, with a number of different models and techniques applied. The results should be reported accordingly, rather than simply presenting the results from a single "best" model, as commonly occurs.
An important limitation of our analyses is that the effects of CaD on all the outcomes we measured in both the randomized and observational designs were weak, with all effect estimates ranging between 0.76 and 1.20. Although WHI CaD was a large study and the confidence intervals around the effect estimates were generally narrow, it is possible that results might differ for agents with stronger therapeutic effects. We are not aware of any other completed large studies with a similar study design-that is, the study permitted non-protocol use of the study medication and had a large proportion of non-protocol users at baseline. However, a large study of vitamin D supplements currently underway also permits the use of non-protocol vitamin D supplements [24]. This study may therefore allow a similar analysis to ours to be undertaken once the study is completed. Cross-over between the study groups occurred with non-users of supplements at baseline starting them during follow-up and also, less commonly, baseline users discontinuing supplements. This cross-over between groups may have obscured true effects of CaD. Finally, an important limitation is that our definition of congruence between study results is necessarily arbitrary, being based on clinical pragmatism [23], although we did explore other definitions in sensitivity analyses.
In summary, these results do not suggest that there are substantial differences between the results of randomized and observational study designs within the same study, although concordance of results did vary between outcomes. The comparison of randomized results from WHI CaD with those from the separate WHI OS observational study again highlight the inconsistency of results between RCTs and observational studies, even, in this case, when the studies used similar methodology in the analyses and recruited participants from the same population. The effect of adjusting for potential confounders in observational analyses differed by only small amounts in a range of outcome variables and in the different methods of adjustment used. However, as the effect estimates were also small, some of these differences did alter the conclusions as to whether results were concordant or not. This suggests that multivariate adjustment in observational studies should explore a variety of different models and techniques, and report the impact of the different approaches as exploratory analyses.