Instrumental Variable Estimation of the Causal Effect of Plasma 25-Hydroxy-Vitamin D on Colorectal Cancer Risk: A Mendelian Randomization Analysis

Vitamin D deficiency has been associated with several common diseases, including cancer and is being investigated as a possible risk factor for these conditions. We reported the striking prevalence of vitamin D deficiency in Scotland. Previous epidemiological studies have reported an association between low dietary vitamin D and colorectal cancer (CRC). Using a case-control study design, we tested the association between plasma 25-hydroxy-vitamin D (25-OHD) and CRC (2,001 cases, 2,237 controls). To determine whether plasma 25-OHD levels are causally linked to CRC risk, we applied the control function instrumental variable (IV) method of the Mendelian randomization (MR) approach using four single nucleotide polymorphisms (rs2282679, rs12785878, rs10741657, rs6013897) previously shown to be associated with plasma 25-OHD. Low plasma 25-OHD levels were associated with CRC risk in the crude model (odds ratio (OR): 0.76, 95% Confidence Interval (CI): 0.71, 0.81, p: 1.4×10−14) and after adjusting for age, sex and other confounding factors. Using an allele score that combined all four SNPs as the IV, the estimated causal effect was OR 1.16 (95% CI 0.60, 2.23), whilst it was 0.94 (95% CI 0.46, 1.91) and 0.93 (0.53, 1.63) when using an upstream (rs12785878, rs10741657) and a downstream allele score (rs2282679, rs6013897), respectively. 25-OHD levels were inversely associated with CRC risk, in agreement with recent meta-analyses. The fact that this finding was not replicated when the MR approach was employed might be due to weak instruments, giving low power to demonstrate an effect (<0.35). The prevalence and degree of vitamin D deficiency amongst individuals living in northerly latitudes is of considerable importance because of its relationship to disease. To elucidate the effect of vitamin D on CRC cancer risk, additional large studies of vitamin D and CRC risk are required and/or the application of alternative methods that are less sensitive to weak instrument restrictions.


Introduction
Vitamin D can be ingested or synthesized in the skin from inactive precursors through the action of UV sunlight. Its active form, 1,25(OH) 2 D (1,25(OH) 2 D 2 and/or 1,25(OH) 2 D 3 ) is produced after two hydroxylation steps in the liver and kidneys ( Figure 1) [1]. The prevalence of vitamin D deficiency in Scotland is high due to high northern latitude, often cloudy weather (lack of sunlight impairs vitamin D synthesis during winter months), indoors oriented lifestyle and poor diet, and so routine vitamin D and calcium supplementation for the housebound (.65 years old) is recommended [2]. In a recent study of over 2000 healthy individuals living in Scotland, we found that 77.5% of the individuals were vitamin D deficient [3]. Although the Reference Nutrient Intake (RNI) of vitamin D by the Scientific Advisory Committee on Nutrition in Scotland for people over 65 years old is 10 ug per day [4], there is a great variation of the recommended daily allowances (RDA) by different research groups and institutions [5][6][7][8].
Vitamin D has been considered relevant to skeletal disease and calcium metabolism, but there is growing evidence that vitamin D deficiency might be a risk factor for cancer, cardiovascular, metabolic, infectious and autoimmune diseases [3]. In particular, vitamin D may affect colorectal cancer (CRC) risk via its binding to the vitamin D receptor (VDR) [9] influencing cell proliferation, differentiation, apoptosis and angiogenesis [10,11] or affecting insulin resistance [12]. Results from case-control and cohort studies are inconclusive, but results from cohort studies measuring 25-hydroxy-vitamin D  in the blood or the serum are more consistent indicating an inverse association with CRC [13][14][15].
Establishing causal relationships between environmental exposures and common diseases using conventional methods of observational studies is problematic due to unresolved confounding, reverse causation and selection bias [16]. The theory underpinning the Mendelian randomization (MR) approach is based on the random assortment of alleles at the time of gamete formation, which is equivalent to a randomized controlled trial in which people are randomly allocated to therapeutic interventions. The main concept of a MR study is based on three relationships: genotype-intermediate phenotype; intermediate phenotype-disease; genotype-disease [17,18] and it can be used to identify causal environmental risk factors without the several potential problems of observational epidemiology [19]. The MR approach can also strengthen causal conclusions by limiting reverse causation problems (biological, through exposure assignment, due to reporting bias), selection bias and regression dilution bias [19]. Figure 2 illustrates how this concept is applied to inform causal inference.
The analytic approach employed here for MR is the instrumental variable (IV) model, in which the genetic variant is treated as an instrument which is assumed to be associated with the disease only through its association with the intermediate phenotype [18]. This requires firstly the identification of one or more genetic variants (typically a single nucleotide polymorphism or SNP) as the IV that is known from published data to be associated with the phenotype [18]. The three key assumptions underlying the MR approach are: a) the genotype is associated with the phenotype; b) the genotype is independent of measured and unmeasured confounders; and c) that the effect of genotype on outcome is mediated only through the intermediate phenotype (no pleiotropy) [17,18].
In this study, we set out to evaluate the relationship between CRC, plasma 25-OHD levels and genotype at 4 genetic loci tagging genes involved in vitamin D metabolism (Table S1) and which have previously been shown to be associated with plasma vitamin D levels in a pooled meta-analysis of Genome Wide Association Studies [20]. In order to estimate whether there is a causal relationship between plasma 25-OHD and CRC risk we applied the control function IV estimator.

Ethics Statement
Ethical approval for the SOCCS study was obtained from the MultiCentre Research Ethics committee for Scotland (reference

Study Population
We studied a subset of 2,001 cases and 2,237 controls from a case-control study of CRC (Study Of Colorectal Cancer in Scotland, SOCCS). We aimed to recruit all incident cases (1999)(2000)(2001)(2002)(2003)(2004)(2005)(2006) of adenocarcinoma of colorectum presenting to surgical units in Scotland (18-79 years old). Exclusions were patient death before ascertainment, patient too ill to participate, recurrent cases, or patient unable to give informed consent due to learning difficulties or other medical conditions. We recruited about 40% of all incident cases in Scotland over the study period. During the same period controls were drawn randomly from a populationbased register (community health index) and invited to participate. Participation rates among those approached were approximately 58% for cases and an estimated 57% for controls. More than 99% of the study participants were white Caucasian (see [21] for further recruitment details).
The subjects completed one questionnaire with lifestyle and cancer information and they were asked to report their status one year prior diagnosis or recruitment, including information about their general medical history, physical activity, smoking status, intake of any regular intake of aspirin and NSAIDs, height, weight and waist circumference were recorded. Additionally a semiquantitative food frequency questionnaire (Scottish Collaborative Group FFQ, Version 6.41) was completed by participants (http:// www.foodfrequency.org), which consisted of 150 foods and the individuals were asked to describe the amount and frequency of each food on the list they have eaten a year prior to diagnosis or recruitment. Further information about the questionnaires were presented in detail previously [21]. In addition, each recruited cancer subject was assigned an American Joint Committee on Cancer (AJCC) stage derived from a synthesis of clinical, pathological and imaging information [22]. Finally, for a subset of the cancer cases (1,423, of those 1,376 with 25-OHD measured) there was information about the symptoms they had developed before recruitment. We grouped the cases into four categories: (1) no symptoms (190 cases), (2) mild symptoms (290; including: change in bowel habit, constipation, intermittent diarrhoea and constipation, more frequent stools, diarrhoea, loose stools, excess wind, mucus in stool and abdominal discomfort) and (3) severe symptoms (220; including rectal bleeding, vomiting, weight loss, loss of energy, loss of appetite and nausea) and (4) both, mild and severe symptoms (676).
Family history risk was determined according to the Scottish Executive cancer guidelines (http://www.sehd.scot.nhs.uk/), The criteria for high family history risk of colorectal cancer are: 1) at least three family members affected by colorectal cancer or at least two with colorectal cancer and one with endometrial cancer in at least two generations; one affected relative must be ,50 years old at diagnosis and one of the relatives must be a first degree relative of the other two; or 2) presence of the HNPCC syndrome; or 3) untested first degree relatives of known gene carriers. The criteria for moderate risk are: 1) one first degree relative affected by colorectal cancer when aged ,45 years old; or 2) two affected first degree relatives with one aged ,55 years old; or 3) three affected relatives with colorectal or endometrial cancer, who are first degree relatives of each other and one a first degree relative of the consultant. Individuals that do not fulfil all the above criteria are classified as low family history risk (Scottish Executive cancer guidelines). For this analysis family history was coded as low vs. medium/high family history of CRC.

Measurement of Plasma 25-OHD
The liquid chromatography-tandem mass spectrometry (LC-MS/MS) method was used to measure 25-hydroxyvitamin D3 and D2. This paper presents the 25-OHD (the total of 25-OHD2 and 25-OHD3); however, most of our samples contained no D2 (,3 ng/ml). The lower limit of detection with the LC-MS/MS method was 4 ng/ml for D3 [23]. The LC MS/MS method was performed following standard protocols and appropriate quality control procedures (including multiple measurements of the same sample from our cohort and standardization against standard reference material, SRM 972) and it has been rated as the preferred 25-OHD measurement method for population studies by an international panel of experts [23]. More details about this method can be found elsewhere [23,24]. For the analysis 25-OHD measurements were standardized to remove the prominent effect of the month when blood was taken on the 25-OHD concentration, as described in detail in Zgaga et al [3].

Genotyping Data
DNA samples were accurately quantified by Pico-GreenTM and quality controlled prior to dispatch. Genotyping was undertaken using TaqMan in the Wellcome Trust Clinical Research Facility (WTCRF) in Edinburgh. 2,000 subjects for the rs2282679 were genotyped as part of an array-based candidate gene approach, using the Illumina Infinium I Custom array platform and performed by Illumina (San Diego). Case and control DNA samples were stored, genotyped and analysed in the same way. In addition to avoid potential systematic batch-to-batch variation or bias, samples were anonymised as to affection status and were randomly distributed within plates. Data were subject to Illumina or WTCRF quality control procedures. Assumptions of Hardy-Weinberg Equilibrium (HWE) were tested using a chisquared test.

Statistical Analysis
The statistical package used was Stata version 11.0 (Stata Corp, College Station, Texas). Participants were divided into quintiles based on the combined distributions of cases and controls. Logistic regression models were used to estimate the strength of association between CRC risk and vitamin D plasma levels. The associations were tested in three logistic regression models (crude model, model I and model II). Model I was corrected for age and sex and Model II was corrected for age, sex, Carstairs Deprivation Index, energy (MJoules/day), smoking (non-smoker, former smoker and current smoker), body mass index (BMI, kg/m 2 , continuous), regular NSAID intake (yes vs. no), family history (low vs. medium/high) of cancer and physical activity (hours of cycling and other sports activities, 4 groups). We also tested the association after sex, stage of cancer at diagnosis (AJCC), presence of symptoms and time between diagnosis and recruitment stratification. In addition the association between CRC and rs2282679, rs12785878, rs10741657 and rs6013897 was tested. Dataset for this analysis was larger, and it comprised all SOCCS study participants for whom genotyping of selected SNPs was successful (up to 5,449). We also tested the interaction between genotype and vitamin D plasma levels on CRC by comparing a model with and without an interaction term between the two variables, using a likelihood ratio test. Assumptions of Hardy-Weinberg Equilibrium were tested using a chi-squared test.
To estimate the causal odds ratio we applied the control function IV estimator for a 3-level categorical instrument Z coded 0, 1, 2 (SNP) a continuous intermediate phenotype X (plasma 25-OHD 3 ) and a binary outcome Y (CRC). The first stage of the control function is a linear regression of the intermediate phenotype (X) on the instrument(s) (Z), which generates predicted values for the intermediate phenotype. The second stage is a logistic regression of the outcome (Y) on the predicted values of the intermediate phenotype including the estimated residuals from the first-stage linear regression in the second-stage logistic regression [25]. The rationale is that the first-stage residuals may be correlated with unmeasured confounding factors. In addition to a crude model, we also adjusted for age and sex. The strength of the applied instruments were evaluated using the F statistics from the first stage linear regression, with values lower than 10 taken as evidence of a weak instrument [26]. Finally, we applied four additional IV estimators which are presented and described in the supplementary material (Methods S1).

Genotype and Plasma 25-OHD Levels
There was no evidence for departure from HWE for all four SNPs: rs2282679 p-value = 0.25, rs12785878 p-value = 0.78, rs10741657 p-value = 0.07 and rs6013897 p-value = 0.52. The A allele of rs2282679 and the T allele of rs12785878 were associated with higher levels of plasma 25-OHD ( Table 2). In particular, we found that rs2282679 and rs12785878 genotypes were associated with a decreased risk of 25-OHD deficiency defined as ,10 ng/ml (rs2282679: for each A allele OR = 0.88, 95% CI 0.80, 0.98, p = 0.02; rs12785878: for each T allele OR = 0.89, 95% CI 0.79, 1.00, p = 0.05; rs10741657; Table 2). These associations were not different when we restricted the analysis only in the controls (data not shown).

Genotype and CRC
Overall there was no evidence of an association between any of the four SNPs and CRC risk ( Table 3). When we stratified according to plasma 25-OHD levels, the rs10741657 SNP was associated with a decreased CRC risk for those of low plasma 25-OHD levels (per A allele OR = 0.88, 95% CI 0.75, 1.02, p = 0.09) and with an increased CRC risk for those of high plasma 25-OHD levels (OR = 1.12, 95% CI 0.98, 1.27, p = 0.09), with a p-value of interaction (p = 0.05).
Before applying the MR approach we assessed the IV assumptions. The first (that genotype is associated with the phenotype) was fulfilled since we selected four SNPs that were found to be linked to plasma 25-OHD levels in a pooled metaanalysis of Genome Wide Association Studies [20]. The second (genotype is independent of measured and unmeasured confounders) was tested by investigating whether the instruments were associated with any of the measured confounding factors that might influence the relationship between plasma 25-OHD levels and CRC (Table S6) and, as expected [27], there was no evidence for an association between these confounding factors and the genotypes. Finally, the third assumption (effect of genotype on outcome is mediated only through the intermediate phenotype) was tested by interrogating of pleiotropic links of genes and SNPs that we recently created [28]. For all SNPs there was no evidence of pleiotropy and they were only found to be linked to plasma vitamin D levels.
Using the rs2282679 as the IV, the estimated causal effect of plasma 25-0HD on CRC risk was 0.94 (95% CI 0.49, 1.83), and the F-statistic for the rs2282679 from the first stage of the IV analysis was 15.80 in the age and sex adjusted analysis ( Table 4). Using the rs12785878 as the IV the causal effect was 1.23 (95% CI 0.60, 2.53), and the F-statistic for the rs12785878 from the first stage of the IV analysis was 13.50 ( Table 4). Using the rs10741657 as the IV the causal effect was 0.89 (95% CI 0.40, 1.98), and the F-statistic for the rs10741657 from the first stage of the IV analysis was 10.89 (Table 4). Finally, using the rs6013897 as the IV the causal effect was 0.99 (95% CI 0.40, 2.45), and the Fstatistic for the rs6013897 from the first stage of the IV analysis was 0.98 ( Table 4). The results of the other IV estimators are presented in Tables S7, S8, S9, and S10.
Furthermore, we combined these four SNPs to form three allele scores: 1) one allele score that combined all four SNPs, 2) an upstream allele score that combined the SNPs rs12785878 and rs10741657 and 3) a downstream allele score that combined the SNPs rs2282679 and rs6013897. We then used this allele scores as the IV. The causal effect for the overall allele score was 1.16 (95% CI 0.60, 2.23; F-statistic 16.52), for the upstream allele score was 0.94 (95% CI 0.46, 1.91; F-statistic 7.87) and for the downstream allele score was 0.93 (95% CI 0.53, 1.63; F-statistic 12.67) ( Table 4).

Levels of Plasma 25-OHD and CRC
In this study, low levels of plasma 25-OHD were associated with a higher risk of CRC in the whole sample and after stratification for sex, tumour stage and severity of symptoms at presentation. These results are in accordance with two recent meta-analyses of serum or plasma prospective studies [14,29]. In addition a systematic review and meta-analysis on colorectal adenoma (CRA) showed a decreased risk with both incidence and recurrence of CRA for an increase of 25-OHD by 20 ng/ml [30]. However, a randomised clinical trial (RCT) (2686 average risk subjects) found no effect of vitamin D supplementation and incidence of CRC [31]. Similarly, a second RCT (Women's Health Initiative -WHI) investigating the effects of daily calcium and vitamin D supplementation for seven years showed no effect on CRC incidence among postmenopausal women [32]. However, it should be noted that neither of these RCTs were designed and powered for cancer as the primary outcome. In addition, reanalysis of the WHI RCT found that concurrent oestrogen therapy was an effect modifier of calcium and vitamin D supplementation and for women that were not assigned to oestrogen therapy calcium and vitamin D supplementation decreased CRC risk [33]. Data from case-control and cohort studies examining the associations between dietary vitamin D intake and CRC are inconclusive [13].

Genotype, Plasma 25-OHD Levels and CRC
The A allele of rs2282679 and the T allele of rs12785878 were associated with higher levels of plasma 25-OHD. These results are  in accordance with a GWAS investigating genetic determinants of vitamin D deficiency [20]. The rs2282679 SNP is located in the GC gene, which encodes a vitamin D binding protein that binds and transports vitamin D (Figure 1) [20]. The rs12785878 SNP is located in the DHCR7 gene that encodes the enzyme 7dehydrocholesterol (7-DHC) reductase, which converts 7-DHC to cholesterol. 7-DHC is a precursor of vitamin D 3 . Mutations in the DHCR7 may lead to a decreased activity of the 7-DHC reductase and therefore to high levels of 7-DHC and vitamin D 3 (Figure 1) [20]. However the other two SNPs (rs10741657, rs6013897) that were also found to be strongly associated with vitamin D levels in the vitamin D genome wide association study were not associated with vitamin D status in our cohort. rs10741657 is located in the CYP2R1 gene, which encodes an enzyme thought to be involved in the 25-hydroxylation of vitamin D 3 to 25(OHD) [20]. rs6013897 is located in the CYP24A1 gene, which encodes an enzyme that initiates the degradation of 1,25(OH) 2 D [20]. The evidence for the role of the enzymes coded by CYP2R1and CYP24A1 is limited and not replicated in other candidate studies [20]. None of the four SNPs were associated with CRC risk, although we found an interaction between 25-OHD levels and rs10741657. The results of the MR analysis did not support a causal relationship between plasma 25-OHD and CRC risk. Although not significant, the inverse relationship was noted when rs2282679 or rs10741657 were used as instrument, but not when rs12785878 or rs6013897. The results remained inconsistent when the three allele scores of the four SNPs was used. The fact that the inverse association that was observed when we applied the conventional epidemiological methods was not replicated when the MR approach was used might be due to several reasons. It is possible that unmeasured or latent variables confounded the associations and there is no true effect of vitamin D on CRC. An alternative explanation is that there might be reverse causality between vitamin D and CRC, given that the plasma of the cases was collected after diagnosis: it cannot be excluded that low plasma 25-OHD is the consequence of disease or a result of patients being bedbound and lacking the exposure to sun. However, when we looked at cases with very mild or no symptoms or cases at the very early stages of the disease we still observed an inverse association between 25-OHD levels and CRC. In addition, given the biological potential of vitamin D having a causal link with cancer, factors that affect the performance of the IV estimators might also explain these findings.
Major limitations of conventional instrumental variable approaches result from the strict assumptions that need to be satisfied for method to be reliable. It is true that genotype is associated with phenotype, however, SNPs that have been used as instruments are only weakly associated to phenotype and explain only a small portion of trait variance. While we tested for common confounders, it is possible that hidden confounding from unmeasured variables affects the analysis. While we see no association between genotype and the outcome, weaker pleiotropic links cannot be excluded with certainty. We are currently working on new instrumental variable methods for assessing causality between 25-OHD concentrations and CRC, based on the platform for the Bayesian analysis of complex statistical models using Markov chain Monte Carlo methods. This method improves on the classical Mendelian Randomisation approach as it allows for pleiotropic links between components of the model, enables easy inclusion of other covariates and does not depend as much on the strength of the instruments.
The wide application of the genome wide association studies (GWAS), has allowed for the MR studies to become more feasible now, as common SNPs linked to various intermediate phenotypes have been identified [34]. To date in 2011, several applied MR studies were published investigating causal relationships in a wide range of diseases including cancer [35], coronary heart disease [36,37], diabetes [38,39], mental disorders [40][41][42], lung function [43] and other diseases [44,45]. Some of these studies replicated the results of the conventional epidemiological methods and confirmed causality (or no causality) of the intermediate phenotype on the outcome [40,44,45]. One of the main reasons behind this lack of replication is the fact that the genetic instruments that are employed are generally weak and therefore the power of the MR studies will be inadequate [34].
According to a recent study on the power and sample size requirements of MR studies based on the strength of the instruments, found that MR studies will require large (n.1000) and often very large (n.10000) sample sizes to draw causal conclusions that are statistically significant [34]. Based on their findings from simulated analyses most of the published MR studies are under-powered (see Table S11 for a review of the 11 MR studies published in 2011). Using simulations presented in the paper of Pierce et al [34] and accounting for sample size, strength of the employed instruments and observed effect size, we estimate that the sample size in this study gave power of ,0.35.

Conclusion
This study shows that higher plasma 25-OHD levels are associated with a lower CRC risk, a finding that is consistent with recent meta-analyses of prospective studies. However, this finding was not replicated when the MR approach was employed. This finding might be due to a lack of a true effect of vitamin D on CRC or due to reverse causation. It may also be due to weak instruments and limited statistical power. The lack of power is a common characteristic of many MR studies and therefore a careful selection of instruments plus an adequate sample size are deemed necessary for this method to be able to make causal conclusions. Given the extent of vitamin D deficiency among individuals living in high latitudes, a large consortium of similar vitamin D and CRC studies and/or the application of alternative methods that are less sensitive to weak instrument bias [46] are necessary to refine the effect of vitamin D on CRC cancer and other chronic diseases.        Methods S1 Information about the following Instrumental Variable estimators: Wald (ratio) estimator, Two stage least squares estimator, Multiplicative structural mean models, Logistic structural mean models.

Supporting Information
(DOC)