Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

A Multilevel Model to Estimate the Within- and the Between-Center Components of the Exposure/Disease Association in the EPIC Study

  • Francesco Sera ,

    Contributed equally to this work with: Francesco Sera, Pietro Ferrari

    Affiliation Molecular and Nutritional Epidemiology Unit, Cancer Prevention and Research, ISPO, Florence, Italy

  • Pietro Ferrari

    Contributed equally to this work with: Francesco Sera, Pietro Ferrari

    ferrarip@iarc.fr

    Affiliation Nutrition and Metabolism Section, International Agency for Research on Cancer, Lyon, France

A Multilevel Model to Estimate the Within- and the Between-Center Components of the Exposure/Disease Association in the EPIC Study

  • Francesco Sera, 
  • Pietro Ferrari
PLOS
x

Abstract

In a multicenter study, the overall relationship between exposure and the risk of cancer can be broken down into a within-center component, which reflects the individual level association, and a between-center relationship, which captures the association at the aggregate level. A piecewise exponential proportional hazards model with random effects was used to evaluate the association between dietary fiber intake and colorectal cancer (CRC) risk in the EPIC study. During an average follow-up of 11.0 years, 4,517 CRC events occurred among study participants recruited in 28 centers from ten European countries. Models were adjusted by relevant confounding factors. Heterogeneity among centers was modelled with random effects. Linear regression calibration was used to account for errors in dietary questionnaire (DQ) measurements. Risk ratio estimates for a 10 g/day increment in dietary fiber were equal to 0.90 (95%CI: 0.85, 0.96) and 0.85 (0.64, 1.14), at the individual and aggregate levels, respectively, while calibrated estimates were 0.85 (0.76, 0.94), and 0.87 (0.65, 1.15), respectively. In multicenter studies, over a straightforward ecological analysis, random effects models allow information at the individual and ecologic levels to be captured, while controlling for confounding at both levels of evidence.

Introduction

The relationship between aetiological factors and risk of disease in multicenter studies can be evaluated at the individual and at the aggregate (ecologic) levels [13], thus reflecting within- and between-center associations, respectively. In order to estimate simultaneously the two components of the exposure/disease association, multilevel models can be used [4]. In this framework, the role of (a possibly consistent list of) individual and aggregate level confounding variables can be accounted for. In addition, the use of random effects makes it possible to quantify the heterogeneity across centers of the association in multicenter studies, and to investigate the potential sources of such heterogeneity.

The characteristics of individual level and ecologic analyses have been extensively investigated [57]. Ecologic analysis is relatively efficient when the between-population exposure variability is large relative to the within-population variation [3, 8]. However, individual and ecologic analyses are prone to bias due to confounding [3]. Generally, individual level confounders are assessed through specifically designed questionnaires, e.g. dietary or lifestyle. Furthermore, over individual level studies, ecologic analysis may be more robust to classical (random) measurement errors but prone to systematic exposure misclassification [8].

Among the rationale that motivated the European Prospective Investigation into Cancer and Nutrition (EPIC), a large multicenter cohort study on diet and cancer conducted in ten Western European countries [9], was the idea of complementing aetiological evidence at the individual level with information available at the aggregate level through a comparison of dietary mean intakes with mean incidence rates [2]. In EPIC, cohorts from populations with diverse dietary exposures were combined, in order to increase between-subject variability and thereby reduce the impact of measurement errors of individual level dietary assessments [2]. In addition, substantial effort was invested in the collection of dietary measures harmonized across recruitment centers, by means of 24-hour dietary recall (24-HDR) [10]. 24-HDR measurements are used in linear regression calibration models [11] to correct for random and systematic errors in center-specific dietary questionnaire measurements [12, 13].

Multilevel analysis on survival data with random effects has been extensively described, including a gamma frailty model where the likelihood function was maximized by an EM algorithm [14], or a two-step algorithm to maximize a penalized partial likelihood [15], a proportional hazard models with random intercept and/or random slope that entails adaptation of penalized likelihood [16], Gibbs sampling [17, 18], or EM algorithm [19].

In this work a random effects piecewise exponential proportional hazards model was used to estimate the individual and aggregate components of the association between fiber intake and risk of colorectal cancer (CRC). This individual-level relationship was recently evaluated in the EPIC study using a standard approach [20]. A multilevel random effects model was used to estimate aggregate and individual-level components of the dietary fiber and CRC relationship, thus accounting for study center heterogeneity, both in terms of disease occurrence and exposure/disease associations [21, 22]. In order to correct for measurement errors, a linear regression calibration model was also employed [12]. The use of the random effects model is illustrated and discussed.

Materials and Methods

The rationale and methodology used in the EPIC study have been previously described in detail [9]. Briefly, in the EPIC cohort study, participants were recruited from 28 centers in 10 European countries: France (North-East, North-West, South, South-coast), Italy (Ragusa, Naples, Florence, Turin, Varese), Spain (Granada, Murcia, Navarra, San Sebastian, Asturias), United Kingdom (Oxford Health conscious and General population, Norfolk), The Netherlands (Bilthoven and Utrecht), Greece, Germany (Heidelberg and Potsdam), Sweden (Malmo and Umeå), Denmark (Aarhus and Copenhagen) and Norway (South-East and North-West). Between 1992 and 1998, information on dietary, anthropometric and lifestyle factors was collected from a total of 521,457 subjects (70% women), mostly aged from 35 to 70 years. Written informed consent was provided by all study participants. Ethical approval for the EPIC study was provided from the review boards of the International Agency for Research on Cancer (IARC) and local participating centers. Information on habitual individual dietary intake was assessed using different validated dietary questionnaires across participating countries to capture geographical specificity of diet [9]. A single standardized 24-hour recall (24-HDR) of actual food consumption during the previous day for a large (7% of the whole study) representative sub-sample of the entire EPIC cohort was also collected [10]. Lifestyle questionnaires were used to assess additional information from study participants, including physical activity, lifetime alcohol use, smoking history, level of education and medical history.

In this work, after systematic exclusions of participants with missing, incomplete, or implausible information, as detailed elsewhere [20], a total of 478,461 subjects were included in the analysis. Out of 5,262,998 person-years accumulated over an average follow-up time of 11.0 years, 4,517 CRC cases were identified. Cancer incidence data were coded in accordance with the 10th Revision of the International Classification of Diseases (ICD-10) and the second revision of the International Classification of Disease for Oncology (ICDO-2).

Statistical analysis

Variability of exposure and confounders.

The between- and within-center variability of dietary and non-dietary exposures were evaluated. For this purpose, variance components were estimated using a mixed model where the variable ‘center’ was modeled with a random effect, as detailed in S1 Appendix. Variance component terms were used to compute the intraclass correlation coefficient (ICC), which expresses the proportion of exposure heterogeneity at the center-level compared to total variability. The extent of between- and within-center variability was evaluated also for dichotomous exposure variables, and the associated ICC values.

The disease model.

In this work, age was used as main the time scale variable, and it was assumed that the hazard function, λ(t), was piecewise constant in the s = 1,…,S age intervals, as Where h0 is the baseline hazard, and g1 = 1,g2,g3,…,gS, are the relative hazard. We considered S = 7 age intervals, i.e. for the categories <49 years, 50–54, 55–59, 60–64, 65–69, 70–74, >75 years with first and last age intervals being open-ended to avoid a sparse number of CRC cases. The representation of a classical survival likelihood with a piecewise constant hazard by a Poisson likelihood is well known [23]. Thus, the time-to-event data becomes a counting process of 0 events or 1 event occurring with an offset of the time experienced at risk for the event within each time interval. Each study subject i = 1,…,nj within center j = 1,…,J was represented by SiS observations equal to the number of time intervals up to failure or censoring. For each subject, the intervals Si ranged from the interval corresponding to age at entry (Si,min) to the interval corresponding to age at failure/censoring (Si,max). A variable indicating the presence of the event of interest was created, for interval k = Si.min,…,Si,max as A random effects model was fitted as , and Indicated with τk and τk−1, the extreme of age interval k, for k = Si.min−1,…,Si,max−1, the terms tijk are equal to 5 years, whereby for the first interval k = Si,min and for the last interval k = Si,max, the terms tijk express the time unit spent in interval k, that is tijk = τk – age at entry(i), and tijk = τk – age at exit(i)τk−1, respectively. The terms αk = ln(gk) indicate the logarithm of the relative hazard were estimated as coefficients of a list of indicator variables hijk, i.e. one for each age interval except the category 60–64 years that was set as reference. The term δ0j is a random effect that models the center effect on the logarithm of the baseline hazard among men. Sex was included as an indicator variable (0 = men, 1 = women), and modelled with a random effect coefficient δ1j. The variability of the outcome is captured by the term in men, while in women it is equal to . In this way, the disease model allows the variability of the outcome across centres to be sex-specific. A negative value of the covariance term σ01 permits the variability in women to be lower than in men.

The variable Xij is the exposure factor of interest, ie. dietary fiber from questionnaire measurements in this work; the terms are the center-specific means, and Zij and are vectors of confounding variables at the individual and aggregate levels, respectively. The coefficients (βW and βB) model the relationship of dietary fiber and risk of colorectal cancer at the individual and aggregate levels, respectively. The terms βWj (random slope) have a fixed (βW) and a random (uwj) component, which capture heterogeneity of the exposure/disease associations across centers. The terms σW,0 and σW,1 express the covariance between random intercepts and random slope in men, as σW,0 and women, as σW,0 + σW,1. In the case of a positive association, a positive (negative) covariance would suggest that the association between dietary fiber and CRC risk is more (less) pronounced in centers displaying higher CRC incidence. The evaluation of the heterogeneity of the exposure/risk of disease association across centers is of particular relevance in international collaborative studies. Unlike a fixed effects analysis, where this evaluation involves a large number of interaction terms, in a multilevel model, the variance associated to a random slope constitutes a succinct way to quantify the heterogeneity of associations. In addition, potential heterogeneity can be investigated in terms of aggregate level variables, thus exploring the extent of cross-level confounding in the association.The vector of coefficients, γW and γB, model the relationship between confounding factors and cancer risk at the individual and aggregate levels, respectively. For the sake of simplicity, γW is assumed constant across centers, but this assumption could possibly be relaxed.

Adjustment for confounding variables was performed at the individual and at the aggregate level in the disease model. At the individual level the model included: red meat baseline alcohol consumption, energy from fat, energy from non-fat and non-alcohol sources, participant’s height and weight, physical activity (inactive/moderately inactive, moderately active/active, unknown), smoking habits (never, former/current, unknown) and educational level (no degree to secondary school, university degree, not specified/missing), consistently with a recent EPIC study [20].

At the aggregate level, models included center-specific means of energy from sources other than fat and alcohol, baseline alcohol consumption, latitude, the percentage of subjects with university degrees and the percentage of former and current smokers. Moreover, latitude of participating centers was considered to account for potential residual ecologic confounding.

In this work, models with different degrees of complexity have been fitted and compared. First, a model with sex-specific random intercept terms was estimated (model (1)). Then, individual level variables were included (model (2)). Aggregate level variables were added, after centering individual level variables to center-specific means (model (3)). Finally, a model with random slopes for the term was fitted (model (4)). Each model was adjusted for individual- and aggregate- level confounding factors Z, as detailed in S2 Appendix.

For proper comparison of dietary measurements between cohorts, and to partially correct risk estimates for the effect of dietary measurement errors, a fixed-effect multivariate linear regression calibration model was used [11, 24], in which 24-hour dietary recall data from an 8% sample from all participating cohorts [25] were regressed on dietary questionnaire (DQ) measurements with adjustment for the same list of covariates mentioned above, and further controlled for the day of the week day and the season of recall measurements [12]. In a linear regression calibration model, the conditional expectations of true intake given the questionnaire measurements and the set of covariates, E(T|Q,Z), are computed, under the assumption that 24-hour dietary recall measurements are unbiased (11). E(T|Q,Z) values are used in the disease model. The variability of parameters was corrected to account for uncertainty due to the estimation process of regression calibration. Ten bootstrap samples were randomly generated, and the models (1) to (4) were fitted for each of them. The Rubin’s rule was used to obtain a correct estimate of the standard error by combining information across the ten different samples [26]. Linear regression calibration models were used for dietary fiber, red meat, energy from fat and energy from non-fat and non-alcohol sources. Alcohol intake was not corrected for measurement errors as the complexity of dealing with excess zero values in 24-HDRs due to the episodically consumption pattern of this factor, was beyond the scope of the present work.

The proportional hazards assumption was satisfied and evaluated via the inclusion into the disease model of interaction terms between dietary fiber and the time variable, tijk (data not shown). Statistical tests were two-sided. Multivariate linear regression calibration and piecewise exponential proportional hazards models were fitted using MLwiN [27] and Stata [28]. The syntax to carry out the different steps of the analysis is reported in S3 Appendix.

Results

The center-specific frequencies of CRC cases, means and standard deviations of dietary fiber intake (g/day) using DQ measurements and after linear calibration are reported in Tables 1 and 2, for women and men respectively. In women, DQ dietary fiber means ranged from 18.7 g/day in Malmö to 27.0 g/day observed in Naples, and their ICC values were equal to 0.08 and 0.29 before and after linear calibration. In men, dietary fiber means went from 21.1 g/day in Malmo to values exceeding 30 g/day in the center of Murcia. ICC values were 0.12 and 0.30, before and after linear calibration. As previously observed and discussed in the EPIC study [29], linear regression calibration reduces the variability of the within-center component, thus resulting in larger ICC values.

thumbnail
Table 1. Centre-specific Frequency of Colorectal Cancer Cases (CRC), Person-Years (PY), and Age Standardized Incidence Ratio (SIR) and Associated 95%CI, and Centre-Specific Means of Dietary Fiber Intake (g/day) and Standard Deviation (SD), EPIC Original Data and After Linear Regression Calibration; Women.

https://doi.org/10.1371/journal.pone.0117815.t001

thumbnail
Table 2. Centre-specific Frequency of Colorectal Cancer Cases (CRC), Person-Years (PY), and Age Standardized Incidence Ratio (SIR) and Associated 95%CI, and Centre-Specific Means of Dietary Fiber Intake (g/day) and Standard Deviation (SD), EPIC Original Data and After Linear Regression Calibration; Men.

https://doi.org/10.1371/journal.pone.0117815.t002

Means and frequencies of dietary and lifestyle variables at the aggregate level are reported in S1 Table and S2 Table. ICC values ranged from 8% (weight and the percentage of smokers) to 26% (height) in women, and from 3% (weight and the percentage of physically active subjects) to 25% (red meat intake) in men. The large ICC values observed for height (0.25 and 0.24, in women and men respectively) reflects the diversity of the EPIC populations.

The association between dietary fiber intake and CRC risk quantified by rate ratios (RR) estimated in a model using individual-level variables only (model (2) in S2 Appendix) was 0.90 (95% confidence interval (CI): 0.85, 0. 96) for a 10 g/d increment, as reported in Table 3. In a random effects model with dietary fiber and confounding factors included at the individual and aggregate levels (model (3)), RR were 0.90 (0.85, 0.96) and 0.85 (0.64, 1.14), for individual and aggregate levels respectively.

thumbnail
Table 3. Estimates of Rate Ratios (RR), 95%CI and Variance Components (VC) Obtained in Models (2) and (3), as Detailed in S2 Appendix, Using Individual and Aggregate Level Variables in the EPIC Study; Original data.

https://doi.org/10.1371/journal.pone.0117815.t003

In models with a between-center random slope (model (4) in S2 Appendix, S3 Table), the fixed component of the coefficient was very similar to the RR estimate in model (3). The variance associated to the random component was 0.003 (SE = 0.004, p = 0.5). In men, the correlation between the intercept and the slope residuals was estimated as 0.30, suggesting that the protective association between fiber consumption and CRC was stronger in centers with elevated CRC incidence rates. Random slope models can be easily extended to test whether aggregate level variables explain some of the between-centers heterogeneity in the fiber/CRC association (model (4)). This can be achieved by making the random slope coefficient dependent on aggregate level variables in a random slope model [4]. However, in our study, there was limited heterogeneity across centers in the exposure/disease associations.

Multilevel models (1) to (4) were fitted with predicted values data after multivariate linear calibration models (Table 4 and S4 Table). The individual level coefficient of dietary fiber intake showed a more pronounced protective association with CRC risk, 0.85 (95%CI: 0.77, 0.93), (model (2)). The calibrated point estimates of individual and aggregate level fiber intake were equal to 0.85 (95%CI: 0.76, 0.94) and 0.87 (95%CI: 0.65, 1.15) respectively (Table 4), according to model (3). After linear calibration, the variance associated to the intercept random components were 0.020 (SE = 0.011) and 0.020 (SE = 0.007), in men and women respectively. Comparing model (3) with (2), random intercept variance components displayed a decrease equal to 47% and 62%, in men and women respectively.

thumbnail
Table 4. Estimates of Rate Ratios (RR), 95%CI and Variance Components (VC) Obtained in Models (2) and (3), as Detailed in S2 Appendix, Using Individual and Aggregate Level Variables in the EPIC Study; Linear Regression Calibration.

https://doi.org/10.1371/journal.pone.0117815.t004

Discussion

In this work, within- and between-center components of the association between dietary exposure and risk of disease were simultaneously estimated. A multilevel piecewise exponential proportional hazards model was used to complement standard evidence on the relationship between dietary fiber intake and risk of CRC at the individual level [20] with results obtained at the aggregate level. Results showed that dietary fiber intake was consistently negatively associated with the risk of CRC based on both levels of evidence.

It is very natural in the context of a multicenter study like EPIC to exploit and use each level of evidence to evaluate the relationship between dietary exposure and risk of disease. The use of information from the between-center component with a random effect model is based on the assumption that individual and aggregate components of the exposure/disease association are of similar magnitude. In the context of an etiological relationship that involves dietary factors, it is reasonable to assume that individual- and aggregate- level inference should provide consistent evidence, if aggregate-level confounding is adequately controlled. However, the two components can result in rather different estimates, possibly due to uncontrolled (residual) confounding and/or exposure measurement errors [8].

An important source of bias in ecological studies is the lack of information on confounding factors at the aggregate level [6]. In addition, confounding by center, typically arising from an uneven distribution of confounders across centers, was identified as one of the factors responsible of ecological bias or fallacy, that determines a difference between individual and ecologic risk estimates [57, 30, 31]. In the context of a multilevel model in a multicenter study, information is available at the aggregate level, thus potentially addressing both aggregate-level and individual-level confounding.

Inclusion of individual-level covariates explained an estimated 22% and 32% of the cross-center variability in CRC incidence, in women and men respectively (results obtained by comparing the variance components in models (2) and (1) in S2 Appendix), suggesting that some of the observed heterogeneity across center in CRC incidence was explained by individual level variables.

The association between alcohol consumption and CRC risk at the aggregate level was consistent with individuallevel associations observed in EPIC [32]. The percentage of subjects with university degrees and the percentage of former and current smokers were, respectively, negatively and positively associated with CRC risk. These variables are relevant determinants of unhealthy patterns of dietary and lifestyle factors [33, 34]. Overall, the association of confounders at the aggregate level was stronger than individual level counterparts. A variable expressing study center latitude was included in the random effects models, in an effort to explore potential residual confounding not captured by the variable center, for example exposure to vitamin D.

Our strategy to include potential individual-level confounding factors followed the standard practice of including in the model variables both associated to dietary fiber and the risk of colorectal cancer. The choice of aggregate level confounding followed a very similar strategy. We consistently used the same confounders at the individual and at the aggregate levels. In our study, a general principle of parsimony was used to include confounder factors at the aggregate level, mainly to account for the fact that aggregate level analysis does not carry the same amount of information, in terms of sample size, ie. statistical units. Means of energy from non-fat and non-alcohol sources, and center-specific percentages of former and current smokers displayed weak associations with dietary fiber (results not shown), but were related to CRC risk at the aggregate level.

In our working example, individual-level variables were centered on sex- and study center-specific means to make disease parameters readily interpretable as within- and between-center estimates of associations. As a result, individual- and aggregate- level variables were orthogonal, de facto limiting the possibility of evaluating cross-level confounding in our study. In addition, the fact that individual-level effect estimates for fiber intake were very similar in models (2) (exposure and confounders included as ‘overall’ variables) and (3) (exposure and confounders included as ‘centered’ variables) may suggest that the extent of cross-level confounding was limited in our study.

Cross-level interaction can be explored by relating the variability of association across centers in terms of aggregate-level variables. In this way, random slope coefficients that capture the association (between dietary fiber and risk of colorectal cancer in our study) at the individual level are dependent on aggregate-level variables [4]. The lack of heterogeneity across centers of exposure/disease associations in our study suggested there was little such variability to be explained.

In agreement with Sheppard and Prentice [8], the parameter of interest for biological inference, particularly in nutritional epidemiology, is the individual-level association of fiber with colorectal cancer risk (βW). The overall relationship between exposure and risk of disease was also fitted, in a model with no inclusion of the variable ‘center’. Neuhaus and Kalbfleisch pointed out that this model can be biased, if between- and the within-center associations are heterogeneous, with the degree of bias also depending on the degree of clustering of the exposure [35]. In this study, the between- and within-center estimates were of similar magnitude. The overall RR estimate was equal to 0.89 (95%CI: 0.85, 0.94), mostly reflecting the within-center component of the association. After linear calibration, dietary fiber showed a higher degree of clustering (ICC∼0.3), and the overall estimate was 0.87 (0.81, 0.94). It was noteworthy that the overall calibrated estimate displayed a gain in efficiency of 26% when compared to calibrated within-center estimates in model (3). This model did not include the possible increase in the standard errors due to clustering of the outcome, which was nevertheless low in our study.

Non-differential exposure misclassification may cause bias in either direction in multilevel models, depending upon the nature of the exposure (binary, or continuous) and the level of measurement, aggregate or individual [36]. DQ measurements are assumed to be affected by systematic misclassification, while the one single replicate of 24-HDR per subject are hypothesised to be relatively accurate but rather imprecise, as a result of random measurement errors [37, 38], possibly reflecting systematic misclassification and day-to-day within-person variability randomly distributed between subjects [39]. As a result, in multilevel studies in nutritional epidemiology, the aggregate level relationship between a dietary variable and an outcome may be biased by errors mostly originating from the lack of comparability of different dietary questionnaires across EPIC centers [2]. In addition, the accuracy of DQ measurements may have a geographical component in terms of accuracy. To minimize the effect of measurement errors in DQs (individual level) and to correct the lack of comparability of DQ measurements across countries (aggregate level), a multivariate linear regression calibration model was used [12].

In the multilevel model, both the within- and between-center components of the association between dietary fiber intake and CRC risk showed an inverse association, with no statistically significant difference between the two estimates. The slightly stronger relation observed at the aggregate level may suggest that information derived from between-population sources can be more robust to classical measurement errors than the corresponding individual level analysis [8]. After linear calibration, the within- and between-center components of the association were closer in magnitude. RR estimates for individual- and aggregate- level dietary fiber intake were 0.85 (0.76, 0.94) and 0.87 (0.65, 1.15), respectively.

In this study, the multivariable piecewise exponential proportional hazards random effects model showed that individual- and aggregate-level evidence were consistent in magnitude and direction, particularly after measurement error correction. The EPIC study offers an ideal design to explore the use of advanced statistical techniques, information on disease outcomes and dietary exposure(s) being available at different levels of evidence. In a relatively simple multilevel model these different components can ultimately become very informative.

Supporting Information

S1 Appendix. Calculation of Intraclass correlation coefficients.

https://doi.org/10.1371/journal.pone.0117815.s001

(DOCX)

S1 Table. Center-Specific Means of Energy from Fat (En-fat) and Non-Fat and Non-alcohol Sources (E-NFNA), Red Meat Intake, Baseline Alcohol, Height, Weight, Percentage of Study Subjects (moderately) Active (PA), with University Degree (EDU), of (former and current) Smokers.

Women.

https://doi.org/10.1371/journal.pone.0117815.s004

(DOCX)

S2 Table. Center-Specific Means of Energy from Fat (En-fat) and Non-Fat and Non-alcohol Sources (E-NFNA), Red Meat Intake, Baseline Alcohol, Height, Weight, Percentage of Study Subjects (moderately) Active (PA), with University Degree (EDU), of (former and current) Smokers.

Men.

https://doi.org/10.1371/journal.pone.0117815.s005

(DOCX)

S3 Table. Estimates of Rate Ratios (RR), 95%CI and Variance Components (VC) Obtained in Models (1), and (4), as Detailed in Appendix B, Using Individual and Aggregate Level Variables in the EPIC Study.

Original data.

https://doi.org/10.1371/journal.pone.0117815.s006

(DOCX)

S4 Table. Estimates of Rate Ratios (RR), 95%CI and Variance Components (VC) Obtained in Models (1), and (4), as Detailed in Appendix B, Using Individual and Aggregate Level Variables in the EPIC Study.

Linear Regression Calibration.

https://doi.org/10.1371/journal.pone.0117815.s007

(DOCX)

Acknowledgments

A special thanks to Rudolf Kaaks, Elio Riboli and Heiner Boeing for their fruitful discussions around this topic over the past years. This work was made possible by the continuous efforts and dedication of the EPIC PIs, Scientists, Post-doc Fellows, PhD students and Data-managers involved in the study.

Author Contributions

Conceived and designed the experiments: FS PF. Analyzed the data: FS. Contributed reagents/materials/analysis tools: PF. Wrote the paper: PF FS.

References

  1. 1. Ferrari P, Carroll RJ, Gustafson P, Riboli E (2008) A Bayesian multilevel model for estimating the diet/disease relationship in a multicenter study with exposures measured with error: the EPIC study. Stat Med 27: 6037–6054. pmid:18951369
  2. 2. Kaaks R, Plummer M, Riboli E, Esteve J, van Staveren W (1994) Adjustment for bias due to errors in exposure assessments in multicenter cohort studies on diet and cancer: a calibration approach. Am J Clin Nutr 59: 245S–250S. pmid:8279434
  3. 3. Navidi W, Thomas D, Stram D, Peters J (1994) Design and analysis of multilevel analytic studies with applications to a study of air pollution. Environ Health Perspect 102 Suppl 8: 25–32. pmid:7851327
  4. 4. Snjiders TAB, Bosker R. Multilevel analysis: an introduction to basic and advanced multilevel modelling. London: Sage Publications; 1999.
  5. 5. Greenland S (2002) A review of multilevel theory for ecologic analyses. Stat Med 21: 389–395. pmid:11813225
  6. 6. Greenland S, Morgenstern H (1989) Ecological bias, confounding, and effect modification. Int J Epidemiol 18: 269–274. pmid:2656561
  7. 7. Greenland S, Robins J (1994) Invited commentary: ecologic studies—biases, misconceptions, and counterexamples. Am J Epidemiol 139: 747–760. pmid:8178788
  8. 8. Sheppard L, Prentice RL (1995) On the reliability and precision of within- and between- population estimates of relative rate parameters. Biometrics 51: 853–863. pmid:7548704
  9. 9. Riboli E, Hunt KJ, Slimani N, Ferrari P, Norat T, et al. (2002) European Prospective Investigation into Cancer and Nutrition (EPIC): study populations and data collection. Public Health Nutr 5: 1113–1124. pmid:12639222
  10. 10. Slimani N, Ferrari P, Ocke M, Welch A, Boeing H, et al. (2000) Standardization of the 24-hour diet recall calibration method used in the european prospective investigation into cancer and nutrition (EPIC): general concepts and preliminary results. Eur J Clin Nutr 54: 900–917. pmid:11114689
  11. 11. Rosner B, Willett WC, Spiegelman D (1989) Correction of logistic regression relative risk estimates and confidence intervals for systematic within-person measurement error. Stat Med 8: 1051–1069. pmid:2799131
  12. 12. Ferrari P, Day NE, Boshuizen HC, Roddam A, Hoffmann K, et al. (2008) The evaluation of the diet/disease relation in the EPIC study: considerations for the calibration and the disease models. Int J Epidemiol 37: 368–378. pmid:18180242
  13. 13. Kaaks R, Riboli E (1997) Validation and calibration of dietary intake measurements in the EPIC project: methodological considerations. European Prospective Investigation into Cancer and Nutrition. Int J Epidemiol 26 Suppl 1: S15–S25. pmid:9126530
  14. 14. Klein JP (1992) Semiparametric estimation of random effects using the Cox model based on the EM algorithm. Biometrics 48: 795–806. pmid:1420842
  15. 15. Therneau TM, Grambsch PM. Modeling survival data: extending the Cox model. New York: Springer; 2000.
  16. 16. Yamaguchi T, Ohashi Y, Matsuyama Y (2002) Proportional hazards models with random effects to examine centre effects in multicentre cancer clinical trials. Stat Methods Med Res 11: 221–236. pmid:12094756
  17. 17. Matsuyama Y, Sakamoto J, Ohashi Y (1998) A Bayesian hierarchical survival model for the institutional effects in a multi-centre cancer clinical trial. Stat Med 17: 1893–1908. pmid:9777685
  18. 18. Wong MC, Lam KF, Lo EC (2006) Multilevel modelling of clustered grouped survival data using Cox regression model: an application to ART dental restorations. Stat Med 25: 447–457. pmid:16143989
  19. 19. Vaida F, Xu R (2000) Proportional hazards model with random effects. Stat Med 19: 3309–3324. pmid:11122497
  20. 20. Murphy N, Norat T, Ferrari P, Jenab M, Bueno-de-Mesquita B, et al. (2012) Dietary fibre intake and risks of cancers of the colon and rectum in the European prospective investigation into cancer and nutrition (EPIC). PLoS One 7: e39361. pmid:22761771
  21. 21. Goldstein H, Browne W, Rasbash J (2002) Multilevel modelling of medical data. Stat Med 21: 3291–3315. pmid:12375305
  22. 22. Skrondal A, Rabe-Hesketh S. Generalized latent variable modeling: Multilevel, longitudinal and structural equation models. Chapman & Hall/CRC; 2004.
  23. 23. Swihart BJ, Caffo BS, Crainiceanu CM, Punjabi NM (2012) Mixed effect Poisson log-linear models for clinical and epidemiological sleep hypnogram data. Stat Med 31: 855–870. pmid:22241689
  24. 24. Kaaks R, Riboli E, van Staveren W (1995) Calibration of dietary intake measurements in prospective cohort studies. Am J Epidemiol 142: 548–556. pmid:7677134
  25. 25. Slimani N, Kaaks R, Ferrari P, Casagrande C, Clavel-Chapelon F, et al. (2002) European Prospective Investigation into Cancer and Nutrition (EPIC) calibration study: rationale, design and population characteristics. Public Health Nutr 5: 1125–1145. pmid:12639223
  26. 26. Rubin DB. Multiple imputation for nonresponse in surveys. Hoboken, NJ: John Wiley & Sons; 2004.
  27. 27. Rasbash J, Charlton C, Browne WJ, Healy M, Cameron B (2009) MLwiN Version 2.1. Centre for multilevel modelling, University of Bristol.
  28. 28. StataCorp (2011) Stata Statistical Software: Release 12. College Station, TX: StataCorp LP.
  29. 29. Ferrari P, Kaaks R, Fahey MT, Slimani N, Day NE, et al. (2004) Within- and between-cohort variation in measured macronutrient intakes, taking account of measurement errors, in the European Prospective Investigation into Cancer and Nutrition study. Am J Epidemiol 160: 814–822. pmid:15466504
  30. 30. Morgenstern H (1995) Ecologic studies in epidemiology: concepts, principles, and methods. Annu Rev Public Health 16: 61–81. pmid:7639884
  31. 31. Morgenstern H. Ecologic studies. In: Rothman K. J., Greenland S., Lash T., editors. Modern Epidemiology. Philadelphia: Lippincott Williams & Wilkins; 2008. pp. 511–531.
  32. 32. Ferrari P, Jenab M, Norat T, Moskal A, Slimani N, et al. (2007) Lifetime and baseline alcohol intake and risk of colon and rectal cancers in the European prospective investigation into cancer and nutrition (EPIC). Int J Cancer 121: 2065–2072. pmid:17640039
  33. 33. Dallongeville J, Marecaux N, Fruchart JC, Amouyel P (1998) Cigarette smoking is associated with unhealthy patterns of nutrient intake: a meta-analysis. J Nutr 128: 1450–1457. pmid:9732304
  34. 34. Darmon N, Drewnowski A (2008) Does social class predict diet quality? Am J Clin Nutr 87: 1107–1117. pmid:18469226
  35. 35. Neuhaus JM, Kalbfleisch JD (1998) Between- and within-cluster covariate effects in the analysis of clustered data. Biometrics 54: 638–645. pmid:9629647
  36. 36. Brenner H, Savitz DA, Jockel KH, Greenland S (1992) Effects of nondifferential exposure misclassification in ecologic studies. Am J Epidemiol 135: 85–95. pmid:1736664
  37. 37. Ferrari P, Roddam A, Fahey MT, Jenab M, Bamia C, et al. (2009) A bivariate measurement error model for nitrogen and potassium intakes to evaluate the performance of regression calibration in the European Prospective Investigation into Cancer and Nutrition study. Eur J Clin Nutr 63 Suppl 4: S179–S187. pmid:19888273
  38. 38. Kipnis V, Subar AF, Midthune D, Freedman LS, Ballard-Barbash R, et al. (2003) Structure of dietary measurement error: results of the OPEN biomarker study. Am J Epidemiol 158: 14–21. pmid:12835281
  39. 39. Willett W (1989) An overview of issues related to the correction of non-differential exposure measurement error in epidemiologic studies. Stat Med 8: 1031–1040. pmid:2678348