Bivariate mixture models for the joint distribution of repeated serum ferritin and transferrin saturation measured 12 years apart in a cohort of healthy middle-aged Australians

Homozygosity for the p.C282Y substitution in the HFE protein encoded by the hemochromatosis gene on chromosome 6p (HFE) is a common genetic trait that increases susceptibility to iron overload. McLaren et al. used bivariate mixture modeling to analyze the joint population distribution of transferrin saturation (TS) and serum ferritin concentration (SF) measured for participants in the Hemochromatosis and Iron Overload Screening (HEIRS) Study. They identified four components (C1, C2, C3, and C4) with successively increasing means for TS and SF. They demonstrated that bivariate mixture modeling in TS and SF reflect the genetic locus of HFE and may isolate p.C282Y homozygotes from the general population. In the current study we used data from the another large cohort, the Australian HealthIron study of genetic and environmental modifiers of hereditary hemochromatosis, to validate the component analysis approach, to examine stability of component proportions over time and to determine if TS and SF values from an individual move between components at baseline and follow-up. Because sampling fractions from each p.C282Y / p.H63D genotype stratum are not equal, we used frequency weights based on the inverse of the probability of selection for invitation to participate. In the weighted female analytic cohorts, C4 captured most of C282Y homozygotes, and C2 was the largest component. We identified four components from the weighted male analytic cohort and C4 captured most of p.C282Y homozygotes. The bivariate mixture modeling approach suggested that the model is transferable from one white population to another, although estimated means within components may differ.


Introduction
Hereditary Hemochromatosis (HH) is a genetic predisposition disease in which the body absorbs and accumulates excess iron from the diet. This excess iron is deposited in certain organs and, if left untreated, can cause tissue damage, the most serious example of which is cirrhosis of the liver, which can lead to liver failure and/or hepatocellular carcinoma and death. Approximately 90% of people with HH are homozygous for a polymorphism in the HFE gene (rs1800562) that causes substitution of the amino acid tyrosine for cysteine in the HFE protein at position 282 (p.Cys282Tyr; p.C282Y) [1]. The primary biochemical measures of iron status are transferrin saturation (TS), a measure of iron absorption, and serum ferritin (SF), a measure of iron accumulation.
McLaren et al. used bivariate mixture modeling to analyze the joint population distribution of transferrin saturation (TS) and serum ferritin concentration (SF) measured for participants in the Hemochromatosis and Iron Overload Screening (HEIRS) Study [2]. They identified four components (C1, C2, C3, and C4) with successively age-adjusted increasing means for TS and SF from data contributed by more than 100,000 participants of multiple ethnicities. They demonstrated that the bivariate mixture modeling in TS and SF reflect the genetic locus of HFE and has the potential to isolate p.C282Y homozygotes from the general population [2][3][4].
In the current study we used data from the another large cohort, the Australian HealthIron study of genetic and environmental modifiers of hereditary hemochromatosis (HH), to validate the component analysis approach, to examine stability of component proportions over time and to determine if TS and SF values from an individual move from one component at baseline to another at follow-up. We note that most new findings require replication before the original report can be considered to have been confirmed. Demonstrating replication is our intent.

Sources of data
Between 1990 and 1994, 41,514 people were recruited to participate in the Melbourne Collaborative Cohort Study (MCCS) [5]. Baseline blood samples from 31,192 participants of northern European descent (the remainder were from southern European countries such as Greece, Italy, and Malta, populations in which the p.C282Y polymorphism is rare) were genotyped for the p.C282Y and p.H63D polymorphisms in the HFE gene. An HFE-stratified random sample of these participants (n = 1,438) were selected for invitation to participate in the HealthIron study. Of these, 1,052 participated in interviews and had a clinical examination between 2004 and 2006.
The study originally named "Genetics modifiers of haemochromatosis phenotype and environmental modifiers of clinical expression of haemochromatosis" was approved by the Human Research Ethics Committee (HREC) of the Cancer Council Victoria (CCV), Australia as Proj- The HealthIron participants are an HFE-genotype stratified random sample, not a screening cohort, so the sampling fractions from each p.C282Y/p.H63D genotype stratum are not equal. To reflect the HFE population proportions, we use frequency weights based on the inverse of the probability of selection for invitation to participate. These weights are 1 for p. C282Y homozygotes (all identified p.C282Y homozygotes were invited to participate), 3 for p. C282Y/p.H63D compound heterozygotes, 10 for p.C282Y simple heterozygotes and 50 for those with no p.C282Y and no p.H63D mutation. Participants who were previously treated for hemochromatosis, missing TS or SF values, missing age, and unknown genotype were excluded.
A data dictionary with variable names and a data spreadsheet are provided in a supporting information file that gives HFE-genotypes, demographic variables, and iron indices for HealthIron participants.

Adjusted transferrin saturation and serum ferritin concentration
A restricted cubic spline analysis was used to reflect the non-linear relationships between TS and age [6,7]. TS values were then adjusted for the age terms using separate multiple linear regression analyses by sex [6]. For each individual, the value of the regression residual was calculated and an adjusted transferrin saturation value was computed as the sum of the regression residual and a constant. The constant was calculated as the predicted TS at the median ages, which were 53.8 years at baseline and 65.5 years at follow-up for men, 53.2 years at baseline and 64.1 years at follow-up for women.
The distributions of serum ferritin concentration values were markedly skewed. The natural logarithm transformation was applied to induce the normality of the distributions. Following the same strategy stated above, the adjusted SF values were calculated for each individual at baseline and follow-up separately.

Statistical mixture modeling of the bivariate distribution of TS and SF
A complete description of the use of mixture model methodology used to analze the bivariate distribution of TS and SF is provided in McLaren et al. [2]. In the current study we applied this bivariate mixture modeling approach separately at baseline and follow-up stratified by sex. The EMMIX program, which implements the EM algorithm, was utilized to fit models and to assess the number of normal components [8,9]. The significance of the likelihood ratio test, the AIC and BIC statistics, and the estimate of overall correct allocation rate were used to determine the number of normal components to best fit the data [10][11][12]. The proportions and the means and variances for adjusted TS and adjusted SF within components of the bivariate distribution were computed. Allocation of each individual to a normal component was obtained at baseline and follow-up based on the highest posterior probability of component membership. A contingency table was constructed to display the shifting in the allocation from baseline to follow-up for individuals by HFE genotype, and Bowker's test of symmetry was performed [13]. Scatter plots of adjusted TS and SF with 95% confidence ellipses based on the 4-component mixture models were constructed.

Results
The final unweighted analytic sample consisted of TS and SF concentration values from 926 whites (426 men, 500 women) at baseline and in follow−up data from 771 whites (341 men, 430 women) (See Table 1). The final weighted sample included 23328 whites (10261 men, 13067 women) at baseline and in follow−up data from 21,045 whites (9,215 men, 11,830 women). (See Table 1). Table 2 displays the participants' characteristics in each analytic cohort. Table 2 shows that in the weighted sample, male C282Y homozygotes comprised at follow-up 43% of the initial population (26/60), female C282Y homozygotes comprised at follow-up 68% of the initial population (49/72), and the other genotypes retained 43% of males (9189/10,201) and 68% of females (11,781/12,995). This may indicate that more male C292Y homozygotes were lost for follow-up after initial evaluation and could eventually develop a more severe phenotype.    Table 3 displays the results of an analysis with four components for both the weighted female and weighted male analytic cohorts, respectively. Although there was some statistical evidence to support the five-component model, we chose to work with the four-component model to aid the comparison of our results with those of previously published research [2]. The largest component, C2, had normal mean TS (28% at baseline, 23% at follow-up) and SF (77 μg/L at baseline, 78 μg/L at the follow-up), with component proportion of 0.82 at baseline and 0.63 at follow-up respectively. In the weighted male cohorts, we determined 4 components at baseline and at the follow-up separately. At baseline, the largest component, C2, had normal mean TS value of 31% and a normal mean SF of 243 μg/L. However, at the follow-up, the largest component is C3 with a normal mean TS value of 32% and a normal mean SF of 250 μg/L. Fig 1 displays the scatter plots of adjusted TS and SF values with the indication of p.C282Y homozygotes in red dots from the weighted HealthIron data. The 95% confidence ellipses in black and green color were added based on the results of mixture models from HealthIron and HEIRS data, respectively.
Female p.C282Y homozygotes showed evidence that component transition probabilities shifted significantly over time; 29% had TS and SF values that were in the same bivariate TS/SF component at baseline and follow−up (Bowker's test of symmetry, p = 0.03). For the other genotypes, the percentages of females who had TS and SF values that were in the same bivariate component at baseline and follow−up were 40% for p.C282Y/p.H63D compound heterozygote, 47% for p.C282Y heterozygotes, 49% for p.H63D heterozygotes, 75% for p.H63D homozygotes and 68% for HFE wildtype. In the male cohort, the percentages of participants whose TS and SF values were in the same bivariate component at baseline and follow−up varied from 33% to 77% by genotype (Bowker's test of symmetry, all p-values < 0.0001, except for p.C282Y homozygotes where p = 0.33) (see Tables 4 and 5). adjusted increasing means for TS and SF at baseline and follow-up data by sex, separately. In the weighted female analytic cohorts, C4 captured most of p.C282Y homozygotes, and C2 was the largest component. In our study at baseline and follow-up, the means of adjusted TS and SF within components were higher than the means from HEIRS data. The estimated variances from the HealthIron data were much smaller than the estimated variances from HEIRS. In addition, we identified 4 components from the weighted male analytic cohort and C4 captured most of the p.C282Y homozygotes. Therefore, the bivariate mixture modeling approach suggested that the model is transferable from one white population to another, although estimated means within components may differ. The longitudinal aspect of this study is unique and illustrates that, with the exception of female p.C282Y homozygotes, the components of the mixture distributions are largely stable over time.
Strengths of this study include the prospective nature of recruitment for both the HEIRS and HealthIron studies and the comprehensive phenotyping through multiple sources: selfcompleted questionnaires, standardized biochemical analysis of all blood and tissue samples, and comprehensive clinical examination of participants by study physicians. Weaknesses include the different circumstances, motivations and time periods under which the two cohorts were recruited, and the different sampling fractions for each HFE-genotype group in the HealthIron study. This required the use of frequency weights to implement the mixture model analysis for which only approximate measures of precision of estimation are available.
In conclusion, our study provides another demonstration of the usefulness of mixture modeling in determining the presence and estimating the frequency of subpopulations in an overall population distribution. A unique aspect of our investigation was the access to longitudinal data enabling determination of subpopulation components at baseline and follow-up and assessment of a shift of component probabilities over time. A significant shift between baseline and follow-up was found for female p.C282Y homozygotes but not for male p.C282Y homozygotes.
Supporting information S1 File. A data dictionary with variable names and a data spreadsheet are provided in a supporting information file. The data spreadsheet gives HFE-genotypes, demographic variables, and iron indices for HealthIron participants. (XLS)