A Genome-Wide Association Study Identifies rs2000999 as a Strong Genetic Determinant of Circulating Haptoglobin Levels

Haptoglobin is an acute phase inflammatory marker. Its main function is to bind hemoglobin released from erythrocytes to aid its elimination, and thereby haptoglobin prevents the generation of reactive oxygen species in the blood. Haptoglobin levels have been repeatedly associated with a variety of inflammation-linked infectious and non-infectious diseases, including malaria, tuberculosis, human immunodeficiency virus, hepatitis C, diabetes, carotid atherosclerosis, and acute myocardial infarction. However, a comprehensive genetic assessment of the inter-individual variability of circulating haptoglobin levels has not been conducted so far. We used a genome-wide association study initially conducted in 631 French children followed by a replication in three additional European sample sets and we identified a common single nucleotide polymorphism (SNP), rs2000999 located in the Haptoglobin gene (HP) as a strong genetic predictor of circulating Haptoglobin levels (Poverall = 8.1×10−59), explaining 45.4% of its genetic variability (11.8% of Hp global variance). The functional relevance of rs2000999 was further demonstrated by its specific association with HP mRNA levels (β = 0.23±0.08, P = 0.007). Finally, SNP rs2000999 was associated with decreased total and low-density lipoprotein cholesterol in 8,789 European children (Ptotal cholesterol = 0.002 and PLDL = 0.0008). Given the central position of haptoglobin in many inflammation-related metabolic pathways, the relevance of rs2000999 genotyping when evaluating haptoglobin concentration should be further investigated in order to improve its diagnostic/therapeutic and/or prevention impact.


Introduction
Human haptoglobin (Hp) is an acute phase inflammatory glycoprotein essentially synthesized by the liver and up-regulated by cytokines [1]. Hp is polymorphic with two co-dominant alleles, Hp1 and Hp2 encoded by the Haptoglobin (HP) gene located in chromosome 16 and resulting in three common isoforms: Hp1-1, Hp2-2 and Hp2-1 (called HP 'common polymorphism') [2]. In normal physiological conditions, Hp protein concentration in blood ranges between 0.3 and 2.0 g/L in adults [3] but significant fall in its level during the first decade of life [4]. The main property of Hp is to scavenge circulating hemoglobin (Hb) released by hemolysis or normal red blood cells turnover [5]. The resulting circulating Hp-Hb complexes are eliminated by Kupffer's cell(s) in the liver, preventing the generation of reactive oxygen species [2,6]. Therefore, Hp plays an important role in preventing renal damage and iron loss that can occur following an intravascular hemolysis. Hp is also able to bind apolipoprotein (Apo) A-I [7] to protect the Apo A-I effector domain of lecithin-cholesterol acyltransferase against oxidative stress, and Hp consequently modulates the high-density lipoprotein (HDL) function [8]. Furthermore, Hp can bind Apo E and the resulting complexes influence cholesterol esterification [9]. These functional characteristics confer to Hp a major role in the reverse transport of cholesterol between peripheral cells and the liver for degradation.
Hp levels and HP rs72294371 'common polymorphism' (Data S1) have been consistently associated with inflammatory-linked infectious [10,11] and non-communicable diseases [11,12]. Malaria caused by Plasmodium falciparum, which is associated with extensive intravascular hemolysis, decreases Hp to undetectable levels as the Hb-scavenging system is saturated [13]. 13 In malariaendemic areas, hypohaptoglobinemia has been proposed as an indirect biochemical indicator of malaria [14]. HP 'common polymorphism' should be considered at diagnosis of tuberculosis. Eisaev and colleagues [15] described an increased recurrence of pulmonary tuberculosis with worse prognosis in Hp2-2 Caucasians. Furthermore, this HP 'common polymorphism' contributes to mortality and viral load in Human immunodeficiency virus (HIV) infection [16]. Hp2-2 HIV carriers have a more pronounced viral replication rate and a worse prognosis compared to Hp1-1 or Hp2-1 HIV carriers [16,17]. Hepatitis C infection has also been associated with low serum Hp concentrations [18] and an overrepresentation of the Hp1-1 phenotype has been associated with high risk for chronic hepatitis C [19,20]. The HP 'common polymorphism' has also an effect on various other infectious diseases [21,22,23,24].
In Type 2 diabetes patients, the Hp2-2 phenotype has been suggested to confer greater risk of cardiovascular events [12,25,26] and of carotid atherosclerosis [27]. Moreover, high Hp level is a risk factor for acute myocardial infarction, stroke and heart failure [28,29]. Therefore, the routine measurement of Hp level has been suggested to be incorporated in daily medical practice to evaluate cardiovascular risk [29].
Despite these findings, the basis of Hp level inter-individual variability is still unknown. To identify genetic variants modulating physiological levels of Hp, we analyzed genome-wide association study (GWAS) data generated in European children for whom no age-related disease may influence Hp concentrations. We also assessed the association between the identified variants and cardiovascular risk factors (total, HDL and low density lipoprotein-cholesterol, Apolipoproteins A1 and B). Table 1 shows the phenotypic characteristics of the studied populations. Hp levels in children were low, in accordance with its reference distribution [4].

Results
Patterns of family correlations for serum Hp concentrations were assessed following both unadjusted values (Table 2). Model 1, which was not adjusted, did not show any family correlation. Model 2, which took into account age and body mass index (BMI) as covariates showed significant correlations for all the various pairs of relatives. Model 3, which hypothesized no effect of gender on family correlations, showed significant father-mother, fatherson and son-son correlations ( Table 2).
We then assessed the components of variance attributable to additive genetic effects, shared household effects and residual environmental factors (including assay variability) in 656 nuclear bi-parental families (2,680 individuals) from the STANISLAS Family Study (SFS) cohort (Table 3). Model 2, which included the three components after adjustment for age and BMI gave a better description of the variance decomposition than model 1 which was not adjusted. Hp genetic variance represented 26% (P,0.001) of the total variance. Shared (i.e. within families) and random environmental variances were 11.6 and 62.4% respectively ( Table 3).
Our GWAS based on 631 unrelated children of the SFS cohort showed strongest association signal for Hp levels in a 218-kb linkage disequilibrium (LD) block on chromosome 16 that includes the HP gene ( Figure 1). Using the square-root transformed Hp measurement adjusted for gender, age and z-BMI under the additive model, we identified in this region two significant association signals 90-kb apart: rs2000999 (with A as allele effect: b = 20.123, standard error [SE] = 0.017, P = 6.32610 213 ; Table 4) and rs10492825 (with C as effect allele: b = 20.0876, SE = 0.016, P = 5.50610 208 ; Table 4). Both SNPs rs2000999 and rs10492825 display moderate LD (r 2 = 0.48, HapMap CEU release #27). In order to assess the redundancy between these two signals, we ran conditional regression analyses for both SNPs adjusted for each other and found that rs2000999 alone drove the association observed at the HP locus (rs2000999: P rs10492825 adjusted = 1.95610 27 , rs10492825: P rs2000999 adjusted = 0.91, Data S2, Table S1).
We confirmed the association between SNP rs2000999 and circulating Hp levels in three additional independent European cohorts: the GENDAI study of Greek children, a subset of obese children from the East of France (N total = 1,434), and a familial subset (N total = 2,957) of the SFS cohort (P replication = 3.49610 241 , P overall = 8.09610 259 ; Table 4).   Accounting for the rs2000999 allelic frequency, the pattern of familial correlation (Table 2, model 4) decreased from 0.230 to 0.206 and from 0.274 to 0.239 for sibling and child-parent respectively, whereas the adequacy of the model was significantly improved. Additional adjustment for rs2000999 for the components of variance attributable to additive genetic effects, shared household effects and residual environmental factors (Table 3, model 3) significantly improved the likelihood function and the proportion of phenotypic variability accounted for by genetic effects decreased (26.0% to 14.2%, in comparison to model 2). Moreover, the component attributable to household factors increased (11.6% to 14.7%). We thus determined that rs2000999 is the major genetic determinant of Hp levels accounting for 11.8% of Hp global variance and 45.4% of the genetic variance of this trait.
In order to assess the degree of independence of rs2000999 from the HP rs72294371 'common polymorphism' (Data S1), we genotyped the latter in the GWAS first stage children (SFS cohort)  by using a PCR-based method and a gel reading. Only a subset of 341 out of 631 samples was successfully genotyped after independent readings by two readers. In this sample set, we found no evidence for LD between HP 'common polymorphism' and rs2000999 (r 2 = 0.135) nor with the other SNPs that were genotyped by the Illumina array within the 218 kb LD block that includes both HP 'common polymorphism' and rs2000999 (0.001,r 2 ,0.137; N = 31 SNPs). Furthermore, the HP 'common polymorphism' (minor allele frequency [MAF] = 0.46) and rs2000999 (MAF = 0.20) were both highly associated with Hp levels, as expected (P = 4610 27 and P = 1610 27 , respectively). When both variants were included in the same regression model, we found that they significantly and independently contributed to the increased Hp levels (P HP rs72294371 'common polymorphism' = 0.001 and P rs2000999 = 5610 25 ) indicating that the association with rs2000999 would be novel and not redundant with the HP 'common polymorphism'. However, it is noteworthy that despite strong efforts, we did not succeed by far in genotyping all samples. We used two other technologies: a pre-designed TaqMan copy number assay (Applied Biosystems) and a PCR-based method with another design than previously used. Unfortunately, we did not find a good concordance (,70%) between the three methods. We conclude that given the state of art, we cannot definitively conclude that the present signal of association is not related to the HP 'common polymorphism' genotype.
In order to validate our main results, we secondary assessed the effect of SNP rs2000999 on HP gene expression in subcutaneous adipose tissue sample from 194 non-obese subjects ascertained from the Swedish SibPair cohort (Data S2). We found a significant contribution of rs2000999 to HP expression (b = 0.2360.08; P = 0.007; P Bayesian = 0.006).
We finally assessed by additive model the effect of SNP rs2000999 on total, HDL and low-density lipoprotein (LDL) cholesterol, Apolipoproteins A1 and B in five independent European pediatric cohorts totaling 8,789 children. Total cholesterol was ln-transformed and we normalized the LDL cholesterol by computing the square root. All measurements were adjusted for gender, age (excepting the NFBC1986) and z-score BMI. Our data showed that rs2000999, with A as allele effect, was associated with total cholesterol (b = 20.011, SE = 0.003, P = 0.002; Table 5) and LDL-cholesterol (b = 20.017, SE = 0.004, P = 0.0008; Table 5). The association with HDL-cholesterol and Apolipoproteins A1 and B are displayed in Table S2.

Discussion
We first determined in 656 nuclear families that 26% of the Hp plasma level variance was under genetic control. Then, using a GWAS in 631 children from the same population and replicating in three independent populations, we identified rs2000999 as the major genetic determinant of Hp levels. This genetic variant alone explained 45.4% of the genetic variance of this trait (11.8% of Hp global variance). SNP rs2000999 is located in the intronic region of HP gene, in a region previously believed to be the HPR gene (encoding the haptoglobin-related protein) which shares more than 90% nucleotide sequence homology with HP [30]. It is 17 kb apart a duplication of 59 a chain amino acid residues resulting to an intragenic duplication of 1.7 kb and which is known as the HP 'common polymorphism' [31].
Our study shows that SNP rs2000999 also modulated expression levels of the Hp mRNA in human adipose tissue suggesting that this SNP (or a SNP in very strong LD with this one) is indeed functional. It is noteworthy that SNP rs2000999 has been previously reported to associate with total cholesterol in 4,200 adults from the EUROSPAN consortium [32] and with both total and LDL-cholesterol in 100,000 adults of European and non-European ancestry [33]. Interestingly, we confirmed the effect of this SNP on these lipid traits in European children.
Increased plasma levels of several inflammatory markers correlate with higher incidence and prognosis of various cardiovascular diseases [34,35,36,37]. Hp level measurement has been recently shown to improve the predictive information for major cardiovascular events [29]. As rs2000999 is also associated with lipid levels, this marker links inflammation and cardiovascular risk. It is noteworthy that the impact of rs2000999 association on lipids occurs early in life and is consistent with previous findings that the precursors of cardiovascular diseases originate in childhood [38,39].
Interestingly, the effect of rs2000999 on Hp levels is more important in our discovery cohort which includes healthy children having low Hp concentration (0.65 g/L60.39). As shown in the analyses for other diseases [40], the statistical power of GWAS can be increased in healthy homogeneous controls.
In addition, the effect of aging and of the environment is minimized in children. Then, by using healthy pediatric populations, we were able to assess more accurately the effect of the SNP rs2000999 on Hp levels.
We tried to assess the degree of independence between rs2000999 and the HP 'common polymorphism'. Three different methods were evaluated to genotype the HP 'common polymorphism' in our whole GWAS sample set. Unfortunately, we found no concordance between the three methods, which underlie a major difficulty to carry out an accurate genotyping of this polymorphism. Even if this difficulty was not clearly discussed and not published to our knowledge, it is admitted in the scientific field and it should also be present in the clinical diagnosis setting. In contrast, SNP rs2000999 can be accurately and easily genotyped. Our findings should be further replicated in non-European adults, especially in those affected by infectious diseases. More generally, rs2000999 should be assessed in cohorts of patients affected by the large variety diseases associated with Hp levels. It is not a trivial task, as Hp is a trait that has been infrequently measured in cohorts used for genetic studies. Given the major effect of rs2000999 on Hp gene expression and on Hp levels, Mendelian randomization approach would be of interest to test the causative effect of this SNP on infectious and noncommunicable phenotypes in order to assess its clinical relevance.

Ethics Statement
All the populations involved in the present study were recruited in accordance with the latest version of the Declaration of Helsinki for Ethical Principles for Medical Research Involving Human Subjects. All participants and their parents gave a written informed consent. Genetic studies protocols were approved by the local ethics committees for the protection of subjects for biomedical research: the Comité Consultatif de Protection des Personnes dans la Recherche Biomédicale (CCPPRB).

Study populations
The STANISLAS Family Study (SFS). The SFS is a 10-year longitudinal survey involving 1,006 volunteer families of European ancestry whose members were free of chronic disease (cardiovascular or cancer) with recruitment taking place from 1993-95 [41]. The SFS samples and data are part of the Biological Resources Centre (BRC) ''Interactions Gène-Environnement en Physiopathologie CardioVasculaire'' (IGE-PCV) in Nancy, France. Genome-wide genotyping was performed on a subset of 631 unrelated children (mean age 11.93 years [11.76-12.11]) constituting the discovery cohort [42] after screening for latent population substructure (Data S2). The 2,957 remaining individuals after quality control were analysed in the replication studies (mean age 29.84 [29.38-30.30]). Hp levels, BMI and the cardiovascular risk traits including total, high density lipoprotein (HDL) and low density lipoprotein (LDL)-cholesterol (calculated by the Friedewald formula [43]), Apolipoprotein A1 and B were available for all participants.
Obese Children. We studied obese children (defined as BMI.97 th percentile for age and sex according to a French cohort [44]) ascertained from 449 nuclear families with at least one obese offspring, recruited in the Paediatric Endocrine Unit of Jeanne de Flandres Hospital of Lille, France or through a national media campaign. We analyzed 1,015 children (mean age 11.07 years [10.86-11.27]) for whom Hp, BMI, total, HDL and LDLcholesterol, Apolipoprotein A1 and B measurements were available.
The GeNe and Diet Attica Investigation (GENDAI). The GENDAI pediatric cohort was recruited from children living in the Attica region of Greece [45]. From November 2005 to June 2006, 1,138 peri-adolescent children were recruited from randomly selected elementary schools of Attica. We analyzed 419 children (mean age 11.16 years [11.10-11.23]) for whom Hp, BMI, total, HDL and LDL-cholesterol, Apolipoprotein A1 and B measurements were available.
The Northern Finland 1986 Birth Cohort (NFBC 1986). The NFBC1986 is a prospective birth cohort including all Finnish mothers of European ancestry with children whose expected date of birth fell between July 1, 1985 and June 30, 1986 in the two northernmost provinces in Finland [46]. Clinical examination at 15-16 years follow-up was conducted between August 2001 and June 2002. All cohort members living in Finland with known address (n = 9,215) were invited, and 6,798 participated (74%). We analyzed 5,310 adolescents successfully genotyped in the NFBC1986 cohort for whom BMI, total, HDL and LDLcholesterol, Apolipoprotein A1 and B measurements were available.
The Verona cohort. The Verona cohort consists of Italian children recruited from the general population of Verona, Italy, whose families were randomly chosen from the registry office database of the town, and contacted by post. We analysed 401 children (mean age 10.90 years [10.75-11.04]) successfully genotyped for whom at least BMI, total, HDL and LDLcholesterol, Apolipoprotein A1 and B measurements were available.
The SibPair cohort. The SibPair cohort comprises 154 nuclear families (732 subjects) from Sweden, each containing an obesity-discordant sib pair (at least 10 kg/m 2 difference in BMI). Gene expression and genetic variation were analysed in 194 nonobese subjects from the SibPair cohort.

Genotyping
Genomewide genotypes were generated for the 631 unrelated SFS children using the Illumina Human CNV370-Duo array [42]. Briefly,750 ng of genomic DNA was processed using Illumina's protocol for the BeadStation genotyping platform (Illumina), followed by GenCall software analysis(Illumina) to automatically cluster, call genotypes, and assign confidence scores using the GenTrain clustering algorithm (Illumina). We discarded a total of 2,552 SNPs due to the following reasons: extreme Hardy-Weinberg disequilibrium (P,0.001), low genotyping call rates (,95%) or low minor-allele frequencies (,1%). We retained 318,237 SNPs for analysis. Genomic control l GC was 1.01.
We used the Applied Biosystems SNPlex TM technology to replicate the association of genome-wide significant genetic variants in the SFS replication set, obese children and GENDAI, NFBC1986 and Verona cohorts.
SNPlex is based on the Oligonucleotide Ligation Assay (OLA) combined with multiplex PCR target amplification and was carried out as per the manufacturer's instructions (http://www. appliedbiosystems.com). Allelic discrimination was performed by capillary electrophoresis analysis using an Applied Biosystems 3730xl DNA Analyzer and GeneMapper 3.7 software. Genotyping call rate was above 95% in all populations studied and genetic variants were in HW equilibrium (p.0.001).
We used a PCR-based method [47] to genotype for the HP 'common polymorphism' in the 631 children of the discovery cohort (SFS cohort) in order to determine any linkage disequilibrium with regard to genome-wide significant variants identified in the analysis. Only genotypes that were concordant following a double blind genotyping call by two independent readers were retained for statistical analyses (N = 341). Two additional genotyping methods for HP 'common polymorphism' were used in order to validate the above-method: a custom TaqMan copy number assay (Applied Biosystems) following the manufacturer's recommendations and another PCR-based method using the following oligonucleotide primers : 59-CTCTCCTTTCTCCC-TTCCTGTC-39 and 59-TTTATCCACTGCTTCTCATTGT-39. We didn't obtain correspondence between the banding patterns and the Hp genotypes.

Haptoglobin measurement
Blood samples were collected between 8:00 and 9:00 am or 11:30 and 12:30 pm by venipeuncture after overnight fasting. Hp protein levels were measured in blood plasma samples by high sensitivity immunophotometry analyses using the BN TM II Siemens analyzer (Siemens, Marburg, Germany) and Siemens reagents and following the manufacturer's instructions.

Lipids measurements
Total cholesterol, HDL-cholesterol and apolipoproteins A1 and B were assayed using enzymatic methods (AU640 [Olympus, Watford, UK]) and LDL-cholesterol was calculated using the Friedewald formula [43].

Statistical analyses
Heritability estimate of Hp levels in the SFS. Intrafamilial correlations were estimated by using maximum likelihood techniques [48] with and without adjustment for covariates. This statistical approach allowed adjustment for covariates within models, simultaneously and separately for fathers, mothers, sons and daughters. The significance of various familial correlations, or sex and generation differences in correlations, was tested using the log-likelihood ratio test. Correlations were computed under two sets of hypotheses: gender effects on correlations for parents and children and no gender effect for all correlations.
Variance component analysis was applied in order to assess the relative contributions of genetic, common household factors and individual specific environment in familial aggregation of serum haptoglobin concentrations. The variable used to estimate variance component was adjusted for age and BMI, separately for fathers, mothers, sons and daughters. The analysis was conducted by using a multivariate normal model for pedigree analysis as described by Lange and colleagues [49,50]. with the software FISHER, which also performed tests of goodness-of-fit of the underlying multinormal distribution. The general model assumed that the studied trait was the result of the sum of three independent random components: a polygenic component (G) representing additive genetic factors, household factors common to individuals within a family (H) and unmeasured environmental factors particular to an individual (including measurement error) (E). These three components were assumed to be normally distributed with mean equal to 0 and variance equal to s 2 G, s 2 H and s 2 E, respectively.
The hypothesis of no polygenic component or no household effect was checked by comparing a model including s 2 G, s 2 H and s 2 E with a model including only s 2 H and s 2 E or s 2 G and s 2 E, respectively. In addition, possible effects of covariates (age and BMI) and genome-wide significant variants' allelic frequency on these variance components were tested.
Comparison of nested models was based on the likelihood ratio criteria. Eventually, the best parsimonious model was selected. The percentage contributions of the three components, additive genetic factors (heritability), household factors and residual environmental, to residual phenotypic variance (after adjustment for covariates) were determined.
Genome-wide association and replication analyses. We carried out genome-wide association and replication analyses on Hp levels using linear mixed regression models under the additive genetic model with one degree of freedom, adjusting for age, gender and BMI and using PLINK [51]. The summary statistics were combined in the meta-analyses (Data S2), using the inverse normal method with equal weight for each population. In this method, P values of each study are transformed into their inverse normal z score and the weighted sum, over all studies, is compared to a normal N (0, 1), provided the sum of squared weights equals 1. The estimates of variants effects on Hp and their standard errors for each separate analysis were combined in the meta-analysis using the weighted inverse normal method, and the overall effect and its confidence interval were estimated using the inverse variance method implemented in the 'meta.summaries' function of the R RMETA package (http://cran.r-project.org/web/packages/ rmeta/index.html). No major heterogeneity in effects was observed (P,0.02). The same mixed model and the same software were used to analyse the association of genetic variants with lipid traits.
Gene-expression investigation. To investigate the effect of genome-wide significant variants on gene expression (Data S2), we used data from 194 non-obese individuals from the SibPair cohort [52]. Gene expression data for HP was measured in subcutaneous adipose tissue [53] from 347 siblings using the Affymetrix Human U133 Plus 2.0 platform (208470_s_at and 208471_at, respectively). DNA was isolated from peripheral blood and genotypes were generated using Illumina 610-Quad arrays.
We used a linear mixed model (Pinheiro and Bates, 2000) to assess association of SNPs with gene expression. Log-transformed expression level was regressed on the random-effect term, which accommodates the family pedigree structure, and on the fixedeffect terms i.e. sex, age, BMI level and the SNP of interest (recoded as 0 = AA; 1 = AG; 2 = GG according to an additive model). Analysis was carried out using the R function lmer() (package lme4) with p-values obtained from the t-statistic.
Significance of the fixed effects was further investigated in the Bayesian set-up using the R function mcmcsamp() (package lme4) that generates Monte Carlo Markov Chain samples from the posterior distribution of the parameters of a linear mixed model. The prior on the fixed effects parameters is taken to be locally uniform while the prior on the variance-covariance matrices of the random effects is taken to be the locally non-informative prior. Based on 100,000 samples drawn from the posterior distribution, we calculated the smallest p such that the (12p) credible interval does not contain the value 0. This parameter was finally used to assess the p-value obtained from the t-statistic: if smaller than p, it was considered anticonservative and its value discarded.