Multi-Ethnic Analysis of Lipid-Associated Loci: The NHLBI CARe Project

Background Whereas it is well established that plasma lipid levels have substantial heritability within populations, it remains unclear how many of the genetic determinants reported in previous studies (largely performed in European American cohorts) are relevant in different ethnicities. Methodology/Principal Findings We tested a set of ∼50,000 polymorphisms from ∼2,000 candidate genes and genetic loci from genome-wide association studies (GWAS) for association with low-density lipoprotein cholesterol (LDL-C), high-density lipoprotein cholesterol (HDL-C), and triglycerides (TG) in 25,000 European Americans and 9,000 African Americans in the National Heart, Lung, and Blood Institute (NHLBI) Candidate Gene Association Resource (CARe). We replicated associations for a number of genes in one or both ethnicities and identified a novel lipid-associated variant in a locus harboring ICAM1. We compared the architecture of genetic loci associated with lipids in both African Americans and European Americans and found that the same genes were relevant across ethnic groups but the specific associated variants at each gene often differed. Conclusions/Significance We identify or provide further evidence for a number of genetic determinants of plasma lipid levels through population association studies. In many loci the determinants appear to differ substantially between African Americans and European Americans.


Introduction
Plasma concentrations of lipids and lipoproteins [low-density lipoprotein cholesterol (LDL), high-density lipoprotein cholesterol (HDL-C), triglycerides (TG)] are heritable risk factors for cardiovascular disease [1,2]. Recently, genome-wide association mapping of common variants (.5% minor allele frequency) in individuals of European ancestry has proven useful in identifying genetic loci contributing to plasma lipids [3]. Roughly one third of these loci harbor genes with previously recognized involvement in lipid metabolism, many by virtue of having rare variants that result in Mendelian disorders. The other loci are suspected to harbor novel lipid regulators, the study of which has the potential of providing important new insights into lipid biology.
As promising as these observations may be, several critical questions remain -How many of the previously reported loci are genuine causal determinants of plasma lipid levels? Are there more loci associated with plasma lipids that are discoverable with techniques besides genome-wide association mapping? Do any of these loci confer effects on lipids in an ethnic-specific manner? To address these questions, we utilized the National Heart, Lung, and Blood Institute (NHLBI) Candidate Gene Association Resource (CARe) comprising more than 40,000 individuals from nine prospective cohorts with measured lipid phenotypes and genotype information obtained using the ''ITMAT-Broad-CARe'' array, or IBC array, with ,50,000 polymorphisms from ,2,000 candidate genes/loci [4,5]. Most of the polymorphisms on this array were obtained from a systematic literature search for candidate genes implicated in cardiovascular diseases; these were supplemented with polymorphisms from loci identified in early genome-wide association studies (GWAS) for lipids [5].
We initiated this study with specific hypotheses: (1) Common DNA sequence variants in previously reported genes and loci, as well as additional novel loci, are associated with plasma lipids; (2) At some of these loci, the specific variants related to plasma lipids differ between ethnic groups. To test these hypotheses, we performed association analyses of the polymorphisms on the IBC array in 25,000 European Americans and 9,000 African Americans in CARe.

Ethics Statement
All participants in each of the CARe cohorts gave informed written consent. The Institutional Review Boards (IRBs) of each CARe cohort (i.e., the IRBs for each cohort's field centers, coordinating center, and laboratory center) have reviewed and approved the cohort's interaction with CARe. The study described in this manuscript was approved by the Committee on the Use of Humans as Experimental Subjects (COUHES) of the Massachusetts Institute of Technology.

CARe Study Design and Quality Control
A full description of the CARe Study is found elsewhere [4]. In brief, DNA samples and phenotypic information from nine NHLBI prospective cohorts [the Atherosclerosis Risk In Communities (ARIC) study, the Coronary Artery Risk Development In Young Adults (CARDIA) study, the Cleveland Family Study  (Table S1). All phenotypes, including plasma lipid levels, had been standardized according to the Clinical Data Interchange Standard Consortium Study Data Tabulation Model (CDISC SDTM).
All DNA samples passing initial quality checks were interrogated with the IBC v2 chip for genotyping of 49,320 total single nucleotide polymorphisms (SNPs). Samples with overall genotyping success rate ,95%, individual SNPs with genotyping success rate ,95% within a cohort, monomorphic SNPs, and SNPs mapping to multiple genomic loci were removed. An inbreeding coefficient was calculated for each sample as a measure of heterozygosity, with those samples exceeding 4 standard deviations from the mean (suggesting poor DNA quality if too low, or sample contamination if too high) being removed. In cases of identical DNA samples, the sample with the lowest genotyping success rate was removed. Samples that shared 5% or more of their genome with many other samples were also removed. Additional outlier samples as determined by multidimensional scaling were also removed. SNPs for which genotype missingness could be predicted by surrounding haplotypes were removed, as were any SNPs found to be associated with chemistry plates. Pedigrees and SNPs not conforming to Mendelian expectations were identified and, where appropriate, samples were removed. These analyses were all performed with PLINK [6].
Because different ethnic groups were represented, with the expectation of differing genotype frequencies and admixture, no filters were applied for minor allele frequency or Hardy-Weinberg P values. The Hardy-Weinberg P value for each index SNP in each cohort is listed in Table S2.
About 2,000 ancestry-informative markers were included in the IBC v2 chip [4]. ANCESTRYMAP [7] was used to generate estimates for global ancestry in African American samples. EIGENSTRAT [8] was used for self-reported African Americans and European Americans separately to generate ten principal components each accounting for population stratification. We found that the first principal component for African Americans was highly correlated with the global ancestry estimates (r 2 .0.98).

Phenotype Modeling
We modeled the lipid phenotypes in the following ways. LDL-C was calculated according to Friedewald's formula: LDL-C = total cholesterol -HDL-C -(TG 4 5). If a TG value was .400 mg/ dL, LDL-C was treated as a missing value. For individuals on lipid-lowering therapy, the LDL-C value was multiplied by 1.42 to model a 30% reduction in LDL-C on therapy. This represents the average expected reduction in LDL-C with a first-generation statin, the most commonly used lipid-lowering medication during the study periods of most of the cohorts [9]. TG values were log(10)-transformed. Sex-specific phenotype residuals were constructed within strata of cohort and ethnicity after accounting for age and age 2 . Each set of residuals was standardized to a mean of zero and a standard deviation of one. The standardized residual served as the phenotype in genotype-phenotype association analyses. Generation of residuals was performed with the R statistical package (The R Foundation for Statistical Computing, Vienna, Austria).

Association Testing
For each of the three traits, we used linear regression to test SNP-phenotype associations in each stratum of cohort and ethnicity assuming an additive genetic model, using the ten calculated principal components as covariates: phenotype , genotype + PC1+ PC2+ PC3+ PC4+ PC5+ PC6+ PC7+ PC8+ PC9+ PC10. These association analyses were performed in PLINK. Genotype-phenotype associations within each ethnic group were assessed by weighted z-score-based fixed-effects metaanalysis and effect size estimates (b values) generated by inversevariance weighted meta-analysis, both using METAL (G. Abecasis, University of Michigan). Genomic control correction was applied to each cohort individually prior to meta-analysis, and a final genomic control correction was applied to each ethnicity's meta-analysis dataset. For all cross-ethnic comparisons, only SNPs available in all cohorts (12 total) were considered. We considered P = 1610 26 to be the threshold for statistical significance, applying a Bonferonni correction for the number of SNPs tested (0.05 4 50,000).
We performed sensitivity analyses for two cohorts for which there were significant numbers of related individuals-CFS and FHS-to account for family relationships, using a linear mixed effects (LME) model to analyze the lipid residuals, with the SNP genotype treated as a fixed effect, and a random effect according to the degree of relatedness within a family [10]. Upon substitution of these data into the meta-analyses for each lipid trait, there were essentially no differences in the SNPs found to be statistically significant, and these data were not considered further.
To identify SNPs exerting sex-specific effects, we added sex as a covariate in the above linear regression model and tested for a formal SNP6sex interaction; the results were meta-analyzed as described above. Similarly, to test for SNP6SNP interactions among the most strongly associated SNPs at each of the identified loci, we added the minor allele count for each SNP as a covariate, in turn, into the above model (with sex removed), testing for a formal interaction between each SNP pair combination.
For those loci containing IBC-significant SNPs in both ethnicities, we performed SNP conditional analyses. These were conducted by adding SNP allele counts as extra covariates to the initial linear regression model described above. The conditional analyses were performed iteratively, with additional independent SNPs identified at each step in each locus added to the model. Adjusted R 2 values for each SNP included in the conditional analyses, as well as for the combination of independently significant SNPs at each locus, were calculated by replicating the relevant linear regression model in the largest cohort (ARIC for both ethnicities) using Statistics Package for Social Sciences (SPSS) software (version 16.0; SPSS Inc., Chicago, IL), with principal components excluded.
We also found significant associations for a number of IBC SNPs chosen for their proximity to candidate genes selected from the literature. Of these, eight loci-FADS1 (fatty acid desaturase 1), PLTP (phospholipid transfer protein), LCAT (lecithin-cholesterol acyltransferase), ANGPTL4 (angiopoietin-like 4), ABCG5/ABCG8 (ATP-binding cassette sub-family G members 5 and 8; sterolin-1 and -2), LPA [lipoprotein(a)], NPC1L1 [NPC1 (Niemann-Pick disease, type C1, gene)-like 1], HPR (haptoglobin-related protein)were shown to have genome-wide-significant lipid-associated SNP variants in GWAS reported subsequent to the fabrication of the IBC array (Table 1) [3]. All of these genes, with the exception of FADS1 and HPR, have well-described functions in lipoprotein metabolism. The LCAT locus was notable for being IBCsignificant for HDL-C in both CARe European Americans and African Americans. An additional significant locus in African Americans harboring CD36 (cluster of differentiation 36; thrombospondin receptor) had previously been reported in a candidate gene study; indeed, the same SNP that was IBC-significant (rs3211938) was shown in that prior study to be a nonsense coding variant that resulted in CD36 deficiency in a homozygous individual and was associated with increased HDL-C levels (P = 0.00018) and decreased TG levels (P = 0.0059) in an African American cohort [15].
Finally, one locus harbors an IBC-significant variant that has not previously been reported to be associated with lipid traits-ICAM1 (intercellular adhesion molecule 1) in CARe African Americans (Table 1). Of note, the P value of association (1.24610 28 ) for the ICAM1 SNP rs5030359 is not only IBCsignificant but also genome-wide-significant. This variant is of low frequency in African Americans (just under 1%) and is virtually absent in European Americans. Interestingly, this variant demonstrated the largest effect size for any trait + ethnicity combination (b = -0.52).  Architecture of Lipid Loci in African Americans Compared to European Americans Comparing each index SNP in each of the lipid loci in European Americans to African Americans, we noted quite varied patterns of association. In some cases, the effect size estimates and minor allele frequencies (MAFs) are quite similar in two ethnicities. An example is SNP rs6511720 in the LDLR locus, with b = -0.23 (in standard deviation units) and MAF = 0.12 in European Americans, b = -0.20 and MAF = 0.14 in African Americans for LDL-C ( Table 1) These findings indicate considerable heterogeneity in the architecture of lipid-related loci in the two ethnicities.
We took advantage of the dense genotyping in each locus on the IBC v2 array to perform more detailed comparisons of the 11 loci that harbored IBC-significant SNPs for both ethnicities (Table 2). In only two of the loci (LIPC for HDL-C, LDLR for LDL-C) was the same SNP the most highly associated for the trait in each ethnicity. At an additional four loci (SORT1 with LDL-C, APOB with LDL-C, LPL with HDL-C and TG, LCAT with HDL), the most highly associated SNP in one ethnicity was among the most highly associated SNPs in the other ethnicities. In the remaining five loci (PCSK9 for LDL-C, ABCA1 for HDL-C, APOA1-C3-A4-A5 for TG, CETP for HDL, APOE for LDL-C and TG), the most highly associated SNPs in the ethnicities did not overlap.
For each ethnicity we performed conditional analyses with the most highly associated SNPs in each of the 11 loci to uncover any additional, independently associated SNPs (at IBC significance) in the same gene regions ( Table 2). We performed this iteratively until no association signals remained in each locus. By this criterion, we identified up to four independent SNP variants (in the case of CETP with HDL-C in European Americans) per locus. Remarkably, only one locus (SORT1) harbored a single independent SNP in each ethnic group.
For many gene regions, the independent SNPs did not colocalize in the two ethnicities. For example, in the PCSK9 locus, the most highly associated SNPs (rs11591147 in European Americans and rs11806638 in African Americans) are more than 10 kb apart and show much weaker associations (and are much less common) in the other ethnicity (Table 2). Similarly, the second independent SNPs (rs499883 in European Americans and rs505151 in African Americans) are more than 10 kb apart and are each absent in the other ethnicity. Similar patterns are seen with APOB, APOA1-C3-A4-A5, and LCAT. More complex patterns are seen in loci such as LPL, CETP, and APOE, where some independent SNPs colocalize (within a few kb of each other) and others are distant.
Notably, for two loci/trait combinations-LCAT with HDL, APOE with TG-the strength of association of SNPs is greater in African Americans than in European Americans (Table 2). For APOE, this is due to the rarity of the index SNP (rs12721054) in European Americans (MAF = 0.00047 vs. MAF = 0.12), since the effect size estimates are similar in the two ethnic groups (b = 0.33 in European Americans vs. b = 0.26 in African Americans). For LCAT, rarity accounts for the second independent SNP in African Americans (rs35673026, absent in European Americans) but not the most highly associated SNP (rs255052, similarly common in European Americans, MAF = 0.15 vs. MAF = 0.22, with modestly different effect size estimates, b = 0.069 vs. b = 0.11). No second independent SNP for LCAT was identified in European Americans. Table 2 also reports the proportion of the variance (adjusted R 2 ) in the respective traits explained by each SNP and the combination of the independently significant SNPs at each locus. The greatest proportion of variance explained by a single SNP in African Americans was 3.7% (rs17231520 for HDL-C); for European Americans it was 2.5% (rs17231506 for HDL-C). These SNPs are both located in the CETP locus on chromosome 16. Unsurprisingly, the CETP locus explained the greatest proportion of variance, in any trait, for both African Americans and European Americans (explaining 5.0% and 3.6% of the variance in HDL-C, respectively). For African Americans, the combination of all independently significant SNPs at the 11 loci that harbored IBC-significant SNPs for both ethnicities explained 4.5%, 7.7%, and 1.9% of the variance in LDL-C, HDL-C, and TG, respectively. For European Americans, 6.2%, 6.0%, and 4.1% of the variance in LDL-C, HDL-C, and TG were explained, respectively.

SNP6SNP Interactions and SNPs Exerting Sex-specific Effects
We investigated whether any SNP6SNP interactions existed between any of the most significant SNPs identified at each of the 19 above-mentioned loci for their respective traits; none were identified (Tables S3, S4, S5, S6, S7, S8). The lowest P value (0.002) was generated for the interaction between SNPs rs10455872 (LPA) and rs934197 (APOB) for LDL-C in African Americans, but was far from IBC-significant and withstands a Bonferroni correction of only 25 tests (there were 120 SNP6SNP interactions tested for LDL-C among African Americans).
Similarly, no IBC-significant SNP6sex interactions were observed among the SNPs identified in Table 1 (Tables S9, S10, S11). Of note, the SNP rs4810479 (PLTP) generated a P value of 1.51610 24 for HDL-C among European Americans and was highly significant among women (P = 7.11610 212 ) but not men (P = 0.12). The P value of this interaction was two orders of magnitude smaller than for any of the other SNPs for any lipid trait, with the exception of rs439401 (P = 1.30610 23 for TG among European Americans). Expanding the search to all SNPs on the IBC chip failed to identify any further IBC-significant SNP6sex interactions (Tables S12, S13).

Discussion
In this work, we identify or provide further evidence for a number of genetic determinants of plasma lipid levels through population association studies of two ethnicities. Specifically, the results of these studies support each of our hypotheses.    Common DNA Sequence Variants in Previously Reported Genes and Loci, as well as Additional Novel Loci, are Associated with Plasma Lipids We found that all 19 lipid-associated loci identified in early GWAS studies replicated in the combined CARe European American cohorts (,25,000 individuals), and 10 of the loci replicated in the combined CARe African American cohorts (,9,000 individuals). It is likely that with genotyping in additional African American individuals, increased power will ultimately allow replication of many of the remaining nine loci. We identified an additional 10 loci associated with one or more lipid traits at Bonferroni-corrected statistical significance. Eight of these loci were mapped in GWAS studies subsequent to the design of the IBC array, and one locus had been identified in a prior non-GWAS study. The remaining locus, ICAM1, is novel. The identification of ICAM1 is of particular interest because its association with LDL-C was exclusive to African Americans.
These findings give a high degree of confidence that most, if not all, of the reported GWAS lipid loci harbor authentic determinants of plasma lipid levels deserving of further functional investigation, and that many of the loci are relevant not just in European American populations but more generally in global populations.

At Some Lipid-associated Loci, the Specific Variants Related to Plasma Lipids Differ between Ethnic Groups
We were able to use the dense SNP genotyping in loci available via the IBC array to analyze lipid-associated loci in an unprecedented level of detail, particularly in African Americans. Our analyses demonstrate that at many loci there are major differences in genetic architecture between European Americans and African Americans. These differences manifest in at least three ways.
First, some of the most highly associated SNPs in one ethnicity are rare or absent in the other ethnicity. This is a well-established phenomenon; for example, truncation mutations in PCSK9 that are of low frequency in African Americans, but absent in European Americans, have been shown to result in a robust reduction in LDL-C levels and coronary heart disease risk [16,17]. Many of the discrepancies in lipid associations of SNPs between the two ethnicities can be attributed primarily to differences in allele frequency. For example, rs3211938 in CD36 is much more highly associated with HDL-C in African Americans (P = 2.60610 212 ) than in European Americans (P = 0.030), with a large discrepancy in MAFs (8.3% vs. 0.01%) (Table 1). Similarly, rs12721054 in APOE is both more highly associated with TG and more frequent in African Americans (P = 5.55610 225 , MAF = 12%) than in European Americans (P = 0.60, MAF = 0.047%). These variants may represent causal variants, as is the case with the nonsense coding variant rs3211938 in CD36 [15].
Second, some of the most highly associated SNPs in one ethnicity are poorly associated in the other ethnicity despite similar MAFs in both groups. An example is rs2515629 in the ABCA1 locus (P = 2.04610 27 , MAF = 0.17, b = 0.11 in African Americans; P = 0.53, MAF = 0.17, b = 0.0079 in European Americans) ( Table 2). In cases like this, it seems unlikely that the SNP represents a causal variant that acts exclusively in one ethnic group. Instead, rs2515629 is likely to be in strong linkage disequilibrium with an unrecognized (not genotyped on the IBC array) causal variant in African Americans; this causal variant may not be present in European Americans, or it may be present but in poor linkage disequilibrium with rs2515629 in European Americans due to differences in correlation structure.
Third, in some gene regions there are differences in the distributions of independent lipid-associated SNPs. Presumably these independent SNPs reflect the presence of independent causal variants. For example, in the CETP locus we identified at least 3 independently associated SNPs (with HDL) in African Americans and 4 independently associated SNPs in European Americans (Table 2), scattered over a range of 20 kb, with none clearly marking the same causal variants. Similar complexity is seen in the PCSK9, LPL, and APOE loci and, to a lesser degree, in most of the other 11 loci for which we performed conditional analyses ( Figures  S1 and S2). Thus, significant variability in the full spectrum of causal DNA variants across gene regions in different ethnic groups may well be the rule rather than the exception.
A previous analysis of lipid-associated loci in the Jackson Heart Study identified several loci-LPL, APOB, and GCKR-that appeared to account for some part of the inter-ethnic variation in lipid profiles, and found that SNPs in the LPL locus displayed different effect sizes depending on whether they existed on a background of European vs. African ancestry [18]. Our study found evidence of a similar phenomenon for a larger number of loci. In the LPL locus, for example, the most highly associated SNPs for TG in either European Americans and African Americans both have larger estimated effect sizes in the latter than the former (rs3916027,  (Table 1). Thus, genetic differences in lipid determinants between the two ethnicities appear to be widespread across many loci.

Inter-ethnic Comparisons of SNP-phenotype Relationships may Pinpoint Causal DNA Variants
Finally, while it has been assumed that differing patterns of linkage disequilibrium in European Americans and African Americans could be helpful in localizing shared, causal DNA variants, for the reasons stated above this may not often be the case. Of the 29 loci we studied, only in 4 cases did association signals in the two ethnicities appear to unambiguously converge ( Figures S1 and S2).
In the SORT1 locus, the most highly associated SNPs in the two ethnicities were not identical but were in close proximity (within a few kb) and in high linkage disequilibrium, suggesting that they are marking the same causal variant ( Figure S1A). In European Americans, we found six SNPs that together bear the strongest association with LDL-C (P values ranging from 1.69610 251 to 1.33610 249 ) and are effectively in perfect linkage disequilibrium. In contrast, for the same 6 SNPs in African Americans, they are in varying degrees of linkage disequilibrium, with rs12740374 standing out as having the strongest association (P = 9.33610 220 in African Americans; P = 2.90610 251 in European Americans). Thus, this analysis nominates rs12740374 as a strong candidate for the causal variant in the locus. Of note, a recent study identified rs12740374 as a causal variant through functional experimentation in cell-based reporter assays [19], confirming the potential value of inter-ethnic comparisons of SNP-phenotype relationships in identifying causal DNA variants.
In the LDLR and LIPC loci, respectively, rs6511720 and rs2070895 were the most highly associated SNPs in both ethnicities, suggesting that each is either a causal SNP or tightly linked to a causal SNP ( Figure S1B, S1C). In the APOB locus, the most highly LDL-C-associated SNP in European Americans, rs934197, is not associated with LDL-C in African Americans despite being common in the second ethnic group ( Figure S1D). However, the second independent SNP in the locus found in European Americans, rs562338, is the most highly LDL-Cassociated SNP in African Americans, suggesting that it may be a causal variant in the APOB locus, albeit not the only one detected in European Americans.
Rs6511720 is located in intron 1 of the LDLR gene, suggesting that it might affect either transcription or splicing of the gene transcript. Rs2070895 is in the promoter region of LIPC, just 300 bp upstream of the start codon, suggesting a role in gene transcription. Finally, rs562338 is located 20 kb upstream of the APOB gene, which would suggest some long-range regulatory role. Further experimentation aimed at evaluating whether rs6511720, rs2070895, and rs562338 are indeed causal DNA variants and by what mechanisms they might act in their respective loci is warranted.