Most genome-wide association and fine-mapping studies to date have been conducted in individuals of European descent, and genetic studies of populations of Hispanic/Latino and African ancestry are limited. In addition, these populations have more complex linkage disequilibrium structure. In order to better define the genetic architecture of these understudied populations, we leveraged >100,000 phased sequences available from deep-coverage whole genome sequencing through the multi-ethnic NHLBI Trans-Omics for Precision Medicine (TOPMed) program to impute genotypes into admixed African and Hispanic/Latino samples with genome-wide genotyping array data. We demonstrated that using TOPMed sequencing data as the imputation reference panel improves genotype imputation quality in these populations, which subsequently enhanced gene-mapping power for complex traits. For rare variants with minor allele frequency (MAF) < 0.5%, we observed a 2.3- to 6.1-fold increase in the number of well-imputed variants, with 11–34% improvement in average imputation quality, compared to the state-of-the-art 1000 Genomes Project Phase 3 and Haplotype Reference Consortium reference panels. Impressively, even for extremely rare variants with minor allele count <10 (including singletons) in the imputation target samples, average information content rescued was >86%. Subsequent association analyses of TOPMed reference panel-imputed genotype data with hematological traits (hemoglobin (HGB), hematocrit (HCT), and white blood cell count (WBC)) in ~21,600 African-ancestry and ~21,700 Hispanic/Latino individuals identified associations with two rare variants in the HBB gene (rs33930165 with higher WBC [p = 8.8x10-15] in African populations, rs11549407 with lower HGB [p = 1.5x10-12] and HCT [p = 8.8x10-10] in Hispanics/Latinos). By comparison, neither variant would have been genome-wide significant if either 1000 Genomes Project Phase 3 or Haplotype Reference Consortium reference panels had been used for imputation. Our findings highlight the utility of the TOPMed imputation reference panel for identification of novel rare variant associations not previously detected in similarly sized genome-wide studies of under-represented African and Hispanic/Latino populations.
Admixed African and Hispanic/Latino populations remain understudied in genetic studies of complex diseases. These populations have more complex linkage disequilibrium (LD) structure that can impair mapping of variants. Genotype imputation represents an approach to improve genome coverage, especially for rare or ancestry-specific variation; however, these understudied populations also have smaller relevant imputation reference panels. In this study, we leveraged >100,000 phased sequences generated from the multi-ethnic NHLBI TOPMed project for imputation in ~21,600 individuals of African ancestry (AAs) and ~21,700 Hispanics/Latinos. We demonstrated substantially higher imputation quality for low frequency and rare variants in comparison to the 1000 Genomes Project and Haplotype Reference Consortium reference panels. Analysis of quantitative hematological traits led to the discovery of associations with two rare variants in the HBB gene; one of these variants was replicated in an independent sample, and the other is known to cause anemia in the homozygous state. By comparison, the same HBB variants would not have been genome-wide significant using current reference panels due to lower imputation quality. Our findings demonstrate the power of TOPMed whole genome sequencing data for imputation and subsequent association analysis in admixed African and Hispanic/Latino populations.
Citation: Kowalski MH, Qian H, Hou Z, Rosen JD, Tapia AL, Shan Y, et al. (2019) Use of >100,000 NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium whole genome sequences improves imputation quality and detection of rare variant associations in admixed African and Hispanic/Latino populations. PLoS Genet 15(12): e1008500. https://doi.org/10.1371/journal.pgen.1008500
Editor: Gregory S. Barsh, Stanford University School of Medicine, UNITED STATES
Received: June 6, 2019; Accepted: October 30, 2019; Published: December 23, 2019
This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication.
Funding: Laura M. Raffield and Chani J. Hodonsky are funded by T32 HL129982. Madeline H Kowalski, Huijun Qian, Alexander P. Reiner, and Yun Li are funded on R01 HL129132. Paul S. de Vries was supported by American Heart Association grant number 18CDA34110116. Eric Jorgenson and Hélène Choquet are funded by R01 EY027004 and R01 DK116738. Ruth J.F. Loos is funded by U01 HG007417 and R56HG010297. Steve Buyske is funded by U01 HG007419. Rasika A. Mathias, Lewis C. Becker, Nauder Faraday, and Lisa R. Yanek are funded by U01 HL72518, R01 HL087698, and R01 HL112064. Russell P Tracy, Stephen S. Rich, Jerome I. Rotter, and Mary Cushman are funded by 3R01HL-117626-02S1, 3R01HL-120393-02S1, HHSN268201500003I, N01-HC-95159, N01-HC-95160, N01-HC-95161, N01-HC-95162, N01-HC-95163, N01-HC-95164, N01-HC-95165, N01-HC-95166, N01-HC-95167, N01-HC-95168, N01-HC-95169, UL1-TR-000040, UL1-TR-001079, UL1-TR-001420, UL1-TR-001881, and DK063491. Edwin K Silverman and Michael H Cho are funded by U01 HL089856. L Adrienne Cupples is funded by NO1-HC-25195, HHSN268201500001I, and R01 HL092577-06S1. Hemant K. Tiwari and Marguerite R. Irvin are funded by R01 HL055673. Jiang He is funded by U01HL072507, R01HL087263, and R01HL090682. Juan M. Peralta and John Blangero are funded by R01 HL113323. Sharon L.R. Kardia and Jennifer A. Smith are funded by R01 HL119443 and R01 HL085571. Scott T. Weiss and Jessica A. Lasky-Su are funded by P01 HL132825. Kathleen C Barnes and Michelle Daya are funded by R01HL104608. Patrick T Ellinor is funded by NIH 1RO1HL092577, R01HL128914, K24HL105780, AHA 18SFRN34110082, and Fondation Leducq 14CVD01. Donna K Arnett is funded by R01 R01HL091397. Bertha Hidalgo is funded by K01 HL130609 01. Courtney Montgomery is funded by R01 HL113326. Nicholette D Palmer and Donald W Bowden are funded by R01 HL92301, R01 HL67348, R01 NS058700, R01 NS075107, R01 AR48797, R01 DK071891, M01 RR07122, F32 HL085989, P60 AG10484. Steven A. Lubitz is funded by NIH 1R01HL139731 and American Heart Association 18SFRN34250007. Kent D. Taylor is funded by 3R01HL-117626-02S1, 3R01HL-120393-02S1, R01HL071051, R01HL071205, R01HL071250, R01HL071251, R01HL071258, R01HL071259, UL1RR033176, and UL1TR001881. Patricia A. Peyser is funded by R01 HL119443 and R01 HL085571. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: Edwin K Silverman and Michael H Cho have received grant support from GSK, MHC has received consulting fees from Genentech. Scott T. Weiss and Kathleen C. Barnes received royalties from UpToDate. Patrick T. Ellinor is supported by a grant from Bayer AG to the Broad Institute focused on the genetics and therapeutics of cardiovascular diseases, and has also served on advisory boards or consulted for Bayer AG, Quest Diagnostics and Novartis. Steven A Lubitz receives sponsored research support from Bristol Myers Squibb / Pfizer, Bayer HealthCare, and Boehringer Ingelheim, and has consulted for Abbott, Quest Diagnostics, Bristol Myers Squibb / Pfizer. Other authors declared no conflicts of interest.
Genotype imputation, despite being a standard practice in modern genetic association studies, remains challenging in populations of Hispanic/Latino or African ancestry, particularly for rare variants [1–6]. One obstacle lies in the lack of appropriate whole genome sequence reference panels for these admixed populations. For individuals of European descent, the relevant haplotypes available have increased by more than 500 times from 120 phased sequences in HapMap2  to more than 64,000 phased sequences in Haplotype Reference Consortium (HRC)  reference. However, HRC is predominantly European (other than included 1000 Genomes Project Phase 3 (1000G) SNPs) and includes mostly low-coverage sequencing data (4-8x coverage). The state-of-the-art reference panels for African-ancestry (AA) and Hispanic/Latino cohorts, including the 1000 Genomes Project Phase 3 (1000G)  and the Consortium on Asthma among African-ancestry Populations in the Americas (CAAPA) , are at least one order of magnitude smaller than HRC. This is especially problematic given the complex LD structure in admixed populations. The NHLBI Trans-Omics for Precision Medicine (TOPMed) Project has recently generated deep-coverage (mean depth 30x) whole genome sequencing (WGS) on more than 50,000 individuals from >26 cohorts and from diverse ancestral backgrounds (notably including ~26% AA and ~10% Hispanic/Latino participants), and now provides an unprecedented opportunity for substantially enhancing imputation quality in under-represented admixed populations and subsequently boosting power for mapping genes and regions underlying complex traits. Here we demonstrate the improvements in rare variant imputation quality in AA and Hispanic/Latino populations using TOPMed as a reference panel versus 1000G and HRC panels, and subsequently identify two low-frequency/rare HBB variant associations with blood cell traits in AA and Hispanic/Latino samples using TOPMed-imputed genotyping array data.
Results and discussion
The cohort and ancestry composition of the TOPMed freeze 5b whole genome sequence reference panel used in our study and the samples with array-based genotyping used for imputation and hematological traits association analyses in self-identified AA and Hispanic/Latino individuals are summarized in S1 and S2 Tables, respectively. We first selected two large U.S. minority cohorts—one AA and one Hispanic/Latino—in order to comprehensively evaluate imputation quality: the Jackson Heart Study (JHS, all AA, n = 3,082) and the Hispanic Community Health Study/Study of Latinos (HCHS/SOL, all Hispanic/Latino, n = 11,887). Both the JHS and HCHS/SOL have external sources of dense genotype data available for comparison. JHS is the largest AA general population cohort sequenced in TOPMed freeze 5b. Therefore, we removed JHS samples from the TOPMed freeze 5b reference panel prior to performing imputation into JHS samples using SNPs genotyped on the Affymetrix 6.0 array, treating the TOPMed freeze 5b calls as true genotypes for evaluation of imputation quality in JHS. HCHS/SOL is the largest and most regionally diverse population-based cohort of Hispanic/Latino individuals living in the US. For HCHS/SOL, we used the entire set of 100,506 phased sequences from TOPMed freeze 5b (including JHS) as reference and performed imputation into 11,887 Hispanic/Latino samples genotyped on the Illumina Omni 2.5 SOL custom array (with high quality genotypes at 2,293,536 markers). As the external source of genotype validation in HCHS/SOL, we used genotypes from the Illumina MEGA array genotyping data (containing >1.7 million multi-ethnic global markers, including low frequency coding variants and ancestry-specific variants) available in the same HCHS/SOL samples to assess imputation quality, evaluating 688,189 imputed markers available on MEGA but not on Omni2.5.
Compared with the 1000G Phase 3 reference panel , we were able to increase the number of well-imputed variants from ~28 and ~35 million to ~51 and ~58 million in JHS and HCHS/SOL, respectively (see S3 Table for genome-wide distribution of well-imputed variants). We defined well-imputed variants based on our previous work [1, 2, 4], using MAF-specific estimated R2 thresholds to ensure an average R2 ≥ 0.8 in each imputed cohort separately. For all rare variants with MAF < 0.5%, we observed ~4.2X (2.3X) and ~6.1X (3.3X) increases in the number of well-imputed variants in JHS (HCHS/SOL), compared with 1000G and HRC, respectively. We also observed 22% (11%) and 34% (20%) increases in imputation information content (as measured by average true R2, which is the squared Pearson correlation between imputed and true genotypes) (Fig 1 and S1 Fig, Table 1). For very rare variants with MAF <0.05%, we observed ~22.1X (5.8X) and ~11.8X (10.7X) increases in the number of well-imputed variants, with 6% (5%) and 13% (11%) increases in average true R2, in JHS (HCHS/SOL), compared with 1000G and HRC respectively. Mismatch rates between true and imputed genotypes were low; using the program CalcMatch, the mean concordance for heterozygote individuals (generally the hardest to impute) for Jackson Heart Study is 97.5% for all well imputed variants in Table 1, 96.6% for MAF <0.5%, and 97.6% for MAF < 0.05%. For HCHS/SOL, the mean concordance is 98.2% for all well imputed variants, 92.9% for MAF <0.5%, and 83.8% for MAF < 0.05%. Most well-imputed variants from 1000G and HRC were also included in TOPMed freeze 5b imputation results (S2 Fig).
Imputation quality (measured by true R2 [Y-axis]) is plotted with progressively more stringent post-imputation filtering from left to right, with filtering according to estimated R2 (X-axis), for variants with MAF < 1%. Top panels are for the JHS cohort and bottom panels for the HCHS/SOL cohort. Three reference panels are shown: TOPMed (TOPMed freeze 5b), 1000G (the 1000 Genomes Phase 3), and HRC (the Haplotype Reference Consortium).
Even for extremely rare variants with sample minor allele count (MAC) <10 (including cohort singleton variants in the target JHS cohort), average information content rescued (again measured by true R2) was >86%. For example, out of the 8.67 million singleton variants discovered in JHS by TOPMed WGS, 72% (6.24 million) can be well-imputed using Affymetrix 6.0 genotypes and using TOPMed freeze 5b (without JHS individuals) as reference, with an average true R2 of 0.92 (Table 2). Singletons within JHS are defined as variants with MAC = 1 among the JHS samples but which are present in multiple copies in the reference panel. Specifically, the average reference MAC is 29.3 before post-imputation quality control (QC) and 31.0 after QC, with all variants having a MAC>5 in the overall reference panel. Imputation quality is similarly high when examining extremely rare MAC variants in the reference panel, and even higher, as expected, with higher MAC variants within the JHS sample (S4 and S5 Tables). Similar observations hold true for HCHS/SOL, with slightly lower imputation quality (S6 and S7 Tables). Compared to JHS African Americans, the lower imputation quality in HCHS/SOL Hispanic/Latino individuals is likely attributable to multiple reasons, including (1) the more complex LD structure among Hispanic/Latino individuals due to the admixture of three ancestral populations; (2) the availability of a much smaller subset of rare variants for quality evaluation through MEGA array genotyping in HCHS/SOL (in contrast to the availability of nearly all segregating variants in JHS through high-coverage sequencing); and (3) the smaller number of relevant haplotypes in the TOPMed freeze 5b reference (~26% self-identified AAs compared to ~10% self-identified Hispanics/Latinos). Imputation quality for rare and low-frequency variants that are estimated to be well imputed in Table 1 is further stratified by regional background in HCHS/SOL and displayed in S8 Table. We note that greater numbers of AA and Hispanic/Latino individuals will be included in future releases of sequencing datasets from TOPMed, which we anticipate will further improve imputation quality; inclusion of JHS itself in imputation for other AA cohorts would also improve imputation quality.
Encouraged by these substantial gains in information content for low-frequency and rare variants, we proceeded with imputation in several additional AA and Hispanic/Latino data sets with array-based genotyping (S1 and S9 Tables), followed by association analyses with quantitative blood cell traits to evaluate the power of TOPMed freeze 5b-based imputation in minorities for discovery of genetic variants underlying complex human traits. We specifically chose hematological traits for several reasons. First, these traits are important intermediate clinical phenotypes for a variety of cardiovascular, hematologic, oncologic, immunologic, and infectious diseases . Second, these traits have family-based heritability estimates in the range of 40–65% [12, 13], and have been highly fruitful for gene-mapping with >2,700 common and rare variants identified, though primarily in individuals of European ancestry [14–19]. Third, these traits remain under-studied in admixed AA and Hispanic/Latino populations, despite evidence for the existence of variants with distinct genetic architecture in AAs and Hispanics/Latinos [20–22]. For example, while hundreds of variants identified in genome-wide association studies (GWAS) of WBC in individuals of European descent explain only ~7% of array heritability, the African specific Duffy null variant DARC rs2814778 alone accounts for 15–20% of population-level WBC variability in AAs . Finally, we have previously successfully leveraged deep-coverage exome sequencing-based imputation using resources from the Exome Sequencing Project for more powerful mapping of genes and regions associated with hematological traits in AAs . Hemoglobin level (HGB), hematocrit (HCT), and WBC were chosen for our primary phenotypic analysis because these traits are available in the largest sample size among the AA and Hispanics/Latinos included in our discovery cohorts.
Our imputation sample used for discovery blood cell trait association analyses included eight cohorts (21,513 AAs and 21,689 Hispanics/Latinos) (S1 Table). These discovery samples do not overlap with individuals sequenced as part of TOPMed freeze 5b (S2 Table). We used the full set of 100,506 phased sequences from TOPMed freeze 5b (including JHS) as the imputation reference panel. We then carried out AA- and Hispanic/Latino-stratified association analyses with quantitative HGB, HCT, and total WBC separately in each cohort genotyping array data set, accounting for ancestry and relatedness. The genome-wide association results for each imputed cohort data set were then meta-analyzed within each ancestry group. S3–S8 Figs show the Manhattan plots from ethnic-specific meta-analyses for each trait. QQ plots (S9–S14 Figs) show no obvious early departure, with genomic control lambda ranging from 1.008 to 1.044, indicating minimal global inflation of test statistics. For replication of any novel associations identified in the imputation-based discovery analysis, we utilized WGS genotype data and hematological trait data from the non-overlapping set of AA individuals within TOPMed freeze 5b (S10 Table) (see Methods for details).
We first evaluated association statistics for variants previously associated with HGB, HCT, or WBC count in AA and Hispanic/Latino populations (summarized in S11 Table). We assembled a list of 24 AA and 13 Hispanic/Latino previously identified autosomal signals from prior published GWAS or exome-based studies [1, 19, 20, 24–30]. Our lists excluded variants reported in multi-ethnic cohorts or meta-analysis including individuals of non-AA or non- Hispanic/Latino ancestry to guard against the scenario that the reported signals were driven predominantly by individuals of European or Asian ancestry. Among the previously reported 24 AA and 13 Hispanic/Latino variants, all but five (four SNPs and a 3.8 kb deletion variant esv2676630) passed variant quality-control filters in TOPMed freeze 5b and were subsequently well-imputed in our target AA and Hispanic/Latino data sets with a stringent post-imputation R2 filter of >0.8 (detailed in S12 Table). Among the 31 known HGB, HCT, or WBC count associations testable with TOPMed freeze 5b, our imputed/discovery cohorts confirmed 84% of these previously reported findings with a consistent direction of effect, using a stringent genome-wide significant threshold of p<5x10-8. Using more lenient p-value thresholds, we could replicate 94% (p<5x10-6) and 100% (p<0.05) of the previously reported findings with the same direction of effect. While these results help confirm the overall validity of our hematological trait association results, it is important to note for these comparisons that many of the samples included in the current TOPMed freeze 5b imputed genome-wide association analysis were also used in the publications originally reporting associations in AA and Hispanic/Latino individuals.
Our ancestry-stratified imputation-based discovery meta-analysis revealed two blood cell trait associations that have not been previously reported, at a genome-wide significant threshold of p<5×10−9 in Hispanics/Latinos and p<1x10-9 in AA populations, based on appropriate significance thresholds for whole genome sequencing analysis . One signal was revealed in each ancestry group: hemoglobin subunit beta (HBB) missense (p.Glu7Lys) variant rs33930165 (gb38:11:5227003:C:T) associated with increased WBC in AAs (β = 0.35 and p = 8.8x10-15, adjusting for SNP rs2814778 and removing potential minor allele homozygotes) (Table 3), and HBB stop-gain (p.Gln40Ter) variant rs11549407 (gb38:11:5226774:G:A) associated with lower HGB and HCT in Hispanics/Latinos (β = -1.92, p = 1.5x10-12; β = -1.66, p = 8.8x10-10). Both variants were either low frequency or rare: the HBB missense variant rs33930165 (hemoglobin C variant) has a MAF of 1.14% among the imputed AA discovery samples and is even rarer in non-AA individuals (absent in Europeans in 1000G); the stop gain variant rs11549407 has a MAF of 0.03% (MAC ~ 15) among the imputed Hispanics/Latinos and is monomorphic among the AAs. Both variants are classified as pathogenic in ClinVar. Both variants were well imputed with R2 ranging from 0.831 to 0.994 and 0.862 to 0.999 in the contributing AA and Hispanic/Latino cohorts, respectively (Table 3). Due to the low allele frequency of these variants in AAs and Hispanics/Latinos and even lower frequency in individuals of European descent, both variants were imputed with lower quality using other reference panels (S13 and S14 Tables): the missense variant HBB rs33930165 had R2 as low as 0.127 and 0.456 using 1000G and HRC, respectively, as references; the HBB stop-gain variant rs11549407 was not available in the 1000G reference panel and had R2 as low as 0.413 using HRC as the reference panel. Carrying the 1000G and HRC imputed genotypes forward to association analyses with hematological traits in the subset of our target imputation cohorts where the variants were well imputed (R2 > 0.8), we observed none of the p-values exceeded genome-wide significance threshold. This explains why these variants were not detectable at a genome-wide significant level using previously available imputation reference panels, with obvious implications for other complex trait association studies in ancestrally diverse study populations.
Both of our previously unreported genotype-trait associations involve coding variants of HBB, which encodes the beta polypeptide chains in adult hemoglobin. The HBB stop gain (p.Gln40Ter) variant 11:5226774:G:A (rs11549407) is the most common cause of beta zero thalassemia in West Mediterranean countries, particularly among the founder population of Sardinia [32, 33], where the variant has a population allele frequency of ~5%. The Sardinian population is represented in the HRC reference panel (~3500 individuals), which likely contributes to the reasonable imputation quality observed using HRC in most but not all cohorts, in contrast to the absence of this variant in the 1000G reference panel due to very low minor allele count, though imputation quality was clearly improved with the TOPMed freeze 5b reference panel. The p.Gln40Ter mutation is much less prevalent outside of the Western Mediterranean, but has been detected among individuals with beta thalassemia among admixed populations from Central and South America [34, 35], which are geographically and genetically similar to some of the Hispanic/Latino samples included in our imputation-based discovery sample. While the individuals carrying the HBB p.Gln40Ter allele in our unselected population-based Hispanic/Latino sample were all imputed heterozygotes (consistent with “thalassemia minor” and generally considered healthy), there is increasing evidence that silent carriers of beta-thalassemia and sickle cell mutations may be at risk for various health-related conditions [36, 37]. Due to the relatively small number of Hispanic/Latino individuals with blood cell trait data in TOPMed freeze 5b (n~1,080), including only one heterozygote carrier of rs11549407 in those with blood cell traits measured, we were unable to perform a well-powered replication of the association of rs11549407 with HGB and HCT. Moderate anemia is known to occur in some individuals with thalassemia minor, however, concordant with our results .
The association of the HBB missense (p.Glu7Lys) variant 11:5227003:C:T or rs33930165 with higher total WBC (β = 0.35, p = 8.8x10-15) among AA was unexpected; rs33930165 has been associated with red blood cell indices such as mean corpuscular hemoglobin concentration  but not with white blood cell traits. Because of the higher allele frequency of this variant and also the larger number of AA samples (n = 6,743) in TOPMed freeze 5b, we were able to replicate this HBB rs33930165 association with total WBC in an independent sample (β = 0.27 and p = 4.6x10-4) of AA individuals. By contrast, there was no significant association of the HBB rs33930165 p.Glu7Lys variant with HGB and a modest association with lower HCT in the AA discovery and replication data sets (discovery HCT β = -0.122, p = 0.012; HGB β = 0.110, p = 0.022; replication HCT β = -0.239, p = 0.002; HGB β = -0.009, p = 0.909). The minor allele T of rs33930165 encodes an abnormal form of hemoglobin, Hb C, which in the homozygous state is associated with mild chronic hemolytic anemia and mild to moderate splenomegaly . In our discovery and replication data sets, there were no individuals homozygous for the Hb C variant, nor any compound heterozygotes for Hb S/C (Hb S is sickling form of hemoglobin and individuals homozygous for Hb S have sickle cell disease), which excludes the possibility that the apparently higher WBC is driven by an “inflammatory response” confined to a small number of individuals clinically affected by sickle cell disease or hemoglobin C disease. We next evaluated the association of HBB rs33930165 with circulating number of WBC subtypes, including neutrophils, monocytes, lymphocytes, basophils, and eosinophils. S15 Table shows the results in our AA imputation-based discovery data sets (S16 Table), and TOPMed freeze 5b WGS replication samples (S17 Table), which suggest that the apparent association of HBB rs33930165 with total WBC is mainly driven by an association with higher lymphocyte count, with perhaps a more modest association with higher neutrophil count. Further studies are needed to delineate the putative mechanism of this unexpected association.
Our findings showcase the power of the large, ancestrally diverse TOPMed WGS data set as an imputation reference panel for admixed populations, in terms of both imputation quality and accuracy (especially for rare variants) and subsequent association studies for complex traits. Specifically, we identified two rare variants associated with hematological traits in AA and Hispanic/Latino populations and were able to validate our initial HBB association with WBC in an independent replication sample of sequenced individuals. In our study, we used EAGLE and minimac4 for imputation. We anticipate that the advantages of TOPMed as a reference panel also manifest when using alternative imputation methods. However, making TOPMed available as a reference panel compatible with each imputation method (e.g., corresponding recombination rate information) would be essential. In addition, computing time and memory usage should be taken into consideration as not all existing methods can scale to ~100 million markers in populations containing over thousands of individuals. TOPMed freeze 5b imputation is slightly more computationally intensive than use of the HRC reference panel (and takes nearly eight times longer than 1000G based imputation using the Michigan imputation server). However, we feel this increase in computational time is more than justified by the large number of additional well-imputed variants. We would note that the gains in imputation quality for AA and Hispanic/Latino populations using the TOPMed WGS reference panel likely do not apply to populations poorly represented in TOPMed freeze 5b (such as South Asians); future large-scale sequencing, including in later freezes of TOPMed, will improve imputation quality further across global populations.
Future studies should also evaluate potential increases in statistical power for gene- and region- based tests using TOPMed imputed data. To demonstrate the potential gains, we have performed a targeted analysis of genes previously identified for their association with white blood cell count or hemoglobin/hematocrit levels in exome genotyping arrays or exome sequencing studies. We compared gene-based SKAT test results at these known loci using TOPMed freeze 5b based imputation to gene-based tests performed using 1000G and HRC reference panels. These results are presented in S18–S21 Tables and demonstrate that in both African ancestry and Hispanic/Latino populations more previously implicated genes from exome arrays or sequencing based studies were significant using TOPMed freeze 5b as an imputation reference panel versus 1000G phase 3 or HRC imputation. Further exploration of gene- and region-based tests is warranted in future studies, however. We expect the combination of high-quality imputation and higher depth sequencing datasets in larger cohorts of individuals will provide increased power for all rare variant association analyses in diverse populations in the near future.
We here performed secondary data analysis on deidentified data only (exempt research). Access to TOPMed data was approved by the University of North Carolina at Chapel Hill Institutional Review board (study 16–2213). All individual studies included in TOPMed were approved by relevant local ethical review boards.
TOPMed 5b sequencing and phasing
The reference panel used for imputation was obtained from deep-coverage whole genome sequences derived from NHLBI’s TOPMed program (www.nhlbiwgs.org), freeze 5b (September 2017). This release included 54,035 non-duplicated, dbGaP released samples, of whom 50,253 have consent to be part of an imputation reference panel. The parent studies that contributed these 50,253 samples are listed in S2 Table. Specific to our analyses, freeze 5b includes 3,082 individuals from the Jackson Heart Study, who were removed from the reference panel for our analysis of imputation quality in this particular cohort. Overall, freeze 5b included 54% European ancestry, 26% AA, 10% Hispanic/Latino, 7% Asian, and 3% other ancestry samples. Detailed sequencing methods used in TOPMed are available at https://www.nhlbiwgs.org/topmed-whole-genome-sequencing-project-freeze-5b-phases-1-and-2. In brief, WGS with mean genome coverage ≥30x was completed at six sequencing centers (New York Genome Center, the Broad Institute of MIT and Harvard, the University of Washington Northwest Genomics Center, Illumina Genomic Services, Macrogen Corp., and Baylor Human Genome Sequencing Center). Sequence data files were transferred from sequencing centers to the TOPMed Informatics Research Center (IRC), where reads were aligned to human genome build GRCh38, using a common pipeline, and joint genotype calling was undertaken. Variants were filtered using a machine learning based support vector machine (SVM) approach, using variants present on genotyping arrays as positive controls and variants with many Mendelian inconsistencies as negative controls. After filtering potentially problematic variant sites, freeze 5b contained ~438 million single nucleotide polymorphisms and ~33 million short insertion-deletion variants. For our imputation analyses, we excluded from the reference panel variants with an overall allele count of 5 or less (leaving 88,062,238 variants in our reference panel, Table 1). Additional sample level quality control (such as detection of sex mismatches, pedigree discrepancies, sample swaps, etc.) was undertaken by the TOPMed Data Coordinating Center (DCC).
Genome-wide genotyping array data sets used for evaluation of imputation quality and/or phenotype association analysis
Hispanic Community Health Study/Study of Latinos (HCHS/SOL).
The HCHS/SOL cohort began in 2006 as a prospective study of Hispanic/Latino populations in the U.S. [40–42]. From 2008 to 2011, 16,415 adults were recruited from a random sample of households in four communities (the Bronx, Chicago, Miami, and San Diego). Each Field Center recruited >4,000 participants from diverse socioeconomic groups. Most participants self-identified as having Cuban, Dominican, Puerto Rican, Mexican, Central American, or South American heritage. The cohort has been genotyped both using an Illumina Omni2.5M array (plus 150,000 custom SNP, including ancestry-informative markers, Amerindian population specific variants, previously identified GWAS hits, and other candidate polymorphisms for a total of 2,293,715 SNPs)  and using the Illumina Multi-Ethnic Genotyping Array (MEGA) array (containing a total of 1,705,969 SNPs) in efforts from the Population Architecture for Genetic Epidemiology  consortium to better assess variation in non-European populations. The MEGA array also includes additional exonic, functional, and clinically-relevant variants. Illumina 2.5M array genotypes were available for 12,802 samples, among whom 11,887 samples also had MEGA array genotypes. The Illumina Omni2.5M array was used for imputation to the TOPMed reference panel, with the MEGA array treated as true genotypes for evaluation of imputation quality. For association analysis, imputation was performed on 11,887 samples after merging Omni2.5M array genotypes and MEGA array genotypes (MEGA genotypes were used for variants in both arrays, which resulted in 2,144,214 variants after quality control). Regional background (for evaluation of stratified imputation quality in S8 Table) was defined using both self-identified background and genetic markers, as described in . For the hematological traits association analysis, 11,588 Hispanic/Latino participants were included.
Women’s Health Initiative.
The Women’s Health Initiative (WHI)  is a long-term national health study focused heart disease, cancer, and osteoporotic fractures in older women. WHI originally enrolled 161,808 women aged 50–79 between 1993 and 1998 at 40 centers across the US, including both a clinical trial (including three trials for hormone therapy, dietary modification, and calcium/vitamin D) and an observational study arm. The recruitment goal of WHI was to include a socio-demographically diverse population with racial/ethnic minority groups proportionate to the total minority population of US women aged 50–79 years. This goal was achieved; a diverse population, including 26,045 (17%) women from minority populations, was recruited. Two WHI extension studies conducted additional follow-up on consenting women from 2005–2010 and 2010–2015. Genotyping was available on some WHI participants through the WHI SNP Health Association Resource (SHARe) resource, which used the Affymetrix 6.0 array (~906,600 SNPs, 946,000 copy number variation probes) and on other participants through the MEGA array . Imputation and association analysis was performed separately in individuals with Affymetrix only, MEGA only, and both Affymetrix and MEGA data (S1 Table). For variants with both Affymetrix and MEGA genotypes available, MEGA genotypes were used. In total, 4,318 Hispanic/Latino and 8,494 AA women with blood cell traits were included.
UK Biobank  recruited 500,000 people aged between 40–69 years in 2006–2010, establishing a prospective biobank study to understand the risk factors for common diseases such as cancer, heart disease, stroke, diabetes, and dementia). Participants are being followed-up through routine medical and other health-related records from the UK National Health Service. UK Biobank has genotype data on all enrolled participants, as well as extensive baseline questionnaire and physical measures and stored blood and urine samples. Hematological traits were assayed as previously described . Genotyping on custom Axiom arrays and subsequent quality control has been previously described . Samples were included in our analyses if ancestry self-report was “Black Carribean”, “Black African”,” Black or Black British”, “White and Black Carribean”, “White and Black African”, or “Any Other Black Background”. Variants were selected based on call rate exceeding 95%, HWE p-value exceeding 10−8, and MAF exceeding 0.5%. Subsequently, variants in approximate linkage equilibrium were used to generate ten principle components. Samples were excluded if the first principal component exceeded 0.1 and the second principal component exceeded 0.2, to exclude individuals not clustering with most African ancestry individuals. In total, 6,820 AA participants with blood cell traits were included in the analysis.
Genetic Epidemiology Research on Aging (GERA).
The GERA cohort includes over 100,000 adults who are members of the Kaiser Permanente Medical Care Plan, Northern California Region (KPNC) and consented to research on the genetic and environmental factors that affect health and disease, linking together clinical data from electronic health records, survey data on demographic and behavioral factors, and environmental data with genetic data. The GERA cohort was formed by including all self-reported racial and ethnic minority participants with saliva samples (19%); the remaining participants were drawn sequentially and randomly from non-Hispanic White participants (81%). Genotyping was completed as previously described  using 4 different custom Affymetrix Axiom arrays with ethnic-specific content to increase genomic coverage. Principal components analysis was used to characterize genetic structure in this multi-ethnic sample, as previously described . Blood cell traits were extracted from medical records. In individuals with multiple measurements, the first visit with complete white blood cell differential (if any) was used for each participant. Otherwise, the first visit was used. In total, 5,783 Hispanic/Latino and 2,246 AA participants with blood cell traits were included in the analysis.
Jackson Heart Study (JHS).
JHS is a population-based study designed to investigate risk factors for cardiovascular disease in African Americans. JHS recruited 5,306 AA participants age 35–84 from urban and rural areas of the three counties (Hinds, Madison and Rankin) that comprise the Jackson, Mississippi metropolitan area from 2000–2004, including a nested family cohort (≥ 21 years old) and some prior participants from the Atherosclerosis Risk in Communities (ARIC) study [50, 51]. Genotyping was performed using an Affymetrix 6.0 array through NHLBI’s Candidate Gene Association Resource (CARe) consortium  in 3,029 individuals, with quality control described previously . Due to the greater JHS sample size in TOPMed freeze 5b (n = 3,082), we extracted SNPs genotyped on Affymetrix 6.0 and which passed CARe consortium quality control in the non-duplicated JHS TOPMed sequenced samples included in the imputation reference panel (821,172 variants which passed TOPMed quality controls used for imputation).
Coronary Artery Risk Development in Young Adults (CARDIA).
The CARDIA study is a longitudinal study of cardiovascular disease risk initiated in 1985–86 in 5,115 AA and European ancestry men and women, then aged 18–30 years. The CARDIA sample was recruited at four sites: Birmingham, AL, Chicago, IL, Minneapolis, MN, and Oakland, CA [54, 55]. Similar to JHS, genotyping was performed through the CARe consortium [52, 53] using an Affymetrix 6.0 array. In total, 1,619 AA participants with blood cell traits were included in the analysis.
Atherosclerosis Risk in Communities (ARIC).
The ARIC study was initiated in 1987, when participants were 45–64 years old, recruiting participants age 45–64 years from 4 field centers (Forsyth County, NC; Jackson, MS; northwestern suburbs of Minneapolis, MN; Washington County, MD) in order to study cardiovascular disease and its risk factors , including the participants of self-reported AA ancestry included here. Standardized physical examinations and interviewer-administered questionnaires were conducted at baseline (1987–89), three triennial follow-up examinations, a fifth examination in 2011–13, and a sixth exam in 2016–2017. Genotyping was performed through the CARe consortium Affymetrix 6.0 array [52, 53]. In total, 2,392 AA participants with blood cell traits were included in the analysis.
Imputation and post-imputation quality filtering
We first phased individuals from each cohort separately using eagle  with default settings. We subsequently performed haplotype-based imputation using minimac4  using phased haplotypes from TOPMed freeze 5b as reference. We used 100,506 TOPMed freeze 5b whole genome sequences as reference for all cohorts except JHS, for which we used 94,342 TOPMed freeze 5b non-JHS sequences. We additionally imputed HCHS/SOL and JHS using 1000 Genomes Phase 3  and HRC  reference panels. Post-imputation quality filtering was performed using a R2 threshold specific to each MAF category to ensure average R2 for variants passing threshold was at least 0.8, following our previous work [4, 59]. Restricting to variants passing post-imputation quality control in at least two cohorts resulted in 34.4–35.8 million variants assessed in the AA cohorts and 26.7–27.2 million assessed in the HA cohorts, depending on the exact sample size of the tested trait. Imputation and association analysis included autosomal variants only. We assessed imputation quality (comparing true and estimated average R2) in three selected 3Mb regions: 16-19Mb region (relative to the start of each chromosome) from chromosomes 3, 12, and 20. Example scripts for imputation quality control are available at https://yunliweb.its.unc.edu/topmed5bimputation/index.php.
HGB, HCT, WBC and differential were measured in both the discovery data sets (S9 and S16 Tables) and a subset of the TOPMed freeze 5b samples (S10 and S17 Tables) using automated clinical hematology analyzers. Prior to association analyses, we excluded extreme outlier values, notably WBC values >200x109/L (as well as WBC subtype count values in these individuals), HCT >60%, and HGB >20g/dL. For longitudinal cohort studies, all values are from the same exam cycle, chosen based on largest available sample size. WBC traits were log transformed due to their skewed distribution. For all traits, we first derived trait residuals adjusting for age, age squared, sex, and principal components/study specific covariates as needed. Trait residuals were then inverse-normalized prior to analysis.
Association analysis in discovery cohorts
Association analyses were carried out for these variants via EPACTS for all cohorts except for HCHS/SOL, using the q.emmax test to account for relatedness within each cohort. Association tests were performed on inverse normalized residuals (adjusted for age, age squared, sex, and principal components/study specific covariates), further adjusting for kinship matrices constructed in EPACTS using variants with a MAF>1%. Individuals with different starting genotyping platform(s) were also analyzed separately. Inverse-variance weighted meta-analysis were further carried out using GWAMA , separately for AAs and Hispanics/Latinos.
Identification and replication of novel associations
To identify putative novel associations, we then filtered out any variant with LD r2 ≥ 0.2 in any ethnic group with any previous reported variant from GWAS, sequencing, or Exome Chip analyses within ±1Mb for a given blood cell trait. We calculated LD in self-reported European ancestry, AA, and Hispanic/Latino individuals from TOPMed freeze 5b. For European and African LD reference panels, we further restricted to individuals with global ancestry estimate ≥0.8. The global ancestry estimates were derived from local ancestry estimates from RFMix  using data from the Human Genome Diversity Project (HGDP)  as the reference panel with seven populations, namely Sub-Saharan Africa, Central and South Asia, East Asia, Europe, Native America, Oceania, and West Asia and North Africa (Middle East). Global ancestry for each TOPMed individual is defined as the mean local ancestry across all HGDP SNPs. For replication of novel signals, similar to the approach we adopted for the discovery cohorts, we performed association analysis using EPACTS in each contributing cohort and then meta-analyzed with GWAMA.
S1 Fig. Comparison of imputation reference panels, for variants with MAF > 1%.
Imputation quality (measured by true R2 [Y-axis]) is plotted with progressively more stringent post-imputation filtering from left to right, with filtering according to estimated R2 (X-axis), for variants with MAF > 1%. Top panels are for the JHS cohort and bottom panels for the HCHS/SOL cohort. Three reference panels are shown: TOPMed (TOPMed freeze 5b), 1000G (the 1000 Genomes Phase 3), and HRC (the Haplotype Reference Consortium).
S2 Fig. Comparison of well imputed variants included in results from TOPMed (TOPMed freeze 5b), 1000G (the 1000 Genomes Phase 3), and HRC (the Haplotype Reference Consortium).
S3 Fig. African ancestry hematocrit analysis Manhattan plot.
S4 Fig. African ancestry hemoglobin analysis Manhattan plot.
S5 Fig. African ancestry white blood cell count analysis Manhattan plot.
S6 Fig. Hispanic/Latino ancestry hematocrit analysis Manhattan plot.
S7 Fig. Hispanic/Latino ancestry hemoglobin analysis Manhattan plot.
S8 Fig. Hispanic/Latino ancestry white blood cell count analysis Manhattan plot.
S9 Fig. African ancestry hematocrit analysis QQ plot.
S10 Fig. African ancestry hemoglobin analysis QQ plot.
S11 Fig. African ancestry white blood cell count analysis QQ plot.
S12 Fig. Hispanic/Latino ancestry hematocrit analysis QQ plot.
S13 Fig. Hispanic/Latino ancestry hemoglobin analysis QQ plot.
S14 Fig. Hispanic/Latino ancestry white blood cell count analysis QQ plot.
S1 Table. Cohorts used for imputation to TOPMed freeze 5b reference panel and subsequent association analysis with hematological traits, including self-identified African ancestry and Hispanic/Latino individuals.
S2 Table. Cohorts included in the TOPMed freeze 5b imputation reference panels, with self-reported ancestry.
S3 Table. Percentage and number of variants well-imputed with TOPMed freeze5b by chromosome in Jackson Heart Study (JHS) and Hispanic Community Health Study/Study of Latinos (HCHS/SOL).
S4 Table. Imputation quality for variants with a minor allele count between 11 and 20 in Jackson Heart Study (JHS).
S5 Table. Imputation quality for overall reference panel rare variants (20 or less MAC in TOPMed freeze 5b) in Jackson Heart Study (JHS).
S6 Table. Imputation quality for rare variants (20 or less MAC) in Hispanic Community Health Study/Study of Latinos (HCHS/SOL).
S7 Table. Imputation quality for rare variants (20 or less MAC) in TOPMed freeze 5b in Hispanic Community Health Study/Study of Latinos (HCHS/SOL).
S8 Table. Imputation quality for rare and low frequency variants estimated to be well imputed in Table 1 stratified by regional background in the Hispanic Community Health Study/Study of Latinos (HCHS/SOL).
S9 Table. Demographics, hematological traits, and number of ancestry principal components adjusted for in association analysis models for cohorts imputed to TOPMed freeze 5b reference panel.
S10 Table. Demographics, hematological traits, and number of ancestry principal components adjusted for in association analysis for African American cohorts with sequencing and hematological trait data from TOPMed freeze 5b.
S11 Table. Overall counts for variants replicated in TOPMed freeze 5b imputed cohorts.
S12 Table. Results for previously identified variants in African ancestry and Hispanic/Latino populations in TOPMed freeze 5b imputed samples (included cohorts detailed in S1 and S8 Tables).
S13 Table. Imputation of novel variants identified with TOPMed freeze 5b-based imputation using current widely used reference panels from the Haplotype Reference Consortium (HRC) and 1000 Genomes Phase 3, as well as subsequent association analysis results for cohorts where the variants were well-imputed (R2>0.8).
S14 Table. Estimated imputation quality for rs33930165 and rs11549407 using 1000G phase 3 and Haplotype Reference Consortium (HRC) as references.
S15 Table. Association statistics for the hemoglobin C variant (rs33930165, 11:5227003:C:T) with white blood cell subtypes, adjusting for age, sex, and ancestry principal components.
S16 Table. White blood cell subtypes for cohorts imputed to TOPMed freeze 5b reference panel.
S17 Table. White blood cell subtypes for African American cohorts with sequencing and hematological trait data from TOPMed freeze 5b.
S18 Table. Results for meta-analysis of African ancestry cohorts from sequence kernel association test (SKAT) association results for previously reported genes for hemoglobin (HGB), hematocrit (HCT), or white blood cell count (WBC) using TOPMed freeze 5b, Haplotype Reference Consortium (HRC), and 1000G phase 3 as imputation reference panels.
S19 Table. Overall counts for gene results replicated in African ancestry cohorts using TOPMed freeze 5b, 1000G phase 3, and Haplotype Reference Consortium (HRC) as imputation reference panels.
S20 Table. Results for meta-analysis of Hispanic/Latino cohorts from sequence kernel association test (SKAT) association results for previously reported genes for hemoglobin (HGB), hematocrit (HCT), or white blood cell count (WBC) using TOPMed freeze 5b, Haplotype Reference Consortium (HRC), and 1000G phase 3 as imputation reference panels.
S21 Table. Overall counts for gene results replicated in Hispanic/Latino cohorts using TOPMed freeze 5b, 1000G phase 3, and Haplotype Reference Consortium (HRC) as imputation reference panels.
This research has been conducted using the UK Biobank Resource under Application Number 25953.
We would like to thank Quan Sun and Jia Wen for their advice and help on this manuscript.
Support for title page creation and format was provided by AuthorArranger, a tool developed at the National Cancer Institute.
Whole genome sequencing (WGS) for the Trans-Omics in Precision Medicine (TOPMed) program was supported by the National Heart, Lung and Blood Institute (NHLBI). WGS for “NHLBI TOPMed: The Jackson Heart Study” (phs000964) was performed at the University of Washington Northwest Genomics Center (HHSN268201100037C). WGS for “NHLBI TOPMed: Genetic Epidemiology of COPD (COPDGene) in the TOPMed Program” (phs000951) was performed at the Broad Institute and the University of Washington Northwest Genomics Center (HHSN268201500014C (Phase 2 Broad)). WGS for “NHLBI TOPMed: Genetics of Cardiometabolic Health in the Amish” (phs000956) was performed at the Broad Institute (3R01HL121007-01S1). WGS for “NHLBI TOPMed: Whole Genome Sequencing and Related Phenotypes in the Framingham Heart Study” (phs000974) was performed at the Broad Institute (3R01HL092577-06S1 (AFGen)). WGS for “NHLBI TOPMed: The Genetic Epidemiology of Asthma in Costa Rica” (phs000988) was performed at the University of Washington Northwest Genomics Center (3R37HL066289-13S1). WGS for “NHLBI TOPMed: Heart and Vascular Health Study (HVH)” (phs000993) was performed at the Broad Institute and Baylor (3R01HL092577-06S1 (Phase 1:Broad, AFGen), 3U54HG003273-12S2 (Phase 2: Baylor, VTE, TOPMed supplement to NHGRI)). WGS for “NHLBI TOPMed: The Vanderbilt AF Ablation Registry” (phs000997) was performed at the Broad Institute (3R01HL092577-06S1). WGS for “NHLBI TOPMed: Partners HealthCare Biobank” (phs001024) was performed at the Broad Institute (3R01HL092577-06S1). WGS for “NHLBI TOPMed: The Vanderbilt Atrial Fibrillation Registry” (phs001032) was performed at the Broad Institute (3R01HL092577-06S1). WGS for “NHLBI TOPMed: Novel Risk Factors for the Development of Atrial Fibrillation in Women” (phs001040) was performed at the Broad Institute (3R01HL092577-06S1). WGS for “NHLBI TOPMed: MGH Atrial Fibrillation Study” (phs001062) was performed at the Broad Institute (3R01HL092577-06S1). WGS for “NHLBI TOPMed: The Genetics and Epidemiology of Asthma in Barbados” (phs001143) was performed at Illumina (3R01HL104608-04S1). WGS for “NHLBI TOPMed: Cleveland Clinic Atrial Fibrillation Study” (phs001189) was performed at the Broad Institute (3R01HL092577-06S1). WGS for “NHLBI TOPMed: African American Sarcoidosis Genetics Resource” (phs001207) was performed at Baylor (3R01HL113326-04S1). WGS for “NHLBI TOPMed: Trans-Omics for Precision Medicine Whole Genome Sequencing Project: ARIC” (phs001211) was performed at the Broad Institute and Baylor (3R01HL092577-06S1 (Broad, AFGen), HHSN268201500015C (Baylor, VTE), 3U54HG003273-12S2 (Baylor, VTE)). WGS for “NHLBI TOPMed: San Antonio Family Heart Study (WGS)” (phs001215) was performed at Illumina (3R01HL113323-03S1). WGS for “NHLBI TOPMed: Genetic Epidemiology Network of Salt Sensitivity (GenSalt)” (phs001217) was performed at Baylor (HHSN268201500015C). WGS for “NHLBI TOPMed: GeneSTAR (Genetic Study of Atherosclerosis Risk)” (phs001218) was performed at the Broad Institute, Illumina, and Macrogen (HHSN268201500014C (Broad, AA_CAC)). WGS for “NHLBI TOPMed: Women’s Health Initiative (WHI)” (phs001237) was performed at the Broad Institute (HHSN268201500014C). WGS for “NHLBI TOPMed: HyperGEN—Genetics of Left Ventricular (LV) Hypertrophy” (phs001293) was performed at the University of Washington Northwest Genomics Center (3R01HL055673-18S1). WGS for “NHLBI TOPMed: Genetic Epidemiology Network of Arteriopathy (GENOA)” (phs001345) was performed at the Broad Institute and the University of Washington Northwest Genomics Center (HHSN268201500014C (Broad, AA_CAC), 3R01HL055673-18S1 (UW NWGC, HyperGEN_GENOA)). WGS for “NHLBI TOPMed: Genetics of Lipid Lowering Drugs and Diet Network (GOLDN)” (phs001359) was performed at the University of Washington Northwest Genomics Center (3R01HL104135-04S1). WGS for “NHLBI TOPMed: Cardiovascular Health Study” (phs001368) was performed at Baylor (HHSN268201500015C (VTE portion of CHS)). WGS for “NHLBI TOPMed: Whole Genome Sequencing of Venous Thromboembolism (WGS of VTE)” (phs001402) was performed at Baylor (HHSN268201500015C, 3U54HG003273-12S2). WGS for “NHLBI TOPMed: Diabetes Heart Study African American Coronary Artery Calcification (AA CAC)” (phs001412) was performed at the Broad Institute (HHSN268201500014C (Broad, AA_CAC)). WGS for “NHLBI TOPMed: Multi-Ethnic Study of Atherosclerosis (MESA)” (phs001416.v1.p1) was performed at the Broad Institute of MIT and Harvard (3U54HG003067-13S1).WGS for “NHLBI TOPMed: MESA Family AA-CAC” (phs001416) was performed at the Broad Institute (HHSN268201500014C). Centralized read mapping and genotype calling, along with variant quality metrics and filtering were provided by the TOPMed Informatics Research Center (3R01HL-117626-02S1). Phenotype harmonization, data management, sample-identity QC, and general study coordination, were provided by the TOPMed Data Coordinating Center (3R01HL-120393-02S1). We gratefully acknowledge the studies and participants who provided biological samples and data for TOPMed.
The Jackson Heart Study (JHS) is supported and conducted in collaboration with Jackson State University (HHSN268201800013I), Tougaloo College (HHSN268201800014I), the Mississippi State Department of Health (HHSN268201800015I/HHSN26800001) and the University of Mississippi Medical Center (HHSN268201800010I, HHSN268201800011I and HHSN268201800012I) contracts from the National Heart, Lung, and Blood Institute (NHLBI) and the National Institute for Minority Health and Health Disparities (NIMHD). The authors also wish to thank the staffs and participants of the JHS.
MESA and the MESA SHARe project are conducted and supported by the National Heart, Lung, and Blood Institute (NHLBI) in collaboration with MESA investigators. Support for MESA is provided by contracts HHSN268201500003I, N01-HC-95159, N01-HC-95160, N01-HC-95161, N01-HC-95162, N01-HC-95163, N01-HC-95164, N01-HC-95165, N01-HC-95166, N01-HC-95167, N01-HC-95168, N01-HC-95169, UL1-TR-000040, UL1-TR-001079, UL1-TR-001420, UL1-TR-001881, and DK063491. MESA Family is conducted and supported by the National Heart, Lung, and Blood Institute (NHLBI) in collaboration with MESA investigators. Support is provided by grants and contracts R01HL071051, R01HL071205, R01HL071250, R01HL071251, R01HL071258, R01HL071259, and by the National Center for Research Resources, Grant UL1RR033176. The provision of genotyping data was supported in part by the National Center for Advancing Translational Sciences, CTSI grant UL1TR001881, and the National Institute of Diabetes and Digestive and Kidney Disease Diabetes Research Center (DRC) grant DK063491 to the Southern California Diabetes Endocrinology Research Center.
The COPDGene project described was supported by Award Number U01 HL089897 and Award Number U01 HL089856 from the National Heart, Lung, and Blood Institute. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Heart, Lung, and Blood Institute or the National Institutes of Health. The COPDGene project is also supported by the COPD Foundation through contributions made to an Industry Advisory Board comprised of AstraZeneca, Boehringer Ingelheim, GlaxoSmithKline, Novartis, Pfizer, Siemens and Sunovion. A full listing of COPDGene investigators can be found at: http://www.copdgene.org/directory
GeneSTAR was supported by the National Institutes of Health/National Heart, Lung, and Blood Institute (U01 HL72518, HL087698, HL112064, HL11006, HL118356) and by a grant from the National Institutes of Health/National Center for Research Resources (M01-RR000052) to the Johns Hopkins General Clinical Research Center.
The views expressed in this manuscript are those of the authors and do not necessarily represent the views of the National Heart, Lung, and Blood Institute; the National Institutes of Health; or the U.S. Department of Health and Human Services.
The Population Architecture Using Genomics and Epidemiology (PAGE) program is funded by the National Human Genome Research Institute (NHGRI) with co-funding from the National Institute on Minority Health and Health Disparities (NIMHD). The contents of this paper are solely the responsibility of the authors and do not necessarily represent the official views of the NIH. The PAGE consortium thanks the staff and participants of all PAGE studies for their important contributions. We thank Rasheeda Williams and Margaret Ginoza for providing assistance with program coordination. The complete list of PAGE members can be found at http://www.pagestudy.org.
Assistance with data management, data integration, data dissemination, genotype imputation, ancestry deconvolution, population genetics, analysis pipelines, and general study coordination was provided by the PAGE Coordinating Center (NIH U01HG007419). Genotyping services were provided by the Center for Inherited Disease Research (CIDR). CIDR is fully funded through a federal contract from the National Institutes of Health to The Johns Hopkins University, contract number HHSN268201200008I. Genotype data quality control and quality assurance services were provided by the Genetic Analysis Center in the Biostatistics Department of the University of Washington, through support provided by the CIDR contract.
The data and materials included in this report result from collaboration between the following studies and organizations:
HCHS/SOL: Primary funding support to Dr. North and colleagues is provided by U01HG007416. Additional support was provided via R01DK101855 and 15GRNT25880008. The Hispanic Community Health Study/Study of Latinos was carried out as a collaborative study supported by contracts from the National Heart, Lung, and Blood Institute (NHLBI) to the University of North Carolina (N01-HC65233), University of Miami (N01-HC65234), Albert Einstein College of Medicine (N01-HC65235), Northwestern University (N01-HC65236), and San Diego State University (N01- HC65237). The following Institutes/Centers/Offices contribute to the HCHS/SOL through a transfer of funds to the NHLBI: National Institute on Minority Health and Health Disparities, National Institute on Deafness and Other Communication Disorders, National Institute of Dental and Craniofacial Research, National Institute of Diabetes and Digestive and Kidney Diseases, National Institute of Neurological Disorders and Stroke, NIH Institution-Office of Dietary Supplements.
WHI: Funding support for the “Exonic variants and their relation to complex traits in minorities of the WHI” study is provided through the NHGRI PAGE program (NIH U01HG007376). The WHI program is funded by the National Heart, Lung, and Blood Institute, National Institutes of Health, U.S. Department of Health and Human Services through contracts HHSN268201100046C, HHSN268201100001C, HHSN268201100002C, HHSN268201100003C, HHSN268201100004C, and HHSN271201100004C. The authors thank the WHI investigators and staff for their dedication, and the study participants for making the program possible. A listing of WHI investigators can be found at: https://www.whi.org/researchers.
ARIC: The Atherosclerosis Risk in Communities study has been funded in whole or in part with Federal funds from the National Heart, Lung, and Blood Institute, National Institutes of Health, Department of Health and Human Services (contract numbers HHSN268201700001I, HHSN268201700002I, HHSN268201700003I, HHSN268201700004I and HHSN268201700005I). The authors thank the staff and participants of the ARIC study for their important contributions.
CARDIA: The Coronary Artery Risk Development in Young Adults Study (CARDIA) is conducted and supported by the National Heart, Lung, and Blood Institute (NHLBI) in collaboration with the University of Alabama at Birmingham (HHSN268201300025C & HHSN268201300026C), Northwestern University (HHSN268201300027C), University of Minnesota (HHSN268201300028C), Kaiser Foundation Research Institute (HHSN268201300029C), and Johns Hopkins University School of Medicine (HHSN268200900041C). CARDIA is also partially supported by the Intramural Research Program of the National Institute on Aging (NIA) and an intra-agency agreement between NIA and NHLBI (AG0005). This manuscript has been reviewed by CARDIA for scientific content.
GERA: Genotyping of the GERA cohort was funded by a grant from the National Institute on Aging, National Institute of Mental Health, and National Institute of Health Common Fund (RC2 AG036607).
- 1. Auer PL, Johnsen JM, Johnson AD, Logsdon BA, Lange LA, Nalls MA, et al. Imputation of exome sequence variants into population- based samples and blood-cell-trait-associated loci in African Americans: NHLBI GO Exome Sequencing Project. Am J Hum Genet. 2012;91(5):794–808. Epub 2012/10/30. pmid:23103231.
- 2. Duan Q, Liu EY, Auer PL, Zhang G, Lange EM, Jun G, et al. Imputation of coding variants in African Americans: better performance using data from the exome sequencing project. Bioinformatics. 2013;29(21):2744–9. Epub 2013/08/21. pmid:23956302.
- 3. Lu F-P, Lin K-P, Kuo H-K. Diabetes and the risk of multi-system aging phenotypes: a systematic review and meta-analysis. PloS one. 2009;4(1):e4144. pmid:19127292
- 4. Liu EY, Buyske S, Aragaki AK, Peters U, Boerwinkle E, Carlson C, et al. Genotype Imputation of MetabochipSNPs Using a Study-Specific Reference Panel of ~4,000 Haplotypes in African Americans From the Women’s Health Initiative. Genet Epidemiol. 2012;36(2):107–17. pmid:22851474
- 5. Liu EY, Li M, Wang W, Li Y. MaCH-admix: genotype imputation for admixed populations. Genetic epidemiology. 2013;37(1):25–37. Epub 2012/10/18. pmid:23074066.
- 6. Vergara C, Parker MM, Franco L, Cho MH, Valencia-Duarte AV, Beaty TH, et al. Genotype imputation performance of three reference panels using African ancestry individuals. Hum Genet. 2018;137(4):281–92. Epub 2018/04/11. pmid:29637265.
- 7. The International HapMap Consortium. A second generation human haplotype map of over 3.1 million SNPs. Nature. 2007;449:851–61. pmid:17943122
- 8. McCarthy S, Das S, Kretzschmar W, Delaneau O, Wood AR, Teumer A, et al. A reference panel of 64,976 haplotypes for genotype imputation. Nat Genet. 2016;48(10):1279–83. Epub 2016/08/23. pmid:27548312.
- 9. The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature. 2015;526(7571):68–74. Epub 2015/10/04. pmid:26432245.
- 10. Mathias RA, Taub MA, Gignoux CR, Fu W, Musharoff S, O’Connor TD, et al. A continuum of admixture in the Western Hemisphere revealed by the African Diaspora genome. Nature communications. 2016;7:12522. Epub 2016/10/12. pmid:27725671 other authors declare no competing financial interests.
- 11. Crosslin DR, McDavid A, Weston N, Nelson SC, Zheng X, Hart E, et al. Genetic variants associated with the white blood cell count in 13,923 subjects in the eMERGE Network. Hum Genet. 2012;131(4):639–52. Epub 2011/11/01. pmid:22037903.
- 12. Whitfield JB, Martin NG, Rao DC. Genetic and environmental influences on the size and number of cells in the blood. Genetic epidemiology. 1985;2(2):133–44. pmid:4054596
- 13. Garner C, Tatu T, Reittie JE, Littlewood T, Darley J, Cervino S, et al. Genetic influences on F cells and other hematologic variables: a twin heritability study. Blood. 2000;95(1):342–6. Epub 1999/12/23. pmid:10607722.
- 14. Astle WJ, Elding H, Jiang T, Allen D, Ruklisa D, Mann AL, et al. The Allelic Landscape of Human Blood Cell Trait Variation and Links to Common Complex Disease. Cell. 2016;167(5):1415–29 e19. Epub 2016/11/20. pmid:27863252.
- 15. van der Harst P, Zhang W, Mateo Leach I, Rendon A, Verweij N, Sehmi J, et al. Seventy-five genetic loci influencing the human red blood cell. Nature. 2012;492(7429):369–75. Epub 2012/12/12. pmid:23222517.
- 16. Mousas A, Ntritsos G, Chen MH, Song C, Huffman JE, Tzoulaki I, et al. Rare coding variants pinpoint genes that control human hematological traits. PLoS Genet. 2017;13(8):e1006925. Epub 2017/08/09. pmid:28787443.
- 17. Chami N, Chen MH, Slater AJ, Eicher JD, Evangelou E, Tajuddin SM, et al. Exome Genotyping Identifies Pleiotropic Variants Associated with Red Blood Cell Traits. Am J Hum Genet. 2016. Epub 2016/06/28. pmid:27346685.
- 18. Eicher JD, Chami N, Kacprowski T, Nomura A, Chen MH, Yanek LR, et al. Platelet-Related Variants Identified by Exomechip Meta-analysis in 157,293 Individuals. Am J Hum Genet. 2016. Epub 2016/06/28. pmid:27346686.
- 19. Tajuddin SM, Schick UM, Eicher JD, Chami N, Giri A, Brody JA, et al. Large-Scale Exome-wide Association Analysis Identifies Loci for White Blood Cell Traits and Pleiotropy with Immune-Mediated Diseases. Am J Hum Genet. 2016. Epub 2016/06/28. pmid:27346689.
- 20. Hodonsky CJ, Jain D, Schick UM, Morrison JV, Brown L, McHugh CP, et al. Genome-wide association study of red blood cell traits in Hispanics/Latinos: The Hispanic Community Health Study/Study of Latinos. PLoS genetics. 2017;13(4):e1006760. Epub 2017/04/30. pmid:28453575.
- 21. Group. CCHW. Meta-analysis of rare and common exome chip variants identifies S1PR4 and other loci influencing blood cell traits. Nat Genet. 2016;48(8):867–76. Epub 2016/07/12. pmid:27399967.
- 22. Lo KS, Wilson JG, Lange LA, Folsom AR, Galarneau G, Ganesh SK, et al. Genetic association analysis highlights new loci that modulate hematological trait variation in Caucasians and African Americans. Hum Genet. 2011;129(3):307–17. Epub 2010/12/15. pmid:21153663.
- 23. Tournamille C, Colin Y, Cartron JP, Le Van Kim C. Disruption of a GATA motif in the Duffy gene promoter abolishes erythroid gene expression in Duffy-negative individuals. Nat Genet. 1995;10(2):224–8. Epub 1995/06/01. pmid:7663520.
- 24. Chen Z, Tang H, Qayyum R, Schick UM, Nalls MA, Handsaker R, et al. Genome-wide association analysis of red blood cell traits in African Americans: the COGENT Network. Hum Mol Genet. 2013;22(12):2529–38. Epub 2013/03/01. pmid:23446634.
- 25. Li J, Glessner JT, Zhang H, Hou C, Wei Z, Bradfield JP, et al. GWAS of blood cell traits identifies novel associated loci and epistatic interactions in Caucasian and African-American children. Hum Mol Genet. 2013;22(7):1457–64. Epub 2012/12/25. pmid:23263863.
- 26. van Rooij FJA, Qayyum R, Smith AV, Zhou Y, Trompet S, Tanaka T, et al. Genome-wide Trans-ethnic Meta-analysis Identifies Seven Genetic Loci Influencing Erythrocyte Traits and a Role for RBPMS in Erythropoiesis. Am J Hum Genet. 2017;100(1):51–63. Epub 2016/12/27. pmid:28017375.
- 27. Jain D, Hodonsky CJ, Schick UM, Morrison JV, Minnerath S, Brown L, et al. Genome-wide association of white blood cell counts in Hispanic/Latino Americans: the Hispanic Community Health Study/Study of Latinos. Hum Mol Genet. 2017;26(6):1193–204. Epub 2017/02/06. pmid:28158719.
- 28. Polfus LM, Khajuria RK, Schick UM, Pankratz N, Pazoki R, Brody JA, et al. Whole-Exome Sequencing Identifies Loci Associated with Blood Cell Traits and Reveals a Role for Alternative GFI1B Splice Variants in Human Hematopoiesis. Am J Hum Genet. 2016;99(2):481–8. Epub 2016/08/04. pmid:27486782.
- 29. Keller MF, Reiner AP, Okada Y, van Rooij FJ, Johnson AD, Chen MH, et al. Trans-ethnic meta-analysis of white blood cell phenotypes. Hum Mol Genet. 2014;23(25):6944–60. Epub 2014/08/07. pmid:25096241.
- 30. Reiner AP, Lettre G, Nalls MA, Ganesh SK, Mathias R, Austin MA, et al. Genome-wide association study of white blood cell count in 16,388 African Americans: the continental origins and genetic epidemiology network (COGENT). PLoS genetics. 2011;7(6):e1002108. Epub 2011/07/09. pmid:21738479.
- 31. Pulit SL, de With SA, de Bakker PI. Resetting the bar: Statistical significance in whole-genome sequencing-based association studies of global populations. Genetic epidemiology. 2017;41(2):145–51. Epub 2016/12/19. pmid:27990689.
- 32. Trecartin RF, Liebhaber SA, Chang JC, Lee KY, Kan YW, Furbetta M, et al. beta zero thalassemia in Sardinia is caused by a nonsense mutation. The Journal of clinical investigation. 1981;68(4):1012–7. Epub 1981/10/01. pmid:6457059.
- 33. Rosatelli MC, Dozy A, Faa V, Meloni A, Sardu R, Saba L, et al. Molecular characterization of beta-thalassemia in the Sardinian population. Am J Hum Genet. 1992;50(2):422–6. Epub 1992/02/01. pmid:1734721.
- 34. Perea FJ, Magana MT, Cobian JG, Sanchez-Lopez JY, Chavez ML, Zamudio G, et al. Molecular spectrum of beta-thalassemia in the Mexican population. Blood cells, molecules & diseases. 2004;33(2):150–2. Epub 2004/08/19. pmid:15315794.
- 35. Silva AN, Cardoso GL, Cunha DA, Diniz IG, Santos SE, Andrade GB, et al. The Spectrum of beta-Thalassemia Mutations in a Population from the Brazilian Amazon. Hemoglobin. 2016;40(1):20–4. Epub 2015/09/16. pmid:26372288.
- 36. Key NS, Connes P, Derebail VK. Negative health implications of sickle cell trait in high income countries: from the football field to the laboratory. British journal of haematology. 2015;170(1):5–14. Epub 2015/03/11. pmid:25754217.
- 37. Graffeo L, Vitrano A, Scondotto S, Dardanoni G, Pollina Addario WS, Giambona A, et al. beta-Thalassemia heterozygote state detrimentally affects health expectation. European journal of internal medicine. 2018;54:76–80. Epub 2018/06/24. pmid:29934240.
- 38. Galanello R, Origa R. Beta-thalassemia. Orphanet journal of rare diseases. 2010;5:11-. pmid:20492708.
- 39. Fairhurst RM, Casella JF. Images in clinical medicine. Homozygous hemoglobin C disease. Q1. 2004;350(26):e24. Epub 2004/06/25. pmid:15215497.
- 40. Sorlie PD, Aviles-Santa LM, Wassertheil-Smoller S, Kaplan RC, Daviglus ML, Giachello AL, et al. Design and implementation of the Hispanic Community Health Study/Study of Latinos. Ann Epidemiol. 2010;20(8):629–41. Epub 2010/07/09. pmid:20609343.
- 41. Daviglus ML, Talavera GA, Aviles-Santa ML, Allison M, Cai J, Criqui MH, et al. Prevalence of major cardiovascular risk factors and cardiovascular diseases among Hispanic/Latino individuals of diverse backgrounds in the United States. Jama. 2012;308(17):1775–84. Epub 2012/11/03. pmid:23117778.
- 42. Lavange LM, Kalsbeek WD, Sorlie PD, Aviles-Santa LM, Kaplan RC, Barnhart J, et al. Sample design and cohort selection in the Hispanic Community Health Study/Study of Latinos. Ann Epidemiol. 2010;20(8):642–9. Epub 2010/07/09. pmid:20609344.
- 43. Conomos MP, Laurie CA, Stilp AM, Gogarten SM, McHugh CP, Nelson SC, et al. Genetic Diversity and Association Studies in US Hispanic/Latino Populations: Applications in the Hispanic Community Health Study/Study of Latinos. Am J Hum Genet. 2016;98(1):165–84. Epub 2016/01/11. pmid:26748518.
- 44. Wojcik G, Graff M, Nishimura KK, Tao R, Haessler J, Gignoux CR, et al. The PAGE Study: How Genetic Diversity Improves Our Understanding of the Architecture of Complex Traits. bioRxiv. 2018:188094.
- 45. The Women’s Health Initiative Study Group. Design of the Women’s Health Initiative clinical trial and observational study. Control Clin Trials. 1998;19(1):61–109. Epub 1998/03/11. pmid:9492970.
- 46. UK Biobank. UK Biobank: rationale, design and development of a large-scale prospective resource. 2007. http://www.ukbiobank.ac.uk/resources/.
- 47. Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562(7726):203–9. pmid:30305743
- 48. Kvale MN, Hesselson S, Hoffmann TJ, Cao Y, Chan D, Connell S, et al. Genotyping Informatics and Quality Control for 100,000 Subjects in the Genetic Epidemiology Research on Adult Health and Aging (GERA) Cohort. Genetics. 2015;200(4):1051–60. Epub 2015/06/21. pmid:26092718.
- 49. Banda Y, Kvale MN, Hoffmann TJ, Hesselson SE, Ranatunga D, Tang H, et al. Characterizing Race/Ethnicity and Genetic Ancestry for 100,000 Subjects in the Genetic Epidemiology Research on Adult Health and Aging (GERA) Cohort. Genetics. 2015;200(4):1285–95. pmid:26092716
- 50. Taylor HA Jr., Wilson JG, Jones DW, Sarpong DF, Srinivasan A, Garrison RJ, et al. Toward resolution of cardiovascular health disparities in African Americans: design and methods of the Jackson Heart Study. Ethn Dis. 2005;15(4 Suppl 6):S6-4-17. Epub 2005/12/02. pmid:16320381.
- 51. Wilson JG, Rotimi CN, Ekunwe L, Royal CD, Crump ME, Wyatt SB, et al. Study design for genetic analysis in the Jackson Heart Study. Ethn Dis. 2005;15(4 Suppl 6):S6-30-7. Epub 2005/12/02. pmid:16317983.
- 52. Musunuru K, Lettre G, Young T, Farlow DN, Pirruccello JP, Ejebe KG, et al. Candidate gene association resource (CARe): design, methods, and proof of concept. Circulation Cardiovascular genetics. 2010;3(3):267–75. Epub 2010/04/20. pmid:20400780.
- 53. Lettre G, Palmer CD, Young T, Ejebe KG, Allayee H, Benjamin EJ, et al. Genome-wide association study of coronary heart disease and its risk factors in 8,090 African Americans: the NHLBI CARe Project. PLoS genetics. 2011;7(2):e1001300. Epub 2011/02/25. pmid:21347282.
- 54. Friedman GD, Cutter GR, Donahue RP, Hughes GH, Hulley SB, Jacobs DR Jr., et al. CARDIA: study design, recruitment, and some characteristics of the examined subjects. J Clin Epidemiol. 1988;41(11):1105–16. Epub 1988/01/01. pmid:3204420.
- 55. Cutter GR, Burke GL, Dyer AR, Friedman GD, Hilner JE, Hughes GH, et al. Cardiovascular risk factors in young adults. The CARDIA baseline monograph. Control Clin Trials. 1991;12(1 Suppl):1S–77S. Epub 1991/02/11. pmid:1851696.
- 56. The Atherosclerosis Risk in Communities (ARIC) Study: design and objectives. The ARIC investigators. Am J Epidemiol. 1989;129(4):687–702. Epub 1989/04/01. pmid:2646917.
- 57. Loh PR, Danecek P, Palamara PF, Fuchsberger C, AR Y, KF H, et al. Reference-based phasing using the Haplotype Reference Consortium panel. Nat Genet. 2016;48(11):1443–8. Epub 2016/10/28. pmid:27694958.
- 58. Das S, Forer L, Schonherr S, Sidore C, Locke AE, Kwong A, et al. Next-generation genotype imputation service and methods. Nat Genet. 2016;48(10):1284–7. Epub 2016/08/30. pmid:27571263.
- 59. Duan Q, Liu EY, Croteau-Chonka DC, Mohlke KL, Li Y. A comprehensive SNP and indel imputability database. Bioinformatics. 2013;29(4):528–31. Epub 2013/01/08. pmid:23292738.
- 60. Magi R, Lindgren CM, Morris AP. Meta-analysis of sex-specific genome-wide association studies. Genet Epidemiol. 2010;34(8):846–53. Epub 2010/11/26. pmid:21104887.
- 61. Maples Brian K, Gravel S, Kenny Eimear E, Bustamante Carlos D. RFMix: A Discriminative Modeling Approach for Rapid and Robust Local-Ancestry Inference. The American Journal of Human Genetics. 2013;93(2):278–88. http://dx.doi.org/10.1016/j.ajhg.2013.06.020 pmid:23910464
- 62. Li JZ, Absher DM, Tang H, Southwick AM, Casto AM, Ramachandran S, et al. Worldwide human relationships inferred from genome-wide patterns of variation. Science. 2008;319(5866):1100–4. Epub 2008/02/23. pmid:18292342.