Use of >100,000 NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium whole genome sequences improves imputation quality and detection of rare variant associations in admixed African and Hispanic/Latino populations

Most genome-wide association and fine-mapping studies to date have been conducted in individuals of European descent, and genetic studies of populations of Hispanic/Latino and African ancestry are limited. In addition, these populations have more complex linkage disequilibrium structure. In order to better define the genetic architecture of these understudied populations, we leveraged >100,000 phased sequences available from deep-coverage whole genome sequencing through the multi-ethnic NHLBI Trans-Omics for Precision Medicine (TOPMed) program to impute genotypes into admixed African and Hispanic/Latino samples with genome-wide genotyping array data. We demonstrated that using TOPMed sequencing data as the imputation reference panel improves genotype imputation quality in these populations, which subsequently enhanced gene-mapping power for complex traits. For rare variants with minor allele frequency (MAF) < 0.5%, we observed a 2.3- to 6.1-fold increase in the number of well-imputed variants, with 11–34% improvement in average imputation quality, compared to the state-of-the-art 1000 Genomes Project Phase 3 and Haplotype Reference Consortium reference panels. Impressively, even for extremely rare variants with minor allele count <10 (including singletons) in the imputation target samples, average information content rescued was >86%. Subsequent association analyses of TOPMed reference panel-imputed genotype data with hematological traits (hemoglobin (HGB), hematocrit (HCT), and white blood cell count (WBC)) in ~21,600 African-ancestry and ~21,700 Hispanic/Latino individuals identified associations with two rare variants in the HBB gene (rs33930165 with higher WBC [p = 8.8x10-15] in African populations, rs11549407 with lower HGB [p = 1.5x10-12] and HCT [p = 8.8x10-10] in Hispanics/Latinos). By comparison, neither variant would have been genome-wide significant if either 1000 Genomes Project Phase 3 or Haplotype Reference Consortium reference panels had been used for imputation. Our findings highlight the utility of the TOPMed imputation reference panel for identification of novel rare variant associations not previously detected in similarly sized genome-wide studies of under-represented African and Hispanic/Latino populations.

genetic studies, these populations have more complex linkage disequilibrium structure that may reduce the number of variants associated with a phenotype. In order to better define the genetic architecture of these understudied populations, we leveraged >100,000 phased sequences available from deep-coverage whole genome sequencing through the multi-ethnic NHLBI Trans-Omics for Precision Medicine (TOPMed) program to impute genotypes into admixed African and Hispanic/Latino samples with commercial genome-wide genotyping array data. We demonstrate that using TOPMed sequencing data as the imputation reference panel improves genotype imputation quality in these populations, which subsequently enhances gene-mapping power for complex traits. For rare variants with minor allele frequency (MAF) < 0.5%, we observed a 2.3 to 6.1-fold increase in the number of well-imputed variants, with 11-34% improvement in average imputation quality, compared to the state-of-the-art 1000 Genomes Project Phase 3 and Haplotype Reference Consortium reference panels, respectively. Impressively, even for extremely rare variants with sample minor allele count <10 (including singletons) in the imputation target samples, average information content rescued was >86%.
Subsequent association analyses of TOPMed reference panel-imputed genotype data with hematological traits (hemoglobin (HGB), hematocrit (HCT), and white blood cell count (WBC)) in ~20,000 self-identified African descent individuals and ~23,000 self-identified Hispanic/Latino individuals identified associations with two rare variants in the HBB gene (rs33930165 with higher WBC (p=8.1x10 -12 ) in African populations, rs11549407 with lower HGB (p=1.59x10 -12 ) and HCT (p=1.13x10 -9 ) in Hispanics/Latinos). By comparison, neither variant would have been genome-wide significant if either 1000 Genomes Project Phase 3 or Haplotype Reference Consortium reference panels had been used for imputation. Our findings highlight the utility of TOPMed imputation reference panel for identification of novel associations between rare variants and complex traits not previously detected in similar sized genome-wide studies of under-represented African and Hispanic/Latino populations.

Author summary
Admixed African and Hispanic/Latino populations remain understudied in genome-wide association and fine-mapping studies of complex diseases. These populations have more complex linkage disequilibrium (LD) structure that can impair mapping of variants associated with complex diseases and their risk factors. Genotype imputation represents an approach to improve genome coverage, especially for rare or ancestry-specific variation; however, these understudied populations also have smaller relevant imputation reference panels that need to be expanded to represent their more complex LD patterns. In this study, we leveraged >100,000 phased sequences generated from the multi-ethnic NHLBI TOPMed project to impute in admixed cohorts encompassing ~20,000 individuals of African ancestry (AAs) and ~23,000 Hispanics/Latinos. We demonstrated substantially higher imputation quality for low frequency and rare variants in comparison to the state-of-the-art reference panels (1000 Genomes Project and Haplotype Reference Consortium). Association analyses of ~35 million (AAs) and ~27 million (Hispanics/Latinos) variants passing stringent post-imputation filtering with quantitative hematological traits led to the discovery of associations with two rare variants in the HBB gene; one of these variants was replicated in an independent sample, and the other is known to cause anemia in the homozygous state. By comparison, the same HBB variants would not have been genome-wide significant using other state-of-the-art reference panels due to lower imputation quality. Our findings demonstrate the power of the TOPMed whole genome sequencing data for imputation and subsequent association analysis in admixed African and Hispanic/Latino populations.

Introduction
Genotype imputation, despite being a standard practice in modern genetic association studies, remains challenging in populations of Hispanic/Latino or African ancestry, particularly for rare variants (1-6). One obstacle lies in the lack of appropriate whole genome sequence reference panels for these admixed populations. For individuals of European descent, the relevant haplotypes available have increased by more than 500 times from 120 phased sequences in HapMap2 (7) to more than 64,000 phased sequences in Haplotype Reference Consortium (HRC) (8) reference. However, HRC is predominantly European (other than included 1000 Genomes Project Phase 3 (1000G) SNPs) and includes mostly low coverage sequencing data (4-8x coverage). The state-of-the-art reference panels for African ancestry (AA) and Hispanic/Latino cohorts, including the 1000 Genomes Project Phase 3 (1000G) (9) and the Consortium on Asthma among African ancestry Populations in the Americas (CAAPA) (10), are at least one order of magnitude smaller than HRC. This is especially problematic given the complex LD structure in admixed populations. The NHLBI Trans-Omics for Precision Medicine (TOPMed) Project has recently generated deep-coverage (mean depth 30x) whole genome sequencing (WGS) on more than 50,000 individuals from >26 cohorts and from diverse ancestral backgrounds (notably including ~26% AA and ~10% Hispanic/Latino participants), and now provides an unprecedented opportunity for substantially enhancing imputation quality in underrepresented admixed populations and subsequently boosting power for mapping genes and regions underlying complex traits. Here we demonstrate the improvements in rare variant imputation quality in AA and Hispanic/Latino populations using TOPMed as a reference panel versus 1000G and HRC panels, and subsequently identify two low frequency/rare HBB variant associations with blood cell traits in AA and Hispanic/Latino samples using TOPMed-imputed genotyping array data.

Results and Discussion
The cohort and ancestry composition of the TOPMed freeze 5b whole genome sequence reference panel used in our study and the samples with array-based genotyping used for imputation and hematological traits association analyses in self-identified AA and Hispanic/Latino individuals are summarized in Tables S1 and S2 respectively. We first selected two large U.S. minority cohorts, one AA and one Hispanic/Latino, in order to comprehensively evaluate imputation quality: the Jackson Heart Study (JHS, all AA, n = 3,082) and the Hispanic Community Health Study/Study of Latinos (HCHS/SOL, all Hispanic/Latino, n = 11,887). Both the JHS and HCHS/SOL have external sources of dense genotype data available for comparison.
JHS is the largest AA general population cohort sequenced in TOPMed freeze 5b. Therefore, we removed JHS samples from the TOPMed freeze 5b reference panel prior to performing imputation into JHS samples using SNPs genotyped on the Affymetrix 6.0 array, treating the TOPMed freeze 5b calls as true genotypes for evaluation of imputation quality in JHS.
HCHS/SOL is the largest and most regionally diverse population-based cohort of Hispanic/Latino individuals living in the US. For HCHS/SOL, we used the entire set of 100,506 phased sequences from TOPMed freeze 5b (including JHS) as reference, and performed imputation into 11,887 Hispanic/Latino samples genotyped on the Illumina Omni 2.5 SOL custom array (with high quality genotypes at 2,293,536 markers). As the external source of genotype validation in HCHS/SOL, we used genotypes from the Illumina MEGA array genotyping data (containing >1.7 million multi-ethnic global markers, including low frequency coding variants and ancestry-specific variants) available in the same HCHS/SOL samples to assess imputation quality, evaluating 688,189 imputed markers available on MEGA but not on Omni2.5.
Compared with the 1000G Phase 3 reference panelv(9), we were able to increase the number of well-imputed variants from ~28 and ~35 million to ~51 and ~58 million in JHS and HCHS/SOL, respectively (see table S7 for genome-wide distribution of well-imputed variants).
We defined well-imputed variants based on our previous work(1, 2, 4), using minor allele frequency (MAF) specific estimated R 2 thresholds to ensure an average R 2 of at least 0.8 in each imputed cohort separately. For all rare variants with MAF < 0.5%, we observed ~4.2X (2.3X) and ~6.1X (3.3X) increases in the number of well-imputed variants in JHS (HCHS/SOL), compared with 1000G and HRC respectively, with 22% (11%) and 34% (20%) increases in imputation information content (as measured by average true R 2 , which is the squared Pearson correlation between imputed and true genotypes) (Fig 1 and S1, Table 1). For very rare variants with MAF <0.05%, we observed ~22.1X (5.8X) and ~11.8X (10.7X) increases in the number of well-imputed variants, with 6% (5%) and 13% (11%) increases in average true R 2 , in JHS (HCHS/SOL), compared with 1000G and HRC respectively. Even for extremely rare variants with sample minor allele count (MAC) <10 (including cohort singleton variants in the target JHS cohort), average information content rescued (again measured by true R 2 ) was >86%. For example, out of the 8.67 million singleton variants discovered in JHS by TOPMed WGS, 72% (or 6.24 million) can be well-imputed using Affymetrix 6.0 genotypes and using TOPMed freeze 5b (without JHS individuals) as reference, with an average true R 2 of 0.92 (Table 2). Singletons within JHS are defined as variants with minor allele count of 1 among the JHS samples but which are present in multiple copies in the reference panel. Specifically, average reference MAC is 29.3 before post imputation quality control (QC) and 31.0 after QC, with all variants having a MAC>5 in the overall reference panel.
Imputation quality is similarly high when examining extremely rare MAC variants in the reference panel, and even higher as expected with higher MAC variants within the JHS sample (Tables S3-4). Similar observations hold true for HCHS/SOL, with slightly lower imputation quality (Tables S5-6). Compared to JHS African Americans, the lower imputation quality in HCHS/SOL Hispanic/Latino individuals is likely attributable to multiple reasons including (1) the more complex LD structure among Hispanic/Latino individuals due to the presence of three ancestral populations; (2) the availability of a much smaller subset of rare variants for quality evaluation through MEGA array genotyping in HCHS/SOL (in contrast to the availability of nearly all segregating variants in JHS through high coverage sequencing); and (3) the smaller number of relevant haplotypes in the TOPMed freeze 5b reference (~26% self-identified AAs compared to ~10% self-identified Hispanics/Latinos). We note that greater numbers of AA and Hispanic/Latino individuals will be included in future releases of sequencing datasets from TOPMed, which we anticipate will further improve imputation quality; inclusion of JHS itself in imputation for other AA cohorts would also improve imputation quality.
Encouraged by these substantial gains in information content for low-frequency and rare variants, we proceeded with imputation in several additional AA and Hispanic/Latino data sets with array-based genotyping (Table S1, S8), followed by association analyses with quantitative blood cell traits to evaluate the power of TOPMed freeze 5b based imputation in minorities for discovery of genetic variants underlying complex human traits. We specifically chose hematological traits for several reasons. First, these traits are important intermediate clinical phenotypes for a variety of cardiovascular, hematologic, oncologic, immunologic, and infectious diseases (11). Second, these traits have family-based heritability estimates in the range of 40-65% (12,13), and have been highly fruitful for gene-mapping with >2700 common and rare variants identified, though primarily in individuals of European ancestry (14)(15)(16)(17)(18)(19). Third, these traits remain under-studied in admixed AA and Hispanic/Latino populations, despite evidence for the existence of variants with distinct genetic architecture in AAs and Hispanics/Latinos (20)(21)(22). For example, while hundreds of variants identified in genome-wide association studies (GWAS) of WBC in individuals of European descent explain only ~7% of array heritability, the African specific Duffy null variant DARC rs2814778 alone accounts for 15-20% of population-level WBC variability in AAs (23). Finally, we have previously successfully leveraged deep-coverage exome sequencing-based imputation using resources from the Exome Sequencing Project for more powerful mapping of genes and regions associated with hematological traits in AAs(1).
Hemoglobin level (HGB), hematocrit (HCT), and white blood cell count (WBC) were chosen for our primary phenotypic analysis because these traits are available in the largest sample size among the AA and Hispanics/Latinos included in our discovery cohorts.
Our imputation sample used for discovery blood cell trait association analyses included eight cohorts (23,869 AAs and 23,059 Hispanics/Latinos). These discovery samples do not overlap with individuals sequenced as part of TOPMed freeze 5b (Table S2). We used the full set of 100,506 phased sequences from TOPMed freeze 5b (including JHS) as the imputation reference panel. We then carried out AA-and Hispanic/Latino-stratified association analyses with quantitative HGB, HCT, and total WBC separately in each cohort genotyping array data set, accounting for ancestry and relatedness. The genome-wide association results for each imputed cohort data set were then meta-analyzed within each ancestry group. Figs S2-7 show the Manhattan plots from ethnic-specific meta-analyses for each trait. QQ plots (Figs S8-13) show no obvious early departure with genomic control lambda ranging from 0.997 to 1.034, indicating no global inflation of test statistics. For replication of any novel associations identified in the imputation-based discovery analysis, we utilized WGS genotype data and hematological trait data from the non-overlapping set of AA individuals within TOPMed freeze 5b (Table S9) (see Methods for details).
We first evaluated association statistics for variants previously associated with HGB, HCT, or WBC count in AA and Hispanic/Latino populations (summarized in Table S10). We assembled a list of 24 AA and 13 Hispanic/Latino previously identified autosomal signals from prior published GWAS or exome-based studies (1, 19,20,[24][25][26][27][28][29][30]. Our lists excluded variants reported in multi-ethnic cohorts or meta-analysis including individuals of non-AA or non-Hispanic/Latino ancestry to guard against the scenario that the reported signals were driven predominantly by individuals of European or Asian ancestry. Among the previously reported 24 AA and 13 Hispanic/Latino variants, all but five (four SNPs and a 3.8 Kb deletion variant esv2676630) passed variant quality control filters in TOPMed freeze 5b and were subsequently well-imputed in our target AA and Hispanic/Latino data sets with a stringent post-imputation R 2 filter of >0.8 (detailed in Table S11). Among the 32 known HGB, HCT, or WBC count associations testable with TOPMed freeze 5b, our imputed/discovery cohorts confirmed 64.5% of these previously reported findings with a consistent direction of effect, using a stringent genome-wide significant threshold of p<5 x 10 -8 . Using more lenient p-value thresholds, we could replicate 74.2% (p<5 x 10 -6 ) and 100% (p<0.05) of the previously reported findings with the same direction of effect. While these results help confirm the overall validity of our hematological trait association results, it is important to note for these comparisons that many of the samples included in the current TOPMed freeze 5b imputed genome-wide association analysis were also used in the publications originally reporting associations in AA and Hispanic/Latino individuals.
Our ancestry-stratified imputation-based discovery meta-analysis revealed two blood cell trait associations that have not been previously reported, at a genome-wide significant threshold of 5×10 -9 in Hispanics/Latinos and 1x10 -9 in AA populations, based on appropriate significance thresholds for whole genome sequencing analysis (31). One signal was revealed in each ancestry group: hemoglobin subunit beta (HBB) missense (p.Glu7Lys) variant rs33930165 (gb38:11:5227003:C:T) associated with increased WBC in AAs (β=0.31 and p=8.1x10 -12 ) ( encodes an abnormal form of hemoglobin, Hb C, which in the homozygous state is associated with mild chronic hemolytic anemia and mild to moderate splenomegaly (39). In our discovery and replication data sets, there were no individuals homozygous for the Hb C variant, nor any compound heterozygotes for Hb S/C (Hb S is sickling form of hemoglobin and individuals homozygous for Hb S have sickle cell disease), which excludes the possibility that the apparently higher WBC is driven by an "inflammatory response" confined to a small number of individuals clinically affected by sickle cell disease or hemoglobin C disease. We next evaluated the association of HBB rs33930165 with circulating number of WBC subtypes, including neutrophils, monocytes, lymphocytes, basophils, and eosinophils. Table S14 shows the results in our AA imputation-based discovery data sets (Table S15), and TOPMed freeze 5b WGS replication samples (Table S16), which suggest that the apparent association of HBB rs33930165 with total WBC is mainly driven by an association with higher lymphocyte count, with perhaps a more modest association with higher neutrophil count. Further studies are needed to delineate the putative mechanism of this unexpected association.
Our findings showcase the power of the large, ancestrally diverse TOPMed WGS data set as an imputation reference panel for admixed populations, in terms of both imputation quality and accuracy (especially for rare variants) and subsequent association studies for complex traits.
Specifically, we identified two rare variants associated with hematological traits in AA and Hispanic/Latino populations and were able to validate our initial HBB association with WBC in an independent replication sample of sequenced individuals. We expect the combination of highquality imputation and higher depth sequencing datasets in larger cohorts of individuals will provide increased power for rare variant association analyses in diverse populations in the near future. baseline questionnaire and physical measures and stored blood and urine samples. Hematological traits were assayed as previously described (14). Genotyping on custom Axiom arrays and subsequent quality control has been previously described (47). Samples were included in our analyses if ancestry self-report was "Black Carribean", "Black African"," Black or Black

TOPMed 5b Sequencing and Phasing
British", "White and Black Carribean", "White and Black African", or "Any Other Black Background". Variants were selected based on call rate exceeding 95%, HWE p-value exceeding non-Hispanic White participants (81%). Genotyping was completed as previously described (48) using 4 different custom Affymetrix Axiom arrays with ethnic-specific content to increase genomic coverage. Principal components analysis was used to characterize genetic structure in this multi-ethnic sample, as previously described (49). Blood cell traits were extracted from medical records. In individuals with multiple measurements, the first visit with complete white blood cell differential (if any) was used for each participant. Otherwise, the first visit was used.
In total, 5,783 Hispanic/Latino and 2,246 AA participants with blood cell traits were included in the analysis. Genotyping was performed through the CARe consortium Affymetrix 6.0 array. (52,53) In total, 2,392 AA participants with blood cell traits were included in the analysis.

Imputation and post-imputation quality filtering
We first phased individuals from each cohort separately using eagle (57) with default settings.
We subsequently performed haplotype-based imputation using minimac4(58) using phased haplotypes from TOPMed freeze 5b as reference. We used 100,506 TOPMed freeze 5b whole genome sequences as reference for all cohorts except JHS, for which we used 94,342 TOPMed freeze 5b non-JHS sequences. We additionally imputed HCHS/SOL and JHS using 1000 Genomes Phase 3(9) and HRC(8) reference panels. Post-imputation quality filtering was performed using a R 2 threshold specific to each MAF category to ensure average R 2 for variants passing threshold was at least 0.8, following our previous work (4,59). Restricting to variants

Hematological traits
HGB, HCT, WBC and differential were measured in both the discovery data sets (Tables S7,   S13) and a subset of the TOPMed freeze 5b samples (Tables S8, S14) using automated clinical hematology analyzers. Prior to association analyses, we excluded extreme outlier values, notably WBC values >200x10 9 /L (as well as WBC subtype count values in these individuals), HCT >60%, and HGB >20g/dL. For longitudinal cohort studies, all values are from the same exam cycle, chosen based on largest available sample size. WBC traits were log transformed due to their skewed distribution. For all traits, we first derived trait residuals adjusting for age, age squared, sex, and principal components/study specific covariates as needed. Trait residuals were then inverse-normalized prior to analysis.

Association analysis in discovery cohorts
Association analyses were carried out for these variants via EPACTS for all cohorts except for HCHS/SOL, using the q.emmax test to account for relatedness within each cohort. Association tests were performed on inverse normalized residuals (adjusted for age, age squared, sex, and principal components/study specific covariates), further adjusting for kinship matrices constructed in EPACTS using variants with a MAF>1%. Individuals with different starting genotyping platform(s) were also analyzed separately. Inverse-variance weighted meta-analysis were further carried out using GWAMA(60), separately for AAs and Hispanics/Latinos.

Identification and replication of novel associations
To identify putative novel associations, we then filtered out any variant with LD r 2 ≥ 0.2 in any ethnic group with any previous reported variant from GWAS, sequencing, or Exome Chip analyses within ±1Mb for a given blood cell trait. We calculated LD in self-reported European ancestry, AA, and Hispanic/Latino individuals from TOPMed freeze 5b. For European and African LD reference panels, we further restricted to individuals with global ancestry estimate       The total number of well imputed variants is extrapolated from three selected 3 Mb regions: 16-19Mb region from chromosomes 3, 12, and 20. These regions were chosen arbitrarily across a range of chromosome sizes, avoiding centromere, telomere, and lowmappability regions. Imputation was carried out using all typed SNPs +/-1Mb (i.e., 15-20Mb) and quality was evaluated in the core 3Mb region. Post imputation quality control was carried out in seven MAF categories separately: <.05%, .05-.2%, .2-.5%, .5-1%, 1-3%, 3-5%, and >5%. In each MAF category, an estimated R 2 threshold was selected to ensure variants above the threshold have an average estimated R 2 of at least 0.8. These variants constitute the well imputed variants  TOPMed freeze 5b reference panel, QC+, number of these variants which passed imputation quality control, avgMAC, the average minor allele count in the (TOPMed freeze 5b minus JHS) reference panel of these variants, avgMAC QC+, the average minor allele count in the (TOPMed freeze 5b minus JHS) reference panel of variants which passed imputation quality control. avgEstR 2 , average estimated R 2 for imputed variants, avgTrueR 2 , average true squared Pearson correlation between imputed genotypes and genotypes from available whole genome sequencing data. Variants that did not have a MAC>5 in the full TOPMed freeze 5b reference panel were not evaluated. Table 3 Novel variants detected in TOPMed freeze 5b imputed Hispanic/Latino and African ancestry cohorts, in association analyses with white blood cell count, hemoglobin, and hematocrit.
EAF, effect allele frequency, HCT, hematocrit, HGB, hemoglobin, WBC, white blood cell count. Imputation R 2 (estimated R 2 ) range reported across all included imputed cohorts. Association results adjusted for nearby known SNPs whenever applicable. Association models for rs33930165 were adjusted for SNP rs2814778; removing potential minor allele homozygotes Association models for rs11549407 were adjusted for SNPs rs334, rs33930165, and rs2213169 rs334 and rs2213169 did not pass variant quality filters in TOPMed freeze 5b and were not included in our main analyses. However, to follow up our novel results in the HBB locus, we phased the failed variants in freeze5b and performed targeted imputation using TOPMed freeze 5b calls for rs334 and rs2213169 NA: among TOPMed freeze 5b Hispanic/Latino individuals, MAC =1 so association statistics are not available q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q HCHS/SOL q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q