Genome-wide association study of red blood cell traits in Hispanics/Latinos: The Hispanic Community Health Study/Study of Latinos

Prior GWAS have identified loci associated with red blood cell (RBC) traits in populations of European, African, and Asian ancestry. These studies have not included individuals with an Amerindian ancestral background, such as Hispanics/Latinos, nor evaluated the full spectrum of genomic variation beyond single nucleotide variants. Using a custom genotyping array enriched for Amerindian ancestral content and 1000 Genomes imputation, we performed GWAS in 12,502 participants of Hispanic Community Health Study and Study of Latinos (HCHS/SOL) for hematocrit, hemoglobin, RBC count, RBC distribution width (RDW), and RBC indices. Approximately 60% of previously reported RBC trait loci generalized to HCHS/SOL Hispanics/Latinos, including African ancestral alpha- and beta-globin gene variants. In addition to the known 3.8kb alpha-globin copy number variant, we identified an Amerindian ancestral association in an alpha-globin regulatory region on chromosome 16p13.3 for mean corpuscular volume and mean corpuscular hemoglobin. We also discovered and replicated three genome-wide significant variants in previously unreported loci for RDW (SLC12A2 rs17764730, PSMB5 rs941718), and hematocrit (PROX1 rs3754140). Among the proxy variants at the SLC12A2 locus we identified rs3812049, located in a bi-directional promoter between SLC12A2 (which encodes a red cell membrane ion-transport protein) and an upstream anti-sense long-noncoding RNA, LINC01184, as the likely causal variant. We further demonstrate that disruption of the regulatory element harboring rs3812049 affects transcription of SLC12A2 and LINC01184 in human erythroid progenitor cells. Together, these results reinforce the importance of genetic study of diverse ancestral populations, in particular Hispanics/Latinos.

Hispanics/Latinos are ethnically heterogeneous, with admixture of European, West African, and Amerindian ancestral populations.In general, RBC trait values among Hispanics/Latinos have been reported to be similar to those among non-Hispanic whites, though certain types of congenital and acquired anemias are more common among Hispanics/Latinos [15][16][17][18][19].As with most complex traits, GWAS for discovery or generalization of RBC trait loci has yet to be performed in Hispanics/Latinos or other populations with Amerindian ancestry.In the current study, we performed genome-wide association analysis of seven quantitative RBC traits in 12,502 participants ascertained by the Hispanic Community Health Study/Study of Latinos (HCHS/SOL) and replicated any new association findings discovered in HCHS/SOL in three independent samples of Hispanic/Latino Americans.

Results
The demographic characteristics and RBC trait distributions of the 12,502 Hispanic/Latino HCHS/SOL participants are summarized in S2 Table .Genomic inflation factors for the seven RBC traits ranged from 1.015 (MCHC) to 1.054 (RDW), indicating adequate control of population stratification (S1 Table ).Overall, 24 loci were significantly associated with one or more RBC traits in HCHS/SOL (Table 1 and S1 and S2 Figs).The number of distinct genomic regions associated with each trait were 4 loci for HCT, 4 for HGB, 6 for RBC count, 8 for RDW, 9 for MCH, 5 for MCHC, and 9 for MCV.Association results and allele frequencies of lead SNPs for each genome-wide-significant trait-locus association are presented for six genetic subgroups comprising the HCHS/SOL study population in S3 Table .Genomic loci previously known to be associated with RBC traits and generalization to Hispanics/Latinos Of the 24 genomic regions harboring variants that reach genome-wide significance for association with RBC traits in HCHS/SOL, 17 have been previously found to associate with RBC traits either through GWAS and/or Mendelian RBC disorders.Genomic regions and variants previously implicated in Mendelian RBC disorders include the African ancestral alleles for sickle cell trait/anemia or hemoglobin S (HBB rs334); hemoglobin C (HBB rs33930165); the common African form of G6PD A-deficiency (rs1050828); the 3.8kb alpha-globin gene deletion responsible for alpha-thalassemia trait (esv2676630); and a proxy SNP (rs2032451) for the European hereditary hemochromatosis (HFE) p.H63D allele.
At 13 of the 17 previously reported RBC loci, the lead variant for the trait detected in HCHS/SOL Hispanics was the same as the previously reported index SNP in European-, African-, or Asian-descent individuals or a strong linkage disequilibrium (LD) proxy (r 2 >0.8) for the variant, where LD was measured in the relevant ancestral population in 1000 Genomes.There were four cases in which the lead variant in HCHS/SOL was not an LD equivalent to the reported index SNP.The first, rs607203 (MAF = 0.07), is a lead SNP for MCH and MCV association loci located within a DNaseI hypersensitive region on chromosome 6q24 approximately 146kb upstream of CITED2.Rs607203 is not in strong LD (HCHS/SOL r 2 between 0.06 and 0.11) with any of the previously reported CITED2 European or Japanese index SNPs (rs590856, rs643381, rs628751, rs668459, rs632057), and therefore appears to represent an independent signal in the CITED2 locus.Among 1000 Genomes super-populations, the frequency of rs607203 is highest in African (AFR) (MAF = 0.14) populations; uncommon in European (EUR), American admixed (AMR), and South Asian (SAS) (MAF<0.05)populations; and monomorphic in East Asian (EAS) populations.A second exception is rs4714548, an intronic SNP of CCND3 associated with MCV.This HCHS/SOL lead SNP exhibits weak or no LD (HCHS/SOL r 2 <0.1) with any of the CCND3 index SNPs previously reported in Europeans (rs9349204, rs9349205) or Japanese (rs3218097) populations.Additionally, we report novel associations for two of the variants significantly associated with RDW in HCHS/SOL: SLC12A7 rs4565255 and TMPRSS6 rs855791.SLC12A7 rs4565255 is a proxy for rs4580814, which was previously associated with MCHC in Japanese populations [9].TMPRSS6 rs855791 has been previously associated with multiple red cell and iron-related phenotypes, but not with RDW [6,8,9].To formally assess whether variants previously associated with RBC traits in populations of European, Asian, and African ancestry generalized to HCHS/SOL Hispanics/Latinos, we used a directional FDR approach.Of 251 unique published SNP associations with any of the seven RBC traits, 146 (58%) generalized to HCHS/SOL (S4

Discovery and replication of new loci associated with RBC traits
The seven remaining genome-wide significant variants in the HCHS/SOL discovery sample were at previously undetected loci (Table 1), and three of these variants replicated in a metaanalysis of three independent Hispanic/Latino samples (Table 2, S2 Table ).

Functional analysis of new loci associated with RBC traits
At each of the three replicated discovery RBC-associated loci, we evaluated the functional genomic annotation and regulatory potential of the lead variant and any proxy variants (r 2 !0.8 in HCHS/SOL) in erythroid cells to determine the most likely causal variant(s).We identified the following variants as the most likely functional candidates: three intronic SNPs of PROX1 (rs7541039, rs7517701, and rs4282786) located within the same erythroid enhancer; one SNP 3' of PSMB5 (rs11846575); and rs3812049, which is located in a bi-directional promoter between SLC12A2 and an anti-sense long noncoding RNA, LINC01184 (S5 and S6 Tables).
We next performed mutagenesis analysis of the regions containing the PROX1, PSMB5, and SLC12A2 candidate causal variants using CRISPR-Cas9 genome editing to disrupt the respective putative regulatory elements in human umbilical cord-derived erythroid progenitor (HUDEP-2) cells (oligonucleotide sequences described in S7 Table ).At the SLC12A2 locus, a single guide RNA was expressed along with Cas9 to produce indels surrounding the predicted functional SNP rs3812049.These edits resulted in a substantial decrease in expression of both SLC12A2 and LINC01184 (Fig

Additional analysis of the alpha-globin copy number variant
Since the quality of structural variants imputed from 1000 Genomes may be lower than single nucleotide variants, we applied a specialized copy number variant (CNV) calling algorithm to re-type the key 3.8kb alpha-globin structural variant using raw probe intensity data from the Comparison of the CNV genotype calls to those for esv2676630 imputed from 1000 Genomes revealed that genotype calling using imputation appears to result in "under-calling" of the 3.8kb deletion, especially homozygous deletions (S8 Table ).In addition, there are a number of individuals in HCHS/SOL who carry a 3.8kb duplication (3 or 4 copies of the structural variant), which are mis-called by 1000 Genomes imputation as wild-type.Notably, the improvement in genotype accuracy with the CNV calling algorithm resulted in a nearly two-fold increase in effect size for MCH and MCV (Table 3) compared to 1000 Genomes imputation (S9 Table ).Therefore conditional association analyses were performed using alpha-globin deletion/duplication genotypes derived from the CNV calling algorithm.

Conditional analysis and identification of secondary, independent association signals
To identify additional independent association signals at known or novel RBC-associated loci, we performed step-wise conditional regression analyses in which we adjusted for the index variant at each genome-wide significant locus.The analysis was repeated with adjustment for each independently associated single variant or structural variant until no further independent signals were identified within that genomic region.Using a significance threshold of α = 5x10 -8 , we identified additional independent variants associated with one or more RBC traits (Table 3) in two genomic regions.At the beta-globin locus on chromosome 11p15 containing the index SNP rs334 (sickle cell variant), there was an additional intergenic variant (rs113342804) independently associated with MCV.At the terminal region of chromosome 16p13 containing the alpha-globin locus, we identified two additional low-frequency variants-HBM-HBA2 rs145546625 (or its proxy HBM rs148323035 for MCH and MCV) and the 3.8kb alpha-globin duplication (for MCV)-independently of the 3.8kb alpha-globin deletion.Rows in bold indicate variants that are Amerindian specific.1000 Genomes super-populations European (EUR), African (AFR), American (AMR), South Asian (SAS) and East Asian (EAS), were examined to determine global allele frequencies."oevar" is the imputation quality defined as the ratio of the observed variance of imputed dosage to the expected binomial variance.a The re-typed structural variant calls determined using Genvisis software.
b CR: during sequential conditional analysis, the round number in which the variant was conditioned for.https://doi.org/10.1371/journal.pgen.1006760.t003

Admixture mapping analysis
Several variants associated with RBC traits in the HCHS/SOL population are highly differentiated across ancestral populations.The HBB rs334, HBB rs33930165, esv2676630 alpha-globin 3.8kb gene deletion, and G6PD rs1050828 lead variants are derived from an African ancestral background, while the HFE hemochromatosis variant rs2032451 (proxy of rs1799945 p.H63D) is common among Europeans and Amerindian populations and much less common among Asians and West Africans.In addition, we note that the two newly reported independent association signals at the chromosome 16 alpha-globin locus-rs148323035/rs145546625 (Table 3) and the 3.8kb duplication-appear to be more common among populations of Amerindian ancestry [20,21].To assess whether any additional genomic regions might contain ancestrally differentiated SNPs associated with RBC traits, we performed a genome-wide admixture-mapping scan in HCHS/SOL for discovery analysis in each RBC trait.Admixture mapping in HCHS/SOL only detected associations already reported in the initial association testing: the chromosome 11p15 beta-globin region (for MCV); the chromosome 16p13 alpha-globin region (for RBC, HGB, MCV, MCH, MCHC, and RDW); and the RDW association on chromosome 14q11, which corresponds to the PSMB5 association signal discovered in the HCHS/ SOL GWAS (S5 Fig) .The PSMB5 index SNP shows large inter-continental allele frequency differences (rs7147308 T allele frequency is 0.87 in AFR, 0.40 in SAS, 0.30 in EUR, 0.21 in AMR, and 0.06 in EAS 1000 Genomes populations).

Discussion
We performed a GWAS of seven red blood cell traits in a diverse subsample of approximately 12,500 Hispanic/Latino participants of HCHS/SOL from across the continental U.S. We discovered and replicated three genome-wide significant variants (SLC12A2 rs17764730 and PSMB5 rs941718 for RDW, and PROX1 rs3754140 for HCT).We also showed that common African ancestral hemoglobin variants (beta-globin Hb S and Hb C missense variants rs334 and rs33930165, and alpha-globin 3.8kb thalassemia structural variant) and the African G6PD A-variant are associated with variation in RBC traits among the U.S. Hispanic/Latino population.Overall, 58% of previously identified GWAS loci for RBC traits generalized to HCHS/ SOL.We additionally provide a more detailed characterization of allelic heterogeneity at the alpha-and beta-globin loci, including a newly identified Amerindian ancestral variant that overlaps a known regulatory region of the alpha-globin gene cluster.
The HCT index SNP rs3754140 is located within a putative enhancer region positioned in the second intron of PROX1 and is in high LD (r 2 >0.8) with approximately 30 other intronic PROX1 variants (S5 and S6 Tables).Some of these intronic proxy SNPs (rs7541039, rs7517701, and rs4282786) occur within putative regulatory regions in erythroleukemia or proerythroblast cells, have CADD phred score >10, and therefore represent likely functional candidates.All three of these proxy SNPs are located in a putative enhancer element that exhibits DNaseI hypersensitivity in fetal proerythroblasts and K562 cells.Although enhancers can have distal target genes, a potential target is the enhancer-harboring gene PROX1, which has been reported as a negative regulator of hematopoietic stem cell renewal and for which mutations have been found in hematopoietic cell lines and primary blood malignancies [22,23].PROX1 encodes Prospero Homeobox 1, a widely expressed transcription factor involved in the development and differentiation of tissues such as endothelial lymphatic vessels, liver, retina, and pancreas [24].Several PROX1 variants (e.g., rs340874, rs340839) located in the 5' UTR of PROX1 or adjacent antisense noncoding RNA have been associated with metabolic traits such as fasting glucose, insulin resistance, diabetes, and triglyceride levels [25][26][27].The HCT-associated signal we detected in Hispanics/Latino is independent of the previously reported PROX1 metabolic trait association signal.Molecular analysis, including biallelic deletion of a 700bp region surrounding rs7541039 in the second intron of PROX1, showed no effect on transcription of PROX1-which does not appear to be expressed in human erythroid precursors-or neighboring genes SMYD2 and CENPF [28].In light of this information, further investigation of the role of the putative PROX1 intronic regulatory region and associated genetic variants in hematopoiesis-specifically RBC production-is warranted.
The RDW-associated locus on chromosome 14q11 is located in a gene-rich region.The lead SNP rs941718 and several LD proxies are non-coding variants within or near PSMB5, which encodes a 20S core proteasome subunit.From the standpoint of RBC biology, the ubiquitin proteasomal system may be particularly important during erythroid maturation and hemoglobin synthesis to control globin-chain balance and limit potential toxicities of unstable free globin chains [29].The lead SNP rs941718 is also a blood cis-eQTL for nearby genes HAUS4, MRPL52, PRMT5-AS1, and PRMT5 [30,31] and has a CADD phred score of 15.8 (S5 and S6 Tables).PRMT5 encodes an arginine methyltransferase involved in binding to the γ-globin promoter and silencing fetal hemoglobin expression, and therefore represents an additional potential mechanism for influencing RBC phenotype [32,33].The LD proxy rs11846575, located just 3' of PSMB5, is proximal to a highly tissue-specific erythroid enhancer [34][35][36] and therefore merits further functional experimentation in the context of erythroid development and hemoglobin synthesis.
The other newly reported RDW-association signal is located on chromosome 5q23 and spans ~100kb including SLC12A2 and an upstream long non-coding RNA (LINC01184) on the antisense strand.SLC12A2 (which codes for the protein NKCC1) is a sodium-, potassium-, and chloride-ion transporter membrane protein involved in cell-volume regulation and maintenance in kidney, RBC, and other cell types [37].Genetic variation in other RBC membrane ion-transport proteins (e.g., PIEZO1, SLC4A1) has been associated with inter-individual variability in RBC traits [13].The lead SNP at the SLC12A2 locus (rs17764730) lies within an exon of LINC01184.RNA-Seq data indicates that both SLC12A2 and LINC01184 are expressed in erythroblasts [35].The lead SNP is in high LD (r 2 >0.8) with 23 other variants spanning SLC12A2 and LINC01184 (S5 and S6 Tables).The strongest functional candidate SNP (rs3812049, imputation quality score 1.006, r 2 to lead SNP = 0.89) is located within a bi-directional promoter region between the 5' ends of SLC12A2 and LINC01184.Rs3812049 is also positioned within an erythroid DNaseI hypersensitive region and is occupied by multiple transcription factors, including the erythropoietic transcription factors GATA1 and TAL1 in erythroblasts and EGR1 in K562 cells.These observations suggest the possibility that the antisense transcript may be involved in erythrocyte maturation or maintenance by regulating SLC12A2 in erythrocytes.While this paper was under review, additional variants in the region of SLC12A2 and LINC01184 were reported to be associated with RDW in a predominantly European samples [38,39].
In human erythroid progenitor cells, we showed that small deletions in the bi-directional promoter region, including directly overlapping the position of rs3812049, lead to reduced expression of both SLC12A2 and LINC01184.Although formally demonstrating the function of the underlying element, these results could be consistent with a model in which rs3812049 alleles differentially modulate promoter activity.While disruption of the bi-directional promoter element did not reveal any differences in erythroid development, in vitro conditions may incompletely model a complex trait like RDW that appears highly dependent on appropriate RBC maturation and clearance in vivo.Finally, it is interesting to note both the large allele-frequency differences of the SLC12A2 index variant between African and non-African populations (Table 1) and a report of lower erythrocyte NKCC1 protein activity in African Americans compared to whites [40].This is particularly noteworthy given the established role of NKCC1 in blood pressure regulation, kidney function, and RBC-volume maintenance, and ethnic differences among these traits [37].Based on our preliminary molecular results, both SLC12A2 and LINC01184 should be examined further for their potential roles in erythrocyte and non-erythroid traits.
The HCHS/SOL cohort represents a diverse subsample of Hispanics/Latinos across the U.S., with varying admixture proportions of three continental ancestry groups: Amerindians, Africans, and Europeans.The beta-globin hemoglobin S and hemoglobin C variants, alpha-globin 3.8kb deletion, and G6PD A-variant have previously been shown to contribute to RBC phenotypic variance among U.S. African Americans [41,42].Here, we establish that these same common African ancestral hemoglobin and G6PD gene variants are associated with quantitative RBC phenotypes among U.S. Hispanics/Latinos.The heterozygous states of each of these inherited RBC conditions are prevalent in populations in Africa, Asia, southern Europe, and South and Central America, and confer a survival advantage against malaria [43].Even though carriers are generally without clinical sequelae, the heterozygous state of Hb C can induce RBC dehydration, resulting in a higher MCHC [44].Alpha-globin deletion carriers [1] and sickle cell trait carriers [45] may have lower levels of HCT, MCV, and MCH, and higher RBC counts, due to ineffective erythropoiesis.We also show that the HFE p.H63D variant (rs1799945) is associated with RBC phenotypes in Hispanics.Both C282Y and H63D hemochromatosis mutations are prevalent in Northern Europeans, while H63D appears more broadly in North Africa, the Middle East, and less commonly in Asia.Emigration from Europe over the past 500 years likely introduced C282Y and H63D to Americas and Oceania, leading to a frequency of H63D in Amerindians and Hispanics/Latinos exceeds that of East and South Asians [46,47].
At the alpha-globin locus, the 3.8kb deletion and duplication generally arise as a result of misalignment of homologous sequences within HBA1 and HBA2 and unequal crossing over during recombination.In U.S. Hispanics/Latinos, we observed that the 3.8kb alpha-globin duplication was significantly associated with lower MCV independently of the 3.8kb deletion.This may be due to imbalanced alpha/beta globin-chain synthesis, which may be exacerbated by co-inheritance of other globin gene mutations [48].Nonetheless, given the caveats of structural variant calling from genotype data, this finding requires additional validation using other molecular techniques.We observed additional allelic heterogeneity at the alpha-globin locus, a novel association signal for MCV and MCH with two Amerindian ancestral variants in high LD (r 2 >0.99): the HBM splice-site variant rs148323035, and rs145546625, located ~2 kb upstream of HBA2.HBM encodes hemoglobin mu, a globin chain similar to the oxygen highaffinity delta-globin found in reptiles and birds that is transcribed in a tightly regulated fashion in erythroid cells, particularly during the terminal differentiation stage [49].The HBM splice donor variant rs148323035 overlaps with a putative regulatory region that spans the transcription start site and first intron of HBM and is DNase hypersensitive, occupied by GATA1 and TAL1 in pro-erythroblasts [49].
Overall, generalization analysis revealed that 58% of RBC trait associations identified in GWAS of European-, Asian-, or African-descent populations generalized to HCHS/SOL Hispanics/Latinos.Nearly half of the previously reported genomic regions associated with RBC traits also had at least one variant associated one or more RBC traits in the HCHS/SOL, and 79% of individual SNPs previously reported as significant for more than one RBC trait generalized to HCHS/SOL for at least one of the previously reported traits.These results demonstrate that the same loci are likely involved in RBC trait biology across global populations, whether the functional variants are shared with or differ between ancestral groups.Failure to generalize can occur for one of several reasons, including but not limited to: (1) coverage of the relevant locus on the genotyping array is insufficient for the study population; (2) the originally published variant was a false positive and that locus is not associated with the relevant trait; (3) the power for generalization in HCHS/SOL is low due to the HCHS/SOL study population size; or (4) the power for generalization in HCHS/SOL may be low due to allelic frequency differences between populations.
In summary, we report three novel loci associated with RBC traits in Hispanics/Latinos as well as independent signals within two RBC trait-associated regions previously identified in African descent populations.This includes an Amerindian ancestral variant at the alpha-globin gene cluster that overlaps a known alpha-globin regulatory region.This particular variant is monomorphic among European, Asian, and African ancestral populations.Other Amerindian-specific loci for platelet count or diabetes have been identified among Hispanics/Latinos [50,51].These findings emphasize the importance of performing genetic studies in Hispanic/ Latino populations.

Study population
The HCHS/SOL is a cohort of 16,415 self-identified Hispanic/Latino persons aged 18-74 years who were selected from households and census block groups in Chicago, IL, Miami, FL, Bronx, NY, and San Diego, CA, as previously described [52].Study participants self-identified as having Hispanic/Latino background in one of six sub-groups, with the total study population including 6,471 participants identifying as having a Mexican background, 2,728 as Puerto Rican, 2,348 as Cuban, 1,730 as Central American, 1,460 as Dominican, and 1,068 as South American.Individuals were recruited to HCHS/SOL between 2008 and 2011, and underwent a baseline clinical exam that included clinical, lifestyle, and sociodemographic assessment [53].Based on kinship coefficient among the genotyped individuals, the HCHS/SOL sample includes 204 parent-offspring trios, 1,042 parent-offspring duos, 699 full-sibling pairs, and numerous second-and third-degree relatives.The IRB committees for the HCHS Coordinating Center at UNC Chapel Hill, San Diego State University, University of Illinois at Chicago, University of Miami, and Yeshiva University-Albert Einstein College of Medicine have all reviewed and approved the informed consent documents and study protocol.Written and signed informed consents in the language preferred by the participants are administered and archived at each of the participating field centers.All participants in this publication from HCHS/SOL have consented to use of their genetic and non-genetic data.Anyone not providing consent has been excluded from this analysis.Demographic characteristics and RBC trait descriptive statistics for included study populations are presented in S2 Table.

Red blood cell trait measurement
Whole blood (approximately 58 to 76ml) was collected at Visit 1 for all consenting HCHS/ SOL participants by certified technicians trained at their respective field-center institutions.Supplies and procedures were standardized across all field centers; 4ml of whole blood for complete blood count (hemogram) was collected in a tube containing EDTA as an anticoagulant.CBC values were measured from whole blood using an automated hematology analyzer (Sysmex XE-2100, Sysmex America, Inc., Mundelein, IL 60060) at the central laboratory at the University of Minnesota Medical Center, Fairview, in Minneapolis.

Exclusion criteria
Of the 16,415 individuals in the HCHS/SOL cohort study, 12,803 consented to genotyping and passed QC.Several individuals from the genotyped subset were excluded from the analysis, including individuals with predominantly Asian ancestry (n = 19), pregnant women (n = 8), participants with >5% immature granulocytes (n = 2), end-stage kidney disease (n = 46), hematologic cancer (n = 28), or those undergoing cancer chemotherapy (n = 54).After exclusions, a total 12,502 participants were included for HCT, HGB, RBC, MCH, and MCV; 12,501 for RDW; and 12,500 for MCHC.

Genotype data cleaning and QC
HCHS/SOL subjects who consented to genetic studies had DNA extracted from whole blood, which was genotyped on the Illumina SOL HCHS Custom 15041502 B3 array.This array comprised the Illumina Omni 2.5M array (HumanOmni2.5-8v1-1)and additional custom content [51,54].In order to capture more Amerindian variation, the Omni2.5Marray was modified by the addition of custom content comprised of ~150K SNPs selected from the CLM, MXL, and PUR 1000 Genomes Phase I samples for higher informativeness to identify Amerindian continental ancestry and for higher frequency in Amerindian genomic segments.Standard quality assurance/quality control (QA/QC) methods for SNP-and sample-level quality were applied.Quality metrics used to filter SNPs included Illumina/LA Biomed assay-failure indicator, missing call rate (>2%), deviation from Hardy-Weinberg equilibrium (p<10 −5 ), Mendelian errors (>3 in 1343 trios or duos), and duplicate sample discordance (>2 in 291 sample pairs).Following genotyping QA/QC procedures, there were 12,803 unique study participants and 2,232,944 SNPs available for imputation.

Imputation
For imputation, we used 1000 Genomes Project phase 1 reference panel and IMPUTE2 software.Genotypes were initially pre-phased using SHAPEIT2 (v2.r644, www.shapeit.fr),and subsequently imputed using IMPUTE2 software (v2.3.0,https://mathgen.stats.ox.ac.uk/ impute/impute_v2.html,last accessed Dec 2016) [54].Only variants with at least two copies of the minor allele present in any of the four 1000 Genomes continental panels were imputed, yielding a total of 25,568,744 imputed variants (SNPs and indels).Imputed genotype dosages were modeled on a continuous scale from 0 to 2 in order to account for genotype uncertainty.Oevar is an imputation quality metric, defined as the ratio of the observed variance of imputed dosage to the expected binomial variance.Variants with an oevar <0.3 were considered low quality and excluded from analysis.Additional information about imputation and quality metrics is found in Conomos, et al [55].

Copy number variant genotyping and association analysis at the alphaglobin locus
The SOL Illumina Omni 2.5M array contains five variants (rs2362744, rs4021971, rs4021965, rs11639532, rs2858942) within the 3,811bp alpha-globin structural variant that can be used for determining copy number.Raw probe intensity data (normalized X and Y values) were exported from GenomeStudio as FinalReport files and then imported into the Genvisis software package (http://genvisis.org,last accessed Jan 2017) in order to use its specialized CNV calling algorithm.The first step in the process is to re-compute the Log R ratios (LRRs) using centroids derived from only high-quality samples (standard deviation of the autosomal LRRs <0.32 and genotype call rate >98%).LRRs from a set of ~50,000 curated markers were included in a principal components analysis (PCA) to capture DNA quality, DNA quantity, and batch effects.After regressing out 60 PCs from the raw intensity data, we recomputed LRRs and determined the median LRR value for the five markers in the alpha-globin region.Copynumber (0, 1, 2, 3, or 4) calling for the structural variant was then performed after visual inspection of the cluster boundaries with median LRR on the x-axis and median absolute

Replication significance criteria
Association testing was performed in each of the three Hispanic replication data sets (WHI, BioMe, MESA) using linear regression and the same RBC trait transformation as the discovery samples, adjusted for age, sex, and principal components.Meta-analysis of results from the 3 independent Hispanic replication study samples was performed using the inverse-varianceweighted method implemented in METAL (http://csg.sph.umich.edu//abecasis/Metal,last accessed Dec 2016).We defined novel, replicated loci as those which exceeded a Bonferronicorrected significance threshold of p <0.05/7, or 0.0071 (accounting for 7 SNPs carried forward for replication) and are located >1 megabase (Mb) from a previously reported genomewide significant association signal.

Conditional analysis
We performed step-wise conditional analysis for each RBC phenotype to identify secondary, independent association signals within 500kb of known and newly discovered GWAS loci.In the first round of the conditional analysis for each trait, we used the same regression model as in the discovery GWAS, with additional adjustment for previously reported or novel variants identified in this study.The list of variants used in conditional analysis of each trait is provided in S13 Table .The significance threshold for discovering new, independent association signals was the same as the genome-wide discovery threshold (α = 5.0x10 -8 ) as well as MAF!0.01.Subsequent rounds of conditional analysis were repeated for each genomic region, adding the strongest genome-wide significant variant from the previous round as a covariate in the regression model, until no further genome-wide significant variants satisfying the MAF threshold remained in that region after covariate adjustment.The full models for each trait used in the final round of conditional analysis are listed in S14 Table .After obtaining probe intensity-based CNV calls for the 3.8kb alpha-globin CNV, we conducted conditional analysis on chromosome 16 using the calls from the re-typed CNV.The full models for each trait used in these conditional analysis are also listed in S14 Table .Conditional analysis with the retyped 3.8kb alpha-globin CNV was conducted on the subset of 12,390 individuals for whom the re-typed CNV calls were available.

Generalization analysis
Variants used in generalization analyses were identified by one of two inclusion methods: (1) any variant listed as genome-wide significant for any of the seven RBC traits in our study in the European Bioinformatics Institute GWAS catalog (http://www.ebi.ac.uk, last accessed Jan 2017); or (2) any RBC trait genome-wide-significant variants published in the main text or supplement of an English-language GWAS indexed in PubMed prior to December 2016.(Of note, we did not identify any GWAS published in a language other than English, hence we expect our list of variants identified using these methods to be complete prior to 2017.)We tested each published RBC-associated variant to see whether that association generalized to Hispanics/Latinos.The directional generalization null hypothesis is rejected if there is enough evidence that the published variant is directionally consistent and associated with the outcome in both the discovery study and HCHS/SOL.We evaluated for generalization all available signals previously reported in any GWAS published in English, for all seven traits evaluated in this paper (S4 Table ).Most of these SNPs were reported in studies of adults of European ancestry, but we also generalized associations from African-and Japanese-ancestry populations.No variants identified in Danjou, 2015, were included in our genotyped or imputed dataset and hence these variants could not be evaluated for generalization [60].To test the generalization null hypotheses, we computed directional FDR r-values for each of the tested SNPs.
Directional r-values were calculated based on one-sided p-values from both the "discovery" study (reported in the literature) and the HCHS/SOL, and based on the number of tests performed in the discovery study, in order to properly account for multiple testing.A SNP was considered generalized if its r-value was <0.05 [61].In generalizing associations reported by Ganesh et al. (2009), we did not employ directional control since Ganesh, et al., (2009) did not report effect sizes or directions [8].The implication is a slight loss of power.
Generalization analysis was performed by looking up reported SNPs in HCHS/SOL results, in an analysis that mimics the analysis reported in the discovery study.For example, if a trait was reported as an association analysis with the natural-log-transformed trait, we performed the analysis with the same transformation in the HCHS/SOL population.In some cases, as with Kamatani, et al. (2010), we also matched effect-size reporting methods (standard deviations) for ease of comparison.Transformations, when applicable, are described in S1 Table .Since the same SNP-trait association may be reported by multiple studies, we counted only unique SNPtrait associations.In instances where more than one study reported associations for the same SNPs and trait, but used different trait transformations, we selected the results from the generalization analysis in which the trait transformation matched our primary analysis.
Since some SNPs are associated with more than one RBC trait, and some genomic regions contain multiple SNPs associated with multiple traits, we summarize the generalization results as follows.Overall, we summarize the number of generalized unique trait-SNP associations (the same SNP may be counted more than once, if associated with more than one trait).Then, for each trait, we summarized (1) the number of unique SNPs, and (2) the number of unique genomic regions.To define genomic regions, we identified specific SNPs, and a 1Mb genomic region around them.Other SNPs within these regions were clumped together.We say that a genomic region generalized for a specific trait if at least one SNP in the region was associated with the trait.

Functional annotation of novel loci
We assessed any novel, replicated red blood cell associated loci to determine potentially causal variants.At each locus, we determined if the lead or proxy variants (r 2 !0.8) were located within putative erythroid regulatory elements, defined on the basis of enrichment for various histone-modification and ChIP-Seq signals in either erythroblasts or the erythroleukemia cell line K562 [34][35][36].We defined these regulatory regions as follows: enrichment for histone H3K4me1 as an enhancer, enrichment for histone H3K4me3 as a promoter.Variants located within a putative promoter or enhancer, and that overlapped a DNaseI hypersensitive site in proerythroblasts or K562 cells, were prioritized as putatively functional [34,36,62].Regulatory elements often are bound by transcription factors and hence we report ChIP-Seq peak overlaps of key erythroid transcription factors (GATA1, TAL1), and others in proerythroblasts and K562 cells to provide further support for the functional role of putative regulatory elements in erythroid cells [34,36,62].The ENCODE and BLUEPRINT datasets were accessed through the ENCODE analysis Hub and Blueprint Hub respectively via the UCSC genome browser [63,64].Datasets from Xu, et al, were accessed from codex (http://codex.stemcells.cam.ac.uk, last accessed Dec 2016) [62,65].To hypothesize likely mode of action via which the causal variants influence the trait, we report eQTL targets and or motifs disrupted by prioritized variants using HaploReg v4.1 [66].All the datasets used for functional annotation were mapped to Human GRCh37/hg19 assembly.Functional annotation is summarized in S5 Table.We also used in silico prediction algorithms to annotate variants.These included RegulomeDB, the Combined Annotation Dependent Depletion (CADD) phred score, GWAVA, and deltaSVM [67][68][69][70].These annotations are summarized in S6 Table.

In vitro analysis of functional candidates within SLC12A2-LINC01184, PSMB5, and PROX1
The CRISPR/Cas9 system was used to mutagenize individual variants or small regions of interest identified during discovery analysis and subsequent bioinformatics interrogation.All oligonucleotide sequences used in CRISPR-Cas9 genome editing experiments are listed in S7 Table .The human umbilical cord blood derived erythroid progenitor cell line #2 (HUDEP-2) was cultured and used for genome editing as previously described [71].Individual and tandem pairs of single chimeric guide RNAs were cloned to lentiviral expression vectors (lentiGuide-Puro, Addgene plasmid 52963).Cells were transduced and selected for lentiviral integrants by antibiotic selection (10 μg/ml blasticidin for lentiCas9-Blast [Addgene plasmid 52962], 1 μg/ ml puromycin for lentiGuide-Puro).For SLC12A2 individual sgRNA promoter editing, indel frequencies were assessed after 7 days by nested PCR followed by amplicon deep sequencing.For SLC12A2-LINC01184, PSMB5, and PROX1 interstitial deletions, cells were plated at limiting dilution to isolate clones 7 days after transduction with tandem sgRNAs.Clones with biallelic deletions were characterized by presence of gap PCR amplification with primers outside the deleted segment and absence of PCR amplification from inside the deleted segment.Expression of mRNA of genes of interest was compared to GAPDH expression using quantitative reverse transcription PCR (RT-qPCR) in control and edited HUDEP-2 cells.For SLC12A2 individual sgRNA promoter editing, the total population of edited cells was evaluated in bulk by RT-qPCR.For SLC12A2-LINC01184, PSMB5, and PROX1 interstitial deletions, clones were first identified by PCR screening and then evaluated by RT-qPCR.For differentiation experiments, control and edited HUDEP-2 cells were cultured separately for 4 days in Erythroid Differentiation Media (EDM) with Iscove's Modified Dulbecco's Medium (IMDM) (Life Technologies) supplemented with 330 mg/ml holo-transferrin (Sigma), 10 mg/ml recombinant human insulin (Sigma), 2 IU/ml heparin (Sigma), 5% human solvent detergent pooled plasma AB (Rhode Island Blood Center), 3 IU/ml erythropoietin, 100 ng/ml human SCF, (R&D), 1 mg/ml doxycycline, 1% L-glutamine, and 2% penicillin/streptomycin. Subsequently the cells were cultured an additional 4 days in EDM lacking SCF, and then an additional 4 days in EDM lacking both SCF and doxycycline.Erythroid maturation was evaluated by flow cytometry staining with CD71 (eBiosciences), CD235a (eBiosciences), CD49f (Miltenyi), and DRAQ5 (eBiosciences) as well as morphology by May-Grunwald-Giemsa staining, Student's t-tests were used for statistical analysis of results.

1 ).
Differentiation of erythroid cells was not obviously affected by disruption of the bi-directional promoter site.In a separate mutagenesis experiment, deletion of the third exon of LINC01184 resulted in a 3-fold reduction in LINC01184 expression, but did not appear to exhibit substantial cis effects on SLC12A2 expression (S3 Fig).While the candidate regulatory region of PROX1 is located within an erythroid enhancer, PROX1 itself is not expressed in human erythroid cells including HUDEP-2, suggesting that the enhancer element might regulate a distal target.However, a 700 base-pair biallelic deletion of the PROX1 intronic region containing rs7541039, rs7517701, and rs4282786 did not show any effect on HUDEP-2 cell maturation or on expression of neighboring genes SMYD2 and CENPF, both located within 300 kb of the putative enhancer element (S4 Fig).Similarly, deletion of the putative enhancer downstream of PSMB5 did not significantly alter expression of PSMB5 or neighboring genes (PRMT5, HAUS4, C14ORF93, and ACIN1) that are both expressed in erythroid precursors and located within the same topologically associated domain of K562 cells (S4 Fig).

Fig 1 .
Fig 1.Small indels around rs3812049 reduce expression of both SLC12A2 and LINC01184 in HUDEP-2 cells.HUDEP-2 human erythroid precursor cells were transduced with lentivirus expressing Cas9 and a guide RNA, either nontargeting (NT) or targeting cleavage at rs3812049, and selected with antibiotics.Seven days after transduction, expression of SLC12A2 and LINC01184 in the population of edited cells was measured by quantitative reverse transcription PCR.Experiment was performed in biologic triplicate.Bars indicate means and error bars indicate standard deviation.T-tests showed significant differences in expression of both SLC12A and LINC01184 upon introduction of indels around rs3812049 (p < 0.01 for each comparison to unedited controls).https://doi.org/10.1371/journal.pgen.1006760.g001 Manhattan plots = -log10(p-value).X-axis of QQ plots = expected p-value; Y-axis of QQ plots = observed p-value.(PDF) S2 Fig.Locus-Zoom plots of loci significantly associated with RBC traits.All variants with minor allele count >30 were plotted using Locus-Zoom software and genome build 37/hg19 positions on the x-axis.The left y-axis is the negative log 10 p-value for the association between each variant and the relevant RBC trait; the gray line represents genome-wide significance (p<5x10-8 ).The left y-axis (blue lines on the plot) is the recombination rate in percent.The lead SNP at each locus is designated with a triangle if the SNP is imputed, and a diamond if the SNP is genotyped.Each symbol represents one variant, with circles for genotyped and x's for imputed variants.Linkage disequilibrium (correlation, r 2 ) with the lead variant in HCHS/SOL is indicated by color, with the colors for each level of LD shown in the upper-right corner of the plot.The genes at each locus are aligned underneath the plot with the corresponding genomic positions.(PDF) S3 Fig. Impact of deletion of LINC01184 Exon-3 on expression of LINC01184 and SLC12A2.HUDEP-2 human erythroid precursor cells were transduced with lentivirus expressing Cas9 and a pair of guide RNAs targeting cleavages flanking exon-3 of LINC01184.After limiting dilution, clones were screened by PCR for deletion of LINC01184 exon-3.Twelve clones with biallelic deletion of LINC01184 exon-3 were identified and utilized for quantitative reverse transcription PCR to measure expression of LINC01184 and SLC12A2.Primers for LINC01184 measurement annealed to sequences at exons 1 and 2, i.e., non-deleted sequences.Data is shown for each of 12 biallelic deletion clones performed in technical triplicate.Gene expression is normalized to the level of parental cells.Lines indicate means and standard deviations.(DOCX) S4 Fig. Small indels in DNase I hypersensitive sites do not exhibit cis effects on expression of PROX1 and PSMB5 in HUDEP-2 cells.Deletions of DNase I hypersensitive sites (DHSs) at PSMB5 and PROX1 loci were not associated with significant gene expression changes in cis.HUDEP-2 human erythroid precursor cells were transduced with lentivirus expressing Cas9 and a pair of guide RNAs targeting cleavages flanking DHSs at PSMB5 and PROX1.After limiting dilution, clones were screened by PCR for deletion of DHSs.Biallelic deletion clones were identified and utilized for quantitative reverse transcription PCR to measure expression of neighboring genes.As a control, nondeletion clones were isolated in parallel.Data is shown for RT-qPCR for indicated gene in a single clone, normalized to GAPDH, and then to median of the nondeletion clones.Each measurement was performed in technical triplicate.Lines indicate medians of each set of clones.No significant differences were identified between deletion and nondeletion clones (p > 0.05 for all comparisons).(TIF) S5 Fig. Manhattan plots from admixture mapping analysis of RBC traits in HCHS/SOL participants.X-axis of Manhattan plots = ordered autosomal chromosomes; Y-axis of Manhattan plots = -log10(p-value).The X chromosome was not evaluated because established methods for admixture mapping of this chromosome are not available.(DOCX) S1 Table.Red blood cell trait descriptions.Genomic inflation factor refers to the ratio between the median test statistics value and the expected median for variants with MAF !0.01.(DOCX)

Table ) .
The