Genetically regulated gene expression underlies lipid traits in Hispanic cohorts

Plasma lipid levels are risk factors for cardiovascular disease, a leading cause of death worldwide. While many studies have been conducted in genetic variation underlying lipid levels, they mainly comprise individuals of European ancestry and thus their transferability to non-European populations is unclear. We performed genome-wide (GWAS) and imputed transcriptome-wide association studies of four lipid traits in the Hispanic Community Health Study/Study of Latinos cohort (HCHS/SoL, n = 11,103), replicated top hits in the Multi-Ethnic Study of Atherosclerosis (MESA, n = 3,855), and compared the results to the larger, predominantly European ancestry meta-analysis by the Global Lipids Genetics Consortium (GLGC, n = 196,475). In our GWAS, we found significant SNP associations in regions within or near known lipid genes, but in our admixture mapping analysis, we did not find significant associations between local ancestry and lipid phenotypes. In the imputed transcriptome-wide association study in multiple tissues and in different ethnicities, we found 59 significant gene-tissue-phenotype associations (P < 3.61×10−8) with 14 unique significant genes, many of which occurred across multiple phenotypes, tissues, and ethnicities and replicated in MESA (45/59) and in GLGC (44/59). These include well-studied lipid genes such as SORT1, CETP, and PSRC1, as well as genes that have been implicated in cardiovascular phenotypes, such as CCL22 and ICAM1. The majority (40/59) of significant associations colocalized with expression quantitative trait loci (eQTLs), indicating a possible mechanism of gene regulation in lipid level variation. To fully characterize the genetic architecture of lipid traits in diverse populations, larger studies in non-European ancestry populations are needed.


Introduction
Lipid levels are a major risk factor for cardiovascular disease, the leading cause of death in the United States [1]. While lipid levels are known to have a highly heritable component, lipid levels as a complex trait are increasingly concerning due to the growing global rate of obesity caused by rapid urbanization and high-fat foods [2]. Hispanic populations have been especially a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 genotypes. We report GTCA-COJO joint effect sizes (bJ) and p-values (pJ) to emphasize independent loci signals.
In the fine-mapping joint analysis, we found multiple significant, independent SNPs associated with the four phenotypes. These SNPs include 12 with CHOL, 7 with HDL, 0 with TRIG, and 10 with LDL, totalling 29 associations (S1 Table). There were 24 unique independent SNPs across the phenotypes with some associations replicating in CHOL and an additional lipid trait (S1 Table).
Two of the significant SNPs, both associated with CHOL, had MAF < 0.01 in European populations, but both had MAF > 0.05 within HCHS/SoL (Table 1, Fig 2). rs117961479 is in LD (1000G AMR r 2 = 0.632) with rs199768142, which is implicated in total cholesterol levels [22]. SNPs in LD with rs17041688 (1000G AMR r 2 > 0.6) in the GWAS catalog had one association with marginal zone lymphoma, and no linked SNPs had obesity or cardiovascular associations [23].  Table 1. Significant conditional and joint analysis SNPs with rare allele frequency in European populations. In the GWAS of HCHS/SoL, we found 29 independent associations between SNPs and 4 lipid traits in the joint analysis, including 2 SNPs with minor allele frequency < 0.01 in Europeans. bJ indicates the effect size and pJ represents the p-value, both from a joint analysis of all the SNPs at the significant loci. The SNPs included in this table have pJ < 5×10 −8 and EUR MAF < 0.01. Neither of these SNPs were present in our genotype data for MESA. Full results for all conditional and joint analyses are in S1 Percent variance explained (PVE) by all genotypes as calculated in GEMMA for each phenotype were similar to those previously observed [25,26]. These values are CHOL: 0.269 ± 0.029, HDL: 0.180 ± 0.031, TRIG: 0.135 ± 0.030, and LDL: 0.278 ± 0.030.

Admixture mapping does not identify ancestral associations with lipid phenotypes
Admixture mapping has been previously used in HCHS/SoL to uncover ancestral tracts, especially of Native American ancestry, that may affect traits [27,28]. We performed admixture mapping using local ancestry estimates of European, African, and Native American chromosome tracts in each individual. We used RFMix to estimate how many alleles at each SNP came from each ancestral population [29]. We ran a separate linear mixed model for each of the three ancestries testing the number of estimated alleles from the origin ancestry for association with each phenotype in GEMMA [10]. Our reference panels for local ancestry were Iberian in Spain (IBS, n = 107), Native American (NAT, n = 27), and Yoruba in Ibadan, Nigeria (YRI, n = 108) with procedures used for reference population selection detailed in the Methods. We used a significance threshold for admixture mapping tests of P < 5.7×10 −5 , previously determined within HCHS/SoL [27]. We restricted analyses to chromosomes 16 and 19 due to their known importance in lipid traits. None of the significant SNPs within our admixture analyses reached previously determined significance. The most significant ancestry tract was on chromosome 19 from 49945850 bp to 50108983 bp for TRIG in Native American ancestry (P = 3.1×10 −4 ). We include our top 1000 most significant SNPs (P < 5.3×10 −3 ) in admixture analyses in S2 Table. Imputed transcriptome-based association study in HCHS/SoL implicates 14

genes in lipid traits
We performed imputed transcriptome-based association studies to investigate the associations of genetically predicted gene expression with the four lipid traits while also accounting for relatedness and structure in the discovery cohort HCHS/SoL (n = 11,103) [10,11]. We used two main sets of prediction models: GTEx V6 and MESA. GTEx V6 predictor models include 44 individual tissue models built in predominantly European-American individuals (predictor population n > 70, 85% European and 15% African-American), and MESA predictor models include monocyte models from 5 populations comprising multiple ethnicities, including African-American and Hispanic (predictor population n > 233) [12][13][14]30]. All GTEx results were filtered to those with green flags as described on http://predictdb.org/. We defined discovery significance as P < 3.1×10 −8 , which is 0.05/(all associations tested). Primary significance may be stringent due to the amount of eQTL sharing between transcriptomic models [31]. We tested significant associations for replication in the genotyped MESA cohort (n = 3,855) and using GWAS summary statistics from GLGC (n = 196,475) [5,13,32]. There were no Hispanic/Latino populations included in the GLGC analysis [5]. There was little test statistic inflation among the PrediXcan results for HCHS/SoL (S4 Fig), MESA, or GLGC. Full PrediXcan results for all cohorts are available at https://github.com/WheelerLab/px_his_chol. Across 4 phenotypes, 44 GTEx models, and 5 MESA models, we found 59 significant genetissue-phenotype associations, including 14 unique significant genes. These include well-studied lipid genes such as APOB, PSRC1, SORT1, and CETP and genes that have been implicated in cardiovascular traits such as CCL22 and ICAM1 (Table 2, Fig 3) [33][34][35]. 45/59 associations replicated in MESA and 44/59 associations replicated in GLGC (P < 0.05). The only associations that did not replicate in GLGC were those that were not predicted at all due to lack of SNP overlap between the GLGC GWAS summary results and the expression prediction models. All 44 of the GLGC associations also met the more stringent threshold of P < 8.5×10 −4 , the Bonferroni adjustment for 59 tests. Despite being just 2% of the size of GLGC, 40 of the MESA associations also met the Bonferroni adjusted threshold (Table 2).
Though most of the results were from the 44 predominantly European GTEx tissue models, we also had significant PrediXcan results using the five MESA monocyte models built in various ethnicities [14]. 14/59 of the significant associations in HCHS/SoL were in MESA models, with 4 associations in CETP and 10 associations in PSRC1 in the same effect direction as the other tissues (Table 2).
A recent study investigated cardiometabolic traits, including the four lipid phenotypes, within diverse cohorts using PrediXcan GTEx V7 models, which, unlike V6 used in our analyses, include only European ancestry individuals [36]. They found 12 gene-lipid-tissue associations at P < 2.6×10 −5 , one of which replicated in GLGC. We compared their novel GTEx tissue results to our results. None of their gene associations replicated in our analyses (all P > 0.09).

A majority of significant PrediXcan gene associations have colocalized GWAS and eQTL signals
We further investigated whether the PrediXcan gene associations had colocalized signal with known eQTLs. Colocalization provides additional evidence that the SNPs in a given expression Table 2. Significant PrediXcan associations in HCHS/SoL and replication in MESA and GLGC. We performed PrediXcan for four lipid phenotypes in HCHS/SoL using GTEx and MESA monocyte gene expression prediction models [12][13][14]. In total, there are 14 unique genes across 59 significant gene-tissue-phenotype associations. MESA population abbreviations: African-American (AFA), European (CAU), Hispanic (HIS), AFA and HIS (AFHI), and all populations combined (ALL). model are functioning via gene regulation to affect lipid traits. We restricted variants to those within PrediXcan models and we tested them for colocalization of eQTLs and lipid trait GWAS associations. We used the software COLOC, which estimates the colocalization probability for a SNP between an eQTL and a GWAS hit [15]. We have previously applied COLOC to gene-level PrediXcan results and observed clustering of significant genes into non-colocalized, colocalized, or unknown signal [13]. P3 estimations indicate the probability of independent signals from an eQTL association and a GWAS association, with P3 > 0.5 indicating no colocalization. P4 > 0.5 indicates a shared eQTL and GWAS association of variants within the prediction model, and if neither P3 nor P4 > 0.5, whether or not the GWAS and the eQTL signals are colocalized is unknown [13,15]. We used eQTL data from GTEx V6 and MESA monocytes, the same populations as the PrediXcan predictors [14,30].   Table 2), which is consistent with previous studies of CETP [35]. We calculated colocalization probabilities for all significant gene-phenotype-tissue associations. Of these associations, 40/59 had a better than chance probability of colocalization between GWAS and eQTL signals (P4 > 0.5) ( Table 3). This included the genes PSRC1, APOB, CELSR2, SORT1, CETP, DOCK6, ICAM1, CCL22, and LIPC. Of the remaining 19 associations, 12 were likely independent signals (P3 > 0.5) and 7 were inconclusive. Full results are available at https://github.com/WheelerLab/px_his_chol.

Chr. BP
Of the 8 significant associations found in liver, 7/8 had a significant P4 (P4 > 0.5) and 6/8 with P4 > 0.99, highlighting the liver's importance in lipid metabolism and possibly indicating its action through gene regulation. While somewhat expected, these results are notable because the GTEx liver tissue has a smaller sample size (n = 97) than other tissues with multiple associations (whole blood: 2/49, n = 338; MESA monocytes ALL 3/49, n = 1,163), signifying that liver is an especially important tissue in lipid metabolism as extensively studied previously [15,37].
We found differences in colocalization probabilities between the MESA models of different ethnicities. This is most prevalent in the CETP model, a known driver of cholesterol metabolism, where only the models including Hispanic genotypes had P4 > 0.99, while the CAU model displayed independent GWAS and eQTL signals (Table 3). This is likely due to the shared LD patterns in the Hispanic eQTL and GWAS cohorts (Fig 4). SNPs included in the MESA HIS and MESA CAU PrediXcan models for CETP are different because elastic net variable selection, which was used to generate the model in each population [14], is partially dependent on correlation among variables, i.e. LD. Thus, since the LD pattern of HCHS more closely resembles HIS than CAU and because linked SNPs in HCHS and HIS were most significantly associated with HDL and CETP expression, respectively, the colocalization probability is much higher in the HIS model than the CAU model (Fig 4, Table 3).

Discussion
We integrated expression quantitative trait loci (eQTL) data from multiple tissues and multiple ethnicities in transcriptome-wide association and colocalization studies to investigate the biological mechanisms underlying lipid trait variation. GWAS have previously been performed in the Hispanic discovery cohort, HCHS/SoL, with no novel, replicable associations found [9], similar to our SNP-level findings. We acknowledge our GWAS findings were limited due to our restriction to SNPs present in PrediXcan models, which were derived from predominantly European cohorts. However, with PrediXcan, we found 59 transcriptome-wide significant gene associations with lipid traits in HCHS/SoL, with 45/59 gene-tissue-phenotype combinations replicating in MESA and 44/59 combinations replicating in GLGC. MESA and GLGC are of diverse (African-American, European, and Hispanic ancestry) and European-only ancestry, respectively, with a similar replication rate in MESA, which has < 2% of the cohort size of GLGC.
Imputed-transcriptome based association methods like PrediXcan offer benefits over SNPlevel GWAS in producing actionable results [11]. Running PrediXcan across GTEx models can implicate particular tissues in a phenotype. Here, we find the most (8/59) significant associations with the GTEx liver models compared to other tissue models, which is encouraging as the liver is key in cholesterol synthesis and metabolism ( Table 2, Fig 3) [37]. PrediXcan results include a direction of effect, and many of our results had the same direction of effect as in vivo observations in humans and mice [35,40]. For example, we found increased expression of PSRC1 is significantly associated with lower CHOL and LDL in multiple tissues and ethnicities. The same relationship has been previously observed in the cholesterol metabolism of mice and in the measured gene expression in an Indian population [41,42]. Our CETP results Table 3. Gene associations and colocalization with eQTLs. We performed COLOC to test colocalization between HCHS/SoL GWAS and eQTL data in our transcriptomic models (44 GTEx, 5 MESA). We found 40/59 of significant (P < 3.61×10 −8 ) PrediXcan associations had significant probability (P4 > 0.5) for colocalization, indicating that the variants may be acting through gene regulation to affect the phenotypes. P3 > 0. 5  indicated that higher CETP expression is associated with lower HDL levels, which has been previously observed extensively within humans and has become a potential drug target for preventing atherosclerosis [35,[43][44][45][46]. Directions of effect revealed by PrediXcan can help elucidate genetic pathways of metabolism or potential drug targets [11]. Our significance threshold was stringent because we used the Bonferroni correction over all associations, but many of the eQTLs between genes and tissues, especially related tissues, are the same or in linkage disequilibrium [31]. PSRC1 and SORT1 are linked and were both highly significant in our analyses. Knowing liver is the key tissue involved in lipid metabolism may help us prioritize the slightly more significant SORT1 liver association over PSRC1, even though PSRC1 associations were more significant in 9 other tissues (Table 3). However, because liver expression between the genes is highly correlated (Pearson R = 0.75) [47], conditional analyses cannot help us distinguish between one or both genes contributing to lipid metabolism. In one functional study, overexpression and knockdown of Sort1 in mice altered cholesterol levels, while perturbation of Psrc1 did not [48], suggesting SORT1 is the more likely causal gene [47]. However, subsequent functional studies also implicate PSRC1 in cholesterol transport in apoE -/mice [41]. Thus, additional evidence is often necessary to identify the most likely causal gene or genes, due to potential confounding PrediXcan results from linkage and co-regulation of gene expression [47].
Increased CCL22 predicted expression significantly associated with lower HDL levels, showed evidence of colocalization, and the gene is located 375 kb downstream from CETP (Tables 2 and 3). It is involved in immunoregulatory and inflammatory processes, and variants within it have been implicated in multiple sclerosis and lupus [53,54]. Apoe knockout mice with a high cholesterol diet were found to have significantly higher CCL22 serum levels than similar mice on a regular diet, contributing to the progression of atherosclerosis [55]. A separate study in humans from the same group found increased CCL22 abundance associated with ischemic heart disease progression [56]. These observations correlate with our observed association of higher predicted CCL22 expression with lower HDL levels, as higher amounts of HDL might protect against atherosclerosis [57].

Genetically regulated gene expression underlies lipid traits in Hispanic cohorts
Increased ICAM1 predicted expression is significantly associated with lower HDL levels, it showed evidence of colocalization, and it is located 803 kb upstream of LDLR, a prominent lipid gene. It produces ICAM-1, an intercellular adhesion molecule. In male rats, a high cholesterol diet was found to significantly influence ICAM-1 molecule levels in the aorta, signifying a correlation between a cholesterol diet and ICAM-1 expression, and ICAM-1 molecule levels were negatively correlated with baseline HDL levels [58]. Another study found that HDL suppresses expression of ICAM1 through transfer of microRNA-223 to endothelial cells, which correlated with our observed association [59].
Increased GSTM1 predicted expression is significantly associated with lower CHOL and LDL levels and is located 378 kb downstream from SORT1 (Table 2). A study in north India found no significant difference in any lipid phenotype between individuals with GSTM1(-) and GSTM1(+), but did find that the GSTM1(-) genotype had a 2-fold increased risk of developing coronary artery disease in the cohort [60]. This gene has also been studied for association with atherosclerosis, and frequency of atherosclerosis in a GSTM1 polymorphic group were found to be 1.2 times higher than those in the control group in a study from a Brazilian cohort [61].
In many studies, participants are asked to self-identify under a given label, such as the four race/ethnicity classifications used in MESA: "White, Caucasian", "Chinese American", "Black, African-American", and "Hispanic". These self-identifications are often not indicative of genetic ancestry, especially in admixed populations such as African-Americans and Hispanics due to a complex population history [32]. For example, even within self-identified regions in Latin America, populations can be heavily stratified (Fig 1 and S1 Fig). Our group recently created the first multi-ethnic transcriptomic prediction models for PrediXcan and observed that predictive performance was improved in cohorts of similar ancestry [14]. In our current analysis, we found PSRC1 and CETP to be significantly associated with multiple phenotypes in MESA models, and these genes have been previously extensively studied in association with lipids [49]. The majority of HCHS/SoL results replicated in both GLGC and MESA even though there is a 50-fold difference in sample size ( Table 2).
Recent studies have tested the portability of PrediXcan models across populations, notably between African and European populations. When predicting gene expression using GTEx models, the Yoruba population from West Africa had poorer accuracy compared to several European-ancestry populations tested [62]. Another study found similar results as both simulated and real African-American populations had poorer correlations of predicted versus observed gene expression than European populations [63]. Accuracy varied with both model population sample size and ancestry [63]. Both studies emphasized that a shared genetic architecture and ancestry between test and reference populations is imperative for accurate gene expression prediction and that further efforts in collection and creating multi-ethnic models are needed [62,63].
In 2012, Stranger et al. observed significant sharing of eQTL effect sizes between Asians, European-admixed, and African subpopulations and suggested that the driving force behind the discovery of an eQTL in one population but not another is mainly due to allele frequency differences and not due to differences in absolute effect size [64]. Indeed, we recently showed differences in gene expression predictive performance are due to allele frequency differences between populations [14]. In MESA, we also showed that genetic correlation (rG) of eQTLs depends on shared ancestry proportions, ranging from rG = 0.46 between AFA and CAU to rG = 0.62 between HIS and CAU [14]. Thus, each population has shared and unique eQTL effects. Other than population-specific SNPs present in the MESA monocyte prediction models, SNPs that are both rare in European populations and common in Hispanic populations are not present in our analyses here and we continue to seek larger and more diverse gene expression data to improve inclusion of population-specific effects.
Another recent study showed that incorporating the use of local ancestry can improve eQTL mapping and gene expression prediction [65]. Future studies of lipid traits in larger Hispanic populations may include testing local ancestry portions individually in this population to see if less-studied ancestries, such as Native American and West African, are associated with SNPs and genes that have not been previously observed in European-majority cohorts. In our admixture mapping analysis to determine if significant GWAS signals originate from tracts of African or Native American ancestry within HCHS/SoL, we did not find any significant results, possibly due to the small non-admixed Native American reference cohort (n = 27) we used from the 1000 Genomes Project. This cohort is mainly Peruvian, which is likely not the best reference panel for the Native American component of most individuals in HCHS/SoL [66]. To have a higher power in these analyses, more samples, especially from Native American genomes and transcriptomes, are needed to integrate local ancestry into gene expression studies. By combining information from both local ancestry mapping and eQTL studies in non-European populations, we may better characterize and predict gene expression in admixed populations.
There is a dearth of diversity of genetic studies that greatly impacts the ability to apply the results of genetic studies to non-European populations. With many biobank-sized resources only including European data, this gap continues to grow [6]. Without data collection and proper models for non-European populations, there is less potential for accurate implementation of precision medicine. To fully characterize the impact of genetic variation between and within populations and allow all individuals to benefit from biomedical research, we must expand genetic studies to include individuals of diverse ancestries.

Materials and methods
This work was approved by the Loyola University Chicago Institutional Review Board (project number 2014).

Phenotypic and genomic data
The analyses in HCHS/SoL and MESA were performed with whole genome genotypes from the database of Genotypes and Phenotypes (Table 4) [67]. The HCHS/SoL cohort is composed of self-identified Hispanic adults  recruited from four urban centers in the US: Chicago, IL; Miami, FL; the Bronx, NY; and San Diego, CA, using a previously described householdbased sampling technique with written consent of the individual in their preferred language with data subsequently de-identified [7]. Within the collected lipid phenotypes, we rank-normalized CHOL, HDL, TRIG, and LDL to correct the skewing in the raw data. The majority of Within the MESA cohort, individuals were recruited from six urban centers in the US: Baltimore, MD; Chicago, IL; Forsyth County, NC; Los Angeles County, CA; northern Manhattan, NY; and St. Paul, MN. Genotype data were collected from blood samples using an Affymetrix 6.0 SNP array and phenotype data from Exam 1 were used. MESA individuals include those that self-identified as "Black, African-American" (AFA, n = 613), "White, Caucasian" (CAU = 2,243), and "Hispanic" (HIS, n = 999). Cohorts underwent quality control and imputation as previously performed by our group [14].

Quality control
For quality control, we merged the two HCHS/SoL permission groups from dbGaP and ran all processes in PLINK. All of the genotypes are in genome build GRCh37/hg19. We removed individuals with genotyping rate < 99%. Additionally, we removed SNPs with failed heterozygosity (outside ± 3 standard deviations from the mean), with failed Hardy-Weinberg equilibrium (P < 1×10 −6 ), and restricted analyses to SNPs on autosomes only. Unlike the typical quality control process for non-admixed populations, we did not remove principal component outliers nor did we remove individuals with >0.125 identity by descent in HCHS/SoL as this would have included >66% of the sample [68][69][70].

Imputation, relationship inference, and principal component calculation
For imputation, we used the University of Michigan Imputation Server with EAGLE2 phasing [71]. The reference panel was 1000 Genomes Phase 3, using all populations due to the multicontinental nature of the cohort [72]. We then filtered imputation results to SNPs with R 2 > 0.8 and minor allele frequency > 1% within PLINK. Quality control and imputation for MESA was performed by our group previously with the same parameters [14]. In MESA, each subpopulation was imputed separated and the post-imputation quality control procedures involved filtering by R 2 > 0.8 in each subpopulation, combining the imputations, filtering by MAF > 0.01, and keeping all SNPs with genotyping rate > 0.99.
Within this cohort, there are potentially confounding amounts of genetic substructure present [73] (Fig 1). In structured and admixed cohorts, such as Hispanic and African-American populations, software such as KING are robust to complex population structure for relatedness and principal component calculations. [16]. We calculated pedigrees, then inferred relationships across all individuals in the cohort into a relationship matrix, and also calculated the first 20 principal components using KING, and we used the first 5 principal components in the rest of the analyses as performed previously within this cohort and since PCs > 5 explained less than 5% of the variance each (Fig 1 and S2 Fig) [19,74].

Genome-wide association study
We used the software GEMMA for the GWAS portion of the study. We ran separate univariate linear mixed models for each of the four phenotypes and included the relationship matrix as random effects. For the covariates used as fixed effects within the model, we included self-identified region, use of lipid or other cardiovascular medication, and the first 5 of 20 principal components as previously conducted in this cohort [16]. P values presented were calculated using the Wald test [10]. We restricted the analyses to SNPs in PrediXcan models as this was the main focus of the analyses, analyzing in total 1,770,809 SNPs for each phenotype. We performed multi-SNP-based conditional and joint association analysis using GWAS data in GCTA-COJO, and we included the genotype data to use the linkage disequilibrium calculations in the actual genotypes with a collinearity cutoff of 0.9 and a standard genome-wide significance threshold of 5×10 −8 [20,21]. We report these results and not base GWAS results to emphasize independent SNPs.

Local ancestry inference
Since local ancestry inference for a population with three or more origins requires reference populations, we used Iberian from Spain (IBS) and Yoruba from Ibadan, Nigeria (YRI), both from 1000 Genomes Phase 3, as representatives for European and West African ancestry, respectively [72]. The European component of Hispanic populations has been previously identified as most similar to modern-day Iberian populations, and the African component has been identified as most similar to the Yoruba people [75].
In ADMIXTURE, we ran the 1000G American (AMR) populations with the Native American sequences, 1000G IBS, and 1000G YRI as reference populations, and kept 1000G AMR individuals with > 90% native American ancestry to use as a Native American reference panel for uniformity with 1000G (NAT), including 4 Mexicans from Los Angeles, USA and 23 Peruvians from Lima, Peru [76]. In total, our reference panel populations were IBS (n = 107), NAT (n = 27), and YRI (n = 108). We restricted analyses to chromosomes 16 and 19 due to their known importance in lipid traits.
To prepare our genotypes from local ancestry estimation, we used HAPI-UR to infer haplotypes with a maximum HMM window size of 64 [77]. We calculated local ancestry inference using RFMix in PopPhased mode, and due to our unequal reference population sizes, following the recommendation to use a minimum node size of 5 and a window size of 0.025, with all other options run as default [29].

Admixture mapping
We converted the RFMix output to the three continental ancestry allelic estimations (NAT, IBS, and YRI). We tested each ancestry allele count for association with each phenotype to observe how local ancestry may be associated with each lipid trait using a univariate linear mixed model including relatedness as random effects and the first 5 principal components, lipid medications, and region as fixed effects [10]. Genome-wide admixture mapping significance was previously determined in HCHS/SoL at 5.7×10 −5 [27].

Imputed transcriptome-based association study
We used the software PrediXcan to produce predicted expression from genotype for both HCHS/SoL and MESA. The models we used include 44 tissue models from the GTEx V6 tissues with at least 70 individuals (85% European, 15% African-American) each, with an average of 5,179 genes predicted in each tissue [11,13,30]. We did not use GTEx V7 as the models only include individuals of European ancestry. We filtered the GTEx V6 results by removing red-and yellow-flagged false positive gene-tissue associations and our results only include gene-model combinations with green flags as described at http://predictdb.org/. The other set of PrediXcan models we used were made from monocyte gene expression in MESA [14]. Each of these five models had a training set of at least 233 individuals, including models of African-American (AFA), European (CAU), Hispanic (HIS), AFA and HIS (AFHI), and all populations combined (ALL). These models were retrieved from http://predictdb.org/, and all models in the analyses were filtered by protein coding genes, R 2 > 0.01 and predictive performance P < 0.05. Individuals used to train the AFHI, ALL, and HIS models are also included in the MESA lipid trait replication cohort [14].
After the 49 predicted expression files were made using PrediXcan, we ran them as a pseudogenotype in a univariate linear mixed model in GEMMA using the command -notsnp. This was performed in GEMMA rather than base PrediXcan as base PrediXcan does not have an option to include a relationship matrix, which is important in a heavily related and structured cohort like HCHS/SoL [10]. We again included the KING relationship matrix as random effects and the first five principal components, use of lipid medication, and self-identified region as fixed effects. P values presented were calculated using the Wald test.
Since we did not have the GLGC genotypes, we used the software S-PrediXcan, which is an extension of PrediXcan that takes GWAS summary statistics as input [13]. We ran S-PrediXcan on GLGC with all 44 GTEx tissue models and all 5 MESA monocyte models in different ethnicities. We considered significance in the discovery population, HCHS/SoL, as P < 3.1×10 −8 , 0.05/(all gene-model associations), and in the replication populations, MESA and GLGC, as P < 0.05 within the model. This significance threshold is conservative, since many tissues share eQTLs [31].

Colocalization analysis
We performed a colocalization analysis by applying the software COLOC to the lipid GWAS results and eQTL data from GTEx and MESA to determine whether eQTLs within gene prediction models and GWAS hits were shared [15]. We subset the COLOC input to only contain SNPs within the predictor models of genes from the PrediXcan analyses due to computational restraints. A higher P4 probability (P > 0.5) indicates likely colocalized signals between an eQTL and a GWAS hit, especially in well-predicted genes with a high R 2 value, while a high P3 probability indicates independent signals between an eQTL and a GWAS hit and a high P0, P1, or P2 indicates an unknown association [13]. Analyses were run using scripts from the S-PrediXcan manuscript at https://github.com/hakyimlab/summary-gwas-imputation/wiki/ Running-Coloc [13].
Supporting information S1 Table.