Trans-ethnic predicted expression genome-wide association analysis identifies a gene for estrogen receptor-negative breast cancer

Genome-wide association studies (GWAS) have identified more than 90 susceptibility loci for breast cancer, but the underlying biology of those associations needs to be further elucidated. More genetic factors for breast cancer are yet to be identified but sample size constraints preclude the identification of individual genetic variants with weak effects using traditional GWAS methods. To address this challenge, we utilized a gene-level expression-based method, implemented in the MetaXcan software, to predict gene expression levels for 11,536 genes using expression quantitative trait loci and examine the genetically-predicted expression of specific genes for association with overall breast cancer risk and estrogen receptor (ER)-negative breast cancer risk. Using GWAS datasets from a Challenge launched by National Cancer Institute, we identified TP53INP2 (tumor protein p53-inducible nuclear protein 2) at 20q11.22 to be significantly associated with ER-negative breast cancer (Z = -5.013, p = 5.35×10−7, Bonferroni threshold = 4.33×10−6). The association was consistent across four GWAS datasets, representing European, African and Asian ancestry populations. There are 6 single nucleotide polymorphisms (SNPs) included in the prediction of TP53INP2 expression and five of them were associated with estrogen-receptor negative breast cancer, although none of the SNP-level associations reached genome-wide significance. We conducted a replication study using a dataset outside of the Challenge, and found the association between TP53INP2 and ER-negative breast cancer was significant (p = 5.07x10-3). Expression of HP (16q22.2) showed a suggestive association with ER-negative breast cancer in the discovery phase (Z = 4.30, p = 1.70x10-5) although the association was not significant after Bonferroni adjustment. Of the 249 genes that are 250 kb within known breast cancer susceptibility loci identified from previous GWAS, 20 genes (8.0%) were statistically significant associated with ER-negative breast cancer (p<0.05), compared to 582 (5.2%) of 11,287 genes that are not close to previous GWAS loci. This study demonstrated that expression-based gene mapping is a promising approach for identifying cancer susceptibility genes.

Women of African ancestry are more likely to be diagnosed with ER-negative breast cancer compared to women of non-African ancestry [21][22][23]. To date, breast cancer GWAS have been conducted primarily in populations of European ancestry. The difference in linkage disequilibrium (LD) patterns and allele frequencies across ancestry groups may explain the apparent inconsistencies in GWAS findings from studies of women of European ancestry as compared to studies of women of African ancestry [24,25]. The strength and the direction of the association between causal variants and disease are expected to be consistent across populations, and thus cross-population validation provides further evidence of causation. In addition, trans-ancestry analysis could identify novel breast cancer susceptibility variants [26].
The variants discovered by previous GWAS along with previously known high-penetrance genes explain only a modest proportion of the heritability of breast cancer [2]. More genetic factors for breast cancer are yet to be identified, but power for discovery of new loci is limited by the sample size of existing GWASs. Moreover, the biologic significance of the variants identified by GWAS and the genes on which they act, are often unknown. Single nucleoid polymorphisms (SNPs) associated with disease traits are more likely to be expression quantitative trait loci (eQTLs) [27], and regulatory variants can explain a large proportion of disease heritability [28]. Therefore, genes regulated by eQTLs can be used as an enrichment analysis unit to identify more genetic risk factors for breast cancer. Recently, gene-based approaches using eQTL information, such as PrediXcan, have been proposed, which can reduce the multiple testing burden in genome-wide analyses and have been used to identify novel genes for id=phs000812.v1.p1 Cancer Genetic Markers of Susceptibility Breast Cancer (CGEMS) data was provided via dbGaP project phs000147.v3.p1: https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/ study.cgi?study_id=phs000147.v3.p1 Genome-Wide Association Study of Breast Cancer in the African Diaspora -the ROOT study was provided via dbGaP project phs000383.v1.p1: https://www. ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi? study_id=phs000383.v1.p1 Shanghai Breast Cancer Genetic Study (SBCGS) data was provided via dbGaP project phs000799.v1.p1: https://www. ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi? study_id=phs000799.v1.p1 autoimmune diseases [29]. PrediXcan uses individual-level data to estimate the correlation between genetically predicted levels of gene expression and human traits to prioritize causal genes. MetaXcan computes the same correlation as PrediXcan, but does so using summary statistics from GWAS, which are much more readily accessible than individual level data [30].
To identify novel genes involved in breast cancer susceptibility, we utilized a gene-level expression-based association method, implemented in the MetaXcan software [30], to infer gene expression levels using summary statistics from five GWASs. We used an additive prediction model of gene-expression levels trained in Depression Genes and Network (DGN) data [31] and examined the predicted expression of specific genes for association with overall breast cancer risk and estrogen receptor-negative breast cancer risk. The GWAS datasets were made available in dbGaP (https://www.ncbi.nlm.nih.gov/gap) through "Up For A Challenge (U4C)-Stimulating Innovation in Breast Cancer Genetic Epidemiology" launched by the National Cancer Institute. The DGN data included RNA sequencing data from whole blood of 922 genotyped individuals (463 cases of major depressive disorder and 459 controls), all of European ancestry. These individuals consisted of 274 males and 648 females with ages ranged from 21 to 60.

Results
Using logistic regression, we first conducted SNP-level GWAS analysis for overall breast cancer risk among 8605 breast cancer cases and 8095 controls, and for ER-negative breast cancer risk among 3879 cases and 10213 controls. The analyses were performed for each of the five GWAS datasets separately and summary statistics including log odds ratios and standard errors were generated. These summary statistics for each dataset were input to the software MetaXcan [30] to perform genome-wide gene-level expression association tests for 11,536 genes. Then, we performed meta-analysis of the results from individual MetaXcan analyses. Quantile-quantile plots of P-values from the meta-analysis showed little inflation (Fig 1). For overall breast cancer risk, there was no gene with a P-value that deviated from the null distribution (Fig 1A), but for ER-negative breast cancer risk analysis, there were several genes with P-values smaller than expected, including TP53INP2, HP, and DHODH (Fig 1B). Table 1 lists the top genes with P-values less than 10 −3 in the analyses of association between predicted gene expressions and overall breast cancer risk. The sign of Z score indicates the direction of association between genetically-predicted expression and breast cancer risk. None of the genes reached genome-wide significance when a Bonferroni threshold (α = 4.33x10 -6 ) was used.
Of the 249 genes that are 250 kb within known susceptibility loci identified from previous breast cancer GWAS [2][3][4]17,32], 12 genes (4.8%) were statistically significant associated with overall breast cancer risk at nominal significance level of 0.05, compared to 497 (4.4%) of 11,287 genes that are not close to previous GWAS loci (P for enrichment = 0.75). Table 2 lists the genes with P-values less than 10 −3 in the ER-negative breast cancer analysis. TP53INP2 was the top gene (P = 5.35x10 -7 ), which surpassed the Bonferroni-corrected p-value threshold (α = 4.33x10 -6 ). The false discovery rate for TP53INP2 was 0.0062. Higher genetically-predicted TP53INP2 expression was associated with lower risk of ER-negative breast cancer. The gene with the second smallest P-value was HP, which had p-value of 1.70x10 -5 , close to but not significant after Bonferroni correction. The false discovery rate for the HP gene was 0.098. For the HP gene, higher expression was associated with higher risk of ER-negative breast cancer. Both genes are novel and no previous studies have found association between these two genes and breast cancer risk.
Of the 249 genes that are 250 kb within known breast cancer susceptibility loci identified from previous GWAS, 20 genes (8.0%) were statistically significant associated with ER- Expression-based analysis and breast cancer risk negative breast cancer (p<0.05), compared to 582 (5.2%) of 11,287 genes that are not close to previous GWAS loci (P for enrichment = 0.044), suggesting a moderate enrichment for genes close to known susceptibility loci.
There were six SNPs included in the prediction of the expression of the TP53INP2 gene, from 367 kb upstream to 159 kb downstream of the gene ( Table 3). Five of the six SNPs (except for rs8116198) were associated with overall breast cancer risk and ER-negative breast cancer risk (at the nominal level of α = 0.05), and the effects were consistently across studies (none of the heterogeneity tests were significant). These associations were more significant for ER-negative breast cancer risk (p values ranging from 5.0x10 -4 to 1.8x10 -6 ) than for overall breast cancer risk (7.0x10 -4 to 1.4x10 -4 ). None of the SNP-level associations reached traditional genome-wide significance, thus they have not been reported in previous GWAS publications. However, our study showed the aggregate effects of these SNPs were significantly associated with ER-negative breast cancer after Bonferroni correction. We noticed that one of the six SNPs, rs8116198, is monomorphic in the SBCGS data. Therefore, when MetaXcan was applied to the SBCGS data, the prediction of TP53INP2 expression was based on only five SNPs. To make our results more robust to missing and low quality genotypes, in the DGN prediction model, we used elastic net with 0.5 as the mixing parameter, which sets the degree of mixing between ridge regression and LASSO. In addition, the SNPs in the prediction were not necessarily causal but could be in LD with the causal SNPs. Interestingly, there are several other genes in this region that were associated with ER-negative breast cancer, including MAP1LC3A, ITCH, and TRPC4AP (Fig 2 and Table 2). The 6 SNPs are Expression-based analysis and breast cancer risk located either in enhancer elements or in promotor regions ( Table 4). The promotor/enhancer features of 4 SNPs were found in human mammary epithelial cells (HMEC) and breast variant human mammary epithelial cells (HMEC.35), and the enrichment was statistically significant for both cell types (both p<0.03). There were 20 SNPs included in the prediction of the expression of the HP gene (S1 Table). Thirteen of the 20 SNPs were associated with overall breast cancer risk and 17 were associated with the risk of ER-negative breast cancer (at the nominal level of α = 0.05), quite consistently across populations (none of the heterogeneity tests were significant). The strengths of their associations were stronger for ER-negative breast cancer risk than for overall breast cancer risk. Interestingly, none of the associations for individual SNPs reached genome-wide significance, thus they have not been reported in previous GWAS publications.
We used summary results from GAME-ON GWAS (http://gameon.dfci.harvard.edu) to replicate our study findings from the U4C. All the six eQTLs for the TP53INP2 gene were available in GAME-ON ( Table 5). Five of the six SNPs that were associated with ER-negative breast cancer in the discovery phase (using U4C datasets) were all statistically significant in GAME-ON at the nominal 0.05 significance level. Gene-level test of TP53INP2 from MetaXcan gave a Z-score of -2.803 (p = 5.1×10 −3 ) for ER-negative breast cancer in GAME-ON. The gene-level test for overall breast cancer risk was not significant in GAME-ON (Z-score = -1.627, p = 0.10). Because the GAME-ON ER-negative data included the BPC3 dataset, in order to show the independent replication, we tested association in the U4C ER-negative data excluding BPC3, and found the Z-score for the TP53INP2 gene was -4.127 (p = 3.67×10 −5 ).
For the HP gene, the direction of association for 19 SNPs (out of 20) were consistent between U4C and GAME-ON for ER-negative breast cancer risk, but only 2 SNPs were Expression-based analysis and breast cancer risk statistically significant at nominal 0.05 level in GAME-ON (S2 Table). None of the SNPs were significantly associated with overall breast cancer risk in GAME-ON. In the gene-based analysis using GAME-ON data, the Z-score for overall breast cancer risk was 1.769 (p = 0.077) and the Z-score for ER-negative breast cancer risk was 2.02 (p = 0.043). In addition, we tested this association in the U4C ER-negative data excluding BPC3, and found the Z-score for the HP gene was 2.81 (p = 5.1×10 −3 ). Expression-based analysis and breast cancer risk

Discussion
In this gene-level expression-based genome-wide association analysis of five breast cancer GWAS datasets composed of individuals of diverse ancestry, we identified TP53INP2 (20q11.22) as gene with genetically-determined expression that is associated with ER-negative breast cancer. The gene-based analysis of aggregated eQTLs for a particular gene as an analysis unit can reduce the burden of multiple testing and provide a direction of association between expression of a specific gene and disease risk. We found that increased expression of TP53INP2 expression in whole blood was associated with a decrease in ER-negative breast cancer risk. In  Expression-based analysis and breast cancer risk addition, we identified the HP gene in the 16q22.2 regions to have expression levels that are positively associated with ER-negative breast cancer. The TP53INP2 gene (tumor protein p53-inducible nuclear protein 2) is 9150 base pairs long and codes for a 220 amino acid protein, which is a dual regulator of transcription and autophagy and is required for autophagosome formation and processing. One experimental study showed that overexpression of TP53INP2 severely attenuated proliferative and invasive capacity of melanoma cells, via p53 signaling and lysosomal pathways [34]. This inverse correlation between TP53INP2 expression and cancer proliferation is consistent with our finding that TP53INP2 expression inversely correlated with breast cancer risk. P53 is a transcription factor for TP53INP2, and TP53 plays an important role in development of multiple cancers. Germline TP53 mutations cause Li-Fraumeni syndrome, characterized as a cluster of cancers including breast cancer [35]. Somatic TP53 mutation is a common event in ER-negative breast cancer [36]. As a downstream gene of p53, TP53INP2 may affect breast cancer risk through p53 signaling pathway. Also, known as DOR (diabetes-and obesity-regulated gene), TP53INP2 has been linked to obesity and diabetes [37]. TP53INP2 is also associated with triglycerides and cholesterol level. One experimental study found that dietary fat content influenced the expression of TP53INP2 expression in adipose and muscle tissues of mice [38]. This gene has been proposed to serve as a diagnostic biomarker for papillary thyroid carcinoma [39] but no study has linked its expression to cancer risk. Obesity has been convincingly correlated with breast cancer risk in numerous studies, although the relationship is complex and involves additional modifying factors [40,41]. Obesity has been associated with excess risk for breast cancer among postmenopausal women [42][43][44][45][46], while in pre-menopausal women, obesity was associated with decreased breast cancer risk [40,43,[47][48][49]. However, the underlying mechanisms for this association are still not fully understood. The identification of TP53INP2/DOR as breast cancer-related gene could provide novel insight on the mechanism for obesity-breast cancer relationship.
In the 20q11.22 region, several other genes including MAP1LC3A, ITCH, and TRPC4AP were associated with ER-negative breast cancer risk. MAP1LC3A codes for a protein that is important in the autophagy process, and was found to be expressed at higher level in breast Expression-based analysis and breast cancer risk cancer tissues than in normal tissues [50]. E3 ubiquitin ligase ITCH plays a role in erythroid and lymphoid cell differentiation and immune response regulation, and ITCH was found to be important in the cross-talk between the Wnt and Hippo pathways in breast cancer development [51]. TRPC4AP is involved in Ca 2+ signaling and is part of the ubiquitin ligase complex [52,53]. It is unclear which of these genes (or their interactions) play a role in breast cancer development, but the 20q11.22 locus is worthy of further investigation. Three of the six SNPs for TP53INP2 (rs6060047, rs11546155, and rs1205339) are also shared by the genes MAP1LC3A and TRPC4AP. It is possible that the associations in these three genes are partly generated by the overlapped SNPs, which contribute to predicted expression levels of the three genes and, possibly, to the enrichment observed at this locus. The HP gene (16q22.2) is 6,491 base pairs long and codes for a 406 amino acid preprotein, which codes haptoglobin. Haptoglobin binds to hemoglobin to prevent iron loss during hemolysis. There are two allelic forms, Hp1 (83 residues) and Hp2 (142 residues), which determine 3 major phenotypes [54]. Haptoglobin genotype has been linked to cardiocerebral outcomes among diabetic patients [55]. A small study found haptoglobin phenotypic polymorphism was associated with familial breast cancer [56], but no studies have reported on the relationship between SNPs in this gene and breast cancer risk. Further larger studies could investigate the relationship between major HP genotype/phenotype (HP1-1, HP1-2, and HP2-2) and breast cancer risk.
The present study has several strengths, including its large sample size, diverse ancestry groups, a cross-replication approach, and a novel gene expression-based analysis method. The gene-level analysis method can combine eQTL SNPs in a biologically informative way to assess relationships between predicated gene expression and disease risk. Compared to SNP-based analysis, the gene-based analysis can gain power by reducing the multiple testing burden by about 100-fold and using external information on correlation between gene expression and SNPs from reference samples. In addition, this approach enables the detection of individual SNPs with weak effects on disease risk by leveraging combined effects of multiple SNPs on gene expression. For example, none of SNPs for TP53INP2 reached traditional genome-wide significance, but their aggregated effect via TP53INP2 expression was genome-wide significant. The gene-based method (MetaXcan) that we employed is an extension of the gene expressionbased method (PrediXcan) [29] and allows the use of SNP-level summary statistics without the need to access individual-level genotype data [30]. The MetaXcan method has been shown to produce PrediXcan results accurately, and it is robust to ancestry mismatches between studies and reference/training populations [30]. With this property, we were able to use summary statistics from the GAME-ON consortium for external replication.
Several limitations should be considered when interpreting the study findings. The gene expression-based association method relies on accurate prediction of gene transcript level from genotypes, i.e. identification of eQTLs, but eQTL identification depends on sample size of eQTL studies as well as tissue types. In the current study, we used the transcriptome prediction model that was developed using 922 RNA-seq samples from whole blood and genotype data [31]. Although it has been shown that models developed in whole blood were still useful for understanding diseases that affect other primary tissues [29], we expect there to be a loss of power when studying non-blood diseases using whole blood eQTL data. As a sensitivity analysis, we performed the MetaXcan analysis using the prediction model from breast tissues of 183 donors of multiple ethnicities (http://www.gtexportal.org). Only 4,308 genes had breast tissue specific eQTLs, and no eQTL was available for TP53INP2, perhaps due to the small sample size. We found that DHODH (P = 3.61×10 −5 ), ITCH (P = 1.23×10 −4 ), and TRPC4AP (P = 7.7x10 -4 ) were among the top genes associated with ER-negative breast cancer risk, and TRPC4AP (P = 1.68×10 −5 ) and DHODH (P = 1.12×10 −4 ) among the top genes associated with the overall breast cancer risk using breast tissue eQTLs. In the enrichment analysis, we found that 7 (8.2%) out of 85 genes that are close to known breast cancer susceptibility loci identified in previous GWAS were associated with ER-negative breast cancer and 6 (7.1%) genes were associated with overall breast cancer risk; by contrast, of the 4223 genes away from previous GWAS loci, 199 (4.7%) genes were associated with ER-negative breast cancer and 212 (5.0%) genes were associated with overall breast cancer risk. Here, we have to consider the balance between tissue relevance and sample size in eQTL studies. Further investigations based on large, reliable eQTL datasets are desirable. In future studies, we will seek out larger samples of multi-ethnic breast tissue as training data to construct improved prediction models of gene expression and further investigate trans-ethnic associations for breast cancer.
In conclusion, our study identified TP53INP2 and several other genes in the 20q11.22 region as potential susceptibility genes for ER-negative breast cancer using a novel gene-based analysis method that incorporates genetically determined gene expression. We demonstrated this gene-based method increases statistical power and may be helpful in searching for causal variants. Future studies need to determine whether the TP53INP2 gene is a true susceptibility gene for breast cancer and what are the underlying mechanisms for its association with ERnegative breast cancer.

Study samples
The study was approved by the Institutional Review Board of the University of Chicago. The Epidemiology and Genomic Research Program within the National Cancer Institute launched a Challenge at the end of 2015 to inspire novel cross-disciplinary approaches to more fully decipher the genomic basis of breast cancer, called "Up For A Challenge (U4C)-Stimulating Innovation in Breast Cancer Genetic Epidemiology". Several data sets were gathered and made available for use in dbGap (https://www.ncbi.nlm.nih.gov/gap). Our study has two phases; the discovery phase included five U4C GWAS datasets ( Table 6). Here, we refer them collectively as "U4C" data. These data were collected from three distinct ancestry groups. The BPC3 [16,18] and CGEMS study [15,20] were conducted in women of European ancestry. The ROOT [17] and AABC study [57] consisted of women of African ancestry. The SBCGS study was conducted in Chinese population [19]. For the analysis of overall breast cancer risk, we used four GWAS datasets: AABC, CGEMS, ROOT, and SBCGS. For the analysis of ER-negative breast cancer risk, we used datasets from AABC, BPC3, ROOT, and SBCGS. All these dbGap datasets included imputed genotype data that were inferred based on reference haplotypes from the 1000 Genomes project. Expression-based analysis and breast cancer risk In the replication phase, we used the summary results from the meta-analysis of 11 breast cancer GWASs in the GAME-ON consortium (http://gameon.dfci.harvard.edu). All participants were of European ancestry. The overall breast cancer analysis included 16,003 cases and 41,335 controls from 11 GWAS studies; The ER-negative breast cancer analysis included 4939 cases and 13128 controls from 7 GWAS studies. The dataset from one study (BPC3; all ERnegative cases) in GAME-ON consortium was also included the U4C datasets. Because only meta-analysis results were available from GAME-ON, we removed the BPC3 data from "U4C" dataset when we compared replication performance to avoid duplicate counting.

Statistical analysis
Our gene level expression-based association analysis consists of three main steps. First, we conducted SNP-level genome-wide association tests and calculated summary statistics such log odds ratios and their standard errors. We used logistic regression model adjusting for eigenvectors from the principal component analysis and related covariates such as age. Genotypes were coded by an additive genetic model. Eigenvectors in principal component analysis were calculated using the method smartPCA, which is implemented in the software EIGEN-SOFT version 6.0.1 [58]. For the ROOT dataset, we adjusted for age, study sites, and the top 4 eigenvectors. For the AABC dataset, we adjusted for age, study sites, and top 10 eigenvectors. For CGEMS and SBCGS, we adjusted for age and the top three or two eigenvectors, respectively. The number of eigenvectors we adjusted for was chosen according to published papers from these GWASs [17,57], as well as their association with case-control status. The logistic regression models were fit using software Mach2dat (http://www.unc.edu/~yunmli/software. html) or SNPtest [59], depending on format of the datasets; the Mach2dat software was used for CGEMS and SBCGS and SNPtest was used for ROOT and AABC. For BPC3, the GWAS summary statistics for ER-negative breast cancer have been pre-calculated in the dbGap release, so we used them directly.
Second, we applied the gene level association method, MetaXcan [30] (https://github.com/ hakyimlab/MetaXcan), to each of the datasets listed in Table 6. MetaXcan is an extension of the method PrediXcan [29], which uses an additive genetic model to estimate the component of gene expression determined by an individual's genetic profile and then identifies likely causal genes by computing the correlations between genetically predicted gene expression levels and disease phenotypes. MetaXcan infers the results of PrediXcan using summary statistics from GWAS, which are much more readily accessible than individual level data. In our study, as input files for MetaXcan, we used summary statistics from SNP-based analysis of each dataset obtained in step one. In addition, we used the whole blood genetic prediction model of transcriptome levels trained in the DGN data [31], which can be downloaded from http:// predictdb.hakyimlab.orghttps://s3.amazonaws.com/predictdb/DGN-HapMap-2015/. The DGN data provides a large reference sample of 922 individuals with both genome-wide genotype data and RNA sequencing data. The model trained in the DGN data can be useful in estimating gene expression levels and has been successfully applied to the Wellcome Trust Case Control Consortium (WTCCC) data in identifying genes associated with five complex diseases [29]. The DGN prediction model includes a) weights for predicting gene expression using genotypes and b) covariance of the SNPs that takes into account linkage disequilibrium. We tested the association between predicted expression levels of 11,536 genes for each of the two phenotypes, overall and ER-negative breast cancer risk, using the MetaXcan software. To construct the prediction model of expression levels using the DGN data, MetaXcan used SNPs with minor allele frequencies (MAFs) >0.05. When MetaXcan was applied to the breast cancer GWAS data, only SNPs with MAFs >0.05 were used. We also looked up genes within 250 kb of the 93 breast cancer susceptibility loci identified in previous GWAS [2][3][4]17,32].
Third, we conducted meta-analysis to combine results from MetaXcan analyses for different datasets. The method described by Willer et al. with sample size as meta-analysis weight [60] was used. We also conducted SNP-level meta-analysis using a fixed effect model, as implemented in the software METAL (http://genome.sph.umich.edu/wiki/METAL). False discovery rates were calculated using the Benjamini and Hochberg method [61].
For genes identified in the discovery phase using the U4C datasets, we conducted replication analysis using GAME-ON summary results using the same methods described above. For each top variant and gene identified in this study, we used HaploReg [33] and USCS Genome Browser to explore functional annotations of noncoding variants. Chromatin states (promoters and enhancers), variant effect on regulatory motifs, and protein binding sites were assessed from available data from the ENCODE [62] and Roadmap Epigenomics Consortium [63]. Data from normal mammary epithelial cells (HMEC, MYO, vMHEC) were emphasized.
Supporting information S1 Cancer Institute, Division of Cancer Epidemiology and Genetics. The BPC3 data was provided via dbGaP project phs000812.v1.p1.
Genome-Wide Association Study of Breast Cancer in the African Diaspora-the ROOT study was provided via dbGaP project phs000383.v1.p1. The ROOT study was conducted by the University of Chicago and supported by the National Cancer Institute (R01-CA142996). This manuscript was not prepared in collaboration with investigators of the ROOT GWAS and does not necessarily reflect the opinions or views of the University of Chicago, or NCI.
Shanghai Breast Cancer Genetic Study (SBCGS) data was provided via dbGaP project phs000799.v1.p1. The SBCGS includes the Shanghai Breast Cancer Study (R01CA64277) and Shanghai Women's Health Study (R37CA70867) that generated the data.