The authors have declared that no competing interests exist.
Conceived and designed the experiments: CAH BEH LLM DOS. Performed the experiments: LX XS LCP DVDB EC KP. Analyzed the data: YH YF YP CH CAH DOS. Contributed reagents/materials/analysis tools: LNK LLM BEH. Wrote the paper: CAH DOS YH YF.
Rare variation in protein coding sequence is poorly captured by GWAS arrays and has been hypothesized to contribute to disease heritability. Using the Illumina HumanExome SNP array, we successfully genotyped 191,032 common and rare non-synonymous, splice site, or nonsense variants in a multiethnic sample of 2,984 breast cancer cases, 4,376 prostate cancer cases, and 7,545 controls. In breast cancer, the strongest associations included either SNPs in or gene burden scores for genes
For breast and prostate cancer, GWAS have revealed many risk variants (>70 for each cancer as of this report). All together the common variants in these regions explain only a minority of familial risk of these cancers. Using the Illumina HumanExome SNP array, we explored the hypothesis of rare coding variation contributing to breast and prostate cancer risk in a sample of African American, Latino, Japanese, Native Hawaiian, and European American breast and prostate cancer cases and controls from the Multiethnic Cohort study. While only one association exceeded significance thresholds after correcting for multiple comparisons, a number of suggestive associations involving genes previously reported to be associated with a cancer-related phenotype were noted. Our results do not generally support a major role of protein-coding variants with odds ratios over a range that is probably required for protein coding variation to play a truly outstanding role in risk heritability. If very rare and/or less penetrant coding variants underlie disease heritability of these cancers, then very large sample sizes (i.e. consortia) will be required for their discovery.
For most common diseases and traits the genetic basis underlying susceptibility has yet to be completely revealed. While genome-wide association studies (GWAS) have been remarkably successful in identifying common genetic variants associated with risk, the effect sizes of the risk alleles have been modest (relative risk, RR of 1.1–1.4) and in most cases, even in sum, they can explain only a fraction of familial risk or disease heritability. GWAS have relied almost exclusively on Illumina and Affymetrix SNP arrays, with SNP content selected primarily from HapMap to capture a large fraction of common variation in coding and non-coding regions in populations of European ancestry. The vast majority of alleles with frequencies <5%, and in particularly those with frequencies ≤1%, have not been tested. This low allele frequency spectrum of genetic variation represents a very large fraction of all variation in the human genome. Thus, to date, a large fraction of genetic variation has yet to be explored with respect to disease etiology.
It is possible that the majority of less common (1–5%) and rare variants (<1%) will have weak effects, like the GWAS-identified common variants, and if this is the case then very large studies will be required for their discovery. An alternative hypothesis is that less common and rare variants convey larger relative risks than common variants, and indeed this assumption is required in order that rare variants contribute meaningfully to the understanding of inherited susceptibility. Such enhancement of effect sizes for rarer alleles may be especially relevant to rare coding variants given their dominant role in the etiology of “Mendelian” disorders (e.g. the OMIM database
To date, a lack of technology to survey the genome and accurately enumerate and test the variants in large numbers of samples has limited the exploration of less common and rare alleles. In the past year the Illumina Infinium HumanExome array (or “exome chip”) has been developed in collaboration with investigators who combined whole-exome sequencing conducted in >12,000 individuals of primarily European ancestry as well as in small numbers of other racial/ethnic minorities including African Americans, Hispanics, and Asians; the content on the array includes >200,000 putative functional exonic variants and is aimed to provide comprehensive testing on all non-synonymous variants above 0.1% frequency in Europeans. In the present study, we have utilized this array to test the hypothesis that there are less common and rare functional variants in the coding regions of genes that convey risk for breast and prostate cancer of greater magnitude than the common variants revealed through GWAS. We tested both single markers as well as gene summaries of the burden of rare alleles in multiethnic studies of invasive incident breast cancer and prostate cancer in the Multiethnic Cohort study (MEC: 3,141 breast cancer cases, 4,675 incident prostate cancer cases and 8,021 controls). In addition we conducted exploratory analyses of rare variants in relationship with several breast and prostate cancer-related traits ascertained at baseline in the entire MEC sample (n = 15,837).
The analysis included 217,601 putative functional variants (of 247,870 total markers listed on the array), predicted to alter the protein coding sequence, and which passed quality control procedures (see Methods). Of the 15,837 samples, 14,905 were included in the analysis (3,315 European Americans, 3,854 African Americans, 3,106 Latinos, 3,843 Japanese Americans and 787 Native Hawaiians; see Methods for exclusion criteria). A few mitochondrial SNPs were included on the array (n = 165 SNPs passing quality control) but are not discussed here (no associations with them were seen in the top ranked 1,000 associations for either breast or prostate cancer). The number of breast and prostate cancer cases and controls are shown in
Breast Cancer | n (Cases/Controls) | Age (mean(years)[sd]; Cases/Controls) | n ER+/n ER − (n (%)) |
All Groups | 2984/7545 | 67[8.8]/68[8.6] | 1688(56.6)/441(14.8) |
European Americans | 754/1682 | 66[8.8]/68[8.9] | 450(59.7)/95(12.6) |
African Americans | 591/2146 | 68[9.3]/69[8.4] | 311(52.6)/130(22.0) |
Latinos | 614/1302 | 67[8.2]/67[7.8] | 339(55.2)/112(18.2) |
Japanese Americans | 809/2012 | 66[8.6]/69[8.6] | 467(57.7)/84(10.4) |
Native Hawaiians | 216/403 | 64[8.3]/64[8.6] | 121(56.0)/20(9.3) |
Prostate Cancer | n (Cases/Controls) | Age (mean(years)[sd]; Cases/Controls) | n Advanced/n Non-advanced (n (%)) |
All Groups | 4376/7545 | 70[7.2]/68[8.6] | 499(11)/3666(84) |
European Americans | 879/1682 | 69[7.7]/68[8.9] | 100(11)/749(85) |
African Americans | 1117/2146 | 70[7.3]/69[8.4] | 116(10)/932(83) |
Latinos | 1190/1302 | 69[6.6]/67[7.8] | 145(12)/986(83) |
Japanese Americans | 1022/2012 | 72[7.4]/69[8.6] | 114(11)/863(84) |
Native Hawaiians | 168/403 | 69[6.7]/64[8.6] | 24(14)/136(81) |
In the pooled sample, 190,662 putative functional (NS, SP, or stop) SNPs had a minor allele frequency (MAF) <1% (56,759<0.01%; 85,897 between 0.01% and 0.1%, and 48,006 between 0.1% and 1%) (
Inspection of the distribution of the chi-square (score) tests from models for overall breast or prostate cancer showed evidence of over-dispersion of test statistics (genomic control lambda estimate to be approximately 1.15 for breast and 1.20 for prostate) however when very rare SNPs were removed (MAF<0.1% overall) then the Wald statistics appeared to be sampled from an overall central chi-square distribution (genomic control lambda = 1.00 for breast cancer and lambda = 1.05 for prostate cancer). In the gene burden analyses, the distribution of observed score tests showed mild evidence of over-dispersion (lambda = 1.04 for breast cancer and lambda = 1.06 for prostate cancer). When the single SNP analysis was restricted to estrogen receptor-negative (ER-) breast or advanced prostate cancer, where there were many more controls than cases included in each model, then the behavior of the score test for the single SNP associations was problematic for rare SNPs. For such SNPs we followed up any apparently globally significant associations with exact logistic regression analysis, in order to reduce what appeared to be a proliferation of false positive signals.
The total number of genes having at least one polymorphic functional variant genotyped and passed quality control varied slightly between breast (17,168 genes) and prostate cancer (17,203 genes) due to sampling (i.e. some variants were polymorphic only for breast cancer cases and so were not included in the prostate cancer analyses and vice versa).
In the ethnic-pooled breast cancer analyses (2,984 cases and 7,545 controls), the most significant predicted protein-altering variant was a rare SP variant rs145889899 at the splice donor site in the second intron of the gene
All Cases (n = 2,984) vs Controls (n = 7,545) | |||||||||||||
SNP ID |
Chr | Position |
rs# | A1/A2 |
Type | Gene | OR |
P | AAMAF |
NHMAF | JAMAF | LAMAF | EAMAF |
exm61019 | 1 | 54476084 | rs145889899 | T/C |
|
|
3.74 | 2.50E-07 | 0.0065 | 0 | 0 | 0 | 0 |
exm1579798 | 21 | 46950811 | rs142899279 | A/C |
|
|
12.67 | 1.30E-06 | 0 | 0.0025 | 0 | 0 | 0.00030 |
exm841657 | 10 | 93668692 | NA | A/G |
|
|
>> | 2.50E-06 | 0 | 0 | 0 | 0 | 0 |
exm952402 | 11 | 104878019 | rs45585331 | C/A |
|
|
9.69 | 2.50E-06 | 0.00093 | 0 | 0 | 0 | 0 |
exm533277 | 6 | 31918154 | rs149101394 | G/A |
|
|
2.90 | 3.50E-06 | 0.0075 | 0.0012 | 0.00025 | 0.00077 | 0.00089 |
exm510328 | 5 | 179285781 | NA | A/G |
|
|
11.61 | 4.20E-06 | 0.00070 | 0 | 0 | 0 | 0 |
exm1043849 | 12 | 121177159 | rs28940872 | T/C |
|
|
25.69 | 7.10E-06 | 0.00023 | 0 | 0 | 0 | 0 |
exm1013941 | 12 | 56998559 | NA | T/C |
|
|
>> | 1.30E-05 | 0 | 0 | 0 | 0 | 0 |
exm1234521 | 16 | 30795481 | NA | G/C |
|
|
>> | 1.30E-05 | 0 | 0 | 0 | 0 | 0 |
exm132287 | 1 | 186313108 | rs58030082 | T/C |
|
|
1.59 | 1.30E-05 | 0 | 0.021 | 0.056 | 0.00038 | 0 |
ER+ Cases (n = 1,688) vs Controls (n = 7,545) | |||||||||||||
SNP ID |
Chr | Position |
rs# | A1/A2 |
Type | Gene | OR |
P | AAMAF |
NHMAF | JAMAF | LAMAF | EAMAF |
exm1573155 | 21 | 43529776 | NA | A/G |
|
|
7.28 | 9.80E-07 | 0.0021 | 0 | 0 | 0 | 0 |
exm85453 | 1 | 114380886 | rs138092829 | C/T |
|
|
12.12 | 1.60E-06 | 0.00093 | 0 | 0 | 0 | 0 |
exm1358833 | 17 | 74625633 | rs115756441 | T/C |
|
|
3.26 | 2.40E-06 | 0.010 | 0 | 0 | 0.0012 | 0.00030 |
exm1159729 | 15 | 48734008 | rs113577372 | A/C |
|
|
7.66 | 2.70E-06 | 0.0019 | 0 | 0 | 0 | 0 |
exm1093791 | 14 | 24792132 | rs74387312 | G/A |
|
|
13.00 | 2.90E-06 | 0.00023 | 0 | 0 | 0 | 0.00059 |
exm1621500 | 22 | 50682785 | rs139896192 | T/C |
|
|
10.16 | 3.70E-06 | 0.0012 | 0 | 0 | 0 | 0 |
exm900686 | 11 | 36119939 | rs147309219 | G/A |
|
|
11.10 | 3.80E-06 | 0.00093 | 0 | 0 | 0 | 0 |
exm551694 | 6 | 44274257 | rs139372744 | A/G |
|
|
4.71 | 4.20E-06 | 0.0044 | 0 | 0 | 0 | 0 |
exm178049 | 2 | 26418053 | rs137852769 | G/C |
|
|
11.73 | 5.20E-06 | 0.00047 | 0 | 0 | 0.00038 | 0 |
exm412038 | 4 | 88534411 | NA | A/G |
|
|
9.39 | 9.30E-06 | 0.0012 | 0 | 0 | 0 | 0 |
ER- Cases (n = 441) vs Controls (n = 7,545) | |||||||||||||
SNP ID |
Chr | Position |
rs# | A1/A2 |
Type | Gene | OR |
P | AAMAF |
NHMAF | JAMAF | LAMAF | EAMAF |
exm1165463 | 15 | 59009800 | rs144893047 | T/C |
|
|
8.52 | 1.30E-09 | 0.0040 | 0 | 0 | 0 | 0 |
exm221867 | 2 | 113539232 | rs142134831 | T/G |
|
|
9.27 | 1.30E-08 | 0.0030 | 0 | 0 | 0.00038 | 0 |
exm220393 | 2 | 110959008 | rs145479679 | G/T |
|
|
7.40 | 8.40E-08 | 0.0047 | 0 | 0 | 0 | 0.00030 |
exm61019 | 1 | 54476084 | rs145889899 | T/C | SP |
|
6.17 | 1.40E-07 | 0.0065 | 0 | 0 | 0 | 0 |
exm645918 | 7 | 101183198 | rs190166648 | T/C |
|
|
14.51 | 1.50E-07 | 0.00070 | 0 | 0 | 0 | 0.00059 |
exm492809 | 5 | 147505341 | NA | A/G | Arg962Lys |
|
9.57 | 1.60E-07 | 0.0028 | 0 | 0 | 0 | 0 |
exm1253047 | 16 | 69727480 | rs145602190 | G/A |
|
|
5.39 | 1.70E-07 | 0.0021 | 0.0037 | 0 | 0.0035 | 0.0051 |
exm1292126 | 17 | 8273026 | rs78738842 | A/G |
|
|
6.42 | 2.70E-07 | 0 | 0 | 0.0092 | 0 | 0 |
exm741581 | 9 | 18928963 | rs150639454 | A/G |
|
|
11.10 | 2.70E-07 | 0.0021 | 0 | 0 | 0 | 0 |
exm2276233 | 12 | 15734702 | rs144347297 | A/C |
|
|
6.47 | 3.00E-07 | 0.0019 | 0 | 0 | 0.0050 | 0 |
SNP ID from HG19.
Position based on GRCh37.
A1 is minor allele based on the entire multiethnic sample and the tested allele, A2 is the reference allele.
Odds ratio per allele based on the pooled analysis adjusted for age and the first 10 principle components.
MAF is minor allele frequency in controls.
AA, African Americans; NH, Native Hawaiians; JA, Japanese Americans; LA, Latinos; EA, European Americans; SP, splice-site variant.
For ER- breast cancer (n = 441 cases) many associations (358) with very rare SNPs were nominally significant using the score test but the p-values failed to stand up to further investigation using exact logistic regression (the exact p-values ranged from 3×10−5 to 0.21). The many small p-values apparently reflected overly liberal behavior of the score test when alleles are rare and when there are many more cases than controls. In order to reduce discussion of a large number of likely false positive tests we consider in the subtype analyses only SNPs with at least 10 minor alleles seen over all cases and controls. With this restriction we found a total of ten globally significant SNPs (using the score test). However, p-values from exact logistic regression for these SNPs were again far less striking (ranging from 3×10−5 to 1.5×10−3).
When restricted to estrogen receptor-positive (ER+) cases (n = 1,688) (and screening out SNPs with less than 10 minor alleles seen) the most significant coding SNP was a rare NS variant in
In ethnic-specific analyses of overall breast cancer only one additional SNP (in
Gene | Chr | # of SNPs | OR | P | |
|
|||||
|
12 | 8 | 1.14 | 0.0000497 | |
|
17 | 6 | 1.10 | 0.0000541 | |
|
11 | 4 | 0.88 | 0.000124 | |
|
1 | 9 | 1.14 | 0.00016 | |
|
6 | 14 | 1.10 | 0.000162 | |
|
|||||
|
10 | 5 | 26.6 | 0.00000871 | |
|
1 | 10 | 1.63 | 0.0000209 | |
|
11 | 36 | 0.54 | 0.000147 | |
|
4 | 2 | 11.17 | 0.000162 | |
|
22 | 2 | 3.03 | 0.000182 | |
|
|||||
|
3 | 3 | 1.76 | 0.0000325 | |
|
22 | 2 | 3.74 | 0.0000373 | |
|
3 | 20 | 0.93 | 0.0000505 | |
|
11 | 3 | 3.26 | 0.000257 | |
|
19 | 6 | 0.85 | 0.000266 | |
|
|||||
|
10 | 5 | 35.35 | 0.000000621 | |
|
19 | 33 | 0.53 | 0.0000199 | |
|
22 | 2 | 3.74 | 0.0000373 | |
|
12 | 4 | 2.44 | 0.0000871 | |
|
15 | 13 | 2.02 | 0.000202 | |
|
|||||
|
10 | 4 | 32.27 | 0.0000000000124 | |
|
6 | 2 | 36.51 | 0.000000000168 | |
|
12 | 8 | 1.37 | 0.0000204 | |
|
8 | 11 | 3.24 | 0.0000209 | |
|
1 | 3 | 5.65 | 0.0000636 | |
|
|||||
|
10 | 4 | 32.27 | 0.0000000000124 | |
|
6 | 2 | 36.51 | 0.000000000168 | |
|
6 | 5 | 4.40 | 0.0000000146 | |
|
15 | 6 | 16.18 | 0.000000483 | |
|
15 | 11 | 2.63 | 0.000000533 |
For overall prostate cancer (4,376 cases and 7,545 controls) none of the single SNP associations with prostate cancer met the Bonferroni adjustment for multiple comparison testing (nominal p<3.9×10−7). The top two associations found for prostate cancer were for rare NS variants in
All Cases (n = 4,376) vs Controls (n = 7,545) | |||||||||||||
SNP ID |
Chr | Position |
rs# | A1/A2 |
Type | Gene | OR |
P | AAMAF |
NHMAF | JAMAF | LAMAF | EAMAF |
exm514211 | 6 | 6266854 | rs140712764 | T/C |
|
|
28.007 | 9.1E-07 | 0.000233 | 0 | 0 | 0 | 0 |
exm199465 | 2 | 70052624 | rs146778617 | T/G |
|
|
4.523 | 6.0E-06 | 0.002563 | 0 | 0 | 0 | 0 |
exm574153 | 6 | 117130704 | rs2274911 | G/A |
|
|
0.875 | 1.3E-05 | 0.2379 | 0.2717 | 0.4332 | 0.2657 | 0.2542 |
exm68152 | 1 | 70896038 | rs145785987 | C/T |
|
|
9.011 | 3.1E-05 | 0.000699 | 0 | 0 | 0 | 0 |
exm971959 | 11 | 134128968 | NA | G/A |
|
|
>999.999 | 3.2E-05 | 0 | 0 | 0 | 0 | 0 |
exm1105738 | 14 | 61180657 | rs3742636 | T/G |
|
|
1.125 | 4.0E-05 | 0.4256 | 0.2742 | 0.4688 | 0.2479 | 0.2912 |
exm1474666 | 19 | 43414890 | rs116433230 | A/C |
|
|
0.223 | 4.3E-05 | 0.01235 | 0 | 0 | 0.001536 | 0 |
exm506442 | 5 | 176637576 | rs28932178 | C/T |
|
|
0.871 | 6.1E-05 | 0.1051 | 0.4243 | 0.5249 | 0.217 | 0.1507 |
exm1507288 | 19 | 55527081 | rs2304167 | C/T |
|
|
0.879 | 7.0E-05 | 0.4653 | 0.2903 | 0.2239 | 0.1889 | 0.1819 |
exm1478994 | 19 | 45296806 | rs3208856 | T/C |
|
|
0.687 | 7.3E-05 | 0.04613 | 0.008685 | 0.000497 | 0.01882 | 0.04667 |
Advanced Cases (n = 499) vs Controls (n = 7,545) | |||||||||||||
SNP ID |
Chr | Position |
rs# | A1/A2 |
Type | Gene | OR |
P | AAMAF |
NHMAF | JAMAF | LAMAF | EAMAF |
exm280349 | 2 | 239049718 | NA | A/G |
|
|
13.991 | 1.7E-09 | 0 | 0 | 0 | 0 | 0.002081 |
exm1488544 | 19 | 49376683 | rs45533432 | G/A |
|
|
4.677 | 1.2E-08 | 0.002097 | 0.002481 | 0 | 0.00384 | 0.009512 |
exm643590 | 7 | 100634145 | rs143984295 | A/G |
|
|
14.425 | 1.5E-08 | 0 | 0 | 0 | 0.000384 | 0.001486 |
exm701486 | 8 | 55540418 | rs114797722 | C/G |
|
|
13.409 | 2.0E-08 | 0.001631 | 0 | 0 | 0 | 0 |
exm782688 | 9 | 130224593 | rs150292099 | G/A |
|
|
10.488 | 3.5E-07 | 0.002097 | 0 | 0 | 0 | 0 |
exm2275251 | 17 | 58235051 | rs185658468 | T/A | SP |
|
7.137 | 5.1E-07 | 0 | 0.002481 | 0.003976 | 0 | 0 |
exm1321007 | 17 | 39520119 | rs150620728 | T/C |
|
|
7.485 | 5.8E-07 | 0.002816 | 0 | 0 | 0.001152 | 0 |
exm942022 | 11 | 75439894 | rs141331999 | G/A |
|
|
8.489 | 9.4E-07 | 0.000932 | 0 | 0 | 0.000384 | 0.001784 |
exm594160 | 6 | 167343185 | rs35716361 | T/G |
|
|
8.129 | 1.1E-06 | 0.00303 | 0 | 0 | 0 | 0 |
exm1607984 | 22 | 38483189 | NA | G/A |
|
|
6.084 | 2.4E-06 | 0.001865 | 0 | 0.001244 | 0.001923 | 0.0002973 |
Non-Advanced cases (n = 3,666) vs Controls (n = 7,545) | |||||||||||||
SNP ID |
Chr | Position |
rs# | A1/A2 |
Type | Gene | OR |
P | AAMAF |
NHMAF | JAMAF | LAMAF | EAMAF |
exm514211 | 6 | 6266854 | rs140712764 | T/C |
|
F13A1 | 28.366 | 8.3E-07 | 0.000233 | 0 | 0 | 0 | 0 |
exm1228070 | 16 | 25255366 | rs61746620 | A/G |
|
ZKSCAN2 | 13.396 | 1.3E-05 | 0.0002331 | 0 | 0.0002485 | 0 | 0 |
exm199465 | 2 | 70052624 | rs146778617 | T/G |
|
ANXA4 | 4.275 | 3.4E-05 | 0.002563 | 0 | 0 | 0 | 0 |
exm574153 | 6 | 117130704 | rs2274911 | G/A |
|
GPRC6A | 0.876 | 4.1E-05 | 0.2379 | 0.2717 | 0.4332 | 0.2657 | 0.2542 |
exm1311040 | 17 | 32688826 | rs138527286 | C/T |
|
CCL1 | 2.343 | 4.1E-05 | 0.01072 | 0 | 0 | 0 | 0 |
exm68152 | 1 | 70896038 | rs145785987 | C/T |
|
CTH | 8.761 | 6.3E-05 | 0.000699 | 0 | 0 | 0 | 0 |
exm1105738 | 14 | 61180657 | rs3742636 | T/G |
|
SIX4 | 1.127 | 8.6E-05 | 0.4256 | 0.2742 | 0.4688 | 0.2479 | 0.2912 |
exm392074 | 4 | 22390167 | rs9002 | T/C |
|
GPR125 | 1.139 | 1.2E-04 | 0.2823 | 0.134 | 0.1886 | 0.2342 | 0.202 |
exm1419304 | 19 | 8564474 | rs4239541 | T/G |
|
PRAM1 | 1.133 | 1.5E-04 | 0.6842 | 0.1563 | 0.1499 | 0.3263 | 0.2701 |
exm854616 | 10 | 106025864 | rs116993524 | T/C |
|
GSTO1 | 3.756 | 1.6E-04 | 0.000932 | 0 | 0 | 0 | 0.002378 |
SNP ID from db135.
Position based on GRCh37.
A1 is minor allele based on the entire multiethnic sample and the tested allele, A2 is the reference allele.
Odds ratio per allele based on the pooled analysis adjusted for age and the first 10 principle components.
MAF is minor allele frequency in controls.
AA, African Americans; NH, Native Hawaiians; JA, Japanese Americans; LA, Latinos; EA, European Americans; SP, splice-site variant.
When restricted to advanced cases (n = 499), similarly as for ER- breast cancer, many associations with very rare SNPs were nominally significant using the score test (69 total for SNPs with less than 10 minor alleles observed) but the p-values failed to stand up to further investigation using exact logistic regression (with p-values all <3×10−5). In order to reduce discussion of a large number of likely false positive tests we considered in subtype (advanced/nonadvanced) analyses only SNPs with at least 10 minor alleles seen over all cases and controls used in the analysis. Of the remaining SNPs we found that four NS SNPs with at least 10 minor alleles present were nominally significant using the score test criteria (
For non-advanced disease (n = 3,666 cases), the strongest associations were with the same SNPs as overall prostate cancer (rs140712764 in
No SNPs were significantly associated with overall prostate cancer in ethnic specific analysis (
None of the gene burden analyses were significant for overall prostate cancer after correcting for multiple comparisons (p<3×10−6) either when including common coding variants or when restricting the results to SNPs with frequency ≤1% (
Gene | Chr | # of SNPs | OR | P | |
|
|||||
|
6 | 26 | 0.86 | 0.00000573 | |
|
12 | 9 | 0.86 | 0.0000611 | |
|
19 | 21 | 0.96 | 0.0000642 | |
|
11 | 7 | 1.51 | 0.0000963 | |
|
14 | 13 | 1.10 | 0.00012 | |
|
|||||
|
17 | 24 | 0.44 | 0.0000533 | |
|
2 | 8 | 1.94 | 0.000112 | |
|
7 | 5 | 1.55 | 0.000138 | |
|
10 | 5 | 18.04 | 0.000151 | |
|
11 | 7 | 3.13 | 0.000188 | |
|
|||||
|
19 | 3 | 26.03 | 0.000000122 | |
|
6 | 2 | 42.26 | 0.000000513 | |
|
18 | 2 | 17.53 | 0.00000213 | |
|
10 | 3 | 13.41 | 0.0000124 | |
|
8 | 5 | >999 | 0.0000207 | |
|
|||||
|
3 | 19 | 2.26 | 0.0000000697 | |
|
19 | 3 | 26.03 | 0.000000122 | |
|
6 | 2 | 42.26 | 0.000000513 | |
|
18 | 2 | 17.53 | 0.00000213 | |
|
10 | 3 | 13.41 | 0.0000124 | |
|
|||||
|
8 | 17 | 0.75 | 0.0000749 | |
|
6 | 26 | 0.87 | 0.0000913 | |
|
12 | 9 | 0.86 | 0.000108 | |
|
19 | 21 | 0.96 | 0.000113 | |
|
4 | 37 | 1.11 | 0.000143 | |
|
|||||
|
17 | 24 | 0.40 | 0.0000433 | |
|
15 | 22 | 1.44 | 0.000172 | |
|
10 | 5 | 18.06 | 0.000181 | |
|
11 | 7 | 3.17 | 0.000266 | |
|
2 | 8 | 1.88 | 0.000454 |
For prostate cancer, the most significant GWAS-related association, as described above, was with rs2274911 (
Given the modest effects noted with the initial GWAS signals as well as observed with these correlated coding SNPs (OR per allele of ∼1.1;
Because of the interest in the possibility that rare coding variants with large effect sizes (OR>1.5 or higher) may underlie GWAS signals and since LD with rare SNPs can extend much further than with common SNPs, we report in
For prostate cancer the strongest such associations were with
We also examined genes implicated in family-based studies of breast or prostate cancer (
For prostate cancer, we analyzed 5 genes and did not observe an over-representation of SNP associations at p<0.05 (observed/tested:
We also examined additional cancer-related traits: body mass index (BMI), alcohol intake, as well as circulating PSA levels (
This paper presents an initial investigation of the role of coding variation in the genetics of breast and prostate cancer. Our initial analysis fails to find strong evidence for the hypothesis that relatively rare coding variation is highly determinative of breast or prostate cancer risk either overall or by subtype. Our sample sizes in each racial/ethnic group were each relatively small (roughly 1,000 cases and 2,000 controls in the largest groups) however these sample sizes are large enough to detect risk alleles with moderate to large effects (odds ratios of 3–13) appearing in quite low frequency (0.1–1%) and to examine whether such coding variation underlie (by so-called synthetic association
Our analyses consisted of both single variant analysis and simple gene burden analyses. The gene burden analyses consisted of summing the minor alleles of coding variants including either all coding variants regardless of their frequency, or only those variants with MAF <1% in our overall sample. While this gene-burden test assumes implicitly that all coding variants have the same direction of effect, this is reasonable given that the power of detecting rare protective alleles in a case-control study such as this one (where controls can be regarded as representative of the population) is much less than the power to detect rare risk alleles. The rare variant sum therefore is not very sensitive to the presence of rare protective alleles in a gene.
One association for breast cancer, a single SNP in
Nevertheless a number of suggestive findings were observed that are worthy of further attempts at replication: The splice site variant rs145889899 in
For prostate cancer (all cases) the third strongest association result was for a common NS coding variant (rs2274911) in
Other suggestive findings for prostate cancer include SNPs in a variety of genes such as
We evaluated also associations in regions surrounding known (GWAS) risk alleles as a partial fine-mapping exercise; we specifically focused upon (1) coding alleles reported to be in high LD (in Europeans using 1000 Genomes data) with the index marker, and (2) other (generally less common) coding alleles within 500 kb of the GWAS alleles, that might show associations that could underlie (by synthetic association
Other coding SNPs that could include causal variants producing synthetic associations (associations of rare with common SNPs of high penetrance) include SNPs in genes
We found little evidence that the NS, SP, or nonsense variants captured by the HumanExome SNP array that fall within known or suspected high risk genes for breast or prostate cancer are meaningfully associated with either cancer. The Illumina array does not directly interrogate the rare, high-risk mutations, such as frameshift mutations in
Genotyping cases and controls from our prospective cohort allowed us an opportunity to examine other cancer-related phenotypes and traits for which data and specimens had been collected prior to breast or prostate cancer diagnosis. While two of these endpoints (BMI, alcohol) were based on self-report, we were able to strongly replicate a number of known associations such as rs671 in
In order for rare variants to play an important role in explaining missing heritability
Realistically our study only begins the assessment of whether a range of effects for “moderately rare” coding variants is possible: the detectable ORs in this study range from approximately 3 to 13 for alleles with frequency 1 to 0.1%, respectively. While these are large ORs the above argument indicates that such effect sizes are not unreasonable if rarer protein coding variation plays a similar role in the heritability of risk as does common variation genome-wide. Our failure to find such ORs for the rarer alleles may be providing evidence against coding variation having a predominant role in breast and prostate cancer heritability and risk (outside of high risk families).
In summary, the analyses and methods described here do not support NS variants on the current exome chip as conveying moderate to high risk for breast and prostate cancer. While some suggestive findings are noted it is likely that very large sample sizes of the order that can be only developed through collaborative efforts such as those now engaged in the NCI GAME-ON post-GWAS meta-analysis of common variants, will be required in order to further the understanding of the role of rare NS and other coding variation in disease genetics. Exome sequencing of high-risk families will continue to be important to reveal biologically relevant coding variants for these cancers, both for insertion/deletion variants that were not covered by the current array, and to capture rarer variation (including private variants) that cannot be captured except by sequencing.
This work has been performed according to relevant national and international guidelines. Written consent was obtained at the time of DNA sample collection. The Institutional Review Boards at the University of Southern California and University of Hawaii approved of the study protocol.
The MEC consists of more than 215,000 men and women in California and Hawaii aged 45–75 at recruitment, and comprises mainly five self-reported racial/ethnic populations: African Americans, Japanese, Latinos, Native Hawaiians, and European Americans
Genotyping of the Illumina Human Exome BeadChip (n = 247,895 SNPs) was conducted at the USC Genomics Core Laboratory.
DNA extraction of buffy coat fractions was conducted using the Qiagen protocol. Cases and controls were randomly placed across ethnic-specific plates for each cancer type. All samples had DNA concentrations >10 ng/ul. Initial genotype definitions were based on auto-clustering 6,404 samples across all populations which had call rate >0.99 (African American 1883, Japanese American 1823, Latino 1008, European American 1690) using the GenomeStudio software (V2011.1). Following genotype calling on all samples (>16,000), manual inspection was conducted of the following SNPs: 1) SNPs with call rate <0.98 (n = 3,317), 2) monomorphic SNPs with call rate <1 (n = ∼15,000), 3) SNPs with minor allele frequency between 0 and 0.001 and call rate <1 (n = ∼31,500), 4) SNPs with >1 replicate error based on sample duplicates (∼1,000, discussed below), 5) SNPs with apparent differences in minor alleles frequencies >15% across ethnic-specific 96 sample plates (n = 798), or other evidence of batch/plate effects on allele frequency (n = 18,188), 6) all mitochondrial SNPs and all SNPs on the X and Y chromosomes (n = 5,574), and 7) autosomal SNPs out of Hardy-Weinberg Equilibrium in more than one ethnic group with p value<0.001 and at least one ethnic group with p value<0.00001 (n = 827). During the inspections we in total inspected cluster plots for approximately 70,000 SNPs (counting overlapping SNPs in the categories above) and genotypes were manually edited for 27,506 SNPs.
Of the 15,837 samples described above genotyping was successful with call rates ≥98% for 15,573 samples; of these we removed 17 samples for which reported sex conflicted with assessment of X chromosome heterozygosity, and 651 samples based on relatedness. Relatedness was determined using the IBD calculation in plink
We relied on documentation files obtained from the University of Michigan posted on
We estimated principal components in the entire sample using EIGENSTRAT
For all analyses except those of the X and Y chromosomes all controls (men and women combined) were utilized in the analysis of each cancer in order to increase statistical power. Only controls of the same sex were used to analyze X or Y chromosome variants. Analyses were performed overall and within each racial/ethnic group. For each genotyped SNP, odds ratios (OR) and 95% confidence intervals (95% CI) were estimated using unconditional logistic regression of case/control status adjusting for age at diagnosis (cases) or blood draw (controls), and reported race/ethnicity in the overall analyses, and the first 10 eigenvectors in both overall and ethnic-specific analyses. For each SNP, we tested for allele dosage effects through a 1 d.f. score chi-square trend test. When exposures are rare but with very strong effects the score test can be more powerful than the usual Wald test for reasons described in Hauck and Donner
For each gene listed in the annotation files we conducted a simple gene-specific burden test summing the number of minor alleles of each putative functional SNP carried by an individual. These summation variables were then used as the genetic variable in logistic regression models of case-control status after adjusting for age, reported race/ethnicity, and the first 10 eigenvectors above. We performed the gene burden analyses twice, once using all putative functional SNPs and again using only those variants with MAF<1% in the total sample. Statistical significance was again evaluated using the score test and exact logistic regression. The use of a simple gene burden analysis is over-simplified since it implicitly assumes all effects are in the same direction. It is important to remember however that the power to detect rare protective alleles is much smaller than the power to detect rare risk alleles since the former will not be over-enriched in our controls; therefore we expect that the simple sum of minor alleles, especially for rarer alleles, will not be very much diluted by rare protective effects.
For breast cancer we ran each of the above single SNP and gene burden tests separately by estrogen receptor status (+/−); for prostate cancer we ran analyses overall and separately by classification into advanced (stage>1) versus non-advanced (stage = 1) disease. For the other traits described above, we analyzed single SNPs using regression (logistic or linear) methods for binary or continuous phenotypes. The 100 most statistically significant results for each phenotype are presented in
We also examined whether known risk alleles (generally intergenic or intronic) from GWAS studies of breast or prostate cancer may be reflecting an underlying signal from a nearby protein-altering variant. In these analyses for each GWAS SNP (73 for breast, 83 for prostate cancer) we initially interrogated nearby SNPs known to be or likely to be in LD with the index signal. Because LD data is not yet available for the majority of the SNPs on the HumanExome array, we expanded the associations considered to be all those within a 100 kb region on either side of the index signal, since LD between common SNPs can sometimes extend this far. In this region we highlighted in the results section and discussion, SNPs with modest signals of association (p<0.05) as well as more strongly significant SNPs. Here the common SNPs are likely to be in high LD with the (generally common) GWAS variants, and the rare SNPs could be producing synthetic associations. We then relaxed this 200 kb region to 1 mb (500 kb on either side of the index signal) in order to expand our examination of possible synthetic associations between rare SNPs and the index GWAS findings, since LD with rare SNPs can extend considerably further than with common SNPs.
Recognizing that many variants are only polymorphic in a few racial/ethnic groups, we give power analysis for a study with 1,000 cases and 2,000 controls (roughly the number of cases and controls in each of the four largest populations) by odds ratio (1–200) and allele frequencies ranging from 0.0001 to 0.1 (
Allele frequency of putative functional SNPs for a. All ethnicities combined; b. European American; c. African American; d. Latino; e. Japanese American; f. Native Hawaiian
(DOCX)
Statistical power for single SNP analyses.
(DOCX)
One hundred most significant single SNP associations with breast cancer; over all ethnic groups (S1.1) and by ethnic group (S1.2–6).
(XLSX)
One hundred most significant associations between single SNPs and (S2.1) ER-positive Breast cancer and (S2.2) ER-negative breast cancer.
(XLSX)
One hundred most significant single SNP associations with prostate cancer; over all ethnic groups (S3.1) and by ethnic group (S3.2–6).
(XLSX)
One hundred most significant associations between single SNPs and (S4.1) advanced prostate cancer and (S4.2) localized prostate cancer.
(XLSX)
Gene burden analyses. One hundred strongest associations with (S5.1) Overall breast cancer, (S5.2) ER-positive breast cancer, (S5.3) ER-negative breast cancer, (S5.4) Overall prostate cancer, (S5.5) Advanced prostate cancer and (S5.6) Non-advanced prostate cancer.
(XLSX)
Relationship between SNPs or genes known to be associated with breast cancer and coding SNPs on the exome array. Summary of nearest coding snps and gene burden analyses for (S6.1) GWAS associations and (S6.2) High risk genes.
(XLSX)
Relationship between SNPs or genes known to be associated with prostate cancer and coding snps on the exome array. Summary of nearest coding snps and gene burden analyses for (S7.1) GWAS associations and (S7.2) High risk genes.
(XLSX)
Summary statistics for other phenotypes examined: BMI, alcohol intake, and PSA.
(XLSX)
Most significant single SNP association results for other phenotypes examined.
(XLSX)
We thank the men and women who volunteered to participate in the MEC. We also thank Andrea Holbrook for technical support.