The authors have declared that no competing interests exist.
Conceived and designed the experiments: Y-CC RK. Performed the experiments: Y-CC HC MP. Analyzed the data: Y-CC RK HC MP PPZ. Contributed reagents/materials/analysis tools: Y-CC RK HC JP MK FSG MP PPZ WRM JBP. Wrote the paper: RK Y-CC MP PPZ HC.
In the past few years, case-control studies of common diseases have shifted their focus from single genes to whole exomes. New sequencing technologies now routinely detect hundreds of thousands of sequence variants in a single study, many of which are rare or even novel. The limitation of classical single-marker association analysis for rare variants has been a challenge in such studies. A new generation of statistical methods for case-control association studies has been developed to meet this challenge. A common approach to association analysis of rare variants is the burden-style collapsing methods to combine rare variant data within individuals across or within genes. Here, we propose a new hybrid likelihood model that combines a burden test with a test of the position distribution of variants. In extensive simulations and on empirical data from the Dallas Heart Study, the new model demonstrates consistently good power, in particular when applied to a gene set (
Inexpensive, high-throughput sequencing has transformed the field of case-control association studies. For the first time, it may be possible to identify the genetic underpinnings of complex diseases, by sequencing the DNA of hundreds (even thousands) of cases and controls and comparing patterns of DNA sequence variation. However, complex diseases are likely to be caused by many variants, some of which are very rare. Taken one at a time, the association between variant and disease phenotype may not be detectable by current statistical methods. One strategy is to identify regions where important variants occur by “collapsing” variants into groups. Here, we present a new collapsing approach, capable of detecting subtle genetic differences between cases and controls. We show, in extensive simulations and using a benchmark set of genes involved in human triglyceride levels, that the approach is potentially more powerful than existing methods. We apply the new method to an ongoing sequencing study of bipolar cases and controls and identify a set of genes found in neuronal synapses, which may be implicated in bipolar disorder.
Research efforts over the past few years have yielded an explosion of exome sequencing studies and exomic variation data (reviewed in
These developments have created demand for a new generation of statistical and informatics approaches. Increasingly powerful analysis methods have been developed to enable detection of association between phenotype and variants with small to moderate effect sizes (reviewed in
Here we describe a new hybrid likelihood test BOMP (Burden Or Mutation Position test), designed for case-control exome sequencing studies, to detect the presence of causal variants in a functional group. The functional group can be defined as a gene, genomic region, or gene set (multiple genes involved in a pathway or biological process). The test can incorporate variant weighting by bioinformatically-predicted functional impact. We combine, into a single statistic, a directional burden test in which low frequency variants have increased weight and a non-directional position distribution test that does not consider allele frequency. Our burden test uses a collapsing strategy and metrics of variant functional importance, which are similar to previously published burden tests (
To assess the utility of BOMP, we compare its power to three leading methods for variant case-control association testing: VT
In these experiments, BOMP is consistently powerful across a spectrum of disease causality models, in simulations of case-control studies drawn from populations of African-American and European-American individuals, and for the ANGPTL variants from the Dallas Heart Study. It appears to be particularly useful for detecting genes containing causal variants when protective variants are present, when a disease phenotype is associated with variants that cluster in key regions on a gene, when a causal variant is common, or when applied to a candidate gene set, rather than a single candidate gene.
Finally, we apply BOMP to identify causal gene sets in an an ongoing, whole-exome case-control sequencing study of bipolar disorder. We find that seven gene sets are nominally associated with bipolar disorder and that one “MAPK signaling pathway” (KEGG) trends towards significance after correcting for multiple gene sets tested. Notably, this pathway has been implicated previously in bipolar disorder
We evaluated the power of the BOMP hybrid likelihood model with both simulations and empirical data from the Dallas Heart Study
First, we assessed the power of BOMP to detect genes with causal variants in an extreme phenotype case-control study, for a disease with 1% population prevalence, and significance level
In single-gene case-control study simulations, a study size of 2000 (1000 cases, 1000 controls) was required for any of the methods to achieve at least 80% power to detect causal variants. BOMP had
Power estimates for BOMP, VT, SKAT, KBAC (KBAC1P = minor allele frequency defined as
Disease Etiology Name | MAF Deleterious |
Selection Coeff Deleterious |
Effect size Deleterious |
Selection Coeff Protective |
Effect size Protective |
Variant Functional Role |
Demographic model(s) |
Rare variant |
|
|
|
NA | NA | NS | AA,EA |
Low frequency variant |
|
|
|
NA | NA | NS | AA,EA |
Key region variant |
|
|
|
NA | NA | NS | AA,EA |
Common variant |
|
|
|
NA | NA | NS | AA |
Rare+Protect |
|
|
|
|
|
NS | AA |
LowFreq+Protect |
|
|
|
|
|
NS | AA |
KeyRegion+Protect |
|
|
|
|
|
NS | AA |
Common+Protect |
|
|
|
|
|
NS | AA |
Minor allele frequency of deleterious causal variants,
Selection coefficients of deleterious causal variants,
Effect size of deleterious causal variants,
Selection coefficient of protective causal variants,
Effect size of protective modifier variants,
Required functional role of causal and protective variants, NS = coding non-synonymous, AA = African-American simple bottleneck demographic model
Next, we explored how the power of the tested methods could be improved by application to a candidate gene set rather than a single candidate gene. We simulated case-control studies, in which each genomic individual had multiple genes, all or some of which contained causal variants. The gene sets in which all genes contained causal variants ranged from 2 to 5 genes. Gene sets with mixtures of casual and non-causal genes ranged from 4 to 15 genes (ratios of causal to non-causal 3∶1, 3∶3, 3∶6, 3∶9, and 3∶12). Causal variants were equally likely to be from any of the disease etiologies dominated by rare variants. The assumption that even 25% of genes in a set contain causal variants is certainly optimistic, but this experiment allowed us to compare the extent to which each method was affected by the fraction of causal genes in a set.
When all genes in a gene set contained causal variants, power increased for all methods as gene set size increased. When the gene sets contained a mixture of genes, both with and without causal variants, the power decreased with the causal to non-causal ratio. For the African-American demographic model, BOMP and SKAT were the most robust to gene sets with low causal to non-causal ratio. As in the single gene experiment, all methods had less power in the European-American demographic than in the African-American. For the European-American, none of the methods had power
A,B. X-axis shows number of candidate genes in 250 simulated case-control studies (approximately one-third each from disease etiologies Rare, LowFreq and KeyRegion). All genes contain causal variants. For each method, average power is shown. Power increases for all methods as the number of candidate genes with causal variants increases. C,D. X-axis shows the number of candidate genes and the ratio of genes containing causal variants to those that do not contain causal variants. As the ratio decreases, the power of the tested methods also decreases. (Tested methods are BOMP, VT, SKAT and KBAC1P = minor allele frequency defined as
Next, we reconsidered the assumption that casual variants in a gene set were equally likely to come from a few disease etiologies. Instead, we sampled disease etiologies from nine multinomial distributions (
Power estimates for BOMP, VT, SKAT, KBAC (KBAC1P = minor allele frequency defined as
We explored the power of BOMP with respect to case-control study size, using a set of 24 candidate genes as the functional group. We varied the ratio of casual to non-causal genes from 1∶3, 1∶1, and 3∶1. Here, causal variants were again equally likely to be from any of the disease etiologies dominated by rare variants. For a case-control study size of 1000, BOMP's power exceeded 0.8, regardless of the causal-to-non-causal gene ratio (African-American only), and for the 1∶1 and 3∶1 causal-to-non-causal gene ratios for European-American. A study size of 200 was sufficient for power
Power estimates for BOMP; each estimate is based on 250 simulated case-control studies ((approximately one-third each from disease etiologies Rare, LowFreq and KeyRegion). The genomic individuals each had 24 genes, the ratio of genes with causal variants to those without causal variants was either 1∶3 (6 causal, 18 non-causal), 1∶1 (12 causal, 12 non-causal), or 3∶1 (18 causal, 6 non-causal). AA = African-American simple bottleneck demographic model. EA = European-American exponential growth demographic model.
We reasoned from these results that, for a population whose allele frequency spectrum is similar to our European-American demographic model simulations, current whole-exome case-control studies are not sufficiently powered. These studies lack power to find causal variants both at the single gene level (as proposed by
To test this hypothesis, we considered a case-control study of 200 individuals, using a set of 100 candidate genes as the functional group. The disease etiologies of causal genes were sampled from three categories with the ratio of 10 (Rare variant or Low frequency variant or Key region variant) :1 (Common variant) : 1 (any etiology involving protective variants). Etiologies were sampled with equal probability within each category. As before, we varied the ratio of causal to non-causal genes in the set from 1∶3, 1∶1, 3∶1. BOMP power was
We computed average power for single candidate gene case-control studies and multiple candidate gene case-control studies (nine genes, 3∶6 causal to non-causal ratio), with respect to both demographic models, all disease etiologies (
Breakdown of contribution of BOMP mutation burden (BOMP_B) and BOMP position distribution (BOMP_P) statistics averaged over single candidate gene power estimates (
In general, mutation burden tests outperform the position distribution statistic when causal variants are rare and are not clustered. The position distribution test outperforms burden tests when the number of rare variants is similar in cases and controls, but where cases and controls differ with respect to the position distribution of the variants. To illustrate this point, we show a case in which burden tests would miss such a difference (
A toy example of a genomic region containing variants (blue squares) in cases and controls. We assume that the region is important for phenotype. Variant counts in cases (red). Variant counts in controls (purple). Cases and controls each have a total of 9 variants in this region, so Burden statistics (
Genotypes of 8 cases and 8 controls at 10 positions. Matrix column colors: controls = light blue, cases = light red. Position distribution bar colors: controls = blue, cases = red. Detailed description is in the section “Toy example with analytical calculations” (
Both collapsing burden and position distribution tests outperform SKAT when causal variants are very rare.
Because we used permutation to compute p-values, type I error should be well controlled.
We applied the BOMP hybrid likelihood model to the analysis of data from the Dallas Heart Study (DHS)
We stratified the DHS samples by ethnicity (Hispanic, non-Hispanic white, non-Hispanic black) and gender. Because BOMP was designed for dichotomous phenotypes, we selected the lower and upper quartiles from each group, by TG level (totaling 1775 individuals, with 897 cases and 878 controls). Sixty mutations in
We computed a P-value for each of the three ANGPTL genes and for the ANGPTL gene set, using BOMP (with and without bioinformatics scores), the burden statistic VT (with and without bioinformatics variant weighting), the overdispersion statistic SKAT, and the mixture-model KBAC statistic (with four parameter settings) (
Method | ANGPTL |
|
|
|
Hybrid BOMP+VEST |
|
0.09 | 2.3E-05 | 0.15 |
Hybrid BOMP | 3.7E-05 | 0.14 | 4.3E-05 | 0.14 |
VT+VEST | 8.3E-05 |
|
1.7E-05 | 0.18 |
SKAT | 1.06E-04 | 0.068 | 5.78E-05 | 0.29 |
Positional BOMP | 1.5E-04 | 0.4 |
|
0.3 |
KBAC (1D,5P) | 2.9E-04 | 0.031 | 1.5E-04 | 0.17 |
KBAC (2D,5P) | 5.5E-04 | 0.064 | 3.2E-04 | 0.31 |
KBAC (1D,1P) | 2.9E-03 | 0.24 | 0.033 |
|
VT | 3.8E-03 | 0.04 | 4.56E-03 | 0.1 |
KBAC (2D,1P) | 5.8E-03 | 0.47 | 0.067 | 0.045 |
Burden BOMP+VEST | 0.006 | 0.04 | 0.008 | 0.09 |
P-values of association between dichotomized triglyceride levels and variation in three ANGPTL family genes sequenced in Dallas Heart Study. ANGPTL - multiple gene set including
The hybrid BOMP test, with bioinformatics scores and allele frequency variant weighting, had the most significant P-value for the ANGPTL gene set
These results confirm previous reports that the performance of current methods to detect causal variants depends on which genes are selected for benchmarking
We then used BOMP to test candidate gene sets in data from an on-going whole exome sequencing study of bipolar disorder. We examined whole-exome sequencing data on the first 191 cases and 107 controls from this study. These samples were sequenced in two rounds, over a two-year period. In the first round, Nimblegen v1.0 arrays were used for exome capture and Illumina GAII platform for next-generation sequencing. In the second round, Nimblegen v2.0 arrays and the Illumina HiSeq2000 platform was used. Only samples with target sequencing coverage of at least 80% at 20× sequencing depth were included for further analysis. Sequence reads from the samples were aligned to the human reference genome sequence database using BWA
We obtained a collection of pathways for testing from SynaptomeDB
To control for differences in the number of exons targeted by Nimblegen v1.0 and v2.0 (approximately 180,000 vs. 300,000 coding exons) in our analyses with BOMP, we only considered variants in exons present in both Nimblegen kits. We used OverlapSelect from the UCSC Kent source library to identify variants in the shared exons. BOMP P-values and FDR were computed for each of the twenty gene sets selected from SynaptomeDB.
Seven of the gene sets were nominally associated with bipolar disorder (
Gene Set Name (Source) | BOMP P-value | BOMP Burden P-value | BOMP Position P-value | FDR |
Synaptic Genes | Gene Set Size | Time [m] |
MAPK SIGNALING PATHWAY (KEGG) | 0.0065 | 1.0000 | 0.0043 | 0.0949 | 68 | 267 | 87 |
AXON GUIDANCE (Reactome) | 0.0162 | 0.5847 | 0.0137 | 0.0949 | 71 | 161 | 99 |
NEUROLOGICAL SYSTEM PROCESS (GO) | 0.0274 | 0.2787 | 0.0273 | 0.0949 | 75 | 377 | 177 |
METABOLISM OF PROTEINS (Reactome) | 0.0299 | 0.3437 | 0.0272 | 0.0949 | 96 | 215 | 46 |
NEUROACTIVE LIGAND RECEPTOR INTERACTION (KEGG) | 0.0309 | 0.8490 | 0.0259 | 0.0949 | 21 | 272 | 104 |
HUNTINGTONS DISEASE (KEGG) | 0.0312 | 0.5408 | 0.0266 | 0.0949 | 77 | 185 | 51 |
CALCIUM SIGNALING PATHWAY (KEGG) | 0.0332 | 0.4796 | 0.0303 | 0.0949 | 51 | 178 | 85 |
POST TRANSLATIONAL PROTEIN MODIFICATION (GO) | 0.0635 | 0.4031 | 0.0620 | 0.1587 | 87 | 462 | 244 |
NERVOUS SYSTEM DEVELOPMENT (GO) | 0.0744 | 0.8912 | 0.0674 | 0.1654 | 88 | 382 | 157 |
SIGNALLING BY NGF (Reactome) | 0.1290 | 0.8393 | 0.1139 | 0.2317 | 71 | 215 | 103 |
GNRH SIGNALING PATHWAY (KEGG) | 0.1313 | 0.7485 | 0.1109 | 0.2317 | 34 | 101 | 43 |
OXIDATIVE PHOSPHORYLATION (KEGG) | 0.1390 | 0.4222 | 0.1263 | 0.2317 | 66 | 135 | 21 |
ALZHEIMERS DISEASE (KEGG) | 0.2164 | 0.7012 | 0.1904 | 0.3151 | 76 | 169 | 46 |
INTRACELLULAR SIGNALING CASCADE (GO) | 0.2267 | 0.7489 | 0.2174 | 0.3151 | 120 | 648 | 256 |
NEUROTROPHIN SIGNALING PATHWAY (KEGG) | 0.2363 | 0.7995 | 0.2031 | 0.3151 | 45 | 126 | 44 |
CHEMOKINE SIGNALING PATHWAY (KEGG) | 0.2667 | 0.6638 | 0.2464 | 0.3272 | 50 | 190 | 62 |
WNT SIGNALING PATHWAY (KEGG) | 0.2781 | 0.7009 | 0.2519 | 0.3272 | 38 | 151 | 56 |
REGULATION OF GENE EXPRESSION IN BETA CELLS (Reactome) | 0.3401 | 0.0877 | 0.4571 | 0.3779 | 60 | 101 | 10 |
MITOCHONDRION (GO) | 0.4315 | 0.9727 | 0.4047 | 0.4542 | 117 | 339 | 104 |
PARKINSONS DISEASE (KEGG) | 0.7893 | 0.3087 | 0.8111 | 0.7893 | 62 | 133 | 22 |
The gene sets were selected for testing because they contained
FDR computed with the Benjamini-Hochberg algorithm
Wall-clock time in minutes.
To check for possible systematic bias in our Bipolar analysis
In our simulations, analysis of an average-sized gene (500 codons), using 100,000 permutations, required wall-clock times of 32 s, 57.8 s, 1 m23 s, and 3 m4.2 s for case-control study sizes of 200, 1000, 2000, or 5000 (African American demographic model) and 22.8 s, 1 m7.6 s, 1 m25.2 s, and 3 m9.6 s for European American demographic model. For our gene set analyses with real data from the bipolar case-control study (298 individuals), using 100,000 permutations to compute P-values, BOMP computation time ranged from 10 m for a gene set with 101 genes to 4 h16 m for a gene set with 648 genes (
In this work, we introduce and explore the power of a new hybrid likelihood model BOMP to detect causal variants underlying dichotomous disease phenotypes. We compared its power with that of several leading methods designed to detect causal variation in whole-exome case-control studies. We performed simulated case-control studies, using a variety of sizes, demographic models, and disease etiologies. The hybrid BOMP model had good power compared to several popular methods (
Because no current variant collapsing methods have been shown to be best for every disease etiology
We considered the possibility that differences between cases and controls might be detected with respect to a gene set, rather than a single gene
Biologically, we don't expect that every gene in a real gene set will contain causal variants. Thus our simulated gene sets were designed to contain a mix of genes with causal variants and those without. The burden tests (VT and BOMP burden) were not able to effectively capture the difference between the two and lost power as the number of non-causal variants in the simulations increased (
We found in our simulations that a case-control study size of 1000 individuals (500 cases, 500 controls) BOMP was sufficiently powered to detect causal variants in situations when a good candidate gene set (of approximately 25 genes) was known. However, if the ratio of causal to non-causal genes in the selected gene set was low (1∶3) and/or the individuals in the study carried a high proportion of rare (MAF
When we applied BOMP, VT, SKAT, and KBAC to an empirical dataset, each method displayed both strengths and weaknesses. While the dataset is small, it is interesting to note that P-values of association between variant ANGPTL family genes and dichotomized serum triglyceride levels from the Dallas Heart Study were most signicant for the BOMP hybrid model, when the genes were considered together as a gene set. However, the burden statistic VT had the most signicant P-value for
BOMP is not designed to be adjusted for additional covariates, which are often available in disease studies. For example, it is not designed to explicitly deal with different ancestries in a structured population. However, if the true population structure is known and the number of subpopulations is not too large, we can run analyses with stratification to get around this problem, as we (and the authors of the VT and SKAT papers) did for ANGPTL family genes in the Dallas Heart Study
Incorporating bioinformatics scoring of variants (by VEST) yielded improved P-values for both BOMP and VT on the Dallas Heart Study data. While it has been suggested that bioinformatics misclassification of variants might be more of a liability than a benefit, our results (albeit on a small gene set) suggest the opposite. Functional classification of variants in both coding and non-coding regions of the genome is an active research area in bioinformatics, and as methods improve, it is likely that they will increasingly contribute to statistical analysis of causal variation.
Finally, we applied BOMP (with VEST scoring) to test candidate gene sets in data from an on-going whole-exome study of bipolar disorder. The top gene set was the “MAPK signaling pathway” defined from the KEGG Pathway Database (map04010). This is a highly conserved pathway that is centrally involved in cell proliferation, differentiation and migration. In the nervous system, it is at the nexus of multiple neuronal signaling cascades thought to mediate certain forms of synaptic plasticity
Complex diseases are expected to have considerable genetic heterogeneity
Bayesian extensions of our work could include prior knowledge about the probability that a functional group of interest is associated with the phenotype. For example, if the functional group is a gene, prior evidence could come from expert knowledge based on previous functional and/or case-control studies. By combining the log likelihood ratio (
In summary, we have developed a new method for identifying causal variants in high-throughput sequencing data from case-control studies. It is shown to have good power relative to other leading methods and can be flexibly used in a variety of realistic scenarios. The genetic architecture of most common human diseases is likely complex, involving variants with a wide spectrum of frequencies from rare to common and contributing to disease through a number of inter-related pathways. The emergence of whole-exome and genome sequencing studies promises to accelerate our ability to interrogate the genetic architecture of these disease. However, a major challenge remains how to make sense of the enormous amounts of data generated by such studies. Our new method provides another useful tool in a growing toolbox for analyzing the data from such studies.
BOMP (Burden Or Mutation Position statistics), the hybrid likelihood model proposed here, consists of two likelihood ratio tests (mutation burden and mutation position distribution statistics) with the same general form,
The first likelihood ratio test is based on comparing mutation burden in cases and controls. Each individual is represented with a Bernoulli random variable, which is 1 if the individual's burden exceeds a burden threshold, and 0 otherwise. To model the likelihood, we assume that individual burden status is independent and identically distributed (IID). The ratio compares an alternative hypothesis that the probability of exceeding the burden threshold is higher in cases than in controls and the null hypothesis (that probabilities are equal or lower in cases than in controls). Biologically, the IID assumption is not necessarily true. We control for such violations by assessing the statistical significance of the likelihood ratio by permuting case and control labels (
For individual
A binary variable is used to label individuals whose mutation burden in a gene of interest exceeds a critical threshold. If the burden of gene
The
A. Mutation burden statistic. The Mutation burden statistic uses the aggregated burden for cases,
For a gene
Thus genes with higher burdens in cases than controls get a high value and those with higher burdens in controls than cases get a low value.
It follows that under
If a gene set rather than a single gene is used as the functional group, the burden is aggregated across all genes in the set, and the procedure is otherwise identical.
The second likelihood ratio test is based on comparing the position distribution of mutations in cases and controls. The codons of a gene are partitioned into windows and mutation count (burden score) is computed for each window in cases only, controls only, and in cases and controls together. To model the likelihood, each window mutation count is considered to be a random variable in a multinomial distribution. If the partition contains
Let the window mutation counts in the multinomial distributions be
The maximum likelihood estimate of the multinomial parameters (including pseudocounts) is then
For a gene
It follows that under
A toy example of aggregated window mutation count calculation is illustrated in
Each gene has many possible window partitions, and we don't know in advance which is the most informative for the position distribution statistic. One way to create candidate window partitions (
For the position distribution statistic, if a gene set rather than a single gene is used as a functional group, the best window segmentation is computed for each gene, and the calculation of the position distribution statistic is otherwise identical.
The mutation burden and mutation position burden statistics are combined into a single log likelihood ratio,
P-values for each
After
Either
The Individual Burden can be modified by incorporation of score coefficients so that
Each nonsilent variant
The Variant Effect Scoring Tool is a Random Forest classifier, trained with the CHASM software suite's Classifier Pack and SNVBox
For each
Simulated case-control studies are generated using two demographic growth models, eight disease etiologies, and a stochastic model of genotype-phenotype association.
The general Wright-Fisher model/forward population genetic simulation tool SFS_code
The individuals in a population are then associated with a quantitative phenotypic trait, which we assume drives a disease, so that individuals with high values of the trait will have the disease and those with low values of the trait will not. Eight possible disease etiologies are considered (
To generate the phenotypic traits for the genomic populations, we select a disease etiology and a population that contains variants meeting the criteria for causality in that etiology (
Next the quantitative disease trait
To match the expected effect size of significant common and rare variants in GWAS, effect sizes of
At this stage, each population consists of genomic individuals, each with a real-valued quantitative trait. To construct the case-control studies, an extreme phenotype model is used. Disease prevalence is set at 1%,
Case-control studies are generated by sampling without replacement from affected and unaffected groups in a population. Individuals with intermediate phenotype values are not included in case-control studies. The random process used to generate
A null case-control sample is also generated, with no disease etiology, in which the phenotypic trait is drawn from a standard normal distribution for every individual in the sample.
For a scenario in which the functional group of interest is a
Power estimates for BOMP, VT, SKAT, KBAC for case-control study with 10,000 individuals. (KBAC1P = minor allele frequency defined as
(PDF)
Distributions of allele frequencies and raw allele counts in simulated European-American and African-American populations. The European-American population consists almost entirely of rare variants, while the African-American population contains a wider range of rare, low-frequency, and common variants. Percentage of variants with allele frequencies and raw allele counts in the designated ranges are shown. Because
(PDF)
Nine multinomial distributions used to construct sets of multiple candidate genes for case-control studies. Each multinomial distribution is named for its dominant disease etiology.
(PDF)
Power of position distribution statistics compared to burden methods and SKAT. Burden tests outperform the position distribution statistic when causal variants are rare and are not clustered, as in our simulations of Rare Variant disease etiology and European-American demographic. The position distribution test outperforms burden tests when the number of rare variants is similar in cases and controls, but where cases and controls differ with respect to the position distribution of the variants, as in simulations of Key Region Variant disease etiology and European-American demographic. Both collapsing burden and position distribution tests outperform SKAT when causal variants are very rare. RareEA = rare variant disease etiology (
(PDF)
Q-Q plot of BOMP P-values for all genes in the Bipolar case-control study. The plot shows no evidence of heavy skew or heavy tails, indicating that there is no systematic bias in our analysis. Empirical P-values are below the line because the BOMP statistic is not continuous for genes with few variants in only a few samples, leading to conservative P-values.
(PNG)
PCA plot showing the overlap of bipolar samples and controls sequenced during rounds 1 and 2. Plot obtained using EIGENSTRAT
(PNG)
Example of how our simulations capture genetic heterogeneity in complex disease. Each horizontal grid line represents a genomic individual. (Cases and controls shown separately.) Each vertical gridline represents a gene. Causal variants (both deleterious and protective) are shown as triangles. Different case individuals have different patterns of causal variants and the allele frequencies of the variants range from rare (1 allele) to common (190 alleles). Causal variants are also observed in the control individuals. Type = Downward pointing triangles are deleterious variants, upward pointing triangles are protective variants. (200 genomic individuals from African-American demographic model are shown).
(PDF)
Flow chart for calculation of mutation burden statistic. The statistic is first calculated from empirical data (by following the blue and black arrows). The null mutation burden statistics are calculated by following the red and black arrows. Key steps in the calculation are: computing the burden for each individual; sorting the individual burdens into a ranked list, where
(PDF)
Window and sequence segmentations. The mutation position distribution statistic requires a segmentation for a sequence of interest (
(PDF)
Flow chart for calculation of position distribution statistic. The statistic is first calculated from empirical data (by following the blue and black arrows). The null position distribution statistics are calculated by following the red and black arrows. Key steps in the calculation are: choice of (user-selected) window sizes
(PDF)
Demographic models of European-American and African-American populations. The models were fit to European-American
(PDF)
BOMP, VT, and SKAT comparison. Approaches to variant collapsing, variant importance, and choice of statistical framework define differences and similarities among BOMP, VT, and SKAT. A. Variant collapsing strategies. VT and BOMP burden both collapse variants across a genomic region. SKAT does not do collapsing and considers variants one at a time. BOMP position distribution collapses variants across local windows over a genomic region. These different collapsing strategies are illustrated in a toy example in
(PDF)
Supplementary Material.
(PDF)
Thanks to Drs. Jonathan Cohen and Helen Hobbs for providing us with data from the Dallas Heart Study on the ANGTPL genes.