^{1}

^{*}

^{2}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: RN XJ. Performed the experiments: XJ. Analyzed the data: RN. Contributed reagents/materials/analysis tools: RN XJ. Wrote the paper: RN XJ.

The interaction between loci to affect phenotype is called

A

We conclude that the bound appears to ameliorate the curse of dimensionality in high-dimensional datasets. This is a very consequential result and could be pivotal in our efforts to reveal the dark matter of genetic disease risk from high-dimensional datasets.

In Mendelian diseases, a genetic variant at a single locus may give rise to a disease

The ability to identify epistasis is important in understanding the inheritance of many common diseases. For example, studying genetic interactions in cancer is essential to further our understanding of cancer mechanisms at the genetic level. Many cancer-associated mutations and interactions among the mutated loci remain unknown. For example, highly penetrant cancer susceptibility genes, such as BRCA1 and BRCA2, are linked to breast cancer

The most common genetic variation is the

The advent of high-throughput technologies has enabled

However, single-SNP investigations could not detect complex epistatic interactions in which each locus by itself exhibits little or no marginal effect. To fully exploit genomic data and possibly reveal a great deal of the dark matter of genetic risk, it is critical that we analyze such data using multi-locus methods, which investigate

When investigating SNP patterns, in some way we must score the patterns to determine which patterns are most noteworthy. Standard techniques such as linear regression may not work well because both the predictors and the target are discrete. One well-known technique is

A difficulty when learning SNP patterns from high-dimensional GWAS datasets concerns the

Each of these methods has at least one of the following shortcomings: 1) It only investigates two-locus interactions and still requires quadratic time; or 2) It has only been shown to detect interactions in which only one interacting locus has no significant marginal effect. Many of the methods proceed in stages, using the first stage to identify promising SNPs, which in some way are investigated further in the second stage. Strict epistasis constitutes the worst-case in terms of detecting disease associations because such associations are only observable if all interacting SNPs are included in the disease model. None of these two-stage methods made any progress towards detecting strict epistasis. So, Evans et al.

An exhaustive search is not possible when there are millions of SNPs. So some researchers turned their efforts to reducing the search space based on ancillary knowledge. You et al.

However, once the search space is reduced, we can still be left with a large number of SNPs, prohibiting an exhaustive search of even the pruned dataset. Furthermore, in an agnostic study we are searching for possible interactions for which we have no previous knowledge. Therefore, a multi-stage technique that can effectively locate strict epistatic interactions could still be considered the

Initially it might seem that it is not possible to successfully prune our search for strict epistatic

We developed both simulated and semi-synthetic datasets based on models of strict epistasis. We compared the performance of the bound and Bayesian score in their ability to efficiently locate the true pattern. After providing background on Bayesian networks, we discuss specialized Bayesian networks called SNP patterns that represent relationships among SNPs and a disease. Then we show the algorithm used for the comparisons, and we describe the datasets we developed.

Let

It is a theorem

Methods have been develop both for learning the parameters in a BN and the structure (DAG) from data. Pierrier et al.

A

In the constraint-based structure learning approach

A straightforward score, called the

The likelihood in

However, Heckerman et al.

The Bayesian score does not explicitly include a DAG penalty. However, the penalty is implicitly determined by the hyperparameters

The Bayesian score decomposes into the product of local scores, one for each node

This theorem could be used to obtain a bound on the local score that could be obtained by adding parents to a given parent set. However, the bound is a very loose bound (i.e. the bounds are much greater than the scores), and therefore the bound has not proven to be useful in pruning the search space.

We could develop a DAG model that represents many factors that affect phenotype including inheritable allele variation, somatic mutations in alleles, environmental factors, and epigenetic phenomena such as DNA methylation. A subnetwork of that model contains only variables that represent inheritable allele variation (the variables whose values are obtained in a GWAS). If, for example, we have a 5-way interaction, two 3-way interactions, and two 2-way interactions, there are 15 SNPs in this subnetwork, all which have edges to the disease node

In the case of a SNP pattern

Using Theorem 1, we can obtain a bound on the local score that could be obtained by adding parents to a given parent set. Given a SNP pattern and the score in

As noted earlier, this is a very loose bound and would not be useful for provably pruning the search space. However, in the case of searching for strict epistatic interactions, perhaps it is asking too much to hope for provable results. We would achieve valuable progress if we could just often heuristically locate such interactions without an exhaustive search. We conjectured that the bound might enable us to do that, and that is what is investigated here.

Suppose we have a dataset concerning SNPs and a disease, and our goal is to find a 2-SNP pattern with a particular BDeu score (

This strategy extends readily to search for

if

if

output

halt;

endif

else

for

endfor

endelse

Algorithm 1 constructs each candidate pattern in the array

Fisher et al.

# SNPs | MAF | Low Heritability | High Heritability |

2 | 0.05 | 0.05 | 0.1 |

2 | 0.1 | 0.05 | 0.2 |

2 | 0.2 | 0.05 | 0.2 |

2 | 0.3 | 0.05 | 0.2 |

2 | 0.4 | 0.05 | 0.2 |

3 | 0.05 | 0.005 | 0.01 |

3 | 0.1 | 0.05 | 0.1 |

3 | 0.2 | 0.05 | 0.2 |

3 | 0.3 | 0.05 | 0.2 |

3 | 0.4 | 0.05 | 0.2 |

4 | 0.05 | 0.001 | 0.002 |

4 | 0.1 | 0.005 | 0.01 |

4 | 0.2 | 0.05 | 0.1 |

4 | 0.3 | 0.05 | 0.2 |

4 | 0.4 | 0.05 | 0.2 |

For each of the 30 combinations of MAF and heritability, we developed datasets in which there were 100 SNPs and 1000 SNPs. Each dataset had 1000 cases and 1000 controls. For each of the 60 variations, 100 datasets were generated, making a total of 6000 datasets. We used the BDeu score with

We did not vary the number of cases and controls because that number does not have a significant effect on the running time. The data needs to preprocessed to obtain the counts needed in the computation of the bound and the score. The only effect that the number of cases and controls has on the running time is that the pre-processing time increases linearly with the total number of cases and controls.

Reiman et al.

We developed 2-SNP, 3-SNP, and 4-SNP models of strict epistatic interactions using GAMETES, and used the models to inject interacting SNPs into each of the real GWAS datasets resulting in semi-synthetic datasets. The models generated have the properties shown in

# SNPs | MAF | Heritability |

2 | 0.05 | 0.1 |

2 | 0.1 | 0.2 |

2 | 0.15 | 0.3 |

2 | 0.2 | 0.4 |

3 | 0.05 | 0.01 |

3 | 0.1 | 0.04 |

3 | 0.15 | 0.1 |

3 | 0.2 | 0.2 |

4 | 0.05 | 0.001 |

4 | 0.1 | 0.01 |

4 | 0.15 | 0.04 |

4 | 0.2 | 0.1 |

We used the BDeu score with

For the simulated datasets developed from the 2-SNP models,

100 Data Items | 1000 Data Items | |||||||

Low Heritability | High Heritability | Low Heritability | High Heritability | |||||

MAF | Bound | Score | Bound | Score | Bound | Score | Bound | Score |

0.05 | 0.007 | 0.042 | 0.007 | 0.042 | 0.008 | 0.044 | 0.008 | 0.048 |

0.1 | 0.033 | 0.133 | 0.047 | 0.134 | 0.037 | 0.148 | 0.037 | 0.139 |

0.2 | 0.151 | 0.356 | 0.153 | 0.385 | 0.159 | 0.386 | 0.159 | 0.404 |

0.3 | 0.348 | 0.594 | 0.349 | 0.554 | 0.359 | 0.571 | 0.362 | 0.555 |

0.4 | 0.623 | 0.641 | 0.629 | 0.696 | 0.662 | 0.673 | 0.656 | 0.714 |

100 Data Items | 1000 Data Items | |||||||

Low Heritability | High Heritability | Low Heritability | High Heritability | |||||

MAF | Bound | Score | Bound | Score | Bound | Score | Bound | Score |

0.05 | 0.001 | 0.015 | 0.001 | 0.017 | 0.0005 | 0.006 | 0.0009 | 0.014 |

0.1 | 0.010 | 0.075 | 0.012 | 0.079 | 0.006 | 0.032 | 0.005 | 0.031 |

0.2 | 0.072 | 0.312 | 0.073 | 0.309 | 0.064 | 0.316 | 0.063 | 0.307 |

0.3 | 0.232 | 0.557 | 0.238 | 0.566 | 0.228 | 0.557 | 0.225 | 0.579 |

0.4 | 0.554 | 0.699 | 0.566 | 0.770 | 0.554 | 0.705 | 0.552 | 0.738 |

100 Data Items | 1000 Data Items | |||||||

Low Heritability | High Heritability | Low Heritability | High Heritability | |||||

MAF | Bound | Score | Bound | Score | Bound | Score | Bound | Score |

0.05 | 0.001 | 0.007 | 0.002 | 0.012 | 0.0004 | 0.004 | 0.0005 | 0.015 |

0.1 | 0.005 | 0.052 | 0.005 | 0.053 | 0.001 | 0.036 | 0.001 | 0.037 |

0.2 | 0.048 | 0.271 | 0.048 | 0.275 | 0.028 | 0.262 | 0.028 | 0.283 |

0.3 | 0.170 | 0.554 | 0.165 | 0.544 | 0.139 | 0.534 | 0.137 | 0.552 |

0.4 | 0.486 | 0.674 | 0.514 | 0.751 | 0.471 | 0.751 | 0.466 | 0.713 |

These tables and figures show that the bound usually performs much better than we would expect by chance, and also performs substantially better than the score. When the MAF is small (0.05) the average fraction of patterns checked by the bound is at most around 0.008. By chance alone we would expect that average to be around 0.5. In general, the average fraction increases as the MAF increases. On the other hand, the heritability and the dimension of the dataset have little effect on the average fraction. It is somewhat surprising that the bound performs about as well when the heritability is low as when it is high. Since unknown genetic risk might confer low heritability, this is an encouraging result.

The other variable that affects the performance of the bound is the number of SNPs involved in the epistatic interaction. That is, the performance improves as the number of SNPs increases.

In order to make these results more transparent, we did an analysis using actual times. As noted above, the performance does not seem to degrade as the dimension of the datasets increases (i.e., the results are about the same for the 100 SNP datasets and for the 1000 SNP datasets). If we assume that this result holds true for larger dimensions, then

1000 SNPs | 10,000 SNPs | 100,000 SNPs | ||||

Bound | Exhaustive | Bound | Exhaustive | Bound | Exhaustive | |

2-SNP | 10 sec | 12 min | 16 min | 20 hours | 28 hours | 84 days |

3-SNP | 7 min | 3 days | 5 days | 8 years | 14 years | 7,662 years |

4-SNP | 1 day | 2 years | 14 years | 19,146 years |

Notice that the bound always performs about an order of magnitude better than the exhaustive search as far as the dimension that it can handle in an acceptable amount of time. That is, in the case of the 2-SNP models the bound can handle 100,000 SNPs whereas the exhaustive search can only handle 10,000 SNPs; in the case of the 3-SNP models the bound can handle 10,000 SNPs whereas the exhaustive search can only handle 1000 SNPs; and in the case of the 4-SNP models the bound can handle 1000 SNPs whereas the exhaustive search cannot.

These results indicate that in the case of pure epistasis the bound can often locate the true pattern much sooner than would be expected by chance, and offers a substantial improvement over using the score to locate that pattern.

For the LOAD datasets,

MAF | 2-SNP | 3-SNP | 4-SNP | |||

Bound | Score | Bound | Score | Bound | Score | |

0.05 | 0.008 | 0.054 | 0.0003 | 0.017 | 0.0001 | 0.006 |

0.1 | 0.043 | 0.170 | 0.009 | 0.085 | 0.002 | 0.044 |

0.15 | 0.102 | 0.274 | 0.035 | 0.206 | 0.012 | 0.124 |

0.2 | 0.187 | 0.393 | 0.083 | 0.353 | 0.037 | 0.281 |

MAF | 2-SNP | 3-SNP | 4-SNP | |||

Bound | Score | Bound | Score | Bound | Score | |

0.05 | 0.008 | 0.045 | 0.0008 | 0.014 | 0.00009 | 0.004 |

0.1 | 0.045 | 0.137 | 0.010 | 0.079 | 0.002 | 0.045 |

0.15 | 0.109 | 0.272 | 0.038 | 0.173 | 0.013 | 0.120 |

0.2 | 0.199 | 0.370 | 0.092 | 0.283 | 0.041 | 0.258 |

The results are similar to those for the simulated datasets. In the case of both the LOAD and the breast cancer datasets, when the MAF is small (0.05 or 0.1) the bound never checks more than 5% of the patterns before finding the pattern representing the interacting SNPs, and it always checks substantially fewer patterns than the score.

Consider the following example. In the case of the LOAD dataset, there are

It is well known that the APOE gene is associated with LOAD

Using this same dataset, Jiang et al.

This discovery of a possible interaction between GAB2 and APOE was achieved because of APOE's high marginal effect.

Locus | 1-SNP Score | 2-SNP Bound | 2-SNP score |

APOE | −836.08 | −13.74 | – |

rs1007837 (GAB2) | −947.88 | −17.50 | −831.33 |

rs7101429 (GAB2) | −947.16 | −17.53 | −830.58 |

rs901104 (GAB2) | −947.69 | −17.57 | −830.69 |

rs4291702 (GAB2) | −947.36 | −17.66 | −830.51 |

rs4945261 (GAB2) | −947.83 | −17.59 | −831.68 |

rs7115850 (GAB2) | −945.19 | −17.89 | −827.261 |

rs10793294 (GAB2) | −947.24 | −18.39 | −830.84 |

rs2450130 (GAB2) | −948.28 | −17.60 | −830.49 |

S1 (0.05) | −949.25 | −14.60 | −836.31 |

S2 (0.05) | −949.12 | −14.42 | −836.31 |

S1 (0.1) | −964.67 | −16.42 | −761.12 |

S2 (0.1) | −964.50 | −16.24 | −761.12 |

S1 (0.15) | −950.29 | −17.56 | −668.48 |

S2 (0.15) | −950.16 | −17.54 | −668.48 |

S1 (0.20) | −950.56 | −18.42 | −612.31 |

S2 (0.20) | −950.36 | −18.60 | −612.31 |

The 2-SNP scores for the real loci are the scores of the patterns in which the other locus is APOE. The 2-SNP scores for the injected loci are the scores of the injected patterns. For the injected SNPs, the MAFs are shown in parentheses.

It might seem odd that the average bounds for the interacting SNPs are around −18 when the MAF is 0.2 (see ^{rd} and 4^{th} SNPs, Algorithm 1 would discover the pattern containing them after investigating only 6 patterns.

Finally, note that the bound is a very loose bound. Therefore, it would not be useful in a best-search first algorithm that prunes SNPs based on their bounds and which is able to guarantee that we discovered the highest scoring pattern. However, as we have seen, the bound can be quite effective for guiding a heuristic search for high scoring patterns. In general, when we are searching for possible epistatic interactions, our concern is with finding likely patterns that we can then further investigate for biological plausibility. It is not necessary that we know that a discovered interaction has the highest score of all patterns.

We identified a bound on the Bayesian score of any SNP pattern that could be obtained by expanding a given SNP pattern. Using simulated datasets based on models of strict epistasis, we showed that the bound can locate the true pattern, when searching moderate-dimensional datasets, much faster than can be expected by chance. Using semi-synthetic datasets based on models of strict epistasis, we showed that the bound can locate the true pattern, when searching high-dimensional GWAS datasets, much faster than can be expected by chance. The average fraction of patterns checked before finding the true pattern was as little as 0.0004. These results indicate that the bound can be an extremely useful tool in algorithms that search high-dimensional datasets for strict epistatic interactions.

We used an algorithm that sorts the SNPs by their bounds and by their scores to test the effectiveness of the bound. Although this was an effective way to compare the bound to the score, in practice it may not be the most effective way to use the bound in a heuristic search. First, the algorithm assumes we know the number of interacting SNPs up front. Second, if, for example, we were looking for a 4-SNP interaction, we would be able to visit very few different SNPs. We plan to incorporate the bound into other algorithms we have developed such as MBS