Exploiting SNP Correlations within Random Forest for Genome-Wide Association Studies

The primary goal of genome-wide association studies (GWAS) is to discover variants that could lead, in isolation or in combination, to a particular trait or disease. Standard approaches to GWAS, however, are usually based on univariate hypothesis tests and therefore can account neither for correlations due to linkage disequilibrium nor for combinations of several markers. To discover and leverage such potential multivariate interactions, we propose in this work an extension of the Random Forest algorithm tailored for structured GWAS data. In terms of risk prediction, we show empirically on several GWAS datasets that the proposed T-Trees method significantly outperforms both the original Random Forest algorithm and standard linear models, thereby suggesting the actual existence of multivariate non-linear effects due to the combinations of several SNPs. We also demonstrate that variable importances as derived from our method can help identify relevant loci. Finally, we highlight the strong impact that quality control procedures may have, both in terms of predictive power and loci identification. Variable importance results and T-Trees source code are all available at www.montefiore.ulg.ac.be/~botta/ttrees/ and github.com/0asa/TTree-source respectively.

: On CD wtccc : T-Trees prediction performance with 10-SNPs blocks and IC = 10. The five upper panels show the influence of T and N min for each of the five investigated values of K. The last panel displays the evolution of the as the value of K increases. The plain (resp. dotted) line plots for each value of K the maximum (resp. minimum) AUC over all possible values of T and N min .   Figure S05: The first 100 variables according to the tree based importance rankings for CD qc . The horizontal axis corresponds to the ranks and the vertical axis to the variable importances. In the first column variables are ordered according to random forests variable importances and in the second column they are ordered according to the T-Trees variable importances. In the first row, red highlights the nine reported regions and blue highlights two more regions mostly detected by tree based methods. In the second row, purple corresponds to rare variants ( < 0.05). In the third row, green represents markers with a low Fisher p-value (< 10 −6 ).  Figure S06: The first 100 variables according to the tree based importance rankings for CD wtccc . The horizontal axis corresponds to the ranks and the vertical axis to the variable importances. In the first column variables are ordered according to random forests variable importances and in the second column they are ordered according to the T-Trees variable importances. In the first row, red highlights the nine reported regions and blue highlights two more regions mostly detected by tree based methods. In the second row, purple corresponds to rare variants ( < 0.05). In the third row, orange highlights SNPs deviating from and in the last row, green represents markers with a low Fisher p-value (< 10 −6 ). 0.024 7.13 · 10 −1 6.26 · 10 −1 2.62 · 10 −5 1.06 · 10 −4 (126) 18 12.77 12.77 1 rs2542151 0.180 3.08 · 10 −1 9.46 · 10 −1 7.21 · 10 −8 9.35 · 10 −5 (146)

Supplementary methods T-Trees Algorithms
T-Trees input : LS,B,T ,K,K int ,IC,N min output: T The T-Trees algorithm is quite similar to the random forest algorithm. It adds two metaparameters: K int and IC; and needs a block map B. Algorithm 3: The pickGroupSplit function is based on the Extra-Trees algorithm. A single Extra-Tree is built and its predictions allow to transform a group of attributes into a new numerical value.
BuildExtraTTree input : LS,g,K int ,IC output: A tree : t // Its root node if const(attributes) or const(output) or #nodes ≤ IC then return a leaf labeled by class frequencies in LS else Select K int random attributes ∈ g : {a 1 , ..., a Kint } {s 1 , ..., s Kint } : Algorithm 4: Inside the outer nodes, the weak learner that is used is a single Extra-Tree with an IC-limited number of (inner) test nodes.

Implementation details Evaluation of the splits
The score measures we used in our experiments are based on the well-known logarithmic or Shannon entropy. Let t denote a test outcome at a node of a decision tree and c the class which we are trying to predict, t and c are two discrete random variables of respective distribution (p(t 1 ), ..., p(t k )) and (p(c 1 ), ..., p(c m )) (in our case, as from a machine learning point of view a is a binary classification problem and the decision trees are binary trees, m = k = 2). Basically, the class entropy allows to measure the impurity at a given node: where p case (where p control ) correspond to the proportion of cases (controls) reaching the current node. Similarly the test entropy is defined as follows: where p lef t (p rigth ) denotes the proportion of objects propagated to the left (right) at the current test node. Also we can define the average conditional entropy of the class given the test: Thus, a score measure can be defined as follow: −p right H C (node right ) and corresponds to the difference of the current node impurity and the weighted impurity of the two resulting child nodes. It reflects the goal of the tree induction which aims to reduce the impurity at each test node. The split that maximizes such score is the one that reduces the more the class entropy from one node to its descendants. It is also called the mutual information I T C and it quantifies the reduction of the uncertainty of c given t. As this information quantity is upper bounded by the prior entropy H C (node), that measure is sensitive to the number and prior distribution of classes rendering it difficult to interpret. Also in the context of decision tree induction, it has been observed to favor tests at a node with a larger number of outcomes. For these reasons, various normalizations have been introduced and are discussed in details in [Weh96]. Among these, we do not recommend using the "gain ratio" as it suffers from the "end-cut" preference ( [Tor01]) which we observed to be biased towards low minor allele frequency SNPs. Due to its symmetrical (in C and T ) properties, we choose to use the following normalisation for our score measure:

Complexity control
In some situations, it may be useful to prevent a node from being further splitted. In our tree-based methods, we use two types of complexity control parameters: • N min : this (user defined) number corresponds to the required minimum number of objects (i.e. local sub-sample size) reaching a node for it to continue splitting. For example, setting N min to n − 1 will produce a one-level decision tree (a.k.a. a decision stump). Practically, a simple condition is added to check whether or not the learning set is big enough to create a new split. The typical default value for N min is 2; meaning that the tree is fully developed.
• N node : this limit corresponds to the maximum number of test nodes allowed in a tree. Similarly, setting N node to 1 will produce a decision stump. In our T-Trees method, this parameter, also referred to as the internal complexity: IC, is used to limit the number of inner nodes in the weak learners.

Labeling the leaves
We keep in the leaves the proportion of objects of each class that reached that terminal node. That proportion somehow reflect the confidence of the corresponding prediction. For example, an object reaching a terminal node with 98 cases and only 2 controls is more susceptible to be classified as a case than another object that arrives in a leaf with 32 cases and 25 controls. We use the T predictions of the T trees in a forest, they are aggregated following a "soft" voting approach in which the prediction become the average class-probability. An object is propagated into each tree, leading that object to T different leaves. The probability vectors v i associated to these T leaves are averaged as follows: and the prediction for the propagated object becomes the average class-probability v. This aggregation method is also referred to as the 'soft' class probability aggregation. In the binary classification case, the resulting predictions v = (c 0 , c 1 ) where c 0 (reps. c 1 ) corresponds to the probability of being classified as an object of class 0 (resp. class 1) and c 1 = 1 − c 0 .

Variable importances
We use the mutual information for that purpose. For each variable x i used (maybe more than once) in a decision tree, we compute its importance as follow: where N odes(x i ) is the set of tree nodes where the variable x i is used to split, p n denotes the relative sample size of node n, and I T C (n) is the local reduction of entropy resulting from the selected split at this node.
Doing so, variables appearing in many and "bigger" nodes (i.e. closer to the root) should be more important than the other ones.
Similarly, variables importances are computed for tree ensembles by using the mutual information. In Equation 15, instead of looking at nodes from a single tree, the set N odes(x i ) becomes now the set of all nodes in the forest of T trees where variable x i is used.
As such, this measure is however dependent on the number of trees T of the ensemble, and on the initial impurity of the dataset and the impurity reduction yielded by the trees, and is thus difficult to interpret. Hence we choose the following normalization, so as to sum up all the variable importances to 1: