^{1}

^{2}

^{1}

^{3}

^{2}

^{*}

Conceived and designed the experiments: JC DH. Performed the experiments: JC. Analyzed the data: JC DH. Wrote the paper: JC DH. Other: Provided HLA and HIV data: SM. Implemented the conditional model: CK.

The authors have declared that no competing interests exist.

Population structure can confound the identification of correlations in biological data. Such confounding has been recognized in multiple biological disciplines, resulting in a disparate collection of proposed solutions. We examine several methods that correct for confounding on discrete data with hierarchical population structure and identify two distinct confounding processes, which we call coevolution and conditional influence. We describe these processes in terms of generative models and show that these generative models can be used to correct for the confounding effects. Finally, we apply the models to three applications: identification of escape mutations in HIV-1 in response to specific HLA-mediated immune pressure, prediction of coevolving residues in an HIV-1 peptide, and a search for genotypes that are associated with bacterial resistance traits in

There is now clear recognition across several application areas that population structure can confound the statistical identification of associations. An area where this problem was recognized early is the identification of coevolving traits. Felsenstein described the problem and proposed a solution for quantitative traits

Another important application area where population structure can confound the identification of associations is genome-wide association (GWA) studies

In this paper, we examine in detail several methods that correct for confounding on discrete data with hierarchical population structure. To introduce these methods, let us consider how confounding may arise when the population structure is due to phylogenetic relationships. In particular, consider the problem of identifying HLA-mediated immune pressure on HIV-1 intrahost adaptation. Moore

Suppose these sequences have the phylogeny shown in

Simple statistical methods such as Fisher's exact test assume the data to be infinitely exchangeable or independent and identically distributed (IID). Although sequence data and other biological data are IID a priori, they are not IID once we learn their hierarchical structure. Furthermore, as we have just seen, this structure can easily confound the statistical search for associations within such data.

An important point that has not been emphasized in previous work is that different applications may involve different evolutionary processes leading to different kinds of confounding and requiring different solutions. For example, in one process,

Both processes lead to confounding, as evidenced by the example in

Using synthetic data, we show that the coevolution model better addresses the confounding of the coevolution process, and the conditional model better addresses the confounding of the conditional process. In addition, we apply these two models to real examples, including the identification of escape mutations in HIV-1 in response to specific HLA-mediated immune pressure and the prediction of coevolving residues in an HIV-1 peptide, and find that no one model is best for all applications.

So far, we have considered only phylogeny as a source of hierarchical population structure. Nonetheless, we also will explore the use of these generative models, which incorporate evolutionary processes, to address hierarchical population structure in its more general case. In particular, we apply these models to a genomic search for genotypes that are associated with bacterial resistance traits in

We have implemented methods for fitting these models in a package called PhyloD, which is available on the internet as both a web based application and downloadable source at

The models that we describe are generative models, also known as directed acyclic graphical models

To represent the hierarchical population structure, our generative models use the same machinery as that found in the maximum-likelihood phylogenetic tree

Before we consider models of associations between variables, let us consider the situation where a _{1}, … , _{N}_{i}_{ij}_{j}_{j}

(a) The single-variable model for _{i}_{i}_{i}

As mentioned, given a set of observations for _{1}, … , _{N}_{Y}_{Y}

This generative model is reversible, since π_{i}_{ij}_{j}_{ji}_{1}, … , _{N}

Now let us consider two binary variables _{X}_{Y}_{xy}_{xy̅}_{x̅}_{y}, and

This model, first presented by Pollock and colleagues

To determine the significance or strength of the correlation between

In some situations, it may be reasonable to assume that

We present a model that captures this notion in Bhattacharya _{i}_{i}_{i}_{i}_{i}_{i}_{i}

As in the undirected joint case, we can use both frequentist (likelihood ratio test) and Bayesian (BIC) methods to determine the degree to which each of these models better explains the data for

Although it is possible that, for example, both escape and reversion processes are acting at the same time, we have found that allowing for two or more processes at the same time dramatically reduces the power of these models. Consequently, in our experimental evaluations, for any given pair of variables, we choose one model from the set above that best explains the data for those variables. We can use both frequentist and Bayesian methods for making this choice. In the frequentist case, we choose the model with the lowest

The conditional model is reversible in the sense that the choice of root node among non-tip nodes does not affect the likelihood of the data. We also note that this model is a (discrete) mixed-effects model, wherein the predictor variables _{i}_{i}

We evaluate and compare our models on both synthetic and real data sets. On synthetic data, we examine two questions. One, does taking hierarchical population structure into account improve the analysis? For example, when we generate data from an undirected joint or conditional model, do these models perform better than Fisher's exact test (FET), which assumes the data to be IID? Two, when we generate data from an undirected joint model, does that model better fit the data than the conditional model, and vice-versa?

We use two criteria to measure the performance of a model. First, we measure the ability of the model to discriminate true from false correlations via plots of true positive rate versus one minus positive predictive value. In particular, we compute a _{0} to be a real association. We then use the known true and false associations in the synthetic data to compute true positive rate and one minus positive predictive value for that threshold. Finally, we allow _{0} to vary, producing a

On real data, we examine whether the undirected joint model or conditional model better represents the data. Because we don't know whether a discovered association is real or not, we cannot use the discrimination curve or calibration criterion to evaluate performance. Furthermore, because neither model is nested within the other, we turn to Bayesian methods—in particular, the Bayesian Information Criterion (BIC)—for comparison. The BIC score for a model with maximum-likelihood parameters θ̂ and

To compare the conditional model, a model for

In our applications, we sometimes find it useful to test whether a variable _{1}, … , _{N}

We constructed two synthetic data sets, one generated by the process of conditional influence (i.e., by a conditional model) and the other generated by coevolution (i.e., the undirected joint model). The data sets representing conditional influence and coevolution were patterned after the real data sets of Application 1 (effects of immune pressure on HIV mutation) and Application 2 (pairwise amino-acid correlations), respectively. In particular, the sample size (

As expected, when the data were generated according to the undirected joint model, the joint model was most discriminating, followed by the conditional model (

The data closely resemble pairwise amino-acid association data (Application 2).

Computing

The data closely resemble the HLA-amino-acid association data (Application 1).

In both examples, the

To validate our use of BIC to evaluate model performance on real data, we compared the BIC scores of the undirected joint and conditional models on the synthetic data sets representing conditional influence and coevolution. As expected, we found that the conditional model had a significantly higher score than the undirected joint model on the conditional-influence data set (^{−53},

Finally, we note that, over a wide variety of data sets, the conditional model runs an order of magnitude faster than the undirected joint model, which requires optimization over a larger number of parameters and more complex Eigen decompositions.

Our approach raises the question of how sensitive the results are to the structure of the tree used by the models. To address this question, we ran the conditional model on the synthetic conditional influence data using four different trees: the tree used to generate the data (_{gen}_{gen}_{rand}_{ML}_{pars}

As expected, the conditional model performed best using _{gen}_{M}_{L} and _{pars}_{rand}_{gen}_{gen}_{rand}_{ML}_{pars}

Although it may seem counter-intuitive that the randomized tree could find any associations, we note that the problem with the conditional model using _{rand}_{rand}_{gen}

To investigate the effects of immune pressure on HIV evolution, Moore

First, we constructed a phylogenetic tree from the full set of sequences (see

Our results using BIC show that the conditional model better explains the notC1701 data, (^{−24},

In this application, we were fortunate that additional information was available to help confirm the HLA-sequence associations that we found. In particular, a known epitope in the vicinity of a found association supports the validity of that association, as immune pressure is focused on epitopes and the immediate surrounding regions that participate in the presentation of the those epitopes on the HLA molecules at the cell surface

Ground truth was estimated by identifying known epitopes within three residues of the predicted association.

The associations found by the conditional model with

Pos | HLA | p | q |

242 | B*5701 | 4.3E-08 | <0.03 |

28 | A*0301 | 1.5E-07 | <0.03 |

242 | B*5801 | 3.2E-06 | 0.03 |

147 | C*0602 | 5.0E-06 | 0.03 |

26 | C*0303 | 6.9E-06 | 0.05 |

482 | B*4001 | 2.8E-05 | 0.10 |

397 | A*3101 | 3.8E-05 | 0.13 |

495 | B*4701 | 6.9E-05 | 0.17 |

Identification of pairwise correlations between amino acids is important to many areas of biology, as correlations can indicate structural or functional interaction

Continuing our focus on HIV, we applied both the undirected joint and the conditional model to the sequence data from the Western Australia cohort ^{55} polyprotein. This fifty two amino-acid protein was chosen because it is the shortest HIV protein, making pair-wise amino acid tests feasible for all models. We fit the conditional model in both directions (making both

Remarkably, the BIC scores of the conditional model are significantly higher than those of the joint model (^{−100},

We have developed a tool for visualizing the network of dependencies (

The fifty two consensus amino acids of P6 are drawn as a circle, with the N-terminal end shown at the far right and the protein extending counter-clockwise. Each arc represents an association predicted by the conditional model that is significant at

Aranzana

Although

When applying our conditional model to this application, it is not clear whether the target variables should be haplotypes or phenotypes. In general, genetic variations directly influence phenotypes, but phenotypes also indirectly influence haplotypes through selection pressure. As two thirds of both variables followed the tree, we ran the conditional model in both directions, once using the phenotypes as the target and once using them as the predictor, using BIC to determine which direction was best for any given haplotype-phenotype pair.

We found that the BIC scores for the conditional and undirected joint models were not significantly different (

4681 haplotypes were compared against each of the three bacterial response phenotypes, Rpm1 (top), Rpt2 (middle) and Pph3 (bottom). For each haplotype, the four conditional models were run and negative log_{10} of the most significant

Our synthetic tests indicate that the conditional method is well calibrated, implying that roughly 80% of the associations we find with a

Evolutionary biologists have long been interested in studying correlated traits in the midst of population structure due to phylogenetic relationships (for reviews, see

Population structure in biological data has also been addressed in the area of GWA studies. There are rather different approaches in this community that have been used to compensate for population structure. A more commonly used approach attempts to recalibrate standard statistics by normalizing results according to the distribution of the statistic across the entire genome

Similar to what happened in the GWA community, those studying amino acid coevolution initially ignored the phylogenetic structure of the protein sequences

We have identified two evolutionary processes that can confound association analyses and have defined two corresponding generative models for discrete data that can correct for and even leverage the existence of these processes. We have found that explicitly modeling evolutionary processes increases discriminatory power and results in well-calibrated estimates of one minus positive predictive value. We have implemented methods for fitting these models to data and a tool for visualizing the results of the analysis. These tools are available on the internet.

The undirected joint model assumes that the two variables coevolve such that a mutation event in either variable can elicit a corresponding change in the other variable. In contrast, the conditional model assumes that a single variable is distributed according to the tree and is influenced by the predictor variable only at the tips of the tree. Of course, other evolutionary processes are possible. In

Neither the undirected joint nor conditional model outperformed the other on all real data sets, suggesting that both models should be considered when analyzing new data. Nonetheless, the conditional model better fit most of the real data that we analyzed. The conditional model better described the effects of immune pressure on HIV evolution, and perhaps more surprising, better described the correlation between HIV-1 p6 amino-acid pairs. This observation may be due to the rapid evolution of HIV and positive selection pressure from the immune response in conjunction with compensatory mutations in the observed sequences.

Our study has focused on the correlation of discrete (specifically binary) variables. Generalization to multistate variables is straightforward, although may suffer a loss in power. Our methods can also be generalized to continuous and/or discrete predictor and target variables. When the target variable is continuous, the conditional model is a special case of a linear mixed-effects model (K.M. Kang, N. Zaitlen, C.M. Wade, A. Kirby, D.H., M.J. Daly and E. Eskin, submitted). The conditional model can also be generalized to situations with multiple predictor and target variables, thus producing a directed network (acyclic or otherwise) of relationships among multiple variables. Potential applications of multiple predictor variables include learning the combined effects of drug and immune pressure on HIV evolution, identifying chains of compensatory mutations, learning the influence of diploid genes on phenotype, and learning networks of interacting genes and proteins. Finally, one could also use the undirected joint or conditional models to learn the structure of phylogenentic or hierarchical relationships rather than learning the tree structure with standard methods that ignore correlations.

The problem of population structure confounding association studies is a ubiquitous problem across many biological disciplines. Existing solutions vary across these disciplines, but typically focus on correcting for shared population structure. As we have seen, however, population structure in either variable can lead to loss of discriminatory power and poor statistical calibration. The flexibility and intuitive nature of generative models makes them a natural and powerful choice for dealing with a variety of biological processes.

We obtained HIV aligned sequences and HLA data from the Western Australia cohort (HIV sequence accession numbers AY856956–AY857186 and EF116290–EF116445)

The Arabidopsis data set was taken from Aranzana

We generated synthetic data to approximate real data as closely as possible. In the case of synthetic conditional influence, we first ran the conditional model on the real HLA data set to obtain reasonable parameter values. To generate predictor variables, we permuted the real HLA alleles to ensure the data were IID. For each association in the real data set, we generated synthetic target data in one of two ways: (1) if the association was significant (

In the case of coevolution, we first fit the undirected joint model to the HIV p6 data set, and then generated synthetic associations using the learned parameters. For these data, we generated one synthetic association for each real association (

When analyzing the sensitivity of results to tree structure, we needed to infer a phylogenetic tree from synthetic data. To do so, we constructed binary sequences from the synthetic target variables, such that each position in the binary sequence corresponded to a target variable. We then used the PHYML software as described above for real data to infer a maximum likelihood phylogeny. In addition, we used the dnapars program from the PHYLIP package

The independence (null) models are nested inside the undirected joint and conditional models and contain one less parameter. Therefore, the asymptotic distribution of the log-likelihood ratio is χ^{2}-distributed with one degree of freedom from which _{0} _{0}, we estimated FDR to be the ratio of the expected number of associations with _{0} under the null distribution to the observed number of associations with _{0} in the real data. In our experiments, we generated ten associations under the null model for each examined _{i}_{0} _{0}. (In general, FDR is expected to be a monotonic function of _{0}, but is rarely monotonic in practice due to, e.g., variance in the statistic.)

When testing for an association between variables

We have developed and used a systematic approach for determining which null-generation method to use for a given data set and given model for analysis based on two observations. First, as the computation of

Our approach yielded the following choices: permutation bootstrap of the predictor variable for Applications 1 and 3, and parametric bootstrap of the predictor variable for Application 2. For the analysis of synthetic data, we used null-generation methods that would preserve the known distributions of the predictor and target variables: permutation bootstrap of the predictor variable for the conditional influence data set, and parametric bootstrap of the predictor variable for the coevolution data set.

When computing _{0}, then we expect _{A}_{0}) = _{B}_{0}_{A}_{∪B}(_{0}). If _{0} than does _{A}_{0})<_{A}_{∪B}(_{0})<_{B}_{0}). That is, computing _{A}_{∪B}(_{0}) assures that the majority of those false positives are in

In this work, we split our tests along natural boundaries. In our experiments with the conditional model, we computed

To compute the significance of the difference between the discrimination curves of two methods

The permutation test was carried out as follows. Given a set ^{(A)}_{i}^{(A)}<_{j}^{(A)}(_{i}_{j}_{1}, _{2},…,_{n}_{i}_{i}^{(A)} is an association. We computed the area under the curve (AUC) of

When comparing the discrimination curves of two methods _{i}_{i}_{i}_{i}

To determine whether a set of predictions was likely to be enriched for disease response proteins, we downloaded the genomic positions of all genes whose description contained the phrase “disease response” (data taken from

We computed the probability that

The Pollock model for coevolution assumes a symmetric or undirected relationship between two coevolving variables _{x}_{y}_{X}_{Y}_{|X}, and λ_{Y}_{|X̅}. As in the undirected case, we compute the probability that one instance of

HIV-1 p6 amino acid pairs that are correlated at q<0.2.

(0.36 MB XLS)

We thank Bette Korber, Tanmoy Bhattacharya, Eleazar Eskin, Noah Zaitlen, Hyun Min Kang, and Walter Ruzzo for helpful discussions. We also thank participants in the WA HIV Cohort Study as well as past and present clinical and laboratory staff of the Department of Clinical Immunology and Biochemical Genetics, Royal Perth Hospital, Western Australia. Finally, we thank the anonymous reviewer for insightful comments.