An evolution-based high-fidelity method of epistasis measurement: Theory and application to influenza

Linkage effects in a multi-locus population strongly influence its evolution. The models based on the traveling wave approach enable us to predict the average speed of evolution and the statistics of phylogeny. However, predicting statistically the evolution of specific sites and pairs of sites in the multi-locus context remains a mathematical challenge. In particular, the effects of epistasis, the interaction of gene regions contributing to phenotype, is difficult to predict theoretically and detect experimentally in sequence data. A large number of false-positive interactions arises from stochastic linkage effects and indirect interactions, which mask true epistatic interactions. Here we develop a proof-of-principle method to filter out false-positive interactions. We start by demonstrating that the averaging of haplotype frequencies over multiple independent populations is necessary but not sufficient for epistatic detection, because it still leaves high numbers of false-positive interactions. To compensate for the residual stochastic noise, we develop a three-way haplotype method isolating true interactions. The fidelity of the method is confirmed analytically and on simulated genetic sequences evolved with a known epistatic network. The method is then applied to a large sequence database of neurominidase protein of influenza A H1N1 obtained from various geographic locations to infer the epistatic network responsible for the difference between the pre-pandemic virus and the pandemic strain of 2009. These results present a simple and reliable technique to measure epistatic interactions of any sign from sequence data.


Supporting Information
"An evolution-based high-fidelity method of epistasis measurement: theory and application to influenza" by Gabriele Pedruzzi and Igor M. Rouzine Sorbonne Université, Institute de Biologie Paris-Seine, Laboratoire de Biologie Computationelle et Quantitative LCQB, F-75004, Paris, France S1 Text. Derivation of UFE for the closed square topology The topology of the network consists from isolated closed squares (S1A Fig). In this case, Eq. 5 takes the form where are the numbers of five types of configurations: single, opposing double, triplet, full square, respectively (S1A Fig). Each configuration has = 1, 2, or 4 symmetric transformations. Eq. 6 for entropy takes the form = log [ 4 3 4 3 2 4 2 1 4 1 2 ′ 2 2 ′ ] We introduce the probability of configuration This maximum is conditioned by the fact that fitness defined by Eq. 1 is fixed, hence Here all in the right-hand side of Eq. 4 are independent differentials. Substituting Eq. 4 into 3 and demanding that all coefficients at these independent differentials are zero, we can express all in terms of 1 as Without the loss of generality, we can consider an interval 0 < < 1/2, because below 0, as we shall see, indirect interactions do not emerge. Above = 1/2, all these values diverge due to over-compensation, i.e., mutations accumulate without limit. We also assume that the system is not too far from the best-fit sequence, 1 ≪ 1. Under these assumptions, the following inequalities apply 2 ′ ≪ 2 , 1 ≫ 2 ≫ 3 (6) Based on Eqs. 5 and 1 ≪ 1, the probability of a 4-site cluster, 4 , is ordered with respect to the other probabilities depending on subdivision of this interval of We now calculate the frequencies of haplotypes for 2 sites locating on the opposite corners of a square, which corresponds to indirect interaction (S1B Fig). By adding all possible configurations that can produce 11 at these two sites, we obtain 11 = Now we repeat the same procedure for two sites located on one side of a square, which corresponds to direct interaction (S1E Fig and S1F where we used 10 = 01 due to the symmetry of the topology. Using strong inequalities in Eqs. 6 and 7 and substituting Eqs. 10 to 16 into Eq. 17, we obtain the values of all types of UFE in different intervals of shown in S1 Table. The dependence of each kind of correlation measure UFE on epistatic strength is plotted in Fig. S2.
We can draw several conclusions, as follows: i) Indirect interaction is absent at small , < 1/4. By induction, it is also absent at negative , where large clusters creating indirect interaction are very rare. Direct equals , as if the pair is isolated epistatically [46].
ii) At large > 1/3, direct and indirect pairwise correlation have exactly the same magnitude, and direct UFE exceeds the value of . The intuitive reason for these results is that direct and indirect UFEs are both determined mostly by 4-allele clusters, which are numerically dominant over the smaller clusters (Eq. 7).
iii) At large 1/2 > > 1/3, the addition of 0 at a third site makes direct and indirect correlation distinct from each other (and smaller). iv) In this case, the addition of another 0 at the remaining site kills indirect correlation. Therefore, the 3-way correlation method can, in principle be used to tease the direct and indirect interactions apart. However, it remains of the same order of magnitude as direct interaction, especially if is close to full compensation point ( = 1/2), and because the magnitude varies broadly between pairs in real biological systems, this difference may be not enough for reliable detection. Hence, the best way is to add another 0 and measure 4-way haplotypes. This trick eliminates indirect interactions completely, 00 ≡ 0, in the entire interval of .
Intuitively, we interrupt both "detours" along interacting pairs connecting the two loci of interest.
In the case of the most general topology with many loops, this simple example leads to the generalization that the number of the additional zeros required to kill an indirect interaction is equal to the number of directions in which a detour can occur from a site of the pair. For example, a site of a suspected pair has six epistatic partners, and two of them start a detour to the other site of the pair. In this case, we would need to add two 0s. Hence, one needs iteratively to add extra 0s and see if anything has changed. We did not have to use this procedure for virus protein data in Fig. 2, because the detected network is almost a tree already after the three-way test.