Epistasis detectably alters correlations between genomic sites in a narrow parameter window

Different genomic sites evolve inter-dependently due to the combined action of epistasis, defined as a non-multiplicative contribution of alleles at different loci to genome fitness, and the physical linkage of different loci in genome. Both epistasis and linkage, partially compensated by recombination, cause correlations between allele frequencies at the loci (linkage disequilibrium, LD). The interaction and competition between epistasis and linkage are not fully understood, nor is their relative sensitivity to recombination. Modeling an adapting population in the presence of random mutation, natural selection, pairwise epistasis, and random genetic drift, we compare the contributions of epistasis and linkage. For this end, we use a panel of haplotype-based measures of LD and their various combinations calculated for epistatic and non-epistatic pairs separately. We compute the optimal percentages of detected and false positive pairs in a one-time sample of a population of moderate size. We demonstrate that true interacting pairs can be told apart in a sufficiently short genome within a narrow window of time and parameters. Outside of this parameter region, unless the population is extremely large, shared ancestry of individual sequences generates pervasive stochastic LD for non-interacting pairs masking true epistatic associations. In the presence of sufficiently strong recombination, linkage effects decrease faster than those of epistasis, and the detection of epistasis improves. We demonstrate that the epistasis component of locus association can be isolated, at a single time point, by averaging haplotype frequencies over multiple independent populations. These results demonstrate the existence of fundamental restrictions on the protocols for detecting true interactions in DNA sequence sets.


Introduction
Epistasis is inter-dependence of the fitness effects of mutations occurring at different loci. The term 'epistasis' in population genetics refers to the fact that mutations occurring at different genomic sites affect the Darwinian fitness of an organism, i.e., its average progeny number, in a non-multiplicative fashion. At the cell biology level, this phenomenon is caused by various biological interactions [1][2][3][4]. In biological systems, amino acids in proteins domains interact a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 with each other. The resulting networks of interactions that include direct protein-protein binding and allosteric effects, shape the gene regulation and metabolic networks. Epistasis is a widespread property of biological networks [2,[5][6][7][8] and a subject of intense studies. The vital role it plays in the genetic evolution of populations and the heritability of complex traits is well established. The existing estimates indicate that the variation of an inherited trait across a population can only partially be explained by the additive contributions from the relevant alleles. On average, 70% of the inheritance may be due to epistasis or epigenetic effects [9]. Epistasis defines the evolutionary paths and creates fitness valleys, i.e., intermediate genetic variants with reduced fitness [10][11][12].
A crucial biological scenario is a viral population adapting to the abrupt changes in external conditions. Examples include the transmission to a new host, the invasion of a new organ, or the process of immune evasion or the development of drug resistance. Typically, virus adaptation consists of primary mutations followed by a cascade of several compensatory (helper) mutations [13][14][15][16][17][18]. These mutations help the adapting virus to pass through a fitness valley [11]. During this process, compensatory mutations rescue the replicative fitness of virus while preserving its resistant phenotype [13,15,19].
However, epistasis is not the only force causing inter-dependence in the evolution of genomic regions. The other dominant factor is the host of linkage effects due to the fact that different loci in the absence of recombination (or under limited recombination) are linked, i.e., inherited together, as a set [20,21]. The consequences of linkage include Fisher-Muller effect (clonal interference), genetic hitchhiking and genetic background effects, and Hill-Robertson interference between genetic drift and selection [21][22][23]. The effects of linkage on evolution in the presence of selection is well understood theoretically [12,[24][25][26][27][28][29][30][31]. The theory shows that linkage significantly slows adaptation, enhances accumulation of deleterious mutations, and changes the shape of the phylogenetic tree [32,33]. The magnitude of linkage effects grows rapidly with the number of loci, L. Recombination partly offsets linkage effects and accelerates evolution [34][35][36][37][38][39][40] and competes with epistasis [41]. Epistasis has been shown to be potentially important for the evolution of recombination in a two-locus model [42,43].
Another consequence of linkage, which represents the focus of the present work, is the strong interaction between the evolutionary trajectories of different sites. LD stemming from linkage is easy to confuse with epistasis effects. Linkage effects are stochastic, due to stochastic sampling of genomes and random nature of mutations. They become small only in populations that are exponentially large in the number of sites L [25]. Working with sequence data from real populations, it is often unclear how to discriminate the effects of shared ancestry from those of epistasis, and which of the two evolutionary forces dominates in each case (for a comprehensive review, see [1,44,45]). Therefore, despite of a considerable theoretical and experimental effort, detecting epistasis from genomic data remains a challenge.
In the present work, we offer an evolutionary explanation for the observed difficulty of the detection of epistasis from one-time data set. The idea is to generate mock data using a Monte-Carlo model of evolution and then try to discriminate between effects of linkage and epistasis. We use a panel of six pairwise LD measures to compare their distributions between epistatic and random pairs in a broad range of model parameters. We also use 3D and 2D maps of all possible combinations of LD measures and employ an optimization algorithm based on a priori knowledge to estimate the best, theoretically possible identification of epistatic pairs. As a result, we delineate the region of time and model parameters where the epistatic pairs can be detected against the linkage background. Finally, we investigate the role of recombination and the effects of averaging over multiple independently-evolving populations.

Computer simulation of evolution
We consider a haploid population of N genomic sequences comprised of L sites, where L >> 1, and either a favorable or deleterious allele is present at each site. Evolution of the population between discrete generations is simulated using a Wright-Fisher model including the evolutionary factors of random mutation with the rate μ per site, random genetic drift, and natural selection, as described in Methods. Natural selection includes positive (antagonistic) epistatic interaction between selected pairs of deleterious alleles. A simple case of genomes with uniform selection coefficient s 0 and uniform epistatic strength, E, is considered. We also assume that epistatic pairs are isolated, i.e., that each genomic site interacts with only one site. The initial population is randomized as it is done in virus passage experiments, with an average allelic frequency f 0 . In most of our work, we initially neglect the factor of recombination and primarily focus on asexual evolution, but lift this restriction in the end and explore broad parameter ranges. We aim to simulate the detection of epistatic pairs and identify the best conditions for detection theoretically.

Measures of linkage disequilibrium (LD)
Various haplotype-based measures based on known haplotype frequencies have been proposed to characterize the allelic association between loci. We will list four measures, as follows.
Lewontin's measure of statistical correlation between alleles at different loci has a form [46] Here f ij is the average frequency of a bi-allelic haplotype of loci i and j, and D max is a normalization coefficient making sure that D 0 2 [0, 1].
An alternative measure is Pearson correlation coefficient between pairs of loci r, expressed as [47] r ¼ D= ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi More recently, Wu and colleagues [48] have proposed another statistical marker of linkage disequilibrium which has the bi-allelic form which represents the logarithm of the Z-measure proposed previously by Kimura [49]. In our recent work [50], we introduced another bi-allelic measure The advantage of this measure with respect to previous three is that it has a direct meaning in terms of fitness. For isolated interacting pairs, when frequencies in Eq 4 are ensemble-averaged, it represents the degree of mutual compensation of two deleterious mutations, UFE = E (see Methods below). Here the value E = 0 corresponds to the absence of compensation, and E = 1 to full mutual compensation of the two mutations. We checked that the singularity in Eq 4 at f 10 f 01 = f 00 2 does not affect our results.
Below we investigate the effect of linkage for interacting and noninteracting pairs of loci using the measures defined in Eqs 1-4. Also, we employ an optimization algorithm that, exploiting a priori knowledge of the correct epistatic pairs, puts the best possible threshold between the two distributions of LD. We consider different combinations of two or three LD measures to obtain the best detection possible.

LD of epistatic and non-epistatic pairs are distinct in a narrow parameter window
We started by plotting the distribution of six LD measures calculated from Eq 1 over individual pairs of sites, at different times (Fig 1). We show separately the distribution for two subsets of pairs: the known epistatic subset (dark shade) and all the pairs (light shade). In the beginning, LD is narrowly distributed around zero, for both epistatic and non-epistatic subsets (Fig 1,  row 1). , the distribution of randomlychosen pairs, which was initially narrow and concentrated near the origin E = 0, gradually expands and overlaps with the small epistatic distribution (Fig 1). This effect implies that nonepistatic pairs of sites, due to the stochastic nature of evolution, produce large LD of random sign. In this case, it is impossible to tell apart epistatic pairs from any of these measures of LD.

Results are robust to the choice of an LD measure or their combination
Next, we checked whether combinations of LDs used together can improve detection. We have calculated all possible combination of six LD measures in Eq 2 and tried to separate interacting and non-interacting pairs using 3D and 2D scatter plots. A representative example is shown in Fig 2, for E = 0, and for E = 0.75 at two time points. Other possible combinations of 2 and 3 measures are summarized in S1 Table. We wrote an optimization algorithm which separates the cloud of interacting pairs from the cloud of non-interacting pairs in the best possible way, using a priori knowledge about the identity of pairs (Fig 2). We adjusted the threshold to optimize the difference between the detection rare and the false positive rate. This method, employing the principle of machine learning, does not give any substantial improvement on the detection window (See S1 Table). For a real data sets, a priori knowledge about interacting pairs is usually unavailable, so that the detection of epistasis in a single population at one time point will be even worse than our prediction.

Clonal exclusion has a minor effect on detection window
We also attempted to improve detection by analyzing the clone structure of population and excluding the largest clones from the simulated sequences set, which comprise a significant fraction of population [28] and could contribute to noise (S1 Appendix). We have reached only a slight expansion in the time window of detection (S2 and S3 Figs).

Parameter sensitivity analysis confirms the narrow window of detection
Selection coefficient. Next, we investigated how the window of detection changes with model parameters. We calculated the detection rate and the false positive rate for the six measures of LD at different values of selection coefficient, s 0 (Fig 3). For each measure, the results show an inverse scaling of the detection time window on s 0 . Note that the window closes at very small s 0 , where evolution is almost selectively neutral, and epistasis is never detectable.
Distributed selection coefficient. Next, we conducted a sensitivity analysis with respect to the other model parameters (S5 Fig). Firstly, we lifted the simplifying assumption of a constant selection coefficient, s = s 0 , and allowed variation of s among sites according to a half-Gaussian distribution. We obtain a similar dependence of the window width on the average selection coefficient (S5 Fig), although with a higher false positive rate within the detection window than for the case with constant s.
Length of the genome. We found out, that sequence length L limits the detectability of epistasis substantially (S5 Fig). An increase of the sequence length or a reduction of the population size leads to narrowing and, eventually, disappearance of the detection window. These results limit the applicability of these methods to short sequences. Indeed, the number of all possible locus pairs increases with genome length L proportionally to L 2 , and the number of epistatic pairs increases only as L, so that the task of finding "the ruby in the rubbish" becomes harder at larger L [1,44,45].
Population size. We observed a very slow (logarithmic) expansion of the detection window with population size N (S5 Fig). This is consistent with the results of asexual evolution models, which predict a very slow logarithmic dependence on N for all the evolutionary observables, including evolution speed, genetic diversity, and the average time to most recent ancestor [25-31, 35-37, 39, 40, 51]. Only in very large populations whose size increases exponentially genome length L, linkage effects become small [25]. In these, astronomically large populations, epistasis would be easily detectable.  11 , plotted for all pairs of sites (blue circles) and for designated epistatic pairs (red circles). Right and middle: two-dimensional projections. The upper row corresponds to zero epistasis (E = 0, top). Second and third row are two time points in the presence of epistasis, within and outside the detection window, respectively. All possible combinations of two and three measures have also been tested and summarized in S1 Table. At intermediate time t = 10, a distinct cloud of epistatic pairs (red dots) cluster separately from the other pairs and, hence, are detectable (middle row). At long times, substantial overlap with noninteracting pairs contaminates detection (bottom row). To optimize detection, we define a detection threshold for each statistics and use an optimization algorithm that minimizes the following quantity "DET + a FPOS", where a is a fitting parameter, DET represent the detection percentage, and FPSO is the percentage of false positive, based on prior knowledge of the identity of true epistatic pairs. Parameters are as in Fig 1. https://doi.org/10.1371/journal.pone.0214036.g002 Initial standing variation. We have observed a detection window in time only at the initial frequencies of deleterious alleles above 10% (S5 Fig). At smaller frequencies, detection lapses. We can conclude that detection of epistasis in a single population studied is possible in a narrow parameter range.

Recombination improves detection
Until now, we have assumed a completely asexual evolution. In our next step, we investigated the role of recombination, parametrised by the average number of crossovers per genome, M, and by the probability of outcrossing per genome, r. We obtained that intermediate recombination rates rescue the detection of epistasis by disrupting linkage and yet preserving the epistasis contribution to LD. At our default parameter set (Fig 1 legend), we observed a significant reduction of linkage fluctuations starting from r = 20% and M = 5 (Fig 4). The results show that LD effects of linkage are much more resistant to recombination than, for example, the evolution speed, which increases substantially already at tiny values of r [34][35][36][37][38][39][40]. We found out also that extremely high levels of recombination decrease LD for epistatic pairs as well, thus rendering epistasis undetectable. Thus, there exists a narrow window of recombination rates where epistasis can be observed outside of the detection window for time and other parameters described above.

Population divergence creates strong linkage effects
In order to understand the reason behind the strong linkage effects masking epistasis, we investigated the time-dependent changes of the phylogenetic tree using a hierarchical clustering algorithm (Fig 5a-5d). The initial, randomized population display a star-shaped phylogeny, characterized by the same mean distance between all sequences and the most common sequence (Fig 5). With time, the phylogenetic tree grows branches of increasingly related sequences (Fig 5c and 5d). As simulation continues (Fig 5d), the tree becomes more lopsided, while recent mutations create short branches at the bottom. At the same time, we observe that the tree has a decreasing number of ancestors. Eventually, the tree evolves into Bolthausen-      Sznitman coalescent (BSC) with a single common ancestor, previously predicted for the stationary regime of traveling wave [25,29,37] (Fig 5).
Emergence of this phylogeny is coincident with the increase in the fluctuations of LD of non interacting pairs (Fig 1). The reason for strong random LD is stochastic divergence of the population from the initial state, as illustrated by clustering of three independently evolved populations (Fig 5, right). The distance between the trees obtained in separate runs increases linearly in time due to fixed beneficial mutations at randomly chosen sites. Haplotype configurations of the common ancestor of the population are inherited by all members of the population, with some small variation determined by the time to the most recent common ancestor. Thus, the stochastic divergence of individual populations creates strong LD with a random sign.

The use of multiple populations defeats LD fluctuations and rescues epistatic signature
Because the linkage fluctuations arise due to stochastic divergence of the founder, the common ancestor, the natural idea is to use multiple populations to average over possible founder sequences. To test this idea, we evolved independently multiple populations at the same initial conditions and averaged the haplotype frequencies used in LD markers (Eqs. 1-4) over populations, for each pair of sites, separately. We found out that including a sufficient number of independent populations results in a substantial reduction of the noise and indefinite expansion of the window of detection (Fig 6). Qualitatively similar results are obtained for all LD markers.

Discussion
In the present work, using a Monte-Carlo simulation of a haploid population, we calculated the distributions of six measures of linkage disequilibrium and their combinations for epistatic and random locus pairs. We demonstrated that, in a single asexual population, the footprints of epistatic pairs are readable only in a narrow time interval between 0.2/s 0 and 1.5/s 0 generations. During later adaptation, the distribution of linkage disequilibrium for non-interacting pairs broadens and engulfs the distribution for epistatic pairs. These results indicate that, long before the onset of the steady state, linkage effects dominate over the effects of epistasis. This phenomenon is predicted in a broad parameter region and for all the LD statistics, suggesting that, in the context of inherited linkage fluctuations, all statistics based on pairwise linkage disequilibrium are equal.
To gain insight into the evolutionary origin of these fluctuations, we investigated phylogenetic trees of the entire population at different time points to observe that the shape of the tree strongly correlates with the magnitude of linkage fluctuations. The shape of the phylogenetic tree changes in time from the initially star-shaped genealogy to a Bolthausen-Sznitman (BS) coalescent [32,33] previously analyzed in great detail for adapting asexual populations [25,36,37]. Once BS genealogy is established, individual sequences share a high degree of interrelatedness due to fixed beneficial mutations at randomly chosen sites. The presence of the BS coalescent is coincident with strong co-inheritance linkage fluctuations. The stochastic nature of their common ancestor sequence, divergent in time from common ancestors in other independent populations (Fig 5) is the cause of the strong fluctuations of LD.
We have also directly quantitated the detection of epistatic pairs against the background of random linkage effects. We evaluated the sensitivity of the width of the detection window with respect to several input parameters, such as the mean selection coefficient, the size of the population, the sequence length, and initial genetic variation, and the role of recombination. We observed that the window is proportional to the inverse average selection coefficient, 1/s 0 , but a very small s 0 abolishes any chance of detection, so that the best detection is attained in the case of moderately weak selection. The detection window exists only for sufficiently small genomes. The presence of recombination has the effect of compensating the linkage component and thus significantly improving the detection of epistasis. Yet, very frequent recombination disrupts epistatic effects.
To isolate the epistatic component from co-inheritance effects, we performed simulations over several independently-evolved populations and averaged the haplotype frequencies over these runs. The results predict the number of independent population required to attain significant expansion of the detection window (Fig 6). Thus, the averaging over multiple independently-evolved populations filters out linkage effects leaving a clear footprint of epistasis in a much broader parameter range. However one should note that the multiple-population sampling was conducted under the ideal conditions, in which every population evolved independently for the same time with the same parameter set, and represented the same fraction of the total sample. Unequal sampling or heterogeneous representation in real data sets may create additional problems.
Our model adopts several simplifying assumptions. (i) Deleterious alleles are assigned selection coefficient constant in time. (ii) We considered constant and fixed epistatic strength for all pairs. (iii) We focused on a simple topology of epistatic network. While these are reasonable assumptions to describe the problem of linkage fluctuations in biological systems, a real scenario with mixed sign epistasis and complex topology might pose additional challenges for the accurate detection of epistasis.
The results obtained from averaging over independent populations give strong evidence for the role of stochastic divergence in linkage statistics. There are some cases, such as virus evolution in independent populations where it is possible to obtain independent replicates. Examples include influenza virus sampled in different countries, virus passage in parallel tissue cultures, cancer cell evolution in different organs. For the study of human genetics, it may be possible to obtain independent isolates from under-mixed subpopulations that split long time ago. In principle, one can try to use data from different countries of common origin. For example, full genome studies show that European nations, despite of interbreeding and outbreeding, remain genetically distinct after a split from a common origin~6000 years ago (see Fig 2 in [52]). Therefore, they can be viewed as quasi-independent populations with weak genetic exchange. Our results imply that comparing genetic data from related but distinct ethnicities allows to study epistasis more reliably than in a single ethnic group.

Conclusions
We identified the evolutionary reason for strong fluctuations of epistatic estimates in the existing sequence sets. Linkage due to stochastic divergence of the common ancestor of a population from the origin is responsible for the high false-positive rates of epistasis detection in a single population. We demonstrated how the use of multiple independently-evolving populations allows to average out strong linkage effects and rescue the detectability of epistasis.

Materials and methods
We consider a haploid population of N binary sequences, where each genome site (nucleotide position) numbered by i = 1, 2, . . ., L is either K i = 0 or K i = 1. We assume that the genome is long, L >> 1. Evolution of the population in discrete time measured in generations is simulated using a standard Wright-Fisher model, which includes the factors of random mutation with rate μL per genome, natural selection, and random genetic drift. Recombination is assumed to be absent. Once per generation, each genome is replaced by a random number of its progeny which obeys multinomial distribution. The total population stays constant with the use of the broken-stick algorithm.
To include natural selection, we calculate fitness (average progeny number) e W of sequence K i as given by [50] W The biological meaning of this expression is, as follows. According to a well-know theorem of population genetics, different loci are predicted to evolve independently in a large population if contributions of mutations occurring at different loci to organism's log fitness are additive, which corresponds to the case of biologically non-interacting sites. Formally, this situation is described by the first term in Eq 5 with additive contribution of single mutations to fitness, with selection coefficient s i for each site i. Interaction between loci creates non-additive effects to the fitness log: the second term in Eq 5 describes pairwise interactions of sites with magnitudes S ij given by Eq 6.
Coefficient E ij introduced in Eq 6 represents the relative strength of epistatic interaction between sites i and j, while the binary elements of matrix T indicate the interacting pairs by T ij = 1 and the other pairs by T ij = 0. An example of positive epistasis is the compensation of two deleterious mutations inside protein segments that bind each other. Note that E ij = 1 corresponds to full mutual compensation of deleterious mutants at sites i and j. We consider the simplest interaction topology of interacting neighbors, as given by T 2i,2i+1 = 1 and 0 for all other pairs.
Here we include only pairwise interactions, neglecting higher-order interactions between protein residues. Even though non-pairwise models are sometimes used in the literature, we are not aware of any evidence that higher-order interactions are significant in viruses or any other organisms.
Supporting information S1 Table. Theoretical limits of detection of epistasis, expressed as the percentages of detection and false positives. We devised an optimization algorithm that, based on prior knowledge of the true epistatic association, identifies threshold within the data that allow to sort epistatic pairs from non-interacting ones. We present data for each induvial estimator of epistasis and for combinations of two, and three measure simultaneously. The analysis was repeated at two time points (T 1 = 10 and T 2 = 30 generations, within and outside of the widows of detection, respectively. The results offer a comparative perspective over the detection performances of different measure and show that all LD-and haplotype-based estimators of epistasis can only detect true association, with reduced error and bias, only at T 1 , while at the later time point the magnitude of CI effects mask the epistatic associations. (PDF) S1 Appendix. Clonal exclusion does not remove the limit to the detection of epistasis in a single-population.