Skip to main content
Advertisement
  • Loading metrics

An extension of the Walsh-Hadamard transform to calculate and model epistasis in genetic landscapes of arbitrary shape and complexity

  • Andre J. Faure ,

    Roles Conceptualization, Formal analysis, Investigation, Methodology, Software, Supervision, Visualization, Writing – original draft, Writing – review & editing

    aandre.faure@crg.eu

    Current address: ALLOX, C/ Dr. Aiguader, 88, PRBB Building, Barcelona, Spain

    Affiliation Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Dr. Aiguader 88, Barcelona 08003, Spain

  • Ben Lehner,

    Roles Conceptualization, Supervision, Writing – original draft, Writing – review & editing

    Affiliations Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Dr. Aiguader 88, Barcelona 08003, Spain, Universitat Pompeu Fabra (UPF), Barcelona, Spain, ICREA, Pg. Lluis Companys 23, Barcelona 08010, Spain, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, United Kingdom

  • Verónica Miró Pina,

    Roles Methodology, Visualization, Writing – original draft, Writing – review & editing

    Affiliation Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Dr. Aiguader 88, Barcelona 08003, Spain

  • Claudia Serrano Colome,

    Roles Methodology, Visualization, Writing – original draft, Writing – review & editing

    Affiliation Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Dr. Aiguader 88, Barcelona 08003, Spain

  • Donate Weghorn

    Roles Methodology, Supervision, Writing – original draft, Writing – review & editing

    Affiliations Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Dr. Aiguader 88, Barcelona 08003, Spain, Universitat Pompeu Fabra (UPF), Barcelona, Spain

Abstract

Accurate models describing the relationship between genotype and phenotype are necessary in order to understand and predict how mutations to biological sequences affect the fitness and evolution of living organisms. The apparent abundance of epistasis (genetic interactions), both between and within genes, complicates this task and how to build mechanistic models that incorporate epistatic coefficients (genetic interaction terms) is an open question. The Walsh-Hadamard transform represents a rigorous computational framework for calculating and modeling epistatic interactions at the level of individual genotypic values (known as genetical, biological or physiological epistasis), and can therefore be used to address fundamental questions related to sequence-to-function encodings. However, one of its main limitations is that it can only accommodate two alleles (amino acid or nucleotide states) per sequence position. In this paper we provide an extension of the Walsh-Hadamard transform that allows the calculation and modeling of background-averaged epistasis (also known as ensemble epistasis) in genetic landscapes with an arbitrary number of states per position (20 for amino acids, 4 for nucleotides, etc.). We also provide a recursive formula for the inverse matrix and then derive formulae to directly extract any element of either matrix without having to rely on the computationally intensive task of constructing or inverting large matrices. Finally, we demonstrate the utility of our theory by using it to model epistasis within both simulated and empirical multiallelic fitness landscapes, revealing that both pairwise and higher-order genetic interactions are enriched between physically interacting positions.

Author summary

An important question in genetics is how the effects of mutations combine to alter phenotypes. Genetic interactions (epistasis) describe non-additive effects of pairs of mutations, but can also involve higher-order (three- and four-way etc.) combinations. Quantifying higher-order interactions is experimentally very challenging requiring a large number of measurements. Techniques based on deep mutational scanning (DMS) represent valuable sources of data to study epistasis. However, the best way to extract the relevant pairwise and higher-order epistatic coefficients (genetic interaction terms) from this data for the task of phenotypic prediction remains an unresolved problem. The Walsh-Hadamard transform represents a rigorous computational framework for calculating and modeling epistatic interactions at the level of individual genotypic values. Critically, this formalism currently only allows for two alleles (amino acid or nucleotide states) per sequence position, hampering applications in more biologically realistic scenarios. Here we present an extension of the Walsh-Hadamard transform that overcomes this limitation and demonstrate the utility of our theory by using it to model epistasis within both simulated and empirical multiallelic genetic landscapes.

Introduction

A fundamental challenge in biology is to understand and predict how changes (or mutations) to biological sequences (DNA, RNA, proteins) affect their molecular function and ultimately the phenotype of living organisms. The phenomenon of ‘epistasis’ (genetic interactions)—broadly defined as the dependence of mutational effects on the genetic context in which they occur [13]—is widespread in biological systems, yet knowledge of the underlying mechanisms remains limited. Defining the extent of epistasis and better understanding of its origins has relevance in fields ranging from genetic prediction, molecular evolution, infectious disease and cancer drug development, to biomolecular structure determination and protein engineering [3].

Evolutionarily related sequences, natural genetic variation within populations, and more recently results of techniques such as deep mutational scanning (DMS) [4]—also known as massively parallel reporter assays (MPRAs) and multiplex assays of variant effect (MAVEs)—represent valuable sources of data to study epistasis [1, 5]. In particular, DMS enables the systematic measurement of mutational effects across entire combinatorially complete genetic landscapes [513]. Importantly, the typical use of engineered genotypes, haploid individuals and near-identical environmental (laboratory) conditions in these experiments allows population genetic considerations—such as dominance, variable allele frequencies and linkage disequilibrium—to be ignored [14]. In other words, measurements obtained from deep mutational scanning and related methods permit the modeling of epistasis in the mechanistic sense (sequence-to-function encoding) rather than in the evolutionary sense, i.e. based on the dynamics of population genotype frequencies. Nevertheless, precisely how to extract the most biologically relevant pairwise and higher-order epistatic coefficients (genetic interaction terms) from this type of data is an unresolved problem.

Quantitative definitions of epistasis vary among fields, but it has been argued that one particular formulation termed ‘background-averaged’ epistasis, also known as ‘ensemble’ epistasis [1, 12], may provide the most useful information on the epistatic structure of biological systems [2]. The underlying rationale is that by averaging the effects of mutations across many different genetic backgrounds (contexts), the method is robust to local idiosyncrasies in the relationship between genotype and phenotype. It has been previously pointed out that the definition of background-averaged epistasis is conceptually similar to that of ‘statistical epistasis’ attributed to Fisher, but instead of measuring the average effect of allele substitutions against the population average genetic background i.e. averaging over all genotypes present in a given population (taking into account their individual frequencies), the approach instead averages over all possible genotypes (assuming equal genotype weights) [1, 2].

The current mathematical formalism of background-averaged epistasis is based on the Walsh-Hadamard transform [2]. Interestingly, although widely used in physics and engineering, the Walsh-Hadamard transform was first applied to non-biological fitness landscapes in the field of genetic algorithms (GA) [15], subsequently being proposed as the basis of a framework for the computation of higher-order epistasis in empirical settings [16]. However, the Walsh-Hadamard transform can only accommodate two alleles (amino acid or nucleotide states) per sequence position, with no extension to multialleleic landscapes (cardinality greater than two) yet made, as confirmed by multiple recent reports [2, 1719]. Alternative implementations for multiallelic landscapes either rely on ‘one-hot encoding’ elements of larger alphabets as biallelic sequences—requiring the manipulation of prohibitively large Walsh-Hadamard matrices—or constructing graph Fourier bases [18, 20], which is mathematically complex and provides no straightforward way to interpret epistatic coefficients. The result is that the application of background-averaged epistasis has been severely limited and its properties remain largely unexplored in more biologically realistic scenarios.

In this work we provide an extension of the Walsh-Hadamard transform that allows the calculation and modeling of background-averaged epistasis in genetic landscapes with an arbitrary number of states (20 for amino acids, 4 for nucleotides, etc.). We also provide a recursive formula for the inverse matrix, which is required to infer epistatic coefficients using regression. Furthermore, we derive convenient formulae to directly extract any element of either matrix without having to rely on the computationally intensive task of constructing or inverting large matrices. Lastly, we apply these formulae to the analysis of both simulated and empirical multiallelic DMS datasets, demonstrating that sparse models inferred from the background-averaged representation (embedding) of the underlying genetic landscape more regularly include epistatic terms corresponding to direct physical interactions.

Results

Extension of the Walsh-Hadamard transform to multiallelic landscapes

In this work, a genotype sequence is represented as a one-dimensional ordering of monomers, each of which can take on s possible states (or alleles), for example s = 4 for nucleotide sequences or s = 20 for amino acid sequences. Without loss of generality, the s states can be labelled 0, 1, 2, …, s − 1, where 0 denotes the wild-type allele. We are going to consider genotype sequences of length , i.e. sequences taking values in , where .

Each genotype is associated with its phenotype . Note that here we use the term ‘phenotype’ as shorthand for ‘molecular phenotype score’ from a quantitative laboratory assay (DMS) reporting on a molecular function for each genotype of interest. In quantitative genetics terminology this might be referred to as ‘genotypic value’ because environmental deviation is negligible due to the controlled nature of the experiments, but our subject here is the macromolecule not an individual from a population [14]. In the context of empirical genotype-phenotype landscapes, the phenotypic effect of a genotype is typically measured with respect to the wild-type, i.e. it is given by .

It is important to emphasize that in what follows we implicitly restrict ourselves to the haploid reference base, because our primary goal is the modeling of sequence-to-function encodings for individual genotype sequences—for the ultimate purpose of understanding and engineering macromolecules—not the modeling of sequence evolution or quantification of sources of phenotypic variance in populations.

If the phenotypic effects of individual mutations were independent, they would be additive, meaning that the phenotypic effect of would be the sum of the phenotypic effects of the single mutants (i1, 0, …, 0), …, (0, …, 0, in). The epistatic coefficient quantifies how much the observed phenotypic effect of deviates from this assumption. In the case of background-averaged epistasis, we quantify the interactions between a set of mutations by averaging over all possible genotypes for the remaining positions in the sequence. For example, if n = 3 and s = 2, the pairwise epistatic coefficient involving the mutations at positions 2 and 3 is calculated by averaging over all states (backgrounds) for the remaining positions, in this case given by the two states of the first position (* denotes the positions at which the averaging is performed), i.e. (1)

More generally, in [2] it is shown that for s = 2 and any sequence length n, phenotypic effects can be decomposed into background-averaged epistatic coefficients with (2) where is the vector , is the vector and and are 2n × 2n matrices defined recursively as follows: (3) (4) The matrix is known as the Walsh-Hadamard transform [21, 22] and is a diagonal weighting (or normalisation) matrix to correct the sign and account for averaging over different numbers of backgrounds as a function of epistatic order [2].

In this work, we provide an extension of this theory to describe background-averaged epistasis for sequences with an arbitrary number of states s. Before writing a general formula, we consider the simplest possible multi-state (multiallelic) landscape i.e. a sequence of length n = 1 with s = 3, (5)

Consistent with the definition of background-averaged epistasis for biallelic landscapes [2], the zeroth-order epistatic coefficient ε(*) is the mean phenotypic value across all genotypes and the first-order epistatic coefficients ε(1) and ε(2) are simply the respective individual phenotypic effects of genotypes y(1) and y(2) with respect to the wild-type. However, the key feature of H1 for multiallelic landscapes—and where it departs from the canonical Walsh-Hadamard transform—is the introduction of zero elements to exclude phenotypes that are irrelevant for the calculation of a given epistatic coefficient. In other words, these phenotypes are excluded because they correspond neither to relevant intermediate genotypes nor alternative genetic backgrounds.

If we now consider a sequence of length n = 2 with s = 3, then the H2 and V2 matrices become 9 × 9 (sn × sn) and can be constructed from recurring to the case n = 1 above, giving where the colors highlight the block structure of the matrices. In V2, the red square corresponds to and the light red squares to −V1. In H2, the gray squares correspond to H1 and the blue squares to −H1. In Table 1 we show the results of background-averaged epistatic coefficients calculated by applying the above formula to an empirical multiallelic landscape with n = 2 and s = 3 [6].

thumbnail
Table 1. Interaction terms based on background-averaged epistasis () for an empirical multiallelic genotype-phenotype landscape consisting of all combinations of two mutations each at positions 6 and 66 in the tRNA-Arg(CCU) [6], i.e. n = 2 and s = 3.

The first two columns indicate nucleic acid sequences and their base 3 representations. Here the ‘GC’ reference (wild-type) genotype corresponds to that of S. cerevisiae, denoted by (0, 0). The second two columns show the measured phenotypic effects and corresponding background-averaged epistatic coefficients together with their empirical errors and their propagated errors. See Results for a regression analysis of the entire dataset and Proposition 8 in S1 Text for a derivation of the error propagation.

https://doi.org/10.1371/journal.pcbi.1012132.t001

More generally, for any value of s, when n = 1, (6) where ε(*) corresponds to averaging phenotypes over all possible genotypes and the remaining coefficients simply correspond to the phenotypic effects of each mutation.

For n = 2, we have to consider different combinations of mutations in both positions. In this case, the phenotypes can be written as

A natural ordering of the phenotypes is given by interpreting genotype as the base s representation of an integer (see Table 1). From this, we can see how the first s genotypes correspond to combining the wild-type allele at the first position with a state from the case n = 1, i.e. to genotypes that can be written , with . The next s genotypes correspond to the first mutated allele at the first position combined with all the genotypes of n = 1, i.e. , and so on. Therefore, we can write the matrices H and V following a block structure. In the case n = 2 and any given s, we would then have (7) where the number of H1 blocks corresponds to the number of states of the first position, so s. Moreover, each of these blocks must be normalized to yield the corresponding background-averaged epistatic terms. Therefore V2 can also be expressed as a function of V1 as follows: (8)

Given these two matrices, we can write the background-averaged epistatic coefficients for the case of n = 2 and s different states per position as . More generally, the decomposition of phenotypic effects into background-averaged epistatic coefficients is given by (9) where Hn and Vn can be defined recursively as (10) (11)

Recursive inverse matrix

Eq (9) defines the vector of epistatic coefficients, , as a function of the vector of phenotypes, , which in general is the quantity that is measured experimentally. However, usually phenotypic measurements are only available for a subset of genotypes. An alternative is therefore to estimate the epistatic coefficients by regression, (12) where the product represents a matrix of sequence features. This is analogous to the more widely used one-hot encoding strategy, which implicitly relies on a ‘background-relative’ (or ‘biochemical’) view of epistasis when regressing to full order [2]. We discuss other advantages of estimating background-averaged epistatic coefficients using regression at the end of this manuscript. Since Vn is a diagonal matrix, its inverse is also a diagonal matrix whose elements are the inverse of the elements of Vn.

The inverse of Hn is the matrix An which can be defined recursively as (13) See Proposition 1 in S1 Text for a proof of this result. This is the most efficient method to determine the full matrix An (see Results).

Formulae to obtain elements of the matrices

When regressing phenotypes on genotypes, a common goal is to determine whether epistatic coefficients up to the rth order (where r < n) are sufficient to describe the complexity of the biological system. Furthermore, as mentioned above, some fraction of phenotype values within combinatorially complete genetic landscapes are typically unavailable, representing missing data. Restricting the epistatic order and missing phenotypes respectively correspond to omitting rows and columns from Hn (and vice versa from An). Formulae to directly obtain elements of the matrices in Eqs (9) and (12) would therefore be convenient.

In order to write the matrix element (Hn)ij, we need to compare the genotype sequences , where denotes the ith element in , , and the elements of are ordered by the base s representation of integers. For instance, for any value of n, we will denote the wild-type state with index i = 1 and write . The element denoted with index i = 2 would be and so on.

The elements of Hn can be written as (14) where M and E are sn × sn matrices whose elements are (15) (16) where δij denotes the Kronecker delta of i, j, which is equal to 1 when i = j and 0 if ij. In words, counts the number of positions at which the genotype sequences and carry the same mutated allele and is equal to plus the number of positions where or carry the wild-type allele. See Proposition 2 in S1 Text for a proof of this result.

Furthermore, the elements of An can be written as (17) where En is defined as in Eq (15). See Proposition 3 in S1 Text for a proof of this result.

Finally, the matrices Vn and are diagonal matrices whose diagonal elements can be written as (18) and (19) where and again denotes the ith element in when ordered by the base s representation of integers. In words, wk = 1 if the genotype sequence carries the wild-type allele at position k and counts the number of positions in carrying the wild-type allele. We prove this result in Proposition 4 in S1 Text.

Generalization to different numbers of states per position

We can generalize the formulae described in the previous subsection further by considering that each position can have different numbers of states. In this case, we can denote sk the number of possible states at position k. For n = 1, this corresponds to exactly the same matrix as in the previous case but with s = s1, which is the number of possible states in this position. For n = 2, the matrix changes because now the new position can have a different number of possible states, s2. Following the recursive definition of Hn, we can construct H2 by repeating H1 s2 times, with the structure stated in Eq (10). Therefore, we have

So the structure is exactly the same but the size of the matrix for each n varies according to the number of possible states of the new position. The definition of Hn is the same as in Eq (10) but the dimensions of the matrix are . Similarly, the inverse matrix An+1 can be written recursively as (21) The matrix An defined in Eq (21) is the inverse of the matrix Hn in the general case where each position can have a different number of states.

In this general case, the elements of Hn and An can be written as (22) (23) where En and Mn are defined as in Eqs (15) and (16) and .

The matrices Vn and are diagonal matrices whose diagonal elements can be written as (24) and (25) where

We prove the results in this subsection in Propositions 5, 6 and 7 in S1 Text.

The above formulae permit the calculation and modeling of background-averaged epistasis in arbitrarily-shaped genetic landscapes, i.e. with any number of alleles (states) per position, as well as the direct construction of sub-matrices for regression to any desired epistatic order and/or in the presence of missing data. In the following subsections we report benchmarking results comparing the performance of alternative methods to obtain Hn and An, as well as results from the application of our theory extension to an empirical multiallelic genotype-phenotype landscape.

Benchmarking

Fig 1A–1D provides a visualization of the matrices Hn and An for different values of n and s, clearly showing a self-similar pattern in all cases due to their recursive nature.

thumbnail
Fig 1. Benchmarking results and heat map representations of matrices corresponding to the binary (biallelic) and multi-state (multiallelic) extension of the Walsh-Hadamard transform, and their corresponding inverses.

A. H6 Walsh-Hadamard transform (s = 2). B. H4 multi-state extension of the Walsh-Hadamard transform for s = 3. C. A6 Inverse Walsh-Hadamard transform. D. A4 multi-state extension of the inverse Walsh-Hadamard transform for s = 3. E. Computational time on a MacBook Pro (13-inch, 2017, 2.3GHz dual-core Intel Core i5) for extracting elements of An matrices of various dimensions and numbers of states (alleles) per position (s ∈ [2, 10]). Comparisons are shown between numerically inverting the recursively constructed Hn (using scipy.linalg.inv), i.e. “Recursive Hn inverse”, using the recursive formula for An, using the formula to extract all elements of An and extracting 10 random elements of An (see legend). The mean across 10 replicates is depicted. Linear regression lines were fit to data from matrices with at least 100 elements. F. Similar to E but indicating memory usage. Linear regression lines were fit to data from matrices with at least 10 elements.

https://doi.org/10.1371/journal.pcbi.1012132.g001

In this paper, we provide different methods to construct . First, Hn can be numerically inverted using standard matrix inversion algorithms (here we use the linalg.inv function from the SciPy library in Python), referred to as “Recursive Hn inverse” in Fig 1E and 1F. Alternatively, the recursive definition of the inverse given by Eq (13) can be used, which we refer to as “Recursive An”. As can be seen in Fig 1E, this method is faster than numerically inverting Hn.

Finally, we also provide a convenient formula for extracting specific individual elements of An Eq (17), referred to as “All elements An” in Fig 1E and 1F. This method is more computationally intensive than the previously described methods, due to the formula relying on the computation of (En)ij, which equates to counting the number of sequence positions that are identically mutated in vectors and , each of size n. However, in situations where subsets of elements (or sub-matrices)—rather than full matrices—are desired, Eq (17) provides a method that can be faster and more memory efficient (see “10 elements An” in Fig 1E and 1F).

For example, in the case of a 10-mer DNA sequence, constructing the full inverse transform A10 with s = 4 would require > 1023 bytes (100 million petabytes) of memory in the best-case scenario (“Recursive Hn inverse” in Fig 1F, log-linear extrapolation). Similarly, the full inverse transform for a 4-mer amino acid sequence (A4 with s = 20) would impose a memory footprint > 1020 bytes. On the other hand, calculating the subset of elements from these matrices required for the prediction of a single phenotype using epistatic coefficients up to third order (three-way genetic interaction terms) is feasible in both situations using Eq (17) (3,675 and 29,678 elements; 2.5 GB and 192 GB of memory; 1.8 and 99 seconds, respectively). This memory footprint can easily be diminished further using data chunking, which is a unique benefit of this method.

Application to a simulated multiallelic genotype-phenotype landscape

In order to demonstrate the utility of our extension of the Walsh-Hadamard transform, we used it to model epistasis within a simulated multiallelic genetic landscape. Knowledge of the ‘ground truth’ epistatic coefficients allowed us to (i) determine whether background-averaged coefficients can be accurately inferred using regression and (ii) investigate the impact of data sparseness on the results i.e. in the context of missing phenotypic values. In most practical applications—particularly in the case of large empirical genotype-phenotype landscapes—measuring the phenotypic effects of all mutation combinations is infeasible. The simulated combinatorially complete landscape consisted of all possible combinations of four alleles at six different positions i.e. a total of 46 = 4, 098 genotypes (Fig 2A, left). First-order and a subset of second-order epistatic coefficients were randomly drawn from Gaussian distributions, while all higher-order epistatic coefficients were set to zero (see Methods). We arbitrarily selected three pairs of positions at which mutations genetically interact i.e. ground truth non-zero pairwise epistatic terms (Fig 2A, left). Fitness values for all variants were reconstructed using Eq (12) with random noise added to simulate measurement error (see Methods).

thumbnail
Fig 2. Learning sparse models from simulated and empirical combinatorial fitness landscapes.

A. Schematic description of combinatorial DMS datasets. Left: simulated multiallelic landscape consisting of all combinations of 4 nucleotides in a DNA 6-mer where ground truth interactions ([1, 2], [3, 5] and [5, 6]) are depicted with red arcs (see Methods). Middle: secondary structure of S. cerevisiae tRNA-Arg(CCU) indicating variable positions (closed circles) combinatorially mutated in a DMS experiment as described in [6]. Three Watson–Crick base pairing (WCBP) interactions involving pairs of these positions ([1, 71], [2, 70] and [6, 66]) are indicated. Right: crystal structure of a blue fluorescent variant (TagBFP) of the Entacmaea quadricolor protein eqFP611 (PDB: 3M24) and 13 positions (12 shown) that differ in the red fluorescent variant (mKate2) that were mutated in a DMS experiment as described in [7]. Physical interactions between the fluorophore (L63) and three proximal residues ([63, 143], [63, 174] and [63, 197]) are indicated. B. Performance of sparse models fit using different proportions of the DMS datasets indicated in panel A. The median number of model coefficients is indicated. Colour scale as in panel D. C. Scatter plots comparing all non-zero model-inferred background-averaged epistatic coefficients (training set size = 64%) to ground truth values (left panel) or previously-reported coefficients (remaining panels) [6, 7]. Pearson’s r is shown. Four outlier 4th order coefficients in the tRNA dataset were omitted. D. Enrichment of direct physical interactions (red arcs in panel A) in non-zero epistatic coefficients. *, P < 0.05; **, P < 0.01; ***, P < 0.001. Error bars indicate the interquartile range.

https://doi.org/10.1371/journal.pcbi.1012132.g002

We then trained Lasso regression models of the form in Eq (12) to predict fitness values from sequence, where the inferred model parameters correspond to background-averaged epistatic coefficients up to third order (S1(A) Fig; see Methods). To assess the impact of data sparsity on the results, we sub-sampled the simulated phenotype (fitness) values to obtain training dataset sizes ranging from 64% to 1% of all variants in the combinatorial complete landscape. The sparse models generalize well, explaining more than 80% of fitness variance in the held out test data even when the overwhelming majority of variants (96%) are missing (Fig 2B, left; ‘Background-averaged models’, red). Importantly, the values of non-zero background-averaged epistatic coefficients closely match those of the ground truth simulated coefficients (Pearson’s r = 0.99, Fig 2C, left). Reducing the size of the training set tends to reduce the number of coefficients retained in sparse models, but the ground truth values of non-zero coefficients are well recovered (Pearson’s r = 0.99, S2(A) Fig).

For comparison, we used the same procedure to fit Lasso regression models of the form , where G−1 represents a matrix of one-hot encoded sequence features i.e. the presence or absence of a given mutation—or mutation combination (interaction)—with respect to the reference (wild-type) genotype is denoted by a ‘1’ or ‘0’ respectively (S1(A) Fig, Fig 2B, left; ‘One-hot models’, blue). The definition of G and its relationship to the biochemical (or background-relative) view of epistasis is explained in [2]. Although one-hot models show similar performance to background-averaged models on simulated test data, they tend to be more complex with larger numbers of coefficients (Fig 2B), including spurious third-order epistatic terms although their ground truth values are zero (S1(C) Fig).

In summary, these results demonstrate that our theory extensions allow the accurate inference of background-averaged epistatic coefficients in multiallelic genetic landscapes, even in situations when large amounts of phenotypic data are missing.

Application to empirical combinatorial fitness landscapes

We next applied our theory to model epistasis within two different empirical combinatorial fitness landscapes (Fig 2A, middle, right). In the first DMS assay, a budding yeast strain was used in which the single-copy arginine-CCU tRNA (tRNA-Arg(CCU)) gene is conditionally required for growth [6]. A library of variants of this gene was designed to cover all 5,184 (26 × 34) combinations of the 14 nucleotide substitutions observed in ten positions in post-whole-genome duplication yeast species. The library was transformed into S. cerevisiae, expressed under restrictive conditions and the enrichment of each genotype in the culture was quantified by deep sequencing before and after selection. After reprocessing of the raw data, we retained high quality fitness estimates for 3,847 variants (74.2%, see Methods).

Although the findings in [6] were based on the application of background-averaged epistasis theory, the prior limitation of the Walsh-Hadamard transform to only two alleles per sequence position required the authors to adopt an ad hoc strategy that involved performing separate analyses on combinatorially complete biallelic sub-landscapes. After imputing fitness values for missing variants we decomposed phenotypes into background-averaged epistatic coefficients using Eq (9), which yielded very similar values for significant coefficients of all orders (S2(B) Fig, see Methods). However—as shown above for simulated data—our theory extensions permit the modeling of background-averaged epistasis in the context of multiallelic landscapes with missing data. We followed the same strategy as described in the previous section, using the tRNA DMS data to train Lasso regression models to predict variant fitness from sequence incorporating epistatic coefficients up to eighth order (Fig 2B, middle, see Methods). The resulting models include many higher-order epistatic coefficients (S1(C) Fig, ‘Background-averaged models’) yet exhibit extreme sparsity, with the median number of non-zero coefficients of any order ranging from 19 to 60 i.e. approximately 1% of all possible coefficients of eighth order or less (Fig 2B, middle). Models trained on at least 10% of the training data tend to explain more than 50% of the total explainable variance (Fig 2B, middle) and model-inferred background-averaged epistatic terms correlate well with those reported previously [6] (Pearson’s r = 0.79, Fig 2C, middle).

To evaluate whether the inferred models report on biologically relevant features of the underlying genetic landscapes, we tested whether sparse model coefficients were more likely to comprise genetic interactions (or modulators thereof) involving known physically interacting positions in the wild-type tRNA secondary structure (Fig 2A, middle). Regardless of data sparsity, background-averaged model coefficients tend to be significantly enriched for physical interactions (Fig 2D, middle). On the other hand, in the case of even moderate sub-sampling of training data (16%), one-hot model coefficients show no such enrichment (Fig 2D, middle). The results of this enrichment analysis closely mirror those obtained using simulated data (Fig 2D, left), suggesting that they are likely to generalise to other empirical genotype-phenotype landscapes.

Finally, we repeated similar analyses using a second DMS dataset comprising fluorescence measurements for all combinations of single substitution mutations at 13 positions (213 = 8, 192 genotypes) separating two variants of the Entacmaea quadricolor protein eqFP611 (Fig 2, right column; see Methods) [7]. Lasso models capture the majority of fitness variance even with high levels of data sparsity (Fig 2B, right) and model-inferred background-averaged coefficients correlate well with those obtained using the decomposition in Eq (9) (Pearson’s r = 0.83, Fig 2C, right). We also find that direct pairwise physical interactions involving the fluorophore (L63, Fig 2A, right) are enriched in model-derived background-averaged epistatic terms (Fig 2D, right), a result that is robust to missing data and recapitulates our observations obtained using ground truth genetic interactions in simulated DMS data.

Discussion

We have provided an extension to the most rigorous computational framework available for describing and modeling empirical genotype-phenotype mappings. Our approach derives from the Walsh-Hadamard transform, yet we note that the new set of columns is no longer orthogonal. However, the columns remain independent, as proven by the existence of the inverse (Eq (13)). Beyond the study of background-averaged epistasis with respect to mutations in the primary sequence, this also permits the inclusion of ‘epimutations’ (changes in the epigenetic state of DNA), amino acid post-translational modifications or even particular environmental/experimental conditions.

In the simplest application, background-averaged epistatic coefficients (genetic interaction terms) can be directly computed from phenotypic measurements via the decomposition in Eq (9). However, estimating epistatic coefficients by regression—as in Eq (12)—is a more natural choice in the presence of missing data, when data for multiple related phenotypes is available [23] and/or in the presence of global epistasis [24, 25]. Our mathematical results provide three alternative methods to compute the multi-state (multiallelic) extension of the inverse Walsh-Hadamard transform An, one of which allows the direct extraction of specific elements or sub-matrices. In which situations might this capability be desirable?

First, constructing full An matrices—particularly by numerical inversion—is impractical for large genetic landscapes. Second, the result of the product represents a matrix of sequence features when setting up the inference of epistatic (model) coefficients from phenotypic measurements as a regression task [23, 24, 26, 27]. The ability to construct this feature matrix in batches (of rows) allows computational resource-efficient iteration over large datasets when using frameworks such as TensorFlow or PyTorch.

Third, there are currently no methods to comprehensively map empirical genotype-phenotype landscapes with size greater than the low millions of genotypes. Therefore, assaying landscapes of this size or larger will typically involve experimental measurement of a (random) sub-sample of genotypes, corresponding to distinct rows in An. In other words, it is usually unnecessary to construct full An matrices when modeling real experimental data. Finally, there is evidence of extreme sparsity in the epistatic architecture of biomolecules where only a small fraction of theoretically possible genetic interactions are non-zero [7]. The feasibility of sampling very large background-averaged epistatic coefficient spaces may improve methods to infer accurate genotype-phenotype models.

Using results from the analysis of both simulated and empirical multiallelic fitness landscapes, we have shown that sparse regression models relying on a background-averaged definition of epistasis can efficiently capture salient features of the underlying biological system—namely direct physical interactions or ground truth genetic interactions—even in situations of sparse sampling of phenotypes. This behaviour, which we speculate is due to a richer representation of the sequence feature space compared to one-hot models (i.e. higher level of constraint during model fitting; S1(A) Fig), is particularly desirable in the case of very large genetic landscapes where comprehensive phenotyping is infeasible. However, more work is needed to determine whether this result holds more generally. One difficulty in such comparisons between approaches is the requirement for a set of interactions or landscape features that are known to be critical for biomolecular function. Here we rely on Watson–Crick base pairing interactions and direct physical interactions involved in fluorophore orientation whose respective importance for RNA secondary structure and fluorescence activity are well-established.

More broadly, this work opens the door to investigations of the biological properties of background-averaged epistasis in empirical genetic landscapes of arbitrary shape and complexity. Beyond applications within the field of DMS, we believe our theory extensions have the potential to influence research in evolutionary and synthetic biology including protein engineering. In future it will be important to compare the performance and properties of models relying on this definition of epistasis to those of other recently proposed models that incorporate higher-order genetic interactions for phenotypic prediction [28, 29].

Methods

Simulated multiallelic fitness landscape

The simulated multiallelic genetic landscape (Fig 2A, left) consists of all possible combinations of 4 nucleotides per position in a 6-mer i.e. 46 = 4, 096 possible genotypes. Consistent with observations of empirical fitness landscapes where single mutant phenotypes tend to be biased towards negative (detrimental) effects and larger in size than pairwise and higher-order epistatic terms, ground truth first-order epistatic coefficients (single nucleotide substitution effects) were drawn randomly from a normal distribution with a mean of -1 and standard deviation of 2 i.e. ϵ = N(μ, σ2) = N(−1, 4). Ground truth non-zero second-order terms (pairwise genetic interactions) were arbitrarily selected between the following pairs of positions: [1, 2], [3, 5] and [5, 6], where their coefficient values were randomly sampled from ϵ = N(0, 1). All other second- and higher-order coefficients were set to zero. Fitness scores for all variants were calculated using Eq (12). We simulated measurement error by adding Gaussian noise to all fitness scores of similar magnitude to the variance of first-order epistatic coefficients i.e. . Finally, we also simulated versions of the same multiallelic fitness landscape with both relatively lower and higher measurement noise where and , respectively (S1(B) Fig).

Empirical combinatorial fitness landscapes

Raw sequencing (FASTQ) files obtained from the tRNA-Arg(CCU) DMS experiment in [6] were re-processed with DiMSum v1.3 [30] using default parameters with minor adjustments. We obtained fitness estimates for 5,059 out of a total of 5,184 possible variants (97.6%) in the combinatorially complete genetic landscape. We restricted the data to a high quality subset by requiring fitness estimates in all six biological replicates as well as at least 10 input read counts in all input samples. This resulted in a total of 3,847 retained variants (74.2%) for downstream analysis. For comparisons to previously-reported background-averaged coefficients (Fig 2C, S2(B) Fig), we used the author-processed data and analysis scripts [6]. Likewise, we used the author-processed data for imputation of missing phenotype values (replaced with the mean fitness at every mutation order) and calculation of background-averaged epistatic coefficients using the decomposition in Eq (9) (S2(B) Fig). For the eqFP611 fluorescent protein DMS experiment, we used the author-processed fitness estimates (brightness scores) [7].

Lasso regression models

We trained Lasso regression models to predict variant fitness estimates from nucleotide or amino acid sequences using the ‘scikit-learn’ Python package. Training data comprised random subsets of 1, 2, 4, 8, 16, 32 and 64% of retained variants of all mutation orders. All remaining held out variants comprised the ‘test’ data which was unseen during model training in each case.

To train models inferring background-averaged epistatic coefficients we used feature matrices of the form (see Eq (12)). For comparison, one-hot encoded matrices of sequence features were used. Linear regression was performed using 10-fold cross validation to determine the optimal value of the L1 regularization parameter λ in the range [0.005, 0.25] (‘LassoCV’ and ‘RepeatedKFold’ functions). Final models were fit to all training data. In order to estimate model-related statistics and performance results we fit 100 models to different random subsets of the training data for each model type and training data fraction. In Fig 2, S1(B), S1(C) and S2(A) Figs we plot the median of the indicated measures over all models, where error bars indicate the interquartile range. For model performance estimates in Fig 2B, the maximum explainable variance was calculated by substracting the total estimated technical variance (as reported by DiMSum fitness errors [30]) from the total fitness variance. For the simulated and eqFP611 DMS datasets, the maximum explainable variance was assumed to be 100%. In comparisons between sparse model-derived background-averaged epistatic coefficients and ground truth or previously-reported coefficients (scatter plots in Fig 2C and S2(A) Fig) we required model-derived terms to be non-zero i.e. those whose interquartile range over all models did not include zero. Four outlier 4th order coefficients in the tRNA dataset were omitted from Fig 2C.

To test enrichment of physical interactions in Lasso model coefficients we used the following approach: for each model, all position pairs represented in non-zero epistatic coefficients of at least second order were determined. The number of position pairs corresponding to direct physical interactions was counted and an associated enrichment score (odds ratio) and P-value calculated using Fisher’s Exact Test. The background set consisted of all position pairs in all possible epistatic coefficients. To test the appropriateness of the null hypothesis we also repeated enrichment analyses using random models i.e. randomly chosen sets of epistatic coefficients matching the numbers of non-zero coefficients in Lasso models and their distribution over different epistatic orders. The results of this analysis performed using the simulated dataset are shown in S1(B) Fig, left. Finally, the middle and right panels in S1(B) Fig assess the impact of measurement noise on the results of this enrichment analysis, revealing that noise level is anti-correlated with enrichment score and significance thereof, as expected.

Supporting information

S1 Fig. Supplementary figure related to Fig 2B and 2C.

A. Cartoon depiction of alternative feature matrices for inferring epistatic coefficients by linear regression. G−1 in the right panel indicates the matrix of one-hot encoded sequence features—or embeddings—typically used when fitting models of genotype-phenotype landscapes [2]. The left panel represents the matrix of sequence features used to infer background-averaged epistatic coefficients, as in Eq (12). B. Enrichment of direct physical interactions (see Fig 2A) in non-zero epistatic coefficients from sparse models of a simulated multiallelic fitness landscape. Left: results for random models with matching numbers of randomly selected epistatic coefficients at all coefficient orders (see Fig 2C, left). Middle, Right: results for sparse models fit to low and high measurement noise versions of the simulated fitness dataset, respectively. C. Numbers of non-zero espistatic coefficients of different orders in Lasso regression models inferred using different random fractions of the simulated and empirical combinatorial fitness landscapes indicated.

https://doi.org/10.1371/journal.pcbi.1012132.s001

(EPS)

S2 Fig. Supplementary figure related to Fig 2D.

A. Scatter plots comparing sparse model-inferred background-averaged epistatic coefficients of a simulated multiallelic fitness landscape to ground truth values for varying training dataset sizes. Error bars indicate the interquartile range. B. Scatter plots comparing background-averaged epistatic coefficients obtained using the tRNA-Arg(CCU) DMS dataset and Eq (9) to previously-reported significant coefficients [6] after imputing fitness values for missing variants (see Methods) and shown separately for coefficient orders 1-6. Only significant coefficients (FDR < 0.05) observed in at least 10 backgrounds are shown [6]. Error bars indicate 95% confidence intervals.

https://doi.org/10.1371/journal.pcbi.1012132.s002

(EPS)

Acknowledgments

We thank all members of the Lehner and Weghorn Labs for helpful discussions and suggestions. We acknowledge support of the Spanish Ministry of Science and Innovation through the Centro de Excelencia Severo Ochoa (CEX2020-001049-S, MCIN/AEI /10.13039/501100011033), the Generalitat de Catalunya through the CERCA programme and to the EMBL partnership. We are grateful to the CRG Core Technologies Programme for their support and assistance in this work.

References

  1. 1. Phillips PC. Epistasis—the essential role of gene interactions in the structure and evolution of genetic systems. Nature Reviews Genetics. 2008;9(11):855–867. pmid:18852697
  2. 2. Poelwijk FJ, Krishna V, Ranganathan R. The Context-Dependence of Mutations: A Linkage of Formalisms. PLoS Computational Biology. 2016;12(6):e1004771. pmid:27337695
  3. 3. Domingo J, Baeza-Centurion P, Lehner B. The Causes and Consequences of Genetic Interactions (Epistasis). Annu Rev Genomics Hum Genet. 2019;20:433–460. pmid:31082279
  4. 4. Fowler DM, Fields S. Deep mutational scanning: a new style of protein science. Nat Methods. 2014;11(8):801–807. pmid:25075907
  5. 5. de Visser JAG, Krug J. Empirical fitness landscapes and the predictability of evolution. Nature Reviews Genetics. 2014;15(7):480–490. pmid:24913663
  6. 6. Domingo J, Diss G, Lehner B. Pairwise and higher-order genetic interactions during the evolution of a tRNA. Nature. 2018;558(7708):117–121. pmid:29849145
  7. 7. Poelwijk FJ, Socolich M, Ranganathan R. Learning the pattern of epistasis linking genotype and phenotype in a protein. Nature communications. 2019;10(1):1–11. pmid:31527666
  8. 8. Baeza-Centurion P, Miñana B, Schmiedel JM, Valcárcel J, Lehner B. Combinatorial genetics reveals a scaling law for the effects of mutations on splicing. Cell. 2019;176(3):549–563. pmid:30661752
  9. 9. Pokusaeva VO, Usmanova DR, Putintseva EV, Espinar L, Sarkisyan KS, Mishin AS, et al. An experimental assay of the interactions of amino acids from orthologous sequences shaping a complex fitness landscape. PLoS genetics. 2019;15(4):e1008079. pmid:30969963
  10. 10. Bendixsen DP, Collet J, Østman B, Hayden EJ. Genotype network intersections promote evolutionary innovation. PLoS biology. 2019;17(5):e3000300. pmid:31136568
  11. 11. Soo VW, Swadling JB, Faure AJ, Warnecke T. Fitness landscape of a dynamic RNA structure. PLoS genetics. 2021;17(2):e1009353. pmid:33524037
  12. 12. Moulana A, Dupic T, Phillips AM, Chang J, Nieves S, Roffler AA, et al. Compensatory epistasis maintains ACE2 affinity in SARS-CoV-2 Omicron BA. 1. Nature Communications. 2022;13(1):1–11. pmid:36384919
  13. 13. Rotrattanadumrong R, Yokobayashi Y. Experimental exploration of a ribozyme neutral network using evolutionary algorithm and deep learning. Nature communications. 2022;13(1):1–14. pmid:35977956
  14. 14. Lynch M, Walsh B, et al. Genetics and analysis of quantitative traits. vol. 1. Sinauer Sunderland, MA; 1998.
  15. 15. Goldberg DE. Genetic Algorithms and Walsh Functions: Part I, A Genetle Introduction. Complex systems. 1989;3:129–152.
  16. 16. Weinreich DM, Lan Y, Wylie CS, Heckendorn RB. Should evolutionary geneticists worry about higher-order epistasis? Current opinion in genetics & development. 2013;23(6):700–707. pmid:24290990
  17. 17. Poelwijk FJ, Ranganathan R. The relation between alignment covariance and background-averaged epistasis. arXiv. 2017;10.48550/ARXIV.1703.10996.
  18. 18. Brookes DH, Aghazadeh A, Listgarten J. On the sparsity of fitness functions and implications for learning. Proc Natl Acad Sci U S A. 2022;119(1). pmid:34937698
  19. 19. Ogbunugafor CB. The mutation effect reaction norm (mu‐rn) highlights environmentally dependent mutation effects and epistatic interactions. Evolution. 2022;76(s1):37–48. pmid:34989399
  20. 20. Weinberger ED. Fourier and Taylor series on fitness landscapes. Biological cybernetics. 1991;65(5):321–330.
  21. 21. Beer T. Walsh transforms. American Journal of Physics. 1981;49(5):466–472.
  22. 22. Stoffer DS. Walsh-Fourier Analysis and its Statistical Applications. Journal of the American Statistical Association. 1991;86(414):461–479.
  23. 23. Faure AJ, Domingo J, Schmiedel JM, Hidalgo-Carcedo C, Diss G, Lehner B. Mapping the energetic and allosteric landscapes of protein binding domains. Nature. 2022;604(7904):175–183. pmid:35388192
  24. 24. Tareen A, Kooshkbaghi M, Posfai A, Ireland WT, McCandlish DM, Kinney JB. MAVE-NN: learning genotype-phenotype maps from multiplex assays of variant effect. Genome biology. 2022;23(1):1–27. pmid:35428271
  25. 25. Otwinowski J, McCandlish DM, Plotkin JB. Inferring the shape of global epistasis. Proceedings of the National Academy of Sciences. 2018;115(32):E7550–E7558. pmid:30037990
  26. 26. Forcier TL, Ayaz A, Gill MS, Jones D, Phillips R, Kinney JB. Measuring cis-regulatory energetics in living cells using allelic manifolds. Elife. 2018;7:e40618. pmid:30570483
  27. 27. Kinney JB, Murugan A, Callan CG Jr, Cox EC. Using deep sequencing to characterize the biophysical mechanism of a transcriptional regulatory sequence. Proceedings of the National Academy of Sciences. 2010;107(20):9158–9163. pmid:20439748
  28. 28. Zhou J, Wong MS, Chen WC, Krainer AR, Kinney JB, McCandlish DM. Higher-order epistasis and phenotypic prediction. Proceedings of the National Academy of Sciences. 2022;119(39):e2204233119. pmid:36129941
  29. 29. Zhou J, McCandlish DM. Minimum epistasis interpolation for sequence-function relationships. Nature communications. 2020;11(1):1–14. pmid:32286265
  30. 30. Faure AJ, Schmiedel JM, Baeza-Centurion P, Lehner B. DiMSum: an error model and pipeline for analyzing deep mutational scanning data and diagnosing common experimental pathologies. Genome Biology. 2020;21(1):1–23. pmid:32799905