An extension of the Walsh-Hadamard transform to calculate and model epistasis in genetic landscapes of arbitrary shape and complexity

Andre J. Faure; Ben Lehner; Verónica Miró Pina; Claudia Serrano Colome; Donate Weghorn

doi:10.1371/journal.pcbi.1012132

Abstract

Accurate models describing the relationship between genotype and phenotype are necessary in order to understand and predict how mutations to biological sequences affect the fitness and evolution of living organisms. The apparent abundance of epistasis (genetic interactions), both between and within genes, complicates this task and how to build mechanistic models that incorporate epistatic coefficients (genetic interaction terms) is an open question. The Walsh-Hadamard transform represents a rigorous computational framework for calculating and modeling epistatic interactions at the level of individual genotypic values (known as genetical, biological or physiological epistasis), and can therefore be used to address fundamental questions related to sequence-to-function encodings. However, one of its main limitations is that it can only accommodate two alleles (amino acid or nucleotide states) per sequence position. In this paper we provide an extension of the Walsh-Hadamard transform that allows the calculation and modeling of background-averaged epistasis (also known as ensemble epistasis) in genetic landscapes with an arbitrary number of states per position (20 for amino acids, 4 for nucleotides, etc.). We also provide a recursive formula for the inverse matrix and then derive formulae to directly extract any element of either matrix without having to rely on the computationally intensive task of constructing or inverting large matrices. Finally, we demonstrate the utility of our theory by using it to model epistasis within both simulated and empirical multiallelic fitness landscapes, revealing that both pairwise and higher-order genetic interactions are enriched between physically interacting positions.

Author summary

An important question in genetics is how the effects of mutations combine to alter phenotypes. Genetic interactions (epistasis) describe non-additive effects of pairs of mutations, but can also involve higher-order (three- and four-way etc.) combinations. Quantifying higher-order interactions is experimentally very challenging requiring a large number of measurements. Techniques based on deep mutational scanning (DMS) represent valuable sources of data to study epistasis. However, the best way to extract the relevant pairwise and higher-order epistatic coefficients (genetic interaction terms) from this data for the task of phenotypic prediction remains an unresolved problem. The Walsh-Hadamard transform represents a rigorous computational framework for calculating and modeling epistatic interactions at the level of individual genotypic values. Critically, this formalism currently only allows for two alleles (amino acid or nucleotide states) per sequence position, hampering applications in more biologically realistic scenarios. Here we present an extension of the Walsh-Hadamard transform that overcomes this limitation and demonstrate the utility of our theory by using it to model epistasis within both simulated and empirical multiallelic genetic landscapes.

Citation: Faure AJ, Lehner B, Miró Pina V, Serrano Colome C, Weghorn D (2024) An extension of the Walsh-Hadamard transform to calculate and model epistasis in genetic landscapes of arbitrary shape and complexity. PLoS Comput Biol 20(5): e1012132. https://doi.org/10.1371/journal.pcbi.1012132

Editor: Zhaolei Zhang, University of Toronto, CANADA

Received: June 19, 2023; Accepted: May 4, 2024; Published: May 28, 2024

Copyright: © 2024 Faure et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The article repository (https://github.com/lehner-lab/whmatrixextms) provides source code for all computational analyses.

Funding: A.J.F. was funded by a Ramón y Cajal fellowship (RYC2021-033375-I funded by MCIN/AEI/10.13039/501100011033 and European Union NextGenerationEU/PRTR). B.L. and this project received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (883742), the Spanish Ministry of Science and Innovation (LCF/ PR/HR21/52410004, EMBL Partnership), the Plan Estatal de Investigación Científica y Técnica y de Innovación (PID2020-118723GB-I00 / AEI / 10.13039/501100011033), the Bettencourt Schueller Foundation, the AXA Research Fund (AXA professor of Risk prediction in age-related diseases) and the Secretariat of Universities and Research, Ministry of Enterprise and Knowledge of the Government of Catalonia and the European Social Funds (2017 SGR 1322). V.M.P. was funded by the Spanish Ministry of Science and Innovation (PGC2018-100941-A- I00 AEI/FEDER, UE and PID2021-128976NB-I00 funded by MICIU/ AEI / 10.13039/501100011033 / FEDER, UE). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: I have read the journal’s policy and the authors of this manuscript have the following competing interests: A.J.F. and B.L. are founders, employees and shareholders of ALLOX.

Introduction

A fundamental challenge in biology is to understand and predict how changes (or mutations) to biological sequences (DNA, RNA, proteins) affect their molecular function and ultimately the phenotype of living organisms. The phenomenon of ‘epistasis’ (genetic interactions)—broadly defined as the dependence of mutational effects on the genetic context in which they occur [1–3]—is widespread in biological systems, yet knowledge of the underlying mechanisms remains limited. Defining the extent of epistasis and better understanding of its origins has relevance in fields ranging from genetic prediction, molecular evolution, infectious disease and cancer drug development, to biomolecular structure determination and protein engineering [3].

Evolutionarily related sequences, natural genetic variation within populations, and more recently results of techniques such as deep mutational scanning (DMS) [4]—also known as massively parallel reporter assays (MPRAs) and multiplex assays of variant effect (MAVEs)—represent valuable sources of data to study epistasis [1, 5]. In particular, DMS enables the systematic measurement of mutational effects across entire combinatorially complete genetic landscapes [5–13]. Importantly, the typical use of engineered genotypes, haploid individuals and near-identical environmental (laboratory) conditions in these experiments allows population genetic considerations—such as dominance, variable allele frequencies and linkage disequilibrium—to be ignored [14]. In other words, measurements obtained from deep mutational scanning and related methods permit the modeling of epistasis in the mechanistic sense (sequence-to-function encoding) rather than in the evolutionary sense, i.e. based on the dynamics of population genotype frequencies. Nevertheless, precisely how to extract the most biologically relevant pairwise and higher-order epistatic coefficients (genetic interaction terms) from this type of data is an unresolved problem.

Quantitative definitions of epistasis vary among fields, but it has been argued that one particular formulation termed ‘background-averaged’ epistasis, also known as ‘ensemble’ epistasis [1, 12], may provide the most useful information on the epistatic structure of biological systems [2]. The underlying rationale is that by averaging the effects of mutations across many different genetic backgrounds (contexts), the method is robust to local idiosyncrasies in the relationship between genotype and phenotype. It has been previously pointed out that the definition of background-averaged epistasis is conceptually similar to that of ‘statistical epistasis’ attributed to Fisher, but instead of measuring the average effect of allele substitutions against the population average genetic background i.e. averaging over all genotypes present in a given population (taking into account their individual frequencies), the approach instead averages over all possible genotypes (assuming equal genotype weights) [1, 2].

The current mathematical formalism of background-averaged epistasis is based on the Walsh-Hadamard transform [2]. Interestingly, although widely used in physics and engineering, the Walsh-Hadamard transform was first applied to non-biological fitness landscapes in the field of genetic algorithms (GA) [15], subsequently being proposed as the basis of a framework for the computation of higher-order epistasis in empirical settings [16]. However, the Walsh-Hadamard transform can only accommodate two alleles (amino acid or nucleotide states) per sequence position, with no extension to multialleleic landscapes (cardinality greater than two) yet made, as confirmed by multiple recent reports [2, 17–19]. Alternative implementations for multiallelic landscapes either rely on ‘one-hot encoding’ elements of larger alphabets as biallelic sequences—requiring the manipulation of prohibitively large Walsh-Hadamard matrices—or constructing graph Fourier bases [18, 20], which is mathematically complex and provides no straightforward way to interpret epistatic coefficients. The result is that the application of background-averaged epistasis has been severely limited and its properties remain largely unexplored in more biologically realistic scenarios.

In this work we provide an extension of the Walsh-Hadamard transform that allows the calculation and modeling of background-averaged epistasis in genetic landscapes with an arbitrary number of states (20 for amino acids, 4 for nucleotides, etc.). We also provide a recursive formula for the inverse matrix, which is required to infer epistatic coefficients using regression. Furthermore, we derive convenient formulae to directly extract any element of either matrix without having to rely on the computationally intensive task of constructing or inverting large matrices. Lastly, we apply these formulae to the analysis of both simulated and empirical multiallelic DMS datasets, demonstrating that sparse models inferred from the background-averaged representation (embedding) of the underlying genetic landscape more regularly include epistatic terms corresponding to direct physical interactions.

Results

Extension of the Walsh-Hadamard transform to multiallelic landscapes

In this work, a genotype sequence is represented as a one-dimensional ordering of monomers, each of which can take on s possible states (or alleles), for example s = 4 for nucleotide sequences or s = 20 for amino acid sequences. Without loss of generality, the s states can be labelled 0, 1, 2, …, s − 1, where 0 denotes the wild-type allele. We are going to consider genotype sequences of length , i.e. sequences taking values in , where .

Each genotype is associated with its phenotype . Note that here we use the term ‘phenotype’ as shorthand for ‘molecular phenotype score’ from a quantitative laboratory assay (DMS) reporting on a molecular function for each genotype of interest. In quantitative genetics terminology this might be referred to as ‘genotypic value’ because environmental deviation is negligible due to the controlled nature of the experiments, but our subject here is the macromolecule not an individual from a population [14]. In the context of empirical genotype-phenotype landscapes, the phenotypic effect of a genotype is typically measured with respect to the wild-type, i.e. it is given by .

It is important to emphasize that in what follows we implicitly restrict ourselves to the haploid reference base, because our primary goal is the modeling of sequence-to-function encodings for individual genotype sequences—for the ultimate purpose of understanding and engineering macromolecules—not the modeling of sequence evolution or quantification of sources of phenotypic variance in populations.

If the phenotypic effects of individual mutations were independent, they would be additive, meaning that the phenotypic effect of would be the sum of the phenotypic effects of the single mutants (i₁, 0, …, 0), …, (0, …, 0, i_n). The epistatic coefficient quantifies how much the observed phenotypic effect of deviates from this assumption. In the case of background-averaged epistasis, we quantify the interactions between a set of mutations by averaging over all possible genotypes for the remaining positions in the sequence. For example, if n = 3 and s = 2, the pairwise epistatic coefficient involving the mutations at positions 2 and 3 is calculated by averaging over all states (backgrounds) for the remaining positions, in this case given by the two states of the first position (* denotes the positions at which the averaging is performed), i.e. (1)

More generally, in [2] it is shown that for s = 2 and any sequence length n, phenotypic effects can be decomposed into background-averaged epistatic coefficients with (2) where is the vector , is the vector and and are 2ⁿ × 2ⁿ matrices defined recursively as follows: (3) (4) The matrix is known as the Walsh-Hadamard transform [21, 22] and is a diagonal weighting (or normalisation) matrix to correct the sign and account for averaging over different numbers of backgrounds as a function of epistatic order [2].

In this work, we provide an extension of this theory to describe background-averaged epistasis for sequences with an arbitrary number of states s. Before writing a general formula, we consider the simplest possible multi-state (multiallelic) landscape i.e. a sequence of length n = 1 with s = 3, (5)

Consistent with the definition of background-averaged epistasis for biallelic landscapes [2], the zeroth-order epistatic coefficient ε_(*) is the mean phenotypic value across all genotypes and the first-order epistatic coefficients ε₍₁₎ and ε₍₂₎ are simply the respective individual phenotypic effects of genotypes y₍₁₎ and y₍₂₎ with respect to the wild-type. However, the key feature of H₁ for multiallelic landscapes—and where it departs from the canonical Walsh-Hadamard transform—is the introduction of zero elements to exclude phenotypes that are irrelevant for the calculation of a given epistatic coefficient. In other words, these phenotypes are excluded because they correspond neither to relevant intermediate genotypes nor alternative genetic backgrounds.

If we now consider a sequence of length n = 2 with s = 3, then the H₂ and V₂ matrices become 9 × 9 (sⁿ × sⁿ) and can be constructed from recurring to the case n = 1 above, giving where the colors highlight the block structure of the matrices. In V₂, the red square corresponds to and the light red squares to −V₁. In H₂, the gray squares correspond to H₁ and the blue squares to −H₁. In Table 1 we show the results of background-averaged epistatic coefficients calculated by applying the above formula to an empirical multiallelic landscape with n = 2 and s = 3 [6].

Download:

Table 1. Interaction terms based on background-averaged epistasis (

) for an empirical multiallelic genotype-phenotype landscape consisting of all combinations of two mutations each at positions 6 and 66 in the tRNA-Arg(CCU) [6], i.e. n = 2 and s = 3.

The first two columns indicate nucleic acid sequences and their base 3 representations. Here the ‘GC’ reference (wild-type) genotype corresponds to that of S. cerevisiae, denoted by (0, 0). The second two columns show the measured phenotypic effects and corresponding background-averaged epistatic coefficients together with their empirical errors and their propagated errors. See Results for a regression analysis of the entire dataset and Proposition 8 in S1 Text for a derivation of the error propagation.

https://doi.org/10.1371/journal.pcbi.1012132.t001

More generally, for any value of s, when n = 1, (6) where ε_(*) corresponds to averaging phenotypes over all possible genotypes and the remaining coefficients simply correspond to the phenotypic effects of each mutation.

For n = 2, we have to consider different combinations of mutations in both positions. In this case, the phenotypes can be written as

A natural ordering of the phenotypes is given by interpreting genotype as the base s representation of an integer (see Table 1). From this, we can see how the first s genotypes correspond to combining the wild-type allele at the first position with a state from the case n = 1, i.e. to genotypes that can be written , with . The next s genotypes correspond to the first mutated allele at the first position combined with all the genotypes of n = 1, i.e. , and so on. Therefore, we can write the matrices H and V following a block structure. In the case n = 2 and any given s, we would then have (7) where the number of H₁ blocks corresponds to the number of states of the first position, so s. Moreover, each of these blocks must be normalized to yield the corresponding background-averaged epistatic terms. Therefore V₂ can also be expressed as a function of V₁ as follows: (8)

Given these two matrices, we can write the background-averaged epistatic coefficients for the case of n = 2 and s different states per position as . More generally, the decomposition of phenotypic effects into background-averaged epistatic coefficients is given by (9) where H_n and V_n can be defined recursively as (10) (11)

Recursive inverse matrix

Eq (9) defines the vector of epistatic coefficients, , as a function of the vector of phenotypes, , which in general is the quantity that is measured experimentally. However, usually phenotypic measurements are only available for a subset of genotypes. An alternative is therefore to estimate the epistatic coefficients by regression, (12) where the product represents a matrix of sequence features. This is analogous to the more widely used one-hot encoding strategy, which implicitly relies on a ‘background-relative’ (or ‘biochemical’) view of epistasis when regressing to full order [2]. We discuss other advantages of estimating background-averaged epistatic coefficients using regression at the end of this manuscript. Since V_n is a diagonal matrix, its inverse is also a diagonal matrix whose elements are the inverse of the elements of V_n.

The inverse of H_n is the matrix A_n which can be defined recursively as (13) See Proposition 1 in S1 Text for a proof of this result. This is the most efficient method to determine the full matrix A_n (see Results).

Formulae to obtain elements of the matrices

When regressing phenotypes on genotypes, a common goal is to determine whether epistatic coefficients up to the r^th order (where r < n) are sufficient to describe the complexity of the biological system. Furthermore, as mentioned above, some fraction of phenotype values within combinatorially complete genetic landscapes are typically unavailable, representing missing data. Restricting the epistatic order and missing phenotypes respectively correspond to omitting rows and columns from H_n (and vice versa from A_n). Formulae to directly obtain elements of the matrices in Eqs (9) and (12) would therefore be convenient.

In order to write the matrix element (H_n)_ij, we need to compare the genotype sequences , where denotes the i^th element in , , and the elements of are ordered by the base s representation of integers. For instance, for any value of n, we will denote the wild-type state with index i = 1 and write . The element denoted with index i = 2 would be and so on.

The elements of H_n can be written as (14) where M and E are sⁿ × sⁿ matrices whose elements are (15) (16) where δ_ij denotes the Kronecker delta of i, j, which is equal to 1 when i = j and 0 if i ≠ j. In words, counts the number of positions at which the genotype sequences and carry the same mutated allele and is equal to plus the number of positions where or carry the wild-type allele. See Proposition 2 in S1 Text for a proof of this result.

Furthermore, the elements of A_n can be written as (17) where E_n is defined as in Eq (15). See Proposition 3 in S1 Text for a proof of this result.

Finally, the matrices V_n and are diagonal matrices whose diagonal elements can be written as (18) and (19) where and again denotes the i^th element in when ordered by the base s representation of integers. In words, w_k = 1 if the genotype sequence carries the wild-type allele at position k and counts the number of positions in carrying the wild-type allele. We prove this result in Proposition 4 in S1 Text.

Generalization to different numbers of states per position

We can generalize the formulae described in the previous subsection further by considering that each position can have different numbers of states. In this case, we can denote s_k the number of possible states at position k. For n = 1, this corresponds to exactly the same matrix as in the previous case but with s = s₁, which is the number of possible states in this position. For n = 2, the matrix changes because now the new position can have a different number of possible states, s₂. Following the recursive definition of H_n, we can construct H₂ by repeating H₁ s₂ times, with the structure stated in Eq (10). Therefore, we have

So the structure is exactly the same but the size of the matrix for each n varies according to the number of possible states of the new position. The definition of H_n is the same as in Eq (10) but the dimensions of the matrix are . Similarly, the inverse matrix A_n+1 can be written recursively as (21) The matrix A_n defined in Eq (21) is the inverse of the matrix H_n in the general case where each position can have a different number of states.

In this general case, the elements of H_n and A_n can be written as (22) (23) where E_n and M_n are defined as in Eqs (15) and (16) and .

The matrices V_n and are diagonal matrices whose diagonal elements can be written as (24) and (25) where

We prove the results in this subsection in Propositions 5, 6 and 7 in S1 Text.

The above formulae permit the calculation and modeling of background-averaged epistasis in arbitrarily-shaped genetic landscapes, i.e. with any number of alleles (states) per position, as well as the direct construction of sub-matrices for regression to any desired epistatic order and/or in the presence of missing data. In the following subsections we report benchmarking results comparing the performance of alternative methods to obtain H_n and A_n, as well as results from the application of our theory extension to an empirical multiallelic genotype-phenotype landscape.

Benchmarking

Fig 1A–1D provides a visualization of the matrices H_n and A_n for different values of n and s, clearly showing a self-similar pattern in all cases due to their recursive nature.

Download:

Fig 1. Benchmarking results and heat map representations of matrices corresponding to the binary (biallelic) and multi-state (multiallelic) extension of the Walsh-Hadamard transform, and their corresponding inverses.

A. H₆ Walsh-Hadamard transform (s = 2). B. H₄ multi-state extension of the Walsh-Hadamard transform for s = 3. C. A₆ Inverse Walsh-Hadamard transform. D. A₄ multi-state extension of the inverse Walsh-Hadamard transform for s = 3. E. Computational time on a MacBook Pro (13-inch, 2017, 2.3GHz dual-core Intel Core i5) for extracting elements of A_n matrices of various dimensions and numbers of states (alleles) per position (s ∈ [2, 10]). Comparisons are shown between numerically inverting the recursively constructed H_n (using scipy.linalg.inv), i.e. “Recursive H_n inverse”, using the recursive formula for A_n, using the formula to extract all elements of A_n and extracting 10 random elements of A_n (see legend). The mean across 10 replicates is depicted. Linear regression lines were fit to data from matrices with at least 100 elements. F. Similar to E but indicating memory usage. Linear regression lines were fit to data from matrices with at least 10 elements.

https://doi.org/10.1371/journal.pcbi.1012132.g001

In this paper, we provide different methods to construct . First, H_n can be numerically inverted using standard matrix inversion algorithms (here we use the linalg.inv function from the SciPy library in Python), referred to as “Recursive H_n inverse” in Fig 1E and 1F. Alternatively, the recursive definition of the inverse given by Eq (13) can be used, which we refer to as “Recursive A_n”. As can be seen in Fig 1E, this method is faster than numerically inverting H_n.

Finally, we also provide a convenient formula for extracting specific individual elements of A_n Eq (17), referred to as “All elements A_n” in Fig 1E and 1F. This method is more computationally intensive than the previously described methods, due to the formula relying on the computation of (E_n)_ij, which equates to counting the number of sequence positions that are identically mutated in vectors and , each of size n. However, in situations where subsets of elements (or sub-matrices)—rather than full matrices—are desired, Eq (17) provides a method that can be faster and more memory efficient (see “10 elements A_n” in Fig 1E and 1F).

For example, in the case of a 10-mer DNA sequence, constructing the full inverse transform A₁₀ with s = 4 would require > 10²³ bytes (100 million petabytes) of memory in the best-case scenario (“Recursive H_n inverse” in Fig 1F, log-linear extrapolation). Similarly, the full inverse transform for a 4-mer amino acid sequence (A₄ with s = 20) would impose a memory footprint > 10²⁰ bytes. On the other hand, calculating the subset of elements from these matrices required for the prediction of a single phenotype using epistatic coefficients up to third order (three-way genetic interaction terms) is feasible in both situations using Eq (17) (3,675 and 29,678 elements; 2.5 GB and 192 GB of memory; 1.8 and 99 seconds, respectively). This memory footprint can easily be diminished further using data chunking, which is a unique benefit of this method.

Application to a simulated multiallelic genotype-phenotype landscape

In order to demonstrate the utility of our extension of the Walsh-Hadamard transform, we used it to model epistasis within a simulated multiallelic genetic landscape. Knowledge of the ‘ground truth’ epistatic coefficients allowed us to (i) determine whether background-averaged coefficients can be accurately inferred using regression and (ii) investigate the impact of data sparseness on the results i.e. in the context of missing phenotypic values. In most practical applications—particularly in the case of large empirical genotype-phenotype landscapes—measuring the phenotypic effects of all mutation combinations is infeasible. The simulated combinatorially complete landscape consisted of all possible combinations of four alleles at six different positions i.e. a total of 4⁶ = 4, 098 genotypes (Fig 2A, left). First-order and a subset of second-order epistatic coefficients were randomly drawn from Gaussian distributions, while all higher-order epistatic coefficients were set to zero (see Methods). We arbitrarily selected three pairs of positions at which mutations genetically interact i.e. ground truth non-zero pairwise epistatic terms (Fig 2A, left). Fitness values for all variants were reconstructed using Eq (12) with random noise added to simulate measurement error (see Methods).

Download:

Fig 2. Learning sparse models from simulated and empirical combinatorial fitness landscapes.

A. Schematic description of combinatorial DMS datasets. Left: simulated multiallelic landscape consisting of all combinations of 4 nucleotides in a DNA 6-mer where ground truth interactions ([1, 2], [3, 5] and [5, 6]) are depicted with red arcs (see Methods). Middle: secondary structure of S. cerevisiae tRNA-Arg(CCU) indicating variable positions (closed circles) combinatorially mutated in a DMS experiment as described in [6]. Three Watson–Crick base pairing (WCBP) interactions involving pairs of these positions ([1, 71], [2, 70] and [6, 66]) are indicated. Right: crystal structure of a blue fluorescent variant (TagBFP) of the Entacmaea quadricolor protein eqFP611 (PDB: 3M24) and 13 positions (12 shown) that differ in the red fluorescent variant (mKate2) that were mutated in a DMS experiment as described in [7]. Physical interactions between the fluorophore (L63) and three proximal residues ([63, 143], [63, 174] and [63, 197]) are indicated. B. Performance of sparse models fit using different proportions of the DMS datasets indicated in panel A. The median number of model coefficients is indicated. Colour scale as in panel D. C. Scatter plots comparing all non-zero model-inferred background-averaged epistatic coefficients (training set size = 64%) to ground truth values (left panel) or previously-reported coefficients (remaining panels) [6, 7]. Pearson’s r is shown. Four outlier 4th order coefficients in the tRNA dataset were omitted. D. Enrichment of direct physical interactions (red arcs in panel A) in non-zero epistatic coefficients. *, P < 0.05; **, P < 0.01; ***, P < 0.001. Error bars indicate the interquartile range.

https://doi.org/10.1371/journal.pcbi.1012132.g002

We then trained Lasso regression models of the form in Eq (12) to predict fitness values from sequence, where the inferred model parameters correspond to background-averaged epistatic coefficients up to third order (S1(A) Fig; see Methods). To assess the impact of data sparsity on the results, we sub-sampled the simulated phenotype (fitness) values to obtain training dataset sizes ranging from 64% to 1% of all variants in the combinatorial complete landscape. The sparse models generalize well, explaining more than 80% of fitness variance in the held out test data even when the overwhelming majority of variants (96%) are missing (Fig 2B, left; ‘Background-averaged models’, red). Importantly, the values of non-zero background-averaged epistatic coefficients closely match those of the ground truth simulated coefficients (Pearson’s r = 0.99, Fig 2C, left). Reducing the size of the training set tends to reduce the number of coefficients retained in sparse models, but the ground truth values of non-zero coefficients are well recovered (Pearson’s r = 0.99, S2(A) Fig).

For comparison, we used the same procedure to fit Lasso regression models of the form , where G⁻¹ represents a matrix of one-hot encoded sequence features i.e. the presence or absence of a given mutation—or mutation combination (interaction)—with respect to the reference (wild-type) genotype is denoted by a ‘1’ or ‘0’ respectively (S1(A) Fig, Fig 2B, left; ‘One-hot models’, blue). The definition of G and its relationship to the biochemical (or background-relative) view of epistasis is explained in [2]. Although one-hot models show similar performance to background-averaged models on simulated test data, they tend to be more complex with larger numbers of coefficients (Fig 2B), including spurious third-order epistatic terms although their ground truth values are zero (S1(C) Fig).

In summary, these results demonstrate that our theory extensions allow the accurate inference of background-averaged epistatic coefficients in multiallelic genetic landscapes, even in situations when large amounts of phenotypic data are missing.

Application to empirical combinatorial fitness landscapes

We next applied our theory to model epistasis within two different empirical combinatorial fitness landscapes (Fig 2A, middle, right). In the first DMS assay, a budding yeast strain was used in which the single-copy arginine-CCU tRNA (tRNA-Arg(CCU)) gene is conditionally required for growth [6]. A library of variants of this gene was designed to cover all 5,184 (2⁶ × 3⁴) combinations of the 14 nucleotide substitutions observed in ten positions in post-whole-genome duplication yeast species. The library was transformed into S. cerevisiae, expressed under restrictive conditions and the enrichment of each genotype in the culture was quantified by deep sequencing before and after selection. After reprocessing of the raw data, we retained high quality fitness estimates for 3,847 variants (74.2%, see Methods).

Although the findings in [6] were based on the application of background-averaged epistasis theory, the prior limitation of the Walsh-Hadamard transform to only two alleles per sequence position required the authors to adopt an ad hoc strategy that involved performing separate analyses on combinatorially complete biallelic sub-landscapes. After imputing fitness values for missing variants we decomposed phenotypes into background-averaged epistatic coefficients using Eq (9), which yielded very similar values for significant coefficients of all orders (S2(B) Fig, see Methods). However—as shown above for simulated data—our theory extensions permit the modeling of background-averaged epistasis in the context of multiallelic landscapes with missing data. We followed the same strategy as described in the previous section, using the tRNA DMS data to train Lasso regression models to predict variant fitness from sequence incorporating epistatic coefficients up to eighth order (Fig 2B, middle, see Methods). The resulting models include many higher-order epistatic coefficients (S1(C) Fig, ‘Background-averaged models’) yet exhibit extreme sparsity, with the median number of non-zero coefficients of any order ranging from 19 to 60 i.e. approximately 1% of all possible coefficients of eighth order or less (Fig 2B, middle). Models trained on at least 10% of the training data tend to explain more than 50% of the total explainable variance (Fig 2B, middle) and model-inferred background-averaged epistatic terms correlate well with those reported previously [6] (Pearson’s r = 0.79, Fig 2C, middle).

To evaluate whether the inferred models report on biologically relevant features of the underlying genetic landscapes, we tested whether sparse model coefficients were more likely to comprise genetic interactions (or modulators thereof) involving known physically interacting positions in the wild-type tRNA secondary structure (Fig 2A, middle). Regardless of data sparsity, background-averaged model coefficients tend to be significantly enriched for physical interactions (Fig 2D, middle). On the other hand, in the case of even moderate sub-sampling of training data (16%), one-hot model coefficients show no such enrichment (Fig 2D, middle). The results of this enrichment analysis closely mirror those obtained using simulated data (Fig 2D, left), suggesting that they are likely to generalise to other empirical genotype-phenotype landscapes.

Finally, we repeated similar analyses using a second DMS dataset comprising fluorescence measurements for all combinations of single substitution mutations at 13 positions (2¹³ = 8, 192 genotypes) separating two variants of the Entacmaea quadricolor protein eqFP611 (Fig 2, right column; see Methods) [7]. Lasso models capture the majority of fitness variance even with high levels of data sparsity (Fig 2B, right) and model-inferred background-averaged coefficients correlate well with those obtained using the decomposition in Eq (9) (Pearson’s r = 0.83, Fig 2C, right). We also find that direct pairwise physical interactions involving the fluorophore (L63, Fig 2A, right) are enriched in model-derived background-averaged epistatic terms (Fig 2D, right), a result that is robust to missing data and recapitulates our observations obtained using ground truth genetic interactions in simulated DMS data.

Discussion

We have provided an extension to the most rigorous computational framework available for describing and modeling empirical genotype-phenotype mappings. Our approach derives from the Walsh-Hadamard transform, yet we note that the new set of columns is no longer orthogonal. However, the columns remain independent, as proven by the existence of the inverse (Eq (13)). Beyond the study of background-averaged epistasis with respect to mutations in the primary sequence, this also permits the inclusion of ‘epimutations’ (changes in the epigenetic state of DNA), amino acid post-translational modifications or even particular environmental/experimental conditions.

In the simplest application, background-averaged epistatic coefficients (genetic interaction terms) can be directly computed from phenotypic measurements via the decomposition in Eq (9). However, estimating epistatic coefficients by regression—as in Eq (12)—is a more natural choice in the presence of missing data, when data for multiple related phenotypes is available [23] and/or in the presence of global epistasis [24, 25]. Our mathematical results provide three alternative methods to compute the multi-state (multiallelic) extension of the inverse Walsh-Hadamard transform A_n, one of which allows the direct extraction of specific elements or sub-matrices. In which situations might this capability be desirable?

First, constructing full A_n matrices—particularly by numerical inversion—is impractical for large genetic landscapes. Second, the result of the product represents a matrix of sequence features when setting up the inference of epistatic (model) coefficients from phenotypic measurements as a regression task [23, 24, 26, 27]. The ability to construct this feature matrix in batches (of rows) allows computational resource-efficient iteration over large datasets when using frameworks such as TensorFlow or PyTorch.

Third, there are currently no methods to comprehensively map empirical genotype-phenotype landscapes with size greater than the low millions of genotypes. Therefore, assaying landscapes of this size or larger will typically involve experimental measurement of a (random) sub-sample of genotypes, corresponding to distinct rows in A_n. In other words, it is usually unnecessary to construct full A_n matrices when modeling real experimental data. Finally, there is evidence of extreme sparsity in the epistatic architecture of biomolecules where only a small fraction of theoretically possible genetic interactions are non-zero [7]. The feasibility of sampling very large background-averaged epistatic coefficient spaces may improve methods to infer accurate genotype-phenotype models.

Using results from the analysis of both simulated and empirical multiallelic fitness landscapes, we have shown that sparse regression models relying on a background-averaged definition of epistasis can efficiently capture salient features of the underlying biological system—namely direct physical interactions or ground truth genetic interactions—even in situations of sparse sampling of phenotypes. This behaviour, which we speculate is due to a richer representation of the sequence feature space compared to one-hot models (i.e. higher level of constraint during model fitting; S1(A) Fig), is particularly desirable in the case of very large genetic landscapes where comprehensive phenotyping is infeasible. However, more work is needed to determine whether this result holds more generally. One difficulty in such comparisons between approaches is the requirement for a set of interactions or landscape features that are known to be critical for biomolecular function. Here we rely on Watson–Crick base pairing interactions and direct physical interactions involved in fluorophore orientation whose respective importance for RNA secondary structure and fluorescence activity are well-established.

More broadly, this work opens the door to investigations of the biological properties of background-averaged epistasis in empirical genetic landscapes of arbitrary shape and complexity. Beyond applications within the field of DMS, we believe our theory extensions have the potential to influence research in evolutionary and synthetic biology including protein engineering. In future it will be important to compare the performance and properties of models relying on this definition of epistasis to those of other recently proposed models that incorporate higher-order genetic interactions for phenotypic prediction [28, 29].

Methods

Simulated multiallelic fitness landscape

The simulated multiallelic genetic landscape (Fig 2A, left) consists of all possible combinations of 4 nucleotides per position in a 6-mer i.e. 4⁶ = 4, 096 possible genotypes. Consistent with observations of empirical fitness landscapes where single mutant phenotypes tend to be biased towards negative (detrimental) effects and larger in size than pairwise and higher-order epistatic terms, ground truth first-order epistatic coefficients (single nucleotide substitution effects) were drawn randomly from a normal distribution with a mean of -1 and standard deviation of 2 i.e. ϵ = N(μ, σ²) = N(−1, 4). Ground truth non-zero second-order terms (pairwise genetic interactions) were arbitrarily selected between the following pairs of positions: [1, 2], [3, 5] and [5, 6], where their coefficient values were randomly sampled from ϵ = N(0, 1). All other second- and higher-order coefficients were set to zero. Fitness scores for all variants were calculated using Eq (12). We simulated measurement error by adding Gaussian noise to all fitness scores of similar magnitude to the variance of first-order epistatic coefficients i.e. . Finally, we also simulated versions of the same multiallelic fitness landscape with both relatively lower and higher measurement noise where and , respectively (S1(B) Fig).

Empirical combinatorial fitness landscapes

Raw sequencing (FASTQ) files obtained from the tRNA-Arg(CCU) DMS experiment in [6] were re-processed with DiMSum v1.3 [30] using default parameters with minor adjustments. We obtained fitness estimates for 5,059 out of a total of 5,184 possible variants (97.6%) in the combinatorially complete genetic landscape. We restricted the data to a high quality subset by requiring fitness estimates in all six biological replicates as well as at least 10 input read counts in all input samples. This resulted in a total of 3,847 retained variants (74.2%) for downstream analysis. For comparisons to previously-reported background-averaged coefficients (Fig 2C, S2(B) Fig), we used the author-processed data and analysis scripts [6]. Likewise, we used the author-processed data for imputation of missing phenotype values (replaced with the mean fitness at every mutation order) and calculation of background-averaged epistatic coefficients using the decomposition in Eq (9) (S2(B) Fig). For the eqFP611 fluorescent protein DMS experiment, we used the author-processed fitness estimates (brightness scores) [7].

Lasso regression models

We trained Lasso regression models to predict variant fitness estimates from nucleotide or amino acid sequences using the ‘scikit-learn’ Python package. Training data comprised random subsets of 1, 2, 4, 8, 16, 32 and 64% of retained variants of all mutation orders. All remaining held out variants comprised the ‘test’ data which was unseen during model training in each case.

To train models inferring background-averaged epistatic coefficients we used feature matrices of the form (see Eq (12)). For comparison, one-hot encoded matrices of sequence features were used. Linear regression was performed using 10-fold cross validation to determine the optimal value of the L1 regularization parameter λ in the range [0.005, 0.25] (‘LassoCV’ and ‘RepeatedKFold’ functions). Final models were fit to all training data. In order to estimate model-related statistics and performance results we fit 100 models to different random subsets of the training data for each model type and training data fraction. In Fig 2, S1(B), S1(C) and S2(A) Figs we plot the median of the indicated measures over all models, where error bars indicate the interquartile range. For model performance estimates in Fig 2B, the maximum explainable variance was calculated by substracting the total estimated technical variance (as reported by DiMSum fitness errors [30]) from the total fitness variance. For the simulated and eqFP611 DMS datasets, the maximum explainable variance was assumed to be 100%. In comparisons between sparse model-derived background-averaged epistatic coefficients and ground truth or previously-reported coefficients (scatter plots in Fig 2C and S2(A) Fig) we required model-derived terms to be non-zero i.e. those whose interquartile range over all models did not include zero. Four outlier 4th order coefficients in the tRNA dataset were omitted from Fig 2C.

To test enrichment of physical interactions in Lasso model coefficients we used the following approach: for each model, all position pairs represented in non-zero epistatic coefficients of at least second order were determined. The number of position pairs corresponding to direct physical interactions was counted and an associated enrichment score (odds ratio) and P-value calculated using Fisher’s Exact Test. The background set consisted of all position pairs in all possible epistatic coefficients. To test the appropriateness of the null hypothesis we also repeated enrichment analyses using random models i.e. randomly chosen sets of epistatic coefficients matching the numbers of non-zero coefficients in Lasso models and their distribution over different epistatic orders. The results of this analysis performed using the simulated dataset are shown in S1(B) Fig, left. Finally, the middle and right panels in S1(B) Fig assess the impact of measurement noise on the results of this enrichment analysis, revealing that noise level is anti-correlated with enrichment score and significance thereof, as expected.

Supporting information

S1 Fig. Supplementary figure related to Fig 2B and 2C.

A. Cartoon depiction of alternative feature matrices for inferring epistatic coefficients by linear regression. G⁻¹ in the right panel indicates the matrix of one-hot encoded sequence features—or embeddings—typically used when fitting models of genotype-phenotype landscapes [2]. The left panel represents the matrix of sequence features used to infer background-averaged epistatic coefficients, as in Eq (12). B. Enrichment of direct physical interactions (see Fig 2A) in non-zero epistatic coefficients from sparse models of a simulated multiallelic fitness landscape. Left: results for random models with matching numbers of randomly selected epistatic coefficients at all coefficient orders (see Fig 2C, left). Middle, Right: results for sparse models fit to low and high measurement noise versions of the simulated fitness dataset, respectively. C. Numbers of non-zero espistatic coefficients of different orders in Lasso regression models inferred using different random fractions of the simulated and empirical combinatorial fitness landscapes indicated.

https://doi.org/10.1371/journal.pcbi.1012132.s001

(EPS)

S2 Fig. Supplementary figure related to Fig 2D.

A. Scatter plots comparing sparse model-inferred background-averaged epistatic coefficients of a simulated multiallelic fitness landscape to ground truth values for varying training dataset sizes. Error bars indicate the interquartile range. B. Scatter plots comparing background-averaged epistatic coefficients obtained using the tRNA-Arg(CCU) DMS dataset and Eq (9) to previously-reported significant coefficients [6] after imputing fitness values for missing variants (see Methods) and shown separately for coefficient orders 1-6. Only significant coefficients (FDR < 0.05) observed in at least 10 backgrounds are shown [6]. Error bars indicate 95% confidence intervals.

https://doi.org/10.1371/journal.pcbi.1012132.s002

(EPS)

S1 Text. Supplementary methods.

https://doi.org/10.1371/journal.pcbi.1012132.s003

(PDF)

Acknowledgments

We thank all members of the Lehner and Weghorn Labs for helpful discussions and suggestions. We acknowledge support of the Spanish Ministry of Science and Innovation through the Centro de Excelencia Severo Ochoa (CEX2020-001049-S, MCIN/AEI /10.13039/501100011033), the Generalitat de Catalunya through the CERCA programme and to the EMBL partnership. We are grateful to the CRG Core Technologies Programme for their support and assistance in this work.

References

1. Phillips PC. Epistasis—the essential role of gene interactions in the structure and evolution of genetic systems. Nature Reviews Genetics. 2008;9(11):855–867. pmid:18852697
- View Article
- PubMed/NCBI
- Google Scholar
2. Poelwijk FJ, Krishna V, Ranganathan R. The Context-Dependence of Mutations: A Linkage of Formalisms. PLoS Computational Biology. 2016;12(6):e1004771. pmid:27337695
- View Article
- PubMed/NCBI
- Google Scholar
3. Domingo J, Baeza-Centurion P, Lehner B. The Causes and Consequences of Genetic Interactions (Epistasis). Annu Rev Genomics Hum Genet. 2019;20:433–460. pmid:31082279
- View Article
- PubMed/NCBI
- Google Scholar
4. Fowler DM, Fields S. Deep mutational scanning: a new style of protein science. Nat Methods. 2014;11(8):801–807. pmid:25075907
- View Article
- PubMed/NCBI
- Google Scholar
5. de Visser JAG, Krug J. Empirical fitness landscapes and the predictability of evolution. Nature Reviews Genetics. 2014;15(7):480–490. pmid:24913663
- View Article
- PubMed/NCBI
- Google Scholar
6. Domingo J, Diss G, Lehner B. Pairwise and higher-order genetic interactions during the evolution of a tRNA. Nature. 2018;558(7708):117–121. pmid:29849145
- View Article
- PubMed/NCBI
- Google Scholar
7. Poelwijk FJ, Socolich M, Ranganathan R. Learning the pattern of epistasis linking genotype and phenotype in a protein. Nature communications. 2019;10(1):1–11. pmid:31527666
- View Article
- PubMed/NCBI
- Google Scholar
8. Baeza-Centurion P, Miñana B, Schmiedel JM, Valcárcel J, Lehner B. Combinatorial genetics reveals a scaling law for the effects of mutations on splicing. Cell. 2019;176(3):549–563. pmid:30661752
- View Article
- PubMed/NCBI
- Google Scholar
9. Pokusaeva VO, Usmanova DR, Putintseva EV, Espinar L, Sarkisyan KS, Mishin AS, et al. An experimental assay of the interactions of amino acids from orthologous sequences shaping a complex fitness landscape. PLoS genetics. 2019;15(4):e1008079. pmid:30969963
- View Article
- PubMed/NCBI
- Google Scholar
10. Bendixsen DP, Collet J, Østman B, Hayden EJ. Genotype network intersections promote evolutionary innovation. PLoS biology. 2019;17(5):e3000300. pmid:31136568
- View Article
- PubMed/NCBI
- Google Scholar
11. Soo VW, Swadling JB, Faure AJ, Warnecke T. Fitness landscape of a dynamic RNA structure. PLoS genetics. 2021;17(2):e1009353. pmid:33524037
- View Article
- PubMed/NCBI
- Google Scholar
12. Moulana A, Dupic T, Phillips AM, Chang J, Nieves S, Roffler AA, et al. Compensatory epistasis maintains ACE2 affinity in SARS-CoV-2 Omicron BA. 1. Nature Communications. 2022;13(1):1–11. pmid:36384919
- View Article
- PubMed/NCBI
- Google Scholar
13. Rotrattanadumrong R, Yokobayashi Y. Experimental exploration of a ribozyme neutral network using evolutionary algorithm and deep learning. Nature communications. 2022;13(1):1–14. pmid:35977956
- View Article
- PubMed/NCBI
- Google Scholar
14. Lynch M, Walsh B, et al. Genetics and analysis of quantitative traits. vol. 1. Sinauer Sunderland, MA; 1998.
15. Goldberg DE. Genetic Algorithms and Walsh Functions: Part I, A Genetle Introduction. Complex systems. 1989;3:129–152.
- View Article
- Google Scholar
16. Weinreich DM, Lan Y, Wylie CS, Heckendorn RB. Should evolutionary geneticists worry about higher-order epistasis? Current opinion in genetics & development. 2013;23(6):700–707. pmid:24290990
- View Article
- PubMed/NCBI
- Google Scholar
17. Poelwijk FJ, Ranganathan R. The relation between alignment covariance and background-averaged epistasis. arXiv. 2017;10.48550/ARXIV.1703.10996.
18. Brookes DH, Aghazadeh A, Listgarten J. On the sparsity of fitness functions and implications for learning. Proc Natl Acad Sci U S A. 2022;119(1). pmid:34937698
- View Article
- PubMed/NCBI
- Google Scholar
19. Ogbunugafor CB. The mutation effect reaction norm (mu‐rn) highlights environmentally dependent mutation effects and epistatic interactions. Evolution. 2022;76(s1):37–48. pmid:34989399
- View Article
- PubMed/NCBI
- Google Scholar
20. Weinberger ED. Fourier and Taylor series on fitness landscapes. Biological cybernetics. 1991;65(5):321–330.
- View Article
- Google Scholar
21. Beer T. Walsh transforms. American Journal of Physics. 1981;49(5):466–472.
- View Article
- Google Scholar
22. Stoffer DS. Walsh-Fourier Analysis and its Statistical Applications. Journal of the American Statistical Association. 1991;86(414):461–479.
- View Article
- Google Scholar
23. Faure AJ, Domingo J, Schmiedel JM, Hidalgo-Carcedo C, Diss G, Lehner B. Mapping the energetic and allosteric landscapes of protein binding domains. Nature. 2022;604(7904):175–183. pmid:35388192
- View Article
- PubMed/NCBI
- Google Scholar
24. Tareen A, Kooshkbaghi M, Posfai A, Ireland WT, McCandlish DM, Kinney JB. MAVE-NN: learning genotype-phenotype maps from multiplex assays of variant effect. Genome biology. 2022;23(1):1–27. pmid:35428271
- View Article
- PubMed/NCBI
- Google Scholar
25. Otwinowski J, McCandlish DM, Plotkin JB. Inferring the shape of global epistasis. Proceedings of the National Academy of Sciences. 2018;115(32):E7550–E7558. pmid:30037990
- View Article
- PubMed/NCBI
- Google Scholar
26. Forcier TL, Ayaz A, Gill MS, Jones D, Phillips R, Kinney JB. Measuring cis-regulatory energetics in living cells using allelic manifolds. Elife. 2018;7:e40618. pmid:30570483
- View Article
- PubMed/NCBI
- Google Scholar
27. Kinney JB, Murugan A, Callan CG Jr, Cox EC. Using deep sequencing to characterize the biophysical mechanism of a transcriptional regulatory sequence. Proceedings of the National Academy of Sciences. 2010;107(20):9158–9163. pmid:20439748
- View Article
- PubMed/NCBI
- Google Scholar
28. Zhou J, Wong MS, Chen WC, Krainer AR, Kinney JB, McCandlish DM. Higher-order epistasis and phenotypic prediction. Proceedings of the National Academy of Sciences. 2022;119(39):e2204233119. pmid:36129941
- View Article
- PubMed/NCBI
- Google Scholar
29. Zhou J, McCandlish DM. Minimum epistasis interpolation for sequence-function relationships. Nature communications. 2020;11(1):1–14. pmid:32286265
- View Article
- PubMed/NCBI
- Google Scholar
30. Faure AJ, Schmiedel JM, Baeza-Centurion P, Lehner B. DiMSum: an error model and pipeline for analyzing deep mutational scanning data and diagnosing common experimental pathologies. Genome Biology. 2020;21(1):1–23. pmid:32799905
- View Article
- PubMed/NCBI
- Google Scholar

[ref1] 1. Phillips PC. Epistasis—the essential role of gene interactions in the structure and evolution of genetic systems. Nature Reviews Genetics. 2008;9(11):855–867. pmid:18852697
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Poelwijk FJ, Krishna V, Ranganathan R. The Context-Dependence of Mutations: A Linkage of Formalisms. PLoS Computational Biology. 2016;12(6):e1004771. pmid:27337695
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. Domingo J, Baeza-Centurion P, Lehner B. The Causes and Consequences of Genetic Interactions (Epistasis). Annu Rev Genomics Hum Genet. 2019;20:433–460. pmid:31082279
View Article
PubMed/NCBI
Google Scholar

[10] View Article

[11] PubMed/NCBI

[12] Google Scholar

[ref4] 4. Fowler DM, Fields S. Deep mutational scanning: a new style of protein science. Nat Methods. 2014;11(8):801–807. pmid:25075907
View Article
PubMed/NCBI
Google Scholar

[14] View Article

[15] PubMed/NCBI

[16] Google Scholar

[ref5] 5. de Visser JAG, Krug J. Empirical fitness landscapes and the predictability of evolution. Nature Reviews Genetics. 2014;15(7):480–490. pmid:24913663
View Article
PubMed/NCBI
Google Scholar

[18] View Article

[19] PubMed/NCBI

[20] Google Scholar

[ref6] 6. Domingo J, Diss G, Lehner B. Pairwise and higher-order genetic interactions during the evolution of a tRNA. Nature. 2018;558(7708):117–121. pmid:29849145
View Article
PubMed/NCBI
Google Scholar

[22] View Article

[23] PubMed/NCBI

[24] Google Scholar

[ref7] 7. Poelwijk FJ, Socolich M, Ranganathan R. Learning the pattern of epistasis linking genotype and phenotype in a protein. Nature communications. 2019;10(1):1–11. pmid:31527666
View Article
PubMed/NCBI
Google Scholar

[26] View Article

[27] PubMed/NCBI

[28] Google Scholar

[ref8] 8. Baeza-Centurion P, Miñana B, Schmiedel JM, Valcárcel J, Lehner B. Combinatorial genetics reveals a scaling law for the effects of mutations on splicing. Cell. 2019;176(3):549–563. pmid:30661752
View Article
PubMed/NCBI
Google Scholar

[30] View Article

[31] PubMed/NCBI

[32] Google Scholar

[ref9] 9. Pokusaeva VO, Usmanova DR, Putintseva EV, Espinar L, Sarkisyan KS, Mishin AS, et al. An experimental assay of the interactions of amino acids from orthologous sequences shaping a complex fitness landscape. PLoS genetics. 2019;15(4):e1008079. pmid:30969963
View Article
PubMed/NCBI
Google Scholar

[34] View Article

[35] PubMed/NCBI

[36] Google Scholar

[ref10] 10. Bendixsen DP, Collet J, Østman B, Hayden EJ. Genotype network intersections promote evolutionary innovation. PLoS biology. 2019;17(5):e3000300. pmid:31136568
View Article
PubMed/NCBI
Google Scholar

[38] View Article

[39] PubMed/NCBI

[40] Google Scholar

[ref11] 11. Soo VW, Swadling JB, Faure AJ, Warnecke T. Fitness landscape of a dynamic RNA structure. PLoS genetics. 2021;17(2):e1009353. pmid:33524037
View Article
PubMed/NCBI
Google Scholar

[42] View Article

[43] PubMed/NCBI

[44] Google Scholar

[ref12] 12. Moulana A, Dupic T, Phillips AM, Chang J, Nieves S, Roffler AA, et al. Compensatory epistasis maintains ACE2 affinity in SARS-CoV-2 Omicron BA. 1. Nature Communications. 2022;13(1):1–11. pmid:36384919
View Article
PubMed/NCBI
Google Scholar

[46] View Article

[47] PubMed/NCBI

[48] Google Scholar

[ref13] 13. Rotrattanadumrong R, Yokobayashi Y. Experimental exploration of a ribozyme neutral network using evolutionary algorithm and deep learning. Nature communications. 2022;13(1):1–14. pmid:35977956
View Article
PubMed/NCBI
Google Scholar

[50] View Article

[51] PubMed/NCBI

[52] Google Scholar

[ref14] 14. Lynch M, Walsh B, et al. Genetics and analysis of quantitative traits. vol. 1. Sinauer Sunderland, MA; 1998.

[ref15] 15. Goldberg DE. Genetic Algorithms and Walsh Functions: Part I, A Genetle Introduction. Complex systems. 1989;3:129–152.
View Article
Google Scholar

[55] View Article

[56] Google Scholar

[ref16] 16. Weinreich DM, Lan Y, Wylie CS, Heckendorn RB. Should evolutionary geneticists worry about higher-order epistasis? Current opinion in genetics & development. 2013;23(6):700–707. pmid:24290990
View Article
PubMed/NCBI
Google Scholar

[58] View Article

[59] PubMed/NCBI

[60] Google Scholar

[ref17] 17. Poelwijk FJ, Ranganathan R. The relation between alignment covariance and background-averaged epistasis. arXiv. 2017;10.48550/ARXIV.1703.10996.

[ref18] 18. Brookes DH, Aghazadeh A, Listgarten J. On the sparsity of fitness functions and implications for learning. Proc Natl Acad Sci U S A. 2022;119(1). pmid:34937698
View Article
PubMed/NCBI
Google Scholar

[63] View Article

[64] PubMed/NCBI

[65] Google Scholar

[ref19] 19. Ogbunugafor CB. The mutation effect reaction norm (mu‐rn) highlights environmentally dependent mutation effects and epistatic interactions. Evolution. 2022;76(s1):37–48. pmid:34989399
View Article
PubMed/NCBI
Google Scholar

[67] View Article

[68] PubMed/NCBI

[69] Google Scholar

[ref20] 20. Weinberger ED. Fourier and Taylor series on fitness landscapes. Biological cybernetics. 1991;65(5):321–330.
View Article
Google Scholar

[71] View Article

[72] Google Scholar

[ref21] 21. Beer T. Walsh transforms. American Journal of Physics. 1981;49(5):466–472.
View Article
Google Scholar

[74] View Article

[75] Google Scholar

[ref22] 22. Stoffer DS. Walsh-Fourier Analysis and its Statistical Applications. Journal of the American Statistical Association. 1991;86(414):461–479.
View Article
Google Scholar

[77] View Article

[78] Google Scholar

[ref23] 23. Faure AJ, Domingo J, Schmiedel JM, Hidalgo-Carcedo C, Diss G, Lehner B. Mapping the energetic and allosteric landscapes of protein binding domains. Nature. 2022;604(7904):175–183. pmid:35388192
View Article
PubMed/NCBI
Google Scholar

[80] View Article

[81] PubMed/NCBI

[82] Google Scholar

[ref24] 24. Tareen A, Kooshkbaghi M, Posfai A, Ireland WT, McCandlish DM, Kinney JB. MAVE-NN: learning genotype-phenotype maps from multiplex assays of variant effect. Genome biology. 2022;23(1):1–27. pmid:35428271
View Article
PubMed/NCBI
Google Scholar

[84] View Article

[85] PubMed/NCBI

[86] Google Scholar

[ref25] 25. Otwinowski J, McCandlish DM, Plotkin JB. Inferring the shape of global epistasis. Proceedings of the National Academy of Sciences. 2018;115(32):E7550–E7558. pmid:30037990
View Article
PubMed/NCBI
Google Scholar

[88] View Article

[89] PubMed/NCBI

[90] Google Scholar

[ref26] 26. Forcier TL, Ayaz A, Gill MS, Jones D, Phillips R, Kinney JB. Measuring cis-regulatory energetics in living cells using allelic manifolds. Elife. 2018;7:e40618. pmid:30570483
View Article
PubMed/NCBI
Google Scholar

[92] View Article

[93] PubMed/NCBI

[94] Google Scholar

[ref27] 27. Kinney JB, Murugan A, Callan CG Jr, Cox EC. Using deep sequencing to characterize the biophysical mechanism of a transcriptional regulatory sequence. Proceedings of the National Academy of Sciences. 2010;107(20):9158–9163. pmid:20439748
View Article
PubMed/NCBI
Google Scholar

[96] View Article

[97] PubMed/NCBI

[98] Google Scholar

[ref28] 28. Zhou J, Wong MS, Chen WC, Krainer AR, Kinney JB, McCandlish DM. Higher-order epistasis and phenotypic prediction. Proceedings of the National Academy of Sciences. 2022;119(39):e2204233119. pmid:36129941
View Article
PubMed/NCBI
Google Scholar

[100] View Article

[101] PubMed/NCBI

[102] Google Scholar

[ref29] 29. Zhou J, McCandlish DM. Minimum epistasis interpolation for sequence-function relationships. Nature communications. 2020;11(1):1–14. pmid:32286265
View Article
PubMed/NCBI
Google Scholar

[104] View Article

[105] PubMed/NCBI

[106] Google Scholar

[ref30] 30. Faure AJ, Schmiedel JM, Baeza-Centurion P, Lehner B. DiMSum: an error model and pipeline for analyzing deep mutational scanning data and diagnosing common experimental pathologies. Genome Biology. 2020;21(1):1–23. pmid:32799905
View Article
PubMed/NCBI
Google Scholar

[108] View Article

[109] PubMed/NCBI

[110] Google Scholar

Figures

Abstract

Author summary

Introduction

Results

Extension of the Walsh-Hadamard transform to multiallelic landscapes

Recursive inverse matrix

Formulae to obtain elements of the matrices

Generalization to different numbers of states per position

Benchmarking

Application to a simulated multiallelic genotype-phenotype landscape

Application to empirical combinatorial fitness landscapes

Discussion

Methods

Simulated multiallelic fitness landscape

Empirical combinatorial fitness landscapes

Lasso regression models

Supporting information

S1 Fig. Supplementary figure related to Fig 2B and 2C.

S2 Fig. Supplementary figure related to Fig 2D.

S1 Text. Supplementary methods.

Acknowledgments

References