Impact of phylogeny on the inference of functional sectors from protein sequence data

doi:10.1371/journal.pcbi.1012091

Fig 1.

Data generation process.

A: Generation of sequences with selection only. Given a vector of mutational effects on the trait of interest, we sample independent equilibrium sequences under the Hamiltonian in Eq 1. For this, we start from random sequences and we use a Metropolis Monte Carlo algorithm where proposed mutations (changes of state at a randomly chosen site) are accepted with probability p, according to the Metropolis criterion associated to . The obtained sequences feature pairwise correlations and conservation arising from selection on the trait (via the Hamiltonian in Eq 1). B: To incorporate phylogenetic correlations, we start from one equilibrium sequence, which becomes the ancestor. We evolve it on a perfect binary tree over a fixed number n of “generations” (i.e., tree levels, corresponding to branching events), here n = 2 generations. A fixed number of mutations μ (red, here μ = 2) are accepted with probability p on each branch of the tree. The earliest generation at which a site mutates with respect to its ancestral state is denoted by G, see examples highlighted in green.

More »

Expand

Fig 2.

Spectrum of the ICOD matrix and of its block diagonal approximation.

Left (resp. middle) panel: spectrum of the ICOD matrix computed on 2000 (resp. 14,000) sequences generated independently at equilibrium, and of its block diagonal approximation. The spectrum of the inverse covariance matrix C⁻¹ is also shown as a reference. Right panel: spectrum of the analytical approximation of the ICOD matrix, and of its block diagonal approximation. Sequences of length L = 200 were sampled independently at equilibrium using the Hamiltonian in Eq 1 with and τ* = 90. The vector of mutational effect comprises sector sites (the 20 first sites) with components sampled from a Gaussian distribution with mean 5 and variance 0.25, and non-sector sites (the remaining 180 sites) with components sampled from a Gaussian distribution with mean 0.5 and variance 0.25. The analytical approximation (see S1 Appendix section 1) was computed from the values of κ and used for data generation.

More »

Expand

Fig 3.

Impact of phylogeny and selection on ICOD, covariance and SCA spectra.

Eigenvalues of the ICOD, covariance and SCA matrices, sorted from largest to smallest, are shown for sequences generated with only phylogeny (light shades) and both phylogeny and selection (dark shades). We consider different levels of phylogeny by considering different values of μ (shown as different colors). ‘No phylogeny’ corresponds to sequences generated independently at equilibrium, and thus containing only correlations due to selection. This data set comprises M = 2048 sequences of length L = 200 generated exactly as in Fig 2, i.e. using the Hamiltonian in Eq 1 with , τ* = 90, and the same vector of mutational effect as in Fig 2. Data sets without selection are generated by evolving random sequences of length L = 200 on a perfect binary branching with 11 generations and μ random mutations on each branch, providing M = 2¹¹ = 2048 sequences. Finally, data sets with phylogeny and selection are generated along a perfect binary tree with μ accepted mutations per branch (with acceptance criterion in Eq 2 using the same κ and τ* as in the no-phylogeny case and as in Fig 2) and 11 generations again. The three values of μ shown here were chosen to illustrate different levels of phylogenetic impact. Insets show a zoom over large eigenvalues. A logarithmic y-scale is used in the center panel for readability.

More »

Expand

Fig 4.

Impact of phylogeny on mutational effect recovery.

The recovery of the mutational effect vector (see Methods) for specific eigenvectors is shown as a function of the number μ of mutations per branch, using ICOD, covariance, SCA and conservation. For ICOD (resp. SCA), eigenvectors associated to the largest eigenvalue Λ_max (resp. λ_max) are considered. For covariance C, eigenvectors associated to the smallest eigenvalue λ_min are considered. Datasets of M = 2048 sequences of length L = 200 were generated along a perfect binary tree with 11 generations, using various numbers μ of accepted mutations per branch. As in Fig 3, we employed the mutation acceptance criterion in Eq 2 with and τ* = 90. We used the same vector of mutational effect as in Fig 2 and Fig 3. All results are averaged over 100 realisations of data generation. The null model corresponds to recovery from a random vector (see Methods, Eq 13).

More »

Expand

Fig 5.

Impact of earliest mutation generation G on eigenvector components.

Violin plots of the absolute value of components of the key eigenvectors of the ICOD, covariance C and SCA matrices are represented versus the earliest mutation generation G at which the associated site first mutates in the phylogeny. Results are shown for data sets generated with μ = 50 (top panels) and μ = 5 (bottom panels). Datasets of M = 2048 sequences of length L = 200 were generated along a perfect binary tree with 11 generations, using two different numbers μ of accepted mutations per branch. As in Fig 3, we employed the mutation acceptance criterion in Eq 2 with and τ* = 90. We used the same vector of mutational effect as in Figs 2 and 3. Violin plots are obtained over 100 realisations of data generation.

More »

Expand

Fig 6.

Identifying functionally important sites in natural protein families.

The symmetrized AUC for the prediction of sites with large mutational effects is computed on 30 protein families, using four different methods: ICOD, SCA, MI and Conservation, using Deep Mutational Scan (DMS) data as ground truth. For ICOD and MI, the average product correction (APC) [4] is applied to the matrix of interest (it was found to improve the average performance for most families for these methods, but not for SCA). For ICOD, MI and SCA, the components of the eigenvector associated to the largest eigenvalue are employed to make predictions of mutational effects. Protein families are ordered by decreasing symmetrized AUC for ICOD. The mapping between protein family number and name is given in S1 Table. The protein families shaded in grey have DMS data featuring a unimodal shape, the other ones have a bimodal shape.

More »

Expand