Detecting Functional Divergence after Gene Duplication through Evolutionary Changes in Posttranslational Regulatory Sequences

Gene duplication is an important evolutionary mechanism that can result in functional divergence in paralogs due to neo-functionalization or sub-functionalization. Consistent with functional divergence after gene duplication, recent studies have shown accelerated evolution in retained paralogs. However, little is known in general about the impact of this accelerated evolution on the molecular functions of retained paralogs. For example, do new functions typically involve changes in enzymatic activities, or changes in protein regulation? Here we study the evolution of posttranslational regulation by examining the evolution of important regulatory sequences (short linear motifs) in retained duplicates created by the whole-genome duplication in budding yeast. To do so, we identified short linear motifs whose evolutionary constraint has relaxed after gene duplication with a likelihood-ratio test that can account for heterogeneity in the evolutionary process by using a non-central chi-squared null distribution. We find that short linear motifs are more likely to show changes in evolutionary constraints in retained duplicates compared to single-copy genes. We examine changes in constraints on known regulatory sequences and show that for the Rck1/Rck2, Fkh1/Fkh2, Ace2/Swi5 paralogs, they are associated with previously characterized differences in posttranslational regulation. Finally, we experimentally confirm our prediction that for the Ace2/Swi5 paralogs, Cbk1 regulated localization was lost along the lineage leading to SWI5 after gene duplication. Our analysis suggests that changes in posttranslational regulation mediated by short regulatory motifs systematically contribute to functional divergence after gene duplication.

squared distributed. If the likelihood-ratio test statistics were chi-squared distributed, then we would expect uniformly distributed p-values. In our simulation of short sequence evolution, we observed that the p-values obtained were not uniform and that there were much fewer rejections of the null hypothesis than expected ( Figure S1A). This indicates that this test has low power for very short sequences and that assuming chi-square distribution for the likelihood-ratio test statistic will be conservative. Increasing the branch lengths to allow for more substitutions improved the uniformity slightly, but we never observed higher rates of rejection of the null hypothesis than expected ( Figure S1B). In phylogenetics studies involving proteome-wide analyses, increased acceptance of the null hypothesis has been suggested as an acceptable compromise to decrease the levels of false positives that may arise due to invalid model assumptions [1,2]. See the main text for the behavior of the test on more 'realistic' simulations.

Simulation of protein evolution under various evolutionary models
To illustrate the use of this non-central correction to eliminate false rejections of the null hypothesis when the background evolutionary process deviates from the model assumed by the test, we performed extensive simulation of sequence evolution where there were truly no changes in constraints after gene duplication, but where the simulation violated the model assumed by the test. In each case, we used a likelihood-ratio test to test whether the data is better explained by two rates of evolution as opposed to a single rate of evolution (using AAML, see reference [38] from the main text, and see Methods). Under the null hypothesis, p-values show a uniform distribution and we use this as a measure for deviations of our test statistic under various simulations. For example, if the p-values are uniform in a simulation with no true positives, then we infer that the null distribution used is correct.
We first simulated proteins evolving under the WAG model (see reference [77] in the main text) according to a single rate of evolution, which is the model assumed by the test, and tested for changes in constraints (see Methods). As expected, the distribution of the likelihood-ratio test statistic followed a chi-squared distribution, and the p-values are uniformly distributed ( Figure   S2A, grey columns). Next, because short linear motifs are found in rapidly evolving disordered regions that contain indels, we simulated proteins according to the same model but also allowed indels (using INDELIBLE [3], see Methods for indel parameters). We then aligned the proteins (MAFFT, see reference [66] in the main text and Methods), and repeated the likelihood-ratio test for changes in constraints. The inclusion of indels in the simulation led to an increased rejection rate of the null hypothesis ( Figure S2A, black circles), presumably because alignment errors lead to analysis of non-homologous residues and the extra rate parameter in the alternative hypothesis can 'fit' some of this heterogeneity. However, after performing the non-central correction to the chi-squared distribution, the distribution of p-values was uniform, indicating that the indel and alignment process was adequately captured by the non-central parameter ( Figure S2A, white circles). The KL divergence for this particular set of parameters was 0.000837.
To test whether the non-central correction could account for heterogeneity in the substitution process, we next performed codon-based simulations where we could vary the stationary codon frequencies. We simulated proteins according to a codon model (with Ka/Ks is equal to 1) using the codon frequency table from Thermus aquaticus (which is GC-biased), and found that the likelihood-ratio test statistic followed a chi-squared distribution ( Figure S2B, grey columns) despite the GC-biased codon model likely leading to amino acid frequencies different than those assumed by the WAG model in the test. Next, we simulated a similar set of proteins, except that one of the tested clades was evolving under the S. cerevisiae codon frequency table (which is AT-biased). Because the substitution model changes on the phylogeny, this set of proteins corresponds to proteins evolving under a non-homogenous and non-stationary process. As expected, we observed increased false rejections of the null hypothesis ( Figure S2B, black circles) because the alternative hypothesis can account for some of this heterogeneity using the additional rate parameter. However, after correction by the non-central parameter, the p-values were now uniformly distributed ( Figure S2B, white circles). The KL divergence for this set of parameters was 0.001031.
To illustrate how all these deviations of the evolutionary models can be compounded in the analyses, we also evolved proteins under the same non-homogenous, non-stationary processes, but now also included indels (see Methods). Doing this, we observed a much higher false rejection rate of the null hypothesis (Figure S1B, black squares); however, even while several factors compounded to deviate from the models assumed in the test, it was still possible to capture the deviation to the null hypothesis using a single non-central parameter ( Figure S1B, white squares). Only a very small increase in the KL divergence for the combined deviations to the model was observed (0.001059 vs 0.001031 for the heterogeneous substitution model with no indels); however, we observed a much higher false rejection rate of the null-hypothesis because the likelihood-ratio test is performed on more columns (more data points in the test) due to the indels.
Taken together, these results indicate that a non-central chi-squared distribution can be used to capture some of the evolutionary complexities that can be encountered when detecting functional divergence after gene duplication.

Simulation of protein evolution under various evolutionary models
To simulate short linear motif evolution of different lengths ( Figure S1), we estimated a phylogenetic tree from Cdc20 (a protein conserved in all eukaryotes) using AAML and under the global clock model. We used this tree with the program INDELIBLE [3] to evolve short sequences of different lengths under the WAG model. These sets of simulated short linear motifs therefore correspond to the null model of the likelihood-ratio test.
To simulate protein sequences under various models of evolution for the purpose of testing the non-central chi-squared correction to the likelihood-ratio test ( Figure S1), we used the program INDELIBLE [3]. The same phylogenetic relationship was always used: (((a,b),(c,d)),((e,f),(gh))) with equal branch lengths of 0.8 and a root length of 300 (codons or amino acids). We arbitrarily set the (e,f) clade to be the post-WGD clade and performed likelihood-ratio tests on proteins simulated by the program. Sequences were evolved with the following parameters: 1) WAG model, 2) WAG model with power law distributed indels, with a=1.7 for inserts, and a=1.8 for deletions, with an indel rate of 0.1, 3) Homogeneous codon models with codon stationary frequency equals to the Thermus aquaticus codon frequency with kappa=2, omega=1, 4) Non-homogeneous codon models, and codon stationary frequency equals to the S. cerevisiae codon frequency for all the tree, except for the (e,f) clade which was set to the Thermus aquaticus codon frequency, 5) same as 4) except with indels similar to test 2) except with indel rate of 0.05.
The likelihood-ratio test was then performed on the resulting proteins (or aligned first with MAFFT if indels were simulated).

Strains and plasmids
BY4741 or isogenic derivatives were used for all of our experiments. Single-copy genes were PCR amplified from purified genomic DNA (Fermentas, #K0512) of L. kluyveri (NRRL Y-12651) and L. waltii (UCD 72-13), and Ace2/Swi5 orthologs were PCR amplified from purified DNA from C. glabrata (CBS 138). Allele replacement for single-copy genes was performed using a modification of the method as in [4] with single-copy genes replacing the SWI5 gene.
Strains were then imaged by growing the cells to log-phase in minimal defined media with appropriate auxotrophic requirements and imaged with a standard 491nm blue laser on a Leica spinning-disc confocal microscope.