Figures
Abstract
In folded protein domains, protein function is frequently more conserved than amino acid sequence because highly diverged sequences can fold into equivalent 3D structures with identical function. During evolution, intrinsically disordered protein regions (IDRs) often experience rapid amino acid sequence divergence, but because they do not fold into stable 3D structures, it remains largely unknown when and how function is conserved. As a model system for studying the evolution of IDRs, we examined transcriptional activation domains, the regions of transcription factors that bind to coactivator complexes. We systematically identified activation domains on 502 homologs of the transcriptional activator Gcn4 spanning 600 MY of fungal evolution in the Ascomycota. We found that the central activation domain shows strong conservation of function without conservation of sequence. This conservation of function without conservation of sequence arises from evolutionary turnover (gain and loss) at two length scales. Within the central activation domain, we see turnover of acidic and aromatic residues, but primarily loss of short linear motifs. In the full-length transcription factor, we see turnover of entire activation domains. Stabilizing selection and evolutionary turnover at multiple length scales are likely a general mechanism for conservation of function without conservation of sequence in IDRs.
Author summary
When and where genes are turned on determine what an organism looks like and how it responds to its environment. The turning on and off of genes is controlled by proteins known as transcription factors. Transcription factors have two main jobs: to bind the genome and to turn genes on or off. The protein regions of the transcription factors that bind the genome have been conserved for billions of years. However, throughout this time the rest of the transcription factor protein sequence has changed substantially. In this study, we investigate how changes in the sequences of these other regions impact the ability of transcription factors to turn on genes. We find that these transcription factor regions are still able to turn on genes despite large changes in protein sequence. The function of these transcription factors is much more conserved than their protein sequences. The protein sequence of these activating regions can change dramatically while maintaining function. These findings improve our understanding of how these transcription factors turn on genes and how they are evolving.
Citation: LeBlanc CJ, Stefani J, Soriano M, Lam AWY, Zintel MA, Kotha SR, et al. (2026) Evolutionary turnover of key amino acids explains conservation of function without conservation of sequence in transcriptional activation domains. PLoS Genet 22(3): e1012069. https://doi.org/10.1371/journal.pgen.1012069
Editor: Michael J. Guertin, UConn Health Center: UConn Health, UNITED STATES OF AMERICA
Received: November 14, 2025; Accepted: February 24, 2026; Published: March 16, 2026
Copyright: © 2026 LeBlanc et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All the raw sequencing data has been deposited at NIH SRA Accession #PRJNA1186961: http://www.ncbi.nlm.nih.gov/bioproject/1186961 All the analysis scripts are deposited on github and Zenodo: 10.5281/zenodo.14201918 https://github.com/staller-lab/Gcn4-evolution https://github.com/staller-lab/labtools/tree/main/src/labtools/adtools All the processed data are attached in supplemental tables (S6, S11, and S12 Tables). Processed sequencing read counts are in S13 Table.
Funding: CJL was funded by T32HG4725, MAZ was funded by T32GM148378, and AF was funded by T32GM146614. AL was funded by UC Berkeley URAP. MS and SRK were funded by UC Berkeley SEED Scholars Program. SRK was also funded by UC Berkeley SURF. GH was funded by UC Berkeley Rose Hill. GPS was funded by the UC Berkeley BSP scholars, the McNair Scholars, and UC Berkeley SURF programs. This work was supported by the Burroughs Wellcome Fund PDEP, Simons Foundation grant 1018719, NSF grant 2112057, and NIH grant R35GM150813 awarded to MVS. MVS is a Biohub, San Francisco Investigator. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
The evolution of eukaryotic transcription factors (TF) contains a paradox: TF protein sequences diverge quickly but maintain function over long evolutionary distances. For example, the master regulator of eye development in mice, Pax6, induces ectopic eyes in flies, and fly Pax6 (eyeless) creates ectopic eye structures in frogs and mice [1–3]. While the DNA-binding domains (DBD) are 96% identical, eye induction requires the intrinsically disordered regions (IDRs), which are only 35.5% identical. Despite substantial sequence divergence, these IDRs must share a conserved function. In folded protein domains, highly diverged sequences can fold into the same 3D structure, providing a molecular mechanism for conservation of function without conservation of sequence [4–6]. Here, we seek an analogous framework for understanding the evolution of IDRs. Small-scale studies have found examples of diverged IDRs that conserve function [7–9], do not conserve function [10, 11], or fall in between [12]. Transcriptional activation domains provide an excellent model system for studying IDR evolution because they are one of the oldest classes of functional IDRs [13], they are required for TF function [14], and their activity can be measured in high throughput [15]. Our goal is to identify molecular mechanisms by which TF IDR function can be conserved in the face of rapid sequence divergence.
Our evolutionary study of acidic activation domains in yeast benefits from high-throughput data that define sequence features controlling their function [15–21]. These data have been used to train neural network models for predicting activation domains from protein sequence [16, 19, 21–24]. These features further motivated a biophysical mechanism, our acidic exposure model: aromatic and leucine residues make key contacts with hydrophobic surfaces of coactivator complexes, but these residues can also interact with each other and drive collapse into an inactive state [15, 25–28]. The acidic residues repel each other, expand the activation domain, and promote exposure of the hydrophobic residues. Large-scale mutagenesis showed the acidic exposure model applies to hundreds of human activation domains [29].
We test the hypothesis that conservation of function without conservation of sequence in TF IDRs results from stabilizing selection to maintain protein function. This stabilizing selection will manifest as evolutionary turnover, or repeated gain and loss, of functional elements, such as short linear motifs, local composition, or key residues [8, 30–35]. Individual functional elements can be lost through mutation so long as other compensatory functional elements have been gained elsewhere in the sequence. This compensatory evolution under stabilizing selection gives the appearance that functional elements move around. A leading model for activation domain function is that they are composed of short linear motifs [33] embedded in a permissive (often acidic) context [36]. In TFs, it is unclear if the functional elements that experience turnover will be entire activation domains, motifs, or individual amino acids. Here, we aim to identify the functional elements and search for evolutionary turnover.
To compare highly-diverged but active sequences, we mapped activation domains across a family of homologous TFs. As a model system, we used 502 diverse homologs of Gcn4, a nutrient stress TF, and screened short protein fragments for activation domain function with a high-throughput assay in Saccharomyces cerevisiae [15]. All homologs contain at least one fragment that functions as an activation domain. Central activation domain function is conserved across 600 million years of evolution despite substantial sequence divergence. The motifs necessary for activity in S. cerevisiae are frequently lost and rarely gained, indicating minimal evolutionary turnover of motifs. Instead, individual acidic and hydrophobic residues are frequently gained and lost, showing extensive evolutionary turnover. We conclude that conservation of function without conservation of sequence in the central activation domain results from turnover of individual residues, not turnover of motifs. We also see evolutionary turnover of entire activation domains upstream of the central activation domain (upstream activation domains). Our functional screening reveals how evolutionary turnover on two length scales maintains conservation of function without conservation of function in transcription factors.
Results
Identification of Gcn4 homologs
As a model system to study TF evolution, we used Gcn4, a yeast stress response TF. Gcn4 contains an intrinsically disordered region followed by a C-terminal DBD. We identified diverse homologs, quantified sequence divergence, experimentally mapped activation domains with a tiling strategy, and looked for conservation of activation domain function without conservation of sequence.
Our null hypothesis is that IDR function is conserved and that the observed diversity of sequence is the result of mutations, purifying (negative) selection, and neutral drift. We found evidence for negative selection on the full-length TF using a high-quality set of thirty-six Gcn4 orthologs from the yeast gene order browser (YGOB, dN/dS = 0.160, phylogenetic codon model) [37]. The DBD is under stronger negative selection than the IDR (0.176 dN/dS for the IDR vs. 0.107 dN/dS for the DBD with a phylogenetic codon model, Fig 1A), matching previous results [40]. We found no evidence for positive selection in the TFs (branch-site test, all adjusted p-values > 0.05 under likelihood ratio test, Methods). The IDR is experiencing more neutral drift than the DBD, consistent with stronger purifying selection on a structured domain.
A) The pairwise omega coefficients (dN/dS), calculated with PAML [38] using the LPB93 [39] method for thirty-six YGOB orthologs, are all less than one, indicating negative selection for both the DBD and IDR (non-DBD). A phylogenetic site codon model fit to the same set of sequences finds similar trends (0.176 omega for the IDR vs. 0.107 omega for the DBD). B) The distance between the WxxLF motif and the start of the DBD is conserved. Red arrow, S. cerevisiae. C) The MSA of 124 homologs (longest per species) shows the DNA binding domain is the most conserved region. Percent identity is downweighted by the fraction of gaps in the alignment column. The final positions of the MSA (where the longest sequence has a C-terminal extension after the DBD) is not shown. D) For 502 homologs, we computed pairwise alignments for the full TF, IDR, DBD, a length-matched region immediately upstream of the DBD, and a length-matched region around the WxxLF motif (W-length DBD + 15:W + 15, central activation domain). E) The central activation domain is always less conserved than the DBD (below the line). The cluster of exact matches in the upper right results from overlapping gene models (e.g., alternative start sites).
We computationally identified 502 unique Gcn4 homologs from 124 genomes that span the Ascomycota, the largest phylum of Fungi, representing >600 million years of evolution [41] (S1 Fig). We scanned for the basic leucine zipper DBD (bZIP), and, to enrich for Gcn4 orthologs, we also scanned for the WxxLF motif, which is highly conserved in our starting set [15]. The WxxLF motif is the most important part of the central activation domain and interacts with coactivator Med15 [42–45]. We required all sequences to contain the WxxLF motif to prevent collecting all bZIP TFs in each genome. However, by forcing our sequences to have this motif, many of our analyses will overstate its conservation. An independent survey of central activation domain conservation revealed that the WxxLF motif is more conserved than all other published motifs (S2 Fig). While the Gcn4 homologs vary in length (S1F Fig), 500 have the DBD at the C-terminus, and the distance between the WxxLF motif and the DBD is very consistent (Fig 1B, S1 File).
The sequence of Gcn4 homologs has diverged
To analyze the patterns of sequence conservation in Gcn4, we generated a multiple sequence alignment (MSA). The Gcn4 MSA typifies eukaryotic TF evolution, with a highly conserved DBD and lower conservation in the rest of the protein (Fig 1C and S1G Fig). For the MSA, we selected the longest annotated protein from each species (n = 124). The majority of positions in the MSA are present in one or a few sequences, indicating that the alignment is dominated by insertions. Although we required each sequence to contain a WxxLF motif, we were surprised that the MAFFT algorithm aligned in nearly all instances (123/124). The central activation domain shows intermediate levels of conservation, driven in part by the WxxLF motif (Fig 1C). The alignment is similar with all homologs (S1G Fig). For all 502 homologs, we quantified the pairwise conservation between the DBD, the region immediately upstream of the DBD, and a region centered on the WxxLF motif, the putative central activation domain (Fig 1D and 1E). For each TF, the DBD annotations have different lengths, so we used length-matched regions. The putative central activation domain is less conserved than either region but is more conserved than the full-length protein or the full-length IDR (non-DBD, Fig 1D). We conclude that the putative central activation domain has low-to-intermediate conservation.
Characterization of a tiling-library on the homologs
To study the evolution of TF IDR function, we experimentally mapped activation domains on the homologs. For each of the 502 Gcn4 homologs, we tiled across the full-length protein with 40 amino acid (AA) tiles spaced every 5 AA and measured activities of all tiles in S. cerevisiae [15] (Fig 2A and 2B). We and others have shown that protein fusion libraries that tile across protein sequences with 30–60 amino acid peptides can faithfully measure acidic activation domain activity [15–17, 19, 20, 46]. Furthermore, because acidic activation domain function in yeast is a reliable measure of endogenous function in humans [47], viruses [48], Drosophila [49, 50], plants [21, 51, 52], and other yeast species [53, 54], we reasoned that the activity of fungal homologs in our assay would serve as a faithful measure of activity in their native context. In all subsequent analysis, we assume that tile activity measured in S. cerevisiae is a good proxy for activation domain function in their native species.
A) The homolog tiling strategy. B) The high-throughput assay for measuring activation domain function uses a synthetic TF with mCherry for quantification of abundance, the Zif268 DNA binding domain (DBD), an estrogen response domain (ERD) for inducible activation, and a C-terminally fused tile. Tile activity was calculated based on barcode abundance in eight equally sized bins of a FACS sorting experiment. Bins were set based on GFP/mCherry ratios. C) The distribution of measured tile activities with the activity threshold (top 20%). The S. cerevisiae Gcn4 central activation domain activity is shown in orange. D) Schematic of S. cerevisiae Gcn4 with the upstream open reading frames (uORFs) that regulate translation, the activation domains, and the DBD. Individual measured tiles are indicated as pink lines with a pink point at the center. We imputed activity at each position with a Lowess smoothing (blue). E) Schematic of the classical and alternative central activation domains (CAD) with key motifs, α-helix, and phosphosites indicated. In control activation domains, mutating motifs and aromatic or leucine residues reduced activity. In these motif mutations (WxxLF, MFxYxxL), only the aromatic and leucine residues were mutated to alanine. F) The number of active tiles found on the 502 full-length TFs (tiles that map to multiple homologs are counted multiple times). G) Combining overlapping active tiles shows that most TFs have two or more activation domains (Methods).
We mapped activation domains by measuring 40 AA tiles in our established high throughput assay [15]. We fused the 40 AA tiles to the C-terminus of a synthetic TF and integrated into the yeast genome (Fig 2B). The synthetic TFs contain mCherry to quantify abundance, an inducible estrogen response domain, and a mouse DBD that binds a GFP reporter. The GFP signal reports on the strength of the activation domain, but we normalized by the mCherry signal (TF abundance) to measure specific activity. We sorted cells based on the GFP/mCherry ratio, sequenced tiles in each pool, computed the relative abundance of each tile across the pools, and calculated a specific activity for each tile [15]. We recovered 18947 of 20731 designed tiles (91.4%), and these data were of high quality (Methods, S3 and S4 Figs). To define active tiles, we tried multiple thresholds that gave similar results and ultimately used the top 20% of sequences (Methods, Fig 2C and S5 Fig).
One benefit of our tiling strategy is that we can identify the location of activation domains. Tiling S. cerevisiae Gcn4 with 40 AA fragments finds the central activation domain (CAD, residues 101–144) but misses the N-terminal, “distributed” activation domain (residues 1–100) [44, 55] (Fig 2D). The strongest tile is the junction of the N-terminal activation domain and CAD (residues 90–129), which we call the altCAD. This result suggests that the original boundaries, which were set by restriction enzymes [44, 55–57], might not be correct. Finding “distributed” activation domains will require tiling with longer fragments, which were not available in large oligo pools when we started this work. The CAD and altCAD have two and three published short linear motifs (SLiMs), respectively (F97 F98 (FF), M107 Y110 L113 (MxxYxxL or MFxYxxL), and W120 L123 F124 (WxxLF)) [44, 55]. These motifs make large contributions to activity (Fig 2E), as previously reported [42, 44, 55]. Our high-throughput screen recovers most of the known features of Gcn4, providing confidence for characterizing the homologs.
All Gcn4 homologs are activators
First, we asked how many Gcn4 homologs were capable of activating transcription and found that all the homologs contained at least one tile that functioned as an activation domain in our assay (Fig 2F and 2G). The only exception is a deprecated gene model, and the new model has an activation domain (S1 File and S6 Fig). This result is robust to the choice of threshold for active tiles (S5 Fig). A priori, it was not a given that all the Gcn4 homologs would be activators because, on long evolutionary timescales, a family of TFs that share a conserved DBD will include both activators and repressors [21, 29, 46, 52]. Our data strongly suggest that Gcn4 homologs have been activators through 600 million years of evolution.
Based on the literature, we suspect that the homolog tiles function by binding to Med15/Gal11 [42, 43, 45]. Regions with and without published motifs crosslink to Med15 [43]. Activity of our reporter is well correlated with Med15 binding affinity to tiles in vitro [16]. Using FINCHES, a computational predictor of IDR-mediated protein interactions, we predicted that active tiles bind to Med15 more strongly than inactive tiles [58–61] (S7 Fig). We found that the four activation domain binding domains of Med15 are well conserved across the 124 species, leading us to hypothesize that this Med15/Gal11 interaction is maintained (S8 Fig).
The sequence features of highly active tiles indicate a flexible grammar
When we searched for the sequence features that control activation domain function, we found all previously observed relationships [15–17], but many signals were stronger (S9 Fig). Active tiles contain many acidic residues and WFYLM residues (Fig 3A and S10 Fig) consistent with the acidic exposure model (Fig 3B). Aromatic and leucine residues make the largest contributions to activity (Fig 3C). Methionine makes larger contributions to activity in this dataset than we have previously observed (Fig 3C). Aspartic acid (D) makes much stronger contributions to activity than glutamic acid (E) (Fig 3C and S10 Fig), which has been seen in mutants [16] or weakly in plants [21]. We suspect this effect occurs because the negative charge is closer to the peptide backbone, leading to a stronger solvation effect and more exposure of nearby hydrophobic residues [64].
A) For each tile, we computed net charge and the number of WFYLM residues. The area of the point indicates the number of tiles with the combination of properties. The color is the median activity. White star, S. cerevisiae Gcn4. B) The acidic exposure model of acidic activation domain function. C) Box and whisker plots for the residues that make the largest contributions to activity. Gray points are outliers beyond the whiskers (1.5x the interquartile range). D) For each published motif, tiles that contain the motif (orange) have higher activity than tiles without the motif (blue). p-values are uncorrected Welch’s t-test. E) Tiles with more published motifs are more active. F) For each tile with the WxxLF motif, activity is plotted against the location of the W. Blue, mean and 95% confidence interval. The location of the motif is correlated with activity. G-J) For tiles with the WxxLF motif, we compared the 500 most active to the 500 least active. The active tiles had more acidic residues (G), more W,F,Y,L residues (H), closer spacing between WFYL and acidic residues (I), and more intermixing as measured by Omega [62, 63] (J).
We searched for sequence grammar, the arrangement of amino acids, that controls activation domain function. Acidic activation domains use a highly flexible grammar [15, 16, 18, 19, 26]. As a baseline, we quantified how amino acid composition contributes to function with ordinary least squares regression on amino acid counts, which explains 49.8% of variance in activity (Table 1, Area under the receiver operator characteristic (AUC) = 0.9346, and area under the precision recall curve (PRC) = 0.7620, S1 Table). Regression on dipeptides [19] led to sixty-nine significant parameters that explain 60.8% of the variance in activity (Table 1, AUC = 0.9472, PRC = 0.8190). More complex sequence motifs did not improve the regression models: published motifs explained 33.4%, and forty de novo motifs explained 50.2% of the variance in activity (Table 1). Combining motifs and composition performed similarly to dipeptides. Composition and dipeptides are the major determinants of activity, consistent with weak grammar.
All published motifs that are essential for function in S. cerevisiae are enriched in active tiles (Figs 2E and 3D), but only WxxLF is conserved (S2 Fig [45]). Although we forced every sequence in our homologs to contain a WxxLF, two independent analyses indicate that the WxxLF motif is the most conserved published motif (S2 Fig). Tiles with more motifs tend to be more active, consistent with multivalent binding to coactivators [19, 43] (Fig 3E). Motifs alone are not necessary for activity, because hundreds of tiles without motifs are highly active (Fig 3E). Conversely, motifs are not sufficient for activity because hundreds of sequences with motifs are inactive (Fig 3D and 3E). For example, WxxLF is present in 2005 tiles, and 1267 of these are active (63.2%), but they are only 33.7% of all active tiles (n = 3761; Fig 3D and 3F). Comparing tiles with the WxxLF motif that had high or low activity revealed that highly active tiles were more acidic (Mean charge = -7.7 for active vs. -6.1 for inactive, Fig 3G) and had more WFYL residues (Mean number WFYL = 8.9 for active vs. 6.2 for inactive, Fig 3H), consistent with the literature [15–18]. The sequence features controlling activity in tiles with or without the WxxLF motif are indistinguishable (S11 Fig), further suggesting they work in the same way [43].
We used tiles containing the WxxLF motif to identify three signals of weak sequence grammar. First, active tiles have shorter distances between DE and WFL residues (Mean distance = 2.0 for active vs. 4.5 for inactive, Fig 3I). Second, tiles with more evenly intermixed acidic and W,F,Y,L residues are more active, supporting the acidic exposure model [62] (Mean custom omega = 0.12 for active vs. 0.22 for inactive, 0 is perfectly mixed and 1 is completely segregated, Fig 3J, Methods). Third, tiles with the WxxLF motif near the C-terminus are more active (Fig 3F), perhaps because the negative charge of the C-terminus increases exposure. Even this highly conserved WxxLF SLiM requires an acidic context [15] and supporting hydrophobic residues to create an activation domain.
In conclusion, these yeast activation domains contain a core of W and F residues, support from Y, L, and M residues, and an acidic context (Fig 3A and S1 Table). We found a handful of highly active tiles without aromatic residues but many L and M residues. Composition makes large contributions to activity, but weak, local grammar also contributes. Motifs contribute to activity but are neither necessary nor sufficient (Fig 3D, 3E, and 3F). These findings parallel the emerging idea that high-quality context (i.e., supporting residues) can compensate for a weak motif [10].
Conservation of function without conservation of sequence in the central activation domain
To determine whether the CAD of Gcn4 is functionally conserved, we used overlapping tiles to infer the activity of each position in each full-length protein (Methods). We visualized activity as a heatmap aligning on the WxxLF motif using one homolog per genome (Fig 4, 120/125) or all homologs (S12 Fig, 497/502). In both cases, the longest sequences distorted the plot and were not shown. The peak of activity is ten residues upstream of the WxxLF motif at the center of the altCAD (Fig 4, inset). The bright vertical band of activity around the WxxLF motif indicates that the CAD is functionally conserved across most homologs.
We used the tile activity data to impute the activity of each position in all the homologs and visualized these activities as a heatmap. One homolog from each species is sorted by length and aligned on the WxxLF motif (120/124 plotted; long sequences are excluded because they distort the plot). The DBD is boxed. Activity is consistently high in the central activation domain, indicating deep functional conservation. Upstream activity moves around. Inset: vertically averaging the heatmap indicated the peak is ten residues upstream of the WxxLF motif. We averaged over all sequences present at each position. Red arrow, S. cerevisiae. 497 homologs are shown in S12 Fig.
To look at the sequence divergence of the CAD, we focused on a 70 AA region around the WxxLF motif (W-50: W + 19). In this region, from the 502 homologs, there are 138 unique sequences, 137 of which contain WxxLF (For Catenaria anguillulae, the motif aligned to a different region of the MSA, see S13 Fig). Of these 138 CADs, 125 (90%) had high activity. In individual homologs, the CAD drifts side-to-side but generally stays near the WxxLF motif (S12 and S14 Figs). The sequences of the CADs have diverged (Fig 1D and 1E), yet 90% retain activation function, indicating conservation of function without conservation of sequence.
Simulations of neural sequence drift
Should we be surprised that 125/138 of the unique central regions retain activation domain function over 600 MY of divergence? To generate a null hypothesis for neutral drift of protein sequence in the central activation domain, we turned to simulations. We used the evolver program from the PAML package to simulate neutral DNA evolution [65, 66] (Methods). We started from the Gcn4 WT sequence and simulated for branch lengths corresponding to the divergence of our sequences. We predicted activation domain function with TADA, a deep learning model that performs well on our data (see below) [21], and DBD function as the ability to fold with FoldX [67].
In our neutral drift simulations, the DBD function deteriorates much more rapidly than the activation domain function (Fig 5A). However, at long evolutionary distances, activation domain function also deteriorates (36/200 still active in the central region). At the equivalent distance in our experimental data, determined using the branch lengths in the gene tree, 125/138 of our Gcn4 homologs have an active central activation domain. There is more conservation of function in the extant sequences than in the simulations. In the simulations, new domains appear outside the central region, suggesting that new activation domains can evolve from neutral drift. These simulations support the idea that the homologs are experiencing negative selection against substitutions that decrease activation domain activity, paralleling the analysis of the YGOB sequences (Fig 1A). This result is also consistent with stabilizing selection to maintain the central activation domain function.
A) In simulations of neutral drift, DBD function deteriorates more quickly than activation domain function. At the evolutionary distances we surveyed, the majority of simulated sequences have lost activation domain activity. B) Presence/absence of published Gcn4 motifs in the central activation domain (FF, MFxYxxL, WxxLF) on the gene tree. Red represents the presence of the WxxLF motif in the central activation domain sequence. Note that the WxxLF motif was required in all of our homologs. Yellow represents the FF motif. Purple represents the MFxYxxL motif. S. cerevisiae is highlighted in grey. C) For the sixty-nine most active unique regions around the WxxLF motif, a bar plot showing the relative amino acid frequencies for the three most common amino acids from the MSA (MSA shown in S16 Fig). The acidic residues, D and E, interchange. D) A sequence logo for the MSA. Arrows indicate the eleven positions where F is the most abundant residue. The two highly-conserved Fs are indicated with black arrows. Black, S and P residues. E) Stronger central activation domains contain more F residues and more L residues than weaker activation domains. The stronger central activation domains also tend to have more intermixing (lower omega) between the hydrophobic and acidic residues. All these regions are acidic.
Conservation of function without conservation of sequence in the central activation domain of Gcn4 results from turnover of individual amino acids
Next, we investigated whether the conservation of function in the central activation domain without conservation of sequence that we observed could be explained by evolutionary turnover of motifs or evolutionary turnover of individual residues. If the key functional unit is a motif, we would expect either strong conservation of the motifs or repeated gain and loss of the motifs. If the key functional units are individual residues, we would see conservation of key residues or repeated gain and loss of individual residues. To investigate the central activation domain, we extracted regions of the MSA that aligned with the S. cerevisiae Gcn4 central activation domain (W-50: W + 19). Note the conservation of the WxxLF motif will be overstated because we forced all these sequences to have the motif. There are 138 central activation domains, 125 of which were highly active in our assay.
We found minimal evidence for turnover of motifs that are important for central activation domain function in S. cerevisiae. The FF, MFxYxxL, or WxxLF motifs contributed to activity (Fig 2E) and were enriched in active tiles (Fig 3D), but only the WxxLF motif was conserved (Fig 5B, 5C, and 5D). Although WxxLF conservation is artificially high due to our homolog search criteria, two independent homolog searches confirm its high conservation (S2 Fig). Only 5/138 sequences have the MFxYxxL motif. Compared to S. cerevisiae, 116 central activation domain sequences lost the FF motif, but only 4/116 have gained new instances in alternative locations, suggesting limited turnover (Figs 5B and S15C Fig). We found one case of convergent evolution (gain) of the WxxLF motif in an N-terminal activation domain of Catenaria anguillulae, which comes from the Blastocladiomycota, early diverging fungi [68] and our only sequence from outside the Ascomycota (S13 Fig). This WxxLF motif is embedded in a histidine-rich context, not an acidic context, but these tiles still have high activity. This example emphasizes the rarity of gaining a WxxLF motif. Based on these data, we do not see evidence for evolutionary turnover of motifs.
We found evolutionary turnover of acidic residues within the central activation domain. In the 50% most active sequences (n = 69), individual acidic positions (D and E) interconvert (Fig 5C, 5D, S16, and S17 Fig). When pooled together, D + E conservation matches or exceeds the conservation level of many aromatic residues (Fig 5C, red plus pink is taller than many orange bars). Net negative charge is conserved in all the sequences (Fig 5E). Acidic residues interconvert and move around the activation domain, hallmarks of turnover.
Finally, we found evolutionary turnover of individual phenylalanine (F) residues that are critical for high activity (Fig 5C, 5D, S16, and S17 Figs). Outside of WxxLF, there is one highly conserved F (F108 in S. cerevisiae, left black arrow in Fig 5D). Besides these two highly conserved Fs, there are nine more positions where F is the most common residue (Fig 5D). The number of F residues is not highly constrained, although the most active sequences had more F residues, more L residues, and more intermixed WFYL and acidic residues (Fig 5E). The F residues experience evolutionary turnover, giving the appearance of moving around the activation domain.
We tested if the rare instances of FF motif turnover arose by chance from turnover of individual F residues. Both published (FF, FxF, FxxF [55]) and unpublished F motifs (FxxxF, FxxxxF) show similar patterns of gain and loss (S15 Fig). The most common motif is the unpublished FxxxxF motif. Moreover, a random sequence with the same length and F frequency as seen in these central activation domains will have a 30.45% chance of having an FF motif (Methods). This probability is higher than the observed frequency of FF (26/138, 19%). Based on these observations, we think the observed FF motif turnover is a consequence of turnover of individual F residues occurring next to each other by chance.
We see conservation and turnover of SP/TP dipeptides, which regulate protein degradation, not transcriptional activation. At the location of the most conserved SP, S. cerevisiae has T105P, which is phosphorylated by a Cdk kinase, Pho85, that recognizes S or T to create a phosphodegron [69, 70]. Homologs contain up to five SP instances (Fig 5D and S18 Fig). SP/TP dipeptides are gained and lost across the phylogeny (S18 Fig). F108 may also aid recognition of T105 by Pho85 and Pcl5 [69, 70]. Pho85 has a preference for a + 3 hydrophobic after the phosphosite when in complex with Pho80 [71, 72], and this preference may extend to Pcl5. In simulations of enhancers with neutral drift and weak negative selection, overlapping TF binding sites become very conserved because it is difficult for evolutionary turnover to replace two functions simultaneously [30, 31, 73]. We suspect F108 is resistant to evolutionary turnover because it contributes to both activation and the phosphodegron. By extension, if the WxxLF motif has two functions, this would explain its deep conservation.
Finally, in some homologs (Fig 4, S12, and S14 Fig), the activity peak of the central activation domain appears to move side-to-side relative to the WxxLF motif, which could be a consequence of evolutionary turnover of key residues. In simulations of regulatory DNA [31], evolutionary turnover of TF binding sites causes enhancers to drift side-to-side.
Naturally occurring changes in sequence generally do not change activity
To look for further evidence of neutral drift and stabilizing selection, we looked at whether small changes in homologous sequences had meaningful effects on activity. In a “natural experiment,” we compared pairs of extant sequences that differed by one or two amino acids and found that most substitutions do not change activity. As a baseline for differences in tile activities, we chose 10000 random pairs of tiles and computed the difference between their activities. The distribution of activity differences between tiles that differ at 1–2 amino acids is much smaller (Fig 6A). The rare substitutions that caused large changes in activity often impacted aromatic, leucine, or acidic residues, consistent with the acidic exposure model (S19 Fig). This result is consistent with the homologs diversifying via neutral drift and negative selection against substitutions that change activity.
A) Extant single and double substitutions rarely change activity (KS = Kolmogorov–Smirnov). B) Examples of gain of F residues before loss of ancestral F residues in Gcn4 homologs from the Y1000 sequences. Ancestral sequences at the shown nodes have been reconstructed using IQ-TREE [74]. At right, the nucleotide sequence is shown in grey. C) Local alignment around the gain/loss events in B.
Additional evidence for turnover of individual amino acids
When sequences are experiencing compensatory evolution under stabilizing selection, gains of functional elements should precede losses of functional elements, giving the appearance of turnover. We tested this prediction by searching for examples where gains of F residues preceded losses of F residues. In the homologs we experimentally tested, there is too much evolutionary distance between the species to answer this question. Instead, we looked for homologs in the Y1000 + collection, which contains more closely related species [75]. We found multiple examples of gain of F residues before loss of F residues (Fig 6B, 6C, and S20 Fig). We performed local ancestral sequence reconstruction of these sequences and found that the most probable ancestral sequences show gain before loss (Fig 6B, 6C, and S20 Fig). These examples of gain before loss bolster the evidence for evolutionary turnover of key F residues.
The turnover and conservation patterns we observed in the Gcn4 homologs generalized to other activation domains. For three other TFs, we searched for homologs in the Y1000 + collection and made alignments of their predicted activation domains (S21 Fig). In these MSAs, aromatic residues were conserved and acidic residues interchange at many positions. Some positions also showed interchange between aromatic residues and leucine residues. When we reanalyzed measurements of Pdr1 orthologs [16], we saw similar patterns. We conclude that evolutionary turnover of aromatic, leucine, and acidic residues enables conservation of function without conservation of sequence in acidic activation domains.
Evolutionary turnover of upstream activation domains
We investigated evolutionary turnover on longer length scales. When we mapped activity onto the full-length sequences (Fig 4 and S14 Fig), we noticed repeated gain and loss of activation domains upstream of the central activation domain driven by insertions, deletions, or amino acid substitutions. For example, in Aspergillus tamarii and A. westerdijkiae (Fig 7A), a deletion and two extra F residues create a stronger upstream activation domain. In Dendryphion nanum and Didymocrea sadasivanii, two M and two L residues create a stronger activation domain at the N-terminus (Fig 7B). In another set of species, the N-terminal and central activation domains appear to trade off strengths (Fig 7C). In 2/3 of these examples, the more active homolog has more helicity predicted by AlphaFold (Fig 7A, 7B). However, across all homologs, the central activation domains show lower predicted helicity than upstream activation domains (Fig 7D, 7E). Some homologs have four prolines around WxxLF, suggesting the S. cerevisiae helix may not be well conserved (S16 Fig).
A-B) Pairs of closely related species where the strength of the upstream activation domain changes dramatically due to substitutions or insertions. AlphaFold2 predicts short alpha helices in the active sequences. C) In three related species, the relative strengths of the central and upstream activation domains appear to trade off. Predicted alpha fold structures of the upstream activation domains. D) The minority of central activation domains are predicted to form helices by SPARROW [58]. E) The majority of the upstream activation domains are predicted to form helices by SPARROW [58]. F) We counted activation domains in the longest gene per species and projected the number of domains onto the species tree. The inner circle counts activation domains by counting the number of peaks in the smoothed activity above our threshold. The outer circle counts activation domains by merging overlapping tiles. Both patterns are consistent with recurrent gain and loss. S. cerevisiae is highlighted in grey. Species marked with green, red and orange bars correspond to the species in A-C.
Gain and loss of upstream activation domains was common across the phylogeny (Fig 7F and S6 Fig). We counted activation domains in two ways: 1) by combining overlapping active tiles, which undercounts activation domains because it requires a gap longer than 40AA; or 2) by counting peaks in the smoothed traces, which is sensitive to the threshold and may overcount activation domains because only 15 AAs are required between peaks. For example, in some homologs, the central activation domain narrows, widens, or splits into two active regions (S6 Fig). Under both methods, the majority of homologs have more than one activation domain (Fig 2G and S6 Fig), and there is recurrent gain and loss of upstream activation domains across the phylogeny (Fig 7F). We stress that the activity patterns are sometimes difficult to force into a discrete number of activation domains, but counting domains is a useful approximation for projecting onto the phylogeny. In addition to changes in the number of activation domains, the shape of the activity traces changes across species (S6 Fig). Closely related species can show large changes in activity. Together, these analyses demonstrate evolutionary turnover of entire upstream activation domains.
Machine learning models support evolutionary turnover
The Gcn4 homologs provide a large dataset of diverse sequences to evaluate deep learning models that predict activation domains from amino acid sequence (S22 Fig). We compared two first-generation neural networks, ADpred and PADDLE [16, 19], with a second-generation model that we helped develop, TADA [21, 22]. We also trained a new regression model, ADHunterLite, a one-hot encoded residual neural network, to predict activation domain strength quantitatively [76]. All the models can approximate the locations of activation domains in full-length TFs, but the two new models are substantially more accurate at predicting the activities of individual tiles and identifying activation domain boundaries (S23 Fig).
The machine learning models further support weak sequence grammar. TADA blurs the raw sequence with sliding windows, so it can only learn weak grammar. The one-hot encoding in ADHunterLite allows it to learn any grammar, but when we trained it on the same data as TADA, its performance was indistinguishable (S23D and S23E Fig, purple). This parity of model performance argues against strong grammar. Both models can learn local weak grammar, like dipeptides. Training ADHunterLite on the Gcn4 homolog data increases performance, as expected (S23D and S23E Fig, orange and pink). Neural network models can achieve high accuracy without strong grammar, supporting weak grammar.
To test if there is a correlation between conservation and contribution to activity, we ran TADA and ADHunterLite on the central activation domains. Both models predicted that all the F residues contribute to activity (S24A and S24D Fig). Removing more F residues caused larger predicted decreases in activity (S24B, S24C, S24E, and S24F Fig). The two most conserved F residues (in the WxxLF motif and F108) show large predicted effects on activity (S24G and S24H Fig). For the other F residues, there is no correlation between conservation and predicted change in activity (S24G and S24H Fig). This computational analysis further supports compensatory evolution under stabilizing selection in the central activation domain of the Gcn4 homologs.
Integrating the activity traces
To estimate the strengths of full-length TFs, we integrated the area under the smoothed activity traces (Fig 4). We were surprised to see that these integrals formed a tight, unimodal distribution because individual TFs gained and lost activation domains (S25 Fig). Permutation testing of the activity traces (with or without replacement) indicated the variance of the observed distribution is much smaller than expected by chance (S25 Fig). At face value, this result suggests that activity of the full-length TFs is maintained in a narrow range by stabilizing selection. We are surprised that integrals hint at this kind of constraint because in full-length proteins we expect synergy between activation domains that cannot be captured by our tiling strategy [43]. A future direction will be to test this conservation of activation domain strength in full-length orthologs.
Discussion
By functionally screening protein fragments from a family of homologous TFs, we demonstrate how conservation of activation function in TF IDRs without conservation of sequence arises from neutral drift, stabilizing selection, evolutionary turnover of complete activation domains, and evolutionary turnover of key acidic and F residues within activation domains. In some cases, gain and loss of upstream activation domains appears to result from turnover of individual residues. Our results emphasize how understanding the residues that control function in an IDR can reveal conservation and evolutionary turnover patterns that are difficult to see in traditional comparative genomics.
Evolutionary turnover of key residues in activation domains arises from the weak grammar constraints. Weak grammar explains why multiple screens for activation domains have found only one recurrent motif, LxxLL, which can be important for binding the Kix domain [16, 19, 29, 33, 77, 78]. We argue that yeast activation domains are nucleated by clusters of W and F residues surrounded by acidic residues and boosted by Y, L, and M residues. Under weak molecular grammar, individual residues are easily replaced, facilitating evolutionary turnover. Each TF family has a different conserved cluster of hydrophobic residues that represents a good solution to binding the preferred coactivator. As a result, each TF family will appear to have a conserved, essential motif, but convergent evolution of motifs is rare.
We propose that it is time to stop thinking about activation domains as short linear motifs in a permissive context because the key sequence features that control function are much more flexible than motifs. For now, we believe the best heuristic is to use the predictions of the neural networks. These models are still black boxes, so this heuristic is mechanistically unsatisfying. We and others are working to interpret these models and learn the underlying biology [18, 19, 21, 22, 76, 79]. Based on our analysis of the Gcn4 homologs, we believe that the neural networks are good enough to study TF evolution computationally.
Evolutionary turnover of key residues is possible because of the physical flexibility of the Gcn4 Med15 interaction [42, 43, 45]. Folding and binding creates a short helix that presents the WxxLF motif in many orientations to Med15 [42], and simulations suggest these orientations interconvert [80]. The physical flexibility of these interactions allows for evolutionary plasticity. Binding one sequence in multiple orientations is a step towards binding diverse homologs, which in turn is a step towards binding to many activation domains [16, 42, 43, 45, 81, 82]. Coactivators that impose weak structural constraints on activation domains can become engines for evolutionary diversification of activation domains through neutral drift, creating an enormous sequence reservoir for later selection. This diverse sequence reservoir allows for selection on standing variation in new environments. This study adds an evolutionary dimension to the idea that the physical flexibility of IDRs allows for multiple binding modes to the same or different partners [77, 83, 84].
Our deep dive into the evolution of one family complements other studies of IDR evolution. Using small numbers of sequences, conservation of IDR function across homologs has been observed, but often the essential residues are unknown [7, 85]. In other systems, there is functional conservation of diverged IDRs, but the key residues [9, 86] or motifs are conserved [87]. In other cases, functional conservation results from the composition, but not the arrangement, of residues through emergent properties like net charge [8, 88–92]. These cases likely permit even more turnover than we observe in Gcn4. In Afb1, IDR function is not conserved [10], and the Msn2/4 IDRs have two overlapping functions, only one of which is conserved [11]. The closest parallel to our turnover of key residues is de novo evolution of phosphorylation motifs [35]. There remains a need for better IDR-alignment algorithms or alignment-free methods to group functionally related IDRs.
Our results fit well with findings that at long evolutionary distances, transcriptional regulatory networks rewire, substituting individual TFs but maintaining circuit logic [53, 93, 94]. Here, we examined much longer evolutionary distances and found that all the Gcn4 homologs are activators, indicating that the signs of TF connections (i.e., protein function) are more conserved than individual connections (i.e., TF binding sites). Slow changes in TF function reduce pleiotropy and may make it easier to substitute TFs at individual regulatory elements.
The turnover of key hydrophobic residues in activation domain evolution bears strong parallels to the turnover of TF binding sites in enhancer evolution. Metazoan enhancers are regulatory DNA that contain clusters of TF binding sites [95]. The DNA sequence of enhancers diverges rapidly as individual TF binding sites are gained and lost but function is maintained [32, 96, 97]. Orthologous enhancers are often impossible to detect in sequence alignments but can be identified by searching for clusters of TF binding sites [30, 98, 99]. Turnover of entire activation domains in full-length TFs parallels turnover of entire enhancers in a locus [100]. Turnover of key residues within activation domains parallels turnover of TF binding sites within enhancers. Activation domains [24, 25] and enhancers [101] have very flexible grammar. Given that TFs function by binding to enhancers, it is striking that both the protein and the DNA are evolving in the same way. Turnover of TF binding sites and enhancers endows gene expression with robustness to environmental stress and evolutionary plasticity [102–106]. Turnover of key residues and full activation domains may similarly endow TFs with plasticity and robustness.
Limitations of this study
The primary limitation of this work is that we measured the activities of short fragments in one species. Measuring short, uniform fragments makes the experiments possible but can miss emergent activation domains [45, 55]. If, in some species, an activation domain and cognate coactivator together experience many compensatory mutations, the assay may not detect activity. It remains possible that small changes in activity could be locally adaptive but missed in our screen of fragments. Additionally, our homologs may be too sparsely sampled to detect positive selection. In the future, limited screening in additional species or screening tiles with multiple lengths would enrich this work. A further limitation of this work is that we investigated Gcn4 homologs only with the WxxLF motif, and sequences that have lost this motif present an opportunity for future directions.
Materials and methods
Identification of homologous sequences
We computationally screened for Gcn4 homologs of S. cerevisiae. We started with a hand-collected set of forty-nine homologs, forty-eight of which contained the WxxLF motif [15, 45]. To find new homologs, we used two criteria: the bZIP DNA binding domain (IPR004827) and the regular expression Wx[SPA]LF for the WxxLF motif. These criteria distinguished Gcn4 homologs from other leucine zipper DNA binding domain TFs. We scanned 207 diverse and representative proteomes from the MycoCosm database (mycocosm.jgi.doe.gov). This initial analysis was performed in 2020 by Sumanth Mutte of MyGen Informatics. Eighty-four of the genomes were from MycoCosm, while the original homolog collection contributed forty-five species. The experimental screen was performed using all 502 unique homologs (S2 Table). For some analyses, we wanted one sequence per genome. We selected the longest sequence per genome, yielding a set of 124 sequences. For analyses of the central activation domain, we used only unique sequences in that region. There were 138 unique central activation domain regions.
We confirmed that the WxxLF motif is well conserved in fungal TFs with HMMER. We ran the web server for HMMER with default parameters, using S. cerevisiae Gcn4 as the seed sequence and restricting our search to Fungi. In the second, third, and fourth iterations of this search, the WxxLF motif was the most prominent feature of the profile HMM in the central region of the TF and always much more prominent than all other published motifs [19, 55]. S2A Fig shows the pHMM from the fourth iteration.
We also confirmed that the WxxLF motif was conserved using the y1000 + genomes [75]. We extracted the pre-computed ortholog set that contained Gcn4. Within these sequences, we counted all instances of the published motifs (S2 Fig).
For the full-length homologs, MSAs were performed with the MAFFT algorithm (S3 and S4 Tables). We removed the two longest homologs that had the DBD near the center.
Short alignments were created with MUSCLE online (https://www.ebi.ac.uk/Tools/msa/muscle/) or with MAFFT v7.526 and visualized with weblogo.berkeley.edu or the LogoMaker [107] Python package.
Design of the Gcn4 oligo library
We took the 502 unique protein sequences and computationally chopped them into 40 AA tiles spaced every 5 AA (e.g., 1–40, 6–45, 11–50 etc.). As a result, if two closely related sequences contain identical regions, insertions, or alternatives (start sites) that change the phasing, a single tile can map to multiple full-length homologs. We removed duplicate tile sequences, yielding 20679 unique tiles. We added fifty-two control sequences (controls were included twice in the oligo pool to increase the probability they were recovered in the plasmid pool during cloning). The controls included hand-designed mutants in control activation domains and a handful of sequences from our previous study [15] (S5 Table, Control sequences). The final design file contained 20783 entries.
We reverse-translated tile sequences using S. cerevisiae-preferred codons. We added primer sequences for PCR amplification and HiFi cloning (‘ArrayDNA’ column in S6 Table). We also added four Stop codons in three reading frames to ensure translational termination, even if there were one or two bp deletions, the most common synthesis errors. We used synonymous mutations to remove instances where the same base occurred four or more times in a row to reduce DNA synthesis errors. The resulting oligo pool was ordered from Agilent Technologies. The final oligos were of the form (see primer sequences in S7 Table):
Plasmid Library construction
The oligos were resuspended in 100 uL of water, yielding a 1 pM solution. The oligos were amplified with eight reactions of Q5 polymerase (NEB) using 1 ul of template, five cycles, Tm = 72C, and the LC3.P1 and LC3.P2 primers. The eight reactions were combined into a single PCR clean-up column (NEB Monarch).
The backbone was prepared by digesting 16 ug of pMVS219 with NheI-HF, PacI, and AscI in eight reactions. We digested for seventeen hours at 37C and heat-inactivated for one hour at 80C. The desired 7025 bp fragment was run on a 0.8% gel, visualized with SYBR Safe (Invitrogen), and gel purified (NEB Monarch Kit). Note pMVS219 and pMVS142 have the same sequence, but the pMVS142 stock developed heteroplasmy, so we repurified it as pMVS219 and submitted the corrected stock to AddGene. Both pMVS219 and pMVS142 correspond to AddGene #99049.
We used NEB HiFi 2x mastermix to perform Gibson Isothermal Assembly to create the plasmid library. The 4x reaction volume had 328 ng of backbone and excess molar insert. We incubated at 50C for 15 min and assembled a backbone-only control in parallel. The assemblies were electroporated three times each into ElectroMax 10b E.coli (Invitrogen 18290–015) following the manufacturer’s protocol. A dilution series was plated and the bulk of the cells grown overnight in 140mL LB + Amp. These cultures overgrew, so they were spun down and frozen. The cultures were regrown with 105 mL LB + Amp and a MaxiPrep was performed (Zymo). An estimated 4.2 million colonies were collected, covering the library 200-fold.
To assess the quality of the plasmid library, we prepared an amplicon sequencing library (see below). Three independent amplicon libraries were prepared, and sequences present in all three were considered to be present in the plasmid pool with high confidence. GREP for the flanking NheI and AscI sites was used to pull out the designed fragments. Only perfect matches were used in this analysis. 20717 of 20731 designed sequences were detected (99.9%). The vast majority of sequence abundances were within fourfold of each other, indicating minimal skew in library member abundance.
Yeast transformation
The plasmid library was integrated into the DHY213 BY superhost strain, MATa his1∆1 leu2∆0 ura3∆0 met15∆0 MKT1(30G) RMEI(INS-308A) TAO3(1493Q), CAT5(91M), MIP(661T) SAL1 + HAP1 + , a generous gift from Angela Chu and Joe Horecka. Requests for the parent strain are best directed to them. We integrated our library into the URA3 locus with a three-piece PCR [108]. The upstream homology between URA3 and the ACT1 promoter was created by PCR amplifying the pMVS295 (Strader 6161) with the primers YP18 and CP19.P6. The downstream homology between the TEF terminator of KANMX and URA3 was amplified from pMVS196 (Strader 6768) with the primers YP7 and YP19. These template plasmids were a generous gift from Nick Morffy and Lucia Strader. To avoid PCR, the plasmid library was digested with Sal I-HF and EcoRI-HF (NEB) overnight but not cleaned up. The homology arms were in 3:1 molar excess. 1.25 ug of total DNA was used (225 ng of upstream homology 626 bp, 225 ng of downstream homology 665 bp, and 800 ng of digested plasmid 4583 bp). Cells were streaked out from the -80C on YP+Glycerol. Four transformation cells were grown overnight in YPD, diluted into YPD, and allowed to grow for at least two doublings. We performed a Lithium Acetate transformation for 30 minutes at 30 C and 60 minutes at 42 C, followed by a two-hour recovery in synthetic dextrose minimal media without a nitrogen source, as recommended by Sasha Levy. We integrated plasmids in seven transformation batches, which were plated overnight on YPD and replica-plated onto YPD + G418 (200 ug/ml). Plates were stored at 4 C and then scraped with water, pooled, frozen into glycerol stocks, and mated. We collected an estimated 100,000 colonies, approximately fivefold coverage of the tiles. For 6/7 pools, we sequenced tiles before and after mating, finding that 67–97% of tiles were detected both before and after mating, indicating that the mating sometimes reduced library complexity.
Yeast mating
We mated each of the seven transformations independently to MY435 (FY5, MATalpha, YBR032w::P3 GFP ClonNat-R (pMVS102)). Downstream sequencing revealed that transformations with modest numbers of colonies (e.g., 4500) experienced no significant loss of complexity during mating, but transformations with more colonies (e.g., > 20,000) experienced loss of complexity, up to 40% in one case. Subsequent matings were performed in larger volumes to avoid creating a bottleneck. Mated diploids were selected in liquid culture with YPD with 200 ug/ml G418 and 100 ug/ml ClonNat. After overnight selection, matings were concentrated and frozen as glycerol stocks.
Cell sorting
The day before sorting, a glycerol stock of mated cells (~100 ul) was thawed into 5 mL SC+Glucose with 200 ug/ml G418 and 100 ug/ml ClonNat and grown overnight, shaking at 30 C. In the morning, the culture was diluted 1:5 into SC+Glucose with G418, ClonNat, and 10 uM ß-estradiol (Sigma). The culture was grown for 3.5-4 hours before sorting.
Cells were sorted on a BD Aria Fusion equipped with four Lasers (488 blue, 405 Violet, 561 Yellow-green and 640 Red) and eleven fluorescent detectors. We used two physical characteristics gates, first to enrich for live cells (FSC vs SSC) and second to enrich for single cells (FSC-Height vs FSC-Area). Cells were sorted by the GFP signal, the mCherry signal, or the ratio of GFP:mCherry signal. The ratio is a synthetic parameter that is very easy to saturate on the eighteen-bit scale available in the BD software. Great care was taken to change PMT voltage and the ratio scaling factor (5–10% depending on the day) to make the value of the top and bottom bins as different as possible. The dynamic range of our final estimate for activation domain activity is set by the value of the top and bottom bins. The maximum activation domain strength is 100% in the top bin and assumes the value of the top bin. The minimum activation domain strength is 100% in the bottom bin and assumes the value of the bottom bin.
We performed our sorting experiment twice. In the first run, we pooled all of the transformants into one sample and sorted it by GFP/mCherry ratio, GFP-only, mCherry-only. We sorted one million cells per bin. For the ratio sort, we split the ratio histogram in eight approximately equal bins [15].
In the second round of sorting, we split the transformants into two pools, labeled A and B, so we could assess measurement reproducibility for independent transformants. Pool A and Pool B are true biological replicates. We sorted each pool by GFP/mCherry ratio, GFP-only, and mCherry-only. We used the comparison of the A and B pool measurements to assess measurement reproducibility of true biological replicates. We have never previously measured this biological reproducibility. On this day, we sorted 250000 cells per bin.
Sorted cells were grown overnight in SC-glucose. The next morning, gDNA was extracted with the Zymo YeaSTAR D2002 kit, using Protocol I with chloroform according to the manufacturer instructions. We have previously shown that growing cells overnight makes the gDNA extraction easier but does not change the computed activation domain activity [15].
Amplicon sequencing library preparation
Amplicon sequencing libraries were prepared from genomic DNA in three steps. First, the general vicinity of the tile sequence was amplified with CP21.P14 and CP17.P12 using 100 ng of gDNA as template and yielding a 604 bp product that was cleaned up (Monarch PCR cleanup). In the second PCR, we added 1–4 bp of phasing on each end and the Illumina sequencing primer in 7–10 cycles with SL5.F[1–4] and SL5.R[1–3]. These seven phased primers were pooled and added to all samples. Four nanograms of the first PCR were used as template for the second PCR. Two microliters of the second PCR served as template for the third PCR. The third PCR added unique Index1 and Index2 sequences to each sample with an additional 7–10 cycles. These final products were cleaned up with PCR columns or magnetic beads (MacroLab at UC Berkeley) and submitted for sequencing. We performed 2x150 bp paired end sequencing in a shared Nova-Seq lane at the Washington University School of Medicine Genome Technology Access Center (GTAC). GTAC provided demultiplexed fastq files. We sequenced additional samples in shared Nova-seq lanes with MedGenome.
Sequencing analysis
After demultiplexing samples and pairing reads with PEAR, we kept only the reads where the tile DNA sequence contained a perfect match to a designed tile. For each eight-bin sort, we performed two normalizations. We first normalized the reads by the total number of reads in each bin. Then, we normalized across the eight bins to calculate a relative abundance. We then converted relative abundances to an activity score for each tile by taking the dot product of the relative abundance with the median fluorescence value of each bin (S8 Table). This weighted average is the measured activation domain activity. Tiles with fewer than forty-one reads were not included in the final dataset. These analysis scripts are available at github.com/staller-lab/labtools/tree/main/src/labtools/adtools. This preprocessing computed an activity for each tile in each experiment. Activity is uncorrelated with total reads (S3E Fig). The pooled ratio sort (BSY2) had 115.6 M reads. The Replicate A ratio sort had 934.5 M reads, and the Replicate B ratio sort had 697 M reads. Replicate A GFP had 33.1 M reads, Replicate B GFP had 31.6 M reads, Replicate A mCherry had 32.8 M reads, and Replicate B mCherry had 30.3 M reads.
Measurement reproducibility
We used the two measurements of independent transformants to assess the reproducibility of our measurements of true biological replicates (R = .870; S3A- S3D Fig). Reproducibility is higher (R = .919) for highly abundant tiles (>1000 reads).
We combined data from the two biological replicates. For tiles present in both populations (n = 11797), we averaged the two measurements and used the standard deviation as the error bar. For tiles present in only one population, we used that measurement and did not report error bars. These combined data agree very well with the pooled sort (R = .919; S3C Fig).
We assessed whether the mating introduced biological variability. We remated seven pools of the integrated library to the same reporter line, selected for diploids, pooled them, and resorted cells. This time we sorted 500,000 cells per bin. This measurement agreed with the initial experiments (R = 0.920; S3D Fig).
Inferred activity was not correlated with read count, which, as previously shown, is another indicator of high-quality data (S3E Fig).
We compared activity measurements to our previously published results [15]. Previously, we used 44 AA regions, and here we used 40 AA tiles. We considered any 44 AA tile that contained one of our 40 AA tiles to be corresponding pairs. The extra 4 AA can modify activity, so the correspondence of these measurements will not be perfect. The observed Pearson correlation of 0.786 and Spearman correlation of 0.731 indicate the new data are of high quality and consistent with previous measurements (S3F Fig).
The technical reproducibility of our measurements at UC Berkeley are lower than the published reproducibility from sorting at Washington University in St. Louis [15]. In both cases, we sorted the same cell population twice and created independent sequencing libraries. In 2018, the technical reproducibility was high, Pearson R = 0.988. The 2018 work had a smaller library (<5000 unique sequences) and sorted more cells (1–2 million cells per bin). Sorting more cells per library member increases the technical reproducibility of the measurement. The sorter operator in the 2018 work was more experienced than the sorter operator in this work (MVS), and the machine was maintained to a higher standard of operation, so the sorted populations were purer.
The eight bin ratio activity measurements are primarily driven by the GFP signal. Activity (ratio) is largely separable from abundance assessed by the mCherry sort (S3G- S3I Fig) and well-correlated with the GFP sort (S3J- S3L Fig).
Determining a threshold for active tiles
The full distribution of tile activities has a peak at low activity, which, based on control sequences, is clearly inactive, with a heavy right shoulder and a heavy right tail (Fig 2C). After trying many thresholds, we ultimately chose the top 20% (94,031) as a threshold for high activity.
Initially, we selected a threshold based on a Gaussian distribution fit to our inactive sequences (S5 Fig). Specifically, we hypothesized tile density is highest around inactive tiles and thus refer to all tiles to the left of the resulting histogram’s peak as inactive tiles. We fit a one-sided Gaussian to these inactive tiles and call the two-sided extension of this Gaussian the inactive tile distribution. Treating this Gaussian inactive tile distribution as our null hypothesis, we calculate p-values for each tile (not including tiles earlier used as inactive). We then correct for multiple comparisons using FDR [109] corrections. The 1% FDR threshold was 33821 (60.6% of tiles active). We will use 60.6% as our lower bound on the threshold for active tiles. Our upper bound on the active tiles threshold is the activity of the CAD (137983, 7.3% of tiles active), which is known to be a weak activation domain. At both this lower and upper bound, almost every homolog sequence has at least one active tile. The only exception is Canca1_23981 from Tortispora caseinolytica, which does not contain a tile with higher activity than the CAD and which we believe is a misannotation (see S1 File). For our main analyses, we chose a threshold of 20%, as it is between both our upper and lower bounds. We emphasize that using this threshold does not significantly influence any of our results.
Protein sequence parameters
We computed protein sequence parameters (net charge, local net charge, Kyte Doolittle Hydrophobicity, Wimley White hydrophobicity, Kappa [110]) with localCIDER [111]. The OmegaWFYL_DE mixture parameter computes the mixture statistic between W,F,Y,L residues and D,E residues using the seq.get_kappa_X([‘D’,’E’],[‘W’,’F’,’Y’,’L’]) function in localCIDER [62]. We predicted intrinsic disorder with MetaPredict2 [112]. We counted motifs with regular expressions in Python with the “re” package.
When we used 500 homologs, the MAFFT algorithm aligned the WxxLF motif for all but three homologs. For three homologs, in the Full_length_homolog_dataframe, we corrected the “WxxLF motif location” parameter using the coordinates from the MSA. These species are the only ones outside the Ascomycota that have the motif.
To predict helical propensity of homolog sequences, we used the Sparrow package in Python [58] [https://github.com/idptools/sparrow]. A region was called helical if it contained five adjacent residues with over 50% chance of being helical. A large proportion of sequences have no residues with a > 50% probability of being helical in this region. We consider this predictor to capture the propensity to form a helix in some contexts. To count proline residues in the region homologous to the known helix, we used the 5 AA upstream and 5 AA downstream of the WxxLF motif.
Data were analyzed in Python with the matplotlib and seaborn packages.
Imputing activity in the full-length homologs
We used the tile data to impute the activity of each position in each of the full-length homologs. The 19099 recovered tiles mapped to 68577 locations on the homologs (each tile matched to 3.6 homologs on average). We used a second-order Loess smoothing (twenty nearest points with the loess.loess_1d.loess_1d() function) across tiles to impute the activities of all positions in the 502 unique homologs. This quadratic smoothing can cause artifacts on the extreme ends of the protein, such as predicting negative activity. To remove this artifact, we constrained the imputed activity to be no more than the maximum measured and no less than the minimum measured in that homolog.
To validate the Loess smoothing, we averaged together all activities for all tiles that overlapped a position, equally weighing all tiles. These averages were more jagged because of the stepwise nature of the tiles. This simple average also created artifacts at the ends of the protein where only one tile is present. The Loess and average smoothing methods agreed well (97% had Pearson R > 0.80).
We used the imputed activities to create the heatmaps to visualize activity across the homologs. We tried many variations of these heatmaps but ultimately found that aligning the sequences on the start of the DBD or on the WxxLF motif was most informative. In the main text, we removed the six longest sequences to ease visualization.
To estimate the activity at the WxxLF motif, we used two methods. First, we used the inferred activity at the W. Second, we used the integral of the imputed activity from -10 to +10 around the W of the WxxLF motif. Both the point measurement and integral give very similar results. In Fig 5, we used the integral. When the integral was below our activity threshold, we called sequences inactive in this region.
For motif enrichment, we performed a Welch’s t-test assuming unequal variances stats.ttest_ind(Sequences_WITH_Motif,Sequences_WITHOUT_Motif, equal_var = False).
To count activation domains on each TF, we combined active overlapping tiles, taking the union. With this method, we found 500 activation domains with the WxxLF motif and 415 activation domains without the WxxLF motif. This method requires more than forty residues between activation domains before they are called as two separate domains. Calling activation domains from the imputed activity map gives different results because some very close double peaks are split. Using this smoothed data, we find there are 332 activation domains with the WxxLF motif and 783 activation domains without the WxxLF motif.
ANOVA
We used ordinary least squares regression (OLS) to create a baseline model for how composition controls activation domain function. We used ANOVA, OLS, and adjusted R-squared to compare models. See the Composition_ANOVA jupyter notebook for the full analysis. Briefly, we used the Python statsmodels ols(formula, ANOVA_DF).fit() function from the statsmodels package to fit the model, find coefficients, and compute adjusted R-squared values. We used the anova_lm(model, typ = 2) function to find the sum of squares explained by each parameter. We used a Bonferroni multiple hypothesis correction to remove non-significant parameters and refit the model. In most cases, one iteration was sufficient to get a model where all parameters were significant. For the dipeptides, we used two interaction terms. All ANOVA parameters are in S1 Table.
We predicted de novo motifs using the DREAM suite and then repeated the OLS ANOVA analysis using the motifs. We performed de novo motif searching on multiple slices of the data, but highly active (n = 3524) vs. inactive (n = 15575) were the most interpretable and gave the clearest signal in the ANOVA analysis. First, we ran the package STREME from the MEME suite to discover motifs that are enriched in a list of sequences relative to a user-provided control list.
For the OLS on de novo motifs, we used the motif counts provided by the DREAM motif prediction software (S9 Table). For simplicity, in the parameter table, we refer to each motif as a string, but we used the PWM for finding motifs in each sequence with FIMO.
Machine learning
We predicted activities on full-length homologs using publicly available models, TADA, ADpred, and PADDLE [16, 19, 21, 22]. All models were run on the SAVIO high performance computing cluster at UC Berkeley. TADA uses 40 AA windows, ADpred 30 AA windows, and PADDLE 53 AA windows. For each TF, we tiled at 1 AA increments, spanning the full proteins (e.g., 1–40, 2–41, etc.). For full-length TF analysis, we corrected the inferred activity at each position (Loess smoothing) with the predictions at each position. The smoothed data averages out some measurement noise, so all the model performance is improved on smoothed data. For individual tile analysis, we used the center aligned score. We also tried maximum scores, average scores, and other variations, but chose center-aligned. ROC and PRC analyses were performed with the sklearn Python package.
To predict the impact of mutating F residues in the central activation domains, we tile the 138 unique 70 AA central regions into 40 AA tiles spaced every 1 AA. For each tile, we computationally mutated each F individually, all pairs, all triplets, and all sets of four or more. For each mutant, we predicted activity. The mutants are predicted to have less activity. For each mutant, we also computed the change in activity. Finally, we grouped the changes in activity based on the conservation of each F residue.
Conservation and phylogenetic analyses
To define the DBD of the protein, InterPro [113] was run on the full-length homologs. The SUPERFAMILY coordinates were used to define the DBD. These coordinates were used in both the conservation analyses and the selection analyses.
For the conservation analyses, we defined the central region as 15 AA after the WxxLF motif and (DBD length - 15) AA before the WxxLF motif to generate a length-matched region. We minimized the region after the WxxLF because activity generally peaks slightly before the motif. We defined the upstream DBD region as the region of size (DBD length) immediately upstream of the DBD annotation. The IDR was all the sequence except the DBD. Sequences were aligned using the BLOSUM45 substitution matrix. Percent identity calculation included gaps.
For the selection analyses, we compared the IDR to the DBD (as defined using the regions above) in PAML [65]. We performed the selection analyses separately for each region, using only a subset of the MSA, and for the full alignment. We used the yn00 from PAML to calculate pairwise dN/dS. We reported results from the LPB93 [39] method, but all methods found similar trends. We used codeml from PAML to calculate dN/dS (omega) using a sitewise model with the same omega for every site and branch (Model 0). We used pre-computed alignments and trees from the Yeast Genome Order Browser in our calculations. We assumed equal codon frequencies. We estimated kappa and fixed gamma shape parameter at zero (constant rate). Pairwise dN/dS results are available as S10 Table.
To test for positive selection, we use branch-site models (model = 2 and NSsites = 2 in codeml control file). For each branch, we tested whether any alignment sites were experiencing positive selection on that branch. We estimated kappa and fixed gamma shape parameter at zero. We compared a model where omega was estimated and allowed to be greater than one to a model where omega was fixed at one. We performed a likelihood ratio test between the two models. We corrected for multiple hypothesis testing using Benjamini-Hochberg procedure [109]. dN/dS control files are available on github.
The gene tree was estimated using IQTree [74] using all the genes identified in the computational screen. Ancestral reconstruction was also performed with IQTree on the smaller set of sequences of interest. The species tree was obtained from MycoCosm. The y1000 + species tree was used to add species not present in MycoCosm to the tree. Because of this, branch lengths are not meaningful. All tree visualizations were created using ETE Toolkit [114].
We simulated codon sequence evolution using the evolver program from the PAML package [65]. We used a simplified neutral version of the Goldman and Yang model [66] where the substitution rate between all codons that differed by one nucleotide was equal (all codons that differed by more than one nucleotide have a substitution rate of zero) and non-synonymous and synonymous mutations were equally likely (omega = 1). Branch lengths between Gcn4 sequences of interest were extracted from the gene tree. The transition/transversion ratio was equal to five.
Calculating probability of FF motif
We wanted to calculate the probability of FF in a sequence. We made the simplifying assumption that all positions were independent (i.e., each amino acid was randomly drawn from some background distribution). We used the average frequency of each amino acid in the central activation domain regions. We used the average length of the central activation domain regions.
We want to calculate P(FF | length, freq F), where length is the length of the sequence and freq of F is the frequency of F in the sequence. P(FF | length, freq F) = 1 - P(no FF | length, freq F), which is more tractable to calculate. We use a recursive approach (implemented via dynamic programming) to calculate P(no FF | length, freq F). Specifically, we add the probabilities of every sequence that does not contain an FF. To calculate these probabilities, we work through the sequence recursively thinking of the two cases: 1) the sequence ends with an F and 2) the sequence does not end in an F. For case 1) we must make sure that an F does not precede the current F and that the rest of the sequence contains no FFs i.e. (freq F) * (freq not F) * (prob that seq[:l-2] contains no FFs). For case 2) we only need to make sure that the rest of the sequence contains no FFs i.e. (freq F) * (prob that seq[:l-1] contains no FFs).
The dynamic programming algorithm is as follows:
not_F_prob = 1 - F_prob
prob_ls = np.full(l, -1.0)
# Base case: l = 1, there are no Fs
prob_ls[0] = 1
# Base case: l = 2, two options: FX or XX (where first X is not F)
prob_ls[1] = F_prob * not_F_prob + not_F_prob * prob_ls[0]
# Two cases: Either current aa is an F or it is not an F
for i in range(2, l):
prob_ls[i] = F_prob * not_F_prob * prob_ls[i - 2] + not_F_prob * prob_ls[i-1]
# Answer
prob_ls[-1]
Datafiles
All the raw sequencing data has been deposited at NIH SRA Accession #PRJNA1186961: http://www.ncbi.nlm.nih.gov/bioproject/1186961
All the analysis scripts are deposited on github and Zenodo:
https://github.com/staller-lab/Gcn4-evolution
https://github.com/staller-lab/labtools/tree/main/src/labtools/adtools
All the processed data are attached in supplemental tables (S6, S11, and S12 Tables).
Processed sequencing read counts are in S13 Table.
The ‘masterDF’ dataframe contains each designed tile (S6 Table). Tiles that were not measured have activity recorded as nan or 0. The ‘orthorlogDF’ dataframe contains all tiles associated with each original full-length homolog (S11 Table). As a result, tiles occur multiple times because they map to multiple homologs. The ‘NativeLocation’ is the position of the tile relative to the first amino acid. The ‘NormLocation’ is the position of the tile relative to the WxxLF motif. Finally, the ‘FullOrthoDF’ dataframe contains one entry for each full-length homolog, and each column contains an array with values for each position (S12 Table), such as imputed activity at each position and local charge from localCIDER. The location of the bZIP DNA-binding domain was identified with the InterPro signature (IPR004827, S14 Table).
Supporting information
S1 Fig. Overview of Gcn4 homolog selection.
A) 207 diverse fungal proteomes were selected to represent the diversity of the Mycosm database. The selected proteomes came from the subdivisions indicated with circles. B) We scanned the proteomes for Gcn4 homologs using the DNA binding domain and the WxxLF motif. For the DNA binding domain, we used the IPR004827 profile HMM from Interpro. The WxxLF motif has been shown to be conserved [26, 45], and we used the regular expression Wx[SPA]LF. This computational screen yielded 1188 gene models from 124 genomes. These 1188 gene models combine to yield 502 unique proteins (S1 Table). Of these, > 99% were reciprocal Blast best hits with S. cerevisiae Gcn4. C) Each of the 502 unique homologs we identified was tiled into 40 AA overlapping fragments. 19099 designed fragments were detected after yeast transformation, and 18947 passed abundance thresholds to be included in the analysis. D) Histogram of homologs identified in each genome. Genomes contained 1–32 gene models and 1–11 unique protein sequences (S2 Fig). E) Histogram of unique homologs identified in each genome. Multiple gene models or splice forms can yield the same protein sequence. F) The distribution of homolog lengths varies considerably. G) Alignment of 500 orthologs (longest two sequences excluded). DBD and central activation domain regions are labeled.
https://doi.org/10.1371/journal.pgen.1012069.s001
(TIF)
S2 Fig. Among the published motifs, only the WxxLF motif is conserved.
A) The sequence logo from the 4th iteration of a search for Gcn4 homologs in fungal genomes with HMMER. This independent analysis confirmed the WxxLF motif is more conserved than the FF and MFxYxxL motifs. B) The number of motifs present in our experimental set of Gcn4 homologs and the Y1000 + set, which was published after our experiments had been completed. The majority of sequences do not contain the published motifs beyond the WxxLF motif. The WxxLF motif conservation in our set is expected because we forced all homologs to contain this motif, but it is also highly conserved in the Y1000 + set.
https://doi.org/10.1371/journal.pgen.1012069.s002
(TIF)
S3 Fig. Measurement quality and reproducibility.
A) Measurement reproducibility when we integrated the plasmid library into yeast in Pool A and Pool B. These are independent biological replicates. This panel shows all the data. Color indicates the number of points (tiles) in each pixel. B) Filtering out tiles with fewer than 1000 reads improved measurement reproducibility. C) For the main dataset, we combined the two biological replicates. This combined number was well correlated with a separate experiment in which we physically mixed all the yeast together after integration, and then induced and sorted cells. D) Measurements of two independent matings of the same library of integrated synthetic TFs. This captures biological variation from mating and technical variation from two independent sorts performed months apart. E) Activity is uncorrelated with total read count, a proxy for abundance in yeast. F) 40 AA tiles in this work were compared with 44 AA tiles in a previous study [15]. The measurements generally agree. G) The Activity (GFP/mcherry ratio) is largely separable from abundance (mCherry). Pool A data. H) Similar to G with Pool B. I) mCherry measurement reproducibility. Vertical lines arise from tiles that have low abundance in one replicate and are only found in one sorting bin. J) The GFP signal is consistent with the Activity (GFP/mCherry ratio) in Pool A. K) Similar to J with Pool B. L) GFP measurement reproducibility. There are four bins.
https://doi.org/10.1371/journal.pgen.1012069.s003
(TIF)
S4 Fig. Control activation domains and tiles from DBDs.
A) Sequences of the hand-designed mutations. B) Control activation domains from human and human viruses. Aromatic residues make larger contributions to activity than leucine residues. C) Tiles from the C-terminus (End tiles) that overlap the DBD have low activity in the assay. Tiles from the N-terminus (Start tiles) have activity that matches the full distribution. D) A similar analysis using the imputed, smoothed activities at each position. The first 40 and last 40 residues are used for each analysis. End regions that overlap the DBD have little-to-no activity. Start regions from the N-terminus resemble the full distribution.
https://doi.org/10.1371/journal.pgen.1012069.s004
(TIF)
S5 Fig. Setting the activity threshold.
A) To create a threshold for active tiles, we first fit a Gaussian distribution to the left half of the low activity population. B) Reflection of the Gaussian fit over the full activity distribution. C) Cumulative density distribution of activities with the peak of the Gaussian fit in yellow. D) Gray, full distribution of tile activities, red Gaussian fit to inactive population. Yellow, 1% FDR threshold. Green, 1% FWER threshold that we used as a lower bound for active tiles. Purple, the Gcn4 CAD activity for reference. At this threshold (45373), 9210 (48.2%) of tiles were active, a much higher fraction than found in the unbiased tiling of yeast, plant, and human TFs [16, 21, 29], which is expected if many Gcn4 orthologs are activators. When we doubled the threshold (90,746), there were 3978 strong activation domains (20.8%). In the main text, we focused on the 20% most active sequences. Varying the threshold with percentiles did not meaningfully change the fraction of orthologs with an activation domain. Computing the Z score leads to a threshold: Mean + 2 * sigma = 189867, which is just below the 92.125th percentile. Mean + sigma = 127983, which is just below 85.96th percentile.
https://doi.org/10.1371/journal.pgen.1012069.s005
(TIF)
S6 Fig. Activity traces for all Gcn4 homologs.
Although all sequences have activation domains, the shape of the activity traces changes across species, as activation domains move around. Upstream activation domains are gained and lost. Two methods for defining upstream activation domains, combining overlapping tiles or counting peaks in the smoothed traces, lead to different results.
https://doi.org/10.1371/journal.pgen.1012069.s006
(PDF)
S7 Fig. Predicting tile binding to Med15/Gal11 with FINCHES.
FINCHES is a computational method for predicting binding between an IDR and a folded domain. During its initial development, it was benchmarked against activation domain binding to Med15/Gal11 [16,26]. Running FINCHES on the homolog tiles predicts that binding to Med15/Gal11 is a primary molecular mechanism for activation by the Gcn4 homologs. Compared to non-active tiles, active tiles (top 20% threshold) are predicted to have higher attraction to Med15/Gal11. Tiles with or without the WxxLF motif show similar attraction to Gal11. Predictions show similar results when either the Mpipi-GG [58, 60] or CALVADOS2 [115] force field is used as parameters for the FINCHES prediction. P-values are derived from a one-sided Wilcoxon rank-sum test. The most active tiles have high attraction to Med15/Gal11 [26].
https://doi.org/10.1371/journal.pgen.1012069.s007
(TIF)
S8 Fig. A key coactivator of Gcn4, Med15/Gal11 shows high conservation.
A) We collected 653 Med15 sequences from the Y1000 + collection and created an MSA. The AlphaFold structure of S. cerevisiae Med15/Gal11 with the KIX, ABD1, and ABD2 domains colored by their conservation (percent identity) in the MSA. B-E) The KIX, ABD1, ABD2, and ABD3 domains colored by conservation in the MSA. F) The conservation profile of the full protein with domains highlighted (Gaps trimmed). G) The hydrophobic residues of ABD1 that engage with Gcn4 [42]. H-K) Permutation tests for percent identity of each domain compared to the rest of the protein. L) Distributions of conservation in each domain. M) The residues that make contact with Gcn4 in ABD1 are not more significantly conserved than those of the rest of the domain. Using only the YGOB high-quality homologs gave similar results for all panels except for M, where there was more conservation of the contacting residues.
https://doi.org/10.1371/journal.pgen.1012069.s008
(TIF)
S9 Fig. The Gcn4 ortholog dataset efficiently identified key sequence features controlling activity.
We compared the Gcn4 ortholog dataset to other published high-throughput datasets. Many of the signals for sequence features that control activity are more visible in the Gcn4 ortholog dataset than in previous datasets. For example, the difference between D and E or the effect of F. Source data: GCN4: this study (length 40). Morffy [21]: all sequences (length 40). DelRosso [29]: CRTF tiling library and activation domain mutants library (most length 80, some 70). PADDLE [16]: TF tiles, activation domain mutants, and 53 AA mutants (length 53). Erijman [19]: All sequences (length 30). Staller 2018 [15]: “ActivityCompleteMedia Replicate1_Normalized” (length 44).
https://doi.org/10.1371/journal.pgen.1012069.s009
(TIF)
S10 Fig. Slicing through the 2-D landscape.
A) Similar to Fig 4A. Tiles with WFYL = 7 are shown in the green box. Tiles with net charge = -6 are shown in the blue box. B) Box plots for the tiles with WFYL = 7, green box in A. In this set, the difference between D and E is very visible. C) Box plots for all the tiles with net charge = -6. In this set, it is clear that L and M play a supporting role, boosting activity. D) For all tiles with a specific net charge, box plots for how activity is dependent on the number of WFYLM residues. As sequences gain WFYL, activity generally increases, but once there are too many WFYLM residues, activity goes down. We saw this arch-shaped behavior in our rational mutagenesis [26], and Sanborn et al. saw it in synthetic sequences [16]. We believe this is the first example of this arch-shaped behavior in natural sequences. The peak of the arch is very dependent on the net charge of the region: very acidic regions support more aromatic residues before losing activity. These sequences support the acidic exposure model: for a given amount of acidity, adding hydrophobic residues will increase activity until they overwhelm the exposure capacity, drive collapse, and decrease activity. The rarity of tiles with many WFYL residues suggest there is stabilizing selection to maintain an intermediate number of WFYLM residues and not gain too many.
https://doi.org/10.1371/journal.pgen.1012069.s010
(TIF)
S11 Fig. Composition signature of tiles without the WxxLF motif matches the signatures of the full dataset.
Boxplots capture the average relationship between composition and activity for the 17193 tiles without the WxxLF motif. The relationships are similar to those of Figs 3C and S9.
https://doi.org/10.1371/journal.pgen.1012069.s011
(TIF)
S12 Fig. The central acidic activation domain of Gcn4 is functionally conserved.
We used the tile activity data to impute the activity of each position in all the homologs and visualized these activities as a heatmap. 497 homologs are sorted by length and aligned on the WxxLF motif. The five longest homologs are excluded because they distort the plow. Inset: vertically averaging the heatmap indicated the peak is ten residues upstream of the WxxLF motif. Activity is consistently high around the central activation domain, indicating deep functional conservation. Upstream activity is more salt-and-pepper, indicating recurrent gain and loss of upstream activation domains. Red arrow, S. cerevisiae. Black scale bar, 100 AA.
https://doi.org/10.1371/journal.pgen.1012069.s012
(TIF)
S13 Fig. A distant homolog from Catenaria anguillulae has convergently evolved a WxxLF motif.
Activity traces of S. cerevisiae Gcn4 and a homolog from Catenaria anguillulae, from the Blastocladiomycota, aligned on the WxxLF motif (position 0). The region is histidine-rich instead of acidic. The third activation domain aligns to the WxxLF motif in the full MSA with all sequences. Both proteins have the bZIP DBD at the C terminus. We suspect the WxxLF motif convergently evolved in these distance homologs because the context is very different and H-rich. Note that in the MSA in Fig 1C, C. anguillulae is the one sequence where the WxxLF motif does not align with all the others, so we hand-corrected the ‘WxxLF’ location to the position in the MSA. For plotting C. anguillulae in Fig 4, we used this hand-corrected coordinate. Regardless of how we deal with C. anguillulae, this sequence is an outlier.
https://doi.org/10.1371/journal.pgen.1012069.s013
(TIF)
S14 Fig. Heatmap of homolog activity projected onto the species tree.
For our homologs that were part of the Y1000 + project, we visualized heatmaps of the activity on the species tree. Sequences are aligned on the WxxLF motif. For species with multiple homologs in our screen, we visualized the longest sequence. Scale bar from [75]. Most of the species close to S. cerevisiae have one activation domain, but it can move around. Tree visualization was made with the ETEToolkit [114].
https://doi.org/10.1371/journal.pgen.1012069.s014
(TIF)
S15 Fig. There is minimal turnover of published Gcn4 motifs in the cAD.
A) Turnover of published Gcn4 motifs (FF, FxF, FxxF, MFxYxxL, WxxLF) on the gene tree. Red represents genes that have the WxxLF motifs. Orange is the FF motif. Green is the FxF motif. Blue is the FxxF motif. Purple is the MFxYxxL motif. B) Turnover of all possible F motifs (both published and unpublished) in the central activation domain. Colors are as before, with pink being the FxxxF motif and black being the FxxxxF motif. C) A subset of the gene tree showing the four homologs, highlighted in purple, that have gained the FF motif in alternative locations. S. cerevisiae is highlighted in blue.
https://doi.org/10.1371/journal.pgen.1012069.s015
(TIF)
S16 Fig. MSA of the sixty-nine strongest central activation domains.
We focused on the central activation domain (-50 to +19 around the WxxLF motif), took the 50% most active sequences (69/138), and made an MSA with MAFFT. The WxxLF motif has an acidic context. The upstream F residues come and go, appearing to move around. The region homologous to the alpha helix in S. cerevisiae contains up to four proline residues.
https://doi.org/10.1371/journal.pgen.1012069.s016
(TIF)
S17 Fig. Comparison of amino acid turnover in most active homologs compared to all homologs.
A) Stacked barplot of amino acid conservation in the sixty-nine most active homologs. Similar to Fig 5C, but showing all residues. B) Stacked barplot for all 138 unique homologs. C) Sequence logo of MSA of sequences in A. Reproduction of Fig 5D. D) Sequence logo of MSA of sequences in B.
https://doi.org/10.1371/journal.pgen.1012069.s017
(TIF)
S18 Fig. Counting SP dinucleotides in the central activation domain region.
We focused on the 138 unique central activation domain regions (-50 to +19 around WxxLF). We counted instances of SP/TP instances and plotted them on the gene tree. There are paralogs from a few species. This pattern is consistent with gain and loss of SP/TP instances. The ancestral state likely had more than the one instance retained in S. cerevisiae.
https://doi.org/10.1371/journal.pgen.1012069.s018
(TIF)
S19 Fig. Point mutations that change tile activity.
https://doi.org/10.1371/journal.pgen.1012069.s019
(TIF)
S20 Fig. Gain before loss of F residues.
A, B) Examples of gain of F residues before loss of ancestral F residues in Gcn4 homologs from the Y1000 + sequences. Ancestral sequences at the shown nodes have been reconstructed using IQ-TREE [116]. Alignment of selected sequences created using MAFFT. C) Larger alignment of the gain/loss events shown in A and B. Gain/loss in A is colored blue and gain/loss in B is colored purple.
https://doi.org/10.1371/journal.pgen.1012069.s020
(TIF)
S21 Fig. Conservation of key residues in other activation domains mirror Gcn4.
For Pdr1, we took active sequences from Sanborn et al. 2021 [16]. For the other activation domains, we used the Y100 + collection orthogroups. Five activation domain predictors (TADA [21], ADHunterLite [76], PADDLE [16], ADPred [19] and Kotha Composition Model [24]) were run on the sequences, and regions predicted by three or more predictors were considered activation domains. For homologs with multiple activation domains, only the activation domain in the position most similar to that of S. cerevisiae was considered. All the major patterns for Gcn4 are apparent in these regions. The acidic residues interconvert. Some aromatic residues are highly conserved, but there is also interconversion between F and L in Ino2 and War1. In War1, in the FWxxLF motif (position 54), only the FW is conserved. Prd1 n = 43, War1 n = 1070, Met4 n = 1148, Ino2 n = 139.
https://doi.org/10.1371/journal.pgen.1012069.s021
(TIF)
S22 Fig. The active tiles and the 138 unique central activation domain sequences are diverse.
A) For all pairs of the 138 unique central activation domain regions (-50 and +19 from the WxxLF motif), we computed the pairwise edit distance (blue). For each sequence, we recorded the minimum distance to the most similar sequence (green). The maximum possible distance is 67 because the sequences are 70 AA long, and they all contain the WxxLF motif. B) For all tiles, we calculated the edit distance to all other tiles. We did the same for all active tiles. The maximum edit distance is 40 because the sequences are 40 AA long.
https://doi.org/10.1371/journal.pgen.1012069.s022
(TIF)
S23 Fig. Neural network models for predicting activation domains from amino acid sequence perform well on the Gcn4 homologs.
A-B) Measured activity and predicted activity (4 models and 3 variations of ADHunterLite (ADHunter_v1 [76, 79])) for S. cerevisiae Gcn4. The models approximate the location of the activation domain reasonably well. ADHunter Light variations: ADHunterLite (orange) is trained on a random split of the Gcn4 homolog tiles measured in this work. ADHunterLite_splitTFs (split full-length TFs) (pink) is trained on a different split of the Gcn4 homolog tiles, a process wherein we first split homologs into train, test, and validate sets and then put all the tiles from each homolog into these sets. This different split was motivated by an attempt to prevent tiles that overlapped by 35 AA from being in train and validation/test sets. If highly similar sequences were in both the training and validation set, for example, this overlap had the potential to boost performance through overfitting. ADHunterLite_PADI (purple) is trained on the PADI dataset [21] of Arabidopsis TFs screened for activation domains with our yeast activation domain assay. C) For each full-length TF, we correlated smoothed measured activity with the predictions. The models can approximate the general location of activation domains. This comparison underweights errors in estimating activation domain boundaries. TADA and ADHhunterLite perform better than the first generation models. D) Receiver Operator Characteristic (ROC) curves for model performance on individual tiles. For each tile, we used the center-aligned predicted activity because the predictors use windows of different lengths. TADA and ADHunterLite_PADI outperformed the older models. The performance of TADA, which is trained on PADI, and the performance of ADHunterLite_PADI were very similar. The performances of ADHunterLite and ADHunterLite_splitTFs were very similar, suggesting that the different training splits had a minor effect on model performance. E) Precision Recall Curve (PRC) for model performance on individual tiles. TADA and ADHunterLite outperformed the older models. F) Scatter plot for measured activity and predicted activity of each tile using center-aligned data. Vertical red line, 100,000 activity units to guide the eye, slightly higher than the top 20% active threshold. Horizontal red lines, the predictor activity thresholds recommended by the authors of each model. Performance on individual tiles is worse than performance on full-length TFs in C.
https://doi.org/10.1371/journal.pgen.1012069.s023
(TIF)
S24 Fig. Predicting activity of phenylalanine mutations with neural networks supports evolutionary turnover.
A) Starting with the 138 unique 70 AA regions around the WxxLF motif, we tiled these regions into 40 AA tiles spaced at 1 AA. For each resulting tile, we predicted activity (gray) with TADA. Next, we generated all single phenylalanine to alanine (F > A) mutations and predicted activity (red). We repeated this process for double (orange), triple (yellow), and 4+ (green) F > A mutations. B) We calculated the change in TADA-predicted activity for the mutations in A. Nearly all mutations decreased predicted activity, consistent with earlier analysis. C) For a single representative region, the traces indicate the TADA-predicted activity of the WT (black), all single F mutants (red), double F mutants (orange), triple F mutants (yellow), four F mutants (green) and five F mutants (blue). Mutations decrease predicted activity. D) Same as A but using ADhunterLite to predict activity. E) Same as B but using ADhunterLite. F) Same as C but using ADhunterLite. G) Using only the single F > A mutations, we grouped mutants by the observed conservation of the F residue in the 138 unique central AD sequences. Nearly all mutations decrease TADA-predicted activity (they are below the gray line). The regression line (red) is close to flat, indicating that more conserved F residues generally do not cause larger predicted decreases in activity when mutated. This analysis suggests all the F’s contribute to activity similarly.H) Same as G but using ADhunterLite.
https://doi.org/10.1371/journal.pgen.1012069.s024
(TIF)
S25 Fig. The integral of smoothed activity is conserved.
To estimate the integral of activity for each sequence, we summed the smoothed activities. To construct a null distribution, we randomly sampled activities (Python random package) for each position in each sequence with or without replacement and summed those activities. The shuffled activities have a much higher variance than observed in our sequence.
https://doi.org/10.1371/journal.pgen.1012069.s025
(TIF)
S26 Fig. Analysis of the spacer sequence between the WxxLF motif and the DBD.
Left panels align position on the WxxLF motif. Middle panels align position on the DBD. The spacer is the sequence between these landmarks. Right panels are a stretched metagene plot. Imputed activation domain activity of the spacer is low. The spacer has higher predicted intrinsic disorder than the central activation domain or the DBD (Metapredict2 [112]). Negative change undulates between the landmarks (localCIDER [111]). The region right after the WxxLF is negatively charged, followed by a positively charged region and another net negative region just before the positively charged DBD. Hydrophobicity is consistent throughout the IDR.
https://doi.org/10.1371/journal.pgen.1012069.s026
(TIF)
S27 Fig. Computationally predicted biophysical properties of the spacers.
We used SPARROW to predict the physical dimensions of the Spacer regions computationally. A) The radius of gyration summarizes the size of the 3D ensemble. B) The end-to-end distance describes the 3D distance from the N and C termini. C) The scaling exponent is a parameter from polymer physics that describes deviation from a self-avoiding random walk (0.59 in a good solvent). These spacers are close to this theoretical value. D) Asphericity describes the symmetry of the ensemble (0 is a sphere, 1 is rod-like). E) The radius of gyration scales with protein length. F) The end-to-end distance scales with protein length. Overall, the highly consistent length and predicted dimensions support the hypothesis that there is stabilizing selection on spacer.
https://doi.org/10.1371/journal.pgen.1012069.s027
(TIF)
S3 Table. 500YeastGcn4Alignment MAFFT multiple sequence alignment of 500 homologs.
https://doi.org/10.1371/journal.pgen.1012069.s030
(TXT)
S6 Table. Tile_Activities_Properties_Dataframe (masterDF).
https://doi.org/10.1371/journal.pgen.1012069.s033
(CSV)
S11 Table. Homolog_Tile_dataframe (homolog DF).
https://doi.org/10.1371/journal.pgen.1012069.s038
(CSV)
S12 Table. Full_length_homolog_dataframe (FullLenthOrthoDF).
https://doi.org/10.1371/journal.pgen.1012069.s039
(CSV)
S15 Table. VeryStrongADsWithHighReproducibility.
https://doi.org/10.1371/journal.pgen.1012069.s042
(CSV)
Acknowledgments
We would like to thank Nick Ingolia, Zeba Wunderlich, Rachel Brem, Alex Holehouse, Andrew Murray, Shahar Sukenik, Michael Botchen, and Ashley Wolf for helpful comments on the manuscript. We thank Sumanth Mutte for finding the initial homologs and Alan Moses for pointing out that Cdk kinases are positioned by aromatic residues. We thank Lucia Strader, Nicholas Morffy, Ross Sozzani, Lisa Van den Broeck, Mara Baylis, Hunter Nisonoff, and Jennifer Listgarten for helpful discussions. Nick Morffy and Lucia Strader provided the yeast genome targeting plasmids. Igor Grigoriev identified the deprecated Tortispora caseinolytica gene models. Weijing Tang performed exploratory analyses not included in the final manuscript. The Regents of the University of California filed an invention disclosure based on the findings of this study. The DHY213 BY superhost strain used for library construction was a generous gift from Angela Chu and Joe Horecka, and requests for this strain should be directed to them.
References
- 1. Onuma Y, Takahashi S, Asashima M, Kurata S, Gehring WJ. Conservation of Pax 6 function and upstream activation by Notch signaling in eye development of frogs and flies. Proc Natl Acad Sci U S A. 2002;99(4):2020–5. pmid:11842182
- 2. Lynch VJ, Wagner GP. Revisiting a classic example of transcription factor functional equivalence: Are Eyeless and Pax6 functionally equivalent or divergent?. J Exp Zool B Mol Dev Evol. 2011;316B:93–8.
- 3. Halder G, Callaerts P, Gehring WJ. Induction of ectopic eyes by targeted expression of the eyeless gene in Drosophila. Science. 1995;267(5205):1788–92. pmid:7892602
- 4. Chothia C, Finkelstein AV. The classification and origins of protein folding patterns. Annu Rev Biochem. 1990;59:1007–39. pmid:2197975
- 5. Lim WA, Sauer RT. Alternative packing arrangements in the hydrophobic core of lambda repressor. Nature. 1989;339(6219):31–6. pmid:2524006
- 6. Metcalf P, Blum M, Freymann D, Turner M, Wiley DC. Two variant surface glycoproteins of Trypanosoma brucei of different sequence classes have similar 6 A resolution X-ray structures. Nature. 1987;325(6099):84–6. pmid:2432433
- 7. Chin AF, Zheng Y, Hilser VJ. Phylogenetic convergence of phase separation and mitotic function in the disordered protein BuGZ. Protein Sci. 2022;31(4):822–34. pmid:34984754
- 8. Beh LY, Colwell LJ, Francis NJ. A core subunit of Polycomb repressive complex 1 is broadly conserved in function but not primary sequence. Proc Natl Acad Sci U S A. 2012;109(18):E1063-71. pmid:22517748
- 9. Schmidt HB, Barreau A, Rohatgi R. Phase separation-deficient TDP43 remains functional in splicing. Nat Commun. 2019;10(1):4890. pmid:31653829
- 10. Langstein-Skora I, Schmid A, Emenecker RJ, Richardson MOG, Götz MJ, Payer SK, et al. Sequence- and chemical specificity define the functional landscape of intrinsically disordered regions. bioRxiv. 2022;:2022.02.10.480018.
- 11. Mindel V, Brodsky S, Cohen A, Manadre W, Jonas F, Carmi M, et al. Intrinsically disordered regions of the Msn2 transcription factor encode multiple functions using interwoven sequence grammars. Nucleic Acids Res. 2024;52(5):2260–72. pmid:38109289
- 12. Hsu IS, Strome B, Lash E, Robbins N, Cowen LE, Moses AM. A functionally divergent intrinsically disordered region underlying the conservation of stochastic signaling. PLoS Genet. 2021;17(9):e1009629. pmid:34506483
- 13. Sigler PB. Transcriptional activation. Acid blobs and negative noodles. Nature. 1988;333(6170):210–2. pmid:3367995
- 14. Hahn S, Young ET. Transcriptional regulation in Saccharomyces cerevisiae: Transcription factor regulation and function, mechanisms of initiation, and roles of activators and coactivators. Genetics. 2011;189(3):705–36. pmid:22084422
- 15. Staller MV, Holehouse AS, Swain-Lenz D, Das RK, Pappu RV, Cohen BA. A high-throughput mutational scan of an intrinsically disordered acidic transcriptional activation domain. Cell Syst. 2018;6(4):444-455.e6. pmid:29525204
- 16. Sanborn AL, Yeh BT, Feigerle JT, Hao CV, Townshend RJ, Lieberman Aiden E, et al. Simple biochemical features underlie transcriptional activation domain diversity and dynamic, fuzzy binding to Mediator. Elife. 2021;10:e68068. pmid:33904398
- 17. Ravarani CN, Erkina TY, De Baets G, Dudman DC, Erkine AM, Babu MM. High-throughput discovery of functional disordered regions: Investigation of transactivation domains. Mol Syst Biol. 2018;14(5):e8190. pmid:29759983
- 18. Broyles BK, Gutierrez AT, Maris TP, Coil DA, Wagner TM, Wang X, et al. Activation of gene expression by detergent-like protein domains. iScience. 2021;24(9):103017. pmid:34522860
- 19. Erijman A, Kozlowski L, Sohrabi-Jahromi S, Fishburn J, Warfield L, Schreiber J, et al. A high-throughput screen for transcription activation domains reveals their sequence features and permits prediction by deep learning. Mol Cell. 2020;78(5):890-902.e6. pmid:32416068
- 20. Arnold CD, Nemčko F, Woodfin AR, Wienerroither S, Vlasova A, Schleiffer A, et al. A high-throughput method to identify trans-activation domains within transcription factor sequences. EMBO J. 2018;37(16):e98896. pmid:30006452
- 21. Morffy N, Van den Broeck L, Miller C, Emenecker RJ, Bryant JA, Lee TM, et al. Identification of plant transcriptional activation domains. Nature. 2024;632(8023):166–73. pmid:39020176
- 22. Mahatma S, Van den Broeck L, Morffy N, Staller MV, Strader LC, Sozzani R. Prediction and functional characterization of transcriptional activation domains. 2023 57th Annual Conference on Information Sciences and Systems (CISS), 2023. 1–6.
- 23. Erkina TY, Erkine AM. Nucleosome distortion as a possible mechanism of transcription activation domain function. Epigenetics Chromatin. 2016;9:40. pmid:27679670
- 24. Kotha SR, Staller MV. Clusters of acidic and hydrophobic residues can predict acidic transcriptional activation domains from protein sequence. Genetics. 2023;225(2):iyad131. pmid:37462277
- 25. Udupa A, Kotha SR, Staller MV. Commonly asked questions about transcriptional activation domains. Curr Opin Struct Biol. 2024;84:102732. pmid:38056064
- 26. Staller MV, Ramirez E, Kotha SR, Holehouse AS, Pappu RV, Cohen BA. Directed mutational scanning reveals a balance between acidic and hydrophobic residues in strong human activation domains. Cell Syst. 2022;13(4):334-345.e5. pmid:35120642
- 27. Cress WD, Triezenberg SJ. Critical structural elements of the VP16 transcriptional activation domain. Science. 1991;251(4989):87–90. pmid:1846049
- 28. Shen F, Triezenberg SJ, Hensley P, Porter D, Knutson JR. Critical amino acids in the transcriptional activation domain of the herpesvirus protein VP16 are solvent-exposed in highly mobile protein segments. An intrinsic fluorescence study. J Biol Chem. 1996;271(9):4819–26. pmid:8617751
- 29. DelRosso N, Tycko J, Suzuki P, Andrews C, Mukund A, et al. Large-scale mapping and mutagenesis of human transcriptional effector domains. Nature. 2023;616(7956):365–72. pmid:37020022
- 30. Hare EE, Peterson BK, Iyer VN, Meier R, Eisen MB. Sepsid even-skipped enhancers are functionally conserved in Drosophila despite lack of sequence conservation. PLoS Genet. 2008;4(6):e1000106. pmid:18584029
- 31. Lusk RW, Eisen MB. Evolutionary mirages: selection on binding site composition creates the illusion of conserved grammars in Drosophila enhancers. PLoS Genet. 2010;6(1):e1000829. pmid:20107516
- 32. Ludwig MZ, Bergman C, Patel NH, Kreitman M. Evidence for stabilizing selection in a eukaryotic enhancer element. Nature. 2000;403(6769):564–7. pmid:10676967
- 33. Kumar M, Michael S, Alvarado-Valverde J, Zeke A, Lazar T, Glavina J, et al. ELM-the Eukaryotic Linear Motif resource-2024 update. Nucleic Acids Res. 2024;52(D1):D442–55. pmid:37962385
- 34. Moses AM, Pollard DA, Nix DA, Iyer VN, Li X-Y, Biggin MD, et al. Large-scale turnover of functional transcription factor binding sites in Drosophila. PLoS Comput Biol. 2006;2(10):e130. pmid:17040121
- 35. Davey NE, Cyert MS, Moses AM. Short linear motifs - ex nihilo evolution of protein regulation. Cell Commun Signal. 2015;13:43. pmid:26589632
- 36. Bugge K, Brakti I, Fernandes CB, Dreier JE, Lundsgaard JE, Olsen JG, et al. Interactions by Disorder - A Matter of Context. Front Mol Biosci. 2020;7:110. pmid:32613009
- 37. Byrne KP, Wolfe KH. The Yeast Gene Order Browser: Combining curated homology and syntenic context reveals gene fate in polyploid species. Genome Res. 2005;15(10):1456–61. pmid:16169922
- 38. Yang Z. PAML: A program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci. 1997;13(5):555–6. pmid:9367129
- 39. Pamilo P, Bianchi NO. Evolution of the Zfx and Zfy genes: Rates and interdependence between the genes. Mol Biol Evol. 1993;10(2):271–81. pmid:8487630
- 40. Bellay J, Han S, Michaut M, Kim T, Costanzo M, Andrews BJ, et al. Bringing order to protein disorder through comparative genomics and genetic interactions. Genome Biol. 2011;12(2):R14. pmid:21324131
- 41. Bennett RJ, Turgeon BG. Fungal Sex: The Ascomycota. Microbiol Spectr. 2016;4(5):10.1128/microbiolspec.FUNK-0005–2016. pmid:27763253
- 42. Brzovic PS, Heikaus CC, Kisselev L, Vernon R, Herbig E, Pacheco D, et al. The acidic transcription activator Gcn4 binds the mediator subunit Gal11/Med15 using a simple protein interface forming a fuzzy complex. Mol Cell. 2011;44(6):942–53. pmid:22195967
- 43. Tuttle LM, Pacheco D, Warfield L, Luo J, Ranish J, Hahn S, et al. Gcn4-Mediator Specificity Is Mediated by a Large and Dynamic Fuzzy Protein-Protein Complex. Cell Rep. 2018;22(12):3251–64. pmid:29562181
- 44. Drysdale CM, Dueñas E, Jackson BM, Reusser U, Braus GH, Hinnebusch AG. The transcriptional activator GCN4 contains multiple activation domains that are critically dependent on hydrophobic amino acids. Mol Cell Biol. 1995;15(3):1220–33. pmid:7862116
- 45. Warfield L, Tuttle LM, Pacheco D, Klevit RE, Hahn S. A sequence-specific transcription activator motif and powerful synthetic variants that bind Mediator using a fuzzy protein interface. Proc Natl Acad Sci U S A. 2014;111(34):E3506-13. pmid:25122681
- 46. Alerasool N, Leng H, Lin Z-Y, Gingras A-C, Taipale M. Identification and functional characterization of transcriptional activators in human cells. Mol Cell. 2022;82(3):677-695.e7. pmid:35016035
- 47. Kato S, Han S-Y, Liu W, Otsuka K, Shibata H, Kanamaru R, et al. Understanding the function-structure and function-mutation relationships of p53 tumor suppressor protein by high-resolution missense mutation analysis. Proc Natl Acad Sci U S A. 2003;100(14):8424–9. pmid:12826609
- 48. Sadowski I, Ma J, Triezenberg S, Ptashne M. GAL4-VP16 is an unusually potent transcriptional activator. Nature. 1988;335(6190):563–4. pmid:3047590
- 49. Burz DS, Hanes SD. Isolation of mutations that disrupt cooperative DNA binding by the Drosophila bicoid protein. J Mol Biol. 2001;305(2):219–30. pmid:11124901
- 50. Lebrecht D, Foehr M, Smith E, Lopes FJP, Vanario-Alonso CE, Reinitz J, et al. Bicoid cooperative DNA binding is critical for embryonic patterning in Drosophila. Proc Natl Acad Sci U S A. 2005;102(37):13176–81. pmid:16150708
- 51. Hummel NFC, Markel K, Stefani J, Staller MV, Shih PM. Systematic identification of transcriptional activation domains from non-transcription factor proteins in plants and yeast. Cell Syst. 2024;15(7):662-672.e4. pmid:38866009
- 52. Hummel NFC, Zhou A, Li B, Markel K, Ornelas IJ, Shih PM. The trans-regulatory landscape of gene networks in plants. Cell Syst. 2023;14(6):501-511.e4. pmid:37348464
- 53. Tsong AE, Tuch BB, Li H, Johnson AD. Evolution of alternative transcriptional circuits with identical logic. Nature. 2006;443(7110):415–20. pmid:17006507
- 54. Snyder LF, O’Brien EM, Zhao J, Liang J, Bruce BJ, Zhang Y, et al. Divergence in a eukaryotic transcription factor’s co-TF dependence involves multiple intrinsically disordered regions. Nat Commun. 2025;16(1):5340. pmid:40533454
- 55. Jackson BM, Drysdale CM, Natarajan K, Hinnebusch AG. Identification of seven hydrophobic clusters in GCN4 making redundant contributions to transcriptional activation. Mol Cell Biol. 1996;16(10):5557–71. pmid:8816468
- 56. Hope IA, Struhl K. Functional dissection of a eukaryotic transcriptional activator protein, GCN4 of yeast. Cell. 1986;46(6):885–94. pmid:3530496
- 57. Hope IA, Mahadevan S, Struhl K. Structural and functional characterization of the short acidic transcriptional activation region of yeast GCN4 protein. Nature. 1988;333(6174):635–40. pmid:3287180
- 58. Lotthammer JM, Ginell GM, Griffith D, Emenecker RJ, Holehouse AS. Direct prediction of intrinsically disordered protein conformational properties from sequence. Nat Methods. 2024;21(3):465–76. pmid:38297184
- 59. Ginell GM, Emenecker RJ, Lotthammer JM, Keeley AT, Plassmeyer SP, Razo N, et al. Sequence-based prediction of intermolecular interactions driven by disordered regions. Science. 2025;388(6749):eadq8381. pmid:40403066
- 60. Joseph JA, Reinhardt A, Aguirre A, Chew PY, Russell KO, Espinosa JR, et al. Physics-driven coarse-grained model for biomolecular phase separation with near-quantitative accuracy. Nat Comput Sci. 2021;1(11):732–43. pmid:35795820
- 61. Tesei G, Lindorff-Larsen K. Improved predictions of phase behaviour of intrinsically disordered proteins by tuning the interaction range. Open Res Eur. 2023;2:94. pmid:37645312
- 62. Martin EW, Holehouse AS, Grace CR, Hughes A, Pappu RV, Mittag T. Sequence determinants of the conformational properties of an intrinsically disordered protein prior to and upon multisite phosphorylation. J Am Chem Soc. 2016;138(47):15323–35. pmid:27807972
- 63. Ginell GM, Holehouse AS. Intrinsically disordered proteins, methods and protocols. Methods Mol Biol. 2020;2141:103–26.
- 64. Roesgaard MA, Lundsgaard JE, Newcombe EA, Jacobsen NL, Pesce F, Tranchant EE, et al. Deciphering the Alphabet of Disorder-Glu and Asp Act Differently on Local but Not Global Properties. Biomolecules. 2022;12(10):1426. pmid:36291634
- 65. Yang Z. PAML 4: Phylogenetic analysis by maximum likelihood. Mol Biol Evol. 2007;24(8):1586–91. pmid:17483113
- 66. Goldman N, Yang Z. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol Biol Evol. 1994;11(5):725–36. pmid:7968486
- 67. Delgado J, Radusky LG, Cianferoni D, Serrano L. FoldX 5.0: Working with RNA, small molecules and a new graphical interface. Bioinformatics. 2019;35(20):4168–9. pmid:30874800
- 68. Goettel MS, Eilenberg J, Glare T. Entomopathogenic Fungi and their Role in Regulation of Insect Populations. Comprehensive Molecular Insect Science. Elsevier. 2005. 361–405.
- 69. Shemer R, Meimoun A, Holtzman T, Kornitzer D. Regulation of the transcription factor Gcn4 by Pho85 cyclin PCL5. Mol Cell Biol. 2002;22(15):5395–404. pmid:12101234
- 70. Chi Y, Huddleston MJ, Zhang X, Young RA, Annan RS, Carr SA, et al. Negative regulation of Gcn4 and Msn2 transcription factors by Srb10 cyclin-dependent kinase. Genes Dev. 2001;15(9):1078–92. pmid:11331604
- 71. O’Neill EM, Kaffman A, Jolly ER, O’Shea EK. Regulation of PHO4 nuclear localization by the PHO80-PHO85 cyclin-CDK complex. Science. 1996;271(5246):209–12. pmid:8539622
- 72. Huang K, Ferrin-O’Connell I, Zhang W, Leonard GA, O’Shea EK, Quiocho FA. Structure of the Pho85-Pho80 CDK-cyclin complex of the phosphate-responsive signal transduction pathway. Mol Cell. 2007;28(4):614–23. pmid:18042456
- 73. Seto K, Mok W, Stone J. Bridging the gap between theory and practice in elucidating modular gene regulatory sequence organisation within genomes. Genome. 2020;63(6):281–9. pmid:32114793
- 74. Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD, von Haeseler A, et al. IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era. Mol Biol Evol. 2020;37(5):1530–4. pmid:32011700
- 75. Opulente DA, LaBella AL, Harrison M-C, Wolters JF, Liu C, Li Y, et al. Genomic factors shape carbon and nitrogen metabolic niche breadth across Saccharomycotina yeasts. Science. 2024;384(6694):eadj4503. pmid:38662846
- 76. Waldburger L, Nisonoff H, Zintel M, Kirkpatrick LD, Lam A, Lanclos N. Active learning enables discovery of transcriptional activators across fungal evolutionary space. bioRxiv. 2025;:2025.09.12.675635.
- 77. Dyson HJ, Wright PE. Role of intrinsic protein disorder in the function and interactions of the transcriptional coactivators creb-binding protein (CBP) and p300. J Biol Chem. 2016;291(13):6714–22. pmid:26851278
- 78. Ludwig CH, Thurm AR, Morgens DW, Yang KJ, Tycko J, Bassik MC, et al. High-throughput discovery and characterization of viral transcriptional effectors in human cells. Cell Syst. 2023;14(6):482-500.e8. pmid:37348463
- 79. LeBlanc C, Agarwal P, Demaray J, Hu G, Zintel M, Lam A, et al. Interpretable biophysical neural networks of transcriptional activation domains separate roles of protein abundance and coactivator binding. bioRxiv. 2025;:2025.09.19.677413. pmid:41000786
- 80. Scholes NS, Weinzierl ROJ. Molecular Dynamics of “Fuzzy” Transcriptional Activator-Coactivator Interactions. PLoS Comput Biol. 2016;12(5):e1004935. pmid:27175900
- 81. Pacheco D, Warfield L, Brajcich M, Robbins H, Luo J, Ranish J, et al. Transcription Activation Domains of the Yeast Factors Met4 and Ino2: Tandem Activation Domains with Properties Similar to the Yeast Gcn4 Activator. Mol Cell Biol. 2018;38(10):e00038-18. pmid:29507182
- 82. Tuttle LM, Pacheco D, Warfield L, Wilburn DB, Hahn S, Klevit RE. Mediator subunit Med15 dictates the conserved “fuzzy” binding mechanism of yeast transcription activators Gal4 and Gcn4. Nat Commun. 2021;12(1):2220. pmid:33850123
- 83. Dyson HJ, Wright PE. Coupling of folding and binding for unstructured proteins. Curr Opin Struct Biol. 2002;12(1):54–60. pmid:11839490
- 84. Schuler B, Borgia A, Borgia MB, Heidarsson PO, Holmstrom ED, Nettels D, et al. Binding without folding - the biomolecular function of disordered polyelectrolyte complexes. Curr Opin Struct Biol. 2020;60:66–76. pmid:31874413
- 85. Gao Y, Tan DS, Girbig M, Hu H, Zhou X, Xie Q, et al. The emergence of Sox and POU transcription factors predates the origins of animal stem cells. Nat Commun. 2024;15(1):9868. pmid:39543096
- 86. Hultqvist G, Åberg E, Camilloni C, Sundell GN, Andersson E, Dogan J, et al. Emergence and evolution of an interaction between intrinsically disordered proteins. Elife. 2017;6:e16059. pmid:28398197
- 87. Liu Y, Huang A, Booth RM, Mendes GG, Merchant Z, Matthews KS, et al. Evolution of the activation domain in a Hox transcription factor. Int J Dev Biol. 2018;62(11–12):745–53. pmid:30604844
- 88. Zarin T, Strome B, Nguyen Ba AN, Alberti S, Forman-Kay JD, Moses AM. Proteome-wide signatures of function in highly diverged intrinsically disordered regions. Elife. 2019;8:e46883. pmid:31264965
- 89. Zarin T, Strome B, Peng G, Pritišanac I, Forman-Kay JD, Moses AM. Identifying molecular features that are associated with biological function of intrinsically disordered protein regions. Elife. 2021;10:e60220. pmid:33616531
- 90. Zarin T, Tsai CN, Nguyen Ba AN, Moses AM. Selection maintains signaling function of a highly diverged intrinsically disordered region. Proc Natl Acad Sci U S A. 2017;114(8):E1450–9. pmid:28167781
- 91. Parker MW, Bell M, Mir M, Kao JA, Darzacq X, Botchan MR, et al. A new class of disordered elements controls DNA replication through initiator self-assembly. Elife. 2019;8:e48562. pmid:31560342
- 92. Parker MW, Kao JA, Huang A, Berger JM, Botchan MR. Molecular determinants of phase separation for Drosophila DNA replication licensing factors. Elife. 2021;10:e70535. pmid:34951585
- 93. Dalal CK, Johnson AD. How transcription circuits explore alternative architectures while maintaining overall circuit output. Genes Dev. 2017;31(14):1397–405. pmid:28860157
- 94. Fowler KR, Leon F, Johnson AD. Ancient transcriptional regulators can easily evolve new pair-wise cooperativity. Proc Natl Acad Sci U S A. 2023;120(28):e2302445120. pmid:37399378
- 95. Furlong EEM, Levine M. Developmental enhancers and chromosome topology. Science. 2018;361(6409):1341–5. pmid:30262496
- 96. Wong ES, Zheng D, Tan SZ, Bower NL, Garside V, Vanwalleghem G, et al. Deep conservation of the enhancer regulatory code in animals. Science. 2020;370(6517):eaax8137. pmid:33154111
- 97. Ludwig MZ, Palsson A, Alekseeva E, Bergman CM, Nathan J, Kreitman M. Functional evolution of a cis-regulatory module. PLoS Biol. 2005;3(4):e93. pmid:15757364
- 98. Peterson BK, Hare EE, Iyer VN, Storage S, Conner L, Papaj DR, et al. Big genomes facilitate the comparative identification of regulatory elements. PLoS One. 2009;4(3):e4688. pmid:19259274
- 99. Kaplow IM, Lawler AJ, Schäffer DE, Srinivasan C, Sestili HH, Wirthlin ME, et al. Relating enhancer genetic variation across mammals to complex phenotypes using machine learning. Science. 2023;380(6643):eabm7993. pmid:37104615
- 100. Villar D, Berthelot C, Aldridge S, Rayner TF, Lukk M, Pignatelli M, et al. Enhancer evolution across 20 mammalian species. Cell. 2015;160:554–66.
- 101. Arnosti DN, Kulkarni MM. Transcriptional enhancers: Intelligent enhanceosomes or flexible billboards?. J Cell Biochem. 2005;94(5):890–8. pmid:15696541
- 102. Crocker J, Abe N, Rinaldi L, McGregor AP, Frankel N, Wang S, et al. Low affinity binding site clusters confer hox specificity and regulatory robustness. Cell. 2015;160(1–2):191–203. pmid:25557079
- 103. Frankel N, Davis GK, Vargas D, Wang S, Payre F, Stern DL. Phenotypic robustness conferred by apparently redundant transcriptional enhancers. Nature. 2010;466(7305):490–3. pmid:20512118
- 104. Perry MW, Bothma JP, Luu RD, Levine M. Precision of hunchback expression in the Drosophila embryo. Curr Biol. 2012;22(23):2247–52. pmid:23122844
- 105. Bothma JP, Garcia HG, Ng S, Perry MW, Gregor T, Levine M. Enhancer additivity and non-additivity are determined by enhancer strength in the Drosophila embryo. Elife. 2015;4:e07956. pmid:26267217
- 106. Waymack R, Fletcher A, Enciso G, Wunderlich Z. Shadow enhancers can suppress input transcription factor noise through distinct regulatory logic. Elife. 2020;9:e59351. pmid:32804082
- 107. Tareen A, Kinney JB. Logomaker: Beautiful sequence logos in Python. Bioinformatics. 2020;36(7):2272–4. pmid:31821414
- 108.
Amberg DC, Burke D, Strathern JN. Methods in Yeast Genetics: A Cold Spring Harbor Laboratory Course Manual. CSHL Press. 2005.
- 109. Benjamini Y, Hochberg Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B: Statistical Methodology. 1995;57(1):289–300.
- 110. Das RK, Pappu RV. Conformations of intrinsically disordered proteins are influenced by linear sequence distributions of oppositely charged residues. Proc Natl Acad Sci U S A. 2013;110(33):13392–7. pmid:23901099
- 111. Holehouse AS, Das RK, Ahad JN, Richardson MOG, Pappu RV. CIDER: Resources to Analyze Sequence-Ensemble Relationships of Intrinsically Disordered Proteins. Biophys J. 2017;112(1):16–21. pmid:28076807
- 112. Emenecker RJ, Griffith D, Holehouse AS. Metapredict V2: An update to metapredict, a fast, accurate, and easy-to-use predictor of consensus disorder and structure. bioRxiv. 2022;:2022.06.06.494887.
- 113. Blum M, Andreeva A, Florentino LC, Chuguransky SR, Grego T, Hobbs E, et al. InterPro: The protein sequence classification resource in 2025. Nucleic Acids Res. 2025;53(D1):D444–56. pmid:39565202
- 114. Huerta-Cepas J, Serra F, Bork P. ETE 3: Reconstruction, analysis, and visualization of phylogenomic data. Mol Biol Evol. 2016;33(6):1635–8. pmid:26921390
- 115.
Ginell GM, Emenecker RJ, Lotthammer JM, Usher ET, Holehouse AS. Direct prediction of intermolecular interactions driven by disordered regions. bioRxivorg. 2024. https://doi.org/10.1101/2024.06.03.597104
- 116. Tesei G, Lindorff-Larsen K. Improved predictions of phase behaviour of intrinsically disordered proteins by tuning the interaction range. Open Res Eur. 2023;2:94. pmid:37645312
- 117. Holehouse AS, Kragelund BB. The molecular basis for cellular function of intrinsically disordered protein regions. Nat Rev Mol Cell Biol. 2024;25(3):187–211. pmid:37957331
- 118. González-Foutel NS, Glavina J, Borcherds WM, Safranchik M, Barrera-Vilarmau S, Sagar A, et al. Conformational buffering underlies functional selection in intrinsically disordered protein regions. Nat Struct Mol Biol. 2022;29(8):781–90. pmid:35948766
- 119. Pries R, Bömeke K, Draht O, Künzler M, Braus GH. Nuclear import of yeast Gcn4p requires karyopherins Srp1p and Kap95p. Mol Genet Genomics. 2004;271(3):257–66. pmid:14648200