Determining the interaction status and evolutionary fate of duplicated homomeric proteins

Oligomeric proteins are central to life. Duplication and divergence of their genes is a key evolutionary driver, also because duplications can yield very different outcomes. Given a homomeric ancestor, duplication can yield two paralogs that form two distinct homomeric complexes, or a heteromeric complex comprising both paralogs. Alternatively, one paralog remains a homomer while the other acquires a new partner. However, so far, conflicting trends have been noted with respect to which fate dominates, primarily because different methods and criteria are being used to assign the interaction status of paralogs. Here, we systematically analyzed all Saccharomyces cerevisiae and Escherichia coli oligomeric complexes that include paralogous proteins. We found that the proportions of homo-hetero duplication fates strongly depend on a variety of factors, yet that nonetheless, rigorous filtering gives a consistent picture. In E. coli about 50%, of the paralogous pairs appear to have retained the ancestral homomeric interaction, whereas in S. cerevisiae only ∼10% retained a homomeric state. This difference was also observed when unique complexes were counted instead of paralogous gene pairs. We further show that this difference is accounted for by multiple cases of heteromeric yeast complexes that share common ancestry with homomeric bacterial complexes. Our analysis settles contradicting trends and conflicting previous analyses, and provides a systematic and rigorous pipeline for delineating the fate of duplicated oligomers in any organism for which protein-protein interaction data are available.


Introduction 1
It is estimated that more than half of all proteins form oligomers. Oligomerization is thus ubiquitous and 2 central to protein stability, function and regulation. Duplication is also ubiquitous and hence serves as the 3 main source of new genes/proteins, as manifested by nearly half of all genes in a given genome being 4 paralogs [1]. The duplication of genes encoding an oligomeric protein is of particular interestthe ancestral 5 function may diverge alongside the oligomeric state thus providing new opportunities for evolutionary 6 innovation [2][3][4][5]. 7 Our analysis examined the divergence of homomers. By parsimony, the ancestors of both homomers and 8 heteromers are homomers, as homomers are encoded by a single gene. Indeed, proteins have an inherent 9 tendency to self-interact, and initially promiscuous self-interactions can be readily amplified by mutations 10 to generate tightly bound homo-dimers and also larger homo-oligomers [6]. Upon duplication of a gene 11 encoding a homomeric ancestor, and acquisition of the very first mutation(s), in either the original gene or 12 its new copy, a statistical mixture of homo-and hetero-meric complexes would form [2] (Fig 1A, (i)). Over 13 time, further evolutionary divergence may result in three possible scenarios: (ii) loss of the capacity to 14 cross-react and formation of two distinct homomeric complexes, or obligatory homomers; (iii) loss of the 15 homomeric interactions and formation of a heteromeric complex, or obligatory heteromers. Alternatively, 16 the interaction pattern may diverge asymmetricallywhile one paralogue is kept as homomer, the other 17 gains a completely new interaction partner (Fig. 1A, (iv): hetero-others). Other scenarios may occur, e.g., 18 loss of homomeric interactions in both copies, or divergence into monomers that do not interact with any 19 other protein; however, these scenarios are intractable on a genome-scale (by parsimony, their ancestors 20 cannot be assumed to be a homomer) and are probably relatively rare. 21 Individual cases following all these four scenarios are known. What remains unclear, however, is which 22 fate is the most likely? Does protein function, or the source organism, for example, affect which fate 23 dominates? Genome-scale studies [2][3][4] attempted to address the relative frequencies of these scenarios in 24 model organisms, but their conclusions are inconsistent. Analyzing human, Arabidopsis, yeast and E. coli 25 protein-protein interaction (PPI) data, [3] reported that most oligomeric paralogs diverged to form 26 obligatory homomers. However, analysis of yeast, worm and fly, using both PPI data and oligomers of 27 known structure, [2] indicated that heteromeric interactions dominate, a conclusion recently supported by 28 [4] who analyzed yeast PPI data. We compared these studies and observed that these inconsistencies relate 29 to three major factors. First, different evolutionary scenarios were examined in different studiese.g., complex datasets (hereafter, curated complexes) and high-throughput PPI data (Fig. 1B, 2). For the latter, 1 PPI data were pulled together from different databases (Data S1, S2) and taken through three different 2 filters to minimize false-positives. These databases encompass all reported interactions, including high 3 resolution data, yet the high throughput data dominate, especially after the applied filtering. 4 Finally, in the 3 rd step, the criteria for assigning the fates of paralogous pairs also matter. In principle, 5 obligatory-homomers means that both paralogs were individually observed as homomers and that a cross-6 interaction was not observed (stringent criterion). Sufficing with one paralog that forms a homomer would 7 inevitably result in obligatory-homomers being the most frequent fate [4]. Further, as shown below, 8 applying this flexible criterion results in assigning paralogs that actually diverged to hetero-others as 9 obligatory-homo (Fig. 1A, iv and ii, respectively). Thus, the divergence modes of the paralogous pairs were 10 assigned applying both stringent and flexible criteria (Fig. 1B, 3). We subsequently examined the relative 11 frequency of the four divergence modes, or fates, as a function of the stringency of analysis in each of the 12 3 steps. 13 Few clarifying notes regarding our analysis. We addressed paralogous pairs, i.e., pairs of two genes that 14 diverged from a common ancestor. In many cases, multiple paralogs exist that arose from two or more 15 sequential duplications. Initially, we detected all potential pairs (Fig. 1B, step-1). Then, by assigning the 16 divergence modes, we defined the relevant paralogous pairs (with few exceptions in the mixed category 17 (Fig. 1A, i) where one protein can be part of more than one pair). Thus, unless otherwise stated, the statistics 18 and below discussion relate to gene pairs. Additionally, given that some complexes comprise multiple pairs, 19 statistics are also provided per complexes. Finally, our parsimonious assumption is that the pre-duplicated 20 ancestor can be considered a homomer if at least one descendent paralog is a homomer, and also if both 21 paralogs are present as a heteromer (as in [2,4]). The latter was subsequently confirmed by our analysis 22 ('Yeast heteromeric paralogs diverged from bacterial homomeric ancestors'). 23 The results of our analysis were distilled to Fig. 2 that presents the relative frequency of the four divergence 24 modes given the dataset and stringency of analysis. The tables are arranged such that the darker the color, 25 the higher is the stringency. The results given different stringencies of paralog assignment (Step-1) are 26 presented in columns, going from low-confidence in pale green to high-confidence paralogs in dark green. 27 Step-2, also in columns, from white (raw PPI data) to dark grey (Filter-3).
Step-3, the stringency of 28 assigning divergence modes, is presented in rows, with the top set of rows in yellow showing the flexible 29 criterion, and the bottom, dark yellow rows indicating the stringent criterion. Finally, the dominant

Heteromeric interactions dominate yeast paralogs 1
For yeast, under stringent filtering, the results from curated complexes and from PPI largely converge, 2 indicating that ~90% of yeast duplicates diverged to various heteromeric states. Specifically, stringent 3 filtering of the PPI interactions ( Fig. 2A, Filter-3, dark grey columns), and applying the stringent criterion 4 for assigning the divergence modes ( Fig. 2A, dark yellow rows), indicated that only about one-tenth of the 5 paralogous pairs diverged to obligatory-homomers. Given the consistency between the two datasets, and 6 the noise origins and biases indicated by our analysis (elaborated below), we surmise that obligatory-homo 7 are indeed a minority in yeast (~10%) and hetero-dominance is the reality ( Fig. 2A, numbers in bold, Data 8 S3). Within the three different hetero fates, the dominant fate is obligatory-hetero (about half of the pairs 9 in the curated complexes, and a third in the PPI data where, as expected, a larger fraction of pairs was 10 annotated as mixed). 11 If we were to count unique complexes instead of gene pairs, would the picture be different? Certain 12 heteromeric complexes are composed of multiple paralogous proteins and these could shift the balance in 13 favor of obligatory homomers (mostly ring-like complexes such as the proteasome; further addressed 14 below). Nonetheless, analysis of complexes showed that, under the stringent filtering criteria, and for high-15 confidence paralogs, complexes comprising heteromers were nearly three-times more frequent than 16 homomeric complexes (Fig 3A). Overall, we conclude that heteromeric interactions dominate yeast 17 paralogs, regardless of whether we count paralogous pairs or unique complexes. 18

Data biases and their mitigation 19
Our analysis also reveals various sources of error and bias, and how these could be mitigated. As expected, 20 consistency of the two interaction datasets, curated complexes and PPI, fades away at lower stringency. 21 Foremost, the 3 rd step of the analysis, assigning the divergence modes, had a massive impact on the relative 22 abundances of homo-hetero pairs. Assigning obligatory homomers using the flexible criterion (suffice that 23 one paralog is a homomer and no cross-reaction) resulted in ~5-fold proliferation of obligatory-homomers 24 in the curated complexes, and ~3-fold proliferation in the PPI data ( Fig. 2A, light yellow rows). The reason 25 being that under the flexible criterion, hetero-others were assigned as obligatory-homo. Thus, cases that are 26 quite abundant in yeast where one paralog kept the ancestral homomeric interaction and the other diverged 27 to bind a completely new partner were not only ignored, but also mis-assigned. 28 Our analysis also reflects the homo-or hetero-biases that are inherent to the source of interaction data. The methods failing to detect homomeric interactions. Indeed, for a given a stringency with respect to the first 1 two steps of the analysis (assigning paralogs, identifying interactions), homomers are more frequent in the 2 curated complexes while heteromers dominate the PPI data (Fig 2A). However, these biases seem to be 3 alleviated under the stringent criterion, as both the PPI and the curated complexes give a similar distribution 4 of fates. Thus, consideration of all four evolutionary fates, namely including both mixed homo-hetero and 5 hetero-others, is critical, as are adequate criteria to assign them (re the stringent criteria). 6 Two other elements seem to be critical for obtaining consistent results, both relating to the PPI data. Upon 7 manual inspection we noticed that five long terminal repeat retrotransposon families, comprising a total of 8 90 proteins. These paralogous mobile genetic elements of viral origin [8,9] caused an inflation in the 9 fraction of obligatory-homomers (~50%, that dropped to ~15% once removed). Further, once these 10 retrotransposon proteins were removed ( Fig. 2A, filter-1), the homo-hetero fates in the PPI data converged 11 with those in the curated complexes ( Fig. 2A, stringent criterion). Filtering of potential false-positives in 12 the PPI data had a lesser effect. First, we applied a demand that interactions are reported in two different 13 databases, and that interactions were detected with the protein pairs applied as both bait and prey ( Fig. 2A,  14 filter-2). The second source of false-positives are in vitro PPI interactions that do not occur in vivo. Obvious 15 cases include interactions between proteins localized in different compartments ( Fig. 2A, filter-3). 16 However, compared to the removal of retrotransposons these two filters had a minor effect. 17 Overall, we conclude that heteromeric interactions between paralogous pairs is the dominant fate in yeast, 18 regardless of whether we count paralogous pairs or unique complexes. 19

Homomeric interactions dominate E. coli paralogs 20
A similar analysis of E. coli indicated that in oppose to S. cerevisiae, for high-confidence paralogs, about 21 60% of the descendent pairs are obligatory-homers in the curated complexes compared to only 30% in the 22 PPI data (Fig 2B, filter-2, HC, stringent criterion, Data S4). However, this inconsistency is because the E. 23 coli sample sizes for high-confidence paralogs are too small (Fig. S1A). In yeast, filtering led to 24 considerable reduction in sample sizes, yet these remained high even for high-confidence paralogs (Fig.  25   2A). Further, the distribution is similar for high and medium-confidence, and with few exceptions even to 26 the low-confidence (highest sample size, Fig. S1B). This is not the case for the E. coli analysis. When more 27 distantly related paralogs were removed (MC and HC columns), sample sizes decreased by >10-fold, 28 Overall, considering the stringent criterion for assigning the divergence fates, the filtered PPI data and the 3 curated complexes gave a consistent picture by which ~55% of the pairs comprise obligatory-homomers, 4 for both medium-and low-confidence paralogs (Fig 2B, MC and LC). Further, as in yeast, homomers also 5 dominated when complexes were counted (Fig 3B). Overall, it appears that retaining the ancestral 6 homomeric interaction is the most likely fate of E. coli gene duplications. 7 Note that tuning the stringencies in the E. coli analysis had similar effects as in S. cerevisiae. Filtering the 8 PPI for interactions reported in at least two databases, and as both bait and prey resulted in a higher fraction 9 of obligatory-homomers. On the other hand, assigning the divergence modes with a flexible criterion 10 resulted in overestimation of obligatory-homomers (and a corresponding drop in obligatory-heteromers). 11

Yeast heteromeric paralogs diverged from bacterial homomeric ancestors 12
We observed the dominance of obligatory-homomers in E. coli (~50%) while in S. cerevisiae they comprise 13 only ~10% of the duplicated oligomeric proteins, and in turn obligatory-heteromers comprise the majority. 14 However, these two model organisms share common ancestry, as reflected in about one-third of S. 15 cerevisiae proteins, many of which are mitochondrial proteins, harboring sequence signatures of bacterial 16 origin [10]. We thus searched for the E. coli orthologs of the S. cerevisiae heteromeric paralogs, asking 17 which are homomeric. 18 A systematic reciprocal BLAST was performed between all known E. coli homomers (n = 1033) and all S. 19 cerevisiae obligatory-hetero and mixed paralogous pairs (n = 692; out of a total of 1152 LC pairs in the 20 stringent categories, PPI dataset; Fig. 2A). Following manual curation (see Methods), we identified about 21 a third of the heteromeric yeast paralogous pairs that have E. coli homomeric orthologs (n = 235; Data S5). 22 Of these, nearly two-thirds, 153 pairs, relate to E. coli homomers that are singletons (i.e., non-duplicated 23 genes; a total of 52 proteins). By parsimony, these reflect cases of duplication and divergence of an ancestral 24 bacterial homomer into paralogous heteromers in yeast. Remarkably, 42/52 of these E. coli proteins are 25 metabolic enzymes that duplicated and diverged into heteromeric S. cerevisiae enzymes. In many such 26 cases only one copy retained the catalytic activity whereas the other one evolved into a regulatory subunit. 27 Examples include mitochondrial NAD + -dependent isocitrate dehydrogenase complex [11], Trehalose 28 helicases appear to have gone through multiple duplications and contribute to the hetero-dominance in S. 1

cerevisiae. 2
The remaining third, 82 yeast heteromeric paralogous pairs, are orthologous to 144 obligatory homomeric 3 pairs in E. coli (Data S5). These also relate to divergence of homomers to heteromers. What is unclear 4 though is which of these genes duplicated independently in these two clades, and which one diverged to 5 heteromers in an earlier bacterial ancestor. What is clear though is that the dominance of heteromeric 6 paralogs in yeast is the result of homomers duplicating and preferentially diverging into heteromers. 7

8
With the obvious caveat of being based on two model organisms for which extensive protein interactions 9 data are available, our analysis indeed suggests a continuous evolutionary process of bacterial homomeric 10 proteins gradually duplicating and diverging into heteromeric proteins in eukaryotes. This ongoing 11 evolutionary transition also validates our assignment of the fundamentally different divergence modes of 12 paralogous pairs in E. coli and S. cerevisiae (Fig. 2). Assuming E. coli and S. cerevisiae are representatives 13 of bacteria and single-cell eukaryotes, the gene duplications that occurred in the eukaryotic lineage that 14 diverged from bacteria via endosymbiosis [16,17] led to 5-fold decrease in the abundance of homomers 15 among paralogous proteins. Further, because paralogous proteins comprise nearly half of the proteomes, 16 this phenomenon has led to a complete shift from the prevalence of homomers in prokaryotes to heteromers 17 in eukaryotes [18,19]. 18 The transition of homomeric prokaryotic complexes into eukaryotic heteromeric ones was previously noted 19 for individual protein families, and especially for ring-like complexes such as DNA/RNA helicases [20,21], 20 TCP complex subunits [22], proteasome [23,24] and exosome [25]. However, examining our dataset 21 revealed that both ring-like and non-ring-like prokaryotic homomers evolved into heteromeric complexes 22 in eukaryotes, and by a single or multiple gene duplications (Fig. 4, Data S5). Thus, the dominance of 23 heteromeric paralogs in S. cerevisiae is not only because the ancestral homomers duplicated and diverged 24 into heteromers, but also because heteromeric paralogs further duplicated and their descendants retained 25 the heteromeric state. 26 For non-ring-like complexes, a single gene duplication typically results in a single eukaryotic heteromeric 27 complex that may or may not retain the ancestral oligomeric order (total number of complex subunits). For 28 example, E. coli DNA mismatch repair endonuclease MutL is a homodimer, and the yeast orthologue is a namely, the oligomeric order changed from 2 to 8 (Fig. 4, ii). In this case, duplication and divergence into 1 a heteromer tendered the opportunity of evolving a new regulatory mode by diversifying one subunit, while 2 the other subunit kept the catalytic activity. 3 As a prokaryotic non-ring-like homomer evolves into a heteromer in eukaryotes, multiple rounds of 4 duplication may occur and the descendent paralogs retain the newly evolved heteromeric interaction (Fig.  5   4, iii). For example, the bacterial homomeric Hsp70 that duplicated and diverged into Hsp110 co-6 chaperones in eukaryotes [27,28], and the S. cerevisiae genome encodes multiple copies of Hsp70 and 7 Hsp110 that form distinct heteromers in different subcellular compartments [29]. 8 For ring-like prokaryotic homomeric complexes (e.g., helicase, protease, RNase and chaperonins), homo-9 to-hetero transition predominantly also occurred while retaining the ancestral oligomeric order or 10 modifying it (Fig. 4, iv-v). Complexes that have retained their ancestral oligomeric order (Fig. 4, iv) include 11 the archaeal homo-hexameric MCM complex that became hetero-hexameric in eukaryotes [21], and the 12 core proteasome alpha-and beta-rings that remained heptameric [23,24]. In contrast, the bacterial helicase 13 homo-hexameric Hfq ring-complex [30] diverged to the hetero-heptameric Lsm1-7 and Lsm2-8 complexes 14 in yeast (Fig. 4, v). 15 The above-described phenomena that underline homo-to-hetero transitions present some interesting 16 questions. This transition needs to overcome the inherent self-interacting tendency of proteins, and 17 eventually lead to incompatibility of the homomeric interactions. It is therefore likely to be adaptive, i.e., 18 provide a distinct functional advantage [6]. In E. coli duplications primarily yield obligatory homomers, 19 with each paralog mediating a different enzymatic function (typically different substrate specificity). In 20 yeast, however, the obligatory heteromers seem to be associated with acquisition of new regulatory modes. 21 Thus, function may dictate the fate of the oligomeric state. Another factor might be the location of the 22 active-site that in some enzymes resides within the subunits and in others at the interface between subunits 23 [31]. Also of note is that, in principle, divergence of a heteromeric interaction increases the likelihood that 24 both copies would fix in the genome, because loss of one copy leads to non-functionalization. Duplication 25 itself is random, yet whether a duplicate is fixed or lost (the far more likely fate) depends on how rapidly it 26 provides a selectable advantage [32]. Gene knockout experiments support this hypothesisdeletion of one 27 copy is highly deleterious in heteromers while for obligatory homomers deletion of one copy often has little 28 effect (Data S3). 29 Future work might address the above and other questions, and may also track down other possible cases of heteromers that diverged to homomers? Addressing these questions will demand detailed 1 phylogenies and experimental evaluation of the oligomeric states before and after the duplication. However, 2 a rigorous way of assigning oligomeric states from molecular interaction databases, and of determining the 3 fate of duplicates, is crucial to any such investigation. 4

5
Further details are provided in the supplementary items, in relation to the each of analyses described therein. 6

Detecting S. cerevisiae and E. coli paralogous protein pairs 7
The 1 st step of our analysis identified all S. cerevisiae and E. coli paralogous protein pairs (Fig. 1B). To 8 this end, all-versus-all intra-species protein-protein BLAST [33] was performed across their respective 9 proteomes, obtained from NCBI Genome Database [34]. BLAST hits associated with at least 25% identity 10 and 40% query coverage were manually inspected and assigned as putative paralogous pairs (3958 pairs in 11 S. cerevisiae and 2090 pairs in E. coli, Fig. S1). These pairs were further classified into three overlapping 12 groups, with increasing stringency of paralogue assignment, Low-Confidence (LC, ≥25% identity, ≥40% 13 coverage), Medium-Confidence (MC, ≥30% identity, ≥50% coverage) and High-Confidence (HC, ≥40% 14 identity, ≥60% coverage, and identical domain content). To ensure identical domain content, we compared 15 the Pfam [35]-annotated domain contents of all HC pairs. Pfam uses Hidden Markov Models to identify 16 domains and every annotated instance is given a probability score (p-value). Any domain assigned with p 17 < 10 −5 significance was considered for further analysis. Following domain assignments, paralogous pairs 18 were compared and those that differ in their domain content were discarded. The list of 455 S. cerevisiae 19 ohnologs (paralogs emerging from the whole genome duplication; Fig. S1 Complexes that include at least one protein annotated as paralog were classified into three groups, with 25 increasing stringency of curation accuracy (Data S1, S2). interactions involve paralogous proteins. Note that these PPI databases include both high-and low-8 throughput data, with the former dominating (see also next section). Predicted interactions, and text-mining 9 based interactions, reported in STRING were removed. S. cerevisiae raw PPI data were filtered in three 10 successive steps (Data S1). In the 1 st step, 90 transposon element proteins encoded by genes of viral origin 11 that are included in the Saccharomyces Genome Database [47] were removed. In the 2 nd step, to minimize 12 false-positives in the PPI data, we demanded that the interaction between two proteins observed using both 13 proteins as bait and as prey, and the interaction must be reported in at least two databases. The bait-and-14 prey information is relevant to high-throughput two-hybrid and pull down experiments, and hence this 15 filtering criterion removed interactions detected by other means, foremost by low-throughput methods such 16 as gel shifts. However, this filtering resulted in a negligible loss of interacting pairs and did not bias the 17 results (see next section). Also note that the databases used here collect their raw data from published 18 literature. Overlaps between databases are therefore common, although none of these databases overlap 19 completely. Thus, the demand that the interaction must be reported in at least two databases does not 20 necessarily mean two independent observations, but as a minimum it eliminates annotation mistakes. In the 21 3 rd filtering step, interactions between two proteins localized in different sub-cellular compartments were (Data S1). The PPI datasets derived after each step of filtering are provided in Data S1. Transposon 26 elements were absent in the E. coli raw PPI data and filtering involved only one step (interactions must be 27 reported for both proteins as bait and as prey, and in at least two databases). This yielded a final PPI dataset 28 of 1996 pairwise interactions (Data S2). 29

Assigning the interaction status of paralogous pairs 30
interact, but cross-react to form a heteromer), mixed homo/hetero (two paralogs cross-react to form a 1 heteromer, and at least one paralog also self-interacts), or hetero others (only one paralog self-interacts and 2 the other interacts with to another, non-paralogous partner). Obligatory homomers were assigned using a 3 stringent and a flexible criterion. The stringent criterion demanded that the two paralogs do not cross-react, 4 and that both self-interact; the flexible criterion demanded that the two paralogs do not cross-react and at 5 least one of them self-interacts. 6 PDB structures and PPI data, by definition, comprise physical interaction data between proteins. For CS 7 and C complexes, inter-subunit interactions were predicted from the PPI data. A homomer was assigned if 8 it is present in multiple copies in a complex, and also self-interacts in the PPI data. Heteromers were 9 assigned if both paralogs co-occur in a complex and found to cross-interact, but not self-interact, in the 10 filtered PPI dataset. For obligatory homomers in the curated complexes, we also ensured that the two 11 paralogs do not cross-interact in the PPI data. 12 To examine if the assignment of homo/hetero fates in the PPI dataset was substantially influenced by the 13 bait-prey filtering, we extracted all the PPI data that were detected exclusively by methods other than two-14 hybrid and pull downs. As with other PPI data, these data were filtered by demanding that the interaction 15 is reported in at least two databases, and by excluding interactions between proteins localized in different 16 sub-cellular compartments. For the filtered subset of data, applying the stringent criterion, we assigned the 17 interaction status of paralogous pairs. Only 14 new pairs were detected (Data S6), compared to 1152 pairs 18 in the filtered PPI data (see Data S3), indicating that a negligible amount of PPI data was lost due to the 19 bait-prey filtering. Further, in these 14 pairs, the overall dominance of heteromers in yeast was reflected (5 20 obligatory hetero, 7 mixed, 2 hetero-others, and no obligatory homo; Data S6). 21

S. cerevisiae and E. coli orthologous proteins 22
To identify the orthologous S. cerevisiae and E. coli protein pairs, inter-species reciprocal protein-protein 23 BLAST [33] searches were performed. In total, 7325 protein pairs associated with e-value < 10 -5 were 24 extracted. We then identified the subset of these pairs that comprise a homomeric protein in E. coli and an 25 obligatory-hetero, or mixed homo-hetero paralogous protein in S. cerevisiae. The domain content of these 26 pairs, as annotated in Pfam [35], were compared and those sharing at least one domain, were extracted. 27 These pairs were then manually checked for having the same function in the two organisms and that the 28 shared domain corresponds to this function. When consolidated, this analysis extracted orthologous

Statistical analysis 1
All the computation and statistical analyses were performed using in-house Python codes. Graph plots were 2 generated using OriginLab software and Adobe Photoshop.   to a statistical mixture of homo-and heteromeric complexes (i). Upon further divergence, three outcomes 7 may arise: two distinct homomeric complexes (ii), a heteromeric complex involving both paralogs (iii), or 8 loss of homomeric interaction in one copy, and gain of new interacting partners in the other paralog (iv). 9 (B) Our analysis aimed to identify these four different evolutionary fates. It comprised three steps: (1) The 10 genomes of E. coli and S. cerevisiae were each scanned to identify all possible paralogous protein pairs. 11 These pairs were classified into three categories with increasing confidence of paralog assignment (note 12 that all categories in our analysis are inclusive, i.e., low-confidence paralogs include the medium-13 confidence ones, and the medium include the low-confidence pairs).
(2) Interactions of these paralogs were 14 identified and classified to homo-and heteromeric ones. Macromolecular complexes were collected from 15 the Protein Data Bank (PDB complexes, inter-subunit interactions were obtained from crystal structure 16 data) and the Complex Portal database (CS and C complexes, inter-subunit interactions were predicted from 17 the PPI data). The S. cerevisiae PPI data were extracted from seven databases, and the E. coli data from 18 eight databases. The raw PPI data were filtered using various criteria to exclude potential false-positives.

19
(3) Finally, based on the identified interactions, the paralogous pairs were assigned to one of the four 20 potential fates (i-iv, panel A) with either a flexible or a stringent criterion. 21 1 Figure 2. The distribution of divergence modes of S. cerevisiae and E. coli paralogous pairs. The four 2 divergence modes, obligatory-homo, obligatory-hetero, mixed and hetero-others, are described in Fig. 1A.

3
(A) The distribution of S. cerevisiae paralogous pairs in PPI data (right panel) and in curated complexes 4 (left panel). Presented are the distributions for different stringencies of analysis, along its 3 steps (Fig. 1B). 5 Step-1, paralog assignment, is presented in columns, shaded in green, from low-confidence in pale green to 6 high-confidence paralogs in dark green.
Step-3, the divergence mode, is presented in rowsthe top set of rows 8 represent the flexible criterion (shaded in yellow), and the bottom rows the stringent criterion (dark yellow). 9 The dominant divergence modes, or fates, are highlighted in darker shades of red. (B) The distribution of 10 E. coli paralogous. Gene duplication of an ancestral non-ring homomer may produce a heteromeric complex that may (i) or 3 may not (ii) retain the ancestral oligomeric order (i.e., the total number of subunits in the complex). After 4 the first gene duplication and the subsequent emergence of a heteromeric interaction, multiple rounds of 5 duplication may follow in which the descendant paralogs retain the heteromeric interaction (iii). For ring-6 like complexes, multiple rounds of intra-ring gene duplications result in heteromeric rings, while keeping 7 (iv) or changing the ancestral oligomeric order (v). For each mode of transition, an example case is provided.  (the subset of paralogs that arose from the whole genome duplication; n = 455 pairs). Note that these plots 5 include all paralogs, not only the ones for which molecular interaction data are available. The dotted red 6 lines represent the identity thresholds used for defining MC (≥30% identity) and HC (≥40% identity). 7 8 Supplementary data file legends 9 10 Data S1. The S. cerevisiae molecular interaction dataset used in this study (including the list of the curated 11 complexes and the PPI data). 12 Data S2. The E. coli molecular interaction dataset used in this study (including the list of the curated 13 complexes and the PPI data). 14 Data S3. The inferred interaction status of S. cerevisiae paralogous pairs, in curated complexes and in PPI 15 data. For paralogous pairs in curated complexes, deletion phenotypes are also provided. 16 Data S4. The inferred interaction status of E. coli paralogous pairs, in curated complexes and in PPI data. 17 Data S5. List of S. cerevisiae heteromeric paralogs that relate to homomeric E. coli proteins. 18 Data S6. The inferred interaction status of S. cerevisiae paralogous pairs in PPI data detected exclusively 19 by methods other than two-hybrid and pulldowns. 20