Loss of homomeric interactions and heteromers formation is the long-term fate of duplicated homomers

Oligomeric proteins are central to life. Duplication and divergence of their genes is a key evolutionary driver, also because duplications can yield very different outcomes. Given a homomeric ancestor, duplication can yield two paralogs that form two distinct homomeric complexes, or a heteromeric complex comprising both paralogs. Alternatively, one paralog remains a homomer while the other acquires a new partner. To delineate the evolutionary fates of duplicated oligomers, we analyzed all S. cerevisiae and E. coli oligomeric complexes that include paralogous proteins. We found that although the proportion of homo-hetero duplication fates strongly depended on a variety of factors, rigorous filtering gave a consistent picture. While in E. coli about half of the paralogous pairs are homomeric, in S. cerevisiae, a eukaryote which diverged later, only ~10% of paralogs kept the ancestral homomeric interaction. Accordingly, we show that homomeric bacterial proteins diverged to heteromeric complexes in yeast. Our analysis reconciles contradicting trends and conflicting previous analyses, and provides deeper understanding of how the fate of duplicated genes depends on evolutionary time and protein function.


Introduction 1
It is estimated that more than half of all proteins form oligomers. Oligomerization is thus ubiquitous and 2 central to protein stability, function and regulation. Duplication is also ubiquitous and hence serves as the 3 main source of new genes/proteins, as manifested by nearly half of all genes in a given genome being 4 paralogs (Penel et al, 2009). The duplication of genes encoding an oligomeric protein is of particular 5 interest -the ancestral function may diverge alongside the oligomeric state thus providing new 6 opportunities for evolutionary innovation (Pereira-Leal et al, 2007;Hochberg et al, 2018;Marchant et al, 7 2019;Diss et al, 2017). 8 We focused our analysis on the divergence of homomers. By parsimony, the ancestors of both homomers 9 and heteromers are homomers that are encoded by a single gene. Upon duplication of a gene encoding a 10 homomeric ancestor, and acquisition of the very first mutation(s), in either the original gene or its new 11 copy, a statistical mixture of homo-and hetero-meric complexes would form (Pereira-Leal et al, 2007) 12 ( Fig 1A, (i)). Over time, further evolutionary divergence may result in three possible scenarios: (ii) loss of 13 the capacity to cross-react and formation of two distinct homomeric complexes, or obligatory homomers; 14 (iii) loss of the homomeric interactions and formation of a heteromeric complex, or obligatory 15 heteromers. Alternatively, the interaction pattern may diverge asymmetrically -while one paralogue is 16 kept as homomer, the other gains a completely new interaction partner (Fig. 1A, (iv): hetero-others). 17 Other scenarios may occur, e.g., loss of homomeric interactions in both copies, or divergence into 18 monomers that do not interact with any other protein; however, these scenarios are intractable on a 19 genome-scale (by parsimony, their ancestors cannot be assumed to be a homomer) and are probably 20 relatively rare. 21 Individual cases following all these four scenarios are known. What remains unclear, however, is which 22 fate is the most likely? Does protein function, or the source organism, for example, affect which fate 23 dominates? Genome-scale studies (Pereira-Leal et al, 2007;Hochberg et al, 2018;Marchant et al, 2019) 24 attempted to address the relative frequencies of these scenarios in model organisms, but their conclusions 25 are inconsistent. Analyzing human, Arabidopsis, yeast and E. coli protein-protein interaction (PPI) data, 26 (Hochberg et al, 2018) reported that most oligomeric paralogs diverged to form obligatory homomers. 27 However, analysis of yeast, worm and fly, using both PPI data and oligomers of known structure, 28 (Pereira-Leal et al, 2007) indicated that heteromeric interactions dominate, a conclusion recently 29 supported by (Marchant et al, 2019) who analyzed yeast PPI data. We compared these studies and 30 observed that these inconsistencies relate to three major factors. First, different evolutionary scenarios 31 were examined in different studies -e.g., (Hochberg et al, 2018) did not consider the mixed homo/heteromers, and essentially none of these studies (Pereira-Leal et al, 2007;Hochberg et al, 2018;1 Marchant et al, 2019) consider hetero-others. Second, different interaction datasets were analyzed ranging 2 from X-ray crystallographic structures (e.g., (Pereira-Leal et al, 2007)) to high-throughput PPI data (e.g., 3 (Hochberg et al, 2018;Marchant et al, 2019)). Third, the divergence modes were assigned in an 4 incongruous fashion. Following the definition of obligatory homomers, (Pereira-Leal et al, 2007) 5 demanded both paralogs to be assigned as homomers; but others, (Marchant et al, 2019), for example, 6 sufficed with identifying just one paralog as homomer. 7 Taking advantage of the extensive characterization of Saccharomyces cerevisiae and Escherichia coli 8 macromolecular complexes, we investigated the potential evolutionary fates of their duplicated 9 homomeric proteins. We systematically varied the stringencies of assigning paralogous pairs, of filtering 10 molecular interaction datasets, and of assigning the divergence modes, and examined how these 11 parameters affect the assigned proportions of homo-hetero divergence events. In S. cerevisiae, when 12 stringent criteria were applied, a consistent picture arose, indicating that 90% of duplications resulted in 13 heteromeric complexes. In E. coli, however, it appears that paralogs are 5 times more likely to retain their 14 ancestral homomeric interactions. We reconciled this difference by tracking down individual complexes 15 and showing that complexes that are homomeric in E. coli have, upon duplication, diverged to 16 heteromeric complexes in S. cerevisiae. 17

18
A systematic approach to delineate the evolutionary fates of duplicated homomers 19 We analyzed the relative abundances of the four potential fates by examining the proteomes of S. 20 cerevisiae and E. coli for which extensive interaction data exist. As the inconsistencies between previous 21 works depict, this analysis presents biases at each one of its three steps (Fig. 1B). In the 1 st step, 22 considering only paralogous pairs with high sequence coverage and identity would enrich closely related 23 pairs that are more likely to be detected as mixed homo-heteromers. Conversely, assigning paralogous 24 pairs with low coverage and identity might include cases where the changes in the divergence modes are 25 due to loss or gain of entire domains rather than divergence of preexisting interfaces. To address this 26 bias, in the 1 st step, we classified the putative paralogous pairs into three groups going from low to high 27 confidence of paralogue assignment (Fig. 1B, 1). 28 In the 2 nd step, structures of macromolecular complexes allow to assign interactions with high accuracy, 29 but crystal structures in particular create a bias in favor of homomeric interactions (Marsh & Teichmann,30 can be noisy. Given that PPI detection methods such as yeast two-hybrid screening can comprise up to 1 64% false positive and 71% false negative rates (Edwards et al, 2002), data filtering would also 2 substantially influence the results. Beyond random noise, there are biases -for example, certain PPI 3 methods cannot detect homomeric interactions (e.g. pulldown and MS identification of binding partners). 4 We thus analyzed separately and compared the results from curated complex datasets (curated complexes) 5 and high-throughput PPI data (PPI; Fig. 1B, 2). The latter were pulled together from different databases 6 (Data S1, S2) and taken through three different filters to minimize false-positives. 7 Finally, in the 3 rd step, the criteria for assigning the fates of paralogous pairs also matter. In principle, 8 obligatory-homomers means that both paralogs were individually observed as homomers and that a cross-9 interaction was not observed (stringent criterion). Sufficing with one paralog that forms a homomer 10 would inevitably result in obligatory-homomers being the most frequent fate (Marchant et al, 2019). 11 Further, as shown below, applying this flexible criterion results in assigning paralogs that actually 12 diverged to hetero-others as obligatory-homo (Fig. 1A, iv and ii, respectively). Thus, the divergence 13 modes of the paralogous pairs were assigned applying both stringent and flexible criteria (Fig. 1B, 3). We 14 subsequently examined the relative frequency of the four divergence modes, or fates, as a function of the 15 stringency of analysis in each of the 3 steps. 16 Few clarifying notes regarding our analysis. We addressed paralogous pairs, i.e., pairs of two genes that 17 diverged from a common ancestor. In many cases, multiple paralogs exist that arose from two or more 18 sequential duplications. Initially, we detected all potential pairs (Fig. 1B, step-1). Then, by assigning the 19 divergence modes, we defined the relevant paralogous pairs (with few exceptions in the mixed category 20 ( Fig. 1A, i) where one protein can be part of more than one pair). Thus, unless otherwise stated, the 21 statistics and below discussion relate to gene pairs. Additionally, given that some complexes comprise 22 multiple pairs, statistics are also provided per complexes. Finally, our parsimonious assumption is that the 23 pre-duplicated ancestor can be considered a homomer if at least one descendent paralog is a homomer, 24 and also if both paralogs are present as a heteromer (as in (Pereira-Leal et al, 2007;Marchant et al, 25 2019)). The latter was subsequently confirmed by our analysis ('Yeast heteromeric paralogs diverged 26 from bacterial homomeric ancestors'). 27 The results of our analysis were distilled to Fig. 2 that presents the relative frequency of the four 28 divergence modes given the dataset and stringency of analysis. The tables are arranged such that the 29 darker the color, the higher is the stringency. The results given different stringencies of paralog 30 assignment (Step-1) are presented in columns, going from low-confidence in pale green to high-31 confidence paralogs in dark green.
Step-3, the stringency of assigning divergence modes, is presented in rows, with the top set of rows in 1 yellow showing the flexible criterion, and the bottom, dark yellow rows indicating the stringent criterion. 2 Finally, the dominant divergence modes, or fates, are highlighted in darker shades of red. 3

Heteromeric interactions dominate yeast paralogs 4
For yeast, under stringent filtering, the results from curated complexes and PPI largely converge, 5 indicating that ~90% of yeast duplicates diverged to various heteromeric states. Specifically, stringent 6 filtering of the PPI interactions ( Fig. 2A, Filter-3, dark grey columns), and applying the stringent 7 criterion for assigning the divergence modes ( Fig. 2A, dark yellow rows), indicated that only about one-8 tenth of the paralogous pairs diverged to obligatory-homomers. Given the consistency between the two 9 datasets, and the noise origins and biases indicated by our analysis (elaborated below), we surmise that 10 obligatory-homo are indeed a minority in yeast (~10%) and hetero-dominance is the reality ( Fig. 2A, 11 numbers in bold, Data S3). Within the three different hetero fates, the dominant fate is obligatory-hetero 12 (about half of the pairs in the curated complexes, and a third in the PPI data where, as expected, a larger 13 fraction of pairs was annotated as mixed). 14 If we were to count unique complexes instead of gene pairs, would the picture be different? Certain 15 heteromeric complexes are composed of multiple paralogous proteins and these could shift the balance in 16 favor of obligatory homomers (mostly ring-like complexes such as the proteasome; further addressed 17 below). Nonetheless, analysis of complexes showed that, under the stringent filtering criteria, and for 18 high-confidence paralogs, complexes comprising heteromers were nearly three-times more frequent than 19 homomeric complexes (Fig 3A). Overall, we conclude that heteromeric interactions dominate yeast 20 paralogs, regardless of whether we count paralogous pairs or unique complexes. 21

Data biases and their mitigation 22
Our analysis also reveals various sources of error and bias, and how these could be mitigated. As 23 expected, consistency of the two interaction datasets, curated complexes and PPI, fades away at lower 24 stringency. Foremost, the 3 rd step of the analysis, assigning the divergence modes, had a massive impact 25 on the relative abundances of homo-hetero pairs. Assigning obligatory homomers using the flexible 26 criterion (suffice that one paralog is a homomer and no cross-reaction) resulted in ~5-fold proliferation of 27 obligatory-homomers in the curated complexes, and ~3-fold proliferation in the PPI data ( Fig. 2A, light  28 yellow rows). The reason being that under the flexible criterion, hetero-others were assigned as 29 obligatory-homo. Thus, cases that are quite abundant in yeast where one paralog kept the ancestral 30 homomeric interaction and the other diverged to bind a completely new partner were not only ignored, 1 but also mis-assigned. 2 Our analysis also reflects the homo-or hetero-biases that are inherent to the source of interaction data. 3 The homo-dominance in the curated complexes dataset primarily stems from the known bias of crystal 4 structures to detect homomers (Marsh & Teichmann, 2015); the hetero-dominance in the PPI dataset 5 stems from certain high-throughput methods failing to detect homomeric interactions. Indeed, for a given 6 a stringency with respect to the first two steps of the analysis (assigning paralogs, identifying 7 interactions), homomers are more frequent in the curated complexes while heteromers dominate the PPI 8 data (Fig 2A). However, these biases seem to be alleviated under the stringent criterion, as both the PPI 9 and the curated complexes give a similar distribution of fates. Thus, consideration of all four evolutionary 10 fates, namely including both mixed homo-hetero and hetero-others, is critical, as are adequate criteria to 11 assign them (re the stringent criteria). 12 Two other elements seem to be critical for obtaining consistent results, both relating to the PPI data. Upon 13 manual inspection we noticed that five long terminal repeat retrotransposon families, comprising a total of 14 90 proteins. These paralogous mobile genetic elements of viral origin (Carr et al, 2012;Bourque et al, 15 2018) caused an inflation in the fraction of obligatory-homomers (~50%, that dropped to ~15% once 16 removed). Further, once these retrotransposon proteins were removed ( Fig. 2A, filter-1), the homo-hetero 17 fates in the PPI data converged with those in the curated complexes ( Fig. 2A, stringent criterion). 18 Filtering of potential false-positives in the PPI data had a lesser effect. First, we applied a demand that 19 interactions are reported in two different databases, and that interactions were detected with the protein 20 pairs applied as both bait and prey ( Fig Overall, we conclude that heteromeric interactions between paralogous pairs is the dominant fate in yeast, 25 regardless of whether we count paralogous pairs or unique complexes. 26

Homomeric interactions dominate E. coli paralogs 27
A similar analysis of E. coli indicated that in oppose to S. cerevisiae, for high-confidence paralogs, about 28 60% of the descendent pairs are obligatory-homers in the curated complexes compared to only 30% in the 29 PPI data (Fig 2B, filter-2, HC, stringent criterion, Data S4). However, this inconsistency is because the 30 considerable reduction in sample sizes, yet these remained high even for high-confidence paralogs (Fig.  1   2A). Further, the distribution is similar for high and medium-confidence, and with few exceptions even to 2 the low-confidence (highest sample size, Fig. S1B). This is not the case for the E. coli analysis. When 3 more distantly related paralogs were removed (MC and HC columns), sample sizes decreased by >10-4 fold, compared to >3-fold in yeast. Indeed, in yeast, owing to the relatively recent whole genome 5 duplication, high-confidence paralogs comprise ~60% of all detectable paralogs (1806/2907, Fig. S1B, 6 C), while in E. coli they comprise only ~35% (Fig. S1B). Thus, it seems that medium-confidence 7 paralogs better report the actual reality in E. coli. 8 Overall, considering the stringent criterion for assigning the divergence fates, the filtered PPI data and the 9 curated complexes gave a consistent picture by which ~55% of the pairs comprise obligatory-homomers, 10 for both medium-and low-confidence paralogs (Fig 2B, MC and LC). Further, as in yeast, homomers 11 also dominated when complexes were counted (Fig 3B). Overall, it appears that retaining the ancestral 12 homomeric interaction is the most likely fate of E. coli gene duplications. 13 Note that tuning the stringencies in the E. coli analysis had similar effects as in S. cerevisiae. Filtering the 14 PPI for interactions reported in at least two databases, and as both bait and prey resulted in a higher 15 fraction of obligatory-homomers. On the other hand, assigning the divergence modes with a flexible 16 criterion resulted in overestimation of obligatory-homomers (and a corresponding drop in obligatory-17 heteromers). 18

Yeast heteromeric paralogs diverged from bacterial homomeric ancestors 19
We observe the dominance of obligatory-homomers in E. coli (> 50%) while in S. cerevisiae they 20 comprise only 10% of the duplicated oligomeric proteins, and in turn obligatory-heteromers comprise the 21 majority. Do the differences reflect an evolutionary change over time? on the other hand, diverged from ~500 million years before this endosymbiosis event (Hedges et al, 25 2015). This temporal order puts E. coli phylogenetically closer to the common ancestor of the two species 26 (Hedges et al, 2015) and therefore allows us to put the differences in their homo-hetero fates in the 27 context of an ongoing evolutionary process. Common ancestry is also reflected in about one-third of S. 28 cerevisiae proteins, many of which are mitochondrial proteins, harboring sequence signatures of bacterial 29 origin (Ku et al, 2015). We thus searched for the E. coli orthologs of the S. cerevisiae heteromeric 30 paralogs, asking which are homomeric.
A systematic reciprocal BLAST was performed between all known E. coli homomers (n = 1033) and all 1 S. cerevisiae obligatory-hetero and mixed paralogous pairs (n = 692; out of a total of 1152 LC pairs in the 2 stringent categories, PPI dataset; Fig. 2A). Following manual curation (see Methods), we identified about 3 a third of the heteromeric yeast paralogous pairs that have E. coli homomeric orthologs (n = 235; Data 4 S5). Of these, nearly two-thirds, 153 pairs, relate to E. coli homomers that are singletons (i.e., non-5 duplicated genes; a total of 52 proteins). By parsimony, these reflect cases of duplication and divergence 6 of an ancestral bacterial homomer into paralogous heteromers in yeast. Remarkably, 42/52 of these E. coli 7 proteins are metabolic enzymes that duplicated and diverged into heteromeric S. cerevisiae enzymes. In 8 many such cases only one copy retained the catalytic activity whereas the other one evolved into a 9 regulatory subunit. Examples include mitochondrial NAD + -dependent isocitrate dehydrogenase 10 complex (Wang et al, 2015), Trehalose Synthase Complex (Bell et al, 1998), the 20S proteasome core 11 particle subunits (Bochtler et al, 1999) or the ATP-dependent 6-phospho-fructokinase complex (Poorman 12 et al, 1984;Banaszak et al, 2011). Other enzymes, such as chaperonins, HSP70 chaperones, and DNA 13 and RNA helicases appear to have gone through multiple duplications and contribute to the hetero-14 dominance in S. cerevisiae. 15 The remaining third, 82 yeast heteromeric paralogous pairs, are orthologous to 144 obligatory homomeric 16 pairs in E. coli (Data S5). These also relate to divergence of homomers to heteromers. What is unclear 17 though is which of these genes duplicated independently in these two clades, and which one diverged to 18 heteromers in an earlier bacterial ancestor. What is clear though is that the dominance of heteromeric 19 paralogs in yeast is the result of homomers duplicating and preferentially diverging into heteromers. 20

21
Our systematic analysis suggests a continuous evolutionary process of homomeric proteins gradually 22 duplicating and diverging into heteromeric proteins. This ongoing evolutionary transition also validates 23 our assignment of the fundamentally different divergence modes of paralogous pairs in E. coli and S. 24 cerevisiae (Fig. 2). Assuming E. coli and S. cerevisiae are representative of bacteria and single-cell 25 eukaryotes, the gene duplications that occurred since their divergence led to 5-fold decrease in the 26 abundance of homomers among paralogous proteins. Because paralogous proteins comprise nearly half of 27 the proteomes, this phenomenon has led to a complete shift from the prevalence of homomers in 28 prokaryotes to heteromers in eukaryotes (Marsh & Teichmann, 2014;Bergendahl & Marsh, 2017). 29 The transition of homomeric prokaryotic complexes into eukaryotic heteromeric ones was previously 30 noted for individual protein families, and especially for ring-like complexes such as DNA/RNA helicases 31 examining our dataset revealed that both ring-like and non-ring-like prokaryotic homomers evolved into 3 heteromeric complexes in eukaryotes, and by a single or multiple gene duplications (Fig. 4, Data S5). 4 Thus, the dominance of heteromeric paralogs in S. cerevisiae is not only because the ancestral homomers 5 duplicated and diverged into heteromers, but also because heteromeric paralogs further duplicated and 6 their descendants retained the heteromeric state. 7 For non-ring-like complexes, a single gene duplication typically results in a single eukaryotic heteromeric 8 complex that may or may not retain the ancestral oligomeric order (total number of complex subunits). 9 For example, E. coli DNA mismatch repair endonuclease MutL is a homodimer, and the yeast ortholog is 10 a heterodimer (Meldal et al, 2019) (Fig. 4, i). On the other hand, the bacterial homodimeric isocitrate 11 dehydrogenase (Wang et al, 2015) duplicated and diverged into a hetero-octameric mitochondrial 12 isocitrate dehydrogenase in yeast (Meldal et al, 2019) -namely, the oligomeric order changed from 2 to 8 13 (Fig. 4, ii). In this case, duplication and divergence into a heteromer tendered the opportunity of evolving 14 a new regulatory mode by diversifying one subunit, while the other subunit kept the catalytic activity. 15 As a prokaryotic non-ring-like homomer evolves into a heteromer in eukaryotes, multiple rounds of 16 duplication may occur and the descendent paralogs retain the newly evolved heteromeric interaction (Fig.  17   4, iii). For example, the bacterial homomeric Hsp70 that duplicated and diverged into Hsp110 co- For ring-like prokaryotic homomeric complexes (e.g., helicase, protease, RNase and chaperonins), homo-22 to-hetero transition predominantly also occurred while retaining the ancestral oligomeric order or 23 modifying it (Fig. 4, iv v). Complexes that have retained their ancestral oligomeric order (Fig. 4, iv) 24 include the archaeal homo-hexameric MCM complex that became hetero-hexameric in eukaryotes 25 (Bochman & Schwacha, 2009), and the core proteasome alpha-and beta-rings that remained heptameric 26 (Gille et al, 2003;Wollenberg & Swaffield, 2001). In contrast, the bacterial helicase homo-hexameric Hfq 27 ring-complex (Sauter et al, 2003) diverged to the hetero-heptameric Lsm1-7 and Lsm2-8 complexes in 28 yeast (Fig. 4, v). 29 The above-described phenomena that underline homo-to-hetero transitions present some interesting 30 questions. There is the functional implication -in E. coli duplications primarily yield obligatory 31 homomers, with each paralog mediating a different enzymatic function (typically different substrate 1 specificity). In yeast, however, the obligatory heteromers seem to be associated with acquisition of new 2 regulatory modes. Thus, function may dictate the fate of the oligomeric state. Another factor might be the 3 location of the active-site that in some enzymes resides within the subunits and in others at the interface 4 between subunits (Abrusán et al, 2019). Also of note is that, in principle, divergence of a heteromeric 5 interaction increases the likelihood that both copies would fix in the genome, because loss of one copy 6 leads to non-functionalization. Duplication itself is random, yet whether a duplicate is fixed or lost (the 7 far more likely fate) depends on how rapidly it provides a selectable advantage. Gene knockout 8 experiments support this hypothesis -deletion of one copy is highly deleterious in heteromers while for 9 obligatory homomers deletion of one copy often has little effect (Data S3). 10 Future work might address the above and other questions, and may also track down other possible 11 evolutionary transitions -e.g. the dominating trend is homo-to-hetero transitions, yet can we track down 12 cases of heteromers that diverged to homomers? Addressing these questions will demand detailed 13 phylogenies and experimental evaluation of the oligomeric states before and after the duplication. The 1 st step of our analysis identified all S. cerevisiae and E. coli paralogous protein pairs (Fig. 1B). To 7 this end, all-versus-all intra-species protein-protein BLAST (Altschu et al, 1990) was performed across 8 their respective proteomes, obtained from NCBI Genome Database (Benson et al, 2013). BLAST hits 9 associated with at least 25% identity and 40% query coverage were manually inspected and assigned as 10 putative paralogous pairs (3958 pairs in S. cerevisiae and 2090 pairs in E. coli, Fig. S1). These pairs were 11 further classified into three overlapping groups, with increasing stringency of paralogue assignment, Low-12 all HC pairs. Pfam uses Hidden Markov Models to identify domains and every annotated instance is given 16 a probability score (p-value). Any domain assigned with p < 10 −5 significance was considered for further 17 analysis. Following domain assignments, paralogous pairs were compared and those that differ in their 18 domain content were discarded. The list of 455 S. cerevisiae ohnologs (paralogs emerging from the whole 19 genome duplication; Fig. S1) were collected from the Yeast Gene Order Browser (Byrne & Wolfe, 2005). 20

Identifying molecular interactions 21
Curated complexes. Curated homo-and hetero-meric macromolecular complexes of both S. cerevisiae 22 and E. coli were collected from Protein Data Bank (Berman et al, 2007), 3D complex database (Levy et 23 al, 2006) andComplex Portal (Meldal et al, 2019). Complexes that include at least one protein annotated 24 as paralog were classified into three groups, with increasing stringency of curation accuracy (Data S1, 25 S2). The first group, C complexes, comprises 127 S. cerevisiae and 18 E. coli complexes annotated in 26 Complex Portal, for which only the subunit composition data are available (subunit stoichiometry is either 27 unknown or only partially known). The second group, CS complexes, comprises 83 S. cerevisiae and 33 28 E. coli complexes annotated in Complex Portal, for which both subunit composition and stoichiometry 29 data are available. The third group, PDB complexes, includes 167 S. cerevisiae and 117 E. coli complexes 1 collected from the Protein Data Bank, for which subunit composition, stoichiometry as well as interaction 2 patterns are known. The subunit stoichiometry of PDB complexes were further cross-validated by 3D 3 complex database (Levy et al, 2006) annotations. removed. In the 2 nd step, to minimize false-positives in the PPI data, we demanded that the interaction 13 between two proteins observed using both proteins as bait and as prey, and the interaction must be 14 reported in in at least two databases. In the 3 rd filtering step, interactions between two proteins localized in 15 different sub-cellular compartments were excluded. For this, yeast protein localization data obtained from datasets derived after each step of filtering are provided in Data S1. Transposon elements were absent in 20 the E. coli raw PPI data and filtering involved only one step (interactions must be reported for both 21 proteins as bait and as prey, and in at least two databases). This yielded a final PPI dataset of 1996 22 pairwise interactions (Data S2). 23

Assigning the interaction status of paralogous pairs 24
Based on the interactions in the above-described molecular interaction datasets, paralogous pairs were 25 assigned to one of the four categories described in Fig. 1A: obligatory hetero (the two paralogs do not 26 self-interact, but cross-react to form a heteromer), mixed homo/hetero (two paralogs cross-react to form a 27 heteromer, and at least one paralog also self-interacts), or hetero others (only one paralog self-interacts 28 and the other interacts with to another, non-paralogous partner). Obligatory homomers were assigned 29 using a stringent and a flexible criterion. The stringent criterion demanded that the two paralogs do not 30 cross-react, and that both self-interact; the flexible criterion demanded that the two paralogs do not cross-31 react and at least one of them self-interacts. PDB structures and PPI data, by definition, comprise physical interaction data between proteins. For CS 1 and C complexes, inter-subunit interactions were predicted from the PPI data. A homomer was assigned if 2 it is present in multiple copies in a complex, and also self-interacts in the PPI data. Heteromers were 3 assigned if both paralogs co-occur in a complex and found to cross-interact, but not self-interact, in the 4 filtered PPI dataset. For obligatory homomers in the curated complexes, we also ensured that the two 5 paralogs do not cross-interact in the PPI data. 6

S. cerevisiae and E. coli orthologous proteins 7
To identify the orthologous S. cerevisiae and E. coli protein pairs, inter-species reciprocal protein-protein 8 BLAST (Altschu et al, 1990) searches were performed. In total, 7325 protein pairs associated with e-9 value < 10 5 were extracted. We then identified the subset of these pairs that comprise a homomeric 10 protein in E. coli and an obligatory-hetero, or mixed homo-hetero paralogous protein in S. cerevisiae. The 11 domain content of these pairs, as annotated in Pfam (Finn et al, 2014), were compared and those sharing 12 at least one domain, were extracted. These pairs were then manually checked for having the same 13 function in the two organisms and that the shared domain corresponds to this function. When 14 consolidated, this analysis extracted orthologous relationships between 103 E. coli homomeric proteins 15 (52 singletons and 144 paralogous pairs) and 421 paralogous S. cerevisiae proteins (235 pairs; Data S5). 16

Statistical analysis 17
All the computation and statistical analyses were performed using in-house Python codes. Graph plots 18 were generated using OriginLab software and Adobe Photoshop.  leads to a statistical mixture of homo-and heteromeric complexes (i). Upon further divergence, three 9 outcomes may arise: two distinct homomeric complexes (ii), a heteromeric complex involving both 10 paralogs (iii), or loss of homomeric interaction in one copy, and gain of new interacting partners in the 11 other paralog (iv).

12
(B) Our analysis aimed to identify these four different evolutionary fates. It comprised three steps: (1) The 13 genomes of E. coli and S. cerevisiae were each scanned to identify all possible paralogous protein pairs. 14 These pairs were classified into three categories with increasing confidence of paralog assignment (note 15 that all categories in our analysis are inclusive, i.e., low-confidence paralogs include the medium-16 confidence ones, and the medium include the low-confidence pairs).
(2) Interactions of these paralogs 17 were identified and classified to homo-and heteromeric ones. Macromolecular complexes were collected 18 from the Protein Data Bank (PDB complexes, inter-subunit interactions were obtained from crystal 19 structure data) and the Complex Portal database (CS and C complexes, inter-subunit interactions were 20 predicted from the PPI data). The S. cerevisiae PPI data were extracted from seven databases, and the E. 21 coli data from eight databases. The raw PPI data were filtered using various criteria to exclude potential 22 false-positives. (3) Finally, based on the identified interactions, the paralogous pairs were assigned to one 1 of the four potential fates (i-iv, panel A) with either a flexible or a stringent criterion. 2 3 4 Figure 2. The distribution of divergence modes of S. cerevisiae and E. coli paralogous pairs. The 5 four divergence modes, obligatory-homo, obligatory-hetero, mixed and hetero-others, are described in 6 Fig. 1A. 7 (A) The distribution of S. cerevisiae paralogous pairs in PPI data (right panel) and in curated complexes 8 (left panel). Presented are the distributions for different stringencies of analysis, along its 3 steps ( Fig.  9 1B).
Step-1, paralog assignment, is presented in columns, shaded in green, from low-confidence in pale 10 green to high-confidence paralogs in dark green.
Step-3, the divergence mode, is presented in rows -the top set