Humanization of yeast genes with multiple human orthologs reveals principles of functional divergence between paralogs

Despite over a billion years of evolutionary divergence, several thousand human genes possess clearly identifiable orthologs in yeast, and many have undergone lineage-specific duplications in one or both lineages. The ortholog conjecture postulates that orthologous genes between species retain ancestral functions despite divergence over vast timescales, but duplicated genes will be free to diverge in function. However, the retention of ancestral functions among co-orthologs between species and within gene families has been difficult to test experimentally at scale. In order to investigate how ancestral functions are retained or lost post-duplication, we systematically replaced hundreds of essential yeast genes with their human orthologs from gene families that have undergone lineage-specific duplications, including those with single duplications (one yeast gene to two human genes, 1:2) or higher-order expansions (1:>2) in the human lineage. We observe a variable pattern of replaceability across different ortholog classes, with an obvious trend towards differential replaceability inside gene families, rarely observing replaceability by all members of a family. We quantify the ability of various properties of the orthologs to predict replaceability, showing that in the case of 1:2 orthologs, replaceability is predicted largely by the divergence and tissue-specific expression of the human co-orthologs, i.e. the human proteins that are less diverged from their yeast counterpart and more ubiquitously expressed across human tissues more often replace their single yeast ortholog. These trends were consistent with in silico simulations demonstrating that when only one ortholog is replaceable, it tends to be the least diverged of the pair. Replaceability of yeast genes having more than two human co-orthologs was marked by retention of orthologous interactions in functional or protein networks as well as by more ancestral subcellular localization. Overall, we performed >400 human gene replaceability assays revealing 56 new human-yeast complementation pairs, thus opening up avenues to further functionally characterize these human genes in a simplified organismal context.


Introduction
Duplication of existing genes is regarded as a major contributing factor to the production of new 27 genetic material [1]. Because of the immediate functional redundancy and dosage increase created 28 following duplication, the most common fate of duplicated genes is loss of one functional copy, 29 thus, returning to the ancestral state [2]. Rarely, however, duplicate gene copies either retain their 30 ancestral roles or eventually diverge toward new functions. In the case of retained duplicates, 31 several theories have been proposed to explain how multiple gene copies are maintained and their 32 evolution post-duplication, though no clear consensus has been reached [3,4]. Sub-and neo-33 functionalization are two of the most commonly discussed theories regarding the evolutionary 34 mechanisms leading to retention of duplicated genes. Generally, sub-functionalization posits that 35 the function of the ancestral singleton is subdivided between its duplicates, leading to their mutual 36 retention. Sub-functionalization can occur as a purely neutral process where degenerative 37 mutations affect different subfunctions of both copies, resulting in their mutual retention in order 38 to perform all functions of the ancestral singleton [5,6]. Sub-functionalization can also occur when 39 duplicates acquire mutations that optimize subfunctions [7]. Neo-functionalization holds that the 40 duplicated copy of a gene is freed from any strong selection, and becomes subject to drift until it 41 is either lost through pseudogenization (a result sometimes termed 'non-functionalization') or 42 acquires a new function that is fixed via positive selection [1]. 43 Lineage-specific duplications also have a major impact on how homologous genes between species 44 are related and identified. Two homologous genes related by gene duplication are termed paralogs, 45 previous observations that centrally placed proteins in interaction networks tend to carry out 246 crucial cellular functions [31] that have been retained over vast evolutionary timescales. 247 We further sought to identify features that distinguished replaceable co-orthologs from non-248 replaceable ones within the same orthogroup for the 1:>2 set. We again considered median features 249 of both replaceable versus non-replaceable co-orthologs within those 1:>2 orthogroups that 250 showed differential replaceability. No features showed an AUC more than two standard deviations 251 above the mean of permutation tests, likely due to the small size of this specific orthogroup set. 252 Nonetheless, the strongest trend observed was for replaceable co-orthologs to be localized in more 253 similar subcellular compartments [32] in human cells as the yeast ortholog(s) ( Figure 5B). Our 254 observations are consistent with a model that at least one co-ortholog in an expanded human gene 255 family will tend to retain essential ancestral functionality by maintaining ancestral cellular 256 localization and ancestral centrality in the functional interaction network of the cell ( Figure 5C). 257

Simulations suggest more diverged duplicates are less likely to bind their ancestral 258 interaction partners 259
To assess the generality of our observations, we performed in silico simulations of functional 260 divergence in a duplicated gene family. We analyzed a small heterodimeric protein complex 261 consisting of A and B subunits (the SMT3-UBC9 protein complex [18,33]), where the subunit B 262 was duplicated in silico to yield A, B, and B'. Using binding of A to B and/or B' as a proxy for 263 functionality, we carried out evolutionary simulations of molecular structural divergence 264 (considering all atom models using the Rosetta molecular modeling suite [34,35]. As described in 265 [36], we examined functional replaceability across five different selection scenarios. All of the 266 simulations assumed that selection acts on the stability of each subunit but differed in how they 267 imposed selective pressure on binding. We ran 100 replicates for each of the five selection 268 schemes, and quantified replaceability of each of the subunits (measured as continued ability to 269 bind to the ancestral partner) over sequence divergence ( Figure 6A, B). Notably, selection for A 270 to bind both B and B' results in a continued ability for either B or B' to bind the ancestral variant 271 A, whereas application of diversifying selection to prevent binding to B' results in a rapid decay 272 in the ability of B' to bind A ( Figure 6B). 273 We then looked across our simulated lineages at those cases where one of the duplicates 274 functionally replaces (i.e. binds to the ancestor) and the other does not. We found a systematic 275 pattern, across all five selection schemes, that the non-binding duplicate tends to be the more 276 diverged one ( Figure 6C; all pairwise comparisons within selection schemes are significant, p < 277 0.001 paired t-test). These results mirror our experimental humanization findings for the relative 278 divergence of 1:2 human co-orthologs that the replaceable duplicate tends to be less diverged than 279 the non-replaceable duplicate (Figure 4). 280

Conclusions 281
By extending the scope of our systematic yeast humanization assays to include those yeast genes 282 that have more than one human ortholog, we successfully added 309 human genes to our tested set 283 (a 73% increase). We have therefore greatly expanded the set of human genes that can now be 284 functionally studied in the simplified unicellular eukaryotic context of budding yeast by adding 80 285 human genes (56 novel to this dataset) to those that can successfully replace their yeast ortholog. 286 Overall, we found that yeast genes with duplicated human orthologs can be replaced by at least 287 one co-ortholog at a slightly lower rate (40%) than 1:1 orthologs [18]. Of those that could be 288 replaced, the clear majority were replaceable only by one or two co-orthologs, rather than being 289 broadly replaceable by many human genes in the same family. This pattern of replaceability 290 between and within orthogroups supports the long-held hypothesis that orthologs between species 291 will retain ancestral functions at a higher rate than paralogs within a species [10,11]. 292 After testing many properties of the gene families for their ability to explain replaceability, our 293 analysis revealed divergent patterns across groups that have two (1:2) or more than two human 294 (1:>2) human co-ortholog members. In the case of the 1:2 class, the top predictors were dominated 295 by features that capture divergence from the yeast ortholog, in particular we observed that the less-296 diverged variant tended to strongly be the replaceable one. This observation was supported by 297 computational simulations of protein divergence after duplication, where the least diverged of the 298 duplicates retained ancestral binding ability in most cases. In addition to sequence divergence, we 299 also observed a significant trend of replaceable 1:2 human co-orthologs to have broader, less 300 tissue-specific expression. Somewhat rarely, the more-diverged human co-ortholog could replace, 301 13 and these tended to be more similar to their yeast counterpart and expressed at a higher level 302 perhaps due to retaining important ancestral functions. In the case of the 1:>2 set, replaceability 303 was marked largely by network-based properties of the genes. Replaceable co-orthologs in this set 304 seem to have retained more ancestral interaction partners as well as higher centrality, even across 305 a billion years of divergent evolution, likely indicative of their functional importance. While not 306 significant by our criteria, we also observed replaceable 1:>2 co-orthologs being expressed in 307 similar subcellular compartments to their corresponding yeast gene, highlighting a trend to 308 maintain their ancestral localization. 309 Overall, our extended set of humanization assays and their analysis reveal principles of functional 310 divergence in paralogs and orthologs. We observed a strong tendency for orthogroups to exhibit 311 only one or a few swappable human genes rather than many. We also extended our observation 312 that network centrality and interaction properties aid in determining how ancestral gene function 313 across orthologs is retained over deep evolutionary timescales. 314 Such assays help to advance our understanding of duplicate gene evolution in the billion years 315 since yeast and humans diverged [37] and add powerful reagents to study myriad human processes 316 and develop therapies in a simpler eukaryotic surrogate. 317

Identifying orthologs 319
Orthologs were calculated with a local installation of InParanoid [25], using UniProt proteomes 320 for the two species (downloaded November 2014). The InParanoid script was modified to use a 321 more recent version (2.2.25+) of the BLAST+ suite of programs [38]. 322 InParanoid identifies orthogroups between two species by first performing an all-vs-all BLAST 323 search between the two species to identify bidirectional best-hits (BBH). Each proteome is then 324 subjected to an all-vs-all BLAST against itself to identify within-species homologs. The BBH pairs 325 are used as seed matches and any within-species genes from the self-BLAST that are at least as 326 close to the gene of interest as its BBH in the other species is added to the ortholog group and 327 termed an 'in-paralog' or co-ortholog. We used the 'table' output of InParanoid to identify 328 14 orthogroup classes as follows: Those groups with one listed yeast gene and two human genes were 329 dubbed one to two (1:2) orthologs, while those with one yeast gene and more than two human 330 genes were identified as one to more-than-two (1:>2) orthologs. Together, these two sets make up 331 the one to many (1:M) ortholog class. 332

ORFeome cloning 333
Human genes were obtained from the human ORFeome collection [26]. The ORFeome comprises 334 a collection of E. coli strains, each harboring a plasmid encoding a single human gene in a Gateway 335 'entry' vector. Sequences cloned in the entry vectors are flanked by attL sites. To create expression 336 vectors, each human entry vector was isolated from E. coli, added to a Gateway LR reaction with 337 a Gateway 'destination' vector and transformed into competent E. coli to obtain expression clones 338 [39]. As ORFeome clones lack stop codons, we modified the Advanced Yeast Gateway kit [28] 339 pAG416-GPD-ccdB destination vector which does not encode a stop codon immediately outside 340 of the cloning region resulting in a ~60 amino acid tail being added to any protein expressed from 341 it. We thus mutagenized the vector downstream of the cloning region to introduce a stop codon, 342 shortening the tail to six amino acids (this plasmid is termed pAG416-GPD-ccdB+6Stop) [18]. 343 Entry and expression clones were verified by sequencing into the gene sequence from the upstream 344 and downstream region of the plasmid. 345

MGC cloning 346
For human genes not available in the ORFeome, we obtained clones from the Mammalian Gene 347 Collection [27], a collection of sequence-verified human cDNA sequences. To obtain entry vectors 348 for these genes, we designed primers for each gene that would amplify the coding sequence while 349 adding attB sites to either end of the human gene and performed PCR using the MGC plasmid as 350 template. PCR products were combined with plasmid pDONR221 in a Gateway BP reaction [39]   harboring one allele of a yeast gene knocked out by replacement with the KanMX kanamycin-374 resistance cassette, allowing for selection on G418 [29]. We transformed human expression clones 375 or an empty control vector into appropriate strains and selected on −Ura G418 medium in a 96-376 well format. Transformants were then re-plated on GNA-rich pre-sporulation medium containing 377 G418 and 50 mg/L histidine. Individual colonies were then inoculated in liquid sporulation 378 medium containing 0.1% potassium acetate, 0.005% Zinc acetate, and incubated with vigorous 379 shaking at 26°C for 3-5 days, after which sporulation efficiency was estimated by microscopy, the 380 mixture then re-suspended in water and equally plated on two assay conditions: 381

Tetrad dissection and plasmid loss assays 397
For human gene replaceability assays performed in the yeast heterozygous knockout Magic 398 Marker collection that were ambiguous in our large-scale screen, we performed tetrad dissections 399 to more clearly test for complementation ( Figure S2). In total, 33 human genes were assayed and 400 analyzed. We transformed each human expression clone or empty vector control into the 401 appropriate yeast strains and selected on SC-Ura + G418 (200µg/ml) to select for the human gene 402 expression vector (CEN, Ura+) and yeast gene knockout (KanMX marker) simultaneously. 403 Transformants were then plated on GNA-rich pre-sporulation media containing G418 (200μg/ml). 404 Individually isolated colonies were inoculated into liquid sporulation medium containing 0.1% 405 potassium acetate, 0.005% Zinc acetate and were incubated with vigorous shaking at 25˚C for 3-5 406 days. Following this, sporulation efficiency was estimated by microscopy, and successful 407 sporulations were subjected to tetrad analysis. 15-20 μL of each sporulation was digested with an 408 equal volume of Zymolyase (5 mg/ml stock) for 30-45 min to remove the ascus coats. The 409 digestions were diluted 1:1 with sterile water after which 20-30 μL of the zymolyase-treated spores 410 were carefully applied to a tilted YPD plate using a pipette, allowing the droplet of cell suspension 411 plates to obtain single colonies. Each colony was tested for plasmid loss in the presence of 5FOA. 428 Colonies that resisted plasmid loss were isolated and subjected to further analysis to quantitatively 429 measure their growth rates. Yeast strains were either pre-cultured in liquid YPD + G418 430 (200µg/ml) or -Ura Dextrose selective medium + G418 (200µg/ml) for 2 hrs or overnight 431 respectively. Each culture was diluted in YPD or -Ura Dextrose medium to an OD of ~0.1 in 100 432 or 150 μL total volume in a 96-well plate. Plates were incubated in a Synergy H1 shaking 433 incubating spectrophotometer (BioTek), measuring the optical density every 15 min over 48 hr. 434 Growth curves were performed in triplicate for each strain by splitting the pre-culture into three 435 independent cultures for each 48-60 hr time course (Figures S3A, S4A and S4C). 436 In the case of temperature-sensitive humanized yeast strains, the growth assays were performed 437 first at permissive temperatures 25-26°C for 48-60 hr time course. These growth assays were 438 largely identical to the empty vector transformed yeast strains. The strains where then shifted to a 439 18 restrictive temperature of 37°C and similarly repeated the growth assay for a 48-60 hr time course 440 ( Figures S3B and S4B). Growth curves were performed in triplicate for each strain by inoculating 441 cells from the same pre-culture into three independent cultures. 442 For computational analyses of the trends underlying replaceability, we computed a diverse set of 443 features (Supplementary File 3), based in part on those from Kachroo et al. [18], as follows: 444

Sequence properties 445
Sequence features for human genes obtained from the human ORFeome [26]  Orthologous pairs were first identified by InParanoid [25]. Global alignments for all pre-identified 469 ortholog pairs were then calculated using NWalign (http://zhanglab.ccmb.med.umich.edu/NW- OrthoRank refer to the scores assigned by InParanoid. For each orthogroup, InParanoid calculates 476 a confidence score between 0 and 1 for each in-paralog that represents how similar it is to the seed 477 ortholog. Rank is simply a ranked ordering of the inparalogs of a group based on their orthoscore, 478 with 1 being the least diverged ortholog and higher values being further away from the seed.  were defined as follows: Degree represents the count of interaction partners for a node in a given 495 network. Betweenness represents network centrality, a measure of how central in a network a given 496 node is, calculated as the number of shortest paths between all node pairs in a network that pass 497 through a given node. Clustering represents the node clustering coefficient, calculated as the 498 fraction of edges that could possibly be present in a node's neighborhood that are actually present. 499 FractionComplementing is the fraction of interaction partners observed to complement (including 500 results obtained in our 1:1 humanization assays [18]). FractionOrthologPartners is the fraction of 501 a genes interaction partners that have orthologs in the other species, and interact with the gene of 502 interests' ortholog in the corresponding network (i.e. if gene Sc-A interacts with Sc-X, Sc-Y, and 503 Sc-Z, and Hs-A interacts with Hs-X and Hs-Y, but not Hs-Z (which is a legitimate gene), the 504  [45]. Translation efficiency was calculated as the 526 ratio of RPF reads to mRNA reads. SubCellHamming is a measure of the difference in subcellular 527 localization between a yeast-human ortholog pair, calculated using data from the 528 COMPARTMENTS database ( [32], downloaded in February 2017). We utilized the 'benchmark' 529 sets to create, for each protein in their respective species, a binary vector of subcellular localization 530 across 11 compartments (Cytoskeleton, Cytosol, Endoplasmic Reticulum, Extracellular space, 531 Golgi apparatus, Lysosome, Mitochondrion, Nucleus, Peroxisome, Plasma membrane) that were 532 common to both human and yeast in the database. Then, for each ortholog pair, we calculated the 533 hamming distance between their respective subcellular localization vectors as the 534 SubCellHamming value. 535

Calculating predictive strength of features 536
The predictive power of each feature was calculated as the area under the receiver-operator 537 characteristic curve (AUC) when treating that feature as an individual classifier. Each feature was 538 sorted in both ascending and descending directions, retaining the direction providing an AUC > 539 0.5. To assess significance, a permutation procedure was performed as follows: For each feature, 540 the replaceable/non-replaceable status of each ortholog pair was shuffled (retaining the original 541 ratio of replaceable to non-replaceable assignments), and the AUC was calculated. The shuffling 542 procedure was carried out 1,000 times for each feature, and the mean AUC values and their 543 standard deviations reported. 544 Features of expanded orthogroups 545

22
In order to avoid overweighting expanded gene families and compensate for uneven sampling, we 546 considered median properties of paralogs in each family as follows: 547 For each orthogroup, we collapsed all yeast-human ortholog pairs with the same status (replaceable 548 or not) into a single case, with the value of each feature was calculated as the median value of the 549 collapsed ortholog pairs for that feature from the original table. Thus, each 1:M orthogroup is 550 collapsed to be represented by either two pairs (Complement and Non-complement) or one pair 551 (either Complement or Non-complement). 552

Simulations 553
We constructed a simulation of protein evolution where one member of an interacting pair was 554 duplicated. The simulation was initialized with the yeast SMT3-UBC9 complex, a small 555 heterodimeric protein complex, as the resident genotype (PDB: 2EKE) [33]. This complex initially 556 had two subunits, which we refer to as A and B. We duplicated the B subunit and refer to it as B'. 557 Our simulation protocol and setup [36] is based on an accelerated origin-fixation model [18,46]. 558 Here, we further analyzed five of the previously published sets of 100 simulation trajectories [36] 559 (summarized in the Figure 6A Selection Schemes panel). Briefly, under the first selection scheme 560 we enforced selection for A to bind B and for A to bind B' (bind both). In the second scheme 561 selection acts on the ability of A to bind B or A to bind B' and the maximum stability of those 562 interactions was considered (bind max). In the third scheme the ability of A to bind B was selected 563 for but the ability of A to bind B' was not (bind B). In the fourth scheme the ability of A to bind B 564 was also selected for but the ability of A to bind B' was selected against (bind B and not B'). We 565 also implemented a control selection scheme where selection does not act on the ability of A to 566 bind either B or B' (no bind) We performed 100 replicates of each of these selection schemes. 567 The percent of simulations where an evolved duplicate has the ability to bind the ancestral partner 568 ( Figure 6A, B) correspond to Figure S7A, B in [36]. We further analyzed these data by examining 569 instances where only one of the duplicates is able to bind the ancestral partner at the end of the 570 simulation run. We recorded each duplicate's divergence, defined as the fraction of amino acid 571 positions that were non-identical between the initial and final sequences of a simulation run 572 ( Figure 6C). For each selection scheme, the distribution of the divergence of duplicates that can 573 bind the ancestral partner was compared to the distribution of divergence of duplicates that cannot 574 bind the ancestral partner with a paired t-test.  results for 424 pairs with no duplications in either yeast or human lineage (i.e. with 1:1 orthology) 593 [18]. In this study, we tested the remaining set of essential yeast genes that have acquired lineage 594 specific duplications, classifying them as 1:M (1 yeast to 2 or more human co-orthologs) or M:M 595 (>=2 yeast to >=2 human co-orthologs) or M:1 (>=2 yeast to 1 human ortholog). There are 140 596 essential yeast genes with more than one human ortholog, representing 378 ortholog pairs to be 597 replacing their yeast gene counterpart equally well is shown. The yeast growth assays for these 613 replaceable human genes is shown on the right for each. Haploid yeast gene deletion strains 614 carrying plasmids expressing functionally replacing human genes (colored solid-lines) generally 615 exhibit comparable growth rates to the wild type parental yeast strain BY4741 (black dotted-lines). 616 Plotted growth curves display the mean of triplicate growth experiments. 617 only three (of 25 tested) human genes replacing three separate yeast orthologs. * indicates data 637 from [18]. 638 HsOrthoRank) for this class indicate that replaceability is driven largely by the nearness of the 642 replaceable human co-ortholog to the yeast gene relative to a non-replaceable co-ortholog (i.e. 643 most 1:2 orthogroups have only one replaceable human co-ortholog, and it is almost always the 644 least-diverged one). (B) The observed trend is even more strongly demonstrated when analysis is 645 restricted to the specific set of 1:2 orthogroups that display differential replaceability (i.e. one 646 26 human co-ortholog replaces and the other does not), with AUCs nearing 0.9. The extent of tissue-647 specific expression also becomes significantly predictive in this set, indicating that human co-648 orthologs that are more broadly expressed are more likely to replace than their more tissue-649 specifically expressed co-ortholog. (C) More-diverged human co-orthologs in a 1:2 pair do replace 650 in several cases (mostly those where both co-orthologs replace). When restricting analysis to this 651 set, it is apparent that the most predictive property is sequence similarity to the yeast ortholog,  the co-ortholog more closely related to the yeast gene is the replaceable one. In (E), the 'none-666 replace' co-orthologs are on average more diverged from the yeast gene. 667 to replace, as well as those that maintain higher centrality in functional interaction networks 674 27 (Betweenness). (B) Similar to the 1:2 case, we further restricted our analysis to a subset of median-675 collapsed 1:>2 orthogroups that had both replaceable and non-replaceable human co-orthologs. In 676 this set, the subcellular localization of the human proteins appears to be predictive, in that more 677 broadly localized co-orthologs are more likely to replace than their more organellar-specific co-678 orthologs. While this AUC is not significant (indicated by being just less than two standard 679 deviations off the mean), it is an obvious trend and does overtake fraction of orthologous partners 680 as the highest performing AUC for this set.  The vast majority of 1 to multiple orthologs are of the 1:2 variety (1 yeast gene to 2 human co-700 orthologs) while the rest were classified as 1:>2 (1 yeast gene to >2 human co-orthologs). These 701 two groups were considered separately during feature analysis. human gene -present condition). Temperature-sensitive haploid yeast strains carrying plasmids 730 expressing functionally replacing human genes (colored red, purple or blue solid-lines) generally 731 exhibit comparable growth rates to the wild type parental yeast strain BY4741 (black dotted-lines). 732 The yeast strains harboring empty vector without a corresponding human gene, showing no or 733 poor growth restrictive temperatures of 37°C (grey solid-lines), serve as controls. In total, 208 734 individual assays were performed (as two independent biological replicates). 45 human genes 735 showed functionally replaced in either assay whereas 127 did not. 36 human gene 736 complementation assays were non-informative (cases where the control experiments did not 737 behave appropriately). 738