Cohesive versus Flexible Evolution of Functional Modules in Eukaryotes

Although functionally related proteins can be reliably predicted from phylogenetic profiles, many functional modules do not seem to evolve cohesively according to case studies and systematic analyses in prokaryotes. In this study we quantify the extent of evolutionary cohesiveness of functional modules in eukaryotes and probe the biological and methodological factors influencing our estimates. We have collected various datasets of protein complexes and pathways in Saccheromyces cerevisiae. We define orthologous groups on 34 eukaryotic genomes and measure the extent of cohesive evolution of sets of orthologous groups of which members constitute a known complex or pathway. Within this framework it appears that most functional modules evolve flexibly rather than cohesively. Even after correcting for uncertain module definitions and potentially problematic orthologous groups, only 46% of pathways and complexes evolve more cohesively than random modules. This flexibility seems partly coupled to the nature of the functional module because biochemical pathways are generally more cohesively evolving than complexes.

Average Cooccurrence: for each pair of module subunits we calculate the fraction of species in which both subunits are either present or absent together. We average over all component pairs to obtain a score per module.
Average deviation from modular: the sum of the deviation of the number of components of the functional module for each genome to the average number of module components per genome. Adopted from Snel et al. (2004) [1]. Homogeneous Columns: the number of species in which a module is either completely present or completely absent. Adopted from Gavin et al. (2006) [2]. Species Absent, Species present: the number of species in which a module is completely absent and the number of species in which the module is completely present. The vector containing those two scores is the raw score which is used throughout the article.  Table 2. Correlation of cohesiveness score with module size (number of components).

2: Cohesiveness scores and number of module components
The score used in this article is the only one which does not correlate with the number of subunits in a module, because it consists of both the number of species in which a module is completely absent, as well as the number of species in which a module is completely present. All one dimensional scores, except the number of species in which a species is completely absent, correlate positively with size: modules with many subunits tend to evolve more cohesively according to these scores. The same trend is reported by Campillos et al. (2006) [3], who use a twodimensional score consisting of the number of evolutionary events (gain or loss) and the number of shared events. We use Spearman rank correlation because because both variables are not normally distributed. Table 3a Being a pathway or a complex as a predictor for evolutionary cohesiveness. Table 2 in the main text and Table 1in the Supplementary text suggest that pathways evolve more cohesively than complexes. We tested this using a Mann Whitney Wilcoxon rank sums test, comparing pathways to curated complexes. We find that pathways indeed tend to have a higher cohesiveness score than complexes and that this difference is significant (average score pathways: 0.9, complexes 0.8, P value 0.00012).

3: Pathways compared to complexes
If we would use the categories 'pathway' and 'complex' to predict whether a module is cohesively evolving or not we would get contingency tables like this:  This figure shows that regardless of the specific cohesiveness score cutoff used to classify modules as cohesive or incohesive, the proportion of pathways which is cohesively evolving, is higher than of complexes.   We perform a Wilcoxon rank sums test to compare the distribution of scores of confirmed modules to unconfirmed modules. P values are shown for a onetailed test: we test whether confirmed modules have higher scores than unconfirmed modules. Confirmed modules are evolving significantly more cohesively than unconfirmed modules in the SGD, PE and MIPS datasets, which are the datasets containing on average the smallest modules. Only a small fraction of the modules in the KEGG pathways and Socioaffinity clusters have been confirmed by other datasets, which may explain why the difference between confirmed and unconfirmed modules is not significant for these datasets. Subunits which have not been confirmed by other datasets are potentially false additions to a module and removal could increase the evolutionary cohesiveness of the module. We compare the cohesiveness score of each completely confirmed submodules with the score of its original module and find that in general this score improves. (Wilcoxon matched pairs test, one tailed: testing whether submodules score higher than the original modules). This difference is more significant for datasets contain large modules (KEGG, Socioaffinity clusters) as there are more submodules to compare in these datasets (see also modules before and after the crosscomparison filter. This filter has less effect on the curated datasets than on the highthroughput data derived module definitions. The increase in the fraction of cohesive modules is the combined effect of an increase in cohesiveness by removing subunits which do not cooccur with the rest of the module in any other dataset and by removing entire unconfirmed modules. All numbers are based on non redundant module sets: no set of KOGs occurs more than once, except as a sub or superset. The highthroughput datasets improve because of crosscomparison with the curated complex sets. The pathway datasets also show a substantial increase in cohesiveness after the filter. Probably this is because pathways are often defined as a set of reactions starting from or ending with a common substrate. Crosscomparison with other datasets may prune a pathway such that only one path between substrates is left.    Table 5c. Fraction of modules which evolves cohesively, average score, average size and number of (sub) modules (1) before any filter (no filter), (2) for modules for which all subunits have at least one interaction (not necessarily within the module) a PE score with confidence > 0.2 [4] (PE data for all subunits) and (3) after the filter (subclusters). First we remove all components which have a zero PE score with all other module subunits. Subsequently we cluster the module subunits with single linkage clustering, using PE scores as a similarity metric. We obtain two clusters and remove the smallest cluster from the module. The pathway datasets have very few modules for which all components interact with at least one other protein.

4: Crosscomparison with other module datasets
Metabolic proteins typically interact via their substrates or not at all. Hence any interactions will be transient at best and are less likely to be picked up by TAP experiments. All numbers are for non redundant sets: no set of KOGs occurs twice in a dataset, but may occur as a subset. The number of submodules reported in  Figure 5. Bar chart of fraction of cohesively evolving modules before and after the filter.
If we compare the average/median/variance PE scores within the module, before and after applying the filter, we find that the average and median PE score increase significantly after the filter (P value 0.0003 and 0.003 respectively, P value from one tailed Wilcoxon rank sums test) and the variance decreased, but not significantly (P value 0.068). Cohesiveness in terms of interaction is significantly increased after this filter, however, the evolutionary cohesiveness is not.  Table 6a. Fraction of modules which evolve cohesively and average score for modules composed of orthologous groups based on KOG and modules composed of orthologous groups obtained by running orthoMCL [5]. Datasets containing large modules (KEGG and Socioaffinity) score a bit lower when subunits are assigned to orthoMCL orthologous groups than when subunits are assigned to KOG groups. The average module size per datasets remains qualitatively the same. Large modules evolve more cohesively than the random background because the module is present entirely in many species. Apparently, the random background of orthoMCL groups contains more groups which are conserved in many species.     Table 7c. Fraction of modules which evolves cohesively, average score, average size and number of (sub) modules before and after the filter. For this filter we remove the top 50% containing most inparalogs of all KOGs constituting a functional module, boiling down to removing all KOGs with more than 7 inparalogs. Although the improvement of submodules over the original modules was not significant in the pathway datasets (table 7b), the SGD pathway dataset contains a larger fraction of cohesive modules than before the filter. However, this increase comes at a cost: more than 2 third of the modules is removed completely. All numbers are for non redundant sets: no set of KOGs occurs twice in a dataset, but may occur as a subset.   This plot is generated with the BiNGO plugin in Cytoscape. It represents the overrepresented GO Slim Yeast categories of proteins constituting cohesively evolving modules with respect to proteins from flexibly evolving modules. The color of the nodes represents the P value (corrected with Benjamini Hochberg correction) of the hypergeometric test ranging from <0.01 (yellow) to < 1E07 (dark orange).