Joint representation of molecular networks from multiple species improves gene classification

Network-based machine learning (ML) has the potential for predicting novel genes associated with nearly any health and disease context. However, this approach often uses network information from only the single species under consideration even though networks for most species are noisy and incomplete. While some recent methods have begun addressing this shortcoming by using networks from more than one species, they lack one or more key desirable properties: handling networks from more than two species simultaneously, incorporating many-to-many orthology information, or generating a network representation that is reusable across different types of and newly-defined prediction tasks. Here, we present GenePlexusZoo, a framework that casts molecular networks from multiple species into a single reusable feature space for network-based ML. We demonstrate that this multi-species network representation improves both gene classification within a single species and knowledge-transfer across species, even in cases where the inter-species correspondence is undetectable based on shared orthologous genes. Thus, GenePlexusZoo enables effectively leveraging the high evolutionary molecular, functional, and phenotypic conservation across species to discover novel genes associated with diverse biological contexts.


Fig A: Number and ratio of cross-species network connections.
A row in each heat map in the left column shows the total number of genes in one species (row) present in a network of species combined with the network of another species (column).The corresponding entries in each heat map in the right column show the fraction of genes with a cross-species edge (when combined with a network from one other species).The top and bottom rows correspond to networks from BioGRID and IMP, respectively.These cross-species gene links were weighted in two different ways (Fig B in S1 File; which were then compared).In the 'uniform edge weighting strategy', the edge weight was set to the same value for all cross-species edges.We devised a second 'degree weighting strategy' that weights edges differently based on the degrees of the incident gene nodes in their respective within-species networks.Specifically, for a pair of genes u and v in two different species, the cross-species edge weight, , is given by:  Here, is the degree of the source node u when only considering within-species edges,  ℎ is the total number of ortholog connections for u, and C is a scaling parameter.The  ℎ parameter C influences the node2vec in how likely it is to generate walks that, after landing on node u, will next traverse to a node in the same species or to a node in a different species.

Fig B:
Edge weighting strategies for connecting genes across species.The uniform edge weighting strategy sets all cross species edges to the same value (left).The degree weighting strategy (shown with C = 1) creates directed edges by considering both the within-species degree of the source node and the total number of cross-species edges the source node is incident on (right).

Creating multi-species network representations
After connecting networks from multiple species, we derived representations (to serve as features for ML) using four different methods: 1. AdjMat: Adjacency matrix representation of the networks where the features directly use the edge weights for all the connections.That is, each gene's feature vector is its corresponding row in the adjacency matrix.2. RWR: The influence matrix representation generated with a random-walk-with-restart kernel [4].
○ The influence matrix , where, is the restart parameter, is β  the identity matrix, and is the degree-weighted adjacency matrix given by   , where is a diagonal matrix of node degrees.We used either the   =  −1

Properties of gene set collections
This section reports the properties of the various gene set collections.The following information is contained in these tables (* denotes wildcard expression): • Task: Name of the gene set collection; Gene Ontology (GO) [6], Monarch (MO) [7], and DisGeNet (DGN) [8,9].• Species: Species for which the annotations of for; hs: human, mm: mouse, dr: zebrafish

Optimal choices for gene classification using multi-species networks
To determine the optimal choice of network combination, representation, and hyperparameter setting that could improve the GenePlexus algorithm, we performed extensive evaluations on the tasks of predicting human and mouse annotations.
Which species to include?
The feature spaces were grouped into three main categories: • Single: Features were only derived from network of the species the prediction task is in • Hs-mm: Features were generated from jointly modeling human and mouse networks • All-species: Features were generated from jointly modeling all species considered in this work (human, mouse, zebrafish, fly, worm, and yeast) How to connect genes across species?
For Hs-mm and All-species, we weighted the edges for the cross-species connections based on two strategies (described in detail above): • Uniform: All edge weights were set to the same value.○ We evaluated weights equal to 1, 2, and 5. • Degree Weighted: Directed edges were generated based on the within-species weighted-degree of the source node, the number of cross-species connections of the source node, and a scaling factor (see eqn.S1).○ The scaling parameter was set to 1.0 or 0.50.
How to represent the network as features in the ML classifier?
Within each of the three main groups listed above, the features are generated using four different methods: • AdjMat: There are no additional hyperparameters.
• RWR: We fixed the restart parameter to 0.8 and tested using either the influence matrix directly or its transpose.• SVD: We tested two settings, one with and the other without standardizing the adjacency matrix column-wise before performing SVD.• N2V: Involved parameters include p and q, which control how random walks are generated.
We also performed some minimal tuning on the number of nodes per walk (i.e., walk length), the number of walks per gene, and the number of epochs used when training the classifer.
For both SVD and N2V, we considered embedding dimension sizes of 500, 2000, and 10000.We note that, for the All-Species category, only the node2vec method could handle jointly modeling information from more than two species.
First, we observed that 'degree-weighted' connections between cross-species gene pairs (compared to 'uniform') resulted in more cohesion between genes in different species networks

Statistical tests for main results
To determine if a method was greater than the "baseline" method we performed a Wilcoxon signed-rank test as implemented in scipy.stats.wilcoxonbetween the log2(auPRC/prior) values for the two feature spaces.This test returns a p-value and a z-statistic, where the latter is a normalized version of the T-statistic.The p-value in the set of p-values generated for each feature set compared to the baseline is considered significant if the Bonferoni corrected p-value has a family-wise error rate less than 0.01.The baseline method each method in a given prediction task is compared to is the adjacency matrix derived from only human network information (AdjMat-hs).Statistical significance is a FWER < 0.01 for a given prediction task.Higher values (red) of the z-statistic denote a representation outperforming the baseline ("+" denotes the difference being significant) and lower values (blue) of the z-statistic denote the baseline method having better performance ("-" denotes the difference being significant).

Changing the ortholog database
We compared connecting genes across species together using two ortholog databases, eggNOG and WORMHOLE.eggNOG is a graph-based orthology method that more liberally assigns genes across species as ortholog pairs leading to a high number of genes with multiple orthologs, albeit at the expense of the percentage of gene pairs that are least diverged orthologs.WORMHOLE is a machine learning orthology method that integrates data from 17 different orthology databases.The ground truth labels for the classifier are least divergent ortholog pairs from the PANTHER database, thus WORMHOLE has far fewer ortholog pairs and percentage of genes with multiple orthologs compared to eggNOG, albeit with a much higher percentage of gene pairs that are least diverged orthologs.
The results show that models using eggNOG have better or equivalent performance than those using WORMHOLE, although not in all cases.Note that these evaluations were only carried out for the embedding methods as 1) PCA plots show that ortholog connections are much more important in embedding methods than adjacency matrix models, 2) hyperparameter tuning results show the adjacency matrix representations are not sensitive to the weighting scheme in which orthologs are connected and 3) the computational time to train models that use the adjacency matrix representation of species pairs is very intensive.

Cross-species knowledge transfer Fig L:
Transferring knowledge from model species to human using human+model and all-species representations of networks from IMP.The boxplots (top row) show the log 2 (auPRC/prior) prediction performance of logistic regression classifiers that were trained using only model species genes annotated to a given biological process and then used to predict human genes annotated to the same biological process.The barplots (bottom row) show the counts of where the log 2 (auPRC/prior) for the matched biological process ranks in comparison to the other processes in the collection.Features for the classifier were created pairwise between human and a model species with IMP networks using the AdjMat (light blue) and N2V (dark blue) representations, as well as features from all species using the N2V (orange) representation.
The following two figures are the corresponding figures to  prediction performance of logistic regression classifiers that were trained using only human genes annotated to a given biological process and then used to predict model species genes annotated to the same biological process.The barplots (bottom row) show the counts of where the log 2 (auPRC/prior) for the matched biological process ranks in comparison to the other processes in the collection.Features for the classifier were created pairwise between human and a model species with BioGRID networks using the AdjMat (light blue) and N2V (dark blue) representations, as well as features from all species using the N2V (orange) representation.

Fig N:
Transferring knowledge from model species to human using human+model and all-species representations of networks from BioGRID.The boxplots (top row) show the log 2 (auPRC/prior) prediction performance of logistic regression classifiers that were trained using only model species genes annotated to a given biological process and then used to predict human genes annotated to the same biological process.The barplots (bottom row) show the counts of where the log 2 (auPRC/prior) for the matched biological process ranks in comparison to the other processes in the collection.Features for the classifier were created pairwise between human and a model species with BioGRID networks using the AdjMat (light blue) and N2V (dark blue) representations, as well as features from all species using the N2V (orange) representation.Within each panel, the boxplots show the performance on the task of predicting human genes annotated to biological processes using classifiers that were trained either on human genes only (left pair) or human+model genes (right pair), based on features from the human (red), human+model (blue), or all six species (orange) networks from IMP.

Predicting model species counterparts of human diseases
For a given species, the classifier trained on human disease gene annotations can be used to predict a probability of how associated every gene in all six species is to the human disease.ten predicted genes in each species.Top ten genes in each species (columns) based on the z-score of the prediction probabilities from the classifier.Genes in each species are ranked independently, i.e., genes from different species in the same row are not intended to correspond to each other.
Table F: Bardet-Biedl Syndrome 1 top ten associated biological processes in each species.Top ten associated GO biological processes in each species (columns) based on the z-score of the predicted probability of the genes in each species.Processes in each species are ranked independently, i.e., processes from different species in the same row are not intended to correspond to each other.Real relationships between these processes are presented in Fig 5 .Table G: Bardet-Biedl Syndrome 1 top ten associated phenotypes in each species.Top ten associated Monarch phenotypes in each species (columns) based on the z-score of the predicted probability of the genes in each species.Phenotypes in each species are ranked independently, i.e., phenotypes from different species in the same row are not intended to correspond to each other.

Fig W:
The ten most enriched biological processes associated with the top genes in each species predicted to be related to organic acidemia (OA).A classifier trained using human OA genes and all-species N2V features was used to predict OA-related genes in the five model species.The figure shows the ten most enriched biological processes associated with the top-ranked genes in each species.Nodes represent biological processes and are colored by which species they were identified in.Edges represent pairs of semantically similar processes (scaled Resnik similarity based on the Gene Ontology).Isolated nodes are not shown.Nodes with thick borders represent biological processes where, in at least one species, none of the annotated genes are orthologous to any human OA gene.

Additional PCA plots
, dm: fly, ce: worm, sc: yeast • Network: Network from which features are derived.Bio : BioGRID • NumSets: Total number of sets that passed the threshold • All*: Considering all genes annotated to the set • Trn*: Considering only annotated genes in the training set • Tst*: Considering only genes in the testing set • Min: The number of genes annotated to the gene set with the least amount of annotations • Max: The number of genes annotated to the gene set with the greatest amount of annotations • Mean: Average number of genes annotated to sets in the collection • Median: Median value of the number of genes annotated to sets in the collection

(Fig 6 ;
Fig S-U in S1 File).So, fixing those connections, we ran all combinations of network representation methods (and their parameter sets) and species choices for function and phenotype prediction tasks.Fig C in S1 File shows the results for the best hyperparameter from this analysis.This analysis required substantial computational resources and generated over 3.5TB worth of files just for human and mouse.For plots of all hyperparameter tuning results see the hyperparameter_tuning folder of the GenePlexusZoo-Manuscript github repository.Therefore, to compare across species other than just human and mouse, we picked one non-embedding method -AdjMat (over RWR) -and one embedding method -N2V with a p = 0.1, q = 0.001, dimension size of 2000, walk length = 120, number of walk = 8 and window size = 10 (over SVD) [Fig D in S1 File] -and used these two in all subsequent analyses (N2V Selected).

Fig C:
Fig C: Evaluation to determine optimal choices of settings and hyperparameters for improving GenePlexus with multi-species networks.Each group of boxplots shows the function and phenotype prediction performance for human and mouse using different species choice (for the network) and network representation methods (each using the best hyperparameter set, N2VSelected) based on networks from BioGRID and IMP.

Fig D :
Fig D: Performance of the selected hyperparameter set for node embeddings.The boxplots show the log2(auPRC/prior) performance metric of the hyperparameter set we selected for N2V (orange) to use in the whole study compared to the performance of the best hyperparameter set (blue) across function and phenotype prediction tasks in human and mouse using networks from BioGRID and IMP.

Fig E :
Fig E: Statistical tests for human gene annotation prediction.These results perform statistical tests for the results presented in Fig 2.The baseline method each method in a given prediction task is compared to is the adjacency matrix derived from only human network information (AdjMat-hs).Statistical significance is a FWER < 0.01 for a given prediction task.Higher values (red) of the z-statistic denote a representation outperforming the baseline ("+" denotes the difference being significant) and lower values (blue) of the z-statistic denote the baseline method having better performance ("-" denotes the difference being significant).

Fig F :
Fig F: Statistical tests for model species gene annotation prediction.These results perform statistical tests for the results presented in Fig 3.The baseline method each method in a given prediction task is compared to is the adjacency matrix derived from only model species network information for which the prediction task is for.Statistical significance is a FWER < 0.01 for a given prediction task.Higher values (red) of the z-statistic denotes a representation outperforming the baseline method (+ denotes the difference being significant) and lower values (blue) of the z-statistic denotes the baseline method having better performance (-denotes the difference being significant)

Fig I :
Fig I: Adding orthologs for predicting function-and phenotype-associated model species genes.The same analysis as Fig 3, but now including orthologs from the model species as additional positive examples during training.

Fig J :
Fig J: Results for predicting function-and phenotype-associated human genes when changing the orthology database used to connect genes across species.The same analysis as Fig 2, but now looking at the difference in using the eggNOG orthology database versus WORMHOLE.

Fig K :
Fig K: Results for predicting function-and phenotype-associated model species genes when changing the orthology database used to connect genes across species.The same analysis as Fig 3, but now looking at the difference in using the eggNOG orthology database versus WORMHOLE.

Fig 4 and
Fig G in S1 File, respectively, except generated using the BioGRID network.

Fig M :
Fig M:Transferring knowledge from human to model species using human+model and all-species representations of networks from BioGRID.The boxplots (top row) show the log 2 (auPRC/prior) prediction performance of logistic regression classifiers that were trained using only human genes annotated to a given biological process and then used to predict model species genes annotated to the same biological process.The barplots (bottom row) show the counts of where the log 2 (auPRC/prior) for the matched biological process ranks in comparison to the other processes in the collection.Features for the classifier were created pairwise between human and a model species with BioGRID networks using the AdjMat (light blue) and N2V (dark blue) representations, as well as features from all species using the N2V (orange) representation.

Fig O :
Fig O: Predicting human gene annotations by training classifiers with genes only from human or only from another species using IMP networks.Each panel corresponds to the analysis of human and one other model species.Within each panel, the boxplots show the performance on the task of predicting human genes annotated to biological processes using classifiers that were trained either on human genes only (left pair) or model species genes only (right pair), based on features from the human-model (red) or all six species (blue) networks from IMP.

Fig P :
Fig P: Predicting model species gene annotations by training classifiers with genes only from that species or only from human using IMP networks.Each panel corresponds to the analysis of one model species with human.Within each panel, the boxplots show the performance on the task of predicting model species genes annotated to biological processes using classifiers that were trained either on model species genes only (left pair) or human genes only (right pair), based on features from the human-model (red) or all six species (blue) networks from IMP.

Fig Q :
Fig Q: Predicting human gene annotations by training classifiers with genes only from human or only from another species using BioGRID networks.Each panel corresponds to the analysis of human and one other model species.Within each panel, the boxplots show the performance on the task of predicting human genes annotated to biological processes using classifiers that were trained either on human genes only (left pair) or model species genes only (right pair), based on features from the human-model (red) or all six species (blue) networks from BioGRID.

Fig R :
Fig R: Predicting model species gene annotations by training classifiers with genes only from that species or only from human using BioGRID networks.Each panel corresponds to the analysis of one model species with human.Within each panel, the boxplots show the performance on the task of predicting model species genes annotated to biological processes using classifiers that were trained either on model species genes only (left pair) or human genes only (right pair), based on features from the human-model (red) or all six species (blue) networks from BioGRID.

Fig S :
Fig S: Predicting human gene annotations by including genes from multiple species during training with IMP networks.Each panel corresponds to the analysis of human and one model species.Within each panel, the boxplots show the performance on the task of predicting human genes annotated to biological processes using classifiers that were trained either on human genes only (left pair) or human+model genes (right pair), based on features from the human (red), human+model (blue), or all six species (orange) networks from IMP.

Fig T :
Fig T: Predicting model species gene annotations by including genes from multiple species during training with IMP networks.Each panel corresponds to the analysis of one model species with human.Within each panel, the boxplots show the performance on the task of predicting model species genes annotated to biological processes using classifiers that were trained either on model species genes only (left pair) or model+human genes (right pair), based on features from the model (red), model+human (blue), or all six species (orange) networks from IMP.

Fig U :
Fig U: Predicting human gene annotations by including genes from multiple species during training with BioGRID networks.Each panel corresponds to the analysis of human and one model species.Within each panel, the boxplots show the performance on the task of predicting human genes annotated to biological processes using classifiers that were trained either on human genes only (left pair) or human+model genes (right pair), based on features from the human (red), human+model (blue), or all six species (orange) networks from BioGRID.

Fig V:
Fig V: Predicting model species gene annotations by including genes from multiple species during training with BioGRID networks.Each panel corresponds to the analysis of one model species with human.Within each panel, the boxplots show the performance on the task of predicting model species genes annotated to biological processes using classifiers that were trained either on model species genes only (left pair) or model+human genes (right pair), based on features from the model (red), model+human (blue), or all six species (orange) networks from BioGRID.

Fig X :
Fig X: PCA of adjacency matrix and node embedding representations of human-mouse and all-species networks.Each PCA plot shows the top two PCA components from the human-mouse adjacency matrix (first column), the human-mouse node embedding (second column) and the all species node embedding (third column) representations.The cross-species edge weights were created by always using a uniform edge weight with a value of one (top row) or by using degree based edge weights (bottom row) for the BioGRID network.

Fig Y :
Fig Y: PCA of adjacency matrix and node embedding representations of human-model networks from IMP.Each PCA plot shows the top two PCA components from the adjacency matrix (using the uniform weighting strategy; top row) and node embeddings (using the degree based weighting strategy; bottom row) representations for the human network combined pairwise with fish, fly, worm, yeast in IMP.

Fig Z :
Fig Z: PCA of adjacency matrix and node embedding representations of human-model networks from BioGRID.Each PCA plot shows the top two PCA components from the adjacency matrix (using the uniform weighting strategy; top row) and node embeddings (using the degree based weighting strategy; bottom row) representations for the human network combined pairwise with fish, fly, worm, yeast in BioGRID.