Benchmarking network propagation methods for disease gene identification

In-silico identification of potential target genes for disease is an essential aspect of drug target discovery. Recent studies suggest that successful targets can be found through by leveraging genetic, genomic and protein interaction information. Here, we systematically tested the ability of 12 varied algorithms, based on network propagation, to identify genes that have been targeted by any drug, on gene-disease data from 22 common non-cancerous diseases in OpenTargets. We considered two biological networks, six performance metrics and compared two types of input gene-disease association scores. The impact of the design factors in performance was quantified through additive explanatory models. Standard cross-validation led to over-optimistic performance estimates due to the presence of protein complexes. In order to obtain realistic estimates, we introduced two novel protein complex-aware cross-validation schemes. When seeding biological networks with known drug targets, machine learning and diffusion-based methods found around 2-4 true targets within the top 20 suggestions. Seeding the networks with genes associated to disease by genetics decreased performance below 1 true hit on average. The use of a larger network, although noisier, improved overall performance. We conclude that diffusion-based prioritisers and machine learning applied to diffusion-based features are suited for drug discovery in practice and improve over simpler neighbour-voting methods. We also demonstrate the large impact of choosing an adequate validation strategy and the definition of seed disease genes.

We choose Net4 as our main network, as it provides a good balance between mapping and coverage. The edge weights were obtained by rescaling the STRING combined score to lie in [0, 1]. For the MashUp network-based feature generation, the experimental and the database STRING-based networks before combined through their algorithm, instead of using the combined weight provided by STRING.
The (unweighted) shortest path distribution of the final STRING network is depicted in figure B:   1. The gene with a known drug target 2. In case of tie(s), the gene with a known genetic association 3. In case of tie(s), the gene with the highest overall association score 4. In case of tie(s), pick a random gene From the original 187,246 rows in the disease table, 125,007 mapped to the OmniPath network, encompassing a total of 5,084 drugs-related genes and 3,442 genetics-related genes.

Descriptive disease statistics in the STRING network
After mapping the (drugs-related) disease genes to the main STRING network, we observe that every pair of diseases shows overlap. This can range from a modest amount (less than 10 genes) to more than 100 genes. Examples of the latter include type II diabetes, coronary heart disease, obesity, hypertension and bipolar disorder, all of which share a notable background.  In turn, this suggests that some genes might participate in multiple diseases, which is confirmed in figure D. Several genes participate in 10 or more diseases (out of 22), thus creating an overlap between any pair of them.
The fact that all diseases share at least one gene implies that their distance within the network is always 0. However, we can examine the mean distance between two diseases, defined as the mean of the distance of every pair of genes (g i , g j ) with g i belonging to the first disease and g j to the second. If we group the rows and columns using the UPGMA algorithm [GM07], taking into account that, at the starting point, every disease is in practice equivalent to a cluster of genes.
Each disease, in turn, tends to form a module within the network. To show this, we have computed the modularity of each disease, as implemented in the modularity function from the igraph R package [37]. A modularity greater than zero indicates that the number of connections within the disease genes is greater than Number of diseases genes are involved in Figure D: Histogram of the number of diseases genes participate in. The majority of genes belong to a single disease, but a small part of genes are found in 10 or more diseases, unveiling a common core in drug targets.
that of a randomly rewired network. Figure F shows how all diseases deviate from their randomised gene sets, which is something expected. Diseases with higher modularity can be easier to predict: an example is cardiac arrhythmia, very modular and well predicted, see the additive models on drugs data. Another way to examine how close disease genes lie is by representing the mean distance to the disease genes, starting either (1) from the disease genes or (2) from the rest of genes (i.e. non-disease genes for this particular disease). Figure G shows how drugs-related genes from a given disease have a shorter mean distance to themselves than the rest of genes in the network.
Finally, we observe that drugs-related disease genes have larger centrality measures than the rest of nodes in the network (figure H). This supports the hypothesis that the centrality itself has predictive power, hereby examined by including the PageRank centrality measure as a baseline method.        , and the node betweenness, also implemented in the betweenness function in igraph. Note that centrality measures are a topological property and do not use disease data as input. For each disease, all the genes have been separated into drugsrelated disease genes (blue) and non-disease (red). We can appreciate how, consistently along the three metrics, drugs-related genes tend to have higher centralities.

Complex data
Chembl complex data was retrieved from https://www.ebi.ac.uk/chembl/downloads, specifically release 23 (doi 10.6019/CHEMBL.database.23). The original data comprises 214 complexes with a mean of 9.29 proteins in each and a standard error of 14.08. Having mapped the complexes to the STRING network, 207 non-empty complexes remain, with a mean of 3.251 ENSEMBL ids in each and a standard error of 4.728. Cross validation splits   Due to their definition, the classic and representative strategies keep the dataset balanced: one third of the drugs-related genes are used to validate and two thirds are used as seed genes. Inevitably, small deviations arise if the total number of disease genes is not a multiple of 3. Note, however, how the block scheme sometimes keeps the balance (diseases such as asthma and Parkinson's disease), but can lead to data imbalance if large complexes are involved, like in COPD and type I diabetes.    Performance using drugs−related data as input and the string network    Performance using drugs−related data as input and the omnipath network Overall performance by disease . The three magnitudes correlate positively, implying that in general diseases with (i) more drugs-related genes and (ii) higher modularity in the network used by the prioritisers will exhibit better performance. Likewise, diseases with larger gene lists tend to be more modular.

Method details
All tests and batch runs were set-up and conducted using the R statistical programming language [36]. When no R package was available, the methodology was re-implemented, building upon existing R packages whenever possible. Standard R machine learning libraries were used to train the support vector machine and random forest classifiers. Only the MashUp algorithm [35] required feature generation outside of the R environment, using the Matlab code from their publication. The versions of the R packages can be found in table T. EGAD (Extending "Guilt by Association" by Degree [27]) was used here as a baseline comparator. EGAD performs a naïve diffusion approach via near-neighbours voting since EGAD's neighbor voting function uses the adjacency matrix of the network and no additional parameters.
PageRank is a standard web ranking technology based upon the original work of Page et al. [26]. The igraph R package implementation of PageRank (here ppr) was used with default damping factor, d = 0.85. The latter implements what is commonly referred to as personalised PageRank, because of the custom prior distribution. This prior gives a probability of 1/n input to each input gene and 0 otherwise, and forces random walks to start from the input genes. PageRank has been employed to diffuse disease seeding information across a two-layered network comprising PPI and GO hierarchy information [28]. Two approaches were developed: BirgRank (applying traditional PageRank with fixed decay parameters) and AptRank (with an adaptive diffusion mechanism). Here we considered only fixed decay parameter PageRank diffusion on the regular, weighted PPI network.
The ) employs kernelised score functions in semi-supervised learning (here knn and wsld), and has been assessed for disease gene identification [42]. Default package settings were used in all cases (number of neighbours, k = 3 for k-nn and coefficient of linear decay, d = 2 for wsld). We used the kernel computed with diffuStats.
The bagging SVM method (here bagsvm) is an implementation of ProDiGe1 [34]. It approximates a form of PU-learning [40,41] by iteratively choosing random subsets from the unlabelled genes (i.e. those genes that are not known to be associated with the disease) when training classifiers. This method was directly applied to the regularised Laplacian kernel computed with diffuStats.
The svm (Support Vector Machine; kernlab R package) and rf (randomForest R package) methods apply classical machine learning approaches on network-based features. Network-based features were generated using MashUp with default parameters for the human network (800-dimensional, as recommended by the authors) [35]. We used the caret [43] and mlr [44] R packages to define the classification tasks, grid-search the parameters and make predictions for these two methods.
The SVM method used here is a nu-SVM with RBF kernel. In training, the negative class examples were randomly under-sampled to match the number of positive class examples. Parameters were determined via inner cross-validation with parameter ranges of (0.1, 0.9) for nu and (10 −6 , 10 2 ) for sigma, with search space linear on nu and logarithmic on sigma. A grid of resolution 5 in each direction was explored to choose the best parameters, with an internal loop of 3 repetitions of 3-fold CV.
Random forest parameters were set to default values (see mlr documentation on classif.randomForest) apart from those tuned via inner cross validation. These were ranges of (10, 500) for ntree and (1, 5) for nodesize, with linear search space in both. A grid of resolution 3 in each direction was explored to choose optimal parameters with an internal loop of 3 repetitions of standard 3-fold CV.
COSNet (COst Sensitive neural Network [33,47]) consists of a parametric Hopfield recurrent neural network classifier, employed within a semi-supervised, cost-sensitive learning context to deal with networks seeded with highly unbalanced labellings. The cost (regularisation) parameter in the COSNet R package was set to 0.0001 following the documentation guidelines.
Finally, we included the following three naive baseline methods, for comparison purposes: (1) pr, a classic problem naïve 'non-personalised' PageRank implementation where input scores on the genes are ignored; (2) randomraw, which applies the raw diffusion approach from diffuStats [39] to randomly permuted input scores on the genes; and (3) random, a uniform re-ranking of input genes without any network propagation (using sample(n) in R, with n = number of genes in the test fold).

Comparing methods
Comparing methods using their predictions on seed and novel genes can give insights on similarities and differences among the various methods families. The main body contains a comparison using the drugs genes as seeds and excluding the seeds for the comparison, while here we also include the seeds (figure O) and an analogous analysis on the genetically associated genes (figure P). Classical MDS plots for specific diseases can be found in the supplementary file mds plots.zip and are qualitatively consistent with their multiview counterparts.
Figure O: Multi-view MDS plot displaying the preserved Spearman's footrule distances representing the differential ranking behaviours of methods across all 22 diseases when individual sets of drugs seeds were input. Each plot is for a different combination of input network (columns) and the predicted gene set that was ranked (rows). Note how COSNet is excluded from seeds prediction, as by its definition it does not order the seeds.
In figure O, two groups of diffusion-based methods consistently clustered together: (i) raw, gm, bagsvm, and (ii) knn and wsld. As a consequence, the supervised, bagged SVM based in the regularised Laplacian kernel behaved like usual diffusion scores (raw) that use the same kernel. Despite their common background, groups (i) and (ii) appeared together in OmniPath but not in STRING, implying that even small methodological differences can have a noticeable overall impact. A third group (iii) was formed by ppr, z and mc, although the latter did not cluster as clearly in some cases. These methods are also diffusion-based, but mc and z have statistical differences with raw [39] and this is reflected in the MDS plot. The supervised methods (iv) rf and svm also tended to agree, since they were trained on the same network-based features. Finally, (v) EGAD and COSNet closes the method grouping, suggesting that the artificial neural network from COSNet resembled neighbour voting approaches.
In figure P, groups (i) and (ii) stick together in all the scenarios and become one single family. Group (iii) is only obvious in STRING and non-seed genes, becoming diluted in the rest.
The tight five method group (raw, gm, bagsvm, knn and wsld), seen only for OmniPath with input drug seeds in the main body, is apparent with both networks. Group (iv) and (v) still behaves as so, further justifying this classification. Despite some differences, these groupings do agree to those seen for the corresponding networks under drug seed input in figure O.

Methods ranking using all the metrics
In the main text we show the method prioritisations using the main metrics. Figure Q contains the same data for all the metrics hereby analysed. The metrics have been arranged from farthest (top 20 hits) to closest to AUROC. Two conclusions can be drawn from figure Q. First, AUROC behaves differently from the other five metrics, which in turn behave alike. This is expected as AUPRC, pAUROC and top k hits emphasise on the performance at the top ranked entities. Second, as the parameter of pAUROC and top k hits grows, both metrics rank closer to AUROC, which is also natural.
The fact that top 20 hits, top 100 hits and AUPRC behave so similarly suggests that the ranking under top k hits is robust for small values of k (k ≤ 100) and that AUPRC is indeed a meaningful performance metric for real scenarios in drug development. Method ranking by their predicted performance, averaged over diseases Figure Q: Ranking of all the methods, using the predictions of the main and the reduced models on the drugs input, STRING network, block cross validation and averaging over diseases. A column-wise z-score on the predicted mean is depicted, in order to illustrate the magnitude of the difference.

Model summaries and confidence intervals on predictions
Model description

Drugs input
Additive models     The DA3 model was used in the main body to statistically compare method performances. We explored its diagnostic plots (figure R) to ensure we drew sound conclusions from it. The first panel in figure R contains the deviance residuals against the predicted values. The lack of tendencies in it, reflected by the flat red line, supports that the residuals are healthy and that the poisson is a suitable distribution to describe the data [Zuu09]. The fourth panel from figure R shows that there are no influential observations using Cook's distance statistic.     Note: * p<0.1; * * p<0.05; * * * p<0.01       Note: * p<0.1; * * p<0.05; * * * p<0.01   Note: * p<0.1; * * p<0.05; * * * p<0.01 Table P: Predictions of the models GA1, GA2, GA3 (95% confidence intervals after averaging over disease).  Note: * p<0.1; * * p<0.05; * * * p<0.01 Table R: Predictions of the models GA4, GA5 and GA6 (95% confidence intervals after averaging over disease).

Reference streams
In order to evaluate the extent to which using networks for predicting disease genes is of use compared against not using networks at all, we have also checked the extent to which the gene scores from other data streams in Open Targets could be used to recover known drug targets. To that end, we have computed the metrics between all the remaining streams and the drug targets stream, reusing the partitions from the cross validation folds (see main text). These metrics are therefore not directly comparable to those presented above for the network-based approaches, as in this case, the concept of cross-validation does not apply. It is rather a data subsetting strategy to compute the estimates, to which the additive models also apply (see table S).
The genes scores from the Open Targets literature data stream result in the best alignment with the scores from known drug targets.
There may be some circularity explaining this as the literature data stream uses publications mentioning known drug targets and their relation to diseases, and also as a gene with a lot of literature describing its relationship to disease may be more likely to be picked as a potential drug target. The genetic association data stream is second best in terms of correlation with the known drug target scores, thereby justifying its usage for finding potential targets a posteriori. Interaction effects between CV scheme, network, diseases and methods As mentioned in the main text, an illustrated by the above tables, in addition to a main effects model, we also considered how model performance could vary with other parameters such as choice of network and disease. In particular, we formally asked whether the differences -or interactions -we observed were in line or greater than those that would be expected by chance.
Interaction effects were omitted from our main analysis to avoid overfitting the data and a corresponding underestimation of the residual error and inflation in statistical significance. Given the large number of combinations possible, this was a risk even where the majority of interactions were not significant. This scenario was however in contrast to the current exploration were the sizes of any such effects were of interest per se. On the other hand, this exploratory analysis shows that the interaction terms that would improve the model involve poorly performing methods. Given their lack of interest in the final recommendations, including such terms would make the formal comparisons cumbersome, without providing any added value.
To simplify our analysis, motivated by standard statistical theory for multi-factorial statistical screening designs [Mon17], high order interactions, such as those between say CV scheme, network and disease, were omitted from our calculations and assumed to be little different to statistical noise. In contrast to standard screening methods, however, which typically address all binary or two level factors, all factor levels where considered adding to the complexity of the plots. In a further deviation, two-way interactions, say, were considered independently from all lower order terms, here one way or main effects, which were removed from the signal prior to analysis. This, done at the cost of over counting by one the total degrees of freedom in our data, served to improve interpretability since otherwise we would have one fewer interaction terms than combinations in the looked for effects. See figure S for an example and the html viewer in the supplementary file interaction html viewer.zip for other models and interaction terms. Deviations upwards from a straight line trend suggest interactions that are larger than would be expected by chance. Due to the use of absolute values -signs of interactions are difficult to interpret and can confuse comparisons-to 'fold over' the two distributions this is typically referred to as a half normal plot. To maintain a one to one correspondence between observed deviations and the set of two-way combinations of disease and method which would otherwise be lost by accounting degrees of freedom, main effect contributions for disease and method were removed prior to this analysis. As a guide to the underlying variability, the plot also includes 95% confidence intervals for the distribution of each absolute value normal reference value under re-sampling.