Optimizing network propagation for multi-omics data integration

doi:10.1371/journal.pcbi.1009161

Table 1.

Different graph normalization approaches and their impact on propagated scores.

Network propagation leads to topology bias when the normalized Laplacian of the graph is used, whereas the degree row-normalized adjacency matrix does not lead to bias on the hub nodes. The Laplacian of the graph cannot be used for RWR because the iterative process is not guaranteed to converge for all α’s. Yes: presence of topology bias, No: absence of topology bias for the respective combination of propagation algorithm and graph normalization approach. The symbol “-” indicates that convergence is not guaranteed for all values of the smoothing parameter.

More »

Expand

Fig 1.

Normalized Laplacian induces a topology bias.

Distributions of log-transformed node scores after network propagation using the normalized Laplacian ( and t = 0.7 for Heat Diffusion, and and α = 0.5 for Random Walk with Restart). The input vector was a unit vector, i.e. all nodes had identical initial scores. Hub nodes (top 10% nodes with the highest degree) gain higher average scores, whereas non-hub nodes (bottom 10% nodes with the lowest degree) get lower average scores (Two-sided Wilcoxon rank sum test p-value < 2.2∙10⁻¹⁶ (***)).

More »

Expand

Fig 2.

Bias-variance decomposition of the Mean Squared Error (MSE).

Mean squared error curve (blue, averaged across samples) across the values of the spreading parameter for the brain (A, B) and liver (C, D) and for the liver data with added noise (E, F) in Ori et al. [13] dataset, and for the mRNA (G, H) and protein layer (I, J) in the PCa dataset using HD (A, C, E, G, I) and RWR (B, D, F, H, J). Minimum MSEs are circled (blue). Additionally, the decomposition is depicted: the bias² curve (green), the variance curve (orange) (as defined in Material and Methods) and their sum (red) are shown. The sum (red curve) approximates very well the observed MSE (blue curve). For the quantities with replicate values (i.e. observed MSE, variance and sum), error bars have been added to represent the actual distribution at the respective spreading parameters. The replicate data is firstly log-transformed. Subsequently, the average and standard deviation (SD) across the six replicates is computed for each value of the spreading parameter in the transformed space. Finally, the average, the lower error bound (average-SD) and the upper error bound (average+SD) in the log-transformed space are transformed back to the original space (i.e. the points correspond to the geometric mean). In (E, F), the dashed curve (blue) corresponds to the MSE curve of the original liver data (i.e. the (blue) curve depicted in (C) and (D) respectively).

More »

Expand

Fig 3.

Factors affecting network propagation gain of individual genes.

A, B: MSE curves of individual genes for the mRNA (A) and protein (B) layer of the PCa study using RWR. C, D, E: Impact of network propagation on gene-specific MSEs. The panels show the change of the MSE (MSE difference comparing value at α = 0 and α = 0.4) versus the corresponding (squared) mean log₂ fold changes (C), versus the inter-replicate variance (D), and versus the corresponding ratios (mean in absolute/SD) (E) for the mRNA layer. The correlation coefficient is given in all three panels. F: Average log₂ fold change of the genes’ neighbors versus their own fold changes for the mRNA layer. The correlation coefficient and corresponding p-value are given. The two vertical red lines have abscissa (-1.9) and (-1.3). Colored points within this area have been selected to generate MSE curves in (G). G: MSE curves of the colored points in (F). The genes were selected to have similar mean log₂ fold changes to eliminate the effect of the mean value. Red curves correspond to genes with similar mean value neighbors, blue correspond to genes in a random neighborhood and green to genes with opposite sign mean value neighbors. Genes with informative neighbors (red curves) achieved lower MSEs after network propagation compared to other genes (blue and green curves).

More »

Expand

Fig 4.

Inter-replicate consistency and within-patient similarity across network smoothing.

Correlation of replicate-wise propagated log₂ fold changes with propagated average log₂ fold changes of the transcriptome for the brain (A, B) and liver (C, D) using HD (A, C) and RWR (B, D). Gray lines are correlations of individual replicates with the average fold changes obtained by combining all replicates. Blue lines are averaged correlations across replicates at the respective spreading parameters for HD and RWR. Maximum correlations are circled (gray and blue). The vertical dashed red lines denote the optimal spreading parameters from the between-dataset analysis (i.e. comparing mRNA and protein propagated scores; see Fig 5). E, F: Correlations between propagated mRNA log₂ fold changes of TA1 and corresponding propagated mRNA log₂ fold changes of paired TA2 across the values of the spreading parameter for the 25 PCa patients with HD (E) and RWR (F). Each gray curve corresponds to a patient while the average curve is depicted in blue. Average maximal correlations are circled. For computing the correlation, only measured network nodes within each dataset were used.

More »

Expand

Fig 5.

Network propagation improves the correlation between mRNA and protein levels of ageing tissues (brain and liver) and PCa samples.

A, B: correlations between propagated scores from mRNA log₂ fold changes and protein weighted mean log₂ fold changes from old brain samples with varying spreading coefficients for HD (A) and RWR (B). C, D: Correlations of propagated scores from mRNA and protein log₂ fold changes of liver samples with HD (C) and RWR (D). Dashed red lines: local maximal correlations from the between-dataset consistency analysis. Blue circles: average maximal correlations from the within-dataset consistency analysis. E, F: Correlations between propagated mRNA log₂ fold changes and corresponding propagated protein log₂ fold changes across the values of the spreading parameter for the 63 PCa tumor samples with HD (E) and RWR (F). Each gray curve corresponds to a tumor sample while the ‘average’ curve is depicted in blue. Average maximal correlations are circled. Correlations were calculated using genes that were expressed and quantified with both RNA-Sequencing and MS proteomics as well as present in the functional interaction network (n = 1,772 genes for brain, n = 1,670 for liver, n = 1,828 for PCa).

More »

Expand

Fig 6.

Network propagation identifies ageing-associated genes as well as genes distinguishing prostate tumors of different grades.

RNA (A and C) and proteome (B and D) log₂ fold changes were smoothed on the network using RWR (yellow/green colored) and HD (purple/blue colored). The spreading parameters were set to α = 0.5 and t = 0.7 for brain (A and B) and α = 0.4 and t = 0.3 for liver (C and D). These are the optimal parameters found by the between-dataset consistency analysis. Afterwards, we subset unmeasured genes which were recovered using network propagation and identified ageing-associated genes within these imputed genes. The median absolute propagated log₂ fold change of the ageing-associated genes within imputed genes is higher than the median absolute propagated log₂ fold change of all imputed genes. In the case of protein log₂ fold changes in brain and liver smoothed with HD the difference is significant (one-sided Wilcoxon rank sum test; * p ≤ 0.05). E: Number of differentially expressed genes between the more and less aggressive PCa tumors with HD at the mRNA layer before (upper left) and after multiple hypothesis testing correction (upper right) across the values of the spreading parameter. Below, distributions of the negative logarithm in base 10 of the t-test p-values for the different values of the spreading parameter before (lower left) and after multiple hypothesis testing correction (lower right) are depicted. The red lines have ordinate -log₁₀(0.05) and -log₁₀(0.1) respectively and correspond to the significance threshold. F: Same as in (E) but for RWR.

More »

Expand