Null diffusion-based enrichment for metabolomics data

doi:10.1371/journal.pone.0189012

Fig 1.

Workflow summary.

Contextual knowledge is extracted from KEGG as a graph object while experimental data is introduced as a list of affected metabolites. A null diffusive model assesses, and reports in a subgraph, which part of the KEGG graph is relevant for the input metabolites.

More »

Expand

Fig 2.

Nodes arrangement for (a) heat diffusion and (b) PageRank.

The affected metabolites are highlighted with a black ring. For heat diffusion (a), affected metabolites are forced to generate unitary flow. Every pathway is highlighted with a blue ring, representing its connection to a cool boundary node. In equilibrium, the highest temperature pathways (and nodes) will have the greatest heat flow, suggesting a relevant role in the experiment. For PageRank (b), affected metabolites are the start of random walks. PageRank scores, represented by the intensity of the blue colour, will attain higher values in the frequently reached random walk nodes.

More »

Expand

Fig 3.

Toy example of an over-representation analysis of a hypothetical “pathway A” containing 3 metabolites out of a total of 10.

The list to be enriched contains 4 metabolites, showing 2 hits in the pathway. The corresponding (Fisher’s exact test) over-representation can be understood as a diffusion process on the depicted network followed by a null model. The temperature of pathway A is always coincident with the number of hits in the pathway, implying that its null distribution is the hypergeometric distribution, to which a one-tailed temperature comparison is made.

More »

Expand

Fig 4.

Expected value (a) and standard deviation (b) of the null temperatures, stratified by level—jitter applied for visual purposes and 0.95 confidence intervals computed by the default GAM models in ggplot2 R library [41]. Clear biases arise due to the node degree, a topological property of the nodes: the larger the pathway, the higher its mean value, and the more connected a compound is, the smaller its variance. If pathways are ranked by raw temperatures, a large pathway will have an undesired, consistent advantage over small ones and will be reported too often. The usage of z-scores (d) instead of raw temperatures (c) to select the top 250 nodes addresses these biases and highlights pathway and module nodes that were eclipsed by other compounds and reactions with higher mean null temperatures.

More »

Expand

Fig 5.

Ranking the 288 KEGG pathways—lower is best– using raw temperatures (a) biases the ranks towards pathways with higher mean null temperature, which in turn tend to be large pathways. Using the z-scores instead (b) breaks this clear dependence and avoids reporting pathways just because of their size. The top 20 pathways through raw temperatures (c), depicted as black dots, include pathways that are even below their mean value, while the top 20 z-scores (d) suggest smaller pathways that were penalised by the aforementioned bias.

More »

Expand

Table 1.

Summary of the outputs.

More »

Expand

Table 2.

Solutions overlap.

More »

Expand

Table 3.

Reported pathways.

More »

Expand

Fig 6.

Subgraph reported through HD norm, the names of reactions and enzymes have been omitted for clarity.

Compounds are green, reactions are blue, enzymes are orange, modules are purple and pathways are red. The compounds in the input are highlighted as green squares to ease the tracing of the biological perturbation up to the pathways. The presence of reactions and enzymes that link pathways in this subgraph might suggest relevant entities by which affected pathways crosstalk. All the reported pathways and modules lie in a large CC, as well as a newly proposed metabolite (L-Glutamate).

More »

Expand

Fig 7.

Synthetic signals evaluation using the pathway rank as a metric to assess orderings.

Lowest ranks correspond to best ranked pathways. The proposed methodology is compared to ORA, represented by Fisher’s exact test. (a) 288 noisy signals have been generated, and every pathway has been ranked in each of the 288 runs. Data points for a given methodology are the mean rank of each pathway, giving 288 data points per box. (b) 288 signals with a target pathway have been generated, in three scenarios: pure noise, proportion-based sampling and network-based sampling. Each box contains the rank of the target pathway, leading to 288 data points per box.

More »

Expand

Fig 8.

KEGG representation of the Glutathione metabolism (hsa00480).

KEGG compounds found affected through MS (orange) and NMR (blue) are pinpointed in the figure. Additionally, enzymes and compounds reported by HD norm are depicted in red. Our approach provides a criterion for highlighting a pathway together with the entities it contains, for example its reported enzymes, to build a sub-pathway representation richer than the classical methods that rely solely on pathways and compounds. Reprinted from www.genome.jp under a CC BY license, with permission from Kanehisha Laboratories, original copyright 2014.

More »

Expand

Table 4.

Distance to NMR metabolites.

More »

Expand