Ranking microbial metabolomic and genomic links in the NPLinker framework using complementary scoring functions

doi:10.1371/journal.pcbi.1008920

Fig 1.

Diagram showing the relationship between the various metabolomic and genomic objects.

On the genomics side, BGCs are detected from microbial genomes, colour-coded by strain. These are clustered into GCFs, where each GCF contains BGCs from one or more strains. GCFs can thus also be considered as sets of strains, where each strain contributes at least one BGC to the GCF. On the metabolomics side, MS2 spectra measured in microbial cultures are grouped across strains, so that identical spectra are assigned one or more strains in which they appear. These are further grouped into MFs in a process called Molecular Networking, where each MF consists of one or more related spectra. Both spectra and MFs can likewise be considered as sets of strains where the spectrum, or a spectrum in the MF, is present in the sample for the strain. Feature-based approaches can be used to link BGCs to individual spectra, while correlation-based approaches can be used to link GCFs to either MFs or spectra, based on the pattern of strain contents.

More »

Expand

Fig 2.

The effect of size on strain correlation scoring.

(A) Size discrepancy in the strain correlation score for GCFs of varying sizes. Each box represents a strain, with filled boxes denoting that the strain is a member of the GCF or MF, and blank boxes that it is not. The top GCF-MF pair outscores the bottom pair by 30 to 26, despite the bottom pair having arguably stronger correspondence. (B) Expected value and variance of the strain correlation score for a population of 100 strains, as a function of GCF and MF sizes. Both the expected value and the variance have a considerable range, rendering comparison between links involving different sizes of GCFs and MFs difficult. For instance, a GCF and MF of size 80 could easily get a score of 500 or higher by chance, while for a GCF and MF of size 20, a score this high would be highly significant.

More »

Expand

Fig 3.

Arrow diagram of the Input-Output Kernel Regression (IOKR) framework.

X denotes the space of MS2 spectra, is the space of metabolites, and is the shared space of molecular fingerprints. is the (learned) mapping from MS2 to fingerprints, while ϕ is the (exact) mapping from metabolites to molecular fingerprints.

More »

Expand

Fig 4.

Diagram of the NPLinker module.

The NPLinker module helps with automatically linking GCFs and MFs. It integrates metabolomic and genomic data sets, using either external sources, user-provided data, or a mixture of both, and ranks potential links between metabolomic and genomic objects by given scoring functions, either built-in or user-defined.

More »

Expand

Fig 5.

Distribution of validated links among scores.

Distribution of the raw and standardised strain correlation scores, as well as the distribution of the scores for validated links (in black) relative to the distribution of scores for all links, in the Crüsemann data set. The standardised score has a more pronounced tail at the top end, which includes 13 out of 15 validated links, whereas many of the validated links score relatively low on the distribution of the raw scores. Figures for other data sets can be found in S1 Fig.

More »

Expand

Table 1.

Mean scores for all links and the subset of validated links in the Crüsemann data set.

More »

Expand

Table 2.

Proportion of validated links among all possible GCF-MF links in the three data sets.

More »

Expand

Table 3.

Top-n accuracy, and AUC, of IOKR on MIBiG data.

More »

Expand

Fig 6.

Correlation of IOKR- and strain correlation scores.

IOKR- and strain correlation scores for all potential links in the Crüsemann data set, with histograms of the scores. Validated links are indicated in red on the joint plot, and with black lines on the distribution histograms. Validated links are concentrated in the upper-right quadrant, i.e. score relatively high on both axes. Figures for the two further data sets can be found in S3 Fig.

More »

Expand

Table 4.

Scoring function performance.

More »

Expand

Fig 7.

Scores starting from particular GCF.

Position of the score for the validated GCF-MF pair (red) within the distribution of the scores of the links between that particular GCF and all MFs, for a selection of validated links in the Crüsemann data set (rows). The first three columns show histograms of the raw and standardised versions of the strain correlation score, as well as the IOKR score, for all links including a given GCF, with the score of the correct link indicated. The last column shows the standardised correlation score (x-axis) and IOKR score (y-axis) for the same links, again with the correct link indicated. Both IOKR and the standardised correlation scores tend to put validated links higher in the distribution of scores for the GCF in consideration, than the raw correlation score. Furthermore, some of the validated links score relatively higher on IOKR than the standardised strain correlation score, and vice versa, suggesting that the two scores complement one another. For full results, as well as for other data sets, please refer to S4 Fig.

More »

Expand

Fig 8.

Combining scores.

The set of points (x, y)such that ℓ_p(x, y) = 1, for three different values of p. This shows the form of the iso-lines of scores using the ℓ_p function for different values of p to combine the scores.

More »

Expand