MetaNovo: An open-source pipeline for probabilistic peptide discovery in complex metaproteomic datasets

doi:10.1371/journal.pcbi.1011163

Fig 1.

Visualisation of the MetaNovo workflow used to analyse the mass spectrometry data of 8 human mucosal-luminal interface samples.

Raw mass-spectrometry data were analysed using the MetaNovo pipeline in MGF format, using de novo sequence tags to create a targeted FASTA file for target-decoy search.

More »

Expand

Fig 2.

A Graphical representation of the MetaNovo algorithm applied for sequence database filtration.

Normalized spectral abundance factor calculations include non-unique spectra. The magnitude of probabilities are represented by +’s. Proteins are ranked by the joint probability of organism and protein probabilities, represented by the arrow, in order of increasing probability. The number of unique spectra for each protein is determined based on its position in the ranked list, and only include spectra that do not appear in the set of proteins in the list above (but may include spectra that appear below), such as the spectra for Peptide B that are counted towards the first protein in the list, but not the second. Tie breaks for adjacent and nearly identical isoforms that share the same set of spectra, will be based on the shortest (most probable) sequence having a higher NSAF (and thus a higher protein probability) or a higher organism probability. Proteins in green will be selected for inclusion in the filtered sequence database, and proteins in red will be excluded (having no unique spectra). The colors shared by proteins, peptides and spectra above, illustrate the assignment of unique spectra and peptides, to the most probable protein in the ranked list.

More »

Expand

Fig 3.

MLI dataset results.

A. Bar chart of peptide identifications. The identification rates of MetaNovo are comparable to the previously published results of MetaPro-IQ using matched metagenome and integrated gene catalog sequence databases. B. Venn diagram showing large overlap in identified sequences using different approaches, with the highest number of sequences identified using MetaNovo. C. Peptide counts by UniPept lowest common ancestor showed similar taxonomic distributions obtained from different approaches. D. Peptides uniquely identified by MaxQuant using the MetaNovo sequence database had a significantly different distribution compared to reverse hits (p-value 6.33e-26). The boxes extend from the lower to the upper quartile, and the whiskers represent 1.5 times the interquartile range (IQR) below and above the first and third quartiles, respectively.

More »

Expand

Fig 4.

9MM dataset identification results.

A. Number of peptides identified in each run. B. Number of protein groups identified in each run. C. Peptide identification overlap between the different approaches. D. Peptide PEP score distribution box plot for shared, exclusive and reverse hit peptides for each run.

More »

Expand

Fig 5.

Percentages of misassigned peptides for all three 9MM runs.

A. MetaNovo originally yielded a very high percentage of misassigned peptides at species level UniPept pept2lca analysis. B. Taxonomic breakdown of misassigned peptides C. Re-analysis after inclusion of plausible taxa yielded a species-level misassignment rate of only 1.04%, with 0% error for all approaches at genus and family level using the 0.5% taxon-specific peptide stringency cutoff. D. 9 Acidobacteria bacterium peptides making up the final misassignment percentage of MetaNovo.

More »

Expand

Table 1.

MetaNovo yields the highest accuracy for species-level annotations compared to matched genomic databases.

Mean Squared Error (MSE) scores for the relative proportion of MSMS of each taxon to the total of each run compared to the expected proportion by CFU counting as a percentage. Scores closer to 0 indicate higher accuracy.

More »

Expand