Common data models to streamline metabolomics processing and annotation, and implementation in a Python pipeline

doi:10.1371/journal.pcbi.1011912

Common data models to streamline metabolomics processing and annotation, and implementation in a Python pipeline

Fig 2

Design and computational performance of the pcpfm pipeline.

A) The pipeline has five major sections: assembly, data processing, quality control, annotation and reporting. assembly creates the on-disk data structures needed for pcpfm analysis and optionally performs conversion to mzML. Data processing encapsulates everything from the start of a processing job to the creation of a feature table using Asari. Quality control consists of multiple chainable commands that allows for a raw feature table to be curated into a table suitable for downstream analysis. Annotation concerns the mapping of empirical compounds to metabolites using formula or MS² similarity to databases, m/z and retention time mapping to authentic standards and optionally, MS² similarity. Finally, reporting handles the creation of the three-table format for downstream analysis, PDF report generation, and JSON outputs for advanced users. Squares represent inputs and outputs, arrows represent dependencies between any steps, while bolded sections collectively represent a minimal workflow. Created with BioRender.com. B) Using the two largest datasets (N is the number of MS¹-only acquisitions), the high computational performance of our pipeline is demonstrated. Most of the wall time is spent during reporting. All steps are single threaded by default except Asari which uses 4 processes. In the HILIC+ and RP- datasets, 40008 and 32086 features are detected (full asari table including non-study samples) corresponding to 27851 and 23400 empirical compounds of which 16431 and 11962 received a level 4 annotation and 614 and 267 received a level 2 annotation. C) A comparison of the wall time required for a minimal pcpfm workflow (Asari+Khipu) compared to its MetaboAnalystR v4.0.0 equivalent on subsets of three studies where N is the number of MS¹-only acquisitions included in each subset. For the CheckMate subset, 3902 and 8907 features were detected by the MetaboAnalystR and PCPFM minimal workflows respectively while in HZV029 HILIC+ and RP- MetaboAnalystR workflow detects 2835 and 5966 features while the pcpfm workflow detects 12142 and 9939 respectively. All pcpfm counts are for the preferred feature table.

doi: https://doi.org/10.1371/journal.pcbi.1011912.g002