Common data models to streamline metabolomics processing and annotation, and implementation in a Python pipeline

doi:10.1371/journal.pcbi.1011912

Table 1.

Core concepts implemented in the MetDataModel package.

More »

Expand

Fig 1.

Design of core concepts and data models in computational metabolomics.

A) The core concepts in MetDataModel with the metabolomics data processing in salmon and metabolic modeling in grey. We introduce "empirical compound" as a key bridge in between. The dashed lines indicate alternative workflows. Created with Biorender.com. B) Abridged empirical compound example including the listing of MS¹ features, annotation from MS² and other sources. This JSON format enables chaining of multiple annotation tools.

More »

Expand

Fig 2.

Design and computational performance of the pcpfm pipeline.

A) The pipeline has five major sections: assembly, data processing, quality control, annotation and reporting. assembly creates the on-disk data structures needed for pcpfm analysis and optionally performs conversion to mzML. Data processing encapsulates everything from the start of a processing job to the creation of a feature table using Asari. Quality control consists of multiple chainable commands that allows for a raw feature table to be curated into a table suitable for downstream analysis. Annotation concerns the mapping of empirical compounds to metabolites using formula or MS² similarity to databases, m/z and retention time mapping to authentic standards and optionally, MS² similarity. Finally, reporting handles the creation of the three-table format for downstream analysis, PDF report generation, and JSON outputs for advanced users. Squares represent inputs and outputs, arrows represent dependencies between any steps, while bolded sections collectively represent a minimal workflow. Created with BioRender.com. B) Using the two largest datasets (N is the number of MS¹-only acquisitions), the high computational performance of our pipeline is demonstrated. Most of the wall time is spent during reporting. All steps are single threaded by default except Asari which uses 4 processes. In the HILIC+ and RP- datasets, 40008 and 32086 features are detected (full asari table including non-study samples) corresponding to 27851 and 23400 empirical compounds of which 16431 and 11962 received a level 4 annotation and 614 and 267 received a level 2 annotation. C) A comparison of the wall time required for a minimal pcpfm workflow (Asari+Khipu) compared to its MetaboAnalystR v4.0.0 equivalent on subsets of three studies where N is the number of MS¹-only acquisitions included in each subset. For the CheckMate subset, 3902 and 8907 features were detected by the MetaboAnalystR and PCPFM minimal workflows respectively while in HZV029 HILIC+ and RP- MetaboAnalystR workflow detects 2835 and 5966 features while the pcpfm workflow detects 12142 and 9939 respectively. All pcpfm counts are for the preferred feature table.

More »

Expand

Fig 3.

Annotation methods in pcpfm.

A) Empirical compounds are constructed from Asari feature tables using khipu, which groups degenerate features such as isotopologues and adducts. The inferred neutral mass of an empirical compound is compared to known metabolites to generate level 4 annotations (via JMS, https://github.com/shuzhao-li-lab/JMS). Panels A, B, and C created with BioRender.com. B) Level 2 and 1a annotations are generated using MS² similarity. Experimental MS² spectra are mapped to empirical compounds and then compared to reference spectra, to annotate metabolite structures. C) Level 1b annotations are generated based on m/z and retention time match to authentic chemical standards. The use of empirical compound improves search efficiency and reduces false positives, while annotations at all levels can also be mapped to the feature level. D) Overlap of MS² annotations by pcpfm and CD in the two HZV029 plasma datasets. Detailed dissection of the differences is difficult since CD is closed-source.

More »

Expand

Fig 4.

Examples of quality control in the pcpfm pipeline.

A) A collection of QA/QC metrics generated by Asari on an example dataset (“HZV029 Plasma RP-”). B) The correlation clustermap of all study samples and pooled samples from the HZV029 Plasma RP- dataset (preferred feature table) illustrating the batch effect induced by instrument calibration. C) Log10 TICs of a random subset of samples before normalization, after normalization, and after batch correction. D) PCA demonstrating the presence of a batch effect (top) and its removal (bottom). E) Detection of failed acquisition by the number of feature Z-scores. The failed injection is highlighted in red and a representative “good” injection in blue for both the plasma HZV029 Two-Phase HILIC- and HZV029 QC dataset (left and right, top). The two-phase failed injection is simulated by replacing a missing sample with an empty vial while the other was identified post-hoc. The TICs of the failed and good injections are shown in red and black respectively (bottom).

More »

Expand

Fig 5.

Applications of pcpfm to analyzing biological datasets.

A) In the Bowen 2023 cardiomyocyte dataset, the pcpfm identifies most of the reported sunitinib-related features in both cell pellets and media using a standard workflow. Asari and pcpfm output both a preferred feature table and a full feature table, the former of higher feature quality and the latter more inclusive. B) The mass track for the sole feature undetected in the Bowen 2023 cell dataset is shown and the suspected undetected peak is in red box (M2_2), which fails to pass Asari’s quality requirement. C) Significant differential metabolite features between sunitinib exposure groups in cell pellets. ANOVA p-values are corrected for multiple testing by Benjamini-Hochberg method. D) Both the pcpfm and MetaboAnalystR were used to extract features from a subset of the CheckMate study. Of 202 compounds in their authentic standard library, MetaboAnalystR identified 167, while the full table from the pcpfm identified 198 of the confirmed features. E) Clustering pattern of the Ansone 2021 cohort using features differentially abundant between treatment groups. F) Example boxplots of differentially abundant features in the Ansone 2021 cohort. F201235 and F201855 (top) were mapped to the same empirical compound that was tentatively annotated as 1,2-DPPC, a pulmonary surfactant by its sole level 4 annotation. Significance was evaluated using ANOVA and post-hoc Tukey’s HSD test in E and F.

More »

Expand