MT-MAG: Accurate and interpretable machine learning for complete or partial taxonomic assignments of metagenome-assembled genomes

doi:10.1371/journal.pone.0283536

Table 1.

Summary of total number of basepairs analyzed, number of FASTA files, number of contigs/reads, and the range of contig/read lengths for the training and test sets in Task 1 (sparse), and Task 2 (dense) for MT-MAG and DeepMicrobes.

Note that contigs come directly from the samples, while reads are simulated from the samples by the ART simulator.

More »

Expand

Fig 1.

A sample hierarchy (taxonomy) with three parent-to-child relationships.

A parent node with all its children nodes forms a parent-to-child relationship. A parent node without a child node is called a leaf node. The level of a node is the length of path from that node to the root node. The part highlighted in red is a multi-child classification, while the part highlighted in cyan is a single-child classification.

More »

Expand

Fig 2.

Overview of eMLDSP, including the main steps that comprise eMLDSP (Preprocessing) (pink box), eMLDSP (Classify-Training) (yellow box), and eMLDSP (Classify-Classification) (lavender box).

Ellipses represent computation steps. Rectangles represent inputs to, and outputs from, computation steps. The diamond represents a condition checking. Note that the training dataset consists of (a) DNA sequences, together with (b) their taxonomic labels; and the inputs to eMLDSP (Preprocessing) and eMLDSP (Classify-Training) consists of the same training set. The output from eMLDSP (Preprocessing), consisting of predictions and classification confidences for the training set, is used for the STP algorithm in the MT-MAG Training phase (dotted arrow from the pink box), to calculate stopping thresholds. These stopping thresholds will then be used together with the output from eMLDSP (Classify-Classification) in the MT-MAG Classifying phase (dotted arrow from the lavender box), to determine final classification results.

More »

Expand

Fig 3.

MT-MAG pipeline for classifying two genomes, genome a and genome b, from the parent taxon Genus 1 into its two child taxa, Species 1, and Species 2 (multi-child classification).

Blue ellipses represent computation steps. Gray rectangles represent inputs to, and outputs from, computation steps. In the MT-MAG training phase (yellow box), the training set is prepared and given as the input to eMLDSP (Preprocessing).

More »

Expand

Fig 4.

Example of the classification path for a genome x.

The pre-calculated stopping thresholds are listed under the corresponding taxon labels. The classification confidences are listed inside blue-bordered rectangles. MT-MAG classifies x from root into “rank 1 group 1” with confidence 0.99, which is greater than the stopping threshold for “rank 1 group 1” (0.94), so MT-MAG continues its classification for x. In the next iteration MT-MAG classifies x from “rank 1 group 1” into “rank 2 group 2” with confidence 0.90, but since this is below the stopping threshold of the parent into its child “rank 2 group 2” (0.92), this classification is deemed “uncertain” and MT-MAG does not attempt further classifications. The path in cyan indicates complete classification(s), the path in yellow indicates uncertain classification(s), and the part in red indicates unattempted classifications.

More »

Expand

Table 2.

Summary of MT-MAG performance metrics at all taxonomic ranks, for Task 1 (sparse) and Task 2 (dense): Constrained accuracy CA_g(tr), absolute accuracy AA_g(tr), weighted accuracy WA_g(tr), and complete classification rate CR_g(tr) (higher is better).

More »

Expand

Table 3.

Summary of proportion of test sequences completely classified by MT-MAG (quantified as CR_g(tr)) vs. classified by DeepMicrobes (quantified as CR_r), at all taxonomic ranks.

A higher CR_g(tr) (respectively CR_r) is better, as it signifies that a higher proportion of genomes (resp. reads) have been completely classified (resp. classified).

More »

Expand

Table 4.

Summary of MT-MAG and DeepMicrobes accuracy statistics, as well as the complete classification rates of MT-MAG and the classified rates of DeepMicrobes.

The inputs are genomes in the case of MT-MAG, and reads in the case of DeepMicrobes.

More »

Expand