Table 1.
Summary of total number of basepairs analyzed, number of FASTA files, number of contigs/reads, and the range of contig/read lengths for the training and test sets in Task 1 (sparse), and Task 2 (dense) for MT-MAG and DeepMicrobes.
Note that contigs come directly from the samples, while reads are simulated from the samples by the ART simulator.
Fig 1.
A sample hierarchy (taxonomy) with three parent-to-child relationships.
A parent node with all its children nodes forms a parent-to-child relationship. A parent node without a child node is called a leaf node. The level of a node is the length of path from that node to the root node. The part highlighted in red is a multi-child classification, while the part highlighted in cyan is a single-child classification.
Fig 2.
Overview of eMLDSP, including the main steps that comprise eMLDSP (Preprocessing) (pink box), eMLDSP (Classify-Training) (yellow box), and eMLDSP (Classify-Classification) (lavender box).
Ellipses represent computation steps. Rectangles represent inputs to, and outputs from, computation steps. The diamond represents a condition checking. Note that the training dataset consists of (a) DNA sequences, together with (b) their taxonomic labels; and the inputs to eMLDSP (Preprocessing) and eMLDSP (Classify-Training) consists of the same training set. The output from eMLDSP (Preprocessing), consisting of predictions and classification confidences for the training set, is used for the STP algorithm in the MT-MAG Training phase (dotted arrow from the pink box), to calculate stopping thresholds. These stopping thresholds will then be used together with the output from eMLDSP (Classify-Classification) in the MT-MAG Classifying phase (dotted arrow from the lavender box), to determine final classification results.
Fig 3.
MT-MAG pipeline for classifying two genomes, genome a and genome b, from the parent taxon Genus 1 into its two child taxa, Species 1, and Species 2 (multi-child classification).
Blue ellipses represent computation steps. Gray rectangles represent inputs to, and outputs from, computation steps. In the MT-MAG training phase (yellow box), the training set is prepared and given as the input to eMLDSP (Preprocessing).
Fig 4.
Example of the classification path for a genome x.
The pre-calculated stopping thresholds are listed under the corresponding taxon labels. The classification confidences are listed inside blue-bordered rectangles. MT-MAG classifies x from root into “rank 1 group 1” with confidence 0.99, which is greater than the stopping threshold for “rank 1 group 1” (0.94), so MT-MAG continues its classification for x. In the next iteration MT-MAG classifies x from “rank 1 group 1” into “rank 2 group 2” with confidence 0.90, but since this is below the stopping threshold of the parent into its child “rank 2 group 2” (0.92), this classification is deemed “uncertain” and MT-MAG does not attempt further classifications. The path in cyan indicates complete classification(s), the path in yellow indicates uncertain classification(s), and the part in red indicates unattempted classifications.
Table 2.
Summary of MT-MAG performance metrics at all taxonomic ranks, for Task 1 (sparse) and Task 2 (dense): Constrained accuracy CAg(tr), absolute accuracy AAg(tr), weighted accuracy WAg(tr), and complete classification rate CRg(tr) (higher is better).
Table 3.
Summary of proportion of test sequences completely classified by MT-MAG (quantified as CRg(tr)) vs. classified by DeepMicrobes (quantified as CRr), at all taxonomic ranks.
A higher CRg(tr) (respectively CRr) is better, as it signifies that a higher proportion of genomes (resp. reads) have been completely classified (resp. classified).
Table 4.
Summary of MT-MAG and DeepMicrobes accuracy statistics, as well as the complete classification rates of MT-MAG and the classified rates of DeepMicrobes.
The inputs are genomes in the case of MT-MAG, and reads in the case of DeepMicrobes.