Fig 1.
Overview of the machine learning method.
(a) During training, the method learns species embeddings from pairwise phylogenetic distances (Sect 2.2.1). We then train an artificial neural network (ANN) to map DNA sequences into the same embedding space, leveraging the previously learned structure (Sect 2.2.2). Finally, we train another ANN model that modulates species predictions based on observed co-occurrence patterns (Sect 2.2.3). (b) Once trained, the model assigns species probabilities to eDNA sequences. Each sequence is first embedded by the trained network, then the kernel converts the position to species probabilities that are modulated by the learned matrices U and V.
Fig 2.
Visualization of the learned embedding space after dimensionality reduction with the t-SNE algorithm.
Each point corresponds to a species in the tree, and its color represents its taxonomic order. Points are further distinguished with two different symbols based on whether the species is in the DNA reference database (dot) or not (cross). Axes are omitted because they do not carry an intrinsic meaning.
Table 1.
Top-1 accuracy for the predictions of species with DNA sequence seen during training, as well as for their corresponding genus, family, and order.
Table 2.
Top-1 accuracy for the predictions of species with DNA sequence not seen during training (zero-shot), as well as for their corresponding genus, family, and order.
Fig 3.
Calibration strength of our model outputs.
The predicted species output probability is shown on the x-axis, and the corresponding proportion of correctly predicted assignments at the species, genus, family, and order level is shown on the y-axis. The dashed black line represents a perfect calibration for the solid blue (species) line. The bar plot at the bottom depicts the respective proportion of predictions that fall into each (5%) probability bin.
Fig 4.
Comparison of predictions for eDNA samples using our method and the traditional pipeline.
Predictions are compared at different taxonomic levels and grouped by region. Each vertical line represents the results for one eDNA sample.