Fig 1.
Overall data analysis workflow in block diagram form.
(Step 1): The collection of raw input data samples, as well as a corresponding set of labelled “ground-truth” targets. (Step 2): The pre-processing of raw input data into suitable structures for modelling, guided by any available domain or expertise knowledge. (Step 3): The training of several types of classification models (including Deep Learning), which maps inputs to their corresponding discrete class labels. (Step 4): The design of special objective function within a Deep Learning classification, which identifies a latent space with improved class separation. The most dominant latent features are then distinguished by the magnitude (ex. ℓ2 norm) of the neural network weights.
Fig 2.
Printout of pandas dataframes containing raw data collected directly from ecological sites.
(Left) Dataframe containing raw input variables. (Right) Dataframe containing output class labels.
Fig 3.
Discovery of an optimally-separating latent feature space.
(Top Left) The high-dimensional and confounded raw inputs X makes class separation a challenging task.(Bottom Left) A case example including six microorganism species (A through F) which are entangled in the raw input space. (Top Right) A DL model which learns a latent space Z that optimally separates the classes. (Bottom Right) The disentanglement of the six species into distinct classes, which can be further aggregated into two major classes—Class 0 (A and B) and Class 1 (C through F).
Fig 4.
Species abundance counts after each pre-processing transformation.
(Top) Distribution of counts from all 21721 species; the density is extremely skewed towards the low end. (Middle) Distribution of only the top 50 species by sum. (Bottom) Distribution of the top 50 species after a log10 transformation. Rarer but higher-abundance species are now recognizable.
Fig 5.
Histogram of maximum abundance counts.
Fig 6.
Visualization of a linear separating hyperplane and separating margins in a 2-class SVM model.
Fig 7.
Visualization of a classification problem with a non-linear separation boundary between the two classes.
(Left) The raw feature-space spanned by raw variables x1 and x2 renders linear separation impossible. (Right) A desired latent feature space which optimally separates the two classes. The goal of the DL model is to learn its coordinates, z1 and z2.
Fig 8.
A two-layer neural network with a hinge-loss-like objective function.
Fig 9.
Feature selection based on direction of optimal separation.
Fig 10.
Selection of relevant input variables, by reverse-engineering matrix multiplication.
Fig 11.
The overall data analysis workflow applied to the Mount Polley case study.
Fig 12.
The ANN neuron architecture used in the Mount Polley case study.
Fig 13.
The 4-layer ANN architecture used to classify the Fisher Iris dataset.
Table 1.
Iris ANN classifier details.
Table 2.
ANN training and testing accuracies on the Iris dataset.
Table 3.
Comparison of separating margins between traditional and optimally-separating ANNs, for the Iris classification problem.
Fig 14.
Visualization of data separation within the first ANN hidden layer.
(Left) Inter-class separation distance in the traditional ANN, between Class 2 (red) samples and Class 1 (black) samples. (Right) Inter-class separation distance in the optimally-separating ANN.
Table 4.
Feature-ranking of the Iris dataset, according to our proposed optimally-separating ANN.
Table 5.
Testing-set accuracy comparison between traditional and optimally-separating ANNs.
Fig 15.
Inter-class separations in the traditional ANN.
Black samples belong to the undisturbed class. Red samples belong to the disturbed class. Only the first 4 out of the total 5000 runs are shown.
Fig 16.
Inter-class separations in the optimally-separating ANN.
The separations are noticeably larger than those in Fig 15. Black samples belong to the undisturbed class. Red samples belong to the disturbed class. Only the first 4 out of the total 5000 runs are shown.
Table 6.
Comparison of separating margins between traditional and optimally-separating ANNs.
Table 7.
Testing-set accuracy comparison between classifiers, using either all species or only indicator species as inputs.
Plus-minus value represents one standard deviation.
Table 8.
Indicator species identified by our proposed feature extractor, compared to those identified by [34].
The Frequency column denotes the number of times each species has been identified as a top weight, divided over all 5000 experiments. The Garris column represents whether each species has been identified in the paper [34], and if so, which habitat it belongs to. The maximum and mean abundance counts of each species are also shown, along with the Class taxonomy level.
Fig 17.
Taxonomic comparison of indicator species at the domain level.
Blue bars represent the indicators identified by Garris et al [34]. Purple bars represent the indicators identified by our proposed feature extractor. The horizontal axis represents the percentage of indicators belonging to each species.
Fig 18.
Taxonomic comparison of indicator species at the phylum level.
Blue bars represent the indicators identified by Garris et al [34]. Purple bars represent the indicators identified by our proposed feature extractor. The horizontal axis represents the percentage of indicators belonging to each species.
Fig 19.
Taxonomic comparison of indicator species at the class level.
Blue bars represent the indicators identified by Garris et al [34]. Purple bars represent the indicators identified by our proposed feature extractor. The horizontal axis represents the percentage of indicators belonging to each species.