Fig 1.
Flowchart describing the machine learning-supported workflow.
White boxes indicate data inputs and produces, black boxes represent processing steps applied to the data. Gray circles indicate steps involving analyst review.
Fig 2.
Modeled delphinid echolocation click received level distributions (solid black line) predict an exponential increase in the number of clicks detected (approximately linear in log space shown here) as received levels decline, assuming animals are uniformly distributed on average around a stationary sensor [38,39]. This shape is driven by an inverse relationship between range and received level (although signal directionality and other factors can introduce additional variation) and by the area monitored, which increases with the square of the monitoring radius, leading to greater numbers of animals at large ranges. Circles illustrate a typical “real-world” received level distribution from a click detector, in which detections approaching the intended threshold (115 dBPP re 1μPa in this case) begin to be systematically missed. Enforcing a higher minimum amplitude threshold at which detection counts are still increasing (e.g. dashed line at 120 dBPP re 1μPa) greatly simplifies subsequent analyses such as species-specific missed-rates and density estimates. More information on the model used in this illustration is available in [38].
Fig 3.
A similarity score based on correlation in time and frequency is used to associate similar signals and distinguish dissimilar signals.
In this illustration, similarity between 100 detections is displayed as a symmetric similarity matrix (A). The similarity between detection X and detection Y is given by the color of grid cell (X,Y) on a scale between 0 (low similarity) and 1 (high similarity). Black squares along the diagonal represent comparisons of each detection to itself, and are ignored. The 90th detection (a delphinid echolocation click) indicated by the black arrow in (A) is compared to two other detections: The blue triangle denotes a highly-similar detection, while the red square denotes a dissimilar detection. In (B) the same dataset is visualized as a network in which similar detections are attracted to each other and dissimilar detections repelled. The black node represents the 90th detection. Waveforms (C), waveform envelopes (D) and spectra (E) are shown for the three detections, with the original detection in black, the similar detection in blue, and the dissimilar detection in red. Waveforms and waveform envelopes have been offset by a constant value for readability. Plots C-E indicate that the detections with high similarity scores are alike in the time and frequency domains, while detection with a low similarity score is quite different from the other two.
Fig 4.
Input vectors for detection-level neural network training, consisting of concatenated spectra and waveforms of 140,000 detections (20,000 per class).
Inputs are normalized as shown prior to being fed into the neural network.
Fig 5.
Input vectors for bin-level neural network training, consisting of concatenated mean spectra, IDI distributions, and mean waveforms of 14,000 bins (2,000 per class).
Inputs are normalized as shown prior to being fed into the neural network.
Fig 6.
Map of HARP monitoring sites E and H, located in the Southern California Bight.
Site depths are 1,300 m and 1,000 m respectively. Base map provided by [49] (https://www.bodc.ac.uk/data/open_download/gebco/gebco_2021_sub_ice_topo/zip/).
Table 1.
Deployment and detection information.
Fig 7.
Received level distributions of detected signals above the minimum received level threshold of 120 dBpp re 1 μPa.
(Blue: Site E, Red: Site H).
Fig 8.
Signal classes formed from the training dataset using unsupervised clustering on spectra and waveform envelope.
Seven signal classes were identified, including five odontocete echolocation types, ships and sonar. Color map represents normalized amplitudes on a scale of 0 (dark blue) to 1 (dark red).
Table 2.
Number of encounters from site H used for training (60% of available encounters), evaluation (30%) and validation (10%) of the neural network.
Table 3.
Confusion matrix for click-level classifier on the balanced evaluation dataset from Site H, consisting of 25,000 examples per category.
Values indicate percentages of the total number of detections classified.
Fig 9.
Spectra of detections counted as network misclassifications in the balanced Site H evaluation dataset.
Spectra are sorted by classification probability scores shown along the upper edge of each subplot. A positive relationship between signal amplitude and probability scores is apparent, with higher amplitude signals being assigned higher probability labels by the network. Many of the signals counted as misclassifications appear to have been correctly classified by the network, but were likely incorrectly labeled in the unsupervised step used to create the training and evaluation datasets. For instance, the majority of spectra misclassified as Risso’s dolphin or PWS type A appear to have been correctly assigned to those species respectively.
Table 4.
Detection level confusion on the independent, manually-labeled, unbalanced evaluation dataset from Site E.
A total of 38,099,453 detections were classified. Detections are given in counts rather than percentages due to the large disparities between class sizes in this unbalanced dataset.
Fig 10.
Distributions of detection-level label probability scores for each class for the Site E evaluation dataset.
In this case, classes are taken to be those assigned by the network.
Fig 11.
Detection level precision and recall curves for each class in the Site E evaluation dataset.
Numbers within each plot represent thresholds applied in the classification probability scores assigned by the network, with points representing the precision and recall achieved by retaining only those labels with probabilities greater or equal to the associated threshold.
Fig 12.
Spectra of bins counted as network misclassifications in the balanced Site H evaluation dataset.
All detections had a probability greater than 0.9. Only classes which contained misclassifications are shown, a majority of the misclassified UD bins appear to be Risso’s dolphin bins. Some of the PWS Type A events are consistent with the class and may represent cases that were mislabeled in the cluster-derived Site H evaluation dataset.
Table 5.
Confusion matrix for bin-level classifier on the evaluation dataset, consisting of 1,000 examples per category.
Values indicate percentages.
Table 6.
Bin-level confusion on the independent, manually-labeled, unbalanced evaluation dataset from Site E.
A total of 11,867 bin-level averages were classified. Bins are given in counts rather than percentages due to the large disparities between class sizes in this unbalanced dataset.
Fig 13.
Distributions of bin-level label probability scores for each class for the Site E evaluation dataset.
In this case, classes are taken to be those assigned by the network.
Fig 14.
Bin level precision and recall curves for each class in the Site E evaluation dataset.
Numbers within each plot represent thresholds applied in the classification probability scores assigned by the network, with points representing the precision and recall achieved by retaining only those labels with probabilities greater or equal to the associated threshold. “None” labels are included in these metrics.
Fig 15.
Comparison of detection and bin level classification results on a six-hour data segment from Site E, displayed as long term spectral average (LTSA) (A). Classes assigned by the bin-level network are shown in (B) as a time series of detection received levels, and in (C) as a time series of inter-detection intervals. Each point represents one detection, and color indicates class assigned by the bin-based classifier. Blue points represent detections which were not assigned a class. Classes assigned by the click-level network are shown in (D) and (E). Note that every click is labeled in the click-level case.