Fig 1.
Overview of the experimental workflow.
Three evaluation modes were applied: (1) label agreement with FADAR classes, (2) unsupervised cluster quality, and (3) acoustic pattern discovery.
Table 1.
Passive acoustic training data site locations and species. The number of recordings represents the number of 20-second records received from recorders for each site before pre-processing and balancing.
Fig 2.
Example spectrograms for each of the six FADAR-defined classes.
Call samples are given from the labeled dataset examples used to train FADAR [26], and reflect the model structure for each sound type.
Fig 3.
Architecture of the proposed PAM-SimCLR framework.
Multiple augmented views (two global and local crops) are processed by a shared ResNet-18 encoder and projection head, with an EMA teacher providing soft multi-positive/negative targets.
Fig 4.
Augmentations used to create positive pairs.
Each spectrogram is cropped globally or locally and then transformed by time/frequency masking, spectral notching, temporal shift/truncation, or Gaussian noise.
Table 2.
Summary of feature extraction methods. Full implementation details are provided in S2 Appendix.
Table 3.
Evaluation metrics used in Experiments 1 and 2. Arrows indicate the desired direction of each score. Formal definitions are provided in S2 Appendix.
Table 4.
Internal clustering metrics on classical acoustic datasets. Higher Silhouette and CH, and lower DBI indicate better clustering quality. This table provides reference clustering performance on clean, label-rich audio datasets; comparisons against learnable baselines for the reef PAM experiments are reported separately in Table 5.
Table 5.
Evaluation of baseline methods. External metrics (ARI, AMI, Hungarian Accuracy) assess agreement with FADAR labels. Internal metrics (Silhouette, DBI, CH) assess cohesion and separability. Higher is better except for DBI.
Table 6.
Evaluation of contrastive learning methods. External metrics (ARI, AMI, Hungarian Accuracy) assess agreement with FADAR labels. Internal metrics (Silhouette, DBI, CH) assess cohesion and separability. Higher is better except for DBI.
Table 7.
Silhouette and Calinski–Harabasz (CH) scores for different clustering algorithms applied to PAM-SimCLR embeddings at k = 60. Higher values indicate better clustering performance.
Table 8.
Number of acoustic signatures discovered across 10 sites. Cohesion is reported as mean intra-cluster cosine similarity.
Table 9.
Example entries from the acoustic signature dictionary.
Fig 5.
3D UMAP visualization of the test-set embeddings for (A) the supervised SupCon model and (B) the PAM-SimCLR model, colored by the six FADAR species-level labels.
Fig 6.
Silhouette score vs. cluster number (k) on PAM-SimCLR embeddings, showing decreasing cohesion at higher k.
Fig 7.
UMAP projection of Puerto Rico BDS latent space (PAM-SimCLR embeddings), showing the 11 acoustic signatures present after applying the cohesion threshold.
Points are colored by cluster ID and number out of the original 60 clusters created.
Fig 8.
Spectrogram gallery of representative acoustic signatures identified through clustering across all sites.
The examples include recurrent sounds associated with spawning species observed at multiple locations, as well as sound signatures restricted to particular sites. Distinct anthropogenic sounds, such as vessel noise, also appear as site-specific signatures. The full spectrogram gallery is provided in S1 Appendix. Sounds with similar spectro-temporal structure may be grouped into the same cluster, even when visual differences are subtle (e.g., Clusters 3 and 5).