Contrastive learning for passive acoustic monitoring: A framework for sound source discovery and cross-site comparison in marine soundscapes

doi:10.1371/journal.pcbi.1014005

Fig 1.

Overview of the experimental workflow.

Three evaluation modes were applied: (1) label agreement with FADAR classes, (2) unsupervised cluster quality, and (3) acoustic pattern discovery.

More »

Expand

Table 1.

Passive acoustic training data site locations and species. The number of recordings represents the number of 20-second records received from recorders for each site before pre-processing and balancing.

More »

Expand

Fig 2.

Example spectrograms for each of the six FADAR-defined classes.

Call samples are given from the labeled dataset examples used to train FADAR [26], and reflect the model structure for each sound type.

More »

Expand

Fig 3.

Architecture of the proposed PAM-SimCLR framework.

Multiple augmented views (two global and local crops) are processed by a shared ResNet-18 encoder and projection head, with an EMA teacher providing soft multi-positive/negative targets.

More »

Expand

Fig 4.

Augmentations used to create positive pairs.

Each spectrogram is cropped globally or locally and then transformed by time/frequency masking, spectral notching, temporal shift/truncation, or Gaussian noise.

More »

Expand

Table 2.

Summary of feature extraction methods. Full implementation details are provided in S2 Appendix.

More »

Expand

Table 3.

Evaluation metrics used in Experiments 1 and 2. Arrows indicate the desired direction of each score. Formal definitions are provided in S2 Appendix.

More »

Expand

Table 4.

Internal clustering metrics on classical acoustic datasets. Higher Silhouette and CH, and lower DBI indicate better clustering quality. This table provides reference clustering performance on clean, label-rich audio datasets; comparisons against learnable baselines for the reef PAM experiments are reported separately in Table 5.

More »

Expand

Table 5.

Evaluation of baseline methods. External metrics (ARI, AMI, Hungarian Accuracy) assess agreement with FADAR labels. Internal metrics (Silhouette, DBI, CH) assess cohesion and separability. Higher is better except for DBI.

More »

Expand

Table 6.

Evaluation of contrastive learning methods. External metrics (ARI, AMI, Hungarian Accuracy) assess agreement with FADAR labels. Internal metrics (Silhouette, DBI, CH) assess cohesion and separability. Higher is better except for DBI.

More »

Expand

Table 7.

Silhouette and Calinski–Harabasz (CH) scores for different clustering algorithms applied to PAM-SimCLR embeddings at k = 60. Higher values indicate better clustering performance.

More »

Expand

Table 8.

Number of acoustic signatures discovered across 10 sites. Cohesion is reported as mean intra-cluster cosine similarity.

More »

Expand

Table 9.

Example entries from the acoustic signature dictionary.

More »

Expand

Fig 5.

3D UMAP visualization of the test-set embeddings for (A) the supervised SupCon model and (B) the PAM-SimCLR model, colored by the six FADAR species-level labels.

More »

Expand

Fig 6.

Silhouette score vs. cluster number (k) on PAM-SimCLR embeddings, showing decreasing cohesion at higher k.

More »

Expand

Fig 7.

UMAP projection of Puerto Rico BDS latent space (PAM-SimCLR embeddings), showing the 11 acoustic signatures present after applying the cohesion threshold.

Points are colored by cluster ID and number out of the original 60 clusters created.

More »

Expand

Fig 8.

Spectrogram gallery of representative acoustic signatures identified through clustering across all sites.

The examples include recurrent sounds associated with spawning species observed at multiple locations, as well as sound signatures restricted to particular sites. Distinct anthropogenic sounds, such as vessel noise, also appear as site-specific signatures. The full spectrogram gallery is provided in S1 Appendix. Sounds with similar spectro-temporal structure may be grouped into the same cluster, even when visual differences are subtle (e.g., Clusters 3 and 5).

More »

Expand