Fig 1.
For each batch comprising pairs of a music track x and its corresponding multi-tag y, the music tracks undergo transformations (indicated by arrows) to compute the self-supervised learning loss and the metric learning loss
. The losses are used to define the overall loss function
(Eq (20)) to train our proposed model. After training the model, given a music track x, the embedding vector zexc and the estimated probabilities of multi-tag
are used for similarity-based retrieval and auto-tagging, respectively.
Table 1.
Results for supervised scenario of MagnaTagATune dataset.
Table 2.
Results for supervised scenario of MTG-Jamendo dataset.
Fig 2.
Similarity-based retrieval R@K results for semi-supervised scenario of MagnaTagATune dataset.
Fig 3.
Similarity-based retrieval M@K results for semi-supervised scenario of MagnaTagATune dataset.
Fig 4.
Auto-tagging results for semi-supervised scenario of MagnaTagATune dataset.
Fig 5.
Similarity-based retrieval R@K results for semi-supervised scenario of MTG-Jamendo dataset.
Fig 6.
Similarity-based retrieval M@K results for semi-supervised scenario of MTG-Jamendo dataset.
Fig 7.
Auto-tagging results for semi-supervised scenario of MTG-Jamendo dataset.
Fig 8.
T-SNE visualization of similarity latent space for MagnaTagATune dataset.
Green, blue, and yellow dots correspond to music tracks with ‘female vocal’ tags, ‘no vocal’ tags, and other tags, respectively. The percentage % indicates the reduction in labels used for training.
Fig 9.
T-SNE visualization of similarity latent space for MTG-Jamendo dataset.
Green, blue, and yellow dots correspond to music tracks with ‘instrument—voice’ tags, ‘genre—instrumentalpop’ tags, and other tags, respectively. The percentage % indicates the reduction in labels used for training.