Auxiliary self-supervision to metric learning for music similarity-based retrieval and auto-tagging

doi:10.1371/journal.pone.0294643

Fig 1.

Model overview.

For each batch comprising pairs of a music track x and its corresponding multi-tag y, the music tracks undergo transformations (indicated by arrows) to compute the self-supervised learning loss and the metric learning loss . The losses are used to define the overall loss function (Eq (20)) to train our proposed model. After training the model, given a music track x, the embedding vector z^exc and the estimated probabilities of multi-tag are used for similarity-based retrieval and auto-tagging, respectively.

More »

Expand

Table 1.

Results for supervised scenario of MagnaTagATune dataset.

More »

Expand

Table 2.

Results for supervised scenario of MTG-Jamendo dataset.

More »

Expand

Fig 2.

Similarity-based retrieval R@K results for semi-supervised scenario of MagnaTagATune dataset.

More »

Expand

Fig 3.

Similarity-based retrieval M@K results for semi-supervised scenario of MagnaTagATune dataset.

More »

Expand

Fig 4.

Auto-tagging results for semi-supervised scenario of MagnaTagATune dataset.

More »

Expand

Fig 5.

Similarity-based retrieval R@K results for semi-supervised scenario of MTG-Jamendo dataset.

More »

Expand

Fig 6.

Similarity-based retrieval M@K results for semi-supervised scenario of MTG-Jamendo dataset.

More »

Expand

Fig 7.

Auto-tagging results for semi-supervised scenario of MTG-Jamendo dataset.

More »

Expand

Fig 8.

T-SNE visualization of similarity latent space for MagnaTagATune dataset.

Green, blue, and yellow dots correspond to music tracks with ‘female vocal’ tags, ‘no vocal’ tags, and other tags, respectively. The percentage % indicates the reduction in labels used for training.

More »

Expand

Fig 9.

T-SNE visualization of similarity latent space for MTG-Jamendo dataset.

Green, blue, and yellow dots correspond to music tracks with ‘instrument—voice’ tags, ‘genre—instrumentalpop’ tags, and other tags, respectively. The percentage % indicates the reduction in labels used for training.

More »

Expand