Auxiliary self-supervision to metric learning for music similarity-based retrieval and auto-tagging

Taketo Akama; Hiroaki Kitano; Katsuhiro Takematsu; Yasushi Miyajima; Natalia Polouliakh

doi:10.1371/journal.pone.0294643

Abstract

In the realm of music information retrieval, similarity-based retrieval and auto-tagging serve as essential components. Similarity-based retrieval involves automatically analyzing a music track and fetching analogous tracks from a database. Auto-tagging, on the other hand, assesses a music track to deduce associated tags, such as genre and mood. Given the limitations and non-scalability of human supervision signals, it becomes crucial for models to learn from alternative sources to enhance their performance. Contrastive learning-based self-supervised learning, which exclusively relies on learning signals derived from music audio data, has demonstrated its efficacy in the context of auto-tagging. In this work, we propose a model that builds on the self-supervised learning approach to address the similarity-based retrieval challenge by introducing our method of metric learning with a self-supervised auxiliary loss. Furthermore, diverging from conventional self-supervised learning methodologies, we discovered the advantages of concurrently training the model with both self-supervision and supervision signals, without freezing pre-trained models. We also found that refraining from employing augmentation during the fine-tuning phase yields better results. Our experimental results confirm that the proposed methodology enhances retrieval and tagging performance metrics in two distinct scenarios: one where human-annotated tags are consistently available for all music tracks, and another where such tags are accessible only for a subset of music tracks.

Citation: Akama T, Kitano H, Takematsu K, Miyajima Y, Polouliakh N (2023) Auxiliary self-supervision to metric learning for music similarity-based retrieval and auto-tagging. PLoS ONE 18(11): e0294643. https://doi.org/10.1371/journal.pone.0294643

Editor: Ji-Hoon Yun, Seoul National University of Science & Technology, KOREA, REPUBLIC OF

Received: January 24, 2023; Accepted: November 6, 2023; Published: November 30, 2023

Copyright: © 2023 Akama et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The datasets used during the current study are available in the sota-music-tagging-models repository, https://github.com/minzwon/sota-music-tagging-models/tree/master/split/mtat and in the mtg-jamendo-dataset repository, https://github.com/MTG/mtg-jamendo-dataset.

Funding: The authors received no specific funding for this work. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

1 Introduction

Just as web search engines, article curation, and recommendations have revolutionized the way we gather information, in the field of music as well, search engines, curation, and recommendations are becoming increasingly important in how we listen to music and how creators produce content.

With the advent of music streaming services, we have entered an era where, depending on how we search, we can listen to various music tailored to our contexts. We have begun to consume and produce large amounts of video content on social media and video streaming services. With the widespread use of smartphones, we casually capture daily memories in videos and edit them, leading to an explosive increase in video consumption and production in recent years. Music or background music (BGM) is effectively and skillfully used in these videos [1], deeply influencing our emotions, often without us being consciously aware. There is a demand to add music to such videos, irrespective of whether the creator is professional or amateur. Furthermore, with AI music generation, we are entering an era where music is semi-automatically produced [2, 3], indicating a forthcoming deluge of music. Now, more than ever, there’s a growing need for information organization techniques to deliver the desired music to consumers and creators.

At the core of this information organization technology lie auto-tagging and similarity-based retrieval. Auto-tagging is a task where, upon inputting a music track into the system, it automatically analyzes the track and outputs tag information related to genre, mood, instruments, etc. This serves as the foundation for various music delivery applications such as recommendation, curation, playlist generation, and user behavior analysis [4]. Similarity-based retrieval, on the other hand, is a task where, upon inputting a music track into the system, it automatically analyzes the track, retrieves similar music tracks from the database, and ranks them in order of similarity. Besides forming the basis for music delivery applications like recommendation, query-by-example, and playlist generation [5], similarity-based retrieval itself also becomes a significant application.

To effectively handle the immense volume of available music information, enhancing foundational technologies such as auto-tagging and similarity-based retrieval is essential. However, the frequent absence of consistent and informative tag data for music tracks complicates the training of models for these tasks. Manual tagging has its limitations, from inconsistencies among annotators to challenges in adapting to new genres and variations in tag notation. Further, for non-mainstream genres and music catalogs geared more towards business rather than direct consumers, relying on public tagging is not only challenging but often impractical or impossible [6]. Data from user activity on popular music streaming services offers insights into user preferences, but it comes with issues. Obtaining objective data about a music track’s genre, mood, and other attributes is tough. Additionally, this data is inaccessible unless developers have access to a popular service, new tracks lack feedback, and feedback primarily focuses on popular tracks [7]. Given this context, there’s a demand for technologies that can fill in the gaps of objective music content information. This paper introduces a technology capable of automatically supplementing such music content information, enhancing similarity-based retrieval and auto-tagging performance.

Conventional methods for similarity-based music retrieval largely depend on supervised learning, utilizing learning signals derived from human-annotated tags [8]. In contrast, self-supervised learning gleans its learning signals from inherent properties of the music tracks themselves, thus autonomously augmenting music content information without the need for attached annotations or metadata. Among these self-supervised approaches, contrastive learning has shown promise and has been applied to auto-tagging [9]. In this work, we present a model that integrates metric learning and contrastive-based self-supervised learning. We demonstrate that contrastive-based self-supervised learning is advantageous not only for auto-tagging but also for the similarity-based retrieval task. Furthermore, we introduce refined techniques to improve conventional self-supervised learning methods.

What is an intuitive explanation for our self-supervised signals? The similarity between music tracks is typically defined by their global similarity, which considers how closely related their global attributes are [8]. Auto-tagging performance is assessed based on the ability to infer global tags from each music track [8, 10, 11]. Our neural network model aims to extract such global attribute features without relying solely on manually annotated tags. We formulate learning signals under the assumption that excerpts from the same music track are more likely to possess similar global attribute features compared to excerpts from different music tracks. Additionally, we assume that the global attribute features of a track remain relatively unchanged even after applying augmentation transformations, such as reverberation, band-pass filtering, and pitch shifting. Given that the learning signal is derived from annotations inherent to the music audio (i.e., self-supervised) rather than from human-provided annotations (i.e., human-supervised), this approach is termed self-supervised learning.

To effectively integrate self-supervision signals into our model, a deliberate design consideration is essential. This includes determining where in the architecture to situate embeddings for similarity-based retrieval, given that global attribute features are more directly relevant to these embeddings than classification probabilities. To this end, we have strategically placed embeddings for similarity-based retrieval immediately after the layer where output features are influenced by self-supervised signals. Additionally, we have carefully considered the placement of normalization operations, ensuring that they do not impact the head of the network on the self-supervised loss function side. We have placed them after the branch leading to the head of the network on the supervised loss function side.

Our self-supervised loss diverges from conventional self-supervised losses in several aspects. Self-supervised learning is frequently introduced in the context of representation learning, wherein the acquired representation, or feature, is fixed (the learned neural network is frozen), and the representation is employed for other tasks during the so-called fine-tuning phase [9, 12]. In this paper, we utilize self-supervised learning to enhance task performance and propose adapted learning techniques. Specifically, 1) during the fine-tuning phase, the neural network is not frozen, allowing the entire network to be trained to capitalize on its expressivity. 2) Self-supervised learning signals are employed even in the fine-tuning phase. 3) Augmentation is omitted for self-supervised learning during the fine-tuning phase, enabling our neural network model to be trained with higher quality data. Overall, we consider the self-supervised signal as an auxiliary loss in relation to the primary metric learning loss, which improves performance compared to employing the standard self-supervised approach, where the learned neural network is frozen during the fine-tuning phase.

To further leverage the self-supervised signals, especially to address situations where real-world data doesn’t always have clean and informative tags, we empirically demonstrate that our method is also effective in addressing semi-supervised scenarios where obtaining human-annotated tags for music tracks is expensive and tags may not always be available for all music tracks used in training models. Notably, the improvement over existing methods was even more significant in situations where only 1% of the songs in the database were tagged.

Our primary contributions can be summarized as follows:

We propose a model architecture and a training algorithm that employ self-supervised learning to boost the performance of similarity-based music retrieval in both supervised and semi-supervised contexts.
We introduce a self-supervised auxiliary loss for similarity-based music retrieval and music auto-tagging, which serves to augment the outcomes in comparison to the conventional self-supervised approach within the supervised scenario.

The remainder of this paper is organized as follows: Section 2 delves into the literature review, offering insights into prior research and identifying gaps in the current knowledge, complementing this section of introduction. Section 3 describes some preliminary technical terms which serve as the basic knowledge to understand the methodology of the paper. Section 4 presents the methodology, introducing our problem setting and detailing the architecture and objective functions of our proposed model. Section 5 offers experimental setups, providing information on datasets we use for experiments, detailed model configurations, evaluation metrics, and baseline methods. Section 6 describes the experimental results of our proposed model, comparing with baseline methods and variations of our models. Finally, Section 7 concludes the paper, summarizing the main points.

2 Related work

Spijkervet and Burgoyne demonstrated the effectiveness of SimCLR-based self-supervised learning for music auto-tagging [9]. We show that self-supervised learning is effective not only for auto-tagging but also for similarity-based music retrieval. Furthermore, our aim is to improve practical performance rather than merely evaluating representation quality. To this end, we propose a self-supervised auxiliary loss accompanied by a simple modified procedure that outperforms their self-supervised approach.

Thomé et al. introduced four triplet learning terms for learning music similarity, which include transformed excerpts, excerpts from the same track, and genre and mood membership [13]. In contrast, our model employs SimCLR-based contrastive learning for self-supervised learning, manages general multi-tag settings through classification-based metric learning, and addresses the auto-tagging task. Our focus is to show the effectiveness of the loss without using tag information and demonstrate effectiveness in semi-supervised settings, which is distinct from them.

Manocha et al. proposed a differentiable speech similarity model with application to improving speech synthesis and enhancement models. They utilized SimCLR for pre-training the body of the model, trained head of the model on JND data (speech similarity dataset), and employed triplet comparison for fine-tuning the model [14]. Their model is designed mainly for the loss in speech synthesis and enhancement models, but our model is designed for auto-tagging and similarity-based retrieval. Their method focuses on speech similarity using carefully designed speech domain datasets, differing from our approach that targets global audio similarity in the music domain by leveraging widely available tag annotations.

To improve retrieval of image using unlabeled image datasets, Duan et al. introduced a self-training framework for metric learning [15]. They used self-supervised learning to train a teacher network. Subsequently, they used the teacher network to generate pseudo labels, which were then utilized for metric learning with ranking loss. Our method applies self-supervision directly to the “student” network eliminating the need for a teacher network. Additionally, their method is designed for the image domain rather than music.

Fu et al. proposed deep metric learning with self-supervised ranking to improve retrieval and ranking of image [16]. They introduced an intra-class ranking loss in a self-supervised manner, in addition to metric learning for handling inter-class variance. However, their self-supervision employs intra-class ranking loss, which is distinct from our contrastive self-supervised loss, and their method is tailored to the image domain rather than music.

In summary, our work is distinct in that we investigate how to design architectures and losses when combining supervised metric learning and classification with cutting edge contrastive based self-supervised learning.

3 Preliminary

In this section, we review some basic mathematical operations used in the next section.

3.1 Layer normalization

One way to stabilize training and reduce the training time of deep neural networks is to normalize the activities of the neurons. Layer normalization (LayerNorm) is one of the most well-known normalization techniques [17]. Formally, LayerNorm without affine parameters is defined for a vector x = [x₁, x₂, …, x_n] by (1) where E[x] and Var[x] denote the mean and variance of x over its dimension. LayerNorm without affine parameters was shown to be effective in classification-based metric learning by helping the network better initialize new parameters and reach better optima [18]. In this paper, LayerNorm is used in Eqs (7) and (16).

3.2 ℓ²-norm

ℓ²-norm is a vector norm defined for a vector x = [x₁, x₂, …, x_n] by (2) In this paper, ℓ²-norm is used for defining the distance in the embedding space of similarity in Eqs (7) and (16), following the distance definition in Eq (4), and defining the cosine similarity in Eq (14).

3.3 Sigmoid activation

Sigmoid activation is an activation function defined for a vector x = [x₁, x₂, …, x_n] by (3) Since the range of the sigmoid activation is [0, 1], this activation is used for outputting the probability of binary classes. In Eq (3), when n > 1, the activation yields multiple probabilities of binary classes, which are used for multi-tag classification problems. In this paper, the sigmoid activation is used in Eqs (6) and (17).

4 Methodology

In this section, we introduce our problem setting and our proposed model, detailing the architecture, objective functions, and algorithms.

4.1 Problem setting

Let us consider a dataset a set of N_label pairs of a music track and its multi-tag and a set of N_unlabel music tracks . Our goal is to learn a similarity function given , where is an embedding vector with dimensionality D, and some distance in the latent space captures similarity of data points . Here, is the set of all real numbers. F_sim maps a music track to an embedding vector for the similarity-based retrieval task. Our goal is also to learn a tag function given , where is a probability vector of T tags whose t-th element is the probability that t-th tag is assigned to x_k. F_tab maps a music track to a probability vector for the auto-tagging task.

4.2 Outline of our model

Instead of learning F_sim and F_tab directly, our model learns functions f_sim and f_tag whose input is an excerpt , cropped from music tracks, following previous work [8]. f_sim and f_tag are the same as F_sim and F_tag in that they output a similarity vector and tag probabilities. However, f_sim and f_tag differ from F_sim and F_tag in that they take as input an excerpt cropped from a music track, rather than the entire music track. We consider a music track x_k as an ensemble of short excerpts derived from it. By feeding each of these excerpts into f_sim and f_tag and subsequently aggregating their outputs, we formulate F_sim and F_tag. Formally, let be a sequence of excerpts cropped from a music track x_k, where E is the total number of excerpts cropped from the track. Then, our model learns an excerpt similarity function , and we define (4) where ‖⋅‖₂ is ℓ²-norm. Similarly, our model also learns an excerpt tag function , and we define (5) In experiments, excerpts are non-overlapping sliding windows in each track to avoid higher computational cost and to follow the convention of previous works [8, 9].

Next, we explain the outline of how to model and learn the similarity and tag functions f_sim and f_tag, which are also visualized in Fig 1. Similarity learning (metric learning) is achieved by tagging (classification) based methodology, as revealed in prior studies [18, 19], where we use the output from the layer just before the final layer of the classification model as an embedding for similarity. Formally, our model learns f_tag such that (6) where is a parameter for mapping the output of f_sim to the output of f_tag, σ denotes the sigmoid activation, and is an embedding vector for similarity-based retrieval. Model architectures for similarity-based retrieval and auto-tagging are mostly shared in this formulation, so it is advantageous in practice in terms of time, memory, and storage in training and inference phases, particularly when using functionalities of both similarity-based retrieval and auto-tagging. In Sections 4.3 and 4.4, we explain how to train f_sim and W (thus f_tag) in detail, where f_sim is defined as a function f, followed by layer normalization [17], followed by normalizing with ℓ²-norm. Formally, (7) where LN denotes layer normalization. Here, both f and f(⋅) refer to the same function, and similarly, both f_sim and refer to the same function. Then our goal in the Sections 4.3 and 4.4 boils down to learning f and W, where we choose to use the SampleCNN architecture for f [20]. f is trained using a self-supervised learning loss and a metric learning loss (a loss function based on metric learning approach) whereas W is trained only using a metric learning loss. Since inner product is the distance metric between each row of W and in Eq (6), we use inner product as the distance metric in the similarity space when conducting similarity-based retrieval.

Download:

Fig 1. Model overview.

For each batch comprising pairs of a music track x and its corresponding multi-tag y, the music tracks undergo transformations (indicated by arrows) to compute the self-supervised learning loss and the metric learning loss . The losses are used to define the overall loss function (Eq (20)) to train our proposed model. After training the model, given a music track x, the embedding vector z^exc and the estimated probabilities of multi-tag are used for similarity-based retrieval and auto-tagging, respectively.

https://doi.org/10.1371/journal.pone.0294643.g001

4.3 Self-supervised learning

Consider a mini-batch from the dataset , where B is the batch size, and a set of augmentation operations (See Section 5.2 for the choice of in experiments). We follow the Contrastive Learning of Musical Representation (CLMR) [9], which uses the simple framework for contrastive learning of visual representations (SimCLR) for self-supervised learning [12]. For each sample x_k in a mini-batch, we randomly crop two excerpts from x_k (where the random crop refers to cropping an excerpt from a music track, where the excerpt position in a music track is drawn uniformly from all possible positions), apply an augmentation operation to each of the excerpts (where the augmentation operations a and a′ are sampled uniformly from , i.e., ), and then feed each into the function f followed by another function g. Formally we compute the following transformations: (8) (9) (10) (11) (12) (13) where a pair is referred to as a positive pair. The random crop (denoted as RandCrop(⋅)) and augmentation operations are assumed to preserve the global attributes. For the architecture of g, we use a linear layer followed by a ReLU layer followed by a linear layer, where no bias term is used in the linear layers.

Given a set including a positive pair of examples and , the contrastive prediction task aims to identify in for a given . Formally, letting sim(u, v) = u^⊤v/‖u‖₂‖v‖₂, , , and , a contrastive loss function L_SSL(i, j) can be defined for a contrastive prediction task as (14) where τ is a temperature parameter set to the default value proposed in SimCLR [12]. L_SSL(i, j) is computed for all augmented pairs, i.e., and averaged, yielding the overall loss function (15)

4.4 Metric learning with self-supervised auxiliary loss

We propose to combine classification-based metric learning with self-supervised learning. Layer normalization (denoted by LN(·)) is applied to h_i, followed by normalization with ℓ²-norm to yield an embedding vector for similarity-based retrieval. Formally, (16) is then multiplied by W, followed by element-wise sigmoid activation to produce classification output , i.e., (17) We use binary cross entropy loss for each tag and average them to compute L_ML(i): (18) Let be an index set such that is the set of all the labeled samples in . is computed for the samples in the labeled subset and averaged, yielding the loss function (19) Finally, the loss function for our proposed model is a combination of the self-supervised loss and the metric learning loss , which is defined as: (20) Here is a balancing factor between two losses and .

In practice, the self-supervised learning needs a longer training time, so we first train our model with only, whose phase is referred to as pre-training phase. We then train with , whose phase is referred to as fine-tuning phase.

5 Experimental setup

In this section, we offer experimental setups, providing information on datasets we use for experiments, detailed model configurations, evaluation metrics, and baseline methods.

5.1 Dataset

In experiments, we employ two commonly used datasets for music retrieval: the MagnaTagATune dataset [6] and the MTG-Jamendo dataset [21].

5.1.1 MagnaTagATune dataset.

The MagnaTagATune dataset consists of 25,000 music tracks from 6,622 unique songs [6]. We use top 50 tags and the same train/validation/test split as in previous work [9], and the train/validation/test datasets are used for both of the similarity-based retrieval and auto-tagging. Utilizing the conventional train/validation/test data splits is essential to maintain fair comparisons with prior studies. To explore the composition of these splits, we looked into the metadata of the datasets to identify common artists within them. It appears that there are 48 common artists, with the train and validation sets containing 203 unique artists, and the test set including 75 unique artists. We obtained the MagnaTagATune dataset using the code in the CLMR repository https://github.com/Spijkervet/CLMR/blob/master/clmr/datasets/magnatagatune.py, where the dataset itself is downloaded from the sota-music-tagging-models repository https://github.com/minzwon/sota-music-tagging-models/tree/master/split/mtat.

5.1.2 MTG-Jamendo dataset.

MTG-Jamendo contains 55,000 full music tracks (320kbps, MP3) with 195 tags covering genre, instrument, and mood/theme [21]. We use the pre-defined train/validation/test splits and the top 50 tags. The train/validation/test data splits are used for both of the similarity-based retrieval and auto-tagging. Employing the conventional train/validation/test data splits is essential to ensure fair comparisons with prior works. In order to examine the characteristics of these splits, we looked into the metadata of the datasets to identify common artists within them. It appears that there are no common artists, with the train and validation sets containing 2815 unique artists, and the test set including 702 unique artists. We obtained the MTG-Jamendo dataset from the mtg-jamendo-dataset reposotory https://github.com/MTG/mtg-jamendo-dataset.

5.2 Model configurations

The set of augmentation operations follows CLMR [9] for fair comparison. Specifically, the following operations are applied sequentially with probability p to create an element of :

polarity inversion (p = 0.8)
additive Gaussian noise with decibel sampled uniformly from [80, 40] (p = 0.01)
gain with decibel sampled uniformly from [−6, 0] (p = 0.3)
low pass filtering or high pass filtering chosen with the same probability, where their cut-off frequency is sampled uniformly from [2200, 4000] Hz and [200, 1200] Hz, respectively (p = 0.8)
delayed signal added to the original signal with a volume factor of 0.5 in which the delay time is randomly sampled from {200, 250, 300, …, 500} ms (p = 0.3)
pitch shifting with shifting semitones sampled uniformly from [−7, 7] (p = 0.6)
reverb with the impulse response’s room size, reverberation, and damping factor sampled uniformly from [0, 100] (p = 0.6)

We set the excerpt length to 59049, audio to monaural, and audio sampling rate to 22.05 kHz following CLMR [9] for fair comparison. We set the dimensionality D of the embedding vector for the similarity-based retrieval to 512 and set the number of tags T to 50.

To determine the value of λ in Eq (20), we first introduce the base balancing factor r of the two terms and . r is defined to be , where and are the converged loss values when the model is trained using either or , respectively, and all available labels are used when trained with . The values of r were 22.00 for MagnaTagATune dataset and 18.95 for MTG-Jamendo dataset. Then, the candidates for λ in Eq (20) were set to {α/r: α ∈ {0.05, 0.1, 1, 10}}. For conciseness, {α/r: α ∈ {0.1, 1, 10}} for the MagnaTagATune dataset and {α/r: α ∈ {0.05, 0.1, 1}} for the MTG-Jamendo dataset are shown in Tables 1 and 2, respectively.

Download:

Table 1. Results for supervised scenario of MagnaTagATune dataset.

https://doi.org/10.1371/journal.pone.0294643.t001

Download:

Table 2. Results for supervised scenario of MTG-Jamendo dataset.

https://doi.org/10.1371/journal.pone.0294643.t002

In our model’s pre-training where only is used, the batch size is set to 48, we employ the Adam optimizer with a learning rate of 0.0003 and β₁, β₂ = (0.9, 0.999). The model is trained for 10, 000 and 1, 000 epochs for MagnaTagATune and MTG-Jamendo, respectively.

For our model’s fine-training where the overall loss is used, the batch size is set to 48. We use the Adam optimizer with a learning rate of 0.001 and β₁, β₂ = (0.9, 0.999), in which the learning rate is multiplied by 0.1 when the validation loss does not improve for 5 epochs. We use a weight decay with a weight of 1.0 × 10⁻⁶, and the model is trained for 200 epochs maximum. The training is stopped when the validation loss does not improve for 10 epochs, which is referred to as early stopping.

5.3 Evaluation metrics

In this section, we explain our evaluation metrics for the two tasks: similarity-based Retrieval and auto-tagging.

5.3.1 Similarity-based retrieval.

To evaluate the similarity-based retrieval, we use the recall@K (R@K) metric to measure retrieval quality following the standard evaluation setting in image retrieval [18, 19] and a similarity-based music retrieval model [8]. This metric is useful for evaluating search methods because it measures the quality of the top K retrieved results, which are more important and more likely to be seen by users than lower ranked retrieved results.

To further assess retrieval quality, we propose using a variant of the MAP@K (Mean Average Precision at K; M@K) metric adapted for similarity-based retrieval with a multi-tag annotated music track dataset. The MAP@K metric has been widely used to evaluate recommender systems [22], and its variant, MAP@R, has been applied to image retrieval [15, 23]. da Silva et al. proposed using this metric for tag-based music retrieval [5]. The calculation of our MAP@K (M@K) is roughly as follows: we compute the tag match rate between the query music track and the retrieved music tracks. We calculate the match rate at rank 1, the cumulative match rate from rank 1 to 2, the cumulative match rate from rank 1 to 3, and so on, up to the cumulative match rate from rank 1 to K. By averaging these match rates, tracks that match tags at higher ranks receive higher scores. Formally, let N be the number of music tracks in the test split; our MAP@K (M@K) is defined as: (21) where P_i,t(k) equals the precision at k for the t-th tag of the i-th music track query if the k-th ranked retrieved result is correct and is 0 otherwise. Here, the precision at k for the t-th tag of the i-th music track query is defined as , where c_k is the number of music tracks that have the t-th tag among the top k retrieved results based on the i-th query of a music track with the t-th tag.

Compared to recall@K, our MAP@K possesses different properties such as: i) weighting higher ranks of the retrieved results more, and ii) the score is based on tags for individual music tracks rather than the union of tags for multiple tracks. The first property may be preferable as users of similarity-based retrieval systems tend to listen to higher-ranked music tracks. The second property might also be beneficial since the purpose of similarity-based music retrieval is often to find a music track with similar attributes to those of the query music track, rather than finding a set of tracks whose intersection of attributes aligns with those of the query music track.

5.3.2 Auto-tagging.

Music auto-tagging has been extensively studied, and diverse model architectures has been developed [8, 10, 11]. We follow the standard benchmarking and evaluation criteria and report average tag-wise Area Under the Receiver Operating Characteristic Curve (ROC-AUC) and Precision Recall Area Under the Curve (PR-AUC) scores to measure tag-based retrieval performance.

5.4 Baseline methods

We compare our model with a state-of-the-art model for similarity-based retrieval and auto-tagging [8]. We also compare our model with CLMR [9], a model for auto-tagging which uses SimCLR as self-supervised learning for pre-training [12].

5.5 Variations of learning techniques

In this section, we discuss three learning techniques “Fine-tune Augment”, “Fine-tune Contrastive”, and “Load Pre-train” that define the variations of our proposed methods and the baseline approaches.

5.5.1 Fine-tune augment.

“Fine-tune augment” involves applying augmentation operations (as detailed in Section 5.2) during the fine-tuning phase. Note that the inception model and CLMR do not utilize this technique.

5.5.2 Fine-tune contrastive.

“Fine-tune contrastive” entails conducting contrastive self-supervised learning, where the loss is given by Eq (15), during the fine-tuning phase. It is noteworthy that neither the inception model nor CLMR employ this technique.

5.5.3 Load pre-train.

“Load pre-train” refers to loading the pre-trained model’s weights at the beginning of the fine-tuning phase. The pre-training is executed using the contrastive self-supervised loss specified by Eq (15). It is pertinent to mention that while CLMR uses this technique, the inception model does not. Moreover, in our proposed methods, we do not freeze the models, even when the pre-trained weights are loaded.

6 Results

In this section, we describe and visualize the experimental results of our proposed model, comparing with baseline methods and variations of our models.

6.1 Supervised: Scenario where tags are always available for music tracks

We begin with the supervised scenario, where tags are always available for music tracks. Table 1 shows the results for the supervised scenario of the MagnaTagATune dataset, where techniques a, c, and p indicate “Fine-tune Augment”, “Fine-tune Contrastive”, and “Load Pre-train”, respectively and they are learning techniques that characterize the variations of especially our proposed methods (See Section 5.5). In Table 1, our A, B, …, and I represent variations of our model under different settings. Specifically, they differ in the learning techniques a, c, and p employed and in the value of α explained in Section 5.2.

Our G outperformed the previous methods, inception and CLMR, on both similarity-based retrieval and auto-tagging tasks. Our A uses the same learning algorithm as that of inception except for the input representation and network architectures, the results of which suggest that the changes do not always lead to higher performance. Our B, “technique a: Fine-tune Augment” added to our A, slightly improved some metrics and slightly degraded some other metrics, although augmentation is usually an effective strategy. Our C, “technique p: Load Pre-train” added to ours A, improves the performance decently. “technique p: Load Pre-train” is the same strategy as CLMR, but our C outperforms it presumably because ours does not freeze the pre-trained network and takes advantage of the expressivity of the pre-trained network.

In the comparison of the differences in α values across our D, E, …, and I, the median value of α = 1 (represented by our F and G) exhibited the best performance. We found that conducting self-supervised learning while fine-tuning, corresponding to having “technique c: Fine-tune Contrastive”, boosts the performance as in our F and G, especially when no augmentation is performed while fine-tuning, corresponding to having no “technique a: Fine-tune Augment” as in our G. The observed trend of enhanced performance in the absence of “technique a: Fine-tune Augment” remained consistent across other values of α. In similarity-based retrieval, models that perform well on the R@K metric tend to also yield good results on the M@K metric. Our proposed method, G, demonstrates robust performance not only in the benchmark metric R@K but also in the application-oriented metric M@K.

Table 2 shows the results for the supervised scenario of MTG-Jamendo dataset. In Table 2, our J, K, …, and O represent variations of our model under different settings. Specifically, they differ in the learning techniques a, c, and p employed and in the value of α explained in Section 5.2.

Our M was the most effective for similarity-based retrieval and had comparable performance to inception in terms of auto-tagging. In the comparison of the differences in α values across our J, K, …, and O, the median value of α = 0.1 (represented by our L and M) exhibited the best performance. We found that no augmentation is performed while fine-tuning, corresponding to having no “technique a: Fine-tune Augment”, boosts the performance as in our M. The observed trend of enhanced performance in the absence of “technique a: Fine-tune Augment” remained consistent across other values of α. In similarity-based retrieval, models that perform well on the R@K metric also yield good results on the M@K metric. Our proposed method, M, demonstrates robust performance not only in the benchmark metric R@K but also in the application-oriented metric M@K.

Note that our G (in Table 1) and M (in Table 2) use exactly the same methodology (ours with “technique c: Fine-tune Contrastive” and “technique p: Load Pre-train”) except the value of hyper-parameter α and they tend to achieve the highest scores for each dataset. The result shows that, even with different datasets, there is no need to tune anything other than the hyper-parameter α, providing a glimpse of our method’s versatility.

6.2 Semi-supervised: Scenario where tags are not always available for music tracks

We simulate the semi-supervised setting by reducing the rate of tags to be used. In this section, we use the model that performed best in the previous section. Specifically, for the MagnaTagATune dataset and the MTG-Jamendo dataset, we use our G and our M, respectively. Figs 2–4 shows the results for the semi-supervised scenario of the MagnaTagATune dataset. As the amount of labeled data decreases, the performance gap between our model and the baseline tends to widen, and it can be said that our method is more likely to have a larger effect when there is less labeled data. For similarity-based retrieval, the performance of our model only degraded slightly even with a 99% reduction in labeled data (i.e., with only 1% of labeled data).

Download:

Fig 2. Similarity-based retrieval R@K results for semi-supervised scenario of MagnaTagATune dataset.

https://doi.org/10.1371/journal.pone.0294643.g002

Download:

Fig 3. Similarity-based retrieval M@K results for semi-supervised scenario of MagnaTagATune dataset.

https://doi.org/10.1371/journal.pone.0294643.g003

Download:

Fig 4. Auto-tagging results for semi-supervised scenario of MagnaTagATune dataset.

https://doi.org/10.1371/journal.pone.0294643.g004

Figs 5–7 shows the results for the semi-supervised scenario of MTG-Jamendo dataset. Similarly to the MagnaTagATune dataset, as the amount of labeled data decreases, the performance gap between our model and the baseline tends to widen, and it can be said that our method is more likely to have a larger effect when there is less labeled data.

Download:

Fig 5. Similarity-based retrieval R@K results for semi-supervised scenario of MTG-Jamendo dataset.

https://doi.org/10.1371/journal.pone.0294643.g005

Download:

Fig 6. Similarity-based retrieval M@K results for semi-supervised scenario of MTG-Jamendo dataset.

https://doi.org/10.1371/journal.pone.0294643.g006

Download:

Fig 7. Auto-tagging results for semi-supervised scenario of MTG-Jamendo dataset.

https://doi.org/10.1371/journal.pone.0294643.g007

In Figs 8 and 9, we visualize the latent space for similarity-based retrieval in the MagnaTagATune and MTG-Jamendo datasets, where each point in the space is determined by Eq (4). For visualization, we employ t-SNE, with each dot representing a music track. In the MagnaTagATune dataset (Fig 8), green, blue, and yellow dots correspond to music tracks with ‘female vocal’ tags, ‘no vocal’ tags, and other tags, respectively. In the MTG-Jamendo dataset (Fig 9), green, blue, and yellow dots represent music tracks with ‘instrument—voice’ tags, ‘genre—instrumentalpop’ tags, and other tags, respectively. We selected contrasting tags such as ‘female vocal’ versus ‘no vocal’ and ‘instrument—voice’ versus ‘genre—instrumentalpop’ for visualization because these distinctive tags are expected to be separated in the similarity latent space, providing a valuable test for evaluating the quality of the visualized latent space.

Download:

Fig 8. T-SNE visualization of similarity latent space

for MagnaTagATune dataset.

Green, blue, and yellow dots correspond to music tracks with ‘female vocal’ tags, ‘no vocal’ tags, and other tags, respectively. The percentage % indicates the reduction in labels used for training.

https://doi.org/10.1371/journal.pone.0294643.g008

Download:

Fig 9. T-SNE visualization of similarity latent space

for MTG-Jamendo dataset.

Green, blue, and yellow dots correspond to music tracks with ‘instrument—voice’ tags, ‘genre—instrumentalpop’ tags, and other tags, respectively. The percentage % indicates the reduction in labels used for training.

https://doi.org/10.1371/journal.pone.0294643.g009

The visualization of the latent space demonstrates that when the amount of label reduction reaches 99%, the appearance of the baseline method, inception, changes significantly, while our method G or M remains relatively unchanged. Specifically, for the Inception baseline method with a 99% reduction in labels (Figs 8(g) and 9(g)), music tracks with distinctive tags such as ‘female vocal’ versus ‘no vocal’ or ‘instrument—voice’ versus ‘genre—instrumentalpop’ are mapped to less separable points, and the overall distribution of latent points of music tracks no longer appears to be tightly gathered into a single cluster.

7 Conclusion

In this paper, we presented a model that enhances the quality of similarity-based music retrieval and music auto-tagging. We explored the role of self-supervision in metric learning and proposed utilizing self-supervision as auxiliary loss for metric learning. Our model outperforms baseline methods and proves effective when human-provided music tags are limited. The music industry often deals with heterogeneous and extensive music databases characterized by long-tailed attributes. Human-annotated tags may be unavailable, unclean, or inconsistent across different database segments. We expect our approach, which generates learning signals without human annotation, to be effective in such real-world situations.

References

1. Dídac Surís, Carl Vondrick, Bryan Russell, and Justin Salamon. It’s time for artistic correspondence in music and video. In Proceedings of Conference on Computer Vision and Pattern Recognition, CVPR, pages 10554–10564, 2022.
2. Andrea Agostinelli, Timo I. Denk, Zalán Borsos, Jesse H. Engel, Mauro Verzetti, Antoine Caillon, et al. MusicLM: Generating music from text. ArXiv, abs/2301.11325, 2022.
3. Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, and Ilya Sutskever. Jukebox: A generative model for music. ArXiv, abs/2005.00341, 2020.
4. Minz Won, Janne Spijkervet, and Keunwoo Choi. Music Classification: Beyond Supervised Learning, Towards Real-world Applications. https://music-classification.github.io/tutorial, 2021.
5. da Silva Angelo Cesar Mendes, Silva Diego Furtado, and Marcacini Ricardo Marcondes. Multimodal representation learning over heterogeneous networks for tag-based music retrieval. Expert Systems with Applications, 207, 2022.
- View Article
- Google Scholar
6. Edith Law, Kris West, Michael I. Mandel, Mert Bay, and J. Stephen Downie. Evaluation of algorithms using games: The case of music tagging. In Proceedings of International Society for Music Information Retrieval, ISMIR, pages 387–392, 2009.
7. Òscar Celma and Pedro Cano. From hits to niches? or how popular artists can bias music recommendation and discovery. In Proceedings of KDD Workshop on Large-Scale Recommender Systems and the Netflix Prize Competition. Association for Computing Machinery, 2008.
8. Jongpil Lee, Nicholas J. Bryan, Justin Salamon, Zeyu Jin, and Juhan Nam. Metric learning vs classification for disentangled music representation learning. In Proceedings of International Society for Music Information Retrieval Conference, ISMIR, pages 439–445, 2020.
9. Janne Spijkervet and John Ashley Burgoyne. Contrastive learning of musical representations. In Proceedings of International Society for Music Information Retrieval Conference, ISMIR, pages 673–681, 2021.
10. Minz Won, Keunwoo Choi, and Xavier Serra. Semi-supervised music tagging transformer. In Proceedings of International Society for Music Information Retrieval, ISMIR, pages 769–776, 2021.
11. Minz Won, Andres Ferraro, Dmitry Bogdanov, and Xavier Serra. Evaluation of cnn-based automatic music tagging models. In Proceedings of Sound and Music Computing, SMC, 2020.
12. Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. A simple framework for contrastive learning of visual representations. In Proceedings of International Conference on Machine Learning, ICML, pages 1597–1607, 2020.
13. Sebastian Piwell Carl Thomé and Oscar Utterbäck. Musical audio similarity with self-supervised convolutional neural networks. In Proceedings of International Society for Music Information Retrieval Conference, ISMIR—Late Breaking & Demo Papers, 2021.
14. Pranay Manocha, Zeyu Jin, Richard Zhang, and Adam Finkelstein. CDPAM: contrastive learning for perceptual audio similarity. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2021.
15. Jiali Duan, Yen-Liang Lin, Son Dinh Tran, Larry S. Davis, and C.-C. Jay Kuo. SLADE: A self-training framework for distance metric learning. In Proceedings of Computer Vision and Pattern Recognition Conference, CVPR, pages 9644–9653, 2021.
16. Zheren Fu, Yan Li, Zhendong Mao, Quan Wang, and Yongdong Zhang. Deep metric learning with self-supervised ranking. In Proceedings of Conference on Artificial Intelligence, AAAI, pages 1370–1378, 2021.
17. Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization. ArXiv, abs/1607.06450, 2016.
18. Andrew Zhai and Hao-Yu Wu. Classification is a strong baseline for deep metric learning. In Proceedings of the British Machine Vision Conference, BMVC, 2019.
19. Qi Qian, Lei Shang, Baigui Sun, Juhua Hu, Tacoma Tacoma, Hao Li, et al. SoftTriple Loss: Deep metric learning without triplet sampling. In Proceedings of International Conference on Computer Vision, ICCV, pages 6449–6457, 2019.
20. Lee Jongpil, Park Jiyoung, Kim Keunhyoung Luke, and Nam Juhan. SampleCNN: End-to-end deep convolutional neural networks using very small filters for music classification. Applied Sciences, 8(1), 2018.
- View Article
- Google Scholar
21. Dmitry Bogdanov, Minz Won, Philip Tovstogan, Alastair Porter, and Xavier Serra. The MTG-jamendo dataset for automatic music tagging. In Machine Learning for Music Discovery Workshop, International Conference on Machine Learning, ICML, 2019.
22. Yan-Martin Tamm, Rinchin Damdinov, and Alexey Vasilev. Quality metrics in recommender systems: Do we calculate metrics consistently? In Proceedings of ACM Conference on Recommender Systems, RecSys, pages 708–713, 2021.
23. Kevin Musgrave, Serge Belongie, and Ser-Nam Lim. A metric learning reality check. In Proceedings of European Conference on Computer Vision, ECCV, pages 681–699, 2020.

[ref1] 1. Dídac Surís, Carl Vondrick, Bryan Russell, and Justin Salamon. It’s time for artistic correspondence in music and video. In Proceedings of Conference on Computer Vision and Pattern Recognition, CVPR, pages 10554–10564, 2022.

[ref2] 2. Andrea Agostinelli, Timo I. Denk, Zalán Borsos, Jesse H. Engel, Mauro Verzetti, Antoine Caillon, et al. MusicLM: Generating music from text. ArXiv, abs/2301.11325, 2022.

[ref3] 3. Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, and Ilya Sutskever. Jukebox: A generative model for music. ArXiv, abs/2005.00341, 2020.

[ref4] 4. Minz Won, Janne Spijkervet, and Keunwoo Choi. Music Classification: Beyond Supervised Learning, Towards Real-world Applications. https://music-classification.github.io/tutorial, 2021.

[ref5] 5. da Silva Angelo Cesar Mendes, Silva Diego Furtado, and Marcacini Ricardo Marcondes. Multimodal representation learning over heterogeneous networks for tag-based music retrieval. Expert Systems with Applications, 207, 2022.
View Article
Google Scholar

[6] View Article

[7] Google Scholar

[ref6] 6. Edith Law, Kris West, Michael I. Mandel, Mert Bay, and J. Stephen Downie. Evaluation of algorithms using games: The case of music tagging. In Proceedings of International Society for Music Information Retrieval, ISMIR, pages 387–392, 2009.

[ref7] 7. Òscar Celma and Pedro Cano. From hits to niches? or how popular artists can bias music recommendation and discovery. In Proceedings of KDD Workshop on Large-Scale Recommender Systems and the Netflix Prize Competition. Association for Computing Machinery, 2008.

[ref8] 8. Jongpil Lee, Nicholas J. Bryan, Justin Salamon, Zeyu Jin, and Juhan Nam. Metric learning vs classification for disentangled music representation learning. In Proceedings of International Society for Music Information Retrieval Conference, ISMIR, pages 439–445, 2020.

[ref9] 9. Janne Spijkervet and John Ashley Burgoyne. Contrastive learning of musical representations. In Proceedings of International Society for Music Information Retrieval Conference, ISMIR, pages 673–681, 2021.

[ref10] 10. Minz Won, Keunwoo Choi, and Xavier Serra. Semi-supervised music tagging transformer. In Proceedings of International Society for Music Information Retrieval, ISMIR, pages 769–776, 2021.

[ref11] 11. Minz Won, Andres Ferraro, Dmitry Bogdanov, and Xavier Serra. Evaluation of cnn-based automatic music tagging models. In Proceedings of Sound and Music Computing, SMC, 2020.

[ref12] 12. Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. A simple framework for contrastive learning of visual representations. In Proceedings of International Conference on Machine Learning, ICML, pages 1597–1607, 2020.

[ref13] 13. Sebastian Piwell Carl Thomé and Oscar Utterbäck. Musical audio similarity with self-supervised convolutional neural networks. In Proceedings of International Society for Music Information Retrieval Conference, ISMIR—Late Breaking & Demo Papers, 2021.

[ref14] 14. Pranay Manocha, Zeyu Jin, Richard Zhang, and Adam Finkelstein. CDPAM: contrastive learning for perceptual audio similarity. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2021.

[ref15] 15. Jiali Duan, Yen-Liang Lin, Son Dinh Tran, Larry S. Davis, and C.-C. Jay Kuo. SLADE: A self-training framework for distance metric learning. In Proceedings of Computer Vision and Pattern Recognition Conference, CVPR, pages 9644–9653, 2021.

[ref16] 16. Zheren Fu, Yan Li, Zhendong Mao, Quan Wang, and Yongdong Zhang. Deep metric learning with self-supervised ranking. In Proceedings of Conference on Artificial Intelligence, AAAI, pages 1370–1378, 2021.

[ref17] 17. Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization. ArXiv, abs/1607.06450, 2016.

[ref18] 18. Andrew Zhai and Hao-Yu Wu. Classification is a strong baseline for deep metric learning. In Proceedings of the British Machine Vision Conference, BMVC, 2019.

[ref19] 19. Qi Qian, Lei Shang, Baigui Sun, Juhua Hu, Tacoma Tacoma, Hao Li, et al. SoftTriple Loss: Deep metric learning without triplet sampling. In Proceedings of International Conference on Computer Vision, ICCV, pages 6449–6457, 2019.

[ref20] 20. Lee Jongpil, Park Jiyoung, Kim Keunhyoung Luke, and Nam Juhan. SampleCNN: End-to-end deep convolutional neural networks using very small filters for music classification. Applied Sciences, 8(1), 2018.
View Article
Google Scholar

[23] View Article

[24] Google Scholar

[ref21] 21. Dmitry Bogdanov, Minz Won, Philip Tovstogan, Alastair Porter, and Xavier Serra. The MTG-jamendo dataset for automatic music tagging. In Machine Learning for Music Discovery Workshop, International Conference on Machine Learning, ICML, 2019.

[ref22] 22. Yan-Martin Tamm, Rinchin Damdinov, and Alexey Vasilev. Quality metrics in recommender systems: Do we calculate metrics consistently? In Proceedings of ACM Conference on Recommender Systems, RecSys, pages 708–713, 2021.

[ref23] 23. Kevin Musgrave, Serge Belongie, and Ser-Nam Lim. A metric learning reality check. In Proceedings of European Conference on Computer Vision, ECCV, pages 681–699, 2020.

Figures

Abstract

1 Introduction

2 Related work

3 Preliminary

3.1 Layer normalization

3.2 ℓ2-norm

3.3 Sigmoid activation

4 Methodology

4.1 Problem setting

4.2 Outline of our model

4.3 Self-supervised learning

4.4 Metric learning with self-supervised auxiliary loss

5 Experimental setup

5.1 Dataset

5.1.1 MagnaTagATune dataset.

5.1.2 MTG-Jamendo dataset.

5.2 Model configurations

5.3 Evaluation metrics

5.3.1 Similarity-based retrieval.

5.3.2 Auto-tagging.

5.4 Baseline methods

5.5 Variations of learning techniques

5.5.1 Fine-tune augment.

5.5.2 Fine-tune contrastive.

5.5.3 Load pre-train.

6 Results

6.1 Supervised: Scenario where tags are always available for music tracks

6.2 Semi-supervised: Scenario where tags are not always available for music tracks

7 Conclusion

References

3.2 ℓ²-norm