Figures
Abstract
Multi-Label Text Classification (MLTC) is a crucial task in natural language processing. Compared to single-label text classification, MLTC is more challenging due to its vast collection of labels which include extracting local semantic information, learning label correlations, and solving label data imbalance problems. This paper proposes a model of Label Attention and Correlation Networks (LACN) to address the challenges of classifying multi-label text and enhance classification performance. The proposed model employs the label attention mechanism for a more discriminative text representation and uses the correlation network based on label distribution to enhance the classification results. Also, a weight factor based on the number of samples and a modulation function based on prediction probability are combined to alleviate the label data imbalance effectively. Extensive experiments are conducted on the widely-used conventional datasets AAPD and RCV1-v2, and extreme datasets EUR-LEX and AmazonCat-13K. The results indicate that the proposed model can be used to deal with extreme multi-label data and achieve optimal or suboptimal results versus state-of-the-art methods. For the AAPD dataset, compared with the suboptimal method, it outperforms the second-best method by 2.05% ∼ 5.07% in precision@k and by 2.10% ∼ 3.24% in NDCG@k for k = 1, 3, 5. The superior outcomes demonstrate the effectiveness of LACN and its competitiveness in dealing with MLTC tasks.
Citation: Yuan L, Xu X, Sun P, Yu Hp, Wei YZ, Zhou Jj (2024) Research of multi-label text classification based on label attention and correlation networks. PLoS ONE 19(9): e0311305. https://doi.org/10.1371/journal.pone.0311305
Editor: Tianlin Zhang, University of the Chinese Academy of Sciences, CHINA
Received: July 7, 2024; Accepted: September 12, 2024; Published: September 30, 2024
Copyright: © 2024 Yuan et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All the datasets utilized in this paper are open source, which are available at: AAPD: https://github.com/lancopku/SGM RCV1-v2: https://trec.nist.gov/data/reuters/reuters.html EUR-LEX: http://manikvarma.org/downloads/XC/XMLRepository.html.
Funding: This paper is funded by the National Natural Science Foundation of China under Grant No.62272180, Hubei Provincial Teaching and Research Project for Higher Education Institutions (No.2022570), Wuhan Education Science Planning Project (No.2022C151), Wuhan Vocational College of Software and Engineering Research Startup funding project (Grant No.KYQDJF2023004), Wuhan Vocational College of Software and Engineering 2023 Doctor Team Science and Technology Innovation Platform Project (No.BSPT2023001).
Competing interests: The authors have declared that no competing interests exist.
1 Introduction
Recently, as the volume of data continues to grow exponentially, it has become crucial to extract meaningful information from vast amounts of data. Text classification technology has emerged as a solution to this challenge, and it is mainly categorized into single-label and multi-label text classifications. Among them, Multi-Label Text Classification (MLTC) is more aligned with real-world requirements as it offers finer granularity and multiple label levels. This enables a comprehensive description of a document from various perspectives. As a vital natural language processing task, MLTC is widely used in topic recognition [1], question-answer systems [2], sentiment analysis [3], text classification [4], search [5] and text summarization [6]. However, MLTC poses challenges with larger text volumes and extensive label sets in the big data era. Therefore, it has become essential to develop effective multi-label classifiers for various applications. To solve the MLTC problem, we investigate a large number of previous studies and summarize them from three primary aspects as follows:
Firstly, we analyze the perspective of MLTC based on known documents and labels. Currently, existing MLTC algorithms utilize traditional machine learning and deep learning. Traditional machine learning methods such as BR [7] and CC [8] simplify MLTC to single-label tasks but encounter difficulties with large label spaces. Deep learning methods, on the other hand, suffer from continuous data and gradient vanishing problems. However, Bi-LSTM [9] and GRU [10] overcome these challenges with gating mechanisms. Recent models have combined CNN and RNN and added attention mechanisms [11] for better text feature extraction. A recent study proposed the hybrid CNN-LSTM [12] model for feature detection and spatial generalization of CNN, which increases efficiency.
Secondly, there are identical subsets between labels, and some researchers have worked to find the correlation between labels accurately. The current algorithms used to determine label correlation in the pair relation can be classified into four categories: one-to-many, tree, sequence generation, and label embedding. One-to-many methods are effective for small label spaces, such as BR [7] and CC [8]. Tree methods, like Attention-XML [13] and Hierarchy-Aware Global [14], can handle complex label relationships but result in error accumulation. Sequence generation methods, such as Seq2Seq [15], capture label correlations but are sensitive to order. Seq2Set [16] addresses this issue with reinforcement learning. Label embedding methods, such as LEAM [17], utilize attention to establish compatibility between text and labels, resulting in joint embeddings.
Thirdly, the imbalance problem of label data is also an excellent research direction. There are several approaches including resampling, classifier adaptation, ensemble, and cost-sensitive methods. Resampling involves both under-sampling, which removes head label samples, and over-sampling, which increases tail label samples. Examples of this method include LP-RUS [18], LP-ROS [18] based on LP [19], and ML-SMOTE [20]. Besides, BBN [21] uses a bilateral branch network for classifier learning. Classifier adaptive methods directly use imbalanced data to train models, and then use machine learning methods such as basing on a min-max modular support vector machine network [22] or increasing the complexity of neural networks to adapt to imbalanced distributions [23]. Ensemble methods integrate multiple models for optimal prediction [24]. Cost-sensitive methods assign varied costs to labels, like SOSHF [25] and DB loss [26], balancing labels through clustering and reweighting.
With the gradual deepening of research, MLTC encounters several complex challenges. Existing MLTC methods still have problems in extracting local semantic information, learning label correlation and solving label data imbalance. Firstly, current methods overlook label text information in local semantic extraction. Deep neural networks such as CNN and RNN have been able to obtain complex semantic representations from documents and perform well in single-label text classification However, with large label spaces in MLTC for longer documents, these methods that only consider the content of documents fail to show the differences in the focus between different labels. Secondly, labels in MLTC tasks are often correlated with each other, and the huge amount of label space makes it difficult to mine label associations. Early MLTC methods ignore label correlation or limit to small label sets. To capture label correlation, existing algorithms such as classifier chain and Seq2Seq [15] sequence model rely heavily on the input label order, and there are problems of overfitting and error accumulation. Thirdly, label imbalance is a pressing issue. Although a large amount of research is devoted to solving this problem, due to co-occurrence and huge label space in MLTC, traditional resampling and reweighting techniques are ineffective and even reduce the robustness of the model. The following Table 1 compares the methods proposed by some typical literature in recent years.
To address these problems, this paper proposes a model based on Label Attention and Correlation Networks (LACN) to effectively address the problems of label correlations and data imbalance in MLTC. Firstly, LACN captures the important local features in the document that are relevant to the label using the document attention mechanism. At the same time, the label attention mechanism is utilized to calculate the semantic connection between the document words and the label. After obtaining these two text representations, they are combined via an adaptive fusion mechanism to create a label-specific text representation that enables more accurate classification. Subsequently, the text representation based on specific labels is mapped into an initial vector to represent its classification result using a fully connected layer and an output layer. Then, the initial classification result is enhanced through label distribution-based correlation networks which are implemented by stacking multiple correlation residual blocks. The above steps can alleviate the problem of overfitting head labels caused by label imbalance to a certain extent, but the issue of tail labels being suppressed by head labels persists. To address this problem, we propose a label imbalance loss function that assigns different weights to samples of different labels to make the model pay attention to minority label samples. Additionally, the modulation function is studied based on the difficulty of label classification and selective suppression of negative samples.
In summary, the main contributions of this paper are as follows:
- While feature extraction is performed on the original document, the attention mechanism is used to identify the most relevant text information for a particular label. Thus, the most relevant classification discriminative information is obtained for different labels.
- This paper proposes a correlation residual network that is based on label distribution. In this approach, input text sequences are labeled with the most appropriate subset of labels from the label set. The idea is to leverage relevant knowledge to learn label correlation and thus reduce the probability of label misclassification.
- To address the problem of data and sample imbalance, this paper proposes a weighting factor based on the number of labels and a modulation function based on the classification result to change the cost-sensitive loss function.
The rest of the paper is organized as follows. Section 2 details the framework of the model, including text representation, correlation networks, and label prediction optimization. Section 3 evaluates the model on a benchmark dataset, comparing it with existing models and conducting an ablation experiment of crucial components. Section 4 summarizes the paper, discusses limitations, and outlines future research directions.
2 Methods
This section discusses the proposed LACN method, as shown in the flowchart of Fig 1, which mainly involves three parts: text representation based on document and label attention, correlation networks based on label distribution, and optimization of classification results based on label imbalance. The overall framework of LACN is illustrated in Fig 2.
2.1 Text representation based on document and label attention
The document and label information-based text representation network model mainly includes document and label word embeddings, Bi-LSTM for long-distance feature extraction, document attention mechanism, label attention mechanism, and adaptive text representation fusion mechanism.
2.1.1 Word embedding of document and label.
In a document S = {x1, x2, …, xT} with T words and a label set Y = {y1, y2, …, yl} with l labels, each word or label is initially represented as a binary vector vi = {0, 0, 0, …, 1, …, 0}. Word2vec [27] is used to map sparse vectors vi to dense vectors in a low-dimensional space for facilitating word similarity analysis. The generated neural network weight Wwrd ∈ Rdw∣V∣ convert vi into word embedding vectors:
(1)
where Wwrd is the learned parameter. V donates size-fixed vocabulary and dw represents word embedding size. Through training, the document word embedding matrix embs = {e1, e2, …, eT} is obtained from the input document. For label word embedding matrix label_embl = {c1, c2, …, cl}, the label embedding representation can directly use word vector in the text if the label appears in the dataset. Otherwise, the samples related to the label are randomly selected, and the word vectors in the samples are used as the initial embedding representation.
2.1.2 Long-distance feature extraction based on Bi-LSTM network.
In this paper, Bi-LSTM [9] is used to extract important information from input documents and convert input word embedding matrix embs = {e1, e2, …, eT} to feature matrix H. For the input word embedding vector et at time t, use LSTM with gate signal to self-update and get hidden state ht. The gate signal ft of the forgetting gate is trained by weight Wf to selectively forget the internal state ct−1 at the previous time. The gate signal it of the input gate is trained by weight Wi to selectively memorize the memoryless internal state Ct at the current time. The gating signal ot of the output gate is trained by weight W∘ to decide the feature value ht of the current state ct output. Where • represents matrix multiplication, ⊕ represents matrix addition, ⊗ represents Hadamard product, and σ represents sigmoid activation function.
(2)
(3)
(4)
(5)
(6)
Word embedding vector feature ht combines the hidden state of the backward left sequence context and the hidden state
of the forward right sequence context:
(7)
Concatenate parallel and opposite hidden states to transform input word embeddings into a feature matrix H. Represent document S using Bi-LSTM [9] networks:
(8)
2.1.3 Document attention mechanism.
Feature extraction from a document involves using LSTM hidden state H = {h1, h2, …, hT} as input to calculate weight vector a:
(9)
where
and
are training weight matrix and parameter vector, and da is the selective hyperparameter. The H ∈ R2u×n is used to obtain the attention vector a ∈ Rn. The normalization function softmax is used to ensure that the calculated weights sum is one and further highlight the weight of important information. According to the calculated weight vector a, the LSTM hidden state H = {h1, h2, …, hT} is weighted and averaged to obtain the vector representation m of the document S:
(10)
Vector representation m emphasizes specific document sections via trained weights and parameters. To capture the overall semantics of a document, multiple vectors focus on various parts, requiring multiple attention jumps for each label. Specifically, the trained parameter vector is expanded into a parameter matrix
, and the obtained weight vector a ∈ Rn is expanded into a weight matrix As ∈ Rl×n. The calculation method is as follows:
(11)
The j-th column of the weight vector As represents the contribution of the document words to label j. The most relevant semantic information of each label in the document to label j can be obtained with
:
(12)
Embedding vector m ∈ Rn×2u expands to matrix Ms ∈ Rl×2u, where is the j-th column of the embedding matrix, representing optimal text representation for label j. The transition is from LSTM hidden state matrix H = {h1, h2, …, hT} to document-based representation, using weight matrices
and
to derive weight vector As ∈ Rl×n for each label and achieving final representation Ms ∈ Rl×2u via weighted average:
(13)
2.1.4 Label attention mechanism.
First input the label word embedding matrix label_embl = {c1, c2, …, cl}, where C ∈ Rl×u, l is the label number and u is the dimension of the trained embedding matrix. The compatibility of label-word pairs is measured by dot product:
(14)
where
and
respectively represent the context semantic relationship between words and label texts, where l is the label number and n is the words number. For the text with length 2r + 1 and centered on i, the local matrix block
is used to measure the correlation between label phrase pairs. To improve the effectiveness of the sparse regularization, the maximum pool and ReLU function are used to expand the matrix
to generate the weight matrix Al:
(15)
(16)
(17)
Similar to the document attention mechanism, Al is used as the weight matrix to extract the document-weighted features and obtain the label-based text representation Ml ∈ Rl×2u:
(18)
2.1.5 Adaptive text representation fusion mechanism.
For input LSTM hidden state H and label embedding vector C, Ms and Ml are derived respectively from the document and label attention mechanisms. They are both label-related text representations. However, Ms focuses on document content, and Ml focuses on label-document semantic links. Use adaptive fusion mechanism to weight the two representations and obtain a specific label-based text representation M containing crucial semantics for classification.
Ms and Ml, having identical dimensions and expression forms as document feature text sequences, undergo fusion through nonlinear addition:
(19)
where λs ∈ Rl and λl ∈ Rl respectively represent the importance of document-based text sequence representation Ms and label-based text sequence representation Ml to the text sequence finally used for classification, and restrict them:
(20)
where
and
respectively represent the proportion of document-based text representation and label-based text representation in the text sequence representation of label i. The text sequence vector m extracted along the label i for classification can be expressed as:
(21)
The values of score proportions λs and λl are obtained by using the activation function φ in combination with Ms and Ml, where Ws3 and Wl are the trained weight matrices:
(22)
The text representation algorithm based on document and label information is presented in Algorithm 1. Firstly, the number of training rounds and counts per processing is determined based on the number of training samples, as described in lines 1 to 2, followed by setting the number of cycles. Word embedding is applied to the input document space X and label space Y to obtain their respective word embedding matrices, as described in lines 3 to 4. Subsequently, long-distance features are extracted from documents using a Bi-LSTM [9] network, as described in line 5. The document attention mechanism is employed to learn the document-based text representation from the content of documents, as described in lines 6 to 7. Furthermore, the label attention mechanism establishes semantic associations between labels and documents for learning the label-based text representation, as described in lines 8 to 12. Finally, an adaptive fusion mechanism allows the independent selection of proportions between these two representations to obtain a final text representation for classification prediction, as described in lines 13 to 17.
Algorithm 1 Text Representation Algorithm based on Document and Label Information
Input: MLTC problem document space X and label space Y, where for multi-label instance {Xi, Yi}, Xi represents document instance {X1, X2, …, XT}, and Yi represents label set {y1, y2, …, yL}.
Output: A label-specific representation M of text with important semantics for classification
1: for _ in [1, …, epoch] do
2: for _ in [1, …, batch] do
3: E = Embed_word(X)
4: C = Embed_Label(Y)
5: H = Bi − LSTM(E)
6: As = softmax(Ws2tanh(Ws1HT))
7: S = AsHT
8: A = CH
9: Wl = ReLU(Ai−r,i+rW + b)
10: ml = max − pooling(wl)
11: Al = softmax(ml)
12: L = AlHT
13: M1 = sigmoid(W1S)
14: M2 = sigmoid(W2L)
15: M1 = M1/(M1 + M2)
16: M2 = 1 − M2
17: M = M1S + M2L
18: end for
19: end for
2.2 Correlation networks based on label distribution
Deep neural networks and the attention mechanism are utilized for document feature extraction. The label distribution-based correlation residual network is introduced to mitigate training costs and network degradation. It annotates text sequences with a relevant label subset and uses related knowledge to discern label correlations, enhancing the classification probability of relevant labels and reducing the irrelevant ones.
The initial classification result vector x is obtained from specific label-based text representation M using a fully connected layer and an output layer. Then, the classification results are enhanced through correlation to obtain the enhanced classification result vector y with label distribution. W ∈ Rl×2u is the fully connected layer’s parameters, w is a l-length vector:
(23)
Each block includes a residual map with a convolution and exponential linear unit, corresponding to the left part as f = F(x), and a direct map, corresponding to the right as f = x. The initial classification result vector x and the enhanced classification result vector y share the same dimensions without extra parameters or complexity. The overall function map can be expressed as:
(24)
When using residual network [28], the L + 1 layer input can be expressed as xl+1 = F(xl, wl) + xl by the L layer, and the L layer input can be expressed as xl = F(xl−1, wl−1) + xl−1 by the L − 1 layer. Therefore, the L + 1 layer input can be expressed by the L and the L − 1 layer together, that is, xl+1 = F(xl−1, wl−1) + F(xl, wl) + xl−1. By reasoning, the L layer input can be expressed as by the L−1 layer front. Since the base map is F(x) + x and the chain derivative result is:
(25)
When a deeper network converges gradually, the presence of 1 in the equation ensures that accuracy doesn’t degrade rapidly with increasing network depth. This allows for continuous parameter updates of nodes, preventing gradient vanishing or explosion. The output classification result vector x in models lacking residual blocks directly trains the loss function. Residual blocks’ skipped connections can address the optimization degradation issue, and the label correlations can be captured by the function F(x).
Both x and y represent classification result vectors of the same meaning and dimension. The straightforward application of the function F is a fully connected layer: F(x) = Wσ(x) + b. However, this approach isn’t practical for large-scale labels. Many labels exhibit no correlations, leading to correlation parameters 0 and resource wastage. One layer of residual blocks can be considered for dense labels with a limited number. In theory, increasing the number of residual blocks improves model accuracy. To balance training cost and classification accuracy, we set the function F as two layers of residual blocks, with W1 and W2 as weight functions, b1 and b2 as bias terms, σ as the sigmoid activation function, and δ as the extra exponential linear unit (ELU) function. The specific calculation formula is as follows:
(26)
The correlation residual block input x can be the initial classification result vector or the enhanced one. Therefore, multiple blocks can be stacked to capture intricate label correlations.
2.3 Optimization of classification results based on label imbalance
In large-scale label space of MLTC, label imbalance significantly affects accuracy, necessitating remediation. Consequently, the classification result optimization module addresses this via cost-sensitive strategies, employing multi-label classification loss to mitigate disparities in label and intra-label sample distributions.
2.3.1 Multi-label classification loss function format.
From multi-classification to multi-label classification, the most direct adaptability is transformed into extending the traditional cross-entropy loss function (CE loss). For the space D = {(xi, yi) ∣ 1 ≤ i ≤ n} with n samples, where Yi = {yi1, yi2, …, yil, yij ∈ {0, 1}} ⊆ Y is the subset of the label space and zi = (zi1, zi2, …, ziL) is the classifier output. zij ∈ (0, 1) represents the probability that sample i is predicted as label j and L represents the label number. Using the extended Categorical Cross-Entropy (CCE) loss function directly for multi-label classification:
(27)
where θ is the parameter,
represents the label set related to the sample i and pij represents the probability that the label set of sample i contains label j, which is predicted by the model obtained through the softmax activation function:
(28)
The MLTC task aims to convert the h(∘) prediction into the scoring function f : x × y → R for ranking and prioritizing relevant labels. The softmax function bifurcates training objectives into correct and incorrect label probabilities, and the sigmoid function focuses solely on the target label, which is more suitable for MLTC sorting issues. Thus, the binary cross-entropy (BCE) loss derived from the sigmoid activation function and CE loss function is obtained:
(29)
where σ represents the sigmoid activation function
, fij represents the predicted output of deep neural network training. BCE loss, superior to CCE in MLTC as evidenced in XML-CNN [29], effectively prioritizes relevant labels. Where Pt(x, y) signifies the edge distribution of label t and L is BCE loss:
(30)
Assumed that label distribution overlaps as Pt(x, y) = Ps(x, y), but label data is usually unbalanced. Large label spaces exacerbate this imbalance, with a disproportionate number of head label samples and a significant disparity between positive and negative samples within labels. Using BCE loss for classification treats head/tail labels and positive/negative samples identically, leading to over-suppression by abundant head labels and negative samples. This results in skewed classification boundaries and poor model generalization, particularly affecting the recognition of low-frequency labels. We consider changing the error to associate with the head and tail labels:
(31)
where
,
. Existing label balance loss functions usually focus on how to deal with w(y) for sample number balance, often overlooking the variance of ε(x, y) due to intra-label positive and negative sample discrepancies.
For the dominance of head label samples in imbalanced issues, the weight factor α is introduced to give reasonable attention parameter to rare labels, obtaining the re-weighted Binary Cross-Entropy Loss (R-BCE Loss):
(32)
For the imbalance problem of positive-negative sample disparity, we introduce a modulation function φ(γ) based on the R-BCE loss to down weight “easy-to-classify” negative instances, thereby deriving enhanced loss function for training multi-label text classifiers:
(33)
where n is the sample number, L is the label number, and σ is the sigmoid function. αij is the weight factor used to deal with the imbalance of the sample number between labels. φ is the modulation function, and γ is the parameter used to deal with the imbalance of positive/negative sample numbers within labels.
The optimization module proposes label quantity-based weight factor αij. Considering co-occurrence takes the ratio of the reciprocal of the sample’s sampling probability and the reciprocal of valid sample number as the weight to protect tail classifiers from excessive suppression by dominant head samples. φ, based on prediction probability, addresses negative label dominance and maintains negative gradients for confusable labels to promote discriminative learning by focusing on classification difficulty and selective negative sample suppression.
2.3.2 Weighting factor based on the number of labels.
This part proposes a label-based weighting factor, considering the label sample’s sampling probability and valid sample number.
For sample sampling probability, the label data set is estimated using probability statistics. It is stipulated that the sample k containing label i is expressed as . A label is randomly selected during sample sampling, and a sample is selected from the label set. According to the calculation model, the expected value of
sampling frequency is:
(34)
However, in the MLTC problem, due to the existence of label co-occurrence, the selection of any sample does not only mean the sampling of label i but means the sampling of the label set I ⊆ Y related to the sample x. For a sample xk, the sampling frequency of the sample xk is the sampling frequency of the relevant label set, that is, the sum of the sampling frequencies of all labels within the label:
(35)
Define a rebalanced weight factor to minimize the difference between the expected and actual sampling times:
(36)
However, when sample k contains both head label i and tail label j, the ratio of the expected value of the head label’s sampling frequency and the expected value
of the tail label’s sampling frequency is the reciprocal of label sample number. Due to the vast difference in the number of labels in MLTC, if the sample number of label i is a hundred times that of label j, then
will be less than 0.01. In particular, when sample k contains multiple labels, it will even approach 0, making the optimization more difficult, which is not conducive to model training. To ensure the stability of the optimization process,
is mapped to a reasonable range by mapping, where α is the overall increase of weight, β and μ control the range of the mapping function:
(37)
From the perspective of the number of label-valid samples, CB loss [30] associates samples with their neighborhoods to measure whether there is an overlap in data. At the same time, the valid sample number is calculated in a manner that the model and loss are unknown, and the balancing weighting factor is set as the inverse ratio of the valid sample number.
The valid number of samples is expressed as En, where n ∈ Z>0 represents the number of samples. The valid number of samples is , where
. When β = 0(N = 1), En = 1 and when β → 1(N → ∞), En → n. The reciprocal of the label-related valid sample number is used as a weighting factor:
(38)
where ni represents the actual sample number of label i, and
represents the valid sample number of the obtained label in i-th class.
2.3.3 Modulation function based on prediction probability.
This section calculates the modulation function using prediction probability from the perspective of label classification difficulty and selective suppression of negative samples. It also performs discriminative learning by retaining negative gradients for easily confused labels.
Starting from the difficulty of distinguishing classification, the model pays more attention to the hard-to-classify samples in the label space by reducing the weight of easy-to-classify samples within the label during training. Focal loss [31] proposes a modulation function (1 − Pt)γ based on prediction probability, where the focus γ is customized.
For γ ∈ [0, 5], the modulation factor decreases the loss impact for well-classified samples, especially as pt approaches 1. This reduction scales with γ: at γ = 0, it equates to standard cross-entropy loss, while higher values increase the focus on hard-to-classify samples. For instance, with γ = 2, easy-to-classify samples (pt = 0.9) have a loss 100 times lower than standard cross-entropy. In contrast, more challenging samples maintain a controlled loss increase, prioritizing them during training.
For negative samples, BCE loss yields a negative suppression gradient enforcing classifier i to exhibit low confidence, which is beneficial to some extent. However, excessive suppression from the head label notably impedes tail label activation. This leads to tail label classifiers distancing from numerous negative samples to minimize loss, thereby overfitting to a scant count of positive samples in the feature space. The gradient of BCE loss concerning network prediction output is:
(39)
To enhance tail label classifier training, a selection mechanism based on prediction probability is proposed. This employs wij to determine label suppression application. For confusable labels i and j, if pij > ξ, suppression is activated (wij = 1). Otherwise, it’s disregarded (wij = 0) to prevent excessive negative suppression:
(40)
The gradient of the loss relative to the network predicted output can be rewritten as:
(41)
The introduction of selection term wij maintains the gradient for positive samples yij = 1 of label i. But for negative samples yij = 0 of labels i, label j with prediction probability pij above threshold ξ remain unaffected. This ensures that labels potentially relevant to the current document aren’t suppressed, while those with minimal correlation are disregarded, thus preventing undue negative influence.
3 Results
This section analyzes the experimental results on the benchmark dataset, including comparative experiments with existing models and ablation experiments of each component of LACN.
3.1 Dataset
This paper validates LACN on four MLTC datasets with different numbers and sizes of labels:
AAPD (https://github.com/lancopku/SGM) [32] contains 55,840 computer science academic paper abstracts, categorized into 54 topics. The AAPD is split into 54,800 training and 1,000 test samples, with a 2,000-sample validation subset. Labels are partitioned into head labels including 19 topics with a sample size greater than 2 200, middle labels including 17 topics with a sample size between 900 and 2 200, and tail labels including 18 topics with a sample size less than 900.
RCV1-v2 (https://trec.nist.gov/data/reuters/reuters.html) [33] comprises 804,414 news articles, classified into 103 topics. They are divided into 23,149 training and 781,265 test texts, with a 1,000-text validation subset. Labels are segmented into head labels including 18 topics with a sample size greater than 348, middle labels including 20 topics with a sample size between 79 and 348, and tail labels including 16 topics with a sample size less than 79.
EUR-LEX (http://manikvarma.org/downloads/XC/XMLRepository.html) [34] includes 3,956 EU legal topics, with 11,585 training and 3,865 test texts.
AmazonCat-13K (http://manikvarma.org/downloads/XC/XMLRepository.html) [35] features 1,186,239 training and 306,782 test samples from Amazon.com, categorized into 13,330 labels.
AAPD and RCV1-v2 are conventional datasets with fewer labels, while EUR-LEX and AmazonCat-13K are extreme datasets with more labels. Table 2 details the number of training samples (NtrS), the number of testing samples (NteS), the total number of words (TNW), the total number of labels (TNL), the average number of labels per sample (ANLS), the average number of samples per label (ANSL), the average number of words per training set (ANWTrS) and the average number of words per testing set (ANWTeS).
3.2 Testing environment
The model presented in this paper is implemented in Python and executed on Ubuntu 18.04.4 LTS. Refer to Table 3 for the detailed experimental environment:
3.3 Parameter configuration
For each dataset, documents are truncated after 500 words and padded zeros at the end to expand documents with less than the predefined number of words. Using the word2vec [27], document and label word embedding matrices are trained with an embedding space size of 300, fine-tuned during training. The training batch size is 64 for AAPD and EUR-LEX and 256 for RCV1-v2 and AmazonCat-13K. Training set Adam optimization learning rate to 0.001 and maximum iteration epochs to 100, monitoring whether Micro-F1 on the validation set has an increase of more than 0.5%, and the training is stopped when there is no increase for 5 consecutive epochs.
To mitigate overfitting, the drop rates of post-embedding and post-Bi-LSTM layers are set to 0.2 and 0.5, respectively. In the document-label attention mechanism, the neuron weight hyperparameters Ws1 and Ws2 are set to da = 200. For AmazonCat-13K, a 512-dimensional Bottleneck layer preceded the output layer to enhance efficiency and reduce computing load. In optimizing classification results based on label imbalance, for each loss function parameter, γ = 2 is used for FL, β = 0.9 is used for CB, the selection of the parameter w is determined by optimizing based on the experimental results of the F1 value, and others are computed from existing data. During training, document and label data are input. After training, the model is tested on the validation set, and the training effect of the current model is saved. The test phase employs the best-performing model in the validation phase to perform testing on the test set.
3.4 Evaluation metrics
The precision and recall of class i label can be expressed as:
(42)
Where TPi is true positive, indicating that the sample with label i is also predicted to have label i; FPi is false positive, indicating that the sample without label i is predicted to have label i; FNi is false negative, indicating that the sample with label i but predicted to have no label i.
Micro-F1 and Macro-F1 are usually used to evaluate the experimental results. Micro-F1 first calculates the Presionmicro and Recallmicro of all labels, and then calculates the value of Micro-F1 through the F1 calculation formula, considering the overall recall and accuracy of all labels:
(43)
(44)
Macro-F1 first calculates the average Presionmacro and Recallmacro of each label, and then obtains the value of Macro-F1 through the F1 calculation formula to calculate the average F1 value of all labels:
(45)
(46)
It can be seen from the above equation that Micro calculates the average value by combining the contribution of different labels and gives more weight to frequent labels, while Macro calculates each label independently and gives all labels the same weight. Therefore, Micro pays more attention to labels with fewer samples than Macro. For the MLTC problem with imbalanced labels, the Micro-F1 value is selected as the evaluation index in this paper.
In addition to measuring the accuracy and recall of labels, in particular, for large-scale labeled datasets, considering the sparsity of labels, a short list of potentially relevant labels for each test sample is used to represent the classification quality. Two sample-based ranking criteria precision@k and nDCG@k are used to evaluate the model, where is the correct prediction value of the top K labels, and ‖z‖ is the number of correctly predicted labels, with k = 3 and 5 being the most representative.
precision@k represents the accuracy at the k-th ranked label:
(47)
nDCG@k represents the average normalized discounted cumulative return at the k-th ranked label:
(48)
3.5 Analysis of model comparison
This paper uses various methods as a baseline, as shown below:
- SGM [32]: Utilizes a sequence generation model for the MLTC task, predicting labels while considering label correlation and vital feature selection.
- Seq2Set [16]: Employs reinforcement learning with rewards to handle label sequences’ high-order correlations and commutative invariance.
- LEAM [17]: Approaches text classification via label-word joint embedding and attention-based compatibility measurement.
- ML Reasoner [36]: Implements binary classification with iterative reasoning to manage label correlations without order sensitivity.
- XML-CNN [29]: Adapts deep learning with new CNNs and dynamic pooling for advanced feature extraction in extreme multi-label text classification.
- Attention-XML [13]: Uses multi-label attention and a probability label tree to handle text relevance and scalability in a large label space.
- BBN [21]: Introduces a bilateral branching network for integrated representation and classifier learning in long-tail tasks.
- HTTN [37]: Enhances tail label recognition by transferring meta-knowledge from data-rich to data-poor labels.
Table 4 compares the Precision, Recall, and F1 values of SGM, Seq2Set, LEAM, ML-R, and the LACN model proposed in this paper on AAPD and RCV1-v2 datasets. The best results are highlighted in bold. The LEAM model, focusing solely on document content, underperforms due to its disregard for label correlations. In contrast, both SGM and Seq2Set are sequence generation models based on Seq2Seq and exhibit enhanced label correlation capturing. Seq2Set’s reward feedback mechanism further mitigates label order sensitivity for superior results. The ML-R model, innovatively addressing multi-label text classification through inferential label correlation, achieves higher recall. The LACN model proposed in this paper considers document content, label correlation, and label imbalance, surpassing others in Micro-F1.
Table 5 contrasts the P@k and NDCG@k values of XML-CNN, Attention-XML, BBN, HTTN, and the LACN model proposed in this paper on AAPD, RCV1-v2, EUR-LEX and AmazonCat-13K dataset. The best results are highlighted in bold. The XML-CNN excels in large-scale data handling but overlooks label correlation. Attention-XML, integrating Bi-LSTM and attention mechanisms, excels in text representation and correlation mining and particularly benefits from predefined label hierarchies in datasets like EUR-LEX. Conversely, in extensive label datasets like AmazonCat-13K, its performance is overshadowed by models with ample training samples per label. BBN and HTTN, focusing on long-tail distribution in MLTC, show better performance with larger k values but neglect label text and inter-label correlations. In contrast, LACN uses the label attention mechanism to obtain the text representation of a specific label. Even in the extreme label dataset EUR-LEX, the number of samples corresponding to the label is small and there is no sufficient number of relevant samples to support training, the existing samples can be mined for relevant semantics to obtain a representation that is more consistent with text classification. It also uses the distribution-based correlation network to introduce label correlation and optimizes for label imbalance. Even in the large-scale label dataset AmazonCat-13K, where there are many label categories and complex correlations among labels, it can get better classification results.
3.6 Analysis of ablation experiment comparison
This section performs ablation experiments to validate the effect and optimizations of the text representation, label distribution, and classification results in the proposed LACN model.
3.6.1 Comparison of text representation.
This paper employs the Bi-LSTM network, fully connected layer, and BCE loss as the foundational model to compare the partial role of document and label information-based text representations. Based on the model, document attention mechanism (W), label attention mechanism (B), the average of text-based and label-based text representations summed directly (W+B), and the text representation obtained by fusing the text-based and label-based text representations through the adaptive fusion mechanism (R+W+B) are used and tested on AAPD, RCV1-v2, EUR-LEX, and AmazonCat-13K datasets. Then, the experimental classification results of P@1, P@3, and P@5 are evaluated.
Fig 3 details the performance of W, B, W+B, R+W+B, and the LACN model proposed in this paper by the value of P@1, P@3, and P@5 on AAPD, RCV1-v2, EUR-LEX, and AmazonCat-13K datasets. In the feature representation of the text content, W tends to remove the redundant text information and find the important content related to the label but does not consider the association between the document and the label text content. B focuses on the semantic relationship between the document and the label content by learning the label text to mine the document semantics most related to the label explicitly. Still, there are differences between tags that cannot be distinguished by the tag text. Through the organic combination of the two representations, the most relevant discriminative information to the corresponding label can be extracted from each document, and the adaptive fusion mechanism enables the model to adaptively select the text representation that is most beneficial to the final prediction result during the training process. Classification results show that the number of label samples in the datasets AAPD, RCV1-v2 and AmazonCat-13K is relatively dense, and each label has a sufficient number of samples for training so that more content can be learned from the text content. In contrast, in the extreme label dataset EUR-LEX, the number of labels is large and the number of samples is small. For tail labels, there is no sufficient number of relevant samples to support training. It can be of great help to mine the relevant semantics of existing samples from the perspective of text content.
3.6.2 Comparison of label co-occurrence.
To compare the role of correlation networks based on label distributions, on the base model, single text representation (A), single relevance network (C), the combination of text representation and relevance network (A+C), and the LACN model proposed in this paper are used and trained on the AAPD, RCV1-v2, EUR-LEX, AmazonCat-13K datasets.
The experimental classification results of P@1, P@3, and P@5 are compared as shown in Fig 4. The result indicates that A neglects label associations, leading to performance degradation due to the superposition of the network. C enhances original classification results using relevant knowledge, but the complex information in documents and labels is not fully utilized. A+C effectively extracts document semantics related to the label, alleviates network degradation, and introduces label distribution, significantly improving model performance. From the perspective of classification results, the performance of the model can be significantly improved through the study of documents and label contents due to the sufficient number of samples, while the introduction of label correlation only through the use of the correlation network is slightly inadequate. However, the performance of the model can be further improved through correlation network mapping to enhance the original label prediction after text representation. In large-scale label datasets EUR-LEX and AmazonCat-13K, there are complex correlations between labels due to a large number of label categories, therefore the use of the correlation network for label correlation mining can achieve great improvement. In particular, in the EUR-LEX dataset, with numerous label categories and a few samples, using A lacks sufficient document content for effective learning, resulting in poor performance. In contrast, using C can extract information from the labels and achieve better results.
Considering training cost and classification accuracy, this paper uses two residual blocks in the correlation networks. Theoretically, more blocks enhance accuracy and escalate training expenses with uncertain benefits. However, with the increase in the number, the training cost increases and the effect is not necessarily better. Fig 5 demonstrates that by comparing the experimental classification results of P@1, P@3, P@5, and F1 values using different numbers of residual blocks for training on AAPD and EUR-LEX datasets. In the AAPD dataset, the optimal results occur with one block, deteriorating as blocks increase. Conversely, the EUR-LEX dataset shows improved performance with more blocks. Text representation-based label feature extraction benefits text classification training for dense-label datasets like AAPD, but excessive distribution hampers model performance. Conversely, sparse-label datasets like EUR-LEX lack sufficient samples per label for effective training, therefore, more information can be obtained from the label distribution.
3.6.3 Comparison of classification results.
The label distribution of the four datasets analyzed in this paper is unbalanced with the long-tail phenomenon, that is, a small number of labels occupy a large number of documents, while most labels have only a small number of documents associated with them. On the whole, the long-tail phenomenon of the RCV1-v2 data set is more obvious, with most labels containing less than 1/3 of the maximum sample number, and most samples appear in the top 2/5 label category groups. Obviously, the training of the last 3/5 label category group will be more difficult than the other categories due to the lack of training samples. The same long-tail distribution phenomenon also exists on the EUR-LEX and AmazonCat-13K datasets. This section mainly analyzes numerical imbalance on AAPD, RCV1-v2, and EUR-LEX, AmazonCat-13K datasets.
The assessment of label imbalance is achieved through the analysis of IRLbl(λ) and MeanIR, where IRLbl(λ) denotes the ratio of the most prevalent label to the number of corresponding samples, indicating label sparsity. A higher IRLbl(λ) value signifies rarer label occurrence and a more significant discrepancy with the most frequently occurring label. The function h(λ, Yi) determines whether the label is in the label space, with the IRLbl(λ) minimum being 1:
(49)
The average value MeanIR of all the label imbalance rates in the label space is used to measure the average imbalance rate of the whole label set. The larger MeanIR is, the greater the difference of occurrence counts between the labels in the dataset and the higher the degree of imbalance between the labels is. Where q is the label space size:
(50)
Fig 6 is the label imbalance ratio graph of the AAPD and RCV1-v2 datasets. The label imbalance ratio graph sorts the label categories by the number of label-related samples and calculates the label imbalance ratio of each category respectively. As can be seen from the graph, the label with more label-related samples has a lower label imbalance ratio. In particular, as the number of labels increases, the corresponding number of tail labels increases, while the label imbalance ratio of tail labels is very low or even tends to 0, resulting in a decrease in the average imbalance ratio of the overall dataset. The same phenomenon also exists on the EUR-LEX and AmazonCat-13K datasets, but the number of label categories is too large to be displayed as a bar chart.
Table 6 numerically presents average imbalance ratios for datasets AAPD, RCV1-v2, EUR-LEX, and AmazonCat-13K. The AAPD dataset is relatively balanced. In contrast, the EUR-LEX and RCV1-v2 datasets are imbalanced at larger and smaller label spaces, respectively. The AmzonCat-13K dataset is imbalanced, with huge label space. Larger label spaces tend to increase imbalance ratios due to limited sample size.
Label imbalance in the four datasets is addressed using multi-label classification loss functions. To compare the role of each imbalance parameter, four separate parameters are used based on the base model, including a label number-based weight factor computed from the perspective of the sampling probability (R) and the label’s valid sample number (C), a modulation function based on the prediction probability computed from the perspective of the label-classified difficulty (F) and the negative sample’s selective suppression (Y), and a loss functions (R+F, R+Y, C+F, C+Y) combined of the weighting factor and the modulation function and trained in the format of Eq 33. They are trained on AAPD, RCV1-v2 and EUR-LEX, AmazonCat-13K datasets. Performance is assessed by comparing the experimental classification results of P@1, P@3, P@5, and F1 values and epochs. Where R is obtained by calculating the number of samples related to the label, C uses γ = 2, F uses β = 0.9, and Y gets the best by setting different experiment values.
Fig 7 shows the experimental results of AAPD and RCV1-v2 at P@1, P@3, P@5, F1 value and numbers of the epoch. The results demonstrate that loss functions targeting label imbalance can improve model performance despite increasing training time. In the AAPD dataset, due to the dense data and relatively low imbalance degree, R simply based on sample sampling probability cannot optimize the model well. In addition, other loss functions yield varying enhancements. Experimental results show that the optimal combination of loss functions for AAPD is C+F, balancing high F1 and training time. In the RCV1-v2 dataset, various improved loss functions can optimize the model performance to a certain extent due to the high label imbalance. For RCV1-v2, R+Y is optimal without training time considered, and C+F is optimal with both training time and classification optimization considered. The optimal loss function may be different for different data sets, which are all determined by experiments. However, from the perspective of experimental time consumption, R+F is less recommended for being too time-consuming. Y excels with negative sample suppression. Therefore, in summary, using C, F, and Y alone or R+Y, C+F can achieve better classification optimization results in terms of experimental results and time loss.
Fig 8 illustrates the F1 value curve for different values of the selection parameter w on the AAPD and RCV1-v2 datasets. It is observed that for w ∈ {0.2, 0.3, 0.4}, the model minimally suppresses negative samples, resulting in a lack of selective inhibition, information loss, and decreased model performance. As w transitions to {0.5, 0.6, 0.7}, it effectively selects inhibitory labels, enhancing the training of the tail label classifier. However, for w = 0.8, the model disregards all negative samples, leading to overfitting on positive samples and a significant decline in training efficacy. Consequently, we set the selection parameter for model training to w ∈ {0.5, 0.6, 0.7}, with the optimal value determined as the outcome.
4 Limitations
In practical applications, because label attention and correlation networks usually need to compute and update a large number of labels, the training time of the model is long, which limits the efficiency of the model. The assumption of the correlation between labels is simple. However, in the real world, the correlation between labels may be more complex, and the correlation network based solely on the distribution of labels may not capture the complex relationship between labels well.
5 Conclusions and future work
This paper presents a new framework LACN to solve the multi-Label text classification problem. The main contributions are as follows: Firstly, the LACN utilizes the label attention mechanism to obtain the most relevant classification information for the texts. In addition, a correlation residual network is introduced to learn the label co-occurrence and improve the accuracy of label prediction. Finally, we propose using a weighting factor and a modulation function to adjust the cost-sensitive loss function thereby tackling the problem of the imbalanced number of samples for each label in the MLTC.
The experiment uses four multi-label text datasets. The comparative experimental results show that LACN can achieve optimal or suboptimal results versus state-of-the-art methods. On the AAPD data set, compared with the suboptimal method, it outperforms the second-best method by 2.05% ∼ 5.07% in precision@k and by 2.10% ∼ 3.24% in NDCG@k for k = 1, 3, 5. The ablation experiment verifies that all components of LACN are essential for its success.
Future work can be carried out from the following three aspects. Firstly, pre-trained language models like BERT can be used to process text to enhance text representation capabilities. Secondly, the correlation network based on label distribution can be extended by incorporating a graph neural network so that it can not only make use of label distribution but also handle increasingly complex structural relationships in the real world, such as inclusion, overlap, etc. This will enable handling complex structural relationships in the real world. Thirdly, developing flexible loss functions can help to adapt to different training sets without requiring additional parameter statistics.
References
- 1.
Yang Z, Yang D, Dyer C, He X, Smola A, Hovy E. Hierarchical Attention Networks for Document Classification. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2016; p. 1480–1489.
- 2.
Kumar A, Irsoy O, Ondruska P, Iyyer M, Bradbury J, Gulrajani I, et al. Ask Me Anything: Dynamic Memory Networks for Natural Language Processing. In: Balcan MF, Weinberger KQ, editors. Proceedings of The 33rd International Conference on Machine Learning. vol. 48 of Proceedings of Machine Learning Research. New York, New York, USA: PMLR; 2016. p. 1378–1387. Available from: https://proceedings.mlr.press/v48/kumar16.html.
- 3.
Xiang Y, Zheng J. Multi-Label Emotion Classification for Imbalanced Chinese Corpus Based on CNN. 2018 11th International Conference on Intelligent Computation Technology and Automation (ICICTA). 2018; p. 38–43.
- 4.
Wang H, Zhao J. Capsule Network Based on Multi-granularity Attention Model for Text Classification. 2022 IEEE Smartworld, Ubiquitous Intelligence Computing, Scalable Computing Communications, Digital Twin, Privacy Computing, Metaverse, Autonomous Trusted Vehicles. 2022; p. 1523–1529.
- 5.
Prabhu Y, Kag A, Harsola S, Agrawal R, Varma M. Parabel: Partitioned Label Trees for Extreme Classification with Application to Dynamic Search Advertising. Proceedings of the 2018 World Wide Web Conference. 2018;(10):993–1002.
- 6. Mahalleh ER, Gharehchopogh FS. An automatic text summarization based on valuable sentences selection. International Journal of Information Technology. 2022;14:2963–2969.
- 7. Boutell MR, Luo J, Shen X, Brown CM. Learning multi-label scene classification. Pattern Recognition. 2004;37(9):1757–1771.
- 8. Read J, Pfahringer B, Holmes G, Frank E. Classifier Chains for Multi-label Classification. Machine Learning and Knowledge Discovery in Databases,Lecture Notes in Computer Science. 2009; p. 254–269.
- 9.
Zhou P, Shi W, Tian J, Qi Z, Li B, Hao H, et al. Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2016; p. 207–212.
- 10.
Cho K, van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014; p. 1724–1734.
- 11.
Xiao L, Huang X, Chen B, Jing L. Label-Specific Document Representation for Multi-Label Text Classification. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019; p. 466–475.
- 12. Maragheh HK, Gharehchopogh FS, Majidzadeh K, Sangar AB. A Hybrid Model Based on Convolutional Neural Network and Long Short-Term Memory for Multi-label Text Classification. Neural Processing Letters. 2024;56:42.
- 13.
You R, Zhang Z, Wang Z, Dai S, Mamitsuka H, Zhu S. AttentionXML: label tree-based attention-aware deep model for high-performance extreme multi-label text classification. In: in Proc. NeurIPS; 2019. p. 5812–5822. Available from: https://github.com/yourh/AttentionXML.
- 14.
Zhou J, Ma C, Long D, Xu G, Ding N, Zhang H, et al. Hierarchy-Aware Global Model for Hierarchical Text Classification. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020; p. 1106–1117.
- 15. Sutskever I, Vinyals O, Le Q. Sequence to Sequence Learning with Neural Networks. 2014.
- 16.
Yang P, Luo F, Ma S, Lin J, Sun X. A Deep Reinforced Sequence-to-Set Model for Multi-Label Classification. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019; p. 5252–5258.
- 17.
Sheng Y, Takashi I. Joint Embedding of Words and Labels for Sentiment Classification. 2020 International Conference on Asian Language Processing (IALP). 2020; p. 264–269.
- 18. Charte F, Rivera A, del Jesus MJ, Herrera F. A First Approach to Deal with Imbalance in Multi-label Datasets. Hybrid Artificial Intelligent Systems. 2013; p. 150–160.
- 19.
Lo HY, Lin SD, Wang HM. Generalized k-labelset ensemble for multi-label classification. 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2012; p. 2061–2064.
- 20. Charte F, Rivera AJ, del Jesus MJ, Herrera F. MLSMOTE: Approaching imbalanced multilabel learning through synthetic instance generation. Knowledge-Based Systems. 2015;89:385–397.
- 21.
Zhou B, Cui Q, Wei XS, Chen ZM. BBN: Bilateral-Branch Network With Cumulative Learning for Long-Tailed Visual Recognition. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2020; p. 9716–9725.
- 22.
Chen K, Lu BL, Kwok JT. Efficient Classification of Multi-label and Imbalanced Data using Min-Max Modular Classifiers. The 2006 IEEE International Joint Conference on Neural Network Proceedings. 2006; p. 1770–1775.
- 23.
Tepvorachai G, Papachristou C. Multi-label imbalanced data enrichment process in neural net classifier training. 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence). 2008; p. 1301–1307.
- 24. Tahir MA, Kittler J, Bouridane A. Multilabel classification using heterogeneous ensemble of multi-label classifiers. Pattern Recognition Letters. 2012;33(5):513–523.
- 25.
Daniels Z, Metaxas D. Addressing Imbalance in Multi-Label Classification Using Structured Hellinger Forests. Proceedings of the AAAI Conference on Artificial Intelligence. 2022;31(1).
- 26. Wu T, Huang Q, Liu Z, Wang Y, Lin D. Distribution-Balanced Loss for Multi-label Classification in Long-Tailed Datasets. Computer Vision—ECCV 2020. 2020; p. 162–178.
- 27.
Pennington J, Socher R, Manning C. GloVe: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014; p. 1532–1543.
- 28.
He K, Zhang X, Ren S, Sun J. Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016; p. 770–778.
- 29. Liu J, Chang WC, Wu Y, Yang Y. Deep Learning for Extreme Multi-label Text Classification. 2017;(10):115–124.
- 30.
Cui Y, Jia M, Lin TY, Song Y, Belongie S. Class-Balanced Loss Based on Effective Number of Samples. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2019; p. 9260–9269.
- 31.
Lin TY, Goyal P, Girshick R, He K, Dollár P. Focal Loss for Dense Object Detection. 2017 IEEE International Conference on Computer Vision (ICCV). 2017; p. 2999–3007.
- 32. Yang P, Sun X, Li W, Ma S, Wu W, Wang H. SGM: Sequence Generation Model for Multi-label Classification. 2018.
- 33. Lewis DD, Yang Y, Rose TG, Li F. RCV1: A New Benchmark Collection for Text Categorization Research. J Mach Learn Res. 2004;5(37):361–397.
- 34. Loza Mencía E, Fürnkranz J. Efficient Pairwise Multilabel Classification for Large-Scale Problems in the Legal Domain. Machine Learning and Knowledge Discovery in Databases. 2008; p. 50–65.
- 35.
McAuley J, Leskovec J. Hidden factors and hidden topics: understanding rating dimensions with review text. Proceedings of the 7th ACM Conference on Recommender Systems. 2013;(8):165–172.
- 36. Wang R, Ridley R, Su X, Qu W, Dai X. A novel reasoning mechanism for multi-label text classification. Information Processing Management. 2021;58(2):102441.
- 37.
Xiao L, Zhang X, Jing L, Huang C, Song M. Does Head Label Help for Long-Tailed Multi-Label Text Classification. Proceedings of the AAAI Conference on Artificial Intelligence. 2022; p. 14103–14111.