Figures
Abstract
Large datasets with noisy labels and high dimensions have become increasingly prevalent in industry. These datasets often contain errors or inconsistencies in the assigned labels and introduce a vast number of predictive variables. Such issues frequently arise in real-world scenarios due to uncertainties or human errors during data collection and annotation processes. The presence of noisy labels and high dimensions can significantly impair the generalization ability and accuracy of trained models. To address the above issues, we introduce a simple-structured penalized γ-divergence model and a novel meta-gradient correction algorithm and establish the foundations of these two modules based on rigorous theoretical proofs. Finally, comprehensive experiments are conducted to validate their effectiveness in detecting noisy labels and mitigating the curse of dimensionality and suggest that our proposed model and algorithm can achieve promising outcomes. Moreover, we open-source our codes and distinctive datasets on GitHub (refer to https://github.com/DebtVC2022/Robust_Learning_with_MGC).
Citation: Liu B, Lin Y (2023) Robust meta gradient learning for high-dimensional data with noisy-label ignorance. PLoS ONE 18(12): e0295678. https://doi.org/10.1371/journal.pone.0295678
Editor: Don Sasikumar, Amrita Vishwa Vidyapeetham, INDIA
Received: September 19, 2023; Accepted: November 28, 2023; Published: December 11, 2023
Copyright: © 2023 Liu, Lin. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The Breast Cancer data are available from the UCI Machine Learning Repository and can be found at the following URL: https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic. And the Simulation data are within the manuscript and its Supporting information files.
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Large datasets in industry, characterized by noisy labels and high dimensions, are increasingly common [1–9]. The labels are often generated through unknown processes or manual annotation, leading to noisy labels [10, 11]. This, in turn, reduces the robustness of models and increases learning costs [12], negatively impacting model generalization and accuracy [13]. Recent research [1, 3, 5, 10–12, 14–16] has focused on developing strategies to train robust models using such datasets or effectively eliminate noisy labels.
Data-driven methods [1, 5, 11] address noisy labels by assuming the Classification Noise Process (CNP) [17] and using data distribution and information to pre-filter the labels. Model-driven studies [7, 10, 14–16], on the other hand, focus on fitting robust models to noisy datasets to learn the true labels. Some studies [7, 11, 18] have improved these approaches by incorporating concepts from curriculum learning, resistance learning, and peer networks to correct noisy labels.
However, most of these studies have primarily focused on optimizing network structures, resulting in longer training times and higher costs. The γ-divergence, initially introduced as the “density power divergence of type-zero” [19], is a loss function that ensures model robustness by comparing two probability density functions [16]. Unfortunately, when modeling large datasets in the industry using γ-divergence, it often leads to unsatisfactory results due to the introduction of numerous predictive variables to mitigate modeling biases (known as the Curse of Dimensionality) [20]. To address this issue, we propose a simple-structured penalized γ-divergence model and a meta gradient correction optimization algorithm to robustly model noisy labels and high-dimensional datasets. We instantiate our approach by deriving the model in the specific scenario of logistic regression and then extend it to the framework of generalized linear models.
Concretely, we begin by defining the γ-divergence and introducing a class of penalty function mentioned in Liu et al. (2023) [21], which yields the objective function Eq (9). Based on this, we derive the consistency and efficiency of our model parameter estimation (refer to Theorems 1 and 2), as well as the empirical risk upper bound of our model (refer to Theorem 3). Furthermore, since the weight function of γ-divergence
of (Xi, Yi) shows the possibility whether Yi is a noisy label, we propose a meta gradient correction algorithm based on ωi and establish its theoretical basis (refer to Theorem 4). This algorithm is a variant of stochastic gradient descent [22], which achieves effective discrimination between noisy label samples and non-noisy label samples by setting a pre-defined threshold. Finally, we extend our theoretical results to demonstrate the generalizability of our approach to a wider range of cases.
This paper makes three key contributions to prior work on finding, understanding, and learning with noisy labels,
- We propose a simple-structured penalized γ-divergence model and lay its foundations on theoretical proofs that effectively reduce manual feature engineering and improve modeling efficiency.
- We propose a novel meta gradient correction algorithm, and demonstrate its effectiveness through solid theoretical proofs and rich experiments that enable it to detect noisy labels and reduce the cost of manual labeling.
- We open-source our experimental code and data, showcasing the promising outcomes of our model on two tasks: detecting label errors, and learning from noisy labels.
Related works
Noisy label
Many studies focus on how to detect, discover, and recognize noisy labels from industrial datasets, which are mainly divided into method for reweighting examples [23–27], method for estimating the noise transition matrix [28–32], method for optimizing gradient descent and training procedures [10, 33, 34], method for selecting confident examples [1, 5, 35, 36], method for introducing regularization [4, 7, 37–39], method for designing robust loss functions [8, 9, 14, 16, 40–44], and method for generating pseudo labels [45–49].
In addition, some advanced start-of-the-art methods combine several techniques, e.g., MentorNet [11], DivideMix [50], Iterative Learning [51], and ELR+ [52].
γ-divergence
The γ-divergence, which is first introduced in [19] with the name density power divergence of type-zero, is defined for two probability density functions. γ-divergence is closely related to other divergence measures, such as Kullback-Leibler (KL) divergence, but includes a tunable parameter γ that allows for more flexibility in the metric. Fujisawa and Eguchi (2008) [53] introduced γ-divergence later. Hung et al. (2018) [14] proposed a robust mislabeled logistic regression model based on the original form of γ-divergence defined by Jones et al. (2001) [19], called the γ-logistic regression.
Penalized empirical likelihood
Penalized Empirical Likelihood has evolved as an essential statistical inference method, addressing challenges in small sample settings with nonparametric and semiparametric modeling. The origins of the method can be traced back to Owen (1988) [54], who introduced the concept of empirical likelihood, a nonparametric likelihood-based method for estimating unknown parameters in statistical models. In 1996, Qin and Lawless [55] further developed the theory and proved its efficiency and adaptability in various scenarios.
The incorporation of penalization in empirical likelihood was first introduced by Fan and Li (2001) [56]. They demonstrated how to use penalized empirical likelihood in variable selection and model estimation in high-dimensional and nonparametric settings. Later, Chen and Pouzo (2009) [57] proposed the usage of L1-penalty in empirical likelihood estimation, offering a more robust framework for handling high-dimensional data. Furthermore, MCP was initially proposed by Zhang (2010) [58] to overcome the drawbacks of the LASSO penalty, such as bias and lack of sparsity.
In recent years, the application of penalized empirical likelihood has expanded into various fields. Two exciting applications of this method are found in Fan et al. (2016) [59] and Shi et al. (2016) [60], where they employed a penalized empirical likelihood method for variable selection in high-dimensional settings.
Methodology
This section proposes a novel noisy-label-ignore γ-divergence model for high-dimensional data with solid theoretical foundations. We substitute logistic regression as a specific case into this framework for a more explicit description. We first give some notations and mathematical derivation of the optimization objective for γ-divergence. Then, we propose the γ-divergence model by introducing a penalty function and proving the parameter consistency, asymptotic normality, and optimal risk upper bound for this model. Finally, we extend this model to the general case and introduce our meta gradient correction algorithm for implementing robust learning for noisy labels.
γ-divergence and its optimization objective
In the binary classification problem, let be the sample space, Y0 be the true binary response taking values in {0, 1}, and X be the d-dimensional vector of covariates. Let P(X, Y0) and PX denote the joint distribution of (X, Y0) and the distribution of X, respectively. Let the conditional success probability of Y0 be
. In many real applications, we can only observe response Y contaminated with noise instead of the true response Y0. Define the mislabeled probability
(1)
and
(2)
Therefore, the conditional success probability of Y, which is denoted by
(3)
can be expressed by
(4)
Suppose that we observe independent and identical data pairs (Xi, Yi), i = 1, …, n, with joint distribution P(X, Y). The goal is to predict the true label Y0 of a new observation with covariates X using the model fitted by the data (X1, Y1), …, (Xn, Yn).
Let g(y|x) be the underlying conditional probability density function of Y. By the definition of , we have
. Let f(y|x; β) be the parametric conditional probability density function with the parameter
, where d is the dimensions of β. The γ-divergence between g(y|x) and f(y|x; β) is defined to be
(5)
where
. Suppose that there exists a true parameter value
satisfying β20 = 0, where
,
, and s is a finite constant, such that g(y|x) = f(y|x; β0). By Theorem 3.1 in Fujisawa and Eguchi (2008) [53], Dγ{g(⋅|x), f(⋅|x; β0)} = 0. Therefore, β can be estimated by minimizing Dγ{g(⋅|x), f(⋅|x; β)}. To eliminate the randomness of X, we take expectations with respect to X in (5) and have
The γ-logistic regression estimates β by minimizing . Since
is a constant, we have
(6)
where
and
denote the expectation with respect to X and (X, Y), respectively.
In particular, logistic regression assumes that
(7)
where
. Given the observations (X1, Y1), …, (Xn, Yn), and substituting (7) into (6), the estimator
is defined by
(8)
Penalized γ-divergence model and its oracle properties
As claimed by Fujisawa and Eguchi (2008) [53] and Kawashima and Fujisawa (2017) [16], γ-divergence is still very robust to noisy data. However, when the dimension d of the covariate X diverges, it can be shown that the estimator in (8) is no longer consistent with the true parameter β0, and the estimation equation is no longer robust [16], which leads to the model prediction will have serious error accumulation and reduces the prediction accuracy [16, 61]. Hence, to solve these problems, we introduce a class of penalty function
mentioned in Liu et al. (2023) [21] as follows,
Then the objective function of the penalized γ-logistic regression is
(9)
where
is a penalty function in class
with tuning parameter λn, and βj, j = 1, …, d, is an element in the vector β. Hence, the estimator
is defined by
.
We still need consistent results here since we focus on a parametric model. We first show two regularity conditions in Lemmas 1 and 2 and then give the consistency and asymptotic normality of parameter estimation in Theorems 1 and 2.
Lemma 1 (Regularity Condition). Let Vi = (Xi, Yi) and (10) where
(11)
and βj, j = 1, …, d, is an element in the vector β.
Let P(Yi = 1) be the probability that the sample label is 1 and (12) then the first derivatives of L(β) satisfying the equations
(13) at β = β0, where
.
Lemma 2 (Regularity Condition). The Fisher information matrix,
(14) is finite and positive definite at β = β0.
Theorem 1 (Consistency and Convergence Rate). Let Vi = (Xi, Yi) be independent and identically distributed and be a penalty function in class
with tuning parameter λn. Let
(15) and
(16)
Assume that
and
, there is a local maximizer
of
such that
under some regularity conditions.
It is clear from Theorem 1 that by choosing a proper , and there exists a root-n consistent penalized likelihood estimator
. Let I(β0) be the Fisher information matrix and let I1(β10) be the Fisher information knowing β20 = 0. We now show that the estimator
possesses the sparsity property
and the asymptotic normality, which is also known as the Oracle property [56].
Theorem 2 Let
be a local maximizer of
in Theorem 1, Vi = {Xi, Yi} be independent and identically distributed and
be a penalty function in class
with tuning parameter λn. Let
where assuming
. Assume that
and λn → 0, then for any constant C, under some regularity conditions,
satisfy:
(a)Sparsity:
,
(b)Asymptotic Normality: (17) where
(18) and
(19)
Theorems 1 and 2 reveal that possesses consistency and asymptotic normality. Moreover, as claimed by Cannings et al. (2020) [62], a model trained on noisy labels should be close to the optimal Bayes model under some certain conditions. Hence, we aim to get the bound of excess risk between our model and the optimal Bayes model. First, we need some notations. Let C(x) be the predicted label of {0, 1} value and given x, define
(20)
and
, which is the estimation error between the predicted label by f(x) and the noisy conditional probability
. We give the following Definitions 1 and 2, for the Bayes classifier.
Definition 1. Given x, the optimal classifier under clean labels is Bayes classifier, denoted as C*(x), where (21)
Definition 2. Given x, the optimal classifier under noisy labels is Bayes classifier, denoted as , where
(22)
Definitions 1 and 2 tell us that the Bayes classifier minimizes the classification risk under noisy labels. Furthermore, we then introduce the Tsybakov Condition by following Assumption 1, which is the basis of Lemma 3 and Theorem 3.
Assumption 1. (Tsybakov Condition). There exist constants M, λ ≥ 0 and , such that for all 0 ≤ t ≤ t0 and x, the following inequality holds,
(23)
This condition, also called margin assumption, stipulates that the uncertainty of η(x) is bounded. In other words, the margin region close to the decision boundary, , has a bounded volume. Moreover, we define the subspace that the Tsybakov condition holds as
, which means that for sufficiently small t ≤ t0,
and
. Hence, let
(24)
and based on the Assumption 1, we can obtain Lemma 3 and Theorem 3.
Based on the above definitions and assumptions, Lemma 3 shows the bound of the model Rour(x) and the Bayes classifier under the noisy labels. And Theorem 3 is the main result of this paper, revealing that our model Rour(x) and the optimal Bayes model R*(x) under the case of noisy labels still have an upper bound of excess risk.
Lemma 3. (Upper bound of Excess Risk between Rour(x) and ). Assume η(x) statisfies the Tsybakov condition with constants M, λ > 0, and
. Assume again that τ01(x) + τ10(x) < 1. Given x, for all 0 ≤ t ≤ t0, then the excess risk is,
(25) as t ≤ t0 → 0, where
(26) and
(27)
Theorem 3. (Upper bound of Excess Risk between Rour(x) and R*(x)). Assume η(x) statisfies the Tsybakov condition with constants M, λ > 0, and . Assume again that τ01(x) + τ10(x) < 1. Given x, for all 0 ≤ t ≤ t0, then the excess risk is,
(28) as t ≤ t0 → 0, where
(29) and
is a small value depend on λ.
Theorem 3 indicates that the risk bound between our model and the best Bayes classifier will be dominated by the model’s estimation error ϵ. If the estimation error ϵ is smaller, our model and the Bayes classifier are closer.
Furthermore, the estimation error refers to the difference between the estimated parameters of the model and the real parameters. We can obtain from the consistency of parameters the estimation error ϵ will tend to 0 as γ → ∞ and n → ∞, so our model will converges to the Bayes model in the case of large samples.
Specifically, from Theorems 1 and 2, we can get with constant
, which leads to the fact that
. Hence, from Theorem 3, we have
Finally, we can easily generalize the conclusions of Theorems 1, 2, and 3 of γ-logistic regression to the generalized linear model (GLM) and obtain Corollary 1, as logistic regression is essentially a GLM. The proof of Corollary 1 is trivial and would not be repeated in our paper.
Corollary 1. Under the regularity conditions 1 and 2 and Assumption 1, for each generalized linear model (such as the Probit model), we can obtain the same results as described in Theorems 1, 2, and 3: (i) consistency, (ii) asymptotic normality, and (iii) convergence of the model.
Meta gradient correction algorithm
Although we have proved in the above subsection that our proposed model possesses oracle properties even in the noisy-label case, recent study [10] has shown that machine learning models gradually remember individual data while adapting to the data distribution. Therefore, when facing noisy labels, all statistics or machine learning methods inevitably encounter the problem of reduced generalization ability, and it is necessary to eliminate the impact of single data as much as possible through methods such as early stopping or dropout. However, these methods passively eradicate the effect of a single data on model training. Excessive use of these methods often results in underfitting. Therefore, this paper proposes a regularized meta-learning algorithm, Meta Gradient Correction (MGC), for noisy label data.
From Hung Hung et al. (2018) [14], the γ-divergence weight function of (Xi, Yi),
(30)
shows the possibility of whether Yi is a noisy label. In other words, if ωi is larger, it means that (Xi, Yi) has a more significant effect on model training and is less likely to be a noisy label. If ωi is smaller, the sample Label Yi is more likely to be a noisy label. On the other hand, the estimation criterion of minimum γ-divergence is to maximize the objective function Q(β), which leads to using the stochastic gradient ascent algorithm.
Hence, in this part, we propose an adaptive algorithm based on a meta approach that modifies the optimization algorithm. For example, if the corresponding value of the γ-divergence weight function ωi of the (Xi, Yi) is large, then gradient descent is performed to β. In contrast, the gradient ascent is performed to eliminate the effect of suspected noisy labels on model parameter convergence. To illustrate the validity of this algorithm, we further show Theorem 4.
Theorem 4. (Convergence of MGC algorithm). Let the number of samples for each batch be N, where the number of noisy labeled samples is Nnoisy, and the number of non-noisy labeled samples is Nnon − noisy, i.e., N = Nnoisy + Nnon−noisy. Without loss of generality, assume that Nnon−noisy > Nnoisy. Moreover, the function we want to minimize, , is continuously differentiable, and we assume that Q(β) has Lipschitz continuous gradients with constant L, i.e., there exists a constant L > 0 such that,
for all
. This algorithm updates the gradient descent iteration as follows, βt+1 = βt−η∇Q(βt), where η > 0 is the learning rate. We then obtain that,
(31) for all t ≥ 0. The function values Q(βt) form a non-increasing sequence, which implies that Q(βt) converges.
Theorem 4 implicitly states that our proposed meta gradient correction algorithm makes our objective function Q(β) converge when the proportion of noisy labels is small (refer to Nnon−noisy > Nnoisy). This is also consistent with the results shown in Section of Experiments and powerfully demonstrates the validity of our algorithm. Furthermore, following [21], we give three specific forms of the penalty function as particular implementations of our algorithm. One is the SCAD penalty [56] where
(32)
where βj, j = 1, …, d, is an element in the vector β. Another one is the MCP [58] where
, j = 1, …, d, to solve the problem of variable selection and estimation in high-dimensional case. The last one is the Lasso [63] where PLasso(βj) = |βj|, j = 1, …, d. For ease of description and understanding, we substitute
with logistic and probit models to illustrate our algorithm.
Experiments
In this section, we evaluate the abilities of proposed model and algorithm, namely, the functionality of penalized γ-divergence model for learning high-dimensional data and the robustness of adaptive classification algorithm in modeling noisy data.
Our dataset
We conducted numerous experiments on multiple simulation data and real data. These dataset has the following three characteristics.
- Simulation data: We generate noisy label dataset based on random seeds with three different sample dimensions and eight noisy label ratios. The training sample dimensions include: 200*500, 500*1000, and 1000*1500, and testing sample dimensions include: 100*500, 200*1000, and 200*1500. The noisy label ratios include 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, and 0.8.
- Real data: We obtain breast cancer data from the UCI Machine Learning Repository as real data (https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic), which contains 571 samples and 31 feature variables. We randomly divide it into the training set and test set in a ratio of 0.8 and 0.2. The training set includes 456 samples, and the test set contains 115. We set the noisy label ratio in experiments to 0.1, 0.3, and 0.5. Following [18, 47, 52], we randomly select samples from the 0-labeled and 1-labeled samples in the training set (the number of samples chosen is the product of the noisy label ratio and the number of samples in this category). Finally, we flip the labels of the selected samples. Similarly, for the test set, we perform the same operation.
- Each noisy label dataset is used to train and test the effectiveness and capability of the eight models in learning and detecting noisy labels. The final test results for each model are based on the average of tests after 50 different training sessions.
Settings of models
This subsection provides brief information about the experimental settings, and more details can be found in our open-sourced codes (https://github.com/DebtVC2022/Robust_Learning_with_MGC).
- Baseline Models: We select the logistic and probit models as baseline models following the research [14], with their optimal settings based on pre-experiments (described in the following paragraph).
- Relevant Settings: Our models contain multiple versions, obtained by combining two backbone models (refer to the logistic and probit models), three penalty functions (refer to the SCAD, MCP, and Lasso), and the proposed meta gradient correction algorithm. For all models, we select hyperparameters based on preliminary experiments and the prevalent findings [14, 18, 56, 61], and the final hyperparameters are set as follows: γ value of 0.5, λ value of 1, threshold value Tv of 0.5, learning rate η of 0.01.
Evaluation metrics
Following the studies [14, 16], we choose the accuracy and F1-score to assess the ability of our model to model noisy data.
- Accuracy based on contaminated label Y: Following the studies [14, 16], we choose the accuracy to assess the ability of our model to learn noisy labels and then denote this evaluation metric as ACC_noisy, which records the number of correctly predicted data points among all data points (X, Y).
- Accuracy based on true label Y0: Following the studies [14, 16], we choose the accuracy to assess the ability of our model to learn noisy labels and then denote this evaluation metric as ACC_true, which records the number of correctly predicted data points among all data points (X, Y0).
- Recall: We denote this evaluation metric by Recall, which records the percentage of samples that a model correctly identifies as belonging to noisy label samples out of the total samples for that class.
- Precision: We denote this evaluation metric by Precision, which records the ratio of correctly identified noisy label samples to the total number of noisy label samples.
- F1-score: We denote this evaluation metric by F1, which is calculated by
.
These five metrics reflect the robustness of the model and its ability to detect noisy labels, and their larger values denote that the model has a stronger ability to model noisy labels. [14].
Ablation study & analysis
In this subsection, we conduct comprehensive evaluations to demonstrate the effectiveness of our proposed model in learning with noisy labels. The evaluations are performed on simulated and real-world data by combining different backbone models (refer to two generalized linear models for binary classification problems: logistic regression and probit models), penalty functions (refer to SCAD [56], MCP [61], and Lasso [63]), and the proposed meta gradient correction (MGC) algorithm.
Analysis based on simulated data.
Overall analysis based on simulated data. We first evaluate the model performance on simulated data under different settings. As illustrated in Tables 1–18, our proposed penalized γ-divergence model consistently achieves higher accuracy than the baseline model across different noisy ratios (refer to 0.1, 0.2, 0.3), backbone models, and penalty functions. This indicates the robustness of our model against noisy labels. More importantly, our model substantially improves Recall, Precision, and F1, the critical metrics for noisy label identification [62]. The results strongly verify the capability of our model in detecting and rectifying noisy labeled instances. In addition, in practical noisy label scenarios, we are more concerned with how similar the predicted labels are to the correct labels Y0 and what percentage of all noisy labels in the dataset are identified [62]. Hence, as depicted in Figs 1–6, we also display the trends of ACC_true and Precision under the different noisy ratios (refer to 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8).
When the proportion of noisy labels is small, introducing the meta gradient correction algorithm can improve the classification accuracy of the model, except when the meta gradient correction algorithm is combined with the SCAD penalty function.
All four subfigures show that the introduction of the meta gradient correction algorithm can increase the precision of model predictions, which means that the reliability of model predictions is improved.
The four subfigures of this figure can lead to similar conclusions as in Fig 1.
The four subfigures of this figure can lead to similar conclusions as in Fig 2.
When the proportion of noisy labels is small, introducing the meta gradient correction algorithm can improve the classification accuracy of the model.
All four subfigures show that the introduction of the meta gradient correction algorithm can increase the precision of model predictions, which means that the reliability of model predictions is improved.
In addition, by incorporating the proposed meta gradient correction algorithm, the model accuracy on true labels (ACC_true) and Precision are further improved under various simulated scenarios. This demonstrates the effectiveness of the meta gradient correction algorithm in enhancing model robustness and resilience to noisy labels. According to our theoretical analyses (refer to Theorem 4), our model is effective when the noisy ratio is small, which is also validated by the experimental results.
The analysis of the impact of sample size based on simulated data. We also evaluated the impact of sample size on model performance. Based on Tables 1–18, we found that when the sample size of the training set increases from 200 to 1000, the predictive ability of the model first increases and then decreases. The reason why the prediction ability improves first is because as the size of the training set increases, the training set contains more practical information. The model with the meta-gradient correction algorithm or the penalty function can capture more information about the correctly labeled samples. As the size of the training set further increases, it is challenging to fit large-scale data sets with simple logistic regression and probability models as the backbone, and they usually require more help to fit such data effectively. At this point, combining the meta-gradient correction algorithm with a penalty model does not provide additional benefits. One potential reason is the inherent conflict between these two technologies’ goals and working mechanisms. When the sample size is small, the excellent capabilities of the two can offset the problems caused by conflicts. When the sample size continues to increase, this problem gradually comes to light. A similar phenomenon occurs when the proportion of noisy labels increases.
The analysis of the impact of noisy label ratio based on simulated data. As illustrated in Tables 1–18 and Figs 1–6, the ACC_true and Precision fluctuate severely as the noise ratio increases, especially when the noise level is high (e.g., ≥0.4). In Fig 1, the evaluation metrics show a smooth downward trend when no penalty function exists in the model. In Figs 2–4, the evaluation metrics show a fluctuating downward trend when the model includes the penalty function. These fluctuations directly result from too many noisy labels in the dataset interfering with the features selected by the penalty function, which can cause significant challenges in the training process of models with penalty functions. Therefore, in this case, the model may need help to achieve optimal predictive performance or even converge. In future research efforts, we are actively considering strategies to mitigate these challenges, including introducing smoothing factors or improving penalty functions to enhance model training when dealing with large-scale noisy labels. Further, Figs 1, 2, 5 and 6 show the results of the evaluation metrics for the same model settings and noise labeling ratio settings on the training set and test set, respectively. We find that the performance of the two evaluation metrics, ACC_true and Precision, are roughly the same for the same scenarios, with the performance of the training set being relatively smoother.
In summary, the comprehensive experiments on simulated data validate the effectiveness of our proposed approach in modeling noisy labels and identifying noisy instances. The results are well aligned with our theoretical analyses.
Analysis based on real data.
Overall analysis based on real data. We further benchmark our model on real-world datasets with synthetic noisy labels. As shown in Tables 19–24, the meta gradient correction algorithm consistently improves the results across different noisy ratios and evaluation metrics compared to baseline models. However, the incorporation of penalized models leads to inferior performance. A potential reason is that the real datasets have lower dimensionality and complexity compared to simulated data and, thus, are less prone to overfitting. Therefore, the benefits of regularization models are diminished. Encouragingly, the results further verify the robustness and effectiveness of the proposed meta gradient correction algorithm on real-world noisy label learning tasks.
In conclusion, extensive experiments on both simulated and real-world datasets demonstrate the capability of our model in handling noisy labels. The meta gradient correction algorithm is verified to deliver consistent performance improvements. In future works, we will focus on alleviating the model fluctuation issue when the noise level is high and evaluating the approach on larger-scale and more complex real-world data. Advanced deep learning techniques can also be explored as backbone models to enhance model capacity and scalability.
Conclusion and future work
In this paper, we have proposed a novel penalized model and a meta gradient correction algorithm with grounded theoretical foundations to detect noisy labels. To illustrate the effectiveness of our proposed model and algorithm, we argue from both theoretical proofs and rich experiments.
First, we derive theorems 1 and 2, which reveal that the in the proposed model possesses consistency and asymptotic normality. Besides, we obtain Theorem 3 based on the empirical risk upper bound derivation, proving that the difference between our model and the optimal Bayesian model is sufficiently small. Furthermore, we propose a meta-learning algorithm for correcting gradient and show its convergence based on rigorous theoretical foundations (refer to Theorem 4).
Next, we conducted experiments on multiple simulated and real data. These experiments proved the following results: (i) The proposed penalized model exhibits surprising robustness in modeling noisy labels. (ii) The proposed meta gradient correction algorithm demonstrates a promising ability to detect and find noisy labels. (iii) Our model can easily used in practical machine learning scenarios.
Finally, we plan to improve our algorithm to make it suitable for large-scale high-dimensional data. Possible improvements include introducing more complex models and smoother optimization algorithms. In addition, our algorithm can also be used in other fields, such as casual inference.
Supporting information
S1 File. The title of this file is ‘Supprot information for “Robust meta gradient learning for high-dimensional data with noisy-label ignorance”’.
The support information file contains all the proofs covered by the manuscript. Specifically, it contains the derivation of Eq (5) and the proofs of 1, 2, 3, and 4.
https://doi.org/10.1371/journal.pone.0295678.s001
(PDF)
References
- 1. Cordeiro F, Sachdeva R, Belagiannis V, Reid I, Carneiro G. Longremix: Robust learning with high confidence samples in a noisy label environment. Pattern Recognition. 2023;133:109013.
- 2. Henriques R, Madeira S. FleBiC: Learning classifiers from high-dimensional biomedical data using discriminative biclusters with non-constant patterns. Pattern Recognition. 2021;115:107900.
- 3. Ma W, Zhou X, Zhu H, Li L, Jiao L. A two-stage hybrid ant colony optimization for high-dimensional feature selection. Pattern Recognition. 2021;116:107933.
- 4.
Ma X, Huang H, Wang Y, Romano S, Erfani S, Bailey J. Normalized loss functions for deep learning with noisy labels. In: International conference on machine learning. PMLR; 2020. p. 6543–6553.
- 5. Northcutt C, Jiang L, Chuang I. Confident learning: Estimating uncertainty in dataset labels. Journal of Artificial Intelligence Research. 2021;70:1373–1411.
- 6. Qin Z, Zhang Z, Li Y, Guo J. Making deep neural networks robust to label noise: Cross-training with a novel loss function. IEEE access. 2019;7:130893–130902.
- 7. Shi X, Guo Z, Li K, Liang Y, Zhu X. Self-paced resistance learning against overfitting on noisy labels. Pattern Recognition. 2023;134:109080.
- 8.
Wang Y, Ma X, Chen Z, Luo Y, Yi J, Bailey J. Symmetric cross entropy for robust learning with noisy labels. In: Proceedings of the IEEE/CVF International Conference on Computer Vision; 2019. p. 322–330.
- 9. Xu Y, Cao P, Kong Y, Wang Y. L_dmi: A novel information-theoretic loss function for training deep nets robust to label noise. Advances in neural information processing systems. 2019;32.
- 10.
Han B, Niu G, Yao J, Yu Xi, Xu M, Tsang I, et al. Pumpout: A meta approach to robust deep learning with noisy labels. arXiv preprint arXiv:180911008. 2018;.
- 11.
Jiang L, Zhou Z, Leung T, Li LJ, Fei-Fei L. Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In: International conference on machine learning. PMLR; 2018. p. 2304–2313.
- 12.
Xiao T, Xia T, Yang Y, Huang C, Wang X. Learning from massive noisy labeled data for image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2015. p. 2691–2699.
- 13. Zhang C, Bengio S, Hardt M, Recht B, Vinyals O. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM. 2021;64(3):107–115.
- 14. Hung H, Jou ZY, Huang SY. Robust mislabel logistic regression without modeling mislabel probabilities. Biometrics. 2018;74(1):145–154. pmid:28493315
- 15.
Jiang L, Huang D, Liu M, Yang W. Beyond synthetic noise: Deep learning on controlled noisy labels. In: International conference on machine learning. PMLR; 2020. p. 4804–4815.
- 16. Kawashima T, Fujisawa H. Robust and sparse regression via γ-divergence. Entropy. 2017;19(11):608.
- 17. Angluin D, Laird P. Learning from noisy examples. Machine Learning. 1988;2:343–370.
- 18.
Yi L, Liu S, She Q, McLeod A, Wang B. On learning contrastive representations for learning with noisy labels. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2022. p. 16682–16691.
- 19. Jones M, Hjort N, Harris I, Basu A. A comparison of related density-based minimum divergence estimators. Biometrika. 2001;88(3):865–873.
- 20. Donoho D, et al. High-dimensional data analysis: The curses and blessings of dimensionality. AMS math challenges lecture. 2000;1(2000):32.
- 21. Liu P, Zhao Y. A review of recent advances in empirical likelihood. Wiley Interdisciplinary Reviews: Computational Statistics. 2023;15(3):e1599.
- 22. Robbins H, Monro S. A stochastic approximation method. The annals of mathematical statistics. 1951; p. 400–407.
- 23. Liu T, Tao D. Classification with noisy labels by importance reweighting. IEEE Transactions on pattern analysis and machine intelligence. 2015;38(3):447–461.
- 24.
Ren M, Zeng W, Yang B, Urtasun R. Learning to reweight examples for robust deep learning. In: International conference on machine learning. PMLR; 2018. p. 4334–4343.
- 25. Shu J, Xie Q, Yi L, Zhao Q, Zhou S, Xu Z, et al. Meta-weight-net: Learning an explicit mapping for sample weighting. Advances in neural information processing systems. 2019;32.
- 26. Tu B, Zhang X, Kang X, Zhang G, Li S. Density peak-based noisy label detection for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing. 2018;57(3):1573–1584.
- 27.
Wang R, Mou S, Wang X, Xiao W, Ju Q, Shi C, et al. Graph structure estimation neural networks. In: Proceedings of the Web Conference 2021; 2021. p. 342–353.
- 28.
Cheng D, Liu T, Ning Y, Wang N, Han B, Niu G, et al. Instance-dependent label-noise learning with manifold-regularized transition matrix estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2022. p. 16630–16639.
- 29. Han B, Yao J, Niu G, Zhou M, Tsang I, Zhang Y, et al. Masking: A new perspective of noisy supervision. Advances in neural information processing systems. 2018;31.
- 30. Hendrycks D, Mazeika M, Wilson D, Gimpel K. Using trusted data to train deep networks on labels corrupted by severe noise. Advances in neural information processing systems. 2018;31.
- 31. Xia X, Liu T, Han B, Wang N, Gong M, Liu H, et al. Part-dependent label noise: Towards instance-dependent label noise. Advances in Neural Information Processing Systems. 2020;33:7597–7610.
- 32. Xia X, Liu T, Wang N, Han B, Gong C, Niu G, et al. Are anchor points really indispensable in label-noise learning? Advances in neural information processing systems. 2019;32.
- 33.
Zhang Y, Wang C, Ling X, Deng W. Learn from all: Erasing attention consistency for noisy label facial expression recognition. In: European Conference on Computer Vision. Springer; 2022. p. 418–434.
- 34.
Zheng G, Awadallah A, Dumais S. Meta label correction for noisy label learning. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 35; 2021. p. 11053–11061.
- 35.
Huang J, Qu L, Jia R, Zhao B. O2u-net: A simple noisy label detection approach for deep neural networks. In: Proceedings of the IEEE/CVF international conference on computer vision; 2019. p. 3326–3334.
- 36.
Li S, Ge S, Hua Y, Zhang C, Wen H, Liu T, et al. Coupled-view deep classifier learning from multiple noisy annotators. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 34; 2020. p. 4667–4674.
- 37.
Chen P, Chen G, Ye J, Heng PA, et al. Noise against noise: Stochastic label noise helps combat inherent label noise. In: International Conference on Learning Representations; 2020.
- 38.
Hu W, Li Z, Yu D. Simple and effective regularization methods for training on noisily labeled data with generalization guarantee. In: International Conference on Learning Representations; 2019.
- 39.
Zhang H, Cisse M, Dauphin Y, Lopez-Paz D. mixup: Beyond empirical risk minimization. In: International Conference on Learning Representations; 2018.
- 40.
Cheng H, Zhu Z, Li X, Gong Y, Sun X, Liu Y. Learning with instance-dependent label noise: A sample sieve approach. In: International Conference on Learning Representations; 2021.
- 41.
Ghosh A, Kumar H, Sastry P. Robust loss functions under label noise for deep neural networks. In: Proceedings of the AAAI conference on artificial intelligence. vol. 31; 2017.
- 42.
Patrini G, Rozza A, Krishna A, Nock R, Qu L. Making deep neural networks robust to label noise: A loss correction approach. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. p. 1944–1952.
- 43. Zhang Z, Sabuncu M. Generalized cross entropy loss for training deep neural networks with noisy labels. Advances in neural information processing systems. 2018;31.
- 44.
Zhou T, Wang S, Bilmes J. Robust curriculum learning: From clean label detection to noisy label self-correction. In: International Conference on Learning Representations; 2020.
- 45.
Han J, Luo P, Wang X. Deep self-learning from noisy labels. In: Proceedings of the IEEE/CVF international conference on computer vision; 2019. p. 5138–5147.
- 46.
Li S, Liu T, Tan J, Zeng D, Ge S. Trustable co-label learning from multiple noisy annotators. IEEE Transactions on Multimedia. 2021;.
- 47.
Tanaka D, Ikami D, Yamasaki T, Aizawa K. Joint optimization framework for learning with noisy labels. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2018. p. 5552–5560.
- 48.
Zhang Y, Zheng S, Wu P, Goswami M, Chen C. Learning with feature-fependent label noise: A progressive approach. In: International Conference on Learning Representations; 2020.
- 49.
Zheng S, Wu P, Goswami A, Goswami M, Metaxas D, Chen C. Error-bounded correction of noisy labels. In: International Conference on Machine Learning. PMLR; 2020. p. 11447–11457.
- 50.
Li J, Socher R, Hoi S. Dividemix: Learning with noisy labels as semi-supervised learning. arXiv preprint arXiv:200207394. 2020;.
- 51.
Wang Y, Liu W, Ma X, Bailey J, Zha H, Song L, et al. Iterative learning with open-set noisy labels. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2018. p. 8688–8696.
- 52. Liu S, Niles-Weed J, Razavian N, Fernandez-Granda C. Early-learning regularization prevents memorization of noisy labels. Advances in neural information processing systems. 2020;33:20331–20342.
- 53. Fujisawa H, Eguchi S. Robust parameter estimation with a small bias against heavy contamination. Journal of Multivariate Analysis. 2008;99(9):2053–2081.
- 54. Owen A. Empirical likelihood ratio confidence intervals for a single functional. Biometrika. 1988;75(2):237–249.
- 55. Qin J, Lawless J. Empirical likelihood and general estimating equations. the Annals of Statistics. 1994;22(1):300–325.
- 56. Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association. 2001;96(456):1348–1360.
- 57. Chen X, Pouzo D. Efficient estimation of semiparametric conditional moment models with possibly nonsmooth residuals. Journal of Econometrics. 2009;152(1):46–60.
- 58. Zhang CH. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics. 2010;38(2):894–942.
- 59. Fan GL, Liang HY, Shen Y. Penalized empirical likelihood for high-dimensional partially linear varying coefficient model with measurement errors. Journal of Multivariate Analysis. 2016;147:183–201.
- 60. Shi Z. Econometric estimation with high-dimensional moment equalities. Journal of Econometrics. 2016;195(1):104–119.
- 61. Fan J, Lv J. A selective overview of variable selection in high dimensional feature space. Statistica Sinica. 2010; p. 101–148. pmid:21572976
- 62. Cannings T, Fan Y, Samworth R. Classification with imperfect training labels. Biometrika. 2020;107(2):311–330.
- 63. Greener J, Kandathil S, Moffat L, Jones D. A guide to machine learning for biologists. Nature Reviews Molecular Cell Biology. 2022;23(1):40–55. pmid:34518686