Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Robust meta gradient learning for high-dimensional data with noisy-label ignorance

Abstract

Large datasets with noisy labels and high dimensions have become increasingly prevalent in industry. These datasets often contain errors or inconsistencies in the assigned labels and introduce a vast number of predictive variables. Such issues frequently arise in real-world scenarios due to uncertainties or human errors during data collection and annotation processes. The presence of noisy labels and high dimensions can significantly impair the generalization ability and accuracy of trained models. To address the above issues, we introduce a simple-structured penalized γ-divergence model and a novel meta-gradient correction algorithm and establish the foundations of these two modules based on rigorous theoretical proofs. Finally, comprehensive experiments are conducted to validate their effectiveness in detecting noisy labels and mitigating the curse of dimensionality and suggest that our proposed model and algorithm can achieve promising outcomes. Moreover, we open-source our codes and distinctive datasets on GitHub (refer to https://github.com/DebtVC2022/Robust_Learning_with_MGC).

Introduction

Large datasets in industry, characterized by noisy labels and high dimensions, are increasingly common [19]. The labels are often generated through unknown processes or manual annotation, leading to noisy labels [10, 11]. This, in turn, reduces the robustness of models and increases learning costs [12], negatively impacting model generalization and accuracy [13]. Recent research [1, 3, 5, 1012, 1416] has focused on developing strategies to train robust models using such datasets or effectively eliminate noisy labels.

Data-driven methods [1, 5, 11] address noisy labels by assuming the Classification Noise Process (CNP) [17] and using data distribution and information to pre-filter the labels. Model-driven studies [7, 10, 1416], on the other hand, focus on fitting robust models to noisy datasets to learn the true labels. Some studies [7, 11, 18] have improved these approaches by incorporating concepts from curriculum learning, resistance learning, and peer networks to correct noisy labels.

However, most of these studies have primarily focused on optimizing network structures, resulting in longer training times and higher costs. The γ-divergence, initially introduced as the “density power divergence of type-zero” [19], is a loss function that ensures model robustness by comparing two probability density functions [16]. Unfortunately, when modeling large datasets in the industry using γ-divergence, it often leads to unsatisfactory results due to the introduction of numerous predictive variables to mitigate modeling biases (known as the Curse of Dimensionality) [20]. To address this issue, we propose a simple-structured penalized γ-divergence model and a meta gradient correction optimization algorithm to robustly model noisy labels and high-dimensional datasets. We instantiate our approach by deriving the model in the specific scenario of logistic regression and then extend it to the framework of generalized linear models.

Concretely, we begin by defining the γ-divergence and introducing a class of penalty function mentioned in Liu et al. (2023) [21], which yields the objective function Eq (9). Based on this, we derive the consistency and efficiency of our model parameter estimation (refer to Theorems 1 and 2), as well as the empirical risk upper bound of our model (refer to Theorem 3). Furthermore, since the weight function of γ-divergence of (Xi, Yi) shows the possibility whether Yi is a noisy label, we propose a meta gradient correction algorithm based on ωi and establish its theoretical basis (refer to Theorem 4). This algorithm is a variant of stochastic gradient descent [22], which achieves effective discrimination between noisy label samples and non-noisy label samples by setting a pre-defined threshold. Finally, we extend our theoretical results to demonstrate the generalizability of our approach to a wider range of cases.

This paper makes three key contributions to prior work on finding, understanding, and learning with noisy labels,

  • We propose a simple-structured penalized γ-divergence model and lay its foundations on theoretical proofs that effectively reduce manual feature engineering and improve modeling efficiency.
  • We propose a novel meta gradient correction algorithm, and demonstrate its effectiveness through solid theoretical proofs and rich experiments that enable it to detect noisy labels and reduce the cost of manual labeling.
  • We open-source our experimental code and data, showcasing the promising outcomes of our model on two tasks: detecting label errors, and learning from noisy labels.

Related works

Noisy label

Many studies focus on how to detect, discover, and recognize noisy labels from industrial datasets, which are mainly divided into method for reweighting examples [2327], method for estimating the noise transition matrix [2832], method for optimizing gradient descent and training procedures [10, 33, 34], method for selecting confident examples [1, 5, 35, 36], method for introducing regularization [4, 7, 3739], method for designing robust loss functions [8, 9, 14, 16, 4044], and method for generating pseudo labels [4549].

In addition, some advanced start-of-the-art methods combine several techniques, e.g., MentorNet [11], DivideMix [50], Iterative Learning [51], and ELR+ [52].

γ-divergence

The γ-divergence, which is first introduced in [19] with the name density power divergence of type-zero, is defined for two probability density functions. γ-divergence is closely related to other divergence measures, such as Kullback-Leibler (KL) divergence, but includes a tunable parameter γ that allows for more flexibility in the metric. Fujisawa and Eguchi (2008) [53] introduced γ-divergence later. Hung et al. (2018) [14] proposed a robust mislabeled logistic regression model based on the original form of γ-divergence defined by Jones et al. (2001) [19], called the γ-logistic regression.

Penalized empirical likelihood

Penalized Empirical Likelihood has evolved as an essential statistical inference method, addressing challenges in small sample settings with nonparametric and semiparametric modeling. The origins of the method can be traced back to Owen (1988) [54], who introduced the concept of empirical likelihood, a nonparametric likelihood-based method for estimating unknown parameters in statistical models. In 1996, Qin and Lawless [55] further developed the theory and proved its efficiency and adaptability in various scenarios.

The incorporation of penalization in empirical likelihood was first introduced by Fan and Li (2001) [56]. They demonstrated how to use penalized empirical likelihood in variable selection and model estimation in high-dimensional and nonparametric settings. Later, Chen and Pouzo (2009) [57] proposed the usage of L1-penalty in empirical likelihood estimation, offering a more robust framework for handling high-dimensional data. Furthermore, MCP was initially proposed by Zhang (2010) [58] to overcome the drawbacks of the LASSO penalty, such as bias and lack of sparsity.

In recent years, the application of penalized empirical likelihood has expanded into various fields. Two exciting applications of this method are found in Fan et al. (2016) [59] and Shi et al. (2016) [60], where they employed a penalized empirical likelihood method for variable selection in high-dimensional settings.

Methodology

This section proposes a novel noisy-label-ignore γ-divergence model for high-dimensional data with solid theoretical foundations. We substitute logistic regression as a specific case into this framework for a more explicit description. We first give some notations and mathematical derivation of the optimization objective for γ-divergence. Then, we propose the γ-divergence model by introducing a penalty function and proving the parameter consistency, asymptotic normality, and optimal risk upper bound for this model. Finally, we extend this model to the general case and introduce our meta gradient correction algorithm for implementing robust learning for noisy labels.

γ-divergence and its optimization objective

In the binary classification problem, let be the sample space, Y0 be the true binary response taking values in {0, 1}, and X be the d-dimensional vector of covariates. Let P(X, Y0) and PX denote the joint distribution of (X, Y0) and the distribution of X, respectively. Let the conditional success probability of Y0 be . In many real applications, we can only observe response Y contaminated with noise instead of the true response Y0. Define the mislabeled probability (1) and (2)

Therefore, the conditional success probability of Y, which is denoted by (3) can be expressed by (4)

Suppose that we observe independent and identical data pairs (Xi, Yi), i = 1, …, n, with joint distribution P(X, Y). The goal is to predict the true label Y0 of a new observation with covariates X using the model fitted by the data (X1, Y1), …, (Xn, Yn).

Let g(y|x) be the underlying conditional probability density function of Y. By the definition of , we have . Let f(y|x; β) be the parametric conditional probability density function with the parameter , where d is the dimensions of β. The γ-divergence between g(y|x) and f(y|x; β) is defined to be (5) where . Suppose that there exists a true parameter value satisfying β20 = 0, where , , and s is a finite constant, such that g(y|x) = f(y|x; β0). By Theorem 3.1 in Fujisawa and Eguchi (2008) [53], Dγ{g(⋅|x), f(⋅|x; β0)} = 0. Therefore, β can be estimated by minimizing Dγ{g(⋅|x), f(⋅|x; β)}. To eliminate the randomness of X, we take expectations with respect to X in (5) and have

The γ-logistic regression estimates β by minimizing . Since is a constant, we have (6) where and denote the expectation with respect to X and (X, Y), respectively.

In particular, logistic regression assumes that (7) where . Given the observations (X1, Y1), …, (Xn, Yn), and substituting (7) into (6), the estimator is defined by (8)

Penalized γ-divergence model and its oracle properties

As claimed by Fujisawa and Eguchi (2008) [53] and Kawashima and Fujisawa (2017) [16], γ-divergence is still very robust to noisy data. However, when the dimension d of the covariate X diverges, it can be shown that the estimator in (8) is no longer consistent with the true parameter β0, and the estimation equation is no longer robust [16], which leads to the model prediction will have serious error accumulation and reduces the prediction accuracy [16, 61]. Hence, to solve these problems, we introduce a class of penalty function mentioned in Liu et al. (2023) [21] as follows,

Then the objective function of the penalized γ-logistic regression is (9) where is a penalty function in class with tuning parameter λn, and βj, j = 1, …, d, is an element in the vector β. Hence, the estimator is defined by .

We still need consistent results here since we focus on a parametric model. We first show two regularity conditions in Lemmas 1 and 2 and then give the consistency and asymptotic normality of parameter estimation in Theorems 1 and 2.

Lemma 1 (Regularity Condition). Let Vi = (Xi, Yi) and (10) where (11) and βj, j = 1, …, d, is an element in the vector β.

Let P(Yi = 1) be the probability that the sample label is 1 and (12) then the first derivatives of L(β) satisfying the equations (13) at β = β0, where .

Lemma 2 (Regularity Condition). The Fisher information matrix, (14) is finite and positive definite at β = β0.

Theorem 1 (Consistency and Convergence Rate). Let Vi = (Xi, Yi) be independent and identically distributed and be a penalty function in class with tuning parameter λn. Let (15) and (16)

Assume that and , there is a local maximizer of such that under some regularity conditions.

It is clear from Theorem 1 that by choosing a proper , and there exists a root-n consistent penalized likelihood estimator . Let I(β0) be the Fisher information matrix and let I1(β10) be the Fisher information knowing β20 = 0. We now show that the estimator possesses the sparsity property and the asymptotic normality, which is also known as the Oracle property [56].

Theorem 2 Let be a local maximizer of in Theorem 1, Vi = {Xi, Yi} be independent and identically distributed and be a penalty function in class with tuning parameter λn. Let where assuming . Assume that and λn → 0, then for any constant C, under some regularity conditions, satisfy:

(a)Sparsity: ,

(b)Asymptotic Normality: (17) where (18) and (19)

Theorems 1 and 2 reveal that possesses consistency and asymptotic normality. Moreover, as claimed by Cannings et al. (2020) [62], a model trained on noisy labels should be close to the optimal Bayes model under some certain conditions. Hence, we aim to get the bound of excess risk between our model and the optimal Bayes model. First, we need some notations. Let C(x) be the predicted label of {0, 1} value and given x, define (20) and , which is the estimation error between the predicted label by f(x) and the noisy conditional probability . We give the following Definitions 1 and 2, for the Bayes classifier.

Definition 1. Given x, the optimal classifier under clean labels is Bayes classifier, denoted as C*(x), where (21)

Definition 2. Given x, the optimal classifier under noisy labels is Bayes classifier, denoted as , where (22)

Definitions 1 and 2 tell us that the Bayes classifier minimizes the classification risk under noisy labels. Furthermore, we then introduce the Tsybakov Condition by following Assumption 1, which is the basis of Lemma 3 and Theorem 3.

Assumption 1. (Tsybakov Condition). There exist constants M, λ ≥ 0 and , such that for all 0 ≤ tt0 and x, the following inequality holds, (23)

This condition, also called margin assumption, stipulates that the uncertainty of η(x) is bounded. In other words, the margin region close to the decision boundary, , has a bounded volume. Moreover, we define the subspace that the Tsybakov condition holds as , which means that for sufficiently small tt0, and . Hence, let (24) and based on the Assumption 1, we can obtain Lemma 3 and Theorem 3.

Based on the above definitions and assumptions, Lemma 3 shows the bound of the model Rour(x) and the Bayes classifier under the noisy labels. And Theorem 3 is the main result of this paper, revealing that our model Rour(x) and the optimal Bayes model R*(x) under the case of noisy labels still have an upper bound of excess risk.

Lemma 3. (Upper bound of Excess Risk between Rour(x) and ). Assume η(x) statisfies the Tsybakov condition with constants M, λ > 0, and . Assume again that τ01(x) + τ10(x) < 1. Given x, for all 0 ≤ tt0, then the excess risk is, (25) as tt0 → 0, where (26) and (27)

Theorem 3. (Upper bound of Excess Risk between Rour(x) and R*(x)). Assume η(x) statisfies the Tsybakov condition with constants M, λ > 0, and . Assume again that τ01(x) + τ10(x) < 1. Given x, for all 0 ≤ tt0, then the excess risk is, (28) as tt0 → 0, where (29) and is a small value depend on λ.

Theorem 3 indicates that the risk bound between our model and the best Bayes classifier will be dominated by the model’s estimation error ϵ. If the estimation error ϵ is smaller, our model and the Bayes classifier are closer.

Furthermore, the estimation error refers to the difference between the estimated parameters of the model and the real parameters. We can obtain from the consistency of parameters the estimation error ϵ will tend to 0 as γ → ∞ and n → ∞, so our model will converges to the Bayes model in the case of large samples.

Specifically, from Theorems 1 and 2, we can get with constant , which leads to the fact that . Hence, from Theorem 3, we have

Finally, we can easily generalize the conclusions of Theorems 1, 2, and 3 of γ-logistic regression to the generalized linear model (GLM) and obtain Corollary 1, as logistic regression is essentially a GLM. The proof of Corollary 1 is trivial and would not be repeated in our paper.

Corollary 1. Under the regularity conditions 1 and 2 and Assumption 1, for each generalized linear model (such as the Probit model), we can obtain the same results as described in Theorems 1, 2, and 3: (i) consistency, (ii) asymptotic normality, and (iii) convergence of the model.

Meta gradient correction algorithm

Although we have proved in the above subsection that our proposed model possesses oracle properties even in the noisy-label case, recent study [10] has shown that machine learning models gradually remember individual data while adapting to the data distribution. Therefore, when facing noisy labels, all statistics or machine learning methods inevitably encounter the problem of reduced generalization ability, and it is necessary to eliminate the impact of single data as much as possible through methods such as early stopping or dropout. However, these methods passively eradicate the effect of a single data on model training. Excessive use of these methods often results in underfitting. Therefore, this paper proposes a regularized meta-learning algorithm, Meta Gradient Correction (MGC), for noisy label data.

From Hung Hung et al. (2018) [14], the γ-divergence weight function of (Xi, Yi), (30) shows the possibility of whether Yi is a noisy label. In other words, if ωi is larger, it means that (Xi, Yi) has a more significant effect on model training and is less likely to be a noisy label. If ωi is smaller, the sample Label Yi is more likely to be a noisy label. On the other hand, the estimation criterion of minimum γ-divergence is to maximize the objective function Q(β), which leads to using the stochastic gradient ascent algorithm.

Hence, in this part, we propose an adaptive algorithm based on a meta approach that modifies the optimization algorithm. For example, if the corresponding value of the γ-divergence weight function ωi of the (Xi, Yi) is large, then gradient descent is performed to β. In contrast, the gradient ascent is performed to eliminate the effect of suspected noisy labels on model parameter convergence. To illustrate the validity of this algorithm, we further show Theorem 4.

Theorem 4. (Convergence of MGC algorithm). Let the number of samples for each batch be N, where the number of noisy labeled samples is Nnoisy, and the number of non-noisy labeled samples is Nnonnoisy, i.e., N = Nnoisy + Nnonnoisy. Without loss of generality, assume that Nnonnoisy > Nnoisy. Moreover, the function we want to minimize, , is continuously differentiable, and we assume that Q(β) has Lipschitz continuous gradients with constant L, i.e., there exists a constant L > 0 such that, for all . This algorithm updates the gradient descent iteration as follows, βt+1 = βt−ηQ(βt), where η > 0 is the learning rate. We then obtain that, (31) for all t ≥ 0. The function values Q(βt) form a non-increasing sequence, which implies that Q(βt) converges.

Theorem 4 implicitly states that our proposed meta gradient correction algorithm makes our objective function Q(β) converge when the proportion of noisy labels is small (refer to Nnonnoisy > Nnoisy). This is also consistent with the results shown in Section of Experiments and powerfully demonstrates the validity of our algorithm. Furthermore, following [21], we give three specific forms of the penalty function as particular implementations of our algorithm. One is the SCAD penalty [56] where (32) where βj, j = 1, …, d, is an element in the vector β. Another one is the MCP [58] where , j = 1, …, d, to solve the problem of variable selection and estimation in high-dimensional case. The last one is the Lasso [63] where PLasso(βj) = |βj|, j = 1, …, d. For ease of description and understanding, we substitute with logistic and probit models to illustrate our algorithm.

Experiments

In this section, we evaluate the abilities of proposed model and algorithm, namely, the functionality of penalized γ-divergence model for learning high-dimensional data and the robustness of adaptive classification algorithm in modeling noisy data.

Our dataset

We conducted numerous experiments on multiple simulation data and real data. These dataset has the following three characteristics.

  • Simulation data: We generate noisy label dataset based on random seeds with three different sample dimensions and eight noisy label ratios. The training sample dimensions include: 200*500, 500*1000, and 1000*1500, and testing sample dimensions include: 100*500, 200*1000, and 200*1500. The noisy label ratios include 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, and 0.8.
  • Real data: We obtain breast cancer data from the UCI Machine Learning Repository as real data (https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic), which contains 571 samples and 31 feature variables. We randomly divide it into the training set and test set in a ratio of 0.8 and 0.2. The training set includes 456 samples, and the test set contains 115. We set the noisy label ratio in experiments to 0.1, 0.3, and 0.5. Following [18, 47, 52], we randomly select samples from the 0-labeled and 1-labeled samples in the training set (the number of samples chosen is the product of the noisy label ratio and the number of samples in this category). Finally, we flip the labels of the selected samples. Similarly, for the test set, we perform the same operation.
  • Each noisy label dataset is used to train and test the effectiveness and capability of the eight models in learning and detecting noisy labels. The final test results for each model are based on the average of tests after 50 different training sessions.

Settings of models

This subsection provides brief information about the experimental settings, and more details can be found in our open-sourced codes (https://github.com/DebtVC2022/Robust_Learning_with_MGC).

  • Baseline Models: We select the logistic and probit models as baseline models following the research [14], with their optimal settings based on pre-experiments (described in the following paragraph).
  • Relevant Settings: Our models contain multiple versions, obtained by combining two backbone models (refer to the logistic and probit models), three penalty functions (refer to the SCAD, MCP, and Lasso), and the proposed meta gradient correction algorithm. For all models, we select hyperparameters based on preliminary experiments and the prevalent findings [14, 18, 56, 61], and the final hyperparameters are set as follows: γ value of 0.5, λ value of 1, threshold value Tv of 0.5, learning rate η of 0.01.

Evaluation metrics

Following the studies [14, 16], we choose the accuracy and F1-score to assess the ability of our model to model noisy data.

  • Accuracy based on contaminated label Y: Following the studies [14, 16], we choose the accuracy to assess the ability of our model to learn noisy labels and then denote this evaluation metric as ACC_noisy, which records the number of correctly predicted data points among all data points (X, Y).
  • Accuracy based on true label Y0: Following the studies [14, 16], we choose the accuracy to assess the ability of our model to learn noisy labels and then denote this evaluation metric as ACC_true, which records the number of correctly predicted data points among all data points (X, Y0).
  • Recall: We denote this evaluation metric by Recall, which records the percentage of samples that a model correctly identifies as belonging to noisy label samples out of the total samples for that class.
  • Precision: We denote this evaluation metric by Precision, which records the ratio of correctly identified noisy label samples to the total number of noisy label samples.
  • F1-score: We denote this evaluation metric by F1, which is calculated by .

These five metrics reflect the robustness of the model and its ability to detect noisy labels, and their larger values denote that the model has a stronger ability to model noisy labels. [14].

Ablation study & analysis

In this subsection, we conduct comprehensive evaluations to demonstrate the effectiveness of our proposed model in learning with noisy labels. The evaluations are performed on simulated and real-world data by combining different backbone models (refer to two generalized linear models for binary classification problems: logistic regression and probit models), penalty functions (refer to SCAD [56], MCP [61], and Lasso [63]), and the proposed meta gradient correction (MGC) algorithm.

Analysis based on simulated data.

Overall analysis based on simulated data. We first evaluate the model performance on simulated data under different settings. As illustrated in Tables 118, our proposed penalized γ-divergence model consistently achieves higher accuracy than the baseline model across different noisy ratios (refer to 0.1, 0.2, 0.3), backbone models, and penalty functions. This indicates the robustness of our model against noisy labels. More importantly, our model substantially improves Recall, Precision, and F1, the critical metrics for noisy label identification [62]. The results strongly verify the capability of our model in detecting and rectifying noisy labeled instances. In addition, in practical noisy label scenarios, we are more concerned with how similar the predicted labels are to the correct labels Y0 and what percentage of all noisy labels in the dataset are identified [62]. Hence, as depicted in Figs 16, we also display the trends of ACC_true and Precision under the different noisy ratios (refer to 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8).

thumbnail
Fig 1. ACC_true on test data with γ-logistic model and 500*1000 training data size.

When the proportion of noisy labels is small, introducing the meta gradient correction algorithm can improve the classification accuracy of the model, except when the meta gradient correction algorithm is combined with the SCAD penalty function.

https://doi.org/10.1371/journal.pone.0295678.g001

thumbnail
Fig 2. Precision on test data with γ-logistic model and 500*1000 training data size.

All four subfigures show that the introduction of the meta gradient correction algorithm can increase the precision of model predictions, which means that the reliability of model predictions is improved.

https://doi.org/10.1371/journal.pone.0295678.g002

thumbnail
Fig 3. ACC_true on test data with γ-probit model and 500*1000 training data size.

The four subfigures of this figure can lead to similar conclusions as in Fig 1.

https://doi.org/10.1371/journal.pone.0295678.g003

thumbnail
Fig 4. Precision on test data with γ-probit model and 500*1000 training data size.

The four subfigures of this figure can lead to similar conclusions as in Fig 2.

https://doi.org/10.1371/journal.pone.0295678.g004

thumbnail
Fig 5. ACC_true on training data with γ-logistic model and 500*1000 training data size.

When the proportion of noisy labels is small, introducing the meta gradient correction algorithm can improve the classification accuracy of the model.

https://doi.org/10.1371/journal.pone.0295678.g005

thumbnail
Fig 6. Precision on training data with γ-logistic model and 500*1000 training data size.

All four subfigures show that the introduction of the meta gradient correction algorithm can increase the precision of model predictions, which means that the reliability of model predictions is improved.

https://doi.org/10.1371/journal.pone.0295678.g006

thumbnail
Table 1. Results of evaulate metrics on the test data and 0.1 noisy ratio.

https://doi.org/10.1371/journal.pone.0295678.t001

thumbnail
Table 2. Results of evaulate metrics on the test data and 0.2 noisy ratio.

https://doi.org/10.1371/journal.pone.0295678.t002

thumbnail
Table 3. Results of evaulate metrics on the test data and 0.3 noisy ratio.

https://doi.org/10.1371/journal.pone.0295678.t003

thumbnail
Table 4. Results of evaulate metrics on the test data and 0.1 noisy ratio.

https://doi.org/10.1371/journal.pone.0295678.t004

thumbnail
Table 5. Results of evaulate metrics on the test data and 0.2 noisy ratio.

https://doi.org/10.1371/journal.pone.0295678.t005

thumbnail
Table 6. Results of evaulate metrics on the test data and 0.3 noisy ratio.

https://doi.org/10.1371/journal.pone.0295678.t006

thumbnail
Table 7. Results of evaluate metrics on the test data and 0.1 noisy ratio.

https://doi.org/10.1371/journal.pone.0295678.t007

thumbnail
Table 8. Results of evaluate metrics on the test data and 0.2 noisy ratio.

https://doi.org/10.1371/journal.pone.0295678.t008

thumbnail
Table 9. Results of evaluate metrics on the test data and 0.3 noisy ratio.

https://doi.org/10.1371/journal.pone.0295678.t009

thumbnail
Table 10. Results of evaluate metrics on the test data and 0.1 noisy ratio.

https://doi.org/10.1371/journal.pone.0295678.t010

thumbnail
Table 11. Results of evaluate metrics on the test data and 0.2 noisy ratio.

https://doi.org/10.1371/journal.pone.0295678.t011

thumbnail
Table 12. Results of evaluate metrics on the test data and 0.3 noisy ratio.

https://doi.org/10.1371/journal.pone.0295678.t012

thumbnail
Table 13. Results of evaluate metrics on the test data and 0.1 noisy ratio.

https://doi.org/10.1371/journal.pone.0295678.t013

thumbnail
Table 14. Results of evaulate metrics on the test data and 0.2 noisy ratio.

https://doi.org/10.1371/journal.pone.0295678.t014

thumbnail
Table 15. Results of evaluate metrics on the test data and 0.3 noisy ratio.

https://doi.org/10.1371/journal.pone.0295678.t015

thumbnail
Table 16. Results of evaluate metrics on the test data and 0.1 noisy ratio.

https://doi.org/10.1371/journal.pone.0295678.t016

thumbnail
Table 17. Results of evaulate metrics on the test data and 0.2 noisy ratio.

https://doi.org/10.1371/journal.pone.0295678.t017

thumbnail
Table 18. Results of evaluate metrics on the test data and 0.3 noisy ratio.

https://doi.org/10.1371/journal.pone.0295678.t018

In addition, by incorporating the proposed meta gradient correction algorithm, the model accuracy on true labels (ACC_true) and Precision are further improved under various simulated scenarios. This demonstrates the effectiveness of the meta gradient correction algorithm in enhancing model robustness and resilience to noisy labels. According to our theoretical analyses (refer to Theorem 4), our model is effective when the noisy ratio is small, which is also validated by the experimental results.

The analysis of the impact of sample size based on simulated data. We also evaluated the impact of sample size on model performance. Based on Tables 118, we found that when the sample size of the training set increases from 200 to 1000, the predictive ability of the model first increases and then decreases. The reason why the prediction ability improves first is because as the size of the training set increases, the training set contains more practical information. The model with the meta-gradient correction algorithm or the penalty function can capture more information about the correctly labeled samples. As the size of the training set further increases, it is challenging to fit large-scale data sets with simple logistic regression and probability models as the backbone, and they usually require more help to fit such data effectively. At this point, combining the meta-gradient correction algorithm with a penalty model does not provide additional benefits. One potential reason is the inherent conflict between these two technologies’ goals and working mechanisms. When the sample size is small, the excellent capabilities of the two can offset the problems caused by conflicts. When the sample size continues to increase, this problem gradually comes to light. A similar phenomenon occurs when the proportion of noisy labels increases.

The analysis of the impact of noisy label ratio based on simulated data. As illustrated in Tables 118 and Figs 16, the ACC_true and Precision fluctuate severely as the noise ratio increases, especially when the noise level is high (e.g., ≥0.4). In Fig 1, the evaluation metrics show a smooth downward trend when no penalty function exists in the model. In Figs 24, the evaluation metrics show a fluctuating downward trend when the model includes the penalty function. These fluctuations directly result from too many noisy labels in the dataset interfering with the features selected by the penalty function, which can cause significant challenges in the training process of models with penalty functions. Therefore, in this case, the model may need help to achieve optimal predictive performance or even converge. In future research efforts, we are actively considering strategies to mitigate these challenges, including introducing smoothing factors or improving penalty functions to enhance model training when dealing with large-scale noisy labels. Further, Figs 1, 2, 5 and 6 show the results of the evaluation metrics for the same model settings and noise labeling ratio settings on the training set and test set, respectively. We find that the performance of the two evaluation metrics, ACC_true and Precision, are roughly the same for the same scenarios, with the performance of the training set being relatively smoother.

In summary, the comprehensive experiments on simulated data validate the effectiveness of our proposed approach in modeling noisy labels and identifying noisy instances. The results are well aligned with our theoretical analyses.

Analysis based on real data.

Overall analysis based on real data. We further benchmark our model on real-world datasets with synthetic noisy labels. As shown in Tables 1924, the meta gradient correction algorithm consistently improves the results across different noisy ratios and evaluation metrics compared to baseline models. However, the incorporation of penalized models leads to inferior performance. A potential reason is that the real datasets have lower dimensionality and complexity compared to simulated data and, thus, are less prone to overfitting. Therefore, the benefits of regularization models are diminished. Encouragingly, the results further verify the robustness and effectiveness of the proposed meta gradient correction algorithm on real-world noisy label learning tasks.

thumbnail
Table 19. Results of evaulate metrics on the breast cancer test data and 0.1 noisy ratio.

https://doi.org/10.1371/journal.pone.0295678.t019

thumbnail
Table 20. Results of evaulate metrics on the breast cancer test data and 0.3 noisy ratio.

https://doi.org/10.1371/journal.pone.0295678.t020

thumbnail
Table 21. Results of evaulate metrics on the breast cancer test data and 0.5 noisy ratio.

https://doi.org/10.1371/journal.pone.0295678.t021

thumbnail
Table 22. Results of evaulate metrics on the breast cancer test data and 0.1 noisy ratio.

https://doi.org/10.1371/journal.pone.0295678.t022

thumbnail
Table 23. Results of evaulate metrics on the breast cancer test data and 0.3 noisy ratio.

https://doi.org/10.1371/journal.pone.0295678.t023

thumbnail
Table 24. Results of evaulate metrics on the breast cancer test data and 0.5 noisy ratio.

https://doi.org/10.1371/journal.pone.0295678.t024

In conclusion, extensive experiments on both simulated and real-world datasets demonstrate the capability of our model in handling noisy labels. The meta gradient correction algorithm is verified to deliver consistent performance improvements. In future works, we will focus on alleviating the model fluctuation issue when the noise level is high and evaluating the approach on larger-scale and more complex real-world data. Advanced deep learning techniques can also be explored as backbone models to enhance model capacity and scalability.

Conclusion and future work

In this paper, we have proposed a novel penalized model and a meta gradient correction algorithm with grounded theoretical foundations to detect noisy labels. To illustrate the effectiveness of our proposed model and algorithm, we argue from both theoretical proofs and rich experiments.

First, we derive theorems 1 and 2, which reveal that the in the proposed model possesses consistency and asymptotic normality. Besides, we obtain Theorem 3 based on the empirical risk upper bound derivation, proving that the difference between our model and the optimal Bayesian model is sufficiently small. Furthermore, we propose a meta-learning algorithm for correcting gradient and show its convergence based on rigorous theoretical foundations (refer to Theorem 4).

Next, we conducted experiments on multiple simulated and real data. These experiments proved the following results: (i) The proposed penalized model exhibits surprising robustness in modeling noisy labels. (ii) The proposed meta gradient correction algorithm demonstrates a promising ability to detect and find noisy labels. (iii) Our model can easily used in practical machine learning scenarios.

Finally, we plan to improve our algorithm to make it suitable for large-scale high-dimensional data. Possible improvements include introducing more complex models and smoother optimization algorithms. In addition, our algorithm can also be used in other fields, such as casual inference.

Supporting information

S1 File. The title of this file is ‘Supprot information for “Robust meta gradient learning for high-dimensional data with noisy-label ignorance”’.

The support information file contains all the proofs covered by the manuscript. Specifically, it contains the derivation of Eq (5) and the proofs of 1, 2, 3, and 4.

https://doi.org/10.1371/journal.pone.0295678.s001

(PDF)

References

  1. 1. Cordeiro F, Sachdeva R, Belagiannis V, Reid I, Carneiro G. Longremix: Robust learning with high confidence samples in a noisy label environment. Pattern Recognition. 2023;133:109013.
  2. 2. Henriques R, Madeira S. FleBiC: Learning classifiers from high-dimensional biomedical data using discriminative biclusters with non-constant patterns. Pattern Recognition. 2021;115:107900.
  3. 3. Ma W, Zhou X, Zhu H, Li L, Jiao L. A two-stage hybrid ant colony optimization for high-dimensional feature selection. Pattern Recognition. 2021;116:107933.
  4. 4. Ma X, Huang H, Wang Y, Romano S, Erfani S, Bailey J. Normalized loss functions for deep learning with noisy labels. In: International conference on machine learning. PMLR; 2020. p. 6543–6553.
  5. 5. Northcutt C, Jiang L, Chuang I. Confident learning: Estimating uncertainty in dataset labels. Journal of Artificial Intelligence Research. 2021;70:1373–1411.
  6. 6. Qin Z, Zhang Z, Li Y, Guo J. Making deep neural networks robust to label noise: Cross-training with a novel loss function. IEEE access. 2019;7:130893–130902.
  7. 7. Shi X, Guo Z, Li K, Liang Y, Zhu X. Self-paced resistance learning against overfitting on noisy labels. Pattern Recognition. 2023;134:109080.
  8. 8. Wang Y, Ma X, Chen Z, Luo Y, Yi J, Bailey J. Symmetric cross entropy for robust learning with noisy labels. In: Proceedings of the IEEE/CVF International Conference on Computer Vision; 2019. p. 322–330.
  9. 9. Xu Y, Cao P, Kong Y, Wang Y. L_dmi: A novel information-theoretic loss function for training deep nets robust to label noise. Advances in neural information processing systems. 2019;32.
  10. 10. Han B, Niu G, Yao J, Yu Xi, Xu M, Tsang I, et al. Pumpout: A meta approach to robust deep learning with noisy labels. arXiv preprint arXiv:180911008. 2018;.
  11. 11. Jiang L, Zhou Z, Leung T, Li LJ, Fei-Fei L. Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In: International conference on machine learning. PMLR; 2018. p. 2304–2313.
  12. 12. Xiao T, Xia T, Yang Y, Huang C, Wang X. Learning from massive noisy labeled data for image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2015. p. 2691–2699.
  13. 13. Zhang C, Bengio S, Hardt M, Recht B, Vinyals O. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM. 2021;64(3):107–115.
  14. 14. Hung H, Jou ZY, Huang SY. Robust mislabel logistic regression without modeling mislabel probabilities. Biometrics. 2018;74(1):145–154. pmid:28493315
  15. 15. Jiang L, Huang D, Liu M, Yang W. Beyond synthetic noise: Deep learning on controlled noisy labels. In: International conference on machine learning. PMLR; 2020. p. 4804–4815.
  16. 16. Kawashima T, Fujisawa H. Robust and sparse regression via γ-divergence. Entropy. 2017;19(11):608.
  17. 17. Angluin D, Laird P. Learning from noisy examples. Machine Learning. 1988;2:343–370.
  18. 18. Yi L, Liu S, She Q, McLeod A, Wang B. On learning contrastive representations for learning with noisy labels. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2022. p. 16682–16691.
  19. 19. Jones M, Hjort N, Harris I, Basu A. A comparison of related density-based minimum divergence estimators. Biometrika. 2001;88(3):865–873.
  20. 20. Donoho D, et al. High-dimensional data analysis: The curses and blessings of dimensionality. AMS math challenges lecture. 2000;1(2000):32.
  21. 21. Liu P, Zhao Y. A review of recent advances in empirical likelihood. Wiley Interdisciplinary Reviews: Computational Statistics. 2023;15(3):e1599.
  22. 22. Robbins H, Monro S. A stochastic approximation method. The annals of mathematical statistics. 1951; p. 400–407.
  23. 23. Liu T, Tao D. Classification with noisy labels by importance reweighting. IEEE Transactions on pattern analysis and machine intelligence. 2015;38(3):447–461.
  24. 24. Ren M, Zeng W, Yang B, Urtasun R. Learning to reweight examples for robust deep learning. In: International conference on machine learning. PMLR; 2018. p. 4334–4343.
  25. 25. Shu J, Xie Q, Yi L, Zhao Q, Zhou S, Xu Z, et al. Meta-weight-net: Learning an explicit mapping for sample weighting. Advances in neural information processing systems. 2019;32.
  26. 26. Tu B, Zhang X, Kang X, Zhang G, Li S. Density peak-based noisy label detection for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing. 2018;57(3):1573–1584.
  27. 27. Wang R, Mou S, Wang X, Xiao W, Ju Q, Shi C, et al. Graph structure estimation neural networks. In: Proceedings of the Web Conference 2021; 2021. p. 342–353.
  28. 28. Cheng D, Liu T, Ning Y, Wang N, Han B, Niu G, et al. Instance-dependent label-noise learning with manifold-regularized transition matrix estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2022. p. 16630–16639.
  29. 29. Han B, Yao J, Niu G, Zhou M, Tsang I, Zhang Y, et al. Masking: A new perspective of noisy supervision. Advances in neural information processing systems. 2018;31.
  30. 30. Hendrycks D, Mazeika M, Wilson D, Gimpel K. Using trusted data to train deep networks on labels corrupted by severe noise. Advances in neural information processing systems. 2018;31.
  31. 31. Xia X, Liu T, Han B, Wang N, Gong M, Liu H, et al. Part-dependent label noise: Towards instance-dependent label noise. Advances in Neural Information Processing Systems. 2020;33:7597–7610.
  32. 32. Xia X, Liu T, Wang N, Han B, Gong C, Niu G, et al. Are anchor points really indispensable in label-noise learning? Advances in neural information processing systems. 2019;32.
  33. 33. Zhang Y, Wang C, Ling X, Deng W. Learn from all: Erasing attention consistency for noisy label facial expression recognition. In: European Conference on Computer Vision. Springer; 2022. p. 418–434.
  34. 34. Zheng G, Awadallah A, Dumais S. Meta label correction for noisy label learning. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 35; 2021. p. 11053–11061.
  35. 35. Huang J, Qu L, Jia R, Zhao B. O2u-net: A simple noisy label detection approach for deep neural networks. In: Proceedings of the IEEE/CVF international conference on computer vision; 2019. p. 3326–3334.
  36. 36. Li S, Ge S, Hua Y, Zhang C, Wen H, Liu T, et al. Coupled-view deep classifier learning from multiple noisy annotators. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 34; 2020. p. 4667–4674.
  37. 37. Chen P, Chen G, Ye J, Heng PA, et al. Noise against noise: Stochastic label noise helps combat inherent label noise. In: International Conference on Learning Representations; 2020.
  38. 38. Hu W, Li Z, Yu D. Simple and effective regularization methods for training on noisily labeled data with generalization guarantee. In: International Conference on Learning Representations; 2019.
  39. 39. Zhang H, Cisse M, Dauphin Y, Lopez-Paz D. mixup: Beyond empirical risk minimization. In: International Conference on Learning Representations; 2018.
  40. 40. Cheng H, Zhu Z, Li X, Gong Y, Sun X, Liu Y. Learning with instance-dependent label noise: A sample sieve approach. In: International Conference on Learning Representations; 2021.
  41. 41. Ghosh A, Kumar H, Sastry P. Robust loss functions under label noise for deep neural networks. In: Proceedings of the AAAI conference on artificial intelligence. vol. 31; 2017.
  42. 42. Patrini G, Rozza A, Krishna A, Nock R, Qu L. Making deep neural networks robust to label noise: A loss correction approach. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. p. 1944–1952.
  43. 43. Zhang Z, Sabuncu M. Generalized cross entropy loss for training deep neural networks with noisy labels. Advances in neural information processing systems. 2018;31.
  44. 44. Zhou T, Wang S, Bilmes J. Robust curriculum learning: From clean label detection to noisy label self-correction. In: International Conference on Learning Representations; 2020.
  45. 45. Han J, Luo P, Wang X. Deep self-learning from noisy labels. In: Proceedings of the IEEE/CVF international conference on computer vision; 2019. p. 5138–5147.
  46. 46. Li S, Liu T, Tan J, Zeng D, Ge S. Trustable co-label learning from multiple noisy annotators. IEEE Transactions on Multimedia. 2021;.
  47. 47. Tanaka D, Ikami D, Yamasaki T, Aizawa K. Joint optimization framework for learning with noisy labels. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2018. p. 5552–5560.
  48. 48. Zhang Y, Zheng S, Wu P, Goswami M, Chen C. Learning with feature-fependent label noise: A progressive approach. In: International Conference on Learning Representations; 2020.
  49. 49. Zheng S, Wu P, Goswami A, Goswami M, Metaxas D, Chen C. Error-bounded correction of noisy labels. In: International Conference on Machine Learning. PMLR; 2020. p. 11447–11457.
  50. 50. Li J, Socher R, Hoi S. Dividemix: Learning with noisy labels as semi-supervised learning. arXiv preprint arXiv:200207394. 2020;.
  51. 51. Wang Y, Liu W, Ma X, Bailey J, Zha H, Song L, et al. Iterative learning with open-set noisy labels. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2018. p. 8688–8696.
  52. 52. Liu S, Niles-Weed J, Razavian N, Fernandez-Granda C. Early-learning regularization prevents memorization of noisy labels. Advances in neural information processing systems. 2020;33:20331–20342.
  53. 53. Fujisawa H, Eguchi S. Robust parameter estimation with a small bias against heavy contamination. Journal of Multivariate Analysis. 2008;99(9):2053–2081.
  54. 54. Owen A. Empirical likelihood ratio confidence intervals for a single functional. Biometrika. 1988;75(2):237–249.
  55. 55. Qin J, Lawless J. Empirical likelihood and general estimating equations. the Annals of Statistics. 1994;22(1):300–325.
  56. 56. Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association. 2001;96(456):1348–1360.
  57. 57. Chen X, Pouzo D. Efficient estimation of semiparametric conditional moment models with possibly nonsmooth residuals. Journal of Econometrics. 2009;152(1):46–60.
  58. 58. Zhang CH. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics. 2010;38(2):894–942.
  59. 59. Fan GL, Liang HY, Shen Y. Penalized empirical likelihood for high-dimensional partially linear varying coefficient model with measurement errors. Journal of Multivariate Analysis. 2016;147:183–201.
  60. 60. Shi Z. Econometric estimation with high-dimensional moment equalities. Journal of Econometrics. 2016;195(1):104–119.
  61. 61. Fan J, Lv J. A selective overview of variable selection in high dimensional feature space. Statistica Sinica. 2010; p. 101–148. pmid:21572976
  62. 62. Cannings T, Fan Y, Samworth R. Classification with imperfect training labels. Biometrika. 2020;107(2):311–330.
  63. 63. Greener J, Kandathil S, Moffat L, Jones D. A guide to machine learning for biologists. Nature Reviews Molecular Cell Biology. 2022;23(1):40–55. pmid:34518686