Residual metric learning with class-specific consistency for multiclass classification

Kai Hu; Jiajun Ma

doi:10.1371/journal.pone.0345369

Abstract

Least squares regression (LSR) has been widely used in pattern recognition due to its concise form and ease of solution. However, inadequate exploration of inter-class margin and intra-class similarity limits its discriminative ability. To this end, we present a novel method called residual metric learning with class-specific consistency for multiclass classification (RMLCC). Specifically, RMLCC jointly learns a projection matrix and a metric matrix for the regression residuals in a compact framework. This joint learning mechanism makes the inter-class margin of the projected instances as large as possible in the learned metric space, prompting the instances of different classes to be separated. To further improve the generalization, the class-specific consistency constraint that stimulate intra-class similarity is cleverly embedded into the joint learning framework. To solve the proposed model, we propose an alternative optimization algorithm which guarantees weak convergence. With the interactive optimization of the projection matrix and metric matrix, RMLCC can fully exploit the structure and supervised information of the data and thus has the potential to outperform other methods. Extensive experiments on several benchmark datasets demonstrate the validity of the proposed method.

Citation: Hu K, Ma J (2026) Residual metric learning with class-specific consistency for multiclass classification. PLoS One 21(3): e0345369. https://doi.org/10.1371/journal.pone.0345369

Editor: Jianchao Bai, Northwestern Polytechnical University, CHINA

Received: December 3, 2025; Accepted: March 3, 2026; Published: March 25, 2026

Copyright: © 2026 Hu, Ma. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The data are held in the public repository Figshare and can be obtained from https://figshare.com/articles/dataset/Multiclass-Data/31204033, DOI: 10.6084/m9.figshare.31204033. The code for this paper can be downloaded from https://github.com/majiajun2018/RMCC-2025.git.

Funding: This work was supported in part by the 2025 Xi’an Science and Technology Plan Project—University Science and Technology Personnel Service Enterprise Project under Grant 25GXKJRC00008. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Least squares regression (LSR) [1] has been widely used in many applications, such as image recognition [2] and discriminative learning [3], due to its concise formulation and efficient solution. By embedding some given prior information into the objective function, LSR can also be tailored for different tasks such as regression [1] or classification [4]. Here, we focus on the task of multiclass classification, where one instance is assigned to one of a number of discrete classes. Broadly speaking, a good classifier should perform well in terms of discrimination and generalization. The conventional LSR-based classification model aims to learn a mapping function that projects the input instances into the binary label space by solving a mean square error minimization problem. This operation not only degrades the discrimination performance but also easily leads to overfitting [1,5], as it encourages a constant Euclidean distance between the regression responses of any two instances from different classes and does not consider the intra-class similarity.

To enhance the discrimination, label relaxation techniques pursuing large inter-class margin have been successively explored. Xiang et al. [2] proposed a discriminative LSR (DLSR) model, which enlarges the distance between regression targets of different classes using the ε-dragging technique. Wang et al. [6] presented a margin scalable discriminative LSR (MSDLSR) method by introducing a sparsity constraint on the dragging values. Zhang et al. [7] constructed a retargeted least squares regression (ReLSR) model, which directly learns the regression targets with large inter-class margin. To improve the generalization, various sparse or low rank regularization terms are introduced into the objective function to maintain the intra-class similarity or structural consistency. In [8], a discriminative LSR method based on inter-class sparsity (ICS_DLSR) was proposed. ICS_DLSR introduces an inter-class sparsity constraint to reduce the intra-class margin while increasing the inter-class margin, and introduces the error factors to improve the discriminability of the model. A group low-rank representation-based discriminative linear regression (GLRRDLR) model was presented by Zhan et al. [9], which imposed a class-wise low-rank constraint on the latent features.

In addition, several subspace learning methods have also been developed to extract the structural information and improve the generalization. Fang et al. [10] presented a robust latent subspace learning (RLSL) model through the combination of latent representation with linear regression. Zhang et al. [11] proposed a pairwise relations oriented discriminative regression (PRDR) model, in which the pairwise relations of the label and the instances are transferred to the latent subspace by means of cosine similarity and manifold regularization. These methods explore the manifold structure of the instance in the latent subspace under the guidance of a pre-computed similarity graph. However, those pre-computed graphs often employ features from original data and cannot be adaptively learned during the training to better express the structures of instances. To make matters worse, constructing a graph Laplacian with a simple weight function is not discriminative enough, but using the complex weight function increases the complexity exponentially. Zadeh et al. [12] described the distance structure between instances from the perspective of metric learning, and proposed a geometric mean metric learning (GMML) model. Unlike the manifold learning based methods, GMML minimizes an unconstrained strictly (geodesically) convex optimization problem, allowing for a closed-form solution that yields smaller distances for similar instances and larger distances for dissimilar instances.

Although the above methods try to pursue the large inter-class margin and intra-class consistency, they still cannot accommodate both interclass separability and intraclass similarity because the transformation matrix learning and structure exploring are independent. How to effectively exploit the structural information and label relations to learn the discriminative representation remains to be explored. In this paper, we propose a novel model, referred to as residual metric learning with class-specific consistency for multiclass classification (RMLCC). In particular, we jointly learn a projection matrix and a metric matrix for the regression residuals, which allows the inter-class margin of the projected instances to be as large as possible in the learned metric space. Furthermore, a class-specific consistency constraint is cleverly embedded in the joint learning framework to ensure the intra-class similarity. In this way, the discriminant projection matrix and the residual metric matrix are mutually reinforcing. This allows the model to account for both the inter-class separability and intra-class similarity. The key contributions of the RMLCC are outlined as follows.

A joint learning framework is proposed to learn a projection matrix and a metric matrix for its residuals. This operation actually constructs an adaptive loss function based on the learnt metric matrix, aiming to separate instances of different classes and gather together instances of the same class as much as possible.
A class-specific consistency regularisation is introduced into the joint learning framework to ensure that instances of the same class remain structurally consistent after projection, fully exploiting the similarity between instances of the same class and thus alleviates the overfitting problem.
The solution to the objective function is studied and the corresponding algorithm is developed. Experiments are performed on six benchmark datasets and compared with seven comparison methods to evaluate the performance of the proposed method.

The rest of this paper is organised as follows. Section Related work provides a brief overview of the relevant methods. Section The proposed method presents our proposed method in detail. The experimental results and analysis are reported in Section Experiment. The conclusion is finally given in Section Conclusion.

Related work

We first introduce the notations used in this paper, and then provide a review of some related works. For matrix , A_i,j denotes the (i,j)-th element of A, A_i,: and A_:,j denote the i-th row vector and j-th column vector, respectively. denotes the l_1,2-norm. The square Frobenius norm is defined as , where tr(A) is the trace of A. For a symmetric positive definite (PSD) matrix M, is the inverse of M, .

Denote a set of n instances with d features by . The binary label indicator matrix is defined as , where and the position of 1 indicates the belonging class of , c is the number of classes. Record the instances from jth class and their labels as X_j and Y_j, and the instances excluding class j and their labels as and . n_j denotes the number of instances from jth class, .

LSR and DLSR

For the multiclass classification task, LSR seeks an optimal projection matrix by solving the following problem:

(1)

where the bias is absorbed into X as an additional dimension with all elements equal to 1, W is the projection matrix to be learned, and λ is a regularization parameter. Eq.(1) is strictly convex and admits a closed-form solution. Intuitively, it is usually hoped that instances of different classes will be as far apart from each other as possible after projection, while instances of the same class will be closer together after projection, which is beneficial for subsequent classification. However, Eq.(1) encourages a constant distance of between the regression responses of any two instances in different classes, and does not take into account the similarity between instances belonging to the same class. Therefore, the discrimination of LSR in multiclass classification still needs to be improved.

To further enhance the discriminative capability, DLSR relaxes the binary label indicator matrix and drags the regression targets of different classes along the opposite directions using a “ε-dragging” technique. The particular model is

(2)

where ⊙ denotes the Hadamard product operator for matrices, is a nonnegative dragging matrix and calculated by . The matrix B is defined as follows:

(3)

The objective Eq.(2) is jointly convex with respect to W and E, and can be solved through alternating optimization. Although the “ε-dragging” relaxation can enhance the inter-class separability, it can also cause instances from the same class to be uncorrelated after projection, as it does not consider the intra-class compactness.

GMML

Unlike the above methods that use the binary labels directly as supervised information, Zadeh et al. [12] proposed a geometric mean metric learning method to measure the structural relationships between instances using the side information of the paired instances. The formula is

(4)

Here, M is a symmetric positive definite (SPD) matrix to be learned, M₀ is a prior SPD matrix about M. and record the similar and dissimilar instance sets, respectively. denotes the square Mahalanobis distance between and , μ is a regularization parameter. is the symmetrized LogDet divergence. Eq.(4) is a strictly (geodesically) convex optimization problem with the closed-form solution

(5)

where , , is the geometric mean operator [12], and is a weighting parameter.

The proposed method

Fig 1 illustrates the framework of our RMLCC. Different from DLSR and GMML, our RMLCC jointly learns the projection matrix and metric matrix for the regression residuals in a unified framework. RMLCC jointly learns a transformation matrix and a metric matrix, such that the metric matrix yields smaller distances for the regression residuals of the same class and larger distances for different classes. To further improve the discriminative performance, a class-specific consistency constraint is imposed on the transformed instances of each class.

Download:

Fig 1. Overview of the structure of RMLCC.

We learn the projection matrix W and metric matrix M simultaneously, so that the transformed instances of different classes can be easily separated in the learned metric space.

https://doi.org/10.1371/journal.pone.0345369.g001

Target-margin dragging

To enhance discrimination, we enlarge the inter-class targets’ margins while reducing those from the same class. Considering that GMML [12] has achieved great success in metric learning and fit our goal well, we adopt a similar way to drag the regression margins in our model. Certainly, other metric learning techniques can replace GMML, but that is beyond our focus.

For convenience, we first reformulate (1) in an equivalent form as

(6)

To conduct target-margin dragging, we learn a metric matrix M such that the regression errors for the same class are as small as possible under its metric, while the errors for the remaining classes are as large as possible under its metric. Then we have

(7)

where and denote the square Mahalanobis distance under the metric of M and , respectively. To alleviate the problem of class imbalance, proportional coefficients and were incorporated into the first and second terms in (7), respectively.

It is worth noting that we perform metric learning for the residuals rather than the instances themselves, which enables the discriminant projection and metric matrix for the instances to be synchronized and mutually reinforcing. By embedding metric M between errors of the same class and its inverse between errors of different classes, the regressor’s discriminability can be enhanced.

Class-specific consistency

When binary labels are used as the regression targets, the desired output of the jth class instances X_j are

(8)

where 1s are on the jth column. The index of the non-zero column indicates the identity of the class of the collection of instances X_j.

It is clear from (8) that the output matrix of the instances from class j has a consistently sparse structure, i.e., only the jth column is non-zero. The regression output X_jW of the instances X_j have the same index of non-zero entries, which promotes the consistency structure for the same class. To this end, we first introduce the class-specific consistency regularization with the following proposition.

Proposition 1 For any non-zero matrix A, if and only if A has one single nonzero column, .

Proof. Please refer to the Appendix S1 Appendix for the detailed proof of Proposition 1. □

As stated in Proposition 1, the class-specific consistency regularization satisfies our search for intra-class similarity, so we use it for the transformed instances, i.e.,

(9)

Objective function

Combining the target-margin dragging in (7) with the class-specific consistency regularization in (9), the residual metric learning with class-specific consistency for multiclass classification is formulated as

(10)

Model (10) enjoys the following three valuable properties:

The first term is the square regression residuals of j-th class under the M metric, which increases monotonically with respect to M. Thus, minimizing this term results in a small distance for the same class.
The second item is the regression residuals of classes excluded in j-th class, which decreases monotonically with respect to M. Thus, minimizing this term results in a large distance for the different classes.
The last terms denote the class-specific consistency regularization term, which ensures the structural consistency for the transformed instances of the same class.

Since M is a SPD matrix, (10) actually finds a more discriminative transformation for X, where satisfies . Our RMLCC is not a simple combination of LSR and GMML, which actually employees the adaptive Mahalanobis distance as the loss function rather than the Frobenius norm used in LSR.

Optimization

Using matrix trace operation, (10) can be simplified and reformulated as

(11)

where L = diag, and D = diag. Here, c denotes the number of classes, is the label of , recorded the number of instances associated with label , and recorded the number of instances not associated with label j.

The optimization problem (11) is nonconvex and involves two variables. Here, an iterative algorithm based on ADMM [13–15] framework is employed to solve problem (11). First, two auxiliary variables U and F are introduced to make the optimization problem (11) separable as follows:

(12)

Then, we obtain the following augmented Lagrangian function of Eq.(12)

(13)

where , P and Z are Lagrange multipliers, and are penalty factors. Next, the variables will be updated alternately.

Step 1. Fixing other variables, W is updated by minimizing the following problem

(14)

By setting the derivative , we obtain

(15)

where , W is updated by solving a Sylvester equation [16].

Step 2. Fixing other variables, U is updated by solving the following problem:

(16)

By setting the derivative , we obtain

(17)

where , U is updated by solving a Sylvester equation.

Step 3. Fixing other variables, F is updated by solving the following problem:

(18)

Obviously, the problem can be solved for each F_j independently, where F_j is the jth subset of F. If we define , then the optimization problem (18) is equivalent to the following problems:

(19)

where H_j is the jth subset of H, corresponding to the instances from the jth class. Let and the kth column of F_j and H_j respectively, then we have and . The objective of problem (19) can be reformulated as

(20)

To simplify the problem (20), we linearize it as follows:

(21)

where is the result of variable at the t-th iteration. By setting the derivative , we obtain

(22)

Then,

(23)

Define , , finally

(24)

that is .

Step 4. Fixing other variables, M is updated by solving the following problem:

(25)

Eq. (25) enjoys a closed-form solution

(26)

where defined as , is a weighting parameter, and

Step 5. The Lagrange multiplier P, Z, and penalty factors μ, σ are updated as:

(27)

In steps 1–2, and are fixed in iterations, we can implement the Schur decomposition for them and store the results in advance to speedup. The pseudocode implementation of RMLCC is shown in Algorithm 1.

Algorithm 1 Algorithm for Solving RMLCC.

Input Training instance matrix X, label matrix Y, parameters , , .

Output Transform matrix W and metric matrix M.

1:Initialization: W with random values, M with identity matrix, U = W, F = XW, Z = 0, P = 0, , ,

2:Perform Schur decomposition for and , , .

3: while not converged do

4: Update W by solving (15).

5: Update U by solving (17).

6: Update F by solving (24).

7: Update M by solving (26).

8: Update P, μ, Z and σ according to (27)

end while

return W and M.

Prediction

Once the optimal W and M are obtained, the label for a given test instance will be given by

(28)

where is the regression target encoding specified for the kth class, .

Complexity and convergence analysis

As seen in Algorithm 1, the main time cost is solving Sylvester equations and computing M. The Schur decompositions of and are performed only once outside the loop, with computational complexity O(d³). The computational complexity of M and in each iteration is O(r³). For F, the computational complexity is O(nc). The computations of Lagrange multiplier P, Z and penalty factor μ, σ are very simple, and thus their computational costs can be ignored. Assuming t is the number of iterations, we can conclude that the total computational complexity of RMLCC is about .

The optimization problem (13) is non-convex with respect to all unknown variables, it is difficult to prove the strong convergence property [13] of the algorithm. It is worth noting that Karush-Kuhn-Tucker (KKT) conditions are the necessary conditions for a constrained local optimal solution, and any converging point must be a KKT point. The following theorem guarantees a weak convergence property of the proposed optimization algorithm.

Theorem 1. Let be the solution of (13) at the kth iteration. Assume the sequence is bounded and , then every limit point of is a Karush-Kuhn-Tucker (KKT) point of the problem (13). Whenever converges, it converges to a KKT point.

Proof. Please refer to the Appendix S2 Appendix for the detailed proof of Theorem 1. □

Experiment

In this section, we conduct several experiments to prove the effectiveness of the proposed method on six benchmark datasets. The main information of the six datasets is listed in Table 1. In particular, the proposed method was compared to some related state-of-the-art methods: DLSR [2], ReLSR [7], MSDLSR [6], RLSL [10], ICS_DLSR [8], GLRRDLR [9], PRDR [11]. For each group of experiments, all methods are repeated 10 times with the random combinations of training and test instances. All experiments are implemented on MATLAB R2017a with Win7 system, Inter Core i7-8550 CPU and 8GB RAM.

Download:

Table 1. Brief description of the benchmark datasets used in experiments.

https://doi.org/10.1371/journal.pone.0345369.t001

Experiments on face databases

In this part, four challenging face databases are chosen to evaluate these methods. All of the facial images have been cropped and resized to 32 × 32 pixels.

(1) Extended Yale B [17]: This dataset contains face images of 38 individuals, totaling 2414 facial images. We randomly select 15, 20, 25 and 30 images from each class as the training set and use the rest for testing. The results are listed in Table 2. From the experimental results, ICS_DLSR and GLRRDLR perform better than DLSR and ReLSR. PRDR achieves better performance than ICS_DLSR and GLRRDLR. RMLCC achieves the best performance by taking into account both inter-class margin and intra-class consistency.
(2) AR [18]: The database contains over 4000 face images of 126 individuals. We select a subset containing 3120 grey scale images from 120 subjects. For each individual, we randomly select 4, 6, 8, and 12 images for training, and the rest for testing. The experimental results are reported in Table 3. It should be noted that the classification accuracy of ICS DLSR and GLRRDLR are about 5% higher than that of DLSR, ReLSR, MSDLSR and RLSL when the number of training instances is small, because they consider the inter-class margin and intra-class similarity. PRDR exploits both the local relationship information from the instance space and the label space, achieving the performance comparable to RMLCC. RMLCC achieves the best performance by performing metric learning and group consistency to effectively capture the structural information in data.
(3) LFW [19]: This is a challenging large wild image dataset. Here, we use a subset consisting of 1251 images from 86 individuals with only 10–20 images per subject. We randomly select 5, 6, 7, 8 images of each subject as training instances. The remaining face images are used as test instances. Table 4 reports the experimental results on this dataset. It is obvious that the classification accuracies of all methods are relatively low, proving that the LFW database is a very challenging database for face recognition. The accuracies of DLSR, ReLSR, MSDLSR and RLSL are grouped together at the lowest level. Encouragingly, the proposed RMLCC still came top. GLRR and PRDR are comparable to RMLCC, and more robust than DLSR and ReLSR. These experimental results prove the superiority of the proposed method on challenging data sets.
(4) CMU PIE [20]: This dataset contains 68 person with 41368 face images in total. Here, we compare all methods on a subset of PIE, where each person has 170 images, gathered under five different poses (C05, C07, C09, C27 and C29). As can be seen from Table 5, the proposed method is superior to the other regression methods on the basis of label relaxation. For example, the classification accuracy of the proposed method is up to about 4% higher than that of DLSR. RMLCC uses metric learning and class-specific consistency constraint to explore the inter-class margin and intra-class similarity. As a result, its classification accuracy is higher than that of PRDR.

Download:

Table 2. Mean classification accuracies (%) and standard deviations of different methods on the Extended Yale B face database.

https://doi.org/10.1371/journal.pone.0345369.t002

Download:

Table 3. Mean classification accuracies (%) and standard deviations of different methods on the AR face database.

https://doi.org/10.1371/journal.pone.0345369.t003

Download:

Table 4. Mean classification accuracies (%) and standard deviations of different methods on the LFW face database.

https://doi.org/10.1371/journal.pone.0345369.t004

Download:

Table 5. Mean classification accuracies (%) and standard deviations of different methods on the PIE face database.

https://doi.org/10.1371/journal.pone.0345369.t005

The experimental results presented in Tables 2, Table 3, Table 4, Table 5 show that the proposed method achieves the best classification accuracy in comparison with the related methods, which indicates its effectiveness in face recognition.

Experiments on the object database

In this subsection, the Columbian Object Image Library (COIL 100) [21] is chosen for the evaluation of the effectiveness of the proposed method. This database consists of 7200 images of 100 objects. Each object has 72 images in different lighting conditions. For each class, we randomly select 15, 20, 25, and 30 images for training and the rest for testing. As can be seen from the results in Table 6, our method outperforms all of the comparison methods.

Download:

Table 6. Mean classification accuracies (%) and standard deviations of different methods on the COIL100 database.

https://doi.org/10.1371/journal.pone.0345369.t006

As discussed in Section 5, the class-specific consistency (i.e., only a single nonzero column of X_jW) is essential to ensure the intra-class similarity of the instances. To demonstrate the preservation of intra-class similarity by our method, Fig 2 shows the transformation matrix X₅W of LSR, PRDR and RMLCC on the instances from the top 20 class of the COIL100 dataset. One can see that, the largest elements of the transformed instances X₅W obtained by our method are all located in the fifth column, and the other columns are almost 0. This ensures the intra-class consistency and can effectively avoid overfitting. To further visually demonstrate the discriminative ability of the RMLCC, Fig 3 shows the t-SNE [22]visualization of the original instances and transformed features learned by RMLCC. For clarity, we selected 1440 instances from the top 20 classes for visualisation. It is clear from Fig 3A that the instance distributions of some classes are highly scattered and even overlap. Fig 3B shows better clustering by class than Fig 3A. However, in the right half of Fig 3B, the data distributions of the different classes still overlap. As shown in Fig 3C, our RMLCC can ensure that transformed features of the same class lie close together, while features of different classes lie further apart. This confirms the validity of the proposed method.

Download:

Fig 2. Class-specific consistency heatmaps of the transformation matrix obtained by LSR, PRDR and RMLCC.

https://doi.org/10.1371/journal.pone.0345369.g002

Download:

Fig 3. t-SNE visualization of original data, PRDR and RMLCC features on COIL100 dataset.

The 1440 instances from the first 20 classes are visualized when 30 images per subject are used for training. Both training instances and testing instances are visualized.

https://doi.org/10.1371/journal.pone.0345369.g003

Experiments on the scene database

Here, we use the spatial pyramidal features of the Fifteen Scene Categories database (Scene15_SPM) [23] to evaluate the proposed method. This data set contains 4485 natural images from 15 different classes, such as bedroom, industry, coast, street and building. We randomly select 10, 20, 30 and 40 instances from each class as the training set, and the remaining as the test set. The experimental results on the Scene15_SPM dataset are reported in Table 7. From Table 7, we can see that our method achieves the best performance among all the methods. This demonstrates the effectiveness of our method in dealing with the scene classification task. Additionally, Fig 4 shows the confusion matrix of our RMLCC on the Scene15_SPM dataset. Specifically, the classification accuracy (%) for the corresponding class is given by the diagonal elements of the confusion matrix. Notably, all classes achieved high classification accuracies, and the worst performance is still acceptable at 95.64%, also reflecting the superiority of our RMLCC.

Download:

Table 7. Mean classification accuracies (%) and standard deviations of different methods on the Scene15_SPM database.

https://doi.org/10.1371/journal.pone.0345369.t007

Download:

Fig 4. Confusion matrices of RMLCC on the Scene15_SPM dataset.

https://doi.org/10.1371/journal.pone.0345369.g004

Statistical significance

To systematically compare the different methods, the Friedman test and post hoc Nemenyi test [24] are used to compare the classification accuracy of the eight methods on six benchmark datasets with different numbers of training instances. In the experiments, two algorithms are regarded as significantly different if their average ranks differ by at least the Critical Difference (CD). Fig 5 shows the CD diagrams for the eight comparison methods on the six benchmark datasets with different numbers of training instances. The average rank of each comparison method is marked along the axis. The axis is turned so that the lowest ranks (best performance) are on the right. The methods in the groups linked by a red line are not significantly different. As shown in Fig 5, the ICS_DLSR, RLSL, MSDLSR, ReLSR and DLSR rank highly and are significantly different from the RMLCC. This result shows that RMLCC is well ahead of most other methods.

Download:

Fig 5. CD diagram of different methods with significance level

.

https://doi.org/10.1371/journal.pone.0345369.g005

Convergence and computational performance

Here we experimentally demonstrate the good convergence of RMLCC. Fig 6 shows the variation curve of the classification accuracy with respect to the number of iterations and the convergence curve, where the blue line is the classification accuracy curve and the red line is the convergence curve. The classification accuracy increases rapidly in the previous iterations, while remaining stable after 15 iterations. In addition, the objective value decreases in a monotonic way with the number of iterations. The above results demonstrate the effectiveness of the proposed optimization method.

Download:

Fig 6. Convergence curve and classification accuracy versus iterations.

https://doi.org/10.1371/journal.pone.0345369.g006

To ascertain the computational complexity of the proposed method, Table 8 reports the training time of the compared methods. PRDR has the shortest training time, while RLSL has the longest training time on most datasets. Our RMLCC is relatively slow compared to PRDR, DLSR, and MSDLSR, ranking third in terms of training speed. The main reason is that RMLCC needs to solve Sylvester equations in each iteration. Although the execution time for the RMLCC is slightly longer, it is possible to use off-line training for tasks that require a higher prediction accuracy.

Download:

Table 8. Run time (second) comparisons of different methods.

https://doi.org/10.1371/journal.pone.0345369.t008

Parameter sensitivity and setting

In RMLCC, there are three parameters , and , which serve to balance the significance of each item in the object. To analyze the parameter sensitivity of RMLCC, a candidate set was first defined for , and . The study revealed that the model’s performance is not sensitive to the , and the performance remains stable when . Therefore, we fix at the optimal value of and analysis the effect of and on the performance of the model. Fig 7 shows the variation of the classification performance of our method with respect to different values of and . Obviously, the proposed method can achieve satisfactory performance when and are located in and , respectively. In addition, a geometric mean weight α needs to be tuned in the model, while it is actually close to 0 in most cases. We select it from the set in the experiments. The prior matrix M₀ was set to the identity matrix throughout the experiment.

Download:

Fig 7. The change in classification accuracy with the

and

on the datasets.

https://doi.org/10.1371/journal.pone.0345369.g007

Conclusion

In this paper, a novel method called residual metric learning with class-specific consistency (RMLCC) is proposed for multiclass classification. Different from existing methods, instead of directly fitting the original features to binary or relaxed labels, this proposed method jointly learns the instance transformation matrix and residual metric matrix in a unified framework, and imposes class-specific consistency constraints on the transformed instances. As a result, in the learned metric space, the distances between transformed instances of same class are as small as possible while the distances between different classes are as large as possible from each other, thus improving the discriminative power. Extensive experiments on the face, object, and scene databases validate the effectiveness of the proposed method.

In the future work, we intend to study the integration of higher-order structure of instances to further improve the discriminative expression ability of the model. In addition, the establishment of a non-linear metric learning method as a replacement for GMML to improve the discriminability for the complex structured data is also the focus of our future research.

Supporting information

S1 Appendix. Proof of the Proposition 1.

https://doi.org/10.1371/journal.pone.0345369.s001

(PDF)

S2 Appendix. Proof of the Theorem 1.

https://doi.org/10.1371/journal.pone.0345369.s002

(PDF)

References

1. Wang J, Xie F, Nie F, Li X. Robust Supervised and Semisupervised Least Squares Regression Using ℓ2,p-Norm Minimization. IEEE Trans Neural Netw Learn Syst. 2023;34(11):8389–403. pmid:35196246
- View Article
- PubMed/NCBI
- Google Scholar
2. Xiang S, Nie F, Meng G, Pan C, Zhang C. Discriminative least squares regression for multiclass classification and feature selection. IEEE Trans Neural Netw Learn Syst. 2012;23(11):1738–54. pmid:24808069
- View Article
- PubMed/NCBI
- Google Scholar
3. Mika S, SchFer C, Laskov P, Tax D, Müller KR. Support vector machines. IEEE Intelligent Systems. 2004;8(6):1–28.
- View Article
- Google Scholar
4. Ma J, Zhou S, Li D. Robust multiclass least squares support vector classifier with optimal error distribution. Knowledge-Based Systems. 2021;215:106652.
- View Article
- Google Scholar
5. Ma J, Zhou S. Discriminative least squares regression for multiclass classification based on within-class scatter minimization. Appl Intell. 2021;52(1):622–35.
- View Article
- Google Scholar
6. Wang L, Zhang X-Y, Pan C. MSDLSR: Margin Scalable Discriminative Least Squares Regression for Multicategory Classification. IEEE Trans Neural Netw Learn Syst. 2016;27(12):2711–7. pmid:26441456
- View Article
- PubMed/NCBI
- Google Scholar
7. Zhang X-Y, Wang L, Xiang S, Liu C-L. Retargeted Least Squares Regression Algorithm. IEEE Trans Neural Netw Learn Syst. 2015;26(9):2206–13. pmid:25474813
- View Article
- PubMed/NCBI
- Google Scholar
8. Wen J, Xu Y, Li Z, Ma Z, Xu Y. Inter-class sparsity based discriminative least square regression. Neural Netw. 2018;102:36–47. pmid:29524766
- View Article
- PubMed/NCBI
- Google Scholar
9. Zhan S, Wu J, Han N, Wen J, Fang X. Group Low-Rank Representation-Based Discriminant Linear Regression. IEEE Trans Circuits Syst Video Technol. 2020;30(3):760–70.
- View Article
- Google Scholar
10. Fang X, Teng S, Lai Z, He Z, Xie S, Wong WK, et al. Robust Latent Subspace Learning for Image Classification. IEEE Trans Neural Netw Learn Syst. 2018;29(6):2502–15. pmid:28500010
- View Article
- PubMed/NCBI
- Google Scholar
11. Zhang C, Li H, Qian Y, Chen C, Gao Y. Pairwise relations oriented discriminative regression. IEEE Transactions on Circuits and Systems for Video Technology. 2021;:2646–60.
- View Article
- Google Scholar
12. Zadeh P, Hosseini R, Sra S. In: 2016. 2464–71.
13. Chen C, He B, Ye Y, Yuan X. The direct extension of ADMM for multi-block convex minimization problems is not necessarily convergent. Math Program. 2014;155(1–2):57–79.
- View Article
- Google Scholar
14. Bai J, Zhang M, Zhang H. An inexact ADMM for separable nonconvex and nonsmooth optimization. Computational Optimization and Applications. 2025;90(2):445–79.
- View Article
- Google Scholar
15. Bai J, Hager WW, Zhang H. An inexact accelerated stochastic ADMM for separable convex optimization. Comput Optim Appl. 2022;81(2):479–518.
- View Article
- Google Scholar
16. Chiang CY, Chu EKW, Lin WW. On the ⋆-Sylvester equation AX±X^⋆B^⋆=C. Applied Mathematics and Computation. 2012;218(17):8393–407.
- View Article
- Google Scholar
17. Georghiades AS, Belhumeur PN, Kriegman DJ. From few to many: illumination cone models for face recognition under variable lighting and pose. IEEE Trans Pattern Anal Machine Intell. 2001;23(6):643–60.
- View Article
- Google Scholar
18. Martinez AM, Benavente R. The AR face database. 1998.
19. Learned-Miller E, Huang GB, RoyChowdhury A, Li H, Hua G. Labeled Faces in the Wild: A Survey. Springer International Publishing. 2016.
20. Terence Sim, Baker S, Bsat M. The CMU pose, illumination, and expression database. IEEE Trans Pattern Anal Machine Intell. 2003;25(12):1615–8.
- View Article
- Google Scholar
21. Nene SA, Nayar SK, Murase H. Columbia Object Image Library (COIL-100). Columbia University. 1996.
22. Laurens Van Der Maaten. Visualizing Data using t-SNE. Journal of Machine Learning Research. 2008;9(2605):2579–605.
- View Article
- Google Scholar
23. Zhang Z, Lai Z, Xu Y, Shao L, Wu J, Xie G-S. Discriminative Elastic-Net Regularized Linear Regression. IEEE Trans Image Process. 2017;26(3):1466–81. pmid:28092552
- View Article
- PubMed/NCBI
- Google Scholar
24. Demiar J, Schuurmans D. Statistical Comparisons of Classifiers over Multiple Data Sets. Journal of Machine Learning Research. 2006;7(1):1–30.
- View Article
- Google Scholar

[ref1] 1. Wang J, Xie F, Nie F, Li X. Robust Supervised and Semisupervised Least Squares Regression Using ℓ2,p-Norm Minimization. IEEE Trans Neural Netw Learn Syst. 2023;34(11):8389–403. pmid:35196246
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Xiang S, Nie F, Meng G, Pan C, Zhang C. Discriminative least squares regression for multiclass classification and feature selection. IEEE Trans Neural Netw Learn Syst. 2012;23(11):1738–54. pmid:24808069
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. Mika S, SchFer C, Laskov P, Tax D, Müller KR. Support vector machines. IEEE Intelligent Systems. 2004;8(6):1–28.
View Article
Google Scholar

[10] View Article

[11] Google Scholar

[ref4] 4. Ma J, Zhou S, Li D. Robust multiclass least squares support vector classifier with optimal error distribution. Knowledge-Based Systems. 2021;215:106652.
View Article
Google Scholar

[13] View Article

[14] Google Scholar

[ref5] 5. Ma J, Zhou S. Discriminative least squares regression for multiclass classification based on within-class scatter minimization. Appl Intell. 2021;52(1):622–35.
View Article
Google Scholar

[16] View Article

[17] Google Scholar

[ref6] 6. Wang L, Zhang X-Y, Pan C. MSDLSR: Margin Scalable Discriminative Least Squares Regression for Multicategory Classification. IEEE Trans Neural Netw Learn Syst. 2016;27(12):2711–7. pmid:26441456
View Article
PubMed/NCBI
Google Scholar

[19] View Article

[20] PubMed/NCBI

[21] Google Scholar

[ref7] 7. Zhang X-Y, Wang L, Xiang S, Liu C-L. Retargeted Least Squares Regression Algorithm. IEEE Trans Neural Netw Learn Syst. 2015;26(9):2206–13. pmid:25474813
View Article
PubMed/NCBI
Google Scholar

[23] View Article

[24] PubMed/NCBI

[25] Google Scholar

[ref8] 8. Wen J, Xu Y, Li Z, Ma Z, Xu Y. Inter-class sparsity based discriminative least square regression. Neural Netw. 2018;102:36–47. pmid:29524766
View Article
PubMed/NCBI
Google Scholar

[27] View Article

[28] PubMed/NCBI

[29] Google Scholar

[ref9] 9. Zhan S, Wu J, Han N, Wen J, Fang X. Group Low-Rank Representation-Based Discriminant Linear Regression. IEEE Trans Circuits Syst Video Technol. 2020;30(3):760–70.
View Article
Google Scholar

[31] View Article

[32] Google Scholar

[ref10] 10. Fang X, Teng S, Lai Z, He Z, Xie S, Wong WK, et al. Robust Latent Subspace Learning for Image Classification. IEEE Trans Neural Netw Learn Syst. 2018;29(6):2502–15. pmid:28500010
View Article
PubMed/NCBI
Google Scholar

[34] View Article

[35] PubMed/NCBI

[36] Google Scholar

[ref11] 11. Zhang C, Li H, Qian Y, Chen C, Gao Y. Pairwise relations oriented discriminative regression. IEEE Transactions on Circuits and Systems for Video Technology. 2021;:2646–60.
View Article
Google Scholar

[38] View Article

[39] Google Scholar

[ref12] 12. Zadeh P, Hosseini R, Sra S. In: 2016. 2464–71.

[ref13] 13. Chen C, He B, Ye Y, Yuan X. The direct extension of ADMM for multi-block convex minimization problems is not necessarily convergent. Math Program. 2014;155(1–2):57–79.
View Article
Google Scholar

[42] View Article

[43] Google Scholar

[ref14] 14. Bai J, Zhang M, Zhang H. An inexact ADMM for separable nonconvex and nonsmooth optimization. Computational Optimization and Applications. 2025;90(2):445–79.
View Article
Google Scholar

[45] View Article

[46] Google Scholar

[ref15] 15. Bai J, Hager WW, Zhang H. An inexact accelerated stochastic ADMM for separable convex optimization. Comput Optim Appl. 2022;81(2):479–518.
View Article
Google Scholar

[48] View Article

[49] Google Scholar

[ref16] 16. Chiang CY, Chu EKW, Lin WW. On the ⋆-Sylvester equation AX±X^⋆B^⋆=C. Applied Mathematics and Computation. 2012;218(17):8393–407.
View Article
Google Scholar

[51] View Article

[52] Google Scholar

[ref17] 17. Georghiades AS, Belhumeur PN, Kriegman DJ. From few to many: illumination cone models for face recognition under variable lighting and pose. IEEE Trans Pattern Anal Machine Intell. 2001;23(6):643–60.
View Article
Google Scholar

[54] View Article

[55] Google Scholar

[ref18] 18. Martinez AM, Benavente R. The AR face database. 1998.

[ref19] 19. Learned-Miller E, Huang GB, RoyChowdhury A, Li H, Hua G. Labeled Faces in the Wild: A Survey. Springer International Publishing. 2016.

[ref20] 20. Terence Sim, Baker S, Bsat M. The CMU pose, illumination, and expression database. IEEE Trans Pattern Anal Machine Intell. 2003;25(12):1615–8.
View Article
Google Scholar

[59] View Article

[60] Google Scholar

[ref21] 21. Nene SA, Nayar SK, Murase H. Columbia Object Image Library (COIL-100). Columbia University. 1996.

[ref22] 22. Laurens Van Der Maaten. Visualizing Data using t-SNE. Journal of Machine Learning Research. 2008;9(2605):2579–605.
View Article
Google Scholar

[63] View Article

[64] Google Scholar

[ref23] 23. Zhang Z, Lai Z, Xu Y, Shao L, Wu J, Xie G-S. Discriminative Elastic-Net Regularized Linear Regression. IEEE Trans Image Process. 2017;26(3):1466–81. pmid:28092552
View Article
PubMed/NCBI
Google Scholar

[66] View Article

[67] PubMed/NCBI

[68] Google Scholar

[ref24] 24. Demiar J, Schuurmans D. Statistical Comparisons of Classifiers over Multiple Data Sets. Journal of Machine Learning Research. 2006;7(1):1–30.
View Article
Google Scholar

[70] View Article

[71] Google Scholar

Figures

Abstract

Introduction

Related work

LSR and DLSR

GMML

The proposed method

Target-margin dragging

Class-specific consistency

Objective function

Optimization

Prediction

Complexity and convergence analysis

Experiment

Experiments on face databases

Experiments on the object database

Experiments on the scene database

Statistical significance

Convergence and computational performance

Parameter sensitivity and setting

Conclusion

Supporting information

S1 Appendix. Proof of the Proposition 1.

S2 Appendix. Proof of the Theorem 1.

References