Figures
Abstract
Feature extraction has been extensively studied in the machine learning field as it plays a critical role in the success of various practical applications. To uncover compact low-dimensional feature representations with strong generalization and discrimination capabilities for recognition tasks, in this paper, we present a novel discriminative graph regularized representation learning (DGRL) model that is able to elegantly incorporate both global and local geometric structures as well as the label structure of data into a joint framework. Specifically, DGRL first integrates dimension reduction into ridge regression rather than treated them as two irrelevant steps, which enables us to capture the underlying subspace structure and correlation patterns among classes. Additionally, a graph regularizer that fully utilizes the local class information is developed and introduced to the new framework so as to enhance the classification accuracy and prevent overfitting. A kernel version of DGRL, called KDGRL, is also established for dealing with complex nonlinear data by using the kernel trick. The proposed framework naturally unifies several well-known approaches and elucidates their intrinsic relationships. We provide detailed theoretical derivations of the resulting optimization problems of DGRL and KDGRL. Meanwhile, we design two simple and tractable parameter estimation procedures based on cross-validation technique to speed up the model selection processes for DGRL and KDGRL. Finally, we conduct comprehensive experiments on diverse benchmark databases drawn from different areas to evaluate the proposed theories and algorithms. The results well demonstrate the effectiveness and superiority of our methods.
Citation: Qi J, Xu R (2025) Discriminative graph regularized representation learning for recognition. PLoS One 20(7): e0326950. https://doi.org/10.1371/journal.pone.0326950
Editor: Alberto Fernández-Hilario, Universidad de Granada, SPAIN
Received: April 4, 2024; Accepted: June 6, 2025; Published: July 17, 2025
Copyright: © 2025 Qi, Xu. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data are within the manuscript and its Supporting Information files.
Funding: This work is supported by the Natural Science Foundation for colleges and universities in Jiangsu Province of China (Grant No. 19KJB520024), the practice innovation training program projects for the Jiangsu College students (Grant No. 202410323125Y) and 2025 College Students' Innovation and Entrepreneurship Training Program Project (Project Title: Intelligent Elderly Health Monitoring System Based on Multimodal Perception and Edge Computing). All funding was awarded to JQ. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Learning a compact and powerful discriminative representation of high-dimensional data has shown remarkable importance to perform accurate and efficient regression or classification because of the so-called curse of dimensionality [1]. Over the past decades, subspace learning based dimension reduction has been extensively studied and widely used in extracting discriminative feature representations by removing redundant, irrelevant, and noisy information. Two of the most well-known techniques on this topic are principal component analysis (PCA) [1] and linear discriminant analysis (LDA) [2].
LDA pursues a linear discriminative subspace with minimum within-class dispersion and maximum between-class separation. Classical LDA always suffers from the small sample size (SSS) problem, which is known to cause serious stability problems for LDA. Consequently, many research works [3–5] have been developed to address this issue in recent decades. Moreover, since the learned projective functions of PCA and LDA are linear combinations of all the original features, they may fail to discover essential data structure that is nonlinear. Kernel based technique [6] provides powerful and tractable extensions of linear models to nonlinear cases, and it has been successfully used to extend PCA and LDA to the kernel-induced feature space [7–9]
Traditional PCA and LDA only consider the global structure of data, while the local structure, which is often important for many real applications [10–12], is ignored. To this end, a variety of manifold-based techniques [13–16] has been developed to uncover the geometrical structure of the underlying manifold. The core idea of these methods is to construct the affinity graph via different measure metrics and then maintain the desired local neighborhood relations characterized by the graph in a new space. The typical application of manifold learning is the graph embedding which has been widely used to address the issue of overfitting for recognition tasks by enhancing the intra-class compactness recently [17–20].
Many of the popular subspace learning techniques such as LDA, locality preserving projection (LPP), and their kernel extensions can also be formulated as a least squares regression (LSR) problem [2]. The LSR-type formulation is not only readily and flexible to introduce various regularization techniques to improve the interpretation and generalization ability of the resulting model, but also provides efficient and scalable implementations by many existing iterative algorithms. Most of the above methods are commonly applied as a separate data preprocessing procedure. As a consequence, feature learning and classification are often treated as two irrelevant steps, and hence, the overall optimum of the algorithms cannot be guaranteed. Motivated by the recent development of low-rank representation (LRR) [21], low-rank constraint has been incorporated into LSR [22–24], which can globally preserve the membership of data and capture the underlying correlations behind samples. Fang et al. [25] introduced a robust latent subspace learning (RLSL) method by simultaneously minimizing the regression loss and reconstruction error so as to enhance the connection between the learned data representation and the classification performance. Supervised approximate low-rank projection learning (SALPL) [26] shares the similar idea of RLSL, in which projection learning process of latent low-rank representation is integrated into rigid regression. Although these methods are very effective in learning an informative feature representation, they only exploit the global structure of data while ignoring the local structure.
Recent studies [10,12] show that local and global structures are indispensable and complement to each other in learning discriminative representation for image classification. In view of this, various subspace learning techniques integrating the theory of graph embedding [27], sparse representation (SR) [28], or LRR [18] have been developed to expose the inherent structure of data. For example, low-rank sparse preserving projections (LSPP) [27] combines manifold learning with low-rank sparse representation so that the self-expressiveness property and the intrinsic geometric structure of data points in the embedding space can be better revealed. It is noted that low-rank criterion governs the global relation of the entire data set, whereas sparsity criterion dominates the local structure around each individual data point [10].
The bring forward of the above methods greatly promote the development of feature learning theory and practice. Nonetheless, they face some challenges that have not been exhaustively explored and perfectly solved as well. First, since these methods are modified with different motivations by introducing extra regularization terms or constraints, the corresponding objective functions usually become non-convex and even non-smooth. Such optimization problems are generally solved with a quite complicated alternating procedure, e.g., employing the augmented Lagrange multiplier (ALM) algorithms [29], which inevitably involve iterative processes and easily converge to local minimums. What is more, theoretically, the convergence of the ALM style optimization approach cannot be well guaranteed in the cases of more than two blocks of primary variables. Thus, most of these methods have to validate their convergence properties via experiments. On the other hand, the choice of regularization parameters, which is key to the success of these methods, has been traditionally entirely left to the user. In practice, cross validation is a widely used evaluation technique for model selection to obtain a satisfactory performance. Unfortunately, it is often computationally prohibitive to make an optimal tuning for the many regularization parameters from a large set of candidates, especially in high-dimensional data analysis. Finally, although these methods work well for linear problems, they may be less effective when severe nonlinearity is involved.
In light of these deficiencies, a novel method termed as DGRL is proposed in this work to learn an appropriate data representation by discovering a latent subspace for multicategory classification. DGRL smoothly integrates feature extraction and recognition into a joint optimization framework rather than performing them in two independent steps, which guarantees an overall optimum. Moreover, DGRL directly introduces the underlying discriminative and geometrical information into a general graph regularization to guide the projection learning and the training of classifier, which encourages the method to obtain a better performance. The kernelized counterpart of DGRL (KDGRL) is also developed to boost the nonlinear representation of samples. We make an insightful and systematic investigation of the problem of regularization parameter estimation in both DGRL and its kernel version.
Our key contributions are summarized as follows.
- 1). We combine feature learning and classification with some constraints to form a unified framework. Under this framework, we present two models, i.e., DGRL and KDGRL, in which both the global Euclidean and local manifold structures of data as well as the underlying class information are exploited simultaneously and naturally evolved together to learn a more compact and discriminative subspace for dimension reduction and classification.
- 2). We design a supervised graph regularizer that elegantly incorporates several informative constraints, such as local consistency and discriminative property, to enforce the uncovered data representation in the transformed category space more competent to the subsequent classification.
- 3). We show that the matrix computations involved in DGRL and KDGRL can be simplified, and we develop two simple yet effective model selection algorithms for them which can greatly accelerate the cross validation process, especially for high-dimensional, small sample size data.
The remainder of this paper is organized as follows. In the Methods section, we first present the motivation and derivation of DGRL in detail, and then provide the model selection algorithm for DGRL with algorithmic description. We then extend DGRL to a kernel version for nonlinear feature extraction and recognition. Extensive experiments are conducted and analyzed In the experimental section. Conclusion section draws conclusions and discusses future works.
Proposed methods
Notations
Throughout the paper, scalars, vectors, and matrices are written as italic letters, bold lowercase letters, and bold uppercase letters, respectively. Let be the original data matrix consisting of n samples in the d-dimensional space. Let
represent the binary class label matrix corresponding to D, where k is the number of classes, and
. Specifically, if the ith sample is from the jth class, gi j = 1; otherwise, gi j = 0, where gi j denotes the (i, j)th element of G. To be clear, we first define
as the centered data matrix such that the global mean is zero, i.e.,
, where
is an identity matrix of size n,
is a vector of all ones, and
represents the ith column of X. We also define the normalized class label matrix as
with
while
belongs to the jth class and yij = 0 otherwise, where nj is the sample size of the jth class,
, and
.
Motivation and formulation
We present a regularized joint learning framework for simultaneous feature extraction and multicategory classification in a supervised manner, which integrates both global and local structures to uncover the most compact and discriminative subspace so as to facilitate the follow-up recognition tasks. In the proposed framework, the original high-dimensional feature space and its underlying semantic space are linked by a linear transformation that projects each datapoint
, for
, in the d-dimensional space onto a s-dimensional space as
, where, typically,
. To improve the overall generalization and discriminative abilities of the algorithm, we aim to employ the extracted features
as the new discriminate representations to build a bridge between the original features and objective outputs so that they are seamlessly connected. In addition, the high-level semantic label information together with the geometric property of data are incorporated into our framework in the form of graph based regularization constraint by virtue of manifold learning. This formulates the following optimization problem
where the loss function measures the approximate error between the desired and actual outputs, the regularization term
controls the model complexity, and the supervised graph regularization term
is specially designed for classification purpose with a strong discrimination capability, which effectively encodes the locality and similarity among data points by making the best of class information so that the separability of samples from different classes in the category space can be further enhanced. Additionally,
and
are two nonnegative regularization parameters for balancing respective terms, and
indicates that the extracted discriminative features
are fed into the specified classifier
to build the mapping relation W between the latent subspace Q and the label space Y.
By combining appropriate regularization or relaxed labels [30–32], the least squares loss function has been shown to be comparable to other loss functions such as the logistic loss and hinge loss. In this work, we simply adopt least squares loss as well as
-norm regularization
to learn a set of k linear models,
,
, by minimizing the classification error. In this way, the embedding representation will be tightly coupled with classification. Accordingly, the objective function (1) becomes
where W = and
are to be estimated,
is the weight vector, and
refers to the Frobenius norm of matrix.
Generally, the -norm constraint on
merely emphasizes the smoothness of the function which may not be sufficient for discrimination among classes, since it has little to do with the label information. And more importantly, similar inputs near the decision boundaries are more likely to come from different classes, meaning that the classifier may not be always smooth everywhere, especially for real data sets with complex distributions. Intuitively, a relative ideal classifier should be able to map the data into the target category space where points within each class are close to each other while those belonging to different classes are kept away from each other. Hence, both the underlying discriminative information and the intrinsic geometric structure of the samples are crucial to classification problems. The graph-based supervised dimensionality reduction technique [13–16] provides a practical and feasible approach to fulfil such desired properties, which has already been applied to develop various kinds of algorithms [17,20]. Based on these observations, we explicitly leverage the class information in constructing the graph regularization
to simultaneously guide the learning process of the latent subspace Q and the linear classifier W, thereby endowing them stronger discriminant power. The geometric structure consistency over target semantics and features can be achieved by minimizing the following objective
where is the predicted label vector of x under the learned projection QW, and the discriminating similarity matrix
models the local structure of the data manifold as defined in SOLPP [16], that is
where } is the target label with respect to sample
,
denotes the prior probability of class
,
refers to the set of K nearest neighbors of
, and
is a positive parameter scaling the Euclidean distance between
and
. Furthermore,
,
, and
stand for the local weight, the intra-class compactness weight, and the inter-class separability weight, respectively.
According to the celebrated manifold assumption [11], from (3) one can see that the original data geometric structure characterized by S is intended to be reflected in the transformed label space. Specifically, minimizing (3) allows to not only separate the nearby data points with the different labels far from each other in the corresponding target space but also constrain that the nearby data points sharing the same label can be kept close to each other after being projected. In this manner, the margins between the samples of different classes at each local neighborhood will be greatly enlarged, and meanwhile the intra-class compactness of the outputs will be intensely boosted, which is beneficial to classification. Besides, the use of S equips the graph regularizer with some additional appealing properties like good robustness, margin augmentation and noise suppression. Please see [16] for more details. Further, we can rewrite (3) as
where is the graph Laplacian matrix and defined as
,
is a diagonal degree matrix with the ith diagonal entry
, and tr(
) is the trace of a square matrix. Note that
is symmetric and positive semidefinite.
By incorporating (5) into (2), we obtain the final objective function of our DGRL model as follows:
Here the orthogonal constraint is imposed to make the problem tractable.
Next, we will show that s < min (n, k) is indeed the dimensionality of the low-rank subspace derived from a generalized discriminant analysis. In this case, the learned mapping QW can be regarded as a low-rank projection matrix, since the size of QW is ,
,
. This indicates that the first term in model (6) essentially performs low-rank linear regression, such that the underlying global correlation structures between classes can be well explored [22]. The second term of (6) is to deal with the singularity problem in computing the optimal solution and consequently strengthens the stability of the whole model. The last term of (6) is actually urges the predicted label matrix
to reproduce the discriminating similarity structure coded in L. During the process, both the within-class compactness and between-class separability of labels in each local area are enhanced, and this helps to alleviate the risk of overfitting [17] while improving the model’s generalization. These factors, when taken together, enable our method to learn a more compact and discriminative subspace Q for dimension reduction and classification and naturally differentiate it from previous works [22,24,33,34]. We will establish the corresponding relationships between those algorithms and ours and analyze it in greater detail in the subsequent section.
The optimal solution
By taking the partial derivative of (6) with respect to W and setting it to zero, we have
Substituting (7) back to (6), it is not difficult to verify that the optimal transformation Q can be obtained by solving
where and
are the total and between-class scatter matrices defined in classical LDA, respectively. One limitation of (8) is that it cannot be directly applied when
is singular. To this end, we employ the pseudoinverse in place of the inverse in the optimization in (8), leading to a more general formulation:
Note that if the matrix is invertible, its pseudoinverse equals its inverse. A similar criterion has been studied in [4] for a family of generalized LDA algorithms to cope with SSS problem.
To solve (9), we provide the following theorem.
Theorem 1: Let A be a matrix whose columns are formed by the eigenvectors of
corresponding to the largest q nonzero eigenvalues, where q = rank(
). Let
be a QR factorization of A, where
has orthonormal columns and
is upper triangular. Then
solves the optimization problem (9).
Proof: The detailed proof of Theorem 1 is moved to the supplementary file for better flow of the paper.
Theorem 1 implies that the dimension, s = q, of the subspace transformed by is at most k-1, since the rank of
is bounded from above by k-1. Additionally, we can find that the major computation involved in DGRL comes from the eigen-decomposition of
in solving (9). For high-dimensional data,
and
are very large dense matrices of size d by d, which incur a high computational cost on directly solving such generalized eigenproblem. Moreover, estimating the best regularization parameters
and
via cross-validation from a set of candidates will further increase the time complexity sharply. This procedure is often computationally too intensive or even infeasible when the candidate set is large, since it requires expensive matrix computations for each candidate value. We next present an efficient implementation for DGRL, which has the potential to perform model selection from a large search space with low computational effort.
Model selection for DGRL
Let be the singular value decomposition (SVD) of X, where
and
are orthogonal,
,
is diagonal and nonsingular with t = rank(X). Let U be partitioned as U = (U1, U2) such that
, where U1
, U2
, and
contains the first t columns of V. We start by factoring
in the form
Denote . It is clear that
is symmetric and positive definite, for any
. Let
be the eigen-decomposition of
, where
is orthogonal and
is diagonal with positive diagonal entries. Then,
can be expressed as
where the third equality follows since the null space, U2, of also lies in the null space of
, i.e.,
, and
is diagonal and always invertible for any
.
To simply and efficiently solve the eigen-problem of , we need the following proposition.
Proposition 1: Let a be an eigenvector of corresponding to the nonzero eigenvalue
, then
for some m, where m is an eigenvector of
.
Proof: Since and a are an eigenvalue-eigenvector pair of
, it follows from (11) that
for some m. We show in the following that m is an eigenvector of
. Left-multiplying both sides of
by
yields
This completes the proof of the proposition.
Proposition 1 tells us that with the above calculated ,
, and P, the original problem of computing A involved in Theorem 1 is equivalent to finding the eigenvector matrix
, that satisfies
, by solving a simpler
sized eigenvalue problem on
. Let us denote
. Recall that
and
). Let
, and
be the SVD of B, where
and
are orthogonal,
is diagonal with
. Let
and
denote the first q columns of
and
, and
, such that B =
. We are now ready to introduce an efficient way to compute the top q eigenvectors of
associated with the nonzero eigenvalues, as follows:
where . This suggests that
diagonalizes
, which is exactly what we need. From (12), we see that only the largest q daigonal entries of
are nonzero, thus we have
.
Based on our analysis, there are essentially three main steps to get the eigenvectors of .
- 1). Compute the SVD of X as
.
- 2). Eigen decompose
as
, for some
.
- 3). Compute the SVD of B as
, for some
.
Then, it is easy to check that . By doing so, we can observe that, the nonsingularity of
is not required at all. Note that
may remain singular when
, but our DGRL still works in such case, without suffering from the SSS problem, thereby extending its applicability. In addition, our three-step strategy enjoys several more important advantages. Specifically, the first step needs to be executed only once relative to the size of the candidate set, as it is independent of the regularization parameters
and
, thus saving huge amount of computation. The second step runs the eigen analysis to
, a tractable matrix with size
, where t, the rank of X, is significantly smaller than d for SSS problem. The third step enables us to find M quickly by applying the reduced SVD to B of much smaller size (i.e.,
), rather than directly calculating the eigenvectors of the
matrix
. Considering rank(X) ≤ n-1 is typically much greater than the number of classes k, this trick leads to a big reduction in terms of the computational cost. When choosing the optimal values for
and
from given candidate sets, we repeat the computations involved in the second and third steps only. So the cross-validation procedure can be performed efficiently, since we are dealing with matrices
and B whose sizes are much smaller than that of
.
From Theorem 1, the optimal solution is given by the reduced
. Now, let’s go back and address the issue of computing W in (7), which can be calculated more reliably through the following proposition.
Proposition 2: Let X, Y, A, and be defined as in Theorem 1. Then,W can be finally computed by
.
Proof: Please see the detailed proof of Proposition 2 in the supplementary material.
Linear feature extraction and classification
For any centered input pattern , one can conduct feature extraction using
to produce a compact and discriminant data representation
, which is subsequently delivered into the learned liner classifier W for multicategory classification. More precisely, the target output for
is calculated as
It is interesting and important to see that the above computation does not rely on matrices and
ultimately. This implies that the QR factorization of A can be avoided, which contributes to further decrease the cost. At last, the predicted class label of x is determined by
, where
refers to the jth element in
. In empirical studies, we adopt the 1-nearest neighbor classifier (1NN) to evaluate the final recognition accuracy for fairly comparing.
Let and
be the candidate sets for the regularization parameters
and
, respectively. In v-fold cross validation, we randomly split the input data into v subsets or folds of roughly equal size. Then, we reserve one subset to assess our model trained on the remaining v −1 subsets. This procedure is repeated v times such that each subset is used exactly once for validation. In the lth fold, the accuracy for each pair of
, is defined as Acc(l, i, j). Moreover, the average accuracy across all v partitions is reported as Acc (i, j). The optimal values
and
are the one with (
,
) = arg max i, j Acc(i, j). Details of our DGRL model selection algorithm are outlined in Algorithm 1.
Next, analyze the computational complexity of Algorithm 1. Both Lines 8 and 9 take time for constructing the graph Laplacian and computing the SVD. To multiply two matrices, Lines 10 and 11 take
,
, and
time, respectively. For each choice
, Line 13 takes
time, Lines 14 takes
time to perform eigen-decomposition on a
matrix, and Line 15 takes
time. For each choice
, Line 17 takes
time to invert the diagonal matrix
. The SVD computation in Line 18 takes
time. In Lines 19 and 20, it takes
time for the matrix multiplications, where q = rank(Hb) is less than or equal to
1. Line 21 takes
time to perform 1NN. Thus, the total cost for estimating the best parameters is about
. Since
and m < n, considering that d > n and
in small sample size problems, the whole cost is simplified to
.
Algorithm 1 Model Selection for DGRL
Input: Data matrix D and zero-one label matrix G;
Parameters and K in (4); Candiate sets
and
.
1: for l = 1 to v do // v -fold cross validation
2: Form training set Dl with label matrix Gl;
3: Form validation set with label matrix
;
4: n = size(Dl, 2); m = size(, 2);
5: Center training matrix Xl;
6: Center validation matrix ;
7: Normalize training labels Yl ;
8: Form Laplacian matrix L from S using Xl and Gl;
9: Compute the skinny SVD of Xl =;
10: ; t=rank(Xl);
11:
;
12: for i = 1 to do //
choices for
13: ;
14: Eigen decompose ;
15: ;
;
16: for j = 1 to do //
choices for
17: ;
;
18: Compute the skinny SVD of B=;
19: ;
;
20: ;
;
21: Run 1NN on (
) and compute the validation accuracy,
denoted as Acc (l, i, j);
22: end for
23: end for
24: end for
25: Acc (i, j);
26: (,
)
;
Output: The best parameter pair .
Kernel DGRL
In this section, we further generalize DGRL to its nonlinear counterpart by virtue of the kernel trick [6], which enables us to uncover the intrinsic structure of nonlinearly distributed data. Let be a nonlinear mapping from the input space to a high-dimensional feature space. For simplicity, we assume
, where D is possibly infinite. The core idea behind kernel DGRL (KDGRL) is to jointly learn the discriminant subspace
and the linear classifier
in this new feature space.
Let us first assume that the data matrix in the feature space is column centered, i.e.,
, and denoted by
The key observation is that the optimal discriminant vectors in
can be expressed as a linear combination of the images of training points in
based on the representer theorem [6]. That is
for some matrix
. Let be the kernel matrix with entries, where is a suitable kernel function satisfying Mercer’s condition [6]. Then, the kernelized version of (6) can be written as
where is the graph Laplacian matrix,
being the diagonal degree matrix of
having elements
and the discriminating similarity matrix
is analogous to the definition of S in (4) but over mapped patterns
, i = 1,
, n. Note that the construct of
in the feature space
, rather than in the input space
has the advantage that nonlinear relationships between the input data
can be better expressed.
Setting the partial derivative of the objective function in (14) with respect to equal to zero, we obtain
Substituting derived from (15) into (14), we get the following equivalent problem
where and
can be viewed as the total and between-class scatter matrices of K when each column in K is considered as a data point in
. Furthermore, we also adopt the pseudoinverse to replace the inverse of a matrix in (16) as a new criterion. This allowed our model to work well even if
is singular.
To get the solution of (16), we need the following theorem.
Theorem 2: Let T be an n × p matrix consisting of the first p eigenvectors of the matrix associated with the nonzero eigenvalues, where
) and p ≤ k-1. Suppose the eigen-decomposition of
is
where
is orthogonal, and
is diagonal and positive definite. Then
solves the maximum problem in (16) with s = p.
Proof: The proof of Theorem 2 is very similar to that of Theorem 1 and can be found in the supplementary file.
Model selection for KDGRL
Directly applying the eigen decomposition to the n × n sized dense matrix can incur heavy computational overheads especially for large-scale problems. In what follows, we show how to simplify the matrix computations involved in this procedure, thus speeding up the model selection process for KDGRL. Let r be the rank of K and
be the SVD of K, where
is orthogonal,
,
is diagonal and invertible,
contains the first r columns of
, and
is the orthogonal complement of
. Let us define
. It is easy to verify that
must be positive definite for arbitrary
and
. Let
be the eigendecomposition of
, where
is orthogonal, and
is diagonal with positive diagonal entries. Then we have
To decrease the computing cost of the eigenvector matrix T stated in Theorem 2, we establish the following proposition.
Proposition 3:
Suppose the eigenvector of is t with the nonzero eigenvalue η, then
, where z is an eigenvector of
.
Proof: Since and
, according to (17), we obtain
for some z. Now, we have to prove z be an eigenvector of
. Left-multiplying both sides of equation
by
gives
This completes the proof of the proposition.
We proceed by diagonalizing , a tractable matrix with size
. Recalling
, let us form an
matrix
. Let the SVD of F be
, where
and
are orthogonal,
)
,
is diagonal and nonsingular with
,
and
consist of the first p columns of
and
, respectively. It follows that
where . It clearly shows that the leftmost p columns of
are all we need, as they form the eigenvectors corresponding to the nonzero eigenvalues of
. The above computation is more efficient than applying an eigen-decomposition to
directly, since we work on a much smaller matrix F of size r × k, where, usually,
. By Proposition 3, with the calculated
and
,we can get
. It should be emphasized that computing
is independent of the regularization parameters, i.e., regardless of the sizes of the candidate sets for
and
.
The above discussion leads to a three-step procedure for solving the eigenvalue problem on
- 1). Compute the skinny SVD of K as
.
- 2). Eigen decompose
, for given
.
- 3). Compute the skinny SVD of F as
.
After determining T, the optimal solution of (16) is obtained from by solving the eigenvalue problem of
using Theorem 2. Now, let’s look back at (15) and offer an efficient way for computing
without matrix inversion. The proposition below illustrates this.
Proposition 4: Let K, Y, , and
be defined as in Theorem 2. Then,
can be given by
.
Proof: The detailed proof for Proposition 4 can be found in the supplementary material.
Nonlinear feature extraction and classification
Once and
are computed, for any centered feature mapping
, its projection by
can then be carried out as
, where
is a
kernel vector. Meanwhile, we perform final classification on discriminant subspace
via linear classifier
. To be more specific, this is done by
An interesting observation from (19) is that the calculation of the target output is entirely irrelevant to the matrices and
. It means that we do not need to compute the eigendecomposition of
at all, thus saving the computational cost further. Finally, the position number with the maximum value of
determines the label of x. In our experiments, the 1NN classifier is used for classification as done in DGRL.
In summary, the complete KDGRL model selection algorithm is described in Algorithm 2. The time complexity analysis of Algorithm 2 is as follows: It takes time for the formation of kernel matrix and graph Laplacian. Line 9 takes
time for the SVD computation. Lines 10 and 11 take
,
, and
time, respectively, for matrix multiplications. For each choice
and
, it takes
time in Line 16 for the eigen-decomposition. Since
is a diagonal matrix and its inversion cost
, Line 17 takes
time. The SVD computation in Line 18 takes
time. In Lines 19 and 20, it takes
time for the matrix multiplications, where p = rank(
) is less than or equal to
. Line 21 takes
time to perform 1NN. Thus, the total cost of our model selection procedure is about
. Since
,
, and
, assuming that
, then the overall computational complexity can be compacted into
.
Algorithm 2 Model Selection for KDGRL
Input: Data matrix D and zero-one label matrix G; Kernel function ;
Parameters and K of
; Candiate sets
and
.
1: for l = 1 to v do // v -fold cross validation
2: Form training set Dl with label matrix Gl;
3: Form validation set with label matrix
;
4: n = size(Dl, 2); m = size(, 2);
// Center the training and validation kernel matrices
5: Xl;
;
6: ;
;
7: Normalize training labels ;
8: Form Laplacian matrix from
using Xl and Gl;
9: Compute the skinny SVD of Kl =;
10: ; t=rank(Kl);
11:
;
12: for i = 1 to do //
choices for
13: ;
14: for j= 1 to do //
choices for
15: ;
16: Eigen decompose ;
17: ;
18: Compute the skinny SVD of =
;
19: ;
;
20: ;
;
21: Run 1NN on (
) and compute the
validation accuracy, denoted as Acc (l, i, j);
22: end for
23: end for
24: end for
25: ;
26: ;
Output: The best parameter pair .
Experiments
In this section, to thoroughly evaluate the performance of DGRL and KDGRL for feature extraction and classification, we conduct extensive experiments on various types of real world applications, involving face, texture, object, and handwritten digit recognition. Several typical relevant works are compared to confirm the superiority of our methods.
Experimental setup
We have taken six standard and publicly available image databases, namely ORL [27], Extended YaleB [25], CUReT [35], KTH-TIPS [35], COIL-100 [20], and MNIST [36], to support our theoretical results and the effectiveness of the proposed algorithms. Moreover, we carry out comprehensive comparisons of DGRL and KDGRL with some of the state of- the-art methods or related approaches. Among them, LRLR [22], LRRR [22], LRKR and LRKRR [2,24] pertain to the most representative low-rank regression models. RLRLR [17], DLSR [30], ReLSR [31], CLSR [32] and WCSDLSR [37] are currently the leading LSR-based soft label learning methods for multicategory classification problems. DRLPP [38] and LRPER [39] are recently proposed subspace learning method methods. SVM [40] and regularized least squares (RLS) [41] are classical and powerful classifiers. Specifically, DLSR, RLRLR and WCSDLSR introduce a technique called -dragging to enlarge the distances of regression targets of different classes. In addition, RLRLR and WCSDLSR constructs a within-class scatter matrix for the relaxed labels to make the projected samples from the same class closer to each other. ReLSR seeks to learn the target matrix directly from data by constraining the margin between the targets of true and false classes of each sample. CLSR is a variant of ReLSR, it considers to reduce the distances of the new learned targets between samples within the same class. RLSL [25] and SALPL [26] are the recent prominent works of representation-based methods which also integrate the least squares loss function into the dimension reduction process to ensure that the extracted features are optimal for recognition. DRLPP uses two transformation matrices to project high-dimensional data into low-dimensional space, which endows it the capability to preserve the local structure of data. LRPER develops a unified framework that integrates LRR, linear regression, and projection learning.
For SVM, classification is performed by employing the LIBSVM Toolbox [40], in which the cost parameter is chosen from {0.001, 0.01, 0.1, 1, 10, 100, 1000}. As special cases of the proposed linear and kernel formulations, LRLR, LRRR, LRKR and LRKRR are implemented by setting the regularization parameters and
to particular values as discussed in the following sections. The codes used for DLSR, ReLSR, CLSR, RLSL, RLRLR, SALPL, DRLPP, LRPER and WCSDLSR are released by the corresponding authors, and the suggested parameter settings are adopted from their original articles. The Gaussian RBF kernel is tested in all the kernel-based methods and the kernel width parameter is tuned on the set {0.05, 0.1, 0.5, 1, 1.5, 2, 2.5, 3}. For simplicity, we empirically set K = 5 and
for the construction of the similarity matrices defined in our objective functions across all problems except for the ORL case, where the neighborhood size is fixed to 2 owing to the limited training instances per individual during the cross validation procedure. In fact, they could also be further adjusted to refine the results. The major parameters
and
of our models are both selected from a wide range of {0, 2−10, 2-9.5,…, 23.5, 24} when performing Algorithms 1 and 2, while the reduced dimension s is kept to k-1, where k denotes the number of classes. For each database, different number of samples per subject were randomly selected as the training set and the remainders were retained for testing. We search the best parameters of each method by 10-fold cross validation on the training set. Finally, we independently run all the algorithms ten times with the obtained optimal parameters under each experiment and report their mean accuracies as well as standard deviations for fairness of comparison.
Experiments for face recognition
The ORL and the Extended YaleB databases are widely employed in examining the algorithmic performance on face recognition learning task.
- 1). ORL Database: The ORL face database contains 40 subjects and each subject comprises 10 face images, which were taken with a tolerance for some tilting and rotation of the face up to 20 degrees. Moreover, some images were captured at different times, with varying lighting, facial expressions, and facial details (e.g., glasses/no glasses).
- 2). Extended YaleB Database: The Extended YaleB database is composed of 2414 face images from 38 individuals with small variations in facial expressions and head poses. Each individual provides around 64 near frontal images acquired under various controlled illumination conditions.
All of original images in the two face databases were transformed into gray channel and resized to 32 × 32 pixels in the experiments. Then, each image was reshaped to a 1024- dimensional vector for constructing the feature matrix. For ORL, we randomly select 5, 6, 7, and 8 images per object to form the training set, while the test set contains the rest of the images. As to Extended YaleB, we randomly select 15, 20, 25, and 30 images of each individual as training samples, and the remaining images are regarded as test samples. Tables 1 and 2 display the results of all models on ORL and Extended YaleB, respectively, whereby the best classification accuracies in each column are marked by boldface. One can see that with training sets of varying size, both our methods almost always provide better recognition performance than the others. For example, on the ORL database, with 8 images sampled from each object, DGRL achieves more than 1.09% average accuracy scores in comparison with the state-of-the-art LRPER and the most related RLSL, reflecting its outstanding ability to discover the intrinsic discriminative structure of facial data. It is also noted that DGRL delivers comparable results or even better results than RLRLR and WCSDLSR in all the cases. This illustrates that DGRL has better adaptability to different test scenarios.
Experiments for texture recognition
To validate the performance of our proposed methods on the material recognition task, the CUReT and the KTH-TIPS databases were used to conduct experiments.
- 1). CUReT Database: The CUReT is a classical database for testing and evaluating state-of-the-art texture recognition algorithms. It contains textures from 61 materials, each of which has been imaged under 205 different viewing and illumination conditions. We utilize the same subset of images as [35]. In this commonly used version, each material has 92 images for which a sufficiently large region of texture is visible across all materials.
- 2). KTH-TIPS Database: The KTH-TIPS database was created to extend CUReT by providing variations in scale in addition to pose and illumination. This makes it particularly popular in the machine learning community. It consists of 10 texture classes, each of which comprises 81 images generated under three viewing angles, three different illumination directions, and nine scales. Comparing to CUReT, KTH-TIPS is a more challenging database for texture image classification and is better suited for testing the new algorithms.
Following the implementation in [35], 1180-dimensional PRICoLBP features are adopted as the image representation of both texture databases to make the results reproducible and comparable. For CUReT, we randomly select 25, 30, 35, and 40 samples per material for training and the remaining samples are used for testing. As to KTH-TIPS, we randomly select 20, 25, 30, and 35 images from each class as training samples and treat the rest of images as test samples. The recognition accuracies of different methods on these two databases are presented in Tables 3 and 4, respectively. It is obvious that KDGRL gets the highest classification rates among all the compared methods in different training and testing conditions, while DGRL is generally better than or comparable to the other competitors in almost all the cases. This illustrates that our methods are feasible to address texture categorization well.
Experiments for object recognition
In this part, we utilize the frequently used famous Columbia Object Image Library (COIL-100) database to test our methods for object recognition task. It includes a total of 7200 images of various views from 100 objects with different lighting conditions. Each object contributes exactly 72 images, which are taken at pose intervals of 5 degrees. All images for our experiments on this database have been cropped and converted to gray-scale images with the size of pixels. By doing so, a 1024-dimensional feature representation for each image can be obtained. Furthermore, we randomly select 15, 20, 25, and 30 images of each object as training samples, and the remaining images are taken as test samples.
Table 5 shows the performance comparisons between the proposed methods and these competing ones under different random splits of training and test data. As observed from these experimental results, our two models once again outperform other comparisons with higher accuracies in all the cases, especially when the number of training samples is very small (e.g., 15 images per object), which is more difficult. This clearly states that DGRL and KDGRL also have greater potential in solving the multiclass object recognition problem.
Experiments for handwritten digit recognition
In the end, we ran experiments on the MNIST database to verify the performance of the proposed methods in handwritten digit recognition. The entire MNIST database covers 60000 training images and 10000 test images falling into 10 categories from “0” to “9”. Each digit image is of size pixels, with 256 gray levels per pixel. This results in a 784-dimensional feature vector, which we use to describe each image. As in [36], we select the first 2000 instances from 60000 training images to make up our training set and take the first 2000 instances from 10000 test images as our test set. Thus, there are about 200 images of each digit in both the training and test sets. A random split with 30, 60, 90, and 120 images per category from training set are selected for training.
The evaluation results are listed in Table 6, from which we can find that KDGRL still consistently produces the best testing accuracy among comparison counterparts in different kinds of data splits. At the same time, DGRL yields a better recognition performance than those of linear algorithms, such as RLS, RLRLR, LRLR, LRRR, DLSR, ReLSR, CLSR, RLSL, SALPL, WCSDLSR, DRLPP and LRPER, but fails short in contrast to all the kernel-based methods, including LRKR, LRKRR, SVM, and KDGRL. The reason may be that the distribution of handwritten digit patterns is highly sparse and complex, whereas DGRL cannot adequately handle such inherent nonlinear problem due to its linear nature, in particular with the case of insufficient training data. Actually, all the other linear methods compared on this subset encounter the same issue as in DGRL. Comparatively, after exploiting the RBF kernel to map the original handwritten digit patterns into an implicit high-dimensional feature space, the discriminability of KDGRL is dramatically enhanced, resulting in a significant improvement in this task.
Experimental analysis
From Tables 1–6, we can see that the proposed DGRL and KDGRL exhibit favorable recognition performance on four different applications. According to these experimental results, we draw the following observations and conclusions.
First, for all the six databases, DGRL and KDGRL can consistently improve the classification accuracy over LRRR and LRKRR, respectively. Note that, in our experiments, we implement LRRR by setting in (6) and LRKRR by setting
in (14). This conforms to our motivation of incorporating the local class information to enhance feature extraction and classification. And it empirically proves that the enforced graph regularizer contributes positively in boosting the overall learning performance. On the other hand, low rank regression type of methods (LRLR, LRRR, LRKR, and LRKRR) act as special cases of our framework by using different
and
. Hence, the proposed formulations are more flexible and expected to outperform, or at least equal to, those approaches when the two regularization parameters are tuned properly with Algorithms 1 and 2. This has been confirmed by our experiments, from another view point, also testify the developed two model selection algorithms are quite successful.
Second, CLSR performs slightly better than DLSR and ReLSR, but worse than our methods in all the cases. This is likely because that in DLSR and ReLSR, the –dragging technique or margin constraint used to relax the label matrix will also enlarge the distances of the regression targets between samples from the same class [20], which often degrades the recognition accuracy. What’s more, unduly pursuing the largest margins between classes with the minimum regression loss magnifies the possibility of overfitting. Although CLSR can ensure that the samples in the same class have similar soft target labels, it is directly performed on the original inputs without the dimension reduction step. This may not always be a good choice if the data has very large dimensions, and contains many redundant features even noises which are harmful to the recognition tasks. Intuitively, it is necessary and beneficial to simultaneously perform feature learning and classification so that the two phases can be boosted mutually during training, and thus the overall optimality in algorithmic performance is guaranteed. We can see that our methods not only exploit the latent low-rank subspace to find the compact and discriminative representation of data by modeling the correlations between different samples, but also tend to keep the learned representations sharing the same label close to each other while those with different labels are far apart in each local neighborhood in the target space through the graph structure used in SOLPP, and thereby more in line with the goal for classification.
Third, as the size of the training set increases, the classification performance of our methods improve steadily and trend to be more stable. We attribute this to the fact that a larger set of training data can sample the underlying distribution more accurately than a smaller one. We also notice that our methods provide superior results even under a small training sample setting. For example, on the ORL and KTH-TIPS databases who have a small number of instances in each class, both DGRL and KDGRL consistently beat all the other methods with the desirable recognition rates. These experimental results agree with the previous theoretical analysis and further demonstrate that our algorithms can address small sample size problems very well. Besides, in situations where the number of data points exceeds the dimension of the data space, our approaches still perform comparably well and achieve remarkable testing accuracy. Our experiments on the CUReT and COIL-100 databases nicely bear this out. That is, the effectiveness of the proposed models are not confined to under sampled problems, and they can have wider application scopes in real-world tasks.
Fourth, nonlinear extensions based on kernel learning usually lead to better performance than their linear counterparts. For example, LRKR consistently delivers higher classification accuracy in comparison with LRLR, and the differences are very significant on the Extended YaleB, CUReT, KTH-TIPS, COIL-100 and MNIST databases. The similar observation occurs for the LRKRR and KDGRL as well. For Extended YaleB, KTH-TIPS, COIL-100 and MNIST, LRKRR outperforms LRRR by a large margin. On the whole, of the different databases and different experimental settings, KDGRL holds the best recognition rates among all the competitors except for the OLR case, where the average accuracy of KDGRL would be comparable with that of DGRL. Especially on the MNIST subset, KDGRL yields a considerable performance gain against DGRL in each setting over various training numbers. From these results, we can conclude that using kernel trick is indeed helpful to improve the generalization ability of LRKR, LRKRR, and KDGRL, and it has also been confirmed by other researchers. The essential reason is that the highly nonlinear structure data will be linearized and simplified in the kernel-induced feature space, such that linear techniques can be readily applied to the subsequent data representation and classification.
Finally, our methods always enjoy more promising performances than RLS, RLSL and SALPL on all six databases. Moreover, our methods achieve better recognition performance than RLRLR, WCSDLSR, DRLPP and LRPER in most cases. We confer that for RLSL and SALPL, holding the main energy of the original data by exploring the best reconstruction of the new representation is a good idea to weaken the disturbance of noise or errors from corrupted samples, but this is not sufficient to guarantee the extracted features have the most discriminate power for classification. Not to mention that they lose sight of the locality and neighborhood properties of data. RLS, WCSDLSR and LRPER overlook the local structure of data. Generally, local structure also contains lots of discriminative information. Besides, RLS, RLRLR, DLSR, ReLSR, CLSR and WCSDLSR only uses a single matrix to transform samples into the target space. However, such a single transformation has less freedom to obtain better margins. DRLPP is an unsupervised algorithm, in which label information does not take into consideration. However, the underlying class information is crucial for classification. By contrast, in our proposed models, both global and local information within data as well as the underlying discriminative knowledge are naturally coded into a unified framework by means of low-rank and graph embedding, which enables the learned projection to capture the intrinsic and discriminant structures reside in data so as to further improve the recognition accuracy. In particular, in most cases, the results of DGRL are even better than some kernel-based nonlinear approaches (e.g., LRKR and LRKRR), which only disclose the global correlation relationships of classes. This indicates the advantages of imposing the inherent local manifold geometric structure in a supervised manner to complement the single global Euclidean structure in the representation. Aside from that, RLSL, SALPL, RLRLR, WCSDLSR, DRLPP, and LRPER adopt alternating iterative optimization strategies to solve their non-convex objective functions and can easily converge to a local minimum, inducing inferior results. On the contrary, both problems (6) and (14) are all convex, they have analytic solutions, which makes DGRL and KDGRL simpler and more stable, encouraging a satisfactory performance on classification.
Parameter sensitivity analysis
In this study, we conduct experiments on the Extended YaleB, CUReT, COIL-100, and MNIST databases to examine how the proposed model selection algorithms work in practice. Figs 1 and 2 respectively depict parameter estimation results of DGRL and KDGRL on these four databases, where #Tr denotes the number of training samples per class selected for evaluation. This intuitively shows the influence of parameters λ and α on classification performance. As can be seen, our two methods perform well over a pretty wide range of parameter settings for different recognition applications. More specifically, both of them are very robust to the changes in value of α. As for λ, it should not be too large or small, and the classification accuracy curve is relatively smooth in some local areas when λ located in a feasible interval. This is mainly because a very large or very small λ may incur either under fitting or overfitting issues in the learned model, since it controls the complexity of the classifier. Overall, our proposed Algorithms 1 and 2 are effective in identifying the best suitable combinations of λ and α for various tasks as long as they are restricted in a reasonable search space.
Running time comparison
In this section, we empirically compare the computational cost of different algorithms as the size of training set increases on the Extended YaleB database. The experiments are carried out in MATLAB R2017b environment running on an ordinary PC with 3.0-GHz CPU and 16-GB RAM. Table 7 records the average training time (in seconds) across 10 trials of each method under the best recognition performance on different parameters. It is clear from the table that our methods have much lower costs in time than those of RLSL and SALPL. The main reason is that both RLSL and SALPL need to iteratively update subspace and classification model variables until convergence in the training phase while ours give the closed form solutions which can be directly obtained by respectively solving the generalized eigenvalue problems (9) and (16) only once. As expected, RLS, LRRR and LRKRR run fast compared with ours because the graph regularizer requires extra computation. In addition, DLSR, ReLSR, and CLSR gain rapid learning speed when compared to KDGRL, since they merely serve as lazy linear classifiers without nonlinear mapping and feature extraction processes. Interestingly, DGRL is computationally more efficient than SVM, RLRLR and CLSR in this comparison. It is also noted that DGRL and KDGRL deliver comparable results or even better results than all the compared methods (see Table 2). In this sense, our methods achieve quite promising recognition results with acceptable running time.
Conclusion
In this article, we propose a novel DGRL model which explicitly exploits both global and local class information from data so as to learn a more compact and discriminative representation for recognition. DGRL seamlessly integrates feature learning and ridge regression into a unified framework, such that the obtained projecting subspace has more discriminating power, and thus is competent to perform classification tasks. Moreover, a supervised graph constraint is tailored to make the best of the underlying discriminative information characterized by the manifold structure of data, which allows to further enlarge the distances between different classes while simultaneously heighten the coherence of the learned interclass representation in each local neighborhood in the target space. We also extend DGRL to its kernelized version named KDGRL with an implicit kernel mapping to address the highly nonlinear problems. We derived two effective model selection algorithms for DGRL and KDGRL to tune the regularization parameters. Empirical studies have been carried out on six widely used databases, and the experimental results clearly illustrate that our methods are superior to other relevant state of-the-art approaches in terms of classification performance.
In our future work, we plan to extend our models to the semi-supervised discriminative data representation learning scenario by reformulating the regularization term Ψ(f) as in [18], where information of the unlabeled samples is encoded into the graph structure. Furthermore, although our methods show remarkable performance when applied to uncontaminated databases, they are sensitive to the data corrupted with severe noises or outliers to some extent, due to the fact that they use the Euclidean distance as the metric. Therefore, imposing robust norms [11], such as the or
sparse constraints on both loss function and regularization term to boost the robustness of DGRL and KDGRL needs to be studied further. On the other hand, since KDGRL involves the whole kernel matrix, which is not scalable for large data. Thus, it is worth investigating how to overcome this limitation. Finally, it is interesting to point out that formula (1) is actually a general framework for feature extraction and classification. In addition to the similarity matrix used in (4), several popular criterions, like MFA, LSDA, and LFDA, can also be readily incorporated into our graph regularizer Ψ(f) as an alternative of L. We will try to systematically compare all possible combinations of these discriminative regularization terms with other types of classifiers or loss functions in the near future.
Supporting information
S1 Appendix. Appendix A: Proof of theorem 1, Appendix B: Proof of Proposition 2, Appendix C: Proof of theorem 2, Appendix D: Proof of Proposition 4.
https://doi.org/10.1371/journal.pone.0326950.s001
(DOCX)
References
- 1. Zhao X, Guo J, Nie F, Chen L, Li Z, Zhang H. Joint principal component and discriminant analysis for dimensionality reduction. IEEE Trans Neural Netw Learn Syst. 2020;31(2):433–44. pmid:31107663
- 2. De la Torre F. A least-squares framework for component analysis. IEEE Trans Pattern Anal Mach Intell. 2012;34(6):1041–55. pmid:21911913
- 3. Howland P, Park H. Generalizing discriminant analysis using the generalized singular value decomposition. IEEE Trans Pattern Anal Mach Intell. 2004;26(8):995–1006. pmid:15641730
- 4. Ye J, Yu B. Characterization of a family of algorithms for generalized discriminant analysis on undersampled problems. J Mach Learn Res. 2005;6(1):483–502.
- 5. Ye J, Xiong T, Madigan D. Computational and theoretical analysis of null space and orthogonal linear discriminant analysis. J Mach Learn Res. 2006;7(7):1183–204.
- 6. Müller KR, Mika S, Rätsch G, Tsuda K, Schölkopf B. An introduction to kernel-based learning algorithms. IEEE Trans Neural Netw. 2001;12(2):181–201. pmid:18244377
- 7. Yang J, Frangi AF, Yang J-Y, Zhang D, Jin Z. KPCA plus LDA: a complete kernel Fisher discriminant framework for feature extraction and recognition. IEEE Trans Pattern Anal Mach Intell. 2005;27(2):230–44. pmid:15688560
- 8. Shuiwang Ji, Jieping Ye. Kernel uncorrelated and regularized discriminant analysis: a theoretical and computational study. IEEE Trans Knowl Data Eng. 2008;20(10):1311–21.
- 9. Chakraborty R, Yang L, Hauberg S, Vemuri BC. Intrinsic Grassmann averages for online linear, robust and nonlinear subspace learning. IEEE Trans Pattern Anal Mach Intell. 2021;43(11):3904–17. pmid:32386140
- 10. Yin M, Gao J, Lin Z. Laplacian regularized low-rank representation and its applications. IEEE Trans Pattern Anal Mach Intell. 2016;38(3):504–17. pmid:27046494
- 11. Wai Keung Wong, Zhihui Lai, Jiajun Wen, Xiaozhao Fang, Yuwu Lu. Low-rank embedding for robust image feature extraction. IEEE Trans Image Process. 2017;26(6):2905–17. pmid:28410104
- 12. Wen J, Fang X, Xu Y, Tian C, Fei L. Low-rank representation with adaptive graph regularization. Neural Netw. 2018;108:83–96. pmid:30173056
- 13. Yan S, Xu D, Zhang B, Zhang H-J, Yang Q, Lin S. Graph embedding and extensions: a general framework for dimensionality reduction. IEEE Trans Pattern Anal Mach Intell. 2007;29(1):40–51. pmid:17108382
- 14.
Cai D, He X, Zhou K, Han J, Bao H. Locality sensitive discriminant analysis. In: Twentieth International Joint Conference on Artificial Intelligence. 2007. pp. 1713–26.
- 15. Sugiyama M. Dimensionality reduction of multimodal labeled data by local fisher discriminant analysis. J Mach Learn Res. 2007;8(5):1027–61.
- 16. Wong WK, Zhao HT. Supervised optimal locality preserving projection. Patt Recogn. 2012;45(1):186–97.
- 17. Fang X, Xu Y, Li X, Lai Z, Wong WK, Fang B. Regularized Label Relaxation Linear Regression. IEEE Trans Neural Netw Learn Syst. 2018;29(4):1006–18. pmid:28166507
- 18. Zheng Z, Ling S, Yong X, Li L, Jian Y. Marginal representation learning with graph structure self-adaptation. IEEE Trans Neural Netw Learn Syst. 2018;29(10):4645–59. pmid:29990209
- 19. Jing P, Su Y, Nie L, Gu H, Liu J, Wang M. A framework of joint low-rank and sparse regression for image memorability prediction. IEEE Trans Circuits Syst Video Technol. 2019;29(5):1296–309.
- 20. Han N, Wu J, Fang X, Wong WK, Xu Y, Yang J, et al. Double relaxed regression for image classification. IEEE Trans Circuits Syst Video Technol. 2020;30(2):307–19.
- 21. Liu G, Lin Z, Yan S, Sun J, Yu Y, Ma Y. Robust recovery of subspace structures by low-rank representation. IEEE Trans Pattern Anal Mach Intell. 2013;35(1):171–84. pmid:22487984
- 22.
Cai X, Ding C, Nie F, Huang H. On the equivalent of low-rank linear regressions and linear discriminant analysis based regressions. Acm Sigkdd International Conference on Knowledge Discovery and Data Mining. 2013; 2013. pp. 1124–1132.
- 23. Zhang Z, Lai Z, Xu Y, Shao L, Wu J, Xie G-S. Discriminative elastic-net regularized linear regression. IEEE Trans Image Process. 2017;26(3):1466–81. pmid:28092552
- 24. Iosifidis A, Gabbouj M. Class-specific kernel discriminant analysis revisited: further analysis and extensions. IEEE Trans Cybern. 2017;47(12):4485–96. pmid:28113416
- 25. Fang X, Teng S, Lai Z, He Z, Xie S, Wong WK, et al. Robust latent subspace learning for image classification. IEEE Trans Neural Netw Learn Syst. 2018;29(6):2502–15. pmid:28500010
- 26. Fang X, Han N, Wu J, Xu Y, Yang J, Wong WK, et al. Approximate low-rank projection learning for feature extraction. IEEE Trans Neural Netw Learn Syst. 2018;29(11):5228–41. pmid:29994377
- 27. Xie L, Yin M, Yin X, Liu Y, Yin G. Low-rank sparse preserving projections for dimensionality reduction. IEEE Trans Image Process. 2018;27(11):5261–74. pmid:30010570
- 28. Zhang Z, Xu Y, Shao L, Yang J. Discriminative block-diagonal representation learning for image recognition. IEEE Trans Neural Netw Learn Syst. 2018;29(7):3111–25. pmid:28692990
- 29.
Boyd S, Parikh N, Chu E. Distributed optimization and statistical learning via the alternating direction method of multipliers. Now Publishers Inc. 2011.
- 30. Xiang S, Nie F, Meng G, Pan C, Zhang C. Discriminative least squares regression for multiclass classification and feature selection. IEEE Trans Neural Netw Learn Syst. 2012;23(11):1738–54. pmid:24808069
- 31. Zhang X-Y, Wang L, Xiang S, Liu C-L. Retargeted least squares regression algorithm. IEEE Trans Neural Netw Learn Syst. 2015;26(9):2206–13. pmid:25474813
- 32. Yuan H, Zheng J, Lai LL, Tang YY. A constrained least squares regression model. Inf Sci: Int J. 2018;429:247–59.
- 33.
Arenas-Garcıa J, Petersen K, Hansen L. Sparse kernel orthonormalized pls for feature extraction in large data sets. International Conference on Neural Information Processing Systems. 2007. pp. 33–40.
- 34. Min C, Yu S, Jia G, Liu D, Wang K. Comprehensive defect detection of bamboo strips with new feature extraction machine vision methods. J Adva Manuf Sci Technol. 2024;4(1):2023018–2023018.
- 35. Qi X, Xiao R, Li C-G, Qiao Y, Guo J, Tang X. Pairwise Rotation Invariant Co-Occurrence Local Binary Pattern. IEEE Trans Pattern Anal Mach Intell. 2014;36(11):2199–213. pmid:26353061
- 36. Deng Cai, Xiaofei He, Jiawei Han. SRDA: An Efficient Algorithm for Large-Scale Discriminant Analysis. IEEE Trans Knowl Data Eng. 2008;20(1):1–12.
- 37. Ma J, Zhou S. Discriminative least squares regression for multiclass classification based on within-class scatter minimization. Appl Intell. 2021;52(1):622–35.
- 38. Jiang L, Fang X, Sun W, Han N, Teng S. Low-rank constraint based dual projections learning for dimensionality reduction. Sig Process. 2023;204:108817.
- 39. Zhang T, Long C, Deng Y, Wang W, Tan S, Li H. Low‐rank preserving embedding regression for robust image feature extraction. IET Comput Vision. 2023;18(1):124–40.
- 40. Chang C, Lin C. Libsvm: a library for support vector machines. ACM Trans Intell Syst Technol. 2011;2(3):1–27.
- 41. Belkin M, Niyogi P, Sindhwani V. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. J Mach Learn Res. 2006;7(11).