Figures
Abstract
The inferential results regarding estimates of Support Vector Regression (SVR) are highly influenced by anomalies and ill-conditioned predictors. Excessive dimensions of data also make the model complex. To improve estimation accuracy, this paper introduces two modelling frameworks, Principal Component Robust Support Vector Regression (PCRSVR) and Principal Fitted Component Robust Support Vector Regression (PFCRSVR). These techniques are developed by incorporating PCs and PFCs with Exponential Quantile SVR (EQSVR), which is capable of dealing with ill-conditioned regressors, extreme observations, and high-dimensional data settings simultaneously. An extensive simulation study has been conducted to evaluate the performance of the proposed methods. Different evaluation criteria are chosen in this regard. Additionally, real-life data applications illustrate the efficacy of the proposed techniques as compared to competing ones.
Citation: Tahir A, Ilyas M (2025) Principal fitted component framework for robust support vector regression based on bounded loss: A simulation study with potential applications. PLoS One 20(6): e0321102. https://doi.org/10.1371/journal.pone.0321102
Editor: Mohamed R. Abonazel, Cairo University, EGYPT
Received: January 5, 2025; Accepted: March 1, 2025; Published: June 4, 2025
Copyright: © 2025 Tahir, Ilyas. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The data is made available at https://github.com/aiman-4/PCRSVR_PFCRSVR.git.
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
1. Introduction
Principal Component Regression (PCR) [1,2] is a widely used technique to address the problem of multicollinearity within the framework of multiple linear regression. PCR is conducted in two main steps. First, Principal Component Analysis (PCA) [3,4] is performed to transform the original predictors into a new set of orthogonal components, or Principal Components (PCs). Then, a subset of these PCs is selected as new explanatory variables in the regression model.
While various adaptations of PCR are discussed in the literature (e.g., [5–7]), this study focuses on the classical PCR approach. In classical PCR, PCs with large eigenvalues are prioritized to capture the maximum variation of data. However, this approach may not always be ideal for predictive accuracy, as PCs with smaller eigenvalues could have a stronger correlation with the response variable (e.g., [8–9]). To address this limitation, various strategies have been developed to incorporate information from the response variable during the construction of PCs (e.g., [10–12]). The focus of this paper is Principal Fitted Component Regression (PFCR), proposed by [10]. It modifies PCR by regressing the response variable on a subset of Principal Fitted Components (PFCs) rather than on traditional PCs. These PFCs are designed to retain the predictive information about the response variable that is embedded within the predictors. This is often done employing inverse regression. Moreover, PFCR addresses the effects of ill-conditioned predictors [9].
Despite these advancements, PCR remains sensitive to outliers, which can distort both the PCA and the regression model. To make PCR more robust, researchers have developed estimators that combine outlier-resistant techniques with PCR. For instance, [13] proposed a robust approach to PCR that substitutes classical PCA with robust PCA. In this method, the covariance matrix is estimated using the least median of squares [14], which reduces the influence of outliers. Additional robust PCR methods have been developed to address different data complexities. [15], for example, introduced an outlier detection method for the response matrix. This method, called “resampling by halfmeans” [16], identifies and removes outlier-contaminated samples before conducting PCA. [17] proposed a robust PCR approach based on projection pursuit [18]. This approach identifies robust PCs and uses them in least-trimmed square regression [19] to reduce the influence of extreme values. [20] developed two variations of robust PCR, each tailored to different data dimensions. For low-dimensional data (p < n), they used the minimum covariance determinant estimator [14] to estimate the covariance matrix. For high-dimensional data (p > n), they recommended the ROBPCA method [21]. This method computes robust PCs specifically for high-dimensional scenarios and then applies robust regression. In addition to these methods, [22] proposed an empirical technique for robust PCR that depends upon “principal sensitive vectors” [23]. It detects outliers before performing classical PCR. [24] conducted a comparative study between robust PCR and robust partial least squares regression. Their study evaluates these methods based on efficiency, robustness, predictive competency, and model fitness. More recent techniques have incorporated advanced statistical frameworks to increase the robustness of PCR. [25] proposed an estimator for parameter function in functional logistic regression to handle functional outliers. [26] introduced a Bayesian approach to improve outlier resistance for both independent and dependent factors. This method penalizes unusual data points to certify that predictions align with the core data distribution. [27] further advanced robust PCR techniques by proposing a correlation scaled robust estimator for PCR. This method addresses the challenges of multicollinearity, outliers, and high-dimensional data. It incorporates response variable information directly into the computation of PCs. This approach enhances the predictive stability of PCR while controlling for data irregularities and dimensionality issues in multiple linear regression.
Support Vector Regression (SVR) was introduced by [28] as a method for tackling regression problems in machine learning. It depends upon the principles of Support Vector Machines (SVM) [29]. Unlike conventional regression models, SVR has garnered widespread attention across numerous disciplines [30]. The primary concern of SVR is to minimize the deviation between the predicted outcome and actual value. Several loss functions are utilized to quantify this distance. Although classical SVR has gained notable achievement in various fields it does not encounter challenges against outliers because of utilizing unbounded loss functions. These unbounded loss functions cause the infinite increase in loss term as error increases. Consequently, a significant shift in the regression line is occurred which reduced model accuracy. To counter this issue, experts have put attention to integrating bounded loss functions into the framework of SVR. For instance, [31] introduced a truncated є-insensitive loss to develop a truncated SVR model, motivated by the Ramp loss. [32] introduced the RLS-SVR model by truncating the least squares loss. Similarly, [33] proposed the RLNPSVR model by applying Ramp-type loss in nonparallel SVR. [34] proposed the NQSVR model depending upon a non-convex quadratic є-insensitive loss. Nevertheless, the truncation of loss functions introduces non-differentiable points, which increases the complexity of the optimization process. [35] addressed this issue by applying the Rhinge loss in Twin Support Vector Regression (TSVR), resulting in a more robust TSVR model. More recently, a novel bounded framework is proposed by [36]. It transforms unbounded loss functions into bounded ones, which establishes the foundation for the development of BLSSVR. Inspired by these advancements, EQSVM and EQSVR are proposed by [37], based on bounded exponential quantile loss. This framework offers an alternative approach to scaling unbounded convex loss functions, providing greater resistance to outliers while preserving model efficiency.
The literature validates that blended estimators can outperform single estimators by combining the strengths of each [38]. Examples of such blended approaches include the combined r-k estimator, which integrates the PCA and ridge estimator [38], the robust‑stein estimator [39], the combined PC-KL estimator [40], and the hybrid PC-SVR [41].
Modern data analysis is increasingly characterized by complex challenges, including multicollinearity, excessive dimensions of data, and the pervasive presence of anomalies. Traditional regression techniques often fail to deliver reliable results in such scenarios, leaving a critical gap in the ability to model real-world data effectively. This study addresses these pressing issues by introducing two approaches, i.e., PCRSVR and PFCRSVR. These methods integrate PCs and PFCs with Exponential Quantile Support Vector Regression (EQSVR) within a machine learning framework. The proposed techniques are designed to handle ill-conditioned regressors, anomalies, and large dimensions of data simultaneously. Their computational algorithms are also developed. Notably, PFCRSVR addresses the predictive limitations identified by [8] and [9] by incorporating response variable information directly into the computation of PCs. This approach aims to improve predictive accuracy by retaining components that are more relevant to the response variable. A comparative analysis is conducted among the proposed robust approaches and their non-robust counterparts to evaluate the effectiveness of the proposed techniques. Among the proposed methods and baseline counterparts, PFCRSVR consistently performs best, achieving the lowest MSE and MAE across all techniques. This establishes PFCRSVR as the most effective framework for complex data environments.
The organization of the paper is as follows: Subsections of section 1 describe the principal component regression, principal fitted component regression and robust support vector regression, respectively. The proposed methodology and its computational algorithms are discussed in section 2. Section 3 conducts the simulation study to investigate the performance of the proposed methods. Real-life data applications illustrate the developed techniques in section 4. Section 5 gives concluding remarks on the paper.
1.1. Principal component regression
[42] introduced Principal Component Analysis (PCA) as a method to transform correlated predictors into uncorrelated variables called principal components (PCs). Each PC is a combination of the original predictors, constructed using specific weights. Consider X, an n×p matrix where n is the number of observations and p is the number of predictors. The PCs are computed such as . Here,
represent eigenvectors of the covariance matrix (Σ = cov (X)), and their corresponding eigenvalues are
. The eigenvectors
are arranged into p×p matrix (V) and PCs
are composed into n×p matrix (M).
In PCR, a subset of the first q-PCs (Mq) is used to model the response variable (y) with q ≤ p. This relationship is modelled by Eq 1, here α is the q×1 vector of regression coefficients for the q-PCs, and є is the (n×1) error term. The regression coefficients (αq) are estimated using the least squares method (Eq 2). Once these coefficients are estimated, they are transformed back to the original predictor space, as defined in Eq 3. Here, represents the (p×1) vector of estimated regression coefficients regarding original predictors. By selecting only the leading PCs that account for most of the variability in the data, PCR simplifies the model and addresses the problem of ill-conditioned regressors.
1.2. Principal fitted component regression
Using principal components as regressors in regression models raises certain concerns. First, PCs are derived solely from the predictor variables without incorporating the response variable. This approach assumes that the response depends primarily on the first few PCs, but in reality, it might also rely on components associated with smaller variations. Second, PCs lack the properties of invariance and equivariance when the predictor variables undergo full-rank linear transformations.
To address these limitations, PFCs were introduced for dimension reduction in regression modeling [10]. Compared to PCs, PFCs provide two key advantages. They retain equivariance under full-rank linear transformations of predictors and can be tailored to incorporate information from the response variable.
PFCs are constructed by extracting sufficient information about the response variable (y) from the predictors (X). This is often achieved through inverse regression, which involves estimating E[X | y = y]. Unlike forward regression, which models E[y | X = x], inverse regression reduces the problem to p times one-dimensional regressions.
The Eq 4 is an inverse regression model such that Δ). Here,
represents the mean of the predictors, and
is a semi-orthogonal matrix whose columns form a basis for the q-dimensional subspace
= span {
}, where Sy is the sample space of y. The term
includes
and
with q ≤ min (r, p), a mean-centred vector-valued function of y, satisfying
. Instead of indexing predictors conventionally by i, here y serves as the index. The predictors (Xy) are regressed on a response-dependent function (fy), which is constructed using a specific basis function g. This basis is mean-centred as
, with
typically chosen as a polynomial basis with degree r, i.e.,
and
. Here,
assumes independence of y and its simplest form is isotropic with
.
To compute PFCs, the sample covariance matrix of the fitted predictors, , is estimated. Here,
represents the predictors fitted from the regression of Xy on fy.
PCA is then applied to , yielding eigenvectors
corresponding to eigenvalues
. These eigenvectors are used to construct PFCs, expressed as
. Instead of using all PFCs, a subset of q-PFCs is employed in the regression model. Since PFCs incorporate information from the response variable during their construction, they often outperform PCs in regression tasks under various scenarios [10].
1.3. Exponential Quantile Support Vector Regression (EQSVR)
Consider, we have n training instances and p features. The ith training instance can be denoted as and its associated outcome can be denoted as yi, i = 1,2,…., n. The data matrix
can be composed by arranging samples in rows and features in columns and y is the (n×1) vector of responses. [37] introduced two parameters of exponential quantile loss (
) in standard SVR. Here, λ > 0 and τ ≥ 0 are two tuning parameters. λ controls the steepness of Leq-loss and τ acts as a hedging factor.
denotes Pinball loss and
represents location constant satisfying
. Also,
denotes the normalizing constant ensuring
. Thus, the objective function of EQSVR is formulated in Eq 5. Here, w is the (p×1) vector of weights, b denotes bias and C represents the non-negative penalty parameter. After estimating w and b we can predict a new sample xnew by using relation
.
In this paper, EQSVR is formulated for a linear regression problem. Let’s assume and
. Here, e is the (n×1) vector of ones. According to these notations, the objective function (Eq 5) is transformed to Eq 6.
EQSVR utilizes the ConCave-Convex Procedure (CCCP) to transform non-convex Leq-loss into the chain of convex optimization problems. Then, these convex optimization problems are solved by ClipDCD algorithm [43]. To solve Eq 6, Leq-loss is decomposed into g (u) and h (u) defined in Eq 7 and Eq 8, respectively. Subsequently, the model of EQSVR is formulated in Eq 9.
The first two terms of Eq 9 are convex parts and are jointly represented by . The third term is the concave part and is denoted by
. The CCCP method is employed to optimize the problem defined in Eq 9. The subsequent sub-problems (Eq 10) are addressed to iteratively obtain the optimal solution. Here,
is the derivative of
for obtaining optimal solution
. An auxiliary variable (
) defined in Eq 11 is introduced for ease of notation. Then, the Eq 9 is reformulated to the Eq 12 and is further simplified to Eq 13. Here,
and Eq 14 is a matrix form of Eq 13.
The Lagrange function is defined in Eq 15 by incorporating two variables γ and θ. The Karush-Kuhn-Tucker (KKT) conditions are derived in Eqs 16–19 and must be satisfied. The resulting Eq 20 is obtained by plugging the KKT conditions in Lagrangian function (Eq 15). After solving Eq 16, we get the weight vector () that is defined in Eq 25. Hence, Eq 23 can be redefined after utilizing the results mentioned in Eq 21 and Eq 22. Here, I
denote the identity matrix and 0 is the vector of zeroes.
The problem in Eq 23 is found a quadratic optimization problem that can be solved by the ClipDCD algorithm [43]. The values of and
are iteratively updated over the CCCP iteration. After obtaining
, we can predict for new instance by following the Eq 24. Here,
denotes the optimal solution. The CCCP algorithm depends upon ClipDCD EQSVR which is described in Fig 1.
2. Proposed methods
2.1. Principal Component Robust Support Vector Regression (PCRSVR)
The proposed PCRSVR is a hybrid technique that combines the PCs and EQSVR in a machine learning framework. This approach can handle the data irregularities, ill-conditioned predictors and excessive data dimensions simultaneously. It first performs PCA on predictors and constructs new transformed variables known as principal components, eliminating the problem of ill-conditioned predictors. It chooses the first q-PCs that explain the maximum variation of the predictors. Then, these PCs are used as regressors in the EQSVR framework to model outcome variable that is characterized by anomalies. These anomalies are tackled with Leq-loss that is plugged into EQSVR.
The PCRSVR performs the following steps to calculate the MSE of estimated regression parameters.
- Generate predictors (X) by Eq 27 and standardize them.
- Simulate the response variable (y) using Eq 28. Define the vector of regression coefficients (β) as an eigenvector relevant to the largest eigenvalue of the information matrix (XT X).
- Introduce outliers in y using Eq 29 according to the outliers fraction specified in section 3.
- Obtain eigenvalues
and eigenvectors
of (XT X) by applying PCA and construct new transformed variables (M = XV).
- Retain q-PCs (Mq), where q represents the number of components explaining at least 80% of the variation of X.
- Model contaminated y based on Mq using linear kernel and considering Leq-loss in SVR described in subsection 1.3.
- Estimate regression parameters for q-PCs using Eq 25 based on the modelling framework implemented in step 6.
- Convert these estimated parameters back to the original predictor space using the transformation explained in Eq 3.
- Calculate the MSE of
according to Eq 26. Here,
denotes the estimated value through the proposed modelling framework (PCRSVR) and β represents its respective true value.
- Replicate steps 1–9 for 100 Monte Carlo runs and obtain a mean over 100 runs.
2.2. Principal Fitted Component Robust Support Vector Regression (PFCRSVR)
The PFCRSVR method combines PFCs with EQSVR to provide a robust solution for data irregularities, ill-conditioned predictors, and high-dimensional settings simultaneously. It also addresses the challenge noted by [8] and [9] by incorporating fitted predictors during the computation of PCs instead of original ones. The computational process of fitted predictors is described in subsection 1.2. In PFCRSVR, PCA is applied to fitted predictors and construct PFCs as detailed in subsection 1.2. From these PFCs, the top q-PFCs are selected for further modeling. These components are then used as inputs in the EQSVR framework to predict the outcome variable that contains anomalies. The Leq -loss function within EQSVR ensures resilience to these anomalies, enabling accurate and reliable regression modeling.
The following steps are involved to obtain the MSE of estimated regression parameters of PFCRSVR.
- Perform steps (1–3) of PCRSVR’s algorithm.
- Compute fitted predictors (
) by regressing X on the polynomial basis of y performing the inverse regression model described in Eq 4. Here, the PFC model assumes isotropic structure and second-degree polynomial (r = 2).
- Obtain fitted sample covariance matrix
of fitted predictors (
.
- Perform PCA on fitted predictors (
) to obtain eigenvalues
and their corresponding eigenvectors (
).
- Multiply eigenvectors
with X to get PFCs (
). Compose these PFCs in n×p matrix (Z).
- Select q-PFCs (Zq) that account for at least 80% of the variation of X and consider them in the further modelling process.
- Use the Leq-loss function in SVR to model the contaminated response variable (y) based on Zq using linear kernel as detailed in subsection 1.3.
- Estimate regression coefficients for the retained q-PFCs (Zq) using Eq 25.
- Transform the estimated coefficients back to the original predictor space considering the mapping described in Eq 3.
- Compute MSE of proposed estimator PFCRSVR (
) using Eq 26.
- Iterate the steps 1–10 for 100 Monte Carlo runs and calculate the mean over 100 replications.
3. Simulation study
In this section, we evaluate the performance of the proposed methods (i.e., PCRSVR and PFCRSVR) by conducting a Monte Carlo simulation study using R programming language. The relevant code and data files are deposited at https://github.com/aiman-4/PCRSVR_PFCRSVR.git. For the implementation of PFCR and classical SVR, the R packages ldr [44] and e1071 [45] are respectively utilized. The competing techniques are SVR, EQSVR, PCSVR and PFCSVR. PCSVR and PFCSVR are two hybrid approaches that utilize q-PCs and q-PFCs as regressors into classical SVR. The simulation settings are outlined in forthcoming subsection 3.1.
3.1. Simulation design
In this subsection, the data generation process of synthesis data sets is described. The explanatory variables are generated according to Eq 27, following the approach of [40] and [39]. In this setup, ρ indicates the correlation among two explanatory variables and Dij represents independent pseudo-random numbers drawn from standard normal distribution. The response variable is simulated based on Eq 28. Here, ei ~ N(0,1) and regression coefficient βj is chosen to satisfy , following the [46].
The proposed techniques are evaluated by varying several key factors including sample size, degree of correlation, level of contamination, and number of predictors. We consider collinearity levels of ρ = 0.8, 0.9 and 0.99. The number of explanatory variables is set to p = 5, 15 and 25. Additionally, we test sample sizes of n = 50, 100, 300 and 500. To evaluate the robustness of proposed techniques, we introduce different proportions of outliers, i.e., 0%, 5%, 15%, and 30%. Different combinations of these factors are considered in this study and relevant results are reported in section 4. The values of hyperparameters of EQSVR are set as τ = 0.7 and λ = 0.5. Whereas the value of penalty parameter (C) of EQSVR is chosen as 0.2 for all the scenarios except the scenarios where p = 25. In this case, the value of C is 0.04.
This study focuses on vertical outliers, which affect only the response variable. We contaminate the response variable (y) randomly, following Eq 29 as suggested by [40]. Here, b is the magnitude of outliers, set at a constant value of 10.
3.2. Performance evaluation criteria
The proposed techniques are compared with their competing ones based on Mean Square Error (MSE) and Mean Absolute Error (MAE). These evaluation measures have been considered by various researchers (see, e.g., [37,39,40]). These metrics are computed using Eq 30 and Eq 31. Here, is the lth estimated regression coefficient of any studied modelling framework and
is its corresponding true value. The technique that produces the lowest values of MSE and MAE is considered the most effective.
Additionally, the strength of the developed techniques against their counterparts is quantified by the improved percentage reduction in MSE regarding the proposed ones. This indicator is termed PMSE and is computed using Eq 32. Here, PMSE denotes the magnitude of percentage which increases or decreases due to the MSE of proposed techniques over their competing ones. MSE* and MSE** denote the mean square error of the proposed technique and its competitor, respectively. Theoretically, the proposed techniques attain achievement if PMSE produces a positive value. The negative value of PMSE shows the inferiority of the proposed techniques over their baseline techniques.
4. Results
An extensive simulation study has been conducted by taking various above-mentioned scenarios into account. The simulation experiments are replicated 100 times. For each replication, the MSE of and MAE of
are computed for proposed methods (e.g., PCRSVR and PFCRSVR) and their competitors (e.g., SVR, EQSVR, PCSVR and PECSVR). The summary statistics (i.e., mean and Standard Error (SE)) of performance measures over 100 replications are reported in Tables (1–12). For brevity, a few tables are inserted in supporting information (see, S1-S6 Tables). It can be noticed from Tables (1–12), that the proposed techniques (PCRSVR and PFCRSVR) produce reduced MSE and MAE as compared to their baseline counterparts (SVR, EQSVR, PCSVR and PFCSVR) in almost all studied simulations settings. Also, the increasing pattern of sample size exhibits decreasing behaviour of MSE and MAE for all the studied estimators (see, Tables 1–12). All the studied techniques perform well with various degrees of correlation and different
percentages of outliers. However, the proposed techniques PCRSVR and PFCRSVR outperform EQSVR and their respective non-robust estimators PCSVR and PFCSVR. As the contamination fraction increases, the MSE and MAE increase for all the techniques. However, these metrics for proposed techniques tend to increase with less proportion over competing ones, especially when the sample size is large (See, Table 1–12). It is also noticed that the MSE and MAE of SVR, PCSVR and PFCSVR tend to increase with the increase in level of collinearity. Whereas the inverse relationship is exhibited between the degree of correlation and performance measures of EQSVR, PCRSVR and PFCRSVR. For instance, the MSE and MAE of SVR, PCSVR and PFCSVR increase with the increase in the degree of collinearity (See, Tables 1–6). It can also be noticed that the increase in predictors generally increases the MSE and MAE of SVR and decreases
the performance metrics for all other techniques except a few cases. For instance, the direct relationship is observed among the number of predictors and performance metrics of PCSVR when ρ = 0.99 (see, Tables 5, 6, 11 and 12).
Moreover, Figs 2–7 and S1-S3 Figs provide more clearer view by displaying the percentage reduction in MSE of PCRSVR and PFCRSVR over their competitors EQSVR, PCSVR and PFCSVR. The proposed techniques become be most efficient due to producing the reduced MSE as compared to the competing ones. It is also evident from Figs 4 and 7 and S3 Fig, that the efficiency of PCRSVR and PFCRSVR substantially improves as compared to PCSVR and PFCSVR for ρ = 0.99. For example, the percentages reduction in MSE of PFCRSVR against PFCSVR are 90%, 99% and 98% for n = 100, 300 and 500, respectively when p = 25 and percent contamination is 30 (see, Fig 4d). Similarly, with the same level of collinearity and contamination, the resulting pattern remains consistent when p = 15, n = 50, 100 and, 300. Consequently,
the maximum reduction in MSE is exhibited up to 99% for proposed techniques over their competitors even with a high concentration of outliers and collinearity. Further, the proposed techniques are also compared with baseline EQSVR and come out to be competent. Because they exhibit a maximum reduction in MSE over all considered simulation settings (see, Figs 2–7 and S1-S3 Figs). Therefore, the results indicate that proposed approaches outperform other competing techniques by overcoming the effects of anomalies and multicollinearity simultaneously.
5. Discussion
The results of this study demonstrate the robustness and effectiveness of the proposed regression frameworks (e.g., PCRSVR and PFCRSVR). Extensive simulations reveal that these methods consistently outperform their baseline counterparts (e.g., PCSVR, PFCSVR, and EQSVR). The proposed frameworks excel in handling challenges such as high multicollinearity, anomaly severity, and varying sample and predictor sizes. Both PCRSVR and PFCRSVR achieve significantly lower MSE and MAE values. These results showcase their ability to mitigate the adverse effects of extreme data complexities as well as ill-conditioned predictors. Moreover, the validation using real-life datasets highlights the practical relevance of these approaches. The proposed techniques consistently outperform baseline methods for real-life datasets characterized by high multicollinearity and the presence of outliers. These findings underline the generalizability and effectiveness of PCRSVR and PFCRSVR in tackling real-world regression challenges.
Despite these promising results, certain limitations seek attention. The study focuses solely on normal response variable and vertical outliers. This excludes scenarios involving non-normal responses, such as binary or count data, as well as leverage points in the predictor space. These restrictions may limit the applicability of the methods in domains with more diverse data characteristics. Future research could focus on extending the frameworks to handle response variables from the exponential family. Modifications could also address leverage points to improve the methods’ robustness. Expanding the scope in these directions would enhance the utility and adaptability of the proposed techniques.
Another limitation stems from the nature of the datasets analyzed. The study predominantly focuses on cases where the number of observations exceed the number of predictors. While this condition is common in many regression applications, it does not account for high-dimensional settings where predictors outnumber observations. In such scenarios, standard dimensionality reduction techniques, like principal components, may not perform optimally. Future work could adapt the frameworks to high-dimensional datasets. This could involve advanced strategies such as sparsity-inducing penalties or tailored regularization techniques. Addressing these gaps would extend the applicability of these methods to fields like genomics, text mining, and image analysis.
6. Real-life data application
This section demonstrates the performance of the proposed techniques using the pollution and Longley datasets. These datasets have been widely analyzed in previous research (e.g., [39,47–49]). According to the literature, these datasets are known for having ill-conditioned predictors and extreme observations. Therefore, these real-life datasets are suitable for evaluating methods that address these issues efficiently.
In the pollution dataset, the outcome variable is the age-adjusted mortality rate per 100,000, which depends on fifteen explanatory variables. A detailed description of these covariates is available in prior studies (e.g., [47,49]). Application of the least square method reveals a high degree of multicollinearity, with variance inflation factors of 98.6 for x12 and 104.9 for x13. The strength of correlation among predictors is illustrated in Fig 8a. Residual analysis is also conducted to identify extreme observations. Normal QQ plots of residuals and Cook’s distances indicate that observations 2, 29, 32, 37, 48, 57, and 59 are outliers (see Fig 9a and 9b). These findings confirm that the dataset exhibits both multicollinearity and outliers, making it an appropriate example for testing the proposed techniques.
The regression coefficients of the predictors are estimated using all examined modeling frameworks, with results presented in Table 13. For performance evaluation, the standard errors of the bootstrap regression estimates are calculated for each technique, and the mean standard errors are reported (see, Table 13). The results indicate that the proposed techniques yield lower Mean Standard Errors of Bootstrap Estimates (MSEBE) compared to competing methods. Notably, the PFCRSVR method outperforms all other approaches, achieving the lowest mean standard error of the bootstrap regression estimates.
Further, the Longley dataset [50] comprise five predictors with the objective of modeling total derived employment (y). These predictors include the Gross National Product (GNP) implicit price deflator (x1), GNP (x2), unemployment rate (x3), size of the armed forces (x4), and non-institutional population aged fourteen years and older (x5). Prior research has highlighted the significant effect of multicollinearity and the presence of outliers within this dataset [40].
This is indicated by a high condition index of 43,275 and variance inflation factors of 5,209.50, 306.50, 2,825.30, 37.74, and 39.90 for the predictors. Fig 8b provides an illustration of the correlations among the predictors, highlighting the extent of multicollinearity. Additionally, Fig 10 includes a QQ-plot of residuals and Cook’s distances, which reveal data points 6, 10, 12, 14, and 16 as notable outliers.
Regression parameters are estimated for both the proposed method and existing approaches and their results are presented in Table 14. To further assess estimation accuracy, bootstrap coefficients are estimated. The MSEBE() for each method is also reported in Table 14. The proposed methods demonstrate favourable performance compared to competing techniques, achieving the lowest MSEBE(
) value.
7. Conclusion
This research advances a robust regression framework by addressing core challenges, including multicollinearity, outliers, and high-dimensional data, which constrains the effectiveness of classical SVR. By proposing frameworks that incorporate PCs and PFCs, the study not only tackles critical issues but also broadens the scope of SVR’s applicability to more intricate and irregular data environments.
Moreover, the findings highlight a broader paradigm shift in developing robust regression approaches to tackle real-world challenges. Fields such as finance, healthcare, and environmental science frequently face complex data structures. These complexities often compromise the accuracy of predictive models. The proposed innovations provide optimal benefits in these fields. The ability to address ill-conditioned predictors and neutralize the effects of anomalies positions these frameworks as transformative tools for practitioners.
Supporting information
S1 File.
S1 Table. The summary statistics (mean ± S.E) of MSE of regression coefficients regarding proposed and other studied estimators for p = 25 and ρ = 0.9. S2 Table. The summary statistics (mean ± S.E) of MAE of regression coefficients regarding proposed and other studied estimators for p = 25 and ρ = 0.9. S3 Table. The summary statistics (mean ± S.E) of MSE of regression coefficients regarding proposed and other studied estimators for p = 25 and ρ = 0.8. S4 Table. The summary statistics (mean ± S.E) of MAE of regression coefficients regarding proposed and other studied estimators for p = 25 and ρ = 0.8. S5 Table. The summary statistics (mean ± S.E) of MSE of regression coefficients regarding proposed and other studied estimators for p = 25 and ρ = 0.99. S6 Table. The summary statistics (mean ± S.E) of MAE of regression coefficients regarding proposed and other studied estimators for p = 25 and ρ = 0.99. S7 Table. A list of abbreviations used in the paper.
https://doi.org/10.1371/journal.pone.0321102.s001
(PDF)
S1 Fig. Improved percentage reduction in MSE of proposed techniques against their competitors with different levels of contamination for p = 25 and ρ = 0.9.
https://doi.org/10.1371/journal.pone.0321102.s002
(TIFF)
S2 Fig. Improved percentage reduction in MSE of proposed techniques against their competitors with different levels of contamination for p = 25 and ρ = 0.9.
https://doi.org/10.1371/journal.pone.0321102.s003
(TIFF)
S3 Fig. Improved percentage reduction in MSE of proposed techniques against their competitors with different levels of contamination for p = 25 and ρ = 0.99 .
https://doi.org/10.1371/journal.pone.0321102.s004
(TIFF)
References
- 1. Massy WF. Principal components regression in exploratory statistical research. J Am Stat Assoc. 1965;60(309):234–56.
- 2.
Jolliffe IT. Principal components in regression analysis. In: Principal Component Analysis. New York: Springer; 1986. https://doi.org/10.1007/0-387-22440-8_8
- 3. Pearson KL III. On lines and planes of closest fit to systems of points in space. Lond. Edinb. Dubl. Phil. Mag J Sci. 1901;2(11):559–72.
- 4.
Jolliffe I. Principal component analysis. Encyclopedia of statistics in behavioural science. 2005. https://doi.org/10.1002/0470013192.bsa501
- 5. Thomas EV. Incorporating auxiliary predictor variation in principal component regression models. J Chemom. 1995;9(6):471–81.
- 6. Wang K, Abbott D. A principal components regression approach to multilocus genetic association studies. Genet Epidemiol. 2008;32(2):108–18. PMID 17849491.
- 7. Agarwal A, Harris K, Whitehouse J, Wu SZ. Adaptive principal component regression with applications to panel data. Adv Neural Inf Process Syst Sci. 2023;36:77104–18.
- 8. Jolliffe IT. A note on the use of principal components in regression. Appl Stat. 1982;31(3):300–3.
- 9. Cook RD. Principal components, sufficient dimension reduction, and envelopes. Annu Rev Stat Appl. 2018;5(1):533–59.
- 10. Cook RD. Fisher lecture: Dimension reduction in regression. 2007. doi:
- 11. Kawano S, Fujisawa H, Takada T, Shiroishi T. Sparse principal component regression with adaptive loading. Comput. Stat. Data Anal. 2015;89:192–203.
- 12.
Singh KK, Patel A, Sadu C. Correlation scaled principal component regression. In Intelligent Systems Design and Applications: 17th International Conference on Intelligent Systems Design and Applications (ISDA 2017) held in Delhi, India, December 14-16. Springer International Publishing. 2018. https://doi.org/10.1007/978-3-319-76348-4_34
- 13. Walczak B, Massart DL. Robust principal components regression as a detection tool for outliers. Chemometr Intell Lab Syst. 1995;27(1):41–54.
- 14. Rousseeuw PJ. Least median of squares regression. J Am Stat Assoc. 1984;79(388):871–80.
- 15. Pell RJ. Multiple outlier detection for multivariate calibration using robust statistical techniques. Chemometr Intell Lab Syst. 2000;52(1):87–104.
- 16. Egan WJ, Morgan SL. Outlier detection in multivariate analytical chemical data. Anal Chem. 1998;70(11):2372–9. pmid:21644644
- 17.
Filzmoser P. Robust principal component regression. In: Aivazian S, Kharin Y, Rider L (eds). Proceedings of the Sixth International Conference on Computer Data Analysis and Modeling. Minsk: Belarusia. 2001;1:132–137.
- 18. Li G, Chen Z. Projection-pursuit approach to robust dispersion matrices and principal components: primary theory and Monte Carlo. J Am Stat Assoc. 1985;80(391):759–66.
- 19.
Rousseeuw PJ, Leroy AM. Robust regression and outlier detection. John wiley & sons; 2003.
- 20. Hubert M, Verboven S. A robust PCR method for high‐dimensional regressors. J Chemom. 2003;17(8-9):438–52.
- 21. Hubert M, Rousseeuw PJ, Vanden Branden K. ROBPCA: a new approach to robust principal component analysis. Technometrics. 2005;47(1):64–79.
- 22. Zhang MH, Xu QS, Massart DL. Robust principal components regression based on principal sensitivity vectors. Chemometr Intell Lab Syst. 2003;67(2):175–85.
- 23. PEna D, Yohai V. A fast procedure for outlier diagnostics in large regression problems. J Am Stat Assoc. 1999;94(446):434–45.
- 24.
Engelen S, Hubert M, Vanden Branden K, Verboven S. Robust PCR and Robust PLSR: a comparative study. In Theory and applications of recent robust methods. Birkhäuser Basel. 2004. p. 105–117. https://doi.org/10.1007/978-3-0348-7958-3_10
- 25. Denhere M, Billor N. Robust principal component functional logistic regression. Commun Stat - Simul Comput. 2016;45(1):264–81.
- 26. Gagnon P, Bédard M, Desgagné A. An automatic robust Bayesian approach to principal component regression. J Appl Statist. 2021;48(1):84–104. pmid:35707235
- 27. Tahir A, Ilyas M. Robust correlation scaled principal component regression. Hacet. J. Math. Stat. 2023;52(2):459–86.
- 28.
Drucker H, Burges CJ, Kaufman L, Smola A, Vapnik V. Support vector regression machines. In Proceedings of the 9th international conference on neural information processing systems. 1996;15:5–161.
- 29. Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20(3):273–97. https://www.scopus.com/inward/record.uri?eid=2-s2.0-34249753618&doi=10.1023%2fA%3a1022627411411&partnerID=40&md5=97a8591c7d55575e8c48344379ee2796
- 30. Liang X, Zhang Z, Song Y, Jian L. Kernel-based online regression with canal loss. Eur J Oper Res. 2022;297(1):268–79.
- 31. Zhao YP, Sun JG. Robust truncated support vector regression. Expert Syst Appl. 2010;37(7):5126–33.
- 32. Wang K, Zhong P. Robust non-convex least squares loss function for regression with outliers. Knowl.-Based Syst. 2014;71:290–302. 10.1016/j.knosys.2014.08.003
- 33. Tang L, Tian Y, Yang C, Pardalos PM. Ramp-loss nonparallel support vector regression: robust, sparse and scalable approximation. Knowl.-Based Syst. 2018;147:55–67.
- 34. Ye Y, Gao J, Shao Y, Li C, Jin Y, Hua X. Robust support vector regression with generic quadratic nonconvex ε-insensitive loss. Appl Math Model. 2020;82:235–51.
- 35. Singla M, Ghosh D, Shukla KK, Pedrycz W. Robust twin support vector regression based on rescaled hinge loss. Pattern Recognit. 2020;105:107395.
- 36. Fu S, Tian Y, Tang L. Robust regression under the general framework of bounded loss functions. Eur J Oper Res. 2023;310(3):1325–39.
- 37. Li F, Yang H. A novel bounded loss framework for support vector machines. Neural Netw. 2024;178:106476. 10.1016/j.neunet.2024.106476
- 38. Baye MR, Parker DF. Combining ridge and principal component regression: a money demand illustration. Commun Stat - Theory Methods. 1984;13(2):197–205.
- 39. Lukman AF, Farghali RA, Kibria BG, Oluyemi OA. Robust-stein estimator for overcoming outliers and multicollinearity. Sci Rep. 2023;13(1)
- 40. Arum KC, Ugwuowo FI, Oranye HE, Alakija TO, Ugah TE, Asogwa OC. Combating outliers and multicollinearity in linear regression model using robust Kibria-Lukman mixed with principal component estimator, simulation and computation. Sci. Afr. 2023;19:e01566.
- 41. Hua XG, Ni YQ, Ko JM, Wong KY. Modeling of temperature–frequency correlation using combined principal component analysis and support vector regression technique. J Comput Civ Eng. 2007;21(2):122–35.
- 42.
Anderson TW. An Introduction to Multivariate Statistical Analysis. Wiley; 2003.
- 43. Peng X, Chen D, Kong L. A clipping dual coordinate descent algorithm for solving support vector machines. Knowl.-Based Syst. 2014;71:266–78.
- 44.
Adragni KP, Raim A. ldr: Methods for likelihood-based dimension reduction in regression. R package version 1.3. 2014. Available from: https://CRAN.R-project.org/package=ldr
- 45.
Meyer D, Dimitriadou E, Hornik K, Weingessel A, Leisch F. e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien. R package version 1.7-14. 2023. Available from: https://CRAN.R-project.org/package=e1071
- 46. Kibria BG. Performance of some new ridge regression estimators. Commun Stat - Simul Comput. 2003;32(2):419–35.
- 47. McDonald GC, Schwing RC. Instabilities of regression estimates relating air pollution to mortality. Technometrics. 1973;15(3):463–81.
- 48. Walker E, Birch JB. Influence measures in ridge regression. Technometrics. 1988;30(2):221–7.
- 49. Yüzbaşı B, Arashi M, Ejaz Ahmed S. Shrinkage estimation strategies in generalised ridge regression models: low/high‐dimension regime. Int Stat Rev. 2020;88(1):229–51.
- 50. Longley JW. An appraisal of least squares programs for the electronic computer from the point of view of the user. J Am Stat Assoc. 1967;62(319):819–41.