Principal fitted component framework for robust support vector regression based on bounded loss: A simulation study with potential applications

Aiman Tahir; Maryam Ilyas

doi:10.1371/journal.pone.0321102

Abstract

The inferential results regarding estimates of Support Vector Regression (SVR) are highly influenced by anomalies and ill-conditioned predictors. Excessive dimensions of data also make the model complex. To improve estimation accuracy, this paper introduces two modelling frameworks, Principal Component Robust Support Vector Regression (PCRSVR) and Principal Fitted Component Robust Support Vector Regression (PFCRSVR). These techniques are developed by incorporating PCs and PFCs with Exponential Quantile SVR (EQSVR), which is capable of dealing with ill-conditioned regressors, extreme observations, and high-dimensional data settings simultaneously. An extensive simulation study has been conducted to evaluate the performance of the proposed methods. Different evaluation criteria are chosen in this regard. Additionally, real-life data applications illustrate the efficacy of the proposed techniques as compared to competing ones.

Citation: Tahir A, Ilyas M (2025) Principal fitted component framework for robust support vector regression based on bounded loss: A simulation study with potential applications. PLoS One 20(6): e0321102. https://doi.org/10.1371/journal.pone.0321102

Editor: Mohamed R. Abonazel, Cairo University, EGYPT

Received: January 5, 2025; Accepted: March 1, 2025; Published: June 4, 2025

Copyright: © 2025 Tahir, Ilyas. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The data is made available at https://github.com/aiman-4/PCRSVR_PFCRSVR.git.

Funding: The author(s) received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

1. Introduction

Principal Component Regression (PCR) [1,2] is a widely used technique to address the problem of multicollinearity within the framework of multiple linear regression. PCR is conducted in two main steps. First, Principal Component Analysis (PCA) [3,4] is performed to transform the original predictors into a new set of orthogonal components, or Principal Components (PCs). Then, a subset of these PCs is selected as new explanatory variables in the regression model.

While various adaptations of PCR are discussed in the literature (e.g., [5–7]), this study focuses on the classical PCR approach. In classical PCR, PCs with large eigenvalues are prioritized to capture the maximum variation of data. However, this approach may not always be ideal for predictive accuracy, as PCs with smaller eigenvalues could have a stronger correlation with the response variable (e.g., [8–9]). To address this limitation, various strategies have been developed to incorporate information from the response variable during the construction of PCs (e.g., [10–12]). The focus of this paper is Principal Fitted Component Regression (PFCR), proposed by [10]. It modifies PCR by regressing the response variable on a subset of Principal Fitted Components (PFCs) rather than on traditional PCs. These PFCs are designed to retain the predictive information about the response variable that is embedded within the predictors. This is often done employing inverse regression. Moreover, PFCR addresses the effects of ill-conditioned predictors [9].

Despite these advancements, PCR remains sensitive to outliers, which can distort both the PCA and the regression model. To make PCR more robust, researchers have developed estimators that combine outlier-resistant techniques with PCR. For instance, [13] proposed a robust approach to PCR that substitutes classical PCA with robust PCA. In this method, the covariance matrix is estimated using the least median of squares [14], which reduces the influence of outliers. Additional robust PCR methods have been developed to address different data complexities. [15], for example, introduced an outlier detection method for the response matrix. This method, called “resampling by halfmeans” [16], identifies and removes outlier-contaminated samples before conducting PCA. [17] proposed a robust PCR approach based on projection pursuit [18]. This approach identifies robust PCs and uses them in least-trimmed square regression [19] to reduce the influence of extreme values. [20] developed two variations of robust PCR, each tailored to different data dimensions. For low-dimensional data (p < n), they used the minimum covariance determinant estimator [14] to estimate the covariance matrix. For high-dimensional data (p > n), they recommended the ROBPCA method [21]. This method computes robust PCs specifically for high-dimensional scenarios and then applies robust regression. In addition to these methods, [22] proposed an empirical technique for robust PCR that depends upon “principal sensitive vectors” [23]. It detects outliers before performing classical PCR. [24] conducted a comparative study between robust PCR and robust partial least squares regression. Their study evaluates these methods based on efficiency, robustness, predictive competency, and model fitness. More recent techniques have incorporated advanced statistical frameworks to increase the robustness of PCR. [25] proposed an estimator for parameter function in functional logistic regression to handle functional outliers. [26] introduced a Bayesian approach to improve outlier resistance for both independent and dependent factors. This method penalizes unusual data points to certify that predictions align with the core data distribution. [27] further advanced robust PCR techniques by proposing a correlation scaled robust estimator for PCR. This method addresses the challenges of multicollinearity, outliers, and high-dimensional data. It incorporates response variable information directly into the computation of PCs. This approach enhances the predictive stability of PCR while controlling for data irregularities and dimensionality issues in multiple linear regression.

Support Vector Regression (SVR) was introduced by [28] as a method for tackling regression problems in machine learning. It depends upon the principles of Support Vector Machines (SVM) [29]. Unlike conventional regression models, SVR has garnered widespread attention across numerous disciplines [30]. The primary concern of SVR is to minimize the deviation between the predicted outcome and actual value. Several loss functions are utilized to quantify this distance. Although classical SVR has gained notable achievement in various fields it does not encounter challenges against outliers because of utilizing unbounded loss functions. These unbounded loss functions cause the infinite increase in loss term as error increases. Consequently, a significant shift in the regression line is occurred which reduced model accuracy. To counter this issue, experts have put attention to integrating bounded loss functions into the framework of SVR. For instance, [31] introduced a truncated є-insensitive loss to develop a truncated SVR model, motivated by the Ramp loss. [32] introduced the RLS-SVR model by truncating the least squares loss. Similarly, [33] proposed the RLNPSVR model by applying Ramp-type loss in nonparallel SVR. [34] proposed the NQSVR model depending upon a non-convex quadratic є-insensitive loss. Nevertheless, the truncation of loss functions introduces non-differentiable points, which increases the complexity of the optimization process. [35] addressed this issue by applying the Rhinge loss in Twin Support Vector Regression (TSVR), resulting in a more robust TSVR model. More recently, a novel bounded framework is proposed by [36]. It transforms unbounded loss functions into bounded ones, which establishes the foundation for the development of BLSSVR. Inspired by these advancements, EQSVM and EQSVR are proposed by [37], based on bounded exponential quantile loss. This framework offers an alternative approach to scaling unbounded convex loss functions, providing greater resistance to outliers while preserving model efficiency.

The literature validates that blended estimators can outperform single estimators by combining the strengths of each [38]. Examples of such blended approaches include the combined r-k estimator, which integrates the PCA and ridge estimator [38], the robust‑stein estimator [39], the combined PC-KL estimator [40], and the hybrid PC-SVR [41].

Modern data analysis is increasingly characterized by complex challenges, including multicollinearity, excessive dimensions of data, and the pervasive presence of anomalies. Traditional regression techniques often fail to deliver reliable results in such scenarios, leaving a critical gap in the ability to model real-world data effectively. This study addresses these pressing issues by introducing two approaches, i.e., PCRSVR and PFCRSVR. These methods integrate PCs and PFCs with Exponential Quantile Support Vector Regression (EQSVR) within a machine learning framework. The proposed techniques are designed to handle ill-conditioned regressors, anomalies, and large dimensions of data simultaneously. Their computational algorithms are also developed. Notably, PFCRSVR addresses the predictive limitations identified by [8] and [9] by incorporating response variable information directly into the computation of PCs. This approach aims to improve predictive accuracy by retaining components that are more relevant to the response variable. A comparative analysis is conducted among the proposed robust approaches and their non-robust counterparts to evaluate the effectiveness of the proposed techniques. Among the proposed methods and baseline counterparts, PFCRSVR consistently performs best, achieving the lowest MSE and MAE across all techniques. This establishes PFCRSVR as the most effective framework for complex data environments.

The organization of the paper is as follows: Subsections of section 1 describe the principal component regression, principal fitted component regression and robust support vector regression, respectively. The proposed methodology and its computational algorithms are discussed in section 2. Section 3 conducts the simulation study to investigate the performance of the proposed methods. Real-life data applications illustrate the developed techniques in section 4. Section 5 gives concluding remarks on the paper.

1.1. Principal component regression

[42] introduced Principal Component Analysis (PCA) as a method to transform correlated predictors into uncorrelated variables called principal components (PCs). Each PC is a combination of the original predictors, constructed using specific weights. Consider X, an n×p matrix where n is the number of observations and p is the number of predictors. The PCs are computed such as . Here, represent eigenvectors of the covariance matrix (Σ = cov (X)), and their corresponding eigenvalues are . The eigenvectors are arranged into p×p matrix (V) and PCs are composed into n×p matrix (M).

In PCR, a subset of the first q-PCs (M_q) is used to model the response variable (y) with q ≤ p. This relationship is modelled by Eq 1, here α is the q×1 vector of regression coefficients for the q-PCs, and є is the (n×1) error term. The regression coefficients (α_q) are estimated using the least squares method (Eq 2). Once these coefficients are estimated, they are transformed back to the original predictor space, as defined in Eq 3. Here, represents the (p×1) vector of estimated regression coefficients regarding original predictors. By selecting only the leading PCs that account for most of the variability in the data, PCR simplifies the model and addresses the problem of ill-conditioned regressors.

(1)

(2)

(3)

1.2. Principal fitted component regression

Using principal components as regressors in regression models raises certain concerns. First, PCs are derived solely from the predictor variables without incorporating the response variable. This approach assumes that the response depends primarily on the first few PCs, but in reality, it might also rely on components associated with smaller variations. Second, PCs lack the properties of invariance and equivariance when the predictor variables undergo full-rank linear transformations.

To address these limitations, PFCs were introduced for dimension reduction in regression modeling [10]. Compared to PCs, PFCs provide two key advantages. They retain equivariance under full-rank linear transformations of predictors and can be tailored to incorporate information from the response variable.

PFCs are constructed by extracting sufficient information about the response variable (y) from the predictors (X). This is often achieved through inverse regression, which involves estimating E[X | y = y]. Unlike forward regression, which models E[y | X = x], inverse regression reduces the problem to p times one-dimensional regressions.

The Eq 4 is an inverse regression model such that Δ). Here, represents the mean of the predictors, and is a semi-orthogonal matrix whose columns form a basis for the q-dimensional subspace = span {}, where S_y is the sample space of y. The term includes and with q ≤ min (r, p), a mean-centred vector-valued function of y, satisfying . Instead of indexing predictors conventionally by i, here y serves as the index. The predictors (X_y) are regressed on a response-dependent function (f_y), which is constructed using a specific basis function g. This basis is mean-centred as , with typically chosen as a polynomial basis with degree r, i.e., and . Here, assumes independence of y and its simplest form is isotropic with .

(4)

To compute PFCs, the sample covariance matrix of the fitted predictors, , is estimated. Here, represents the predictors fitted from the regression of X_y on f_y.

PCA is then applied to , yielding eigenvectors corresponding to eigenvalues . These eigenvectors are used to construct PFCs, expressed as . Instead of using all PFCs, a subset of q-PFCs is employed in the regression model. Since PFCs incorporate information from the response variable during their construction, they often outperform PCs in regression tasks under various scenarios [10].

1.3. Exponential Quantile Support Vector Regression (EQSVR)

Consider, we have n training instances and p features. The i^th training instance can be denoted as and its associated outcome can be denoted as y_i, i = 1,2,…., n. The data matrix can be composed by arranging samples in rows and features in columns and y is the (n×1) vector of responses. [37] introduced two parameters of exponential quantile loss () in standard SVR. Here, λ > 0 and τ ≥ 0 are two tuning parameters. λ controls the steepness of L_eq-loss and τ acts as a hedging factor. denotes Pinball loss and represents location constant satisfying . Also, denotes the normalizing constant ensuring . Thus, the objective function of EQSVR is formulated in Eq 5. Here, w is the (p×1) vector of weights, b denotes bias and C represents the non-negative penalty parameter. After estimating w and b we can predict a new sample x_new by using relation .

(5)

In this paper, EQSVR is formulated for a linear regression problem. Let’s assume and . Here, e is the (n×1) vector of ones. According to these notations, the objective function (Eq 5) is transformed to Eq 6.

(6)

EQSVR utilizes the ConCave-Convex Procedure (CCCP) to transform non-convex L_eq-loss into the chain of convex optimization problems. Then, these convex optimization problems are solved by ClipDCD algorithm [43]. To solve Eq 6, L_eq-loss is decomposed into g (u) and h (u) defined in Eq 7 and Eq 8, respectively. Subsequently, the model of EQSVR is formulated in Eq 9.

(7)

(8)

(9)

The first two terms of Eq 9 are convex parts and are jointly represented by . The third term is the concave part and is denoted by . The CCCP method is employed to optimize the problem defined in Eq 9. The subsequent sub-problems (Eq 10) are addressed to iteratively obtain the optimal solution. Here, is the derivative of for obtaining optimal solution . An auxiliary variable () defined in Eq 11 is introduced for ease of notation. Then, the Eq 9 is reformulated to the Eq 12 and is further simplified to Eq 13. Here, and Eq 14 is a matrix form of Eq 13.

(10)

(11)

(12)

(13)

(14)

The Lagrange function is defined in Eq 15 by incorporating two variables γ and θ. The Karush-Kuhn-Tucker (KKT) conditions are derived in Eqs 16–19 and must be satisfied. The resulting Eq 20 is obtained by plugging the KKT conditions in Lagrangian function (Eq 15). After solving Eq 16, we get the weight vector () that is defined in Eq 25. Hence, Eq 23 can be redefined after utilizing the results mentioned in Eq 21 and Eq 22. Here, Idenote the identity matrix and 0 is the vector of zeroes.

(15)

(16)

(17)

(18)

(19)

(20)

(21)

(22)

(23)

The problem in Eq 23 is found a quadratic optimization problem that can be solved by the ClipDCD algorithm [43]. The values of and are iteratively updated over the CCCP iteration. After obtaining , we can predict for new instance by following the Eq 24. Here, denotes the optimal solution. The CCCP algorithm depends upon ClipDCD EQSVR which is described in Fig 1.

Download:

Fig 1. The CCCP algorithm of EQSVR based on ClipDCD.

https://doi.org/10.1371/journal.pone.0321102.g001

(24)

(25)

2. Proposed methods

2.1. Principal Component Robust Support Vector Regression (PCRSVR)

The proposed PCRSVR is a hybrid technique that combines the PCs and EQSVR in a machine learning framework. This approach can handle the data irregularities, ill-conditioned predictors and excessive data dimensions simultaneously. It first performs PCA on predictors and constructs new transformed variables known as principal components, eliminating the problem of ill-conditioned predictors. It chooses the first q-PCs that explain the maximum variation of the predictors. Then, these PCs are used as regressors in the EQSVR framework to model outcome variable that is characterized by anomalies. These anomalies are tackled with L_eq-loss that is plugged into EQSVR.

The PCRSVR performs the following steps to calculate the MSE of estimated regression parameters.

Generate predictors (X) by Eq 27 and standardize them.
Simulate the response variable (y) using Eq 28. Define the vector of regression coefficients (β) as an eigenvector relevant to the largest eigenvalue of the information matrix (X^T X).
Introduce outliers in y using Eq 29 according to the outliers fraction specified in section 3.
Obtain eigenvalues and eigenvectors of (X^T X) by applying PCA and construct new transformed variables (M = XV).
Retain q-PCs (M_q), where q represents the number of components explaining at least 80% of the variation of X.
Model contaminated y based on M_q using linear kernel and considering L_eq-loss in SVR described in subsection 1.3.
Estimate regression parameters for q-PCs using Eq 25 based on the modelling framework implemented in step 6.
Convert these estimated parameters back to the original predictor space using the transformation explained in Eq 3.
Calculate the MSE of according to Eq 26. Here, denotes the estimated value through the proposed modelling framework (PCRSVR) and β represents its respective true value.

(26)

Replicate steps 1–9 for 100 Monte Carlo runs and obtain a mean over 100 runs.

2.2. Principal Fitted Component Robust Support Vector Regression (PFCRSVR)

The PFCRSVR method combines PFCs with EQSVR to provide a robust solution for data irregularities, ill-conditioned predictors, and high-dimensional settings simultaneously. It also addresses the challenge noted by [8] and [9] by incorporating fitted predictors during the computation of PCs instead of original ones. The computational process of fitted predictors is described in subsection 1.2. In PFCRSVR, PCA is applied to fitted predictors and construct PFCs as detailed in subsection 1.2. From these PFCs, the top q-PFCs are selected for further modeling. These components are then used as inputs in the EQSVR framework to predict the outcome variable that contains anomalies. The L_eq -loss function within EQSVR ensures resilience to these anomalies, enabling accurate and reliable regression modeling.

The following steps are involved to obtain the MSE of estimated regression parameters of PFCRSVR.

Perform steps (1–3) of PCRSVR’s algorithm.
Compute fitted predictors () by regressing X on the polynomial basis of y performing the inverse regression model described in Eq 4. Here, the PFC model assumes isotropic structure and second-degree polynomial (r = 2).
Obtain fitted sample covariance matrix of fitted predictors (.
Perform PCA on fitted predictors () to obtain eigenvalues and their corresponding eigenvectors ().
Multiply eigenvectors with X to get PFCs (). Compose these PFCs in n×p matrix (Z).
Select q-PFCs (Z_q) that account for at least 80% of the variation of X and consider them in the further modelling process.
Use the L_eq-loss function in SVR to model the contaminated response variable (y) based on Z_q using linear kernel as detailed in subsection 1.3.
Estimate regression coefficients for the retained q-PFCs (Z_q) using Eq 25.
Transform the estimated coefficients back to the original predictor space considering the mapping described in Eq 3.
Compute MSE of proposed estimator PFCRSVR () using Eq 26.
Iterate the steps 1–10 for 100 Monte Carlo runs and calculate the mean over 100 replications.

3. Simulation study

In this section, we evaluate the performance of the proposed methods (i.e., PCRSVR and PFCRSVR) by conducting a Monte Carlo simulation study using R programming language. The relevant code and data files are deposited at https://github.com/aiman-4/PCRSVR_PFCRSVR.git. For the implementation of PFCR and classical SVR, the R packages ldr [44] and e1071 [45] are respectively utilized. The competing techniques are SVR, EQSVR, PCSVR and PFCSVR. PCSVR and PFCSVR are two hybrid approaches that utilize q-PCs and q-PFCs as regressors into classical SVR. The simulation settings are outlined in forthcoming subsection 3.1.

3.1. Simulation design

In this subsection, the data generation process of synthesis data sets is described. The explanatory variables are generated according to Eq 27, following the approach of [40] and [39]. In this setup, ρ indicates the correlation among two explanatory variables and D_ij represents independent pseudo-random numbers drawn from standard normal distribution. The response variable is simulated based on Eq 28. Here, e_i ~ N(0,1) and regression coefficient β_j is chosen to satisfy , following the [46].

(27)

(28)

The proposed techniques are evaluated by varying several key factors including sample size, degree of correlation, level of contamination, and number of predictors. We consider collinearity levels of ρ = 0.8, 0.9 and 0.99. The number of explanatory variables is set to p = 5, 15 and 25. Additionally, we test sample sizes of n = 50, 100, 300 and 500. To evaluate the robustness of proposed techniques, we introduce different proportions of outliers, i.e., 0%, 5%, 15%, and 30%. Different combinations of these factors are considered in this study and relevant results are reported in section 4. The values of hyperparameters of EQSVR are set as τ = 0.7 and λ = 0.5. Whereas the value of penalty parameter (C) of EQSVR is chosen as 0.2 for all the scenarios except the scenarios where p = 25. In this case, the value of C is 0.04.

This study focuses on vertical outliers, which affect only the response variable. We contaminate the response variable (y) randomly, following Eq 29 as suggested by [40]. Here, b is the magnitude of outliers, set at a constant value of 10.

(29)

3.2. Performance evaluation criteria

The proposed techniques are compared with their competing ones based on Mean Square Error (MSE) and Mean Absolute Error (MAE). These evaluation measures have been considered by various researchers (see, e.g., [37,39,40]). These metrics are computed using Eq 30 and Eq 31. Here, is the l^th estimated regression coefficient of any studied modelling framework and is its corresponding true value. The technique that produces the lowest values of MSE and MAE is considered the most effective.

(30)

(31)

Additionally, the strength of the developed techniques against their counterparts is quantified by the improved percentage reduction in MSE regarding the proposed ones. This indicator is termed PMSE and is computed using Eq 32. Here, PMSE denotes the magnitude of percentage which increases or decreases due to the MSE of proposed techniques over their competing ones. MSE* and MSE** denote the mean square error of the proposed technique and its competitor, respectively. Theoretically, the proposed techniques attain achievement if PMSE produces a positive value. The negative value of PMSE shows the inferiority of the proposed techniques over their baseline techniques.

(32)

4. Results

An extensive simulation study has been conducted by taking various above-mentioned scenarios into account. The simulation experiments are replicated 100 times. For each replication, the MSE of and MAE of are computed for proposed methods (e.g., PCRSVR and PFCRSVR) and their competitors (e.g., SVR, EQSVR, PCSVR and PECSVR). The summary statistics (i.e., mean and Standard Error (SE)) of performance measures over 100 replications are reported in Tables (1–12). For brevity, a few tables are inserted in supporting information (see, S1-S6 Tables). It can be noticed from Tables (1–12), that the proposed techniques (PCRSVR and PFCRSVR) produce reduced MSE and MAE as compared to their baseline counterparts (SVR, EQSVR, PCSVR and PFCSVR) in almost all studied simulations settings. Also, the increasing pattern of sample size exhibits decreasing behaviour of MSE and MAE for all the studied estimators (see, Tables 1–12). All the studied techniques perform well with various degrees of correlation and different

percentages of outliers. However, the proposed techniques PCRSVR and PFCRSVR outperform EQSVR and their respective non-robust estimators PCSVR and PFCSVR. As the contamination fraction increases, the MSE and MAE increase for all the techniques. However, these metrics for proposed techniques tend to increase with less proportion over competing ones, especially when the sample size is large (See, Table 1–12). It is also noticed that the MSE and MAE of SVR, PCSVR and PFCSVR tend to increase with the increase in level of collinearity. Whereas the inverse relationship is exhibited between the degree of correlation and performance measures of EQSVR, PCRSVR and PFCRSVR. For instance, the MSE and MAE of SVR, PCSVR and PFCSVR increase with the increase in the degree of collinearity (See, Tables 1–6). It can also be noticed that the increase in predictors generally increases the MSE and MAE of SVR and decreases

Download:

Table 1. The summary statistics (mean ± S.E) of MSE of regression coefficients regarding proposed and other studied estimators for p = 5 and ρ = 0.8.

https://doi.org/10.1371/journal.pone.0321102.t001

Download:

Table 2. The summary statistics (mean ± S.E) of MAE of regression coefficients regarding proposed and other studied estimators for p = 5 and ρ = 0.8.

https://doi.org/10.1371/journal.pone.0321102.t002

Download:

Table 3. The summary statistics (mean ± S.E) of MSE of regression coefficients regarding proposed and other studied estimators for p = 5 and ρ = 0.9.

https://doi.org/10.1371/journal.pone.0321102.t003

Download:

Table 4. The summary statistics (mean ± S.E) of MAE of regression coefficients regarding proposed and other studied estimators for p = 5 and ρ = 0.9.

https://doi.org/10.1371/journal.pone.0321102.t004

Download:

Table 5. The summary statistics (mean ± S.E) of MSE of regression coefficients regarding proposed and other studied estimators for p = 5 and ρ = 0.99.

https://doi.org/10.1371/journal.pone.0321102.t005

Download:

Table 6. The summary statistics (mean ± S.E) of MAE of regression coefficients regarding proposed and other studied estimators for p = 5 and ρ = 0.99.

https://doi.org/10.1371/journal.pone.0321102.t006

Download:

Table 7. The summary statistics (mean ± S.E) of MSE of regression coefficients regarding proposed and other studied estimators for p = 15 and ρ = 0.8.

https://doi.org/10.1371/journal.pone.0321102.t007

Download:

Table 8. The summary statistics (mean ± S.E) of MAE of regression coefficients regarding proposed and other studied estimators for p = 15 and ρ = 0.8.

https://doi.org/10.1371/journal.pone.0321102.t008

Download:

Table 9. The summary statistics (mean ± S.E) of MSE of regression coefficients regarding proposed and other studied estimators for p = 15 and ρ = 0.9.

https://doi.org/10.1371/journal.pone.0321102.t009

Download:

Table 10. The summary statistics (mean ± S.E) of MAE of regression coefficients regarding proposed and other studied estimators for p = 15 and ρ = 0.9.

https://doi.org/10.1371/journal.pone.0321102.t010

Download:

Table 11. The summary statistics (mean ± S.E) of MSE of regression coefficients regarding proposed and other studied estimators for p = 15 and ρ = 0.99.

https://doi.org/10.1371/journal.pone.0321102.t011

Download:

Table 12. The summary statistics (mean ± S.E) of MAE of regression coefficients regarding proposed and other studied estimators for p = 15 and ρ = 0.99.

https://doi.org/10.1371/journal.pone.0321102.t012

the performance metrics for all other techniques except a few cases. For instance, the direct relationship is observed among the number of predictors and performance metrics of PCSVR when ρ = 0.99 (see, Tables 5, 6, 11 and 12).

Moreover, Figs 2–7 and S1-S3 Figs provide more clearer view by displaying the percentage reduction in MSE of PCRSVR and PFCRSVR over their competitors EQSVR, PCSVR and PFCSVR. The proposed techniques become be most efficient due to producing the reduced MSE as compared to the competing ones. It is also evident from Figs 4 and 7 and S3 Fig, that the efficiency of PCRSVR and PFCRSVR substantially improves as compared to PCSVR and PFCSVR for ρ = 0.99. For example, the percentages reduction in MSE of PFCRSVR against PFCSVR are 90%, 99% and 98% for n = 100, 300 and 500, respectively when p = 25 and percent contamination is 30 (see, Fig 4d). Similarly, with the same level of collinearity and contamination, the resulting pattern remains consistent when p = 15, n = 50, 100 and, 300. Consequently,

Download:

Fig 2. Improved percentage reduction in MSE of proposed techniques against their competitors with different levels of contamination for p = 5 and ρ = 0.8.

https://doi.org/10.1371/journal.pone.0321102.g002

Download:

Fig 3. Improved percentage reduction in MSE of proposed techniques against their competitors with different levels of contamination for p = 5 and ρ = 0.9.

https://doi.org/10.1371/journal.pone.0321102.g003

Download:

Fig 4. Improved percentage reduction in MSE of proposed techniques against their competitors with different levels of contamination for p = 5 and ρ = 0.99.

https://doi.org/10.1371/journal.pone.0321102.g004

Download:

Fig 5. Improved percentage reduction in MSE of proposed techniques against their competitors with different levels of contamination for p = 15 and ρ = 0.8.

https://doi.org/10.1371/journal.pone.0321102.g005

Download:

Fig 6. Improved percentage reduction in MSE of proposed techniques against their competitors with different levels of contamination for p = 15 and ρ = 0.9.

https://doi.org/10.1371/journal.pone.0321102.g006

Download:

Fig 7. Improved percentage reduction in MSE of proposed techniques against their competitors with different levels of contamination for p = 15 and ρ = 0.99.

https://doi.org/10.1371/journal.pone.0321102.g007

the maximum reduction in MSE is exhibited up to 99% for proposed techniques over their competitors even with a high concentration of outliers and collinearity. Further, the proposed techniques are also compared with baseline EQSVR and come out to be competent. Because they exhibit a maximum reduction in MSE over all considered simulation settings (see, Figs 2–7 and S1-S3 Figs). Therefore, the results indicate that proposed approaches outperform other competing techniques by overcoming the effects of anomalies and multicollinearity simultaneously.

5. Discussion

The results of this study demonstrate the robustness and effectiveness of the proposed regression frameworks (e.g., PCRSVR and PFCRSVR). Extensive simulations reveal that these methods consistently outperform their baseline counterparts (e.g., PCSVR, PFCSVR, and EQSVR). The proposed frameworks excel in handling challenges such as high multicollinearity, anomaly severity, and varying sample and predictor sizes. Both PCRSVR and PFCRSVR achieve significantly lower MSE and MAE values. These results showcase their ability to mitigate the adverse effects of extreme data complexities as well as ill-conditioned predictors. Moreover, the validation using real-life datasets highlights the practical relevance of these approaches. The proposed techniques consistently outperform baseline methods for real-life datasets characterized by high multicollinearity and the presence of outliers. These findings underline the generalizability and effectiveness of PCRSVR and PFCRSVR in tackling real-world regression challenges.

Despite these promising results, certain limitations seek attention. The study focuses solely on normal response variable and vertical outliers. This excludes scenarios involving non-normal responses, such as binary or count data, as well as leverage points in the predictor space. These restrictions may limit the applicability of the methods in domains with more diverse data characteristics. Future research could focus on extending the frameworks to handle response variables from the exponential family. Modifications could also address leverage points to improve the methods’ robustness. Expanding the scope in these directions would enhance the utility and adaptability of the proposed techniques.

Another limitation stems from the nature of the datasets analyzed. The study predominantly focuses on cases where the number of observations exceed the number of predictors. While this condition is common in many regression applications, it does not account for high-dimensional settings where predictors outnumber observations. In such scenarios, standard dimensionality reduction techniques, like principal components, may not perform optimally. Future work could adapt the frameworks to high-dimensional datasets. This could involve advanced strategies such as sparsity-inducing penalties or tailored regularization techniques. Addressing these gaps would extend the applicability of these methods to fields like genomics, text mining, and image analysis.

6. Real-life data application

This section demonstrates the performance of the proposed techniques using the pollution and Longley datasets. These datasets have been widely analyzed in previous research (e.g., [39,47–49]). According to the literature, these datasets are known for having ill-conditioned predictors and extreme observations. Therefore, these real-life datasets are suitable for evaluating methods that address these issues efficiently.

In the pollution dataset, the outcome variable is the age-adjusted mortality rate per 100,000, which depends on fifteen explanatory variables. A detailed description of these covariates is available in prior studies (e.g., [47,49]). Application of the least square method reveals a high degree of multicollinearity, with variance inflation factors of 98.6 for x₁₂ and 104.9 for x₁₃. The strength of correlation among predictors is illustrated in Fig 8a. Residual analysis is also conducted to identify extreme observations. Normal QQ plots of residuals and Cook’s distances indicate that observations 2, 29, 32, 37, 48, 57, and 59 are outliers (see Fig 9a and 9b). These findings confirm that the dataset exhibits both multicollinearity and outliers, making it an appropriate example for testing the proposed techniques.

Download:

Fig 8. The graphical presentation of correlations among predictors of the pollution data (a) and the Longley data (b).

https://doi.org/10.1371/journal.pone.0321102.g008

Download:

Fig 9. The normal QQ-plot of residuals (a) and Cook’s distances (b) of pollution data.

https://doi.org/10.1371/journal.pone.0321102.g009

The regression coefficients of the predictors are estimated using all examined modeling frameworks, with results presented in Table 13. For performance evaluation, the standard errors of the bootstrap regression estimates are calculated for each technique, and the mean standard errors are reported (see, Table 13). The results indicate that the proposed techniques yield lower Mean Standard Errors of Bootstrap Estimates (MSEBE) compared to competing methods. Notably, the PFCRSVR method outperforms all other approaches, achieving the lowest mean standard error of the bootstrap regression estimates.

Download:

Table 13. The estimated regression coefficients and MSEBE for proposed and competing techniques using the Pollution dataset.

https://doi.org/10.1371/journal.pone.0321102.t013

Further, the Longley dataset [50] comprise five predictors with the objective of modeling total derived employment (y). These predictors include the Gross National Product (GNP) implicit price deflator (x₁), GNP (x₂), unemployment rate (x₃), size of the armed forces (x₄), and non-institutional population aged fourteen years and older (x₅). Prior research has highlighted the significant effect of multicollinearity and the presence of outliers within this dataset [40].

This is indicated by a high condition index of 43,275 and variance inflation factors of 5,209.50, 306.50, 2,825.30, 37.74, and 39.90 for the predictors. Fig 8b provides an illustration of the correlations among the predictors, highlighting the extent of multicollinearity. Additionally, Fig 10 includes a QQ-plot of residuals and Cook’s distances, which reveal data points 6, 10, 12, 14, and 16 as notable outliers.

Download:

Fig 10. The normal QQ-plot of residuals (a) and Cook’s distances (b) of Longley data.

https://doi.org/10.1371/journal.pone.0321102.g010

Regression parameters are estimated for both the proposed method and existing approaches and their results are presented in Table 14. To further assess estimation accuracy, bootstrap coefficients are estimated. The MSEBE() for each method is also reported in Table 14. The proposed methods demonstrate favourable performance compared to competing techniques, achieving the lowest MSEBE() value.

Download:

Table 14. The estimated regression coefficients and MSEBE for proposed and competing techniques using the Longley dataset.

https://doi.org/10.1371/journal.pone.0321102.t014

7. Conclusion

This research advances a robust regression framework by addressing core challenges, including multicollinearity, outliers, and high-dimensional data, which constrains the effectiveness of classical SVR. By proposing frameworks that incorporate PCs and PFCs, the study not only tackles critical issues but also broadens the scope of SVR’s applicability to more intricate and irregular data environments.

Moreover, the findings highlight a broader paradigm shift in developing robust regression approaches to tackle real-world challenges. Fields such as finance, healthcare, and environmental science frequently face complex data structures. These complexities often compromise the accuracy of predictive models. The proposed innovations provide optimal benefits in these fields. The ability to address ill-conditioned predictors and neutralize the effects of anomalies positions these frameworks as transformative tools for practitioners.

Supporting information

S1 File.

S1 Table. The summary statistics (mean ± S.E) of MSE of regression coefficients regarding proposed and other studied estimators for p = 25 and ρ = 0.9. S2 Table. The summary statistics (mean ± S.E) of MAE of regression coefficients regarding proposed and other studied estimators for p = 25 and ρ = 0.9. S3 Table. The summary statistics (mean ± S.E) of MSE of regression coefficients regarding proposed and other studied estimators for p = 25 and ρ = 0.8. S4 Table. The summary statistics (mean ± S.E) of MAE of regression coefficients regarding proposed and other studied estimators for p = 25 and ρ = 0.8. S5 Table. The summary statistics (mean ± S.E) of MSE of regression coefficients regarding proposed and other studied estimators for p = 25 and ρ = 0.99. S6 Table. The summary statistics (mean ± S.E) of MAE of regression coefficients regarding proposed and other studied estimators for p = 25 and ρ = 0.99. S7 Table. A list of abbreviations used in the paper.

https://doi.org/10.1371/journal.pone.0321102.s001

(PDF)

S1 Fig. Improved percentage reduction in MSE of proposed techniques against their competitors with different levels of contamination for p = 25 and ρ = 0.9.

https://doi.org/10.1371/journal.pone.0321102.s002

(TIFF)

S2 Fig. Improved percentage reduction in MSE of proposed techniques against their competitors with different levels of contamination for p = 25 and ρ = 0.9.

https://doi.org/10.1371/journal.pone.0321102.s003

(TIFF)

S3 Fig. Improved percentage reduction in MSE of proposed techniques against their competitors with different levels of contamination for p = 25 and ρ = 0.99 .

https://doi.org/10.1371/journal.pone.0321102.s004

(TIFF)

References

1. Massy WF. Principal components regression in exploratory statistical research. J Am Stat Assoc. 1965;60(309):234–56.
- View Article
- Google Scholar
2. Jolliffe IT. Principal components in regression analysis. In: Principal Component Analysis. New York: Springer; 1986. https://doi.org/10.1007/0-387-22440-8_8
3. Pearson KL III. On lines and planes of closest fit to systems of points in space. Lond. Edinb. Dubl. Phil. Mag J Sci. 1901;2(11):559–72.
- View Article
- Google Scholar
4. Jolliffe I. Principal component analysis. Encyclopedia of statistics in behavioural science. 2005. https://doi.org/10.1002/0470013192.bsa501
5. Thomas EV. Incorporating auxiliary predictor variation in principal component regression models. J Chemom. 1995;9(6):471–81.
- View Article
- Google Scholar
6. Wang K, Abbott D. A principal components regression approach to multilocus genetic association studies. Genet Epidemiol. 2008;32(2):108–18. PMID 17849491.
- View Article
- PubMed/NCBI
- Google Scholar
7. Agarwal A, Harris K, Whitehouse J, Wu SZ. Adaptive principal component regression with applications to panel data. Adv Neural Inf Process Syst Sci. 2023;36:77104–18.
- View Article
- Google Scholar
8. Jolliffe IT. A note on the use of principal components in regression. Appl Stat. 1982;31(3):300–3.
- View Article
- Google Scholar
9. Cook RD. Principal components, sufficient dimension reduction, and envelopes. Annu Rev Stat Appl. 2018;5(1):533–59.
- View Article
- Google Scholar
10. Cook RD. Fisher lecture: Dimension reduction in regression. 2007. doi:
- View Article
- Google Scholar
11. Kawano S, Fujisawa H, Takada T, Shiroishi T. Sparse principal component regression with adaptive loading. Comput. Stat. Data Anal. 2015;89:192–203.
- View Article
- Google Scholar
12. Singh KK, Patel A, Sadu C. Correlation scaled principal component regression. In Intelligent Systems Design and Applications: 17th International Conference on Intelligent Systems Design and Applications (ISDA 2017) held in Delhi, India, December 14-16. Springer International Publishing. 2018. https://doi.org/10.1007/978-3-319-76348-4_34
13. Walczak B, Massart DL. Robust principal components regression as a detection tool for outliers. Chemometr Intell Lab Syst. 1995;27(1):41–54.
- View Article
- Google Scholar
14. Rousseeuw PJ. Least median of squares regression. J Am Stat Assoc. 1984;79(388):871–80.
- View Article
- Google Scholar
15. Pell RJ. Multiple outlier detection for multivariate calibration using robust statistical techniques. Chemometr Intell Lab Syst. 2000;52(1):87–104.
- View Article
- Google Scholar
16. Egan WJ, Morgan SL. Outlier detection in multivariate analytical chemical data. Anal Chem. 1998;70(11):2372–9. pmid:21644644
- View Article
- PubMed/NCBI
- Google Scholar
17. Filzmoser P. Robust principal component regression. In: Aivazian S, Kharin Y, Rider L (eds). Proceedings of the Sixth International Conference on Computer Data Analysis and Modeling. Minsk: Belarusia. 2001;1:132–137.
18. Li G, Chen Z. Projection-pursuit approach to robust dispersion matrices and principal components: primary theory and Monte Carlo. J Am Stat Assoc. 1985;80(391):759–66.
- View Article
- Google Scholar
19. Rousseeuw PJ, Leroy AM. Robust regression and outlier detection. John wiley & sons; 2003.
20. Hubert M, Verboven S. A robust PCR method for high‐dimensional regressors. J Chemom. 2003;17(8-9):438–52.
- View Article
- Google Scholar
21. Hubert M, Rousseeuw PJ, Vanden Branden K. ROBPCA: a new approach to robust principal component analysis. Technometrics. 2005;47(1):64–79.
- View Article
- Google Scholar
22. Zhang MH, Xu QS, Massart DL. Robust principal components regression based on principal sensitivity vectors. Chemometr Intell Lab Syst. 2003;67(2):175–85.
- View Article
- Google Scholar
23. PEna D, Yohai V. A fast procedure for outlier diagnostics in large regression problems. J Am Stat Assoc. 1999;94(446):434–45.
- View Article
- Google Scholar
24. Engelen S, Hubert M, Vanden Branden K, Verboven S. Robust PCR and Robust PLSR: a comparative study. In Theory and applications of recent robust methods. Birkhäuser Basel. 2004. p. 105–117. https://doi.org/10.1007/978-3-0348-7958-3_10
25. Denhere M, Billor N. Robust principal component functional logistic regression. Commun Stat - Simul Comput. 2016;45(1):264–81.
- View Article
- Google Scholar
26. Gagnon P, Bédard M, Desgagné A. An automatic robust Bayesian approach to principal component regression. J Appl Statist. 2021;48(1):84–104. pmid:35707235
- View Article
- PubMed/NCBI
- Google Scholar
27. Tahir A, Ilyas M. Robust correlation scaled principal component regression. Hacet. J. Math. Stat. 2023;52(2):459–86.
- View Article
- Google Scholar
28. Drucker H, Burges CJ, Kaufman L, Smola A, Vapnik V. Support vector regression machines. In Proceedings of the 9th international conference on neural information processing systems. 1996;15:5–161.
29. Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20(3):273–97. https://www.scopus.com/inward/record.uri?eid=2-s2.0-34249753618&doi=10.1023%2fA%3a1022627411411&partnerID=40&md5=97a8591c7d55575e8c48344379ee2796
- View Article
- Google Scholar
30. Liang X, Zhang Z, Song Y, Jian L. Kernel-based online regression with canal loss. Eur J Oper Res. 2022;297(1):268–79.
- View Article
- Google Scholar
31. Zhao YP, Sun JG. Robust truncated support vector regression. Expert Syst Appl. 2010;37(7):5126–33.
- View Article
- Google Scholar
32. Wang K, Zhong P. Robust non-convex least squares loss function for regression with outliers. Knowl.-Based Syst. 2014;71:290–302. 10.1016/j.knosys.2014.08.003
- View Article
- Google Scholar
33. Tang L, Tian Y, Yang C, Pardalos PM. Ramp-loss nonparallel support vector regression: robust, sparse and scalable approximation. Knowl.-Based Syst. 2018;147:55–67.
- View Article
- Google Scholar
34. Ye Y, Gao J, Shao Y, Li C, Jin Y, Hua X. Robust support vector regression with generic quadratic nonconvex ε-insensitive loss. Appl Math Model. 2020;82:235–51.
- View Article
- Google Scholar
35. Singla M, Ghosh D, Shukla KK, Pedrycz W. Robust twin support vector regression based on rescaled hinge loss. Pattern Recognit. 2020;105:107395.
- View Article
- Google Scholar
36. Fu S, Tian Y, Tang L. Robust regression under the general framework of bounded loss functions. Eur J Oper Res. 2023;310(3):1325–39.
- View Article
- Google Scholar
37. Li F, Yang H. A novel bounded loss framework for support vector machines. Neural Netw. 2024;178:106476. 10.1016/j.neunet.2024.106476
- View Article
- Google Scholar
38. Baye MR, Parker DF. Combining ridge and principal component regression: a money demand illustration. Commun Stat - Theory Methods. 1984;13(2):197–205.
- View Article
- Google Scholar
39. Lukman AF, Farghali RA, Kibria BG, Oluyemi OA. Robust-stein estimator for overcoming outliers and multicollinearity. Sci Rep. 2023;13(1)
- View Article
- Google Scholar
40. Arum KC, Ugwuowo FI, Oranye HE, Alakija TO, Ugah TE, Asogwa OC. Combating outliers and multicollinearity in linear regression model using robust Kibria-Lukman mixed with principal component estimator, simulation and computation. Sci. Afr. 2023;19:e01566.
- View Article
- Google Scholar
41. Hua XG, Ni YQ, Ko JM, Wong KY. Modeling of temperature–frequency correlation using combined principal component analysis and support vector regression technique. J Comput Civ Eng. 2007;21(2):122–35.
- View Article
- Google Scholar
42. Anderson TW. An Introduction to Multivariate Statistical Analysis. Wiley; 2003.
43. Peng X, Chen D, Kong L. A clipping dual coordinate descent algorithm for solving support vector machines. Knowl.-Based Syst. 2014;71:266–78.
- View Article
- Google Scholar
44. Adragni KP, Raim A. ldr: Methods for likelihood-based dimension reduction in regression. R package version 1.3. 2014. Available from: https://CRAN.R-project.org/package=ldr
45. Meyer D, Dimitriadou E, Hornik K, Weingessel A, Leisch F. e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien. R package version 1.7-14. 2023. Available from: https://CRAN.R-project.org/package=e1071
46. Kibria BG. Performance of some new ridge regression estimators. Commun Stat - Simul Comput. 2003;32(2):419–35.
- View Article
- Google Scholar
47. McDonald GC, Schwing RC. Instabilities of regression estimates relating air pollution to mortality. Technometrics. 1973;15(3):463–81.
- View Article
- Google Scholar
48. Walker E, Birch JB. Influence measures in ridge regression. Technometrics. 1988;30(2):221–7.
- View Article
- Google Scholar
49. Yüzbaşı B, Arashi M, Ejaz Ahmed S. Shrinkage estimation strategies in generalised ridge regression models: low/high‐dimension regime. Int Stat Rev. 2020;88(1):229–51.
- View Article
- Google Scholar
50. Longley JW. An appraisal of least squares programs for the electronic computer from the point of view of the user. J Am Stat Assoc. 1967;62(319):819–41.
- View Article
- Google Scholar

[ref1] 1. Massy WF. Principal components regression in exploratory statistical research. J Am Stat Assoc. 1965;60(309):234–56.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Jolliffe IT. Principal components in regression analysis. In: Principal Component Analysis. New York: Springer; 1986. https://doi.org/10.1007/0-387-22440-8_8

[ref3] 3. Pearson KL III. On lines and planes of closest fit to systems of points in space. Lond. Edinb. Dubl. Phil. Mag J Sci. 1901;2(11):559–72.
View Article
Google Scholar

[6] View Article

[7] Google Scholar

[ref4] 4. Jolliffe I. Principal component analysis. Encyclopedia of statistics in behavioural science. 2005. https://doi.org/10.1002/0470013192.bsa501

[ref5] 5. Thomas EV. Incorporating auxiliary predictor variation in principal component regression models. J Chemom. 1995;9(6):471–81.
View Article
Google Scholar

[10] View Article

[11] Google Scholar

[ref6] 6. Wang K, Abbott D. A principal components regression approach to multilocus genetic association studies. Genet Epidemiol. 2008;32(2):108–18. PMID 17849491.
View Article
PubMed/NCBI
Google Scholar

[13] View Article

[14] PubMed/NCBI

[15] Google Scholar

[ref7] 7. Agarwal A, Harris K, Whitehouse J, Wu SZ. Adaptive principal component regression with applications to panel data. Adv Neural Inf Process Syst Sci. 2023;36:77104–18.
View Article
Google Scholar

[17] View Article

[18] Google Scholar

[ref8] 8. Jolliffe IT. A note on the use of principal components in regression. Appl Stat. 1982;31(3):300–3.
View Article
Google Scholar

[20] View Article

[21] Google Scholar

[ref9] 9. Cook RD. Principal components, sufficient dimension reduction, and envelopes. Annu Rev Stat Appl. 2018;5(1):533–59.
View Article
Google Scholar

[23] View Article

[24] Google Scholar

[ref10] 10. Cook RD. Fisher lecture: Dimension reduction in regression. 2007. doi:
View Article
Google Scholar

[26] View Article

[27] Google Scholar

[ref11] 11. Kawano S, Fujisawa H, Takada T, Shiroishi T. Sparse principal component regression with adaptive loading. Comput. Stat. Data Anal. 2015;89:192–203.
View Article
Google Scholar

[29] View Article

[30] Google Scholar

[ref12] 12. Singh KK, Patel A, Sadu C. Correlation scaled principal component regression. In Intelligent Systems Design and Applications: 17th International Conference on Intelligent Systems Design and Applications (ISDA 2017) held in Delhi, India, December 14-16. Springer International Publishing. 2018. https://doi.org/10.1007/978-3-319-76348-4_34

[ref13] 13. Walczak B, Massart DL. Robust principal components regression as a detection tool for outliers. Chemometr Intell Lab Syst. 1995;27(1):41–54.
View Article
Google Scholar

[33] View Article

[34] Google Scholar

[ref14] 14. Rousseeuw PJ. Least median of squares regression. J Am Stat Assoc. 1984;79(388):871–80.
View Article
Google Scholar

[36] View Article

[37] Google Scholar

[ref15] 15. Pell RJ. Multiple outlier detection for multivariate calibration using robust statistical techniques. Chemometr Intell Lab Syst. 2000;52(1):87–104.
View Article
Google Scholar

[39] View Article

[40] Google Scholar

[ref16] 16. Egan WJ, Morgan SL. Outlier detection in multivariate analytical chemical data. Anal Chem. 1998;70(11):2372–9. pmid:21644644
View Article
PubMed/NCBI
Google Scholar

[42] View Article

[43] PubMed/NCBI

[44] Google Scholar

[ref17] 17. Filzmoser P. Robust principal component regression. In: Aivazian S, Kharin Y, Rider L (eds). Proceedings of the Sixth International Conference on Computer Data Analysis and Modeling. Minsk: Belarusia. 2001;1:132–137.

[ref18] 18. Li G, Chen Z. Projection-pursuit approach to robust dispersion matrices and principal components: primary theory and Monte Carlo. J Am Stat Assoc. 1985;80(391):759–66.
View Article
Google Scholar

[47] View Article

[48] Google Scholar

[ref19] 19. Rousseeuw PJ, Leroy AM. Robust regression and outlier detection. John wiley & sons; 2003.

[ref20] 20. Hubert M, Verboven S. A robust PCR method for high‐dimensional regressors. J Chemom. 2003;17(8-9):438–52.
View Article
Google Scholar

[51] View Article

[52] Google Scholar

[ref21] 21. Hubert M, Rousseeuw PJ, Vanden Branden K. ROBPCA: a new approach to robust principal component analysis. Technometrics. 2005;47(1):64–79.
View Article
Google Scholar

[54] View Article

[55] Google Scholar

[ref22] 22. Zhang MH, Xu QS, Massart DL. Robust principal components regression based on principal sensitivity vectors. Chemometr Intell Lab Syst. 2003;67(2):175–85.
View Article
Google Scholar

[57] View Article

[58] Google Scholar

[ref23] 23. PEna D, Yohai V. A fast procedure for outlier diagnostics in large regression problems. J Am Stat Assoc. 1999;94(446):434–45.
View Article
Google Scholar

[60] View Article

[61] Google Scholar

[ref24] 24. Engelen S, Hubert M, Vanden Branden K, Verboven S. Robust PCR and Robust PLSR: a comparative study. In Theory and applications of recent robust methods. Birkhäuser Basel. 2004. p. 105–117. https://doi.org/10.1007/978-3-0348-7958-3_10

[ref25] 25. Denhere M, Billor N. Robust principal component functional logistic regression. Commun Stat - Simul Comput. 2016;45(1):264–81.
View Article
Google Scholar

[64] View Article

[65] Google Scholar

[ref26] 26. Gagnon P, Bédard M, Desgagné A. An automatic robust Bayesian approach to principal component regression. J Appl Statist. 2021;48(1):84–104. pmid:35707235
View Article
PubMed/NCBI
Google Scholar

[67] View Article

[68] PubMed/NCBI

[69] Google Scholar

[ref27] 27. Tahir A, Ilyas M. Robust correlation scaled principal component regression. Hacet. J. Math. Stat. 2023;52(2):459–86.
View Article
Google Scholar

[71] View Article

[72] Google Scholar

[ref28] 28. Drucker H, Burges CJ, Kaufman L, Smola A, Vapnik V. Support vector regression machines. In Proceedings of the 9th international conference on neural information processing systems. 1996;15:5–161.

[ref29] 29. Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20(3):273–97. https://www.scopus.com/inward/record.uri?eid=2-s2.0-34249753618&doi=10.1023%2fA%3a1022627411411&partnerID=40&md5=97a8591c7d55575e8c48344379ee2796
View Article
Google Scholar

[75] View Article

[76] Google Scholar

[ref30] 30. Liang X, Zhang Z, Song Y, Jian L. Kernel-based online regression with canal loss. Eur J Oper Res. 2022;297(1):268–79.
View Article
Google Scholar

[78] View Article

[79] Google Scholar

[ref31] 31. Zhao YP, Sun JG. Robust truncated support vector regression. Expert Syst Appl. 2010;37(7):5126–33.
View Article
Google Scholar

[81] View Article

[82] Google Scholar

[ref32] 32. Wang K, Zhong P. Robust non-convex least squares loss function for regression with outliers. Knowl.-Based Syst. 2014;71:290–302. 10.1016/j.knosys.2014.08.003
View Article
Google Scholar

[84] View Article

[85] Google Scholar

[ref33] 33. Tang L, Tian Y, Yang C, Pardalos PM. Ramp-loss nonparallel support vector regression: robust, sparse and scalable approximation. Knowl.-Based Syst. 2018;147:55–67.
View Article
Google Scholar

[87] View Article

[88] Google Scholar

[ref34] 34. Ye Y, Gao J, Shao Y, Li C, Jin Y, Hua X. Robust support vector regression with generic quadratic nonconvex ε-insensitive loss. Appl Math Model. 2020;82:235–51.
View Article
Google Scholar

[90] View Article

[91] Google Scholar

[ref35] 35. Singla M, Ghosh D, Shukla KK, Pedrycz W. Robust twin support vector regression based on rescaled hinge loss. Pattern Recognit. 2020;105:107395.
View Article
Google Scholar

[93] View Article

[94] Google Scholar

[ref36] 36. Fu S, Tian Y, Tang L. Robust regression under the general framework of bounded loss functions. Eur J Oper Res. 2023;310(3):1325–39.
View Article
Google Scholar

[96] View Article

[97] Google Scholar

[ref37] 37. Li F, Yang H. A novel bounded loss framework for support vector machines. Neural Netw. 2024;178:106476. 10.1016/j.neunet.2024.106476
View Article
Google Scholar

[99] View Article

[100] Google Scholar

[ref38] 38. Baye MR, Parker DF. Combining ridge and principal component regression: a money demand illustration. Commun Stat - Theory Methods. 1984;13(2):197–205.
View Article
Google Scholar

[102] View Article

[103] Google Scholar

[ref39] 39. Lukman AF, Farghali RA, Kibria BG, Oluyemi OA. Robust-stein estimator for overcoming outliers and multicollinearity. Sci Rep. 2023;13(1)
View Article
Google Scholar

[105] View Article

[106] Google Scholar

[ref40] 40. Arum KC, Ugwuowo FI, Oranye HE, Alakija TO, Ugah TE, Asogwa OC. Combating outliers and multicollinearity in linear regression model using robust Kibria-Lukman mixed with principal component estimator, simulation and computation. Sci. Afr. 2023;19:e01566.
View Article
Google Scholar

[108] View Article

[109] Google Scholar

[ref41] 41. Hua XG, Ni YQ, Ko JM, Wong KY. Modeling of temperature–frequency correlation using combined principal component analysis and support vector regression technique. J Comput Civ Eng. 2007;21(2):122–35.
View Article
Google Scholar

[111] View Article

[112] Google Scholar

[ref42] 42. Anderson TW. An Introduction to Multivariate Statistical Analysis. Wiley; 2003.

[ref43] 43. Peng X, Chen D, Kong L. A clipping dual coordinate descent algorithm for solving support vector machines. Knowl.-Based Syst. 2014;71:266–78.
View Article
Google Scholar

[115] View Article

[116] Google Scholar

[ref44] 44. Adragni KP, Raim A. ldr: Methods for likelihood-based dimension reduction in regression. R package version 1.3. 2014. Available from: https://CRAN.R-project.org/package=ldr

[ref45] 45. Meyer D, Dimitriadou E, Hornik K, Weingessel A, Leisch F. e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien. R package version 1.7-14. 2023. Available from: https://CRAN.R-project.org/package=e1071

[ref46] 46. Kibria BG. Performance of some new ridge regression estimators. Commun Stat - Simul Comput. 2003;32(2):419–35.
View Article
Google Scholar

[120] View Article

[121] Google Scholar

[ref47] 47. McDonald GC, Schwing RC. Instabilities of regression estimates relating air pollution to mortality. Technometrics. 1973;15(3):463–81.
View Article
Google Scholar

[123] View Article

[124] Google Scholar

[ref48] 48. Walker E, Birch JB. Influence measures in ridge regression. Technometrics. 1988;30(2):221–7.
View Article
Google Scholar

[126] View Article

[127] Google Scholar

[ref49] 49. Yüzbaşı B, Arashi M, Ejaz Ahmed S. Shrinkage estimation strategies in generalised ridge regression models: low/high‐dimension regime. Int Stat Rev. 2020;88(1):229–51.
View Article
Google Scholar

[129] View Article

[130] Google Scholar

[ref50] 50. Longley JW. An appraisal of least squares programs for the electronic computer from the point of view of the user. J Am Stat Assoc. 1967;62(319):819–41.
View Article
Google Scholar

[132] View Article

[133] Google Scholar

Figures

Abstract

1. Introduction

1.1. Principal component regression

1.2. Principal fitted component regression

1.3. Exponential Quantile Support Vector Regression (EQSVR)

2. Proposed methods

2.1. Principal Component Robust Support Vector Regression (PCRSVR)

2.2. Principal Fitted Component Robust Support Vector Regression (PFCRSVR)

3. Simulation study

3.1. Simulation design

3.2. Performance evaluation criteria

4. Results

5. Discussion

6. Real-life data application

7. Conclusion

Supporting information

S1 File.

S1 Fig. Improved percentage reduction in MSE of proposed techniques against their competitors with different levels of contamination for p = 25 and ρ = 0.9.

S2 Fig. Improved percentage reduction in MSE of proposed techniques against their competitors with different levels of contamination for p = 25 and ρ = 0.9.

S3 Fig. Improved percentage reduction in MSE of proposed techniques against their competitors with different levels of contamination for p = 25 and ρ = 0.99 .

References