A non-linear data mining parameter selection algorithm for continuous variables

In this article, we propose a new data mining algorithm, by which one can both capture the non-linearity in data and also find the best subset model. To produce an enhanced subset of the original variables, a preferred selection method should have the potential of adding a supplementary level of regression analysis that would capture complex relationships in the data via mathematical transformation of the predictors and exploration of synergistic effects of combined variables. The method that we present here has the potential to produce an optimal subset of variables, rendering the overall process of model selection more efficient. This algorithm introduces interpretable parameters by transforming the original inputs and also a faithful fit to the data. The core objective of this paper is to introduce a new estimation technique for the classical least square regression framework. This new automatic variable transformation and model selection method could offer an optimal and stable model that minimizes the mean square error and variability, while combining all possible subset selection methodology with the inclusion variable transformations and interactions. Moreover, this method controls multicollinearity, leading to an optimal set of explanatory variables.


Introduction
It happens often that the physical or mathematical model behind an experiment or dataset is not known. Hence, model selection (sometimes called subset selection) becomes an important feature during the data analysis endeavor. The methodology of selecting the best model from a set of inputs has constantly been examined by many authors [1]. Identifying the best subset among many variables is the most difficult part of this effort. The latter is exacerbated as the number of possible subsets grows exponentially, if the number of variables (parameters) grows linearly. Furthermore, there is also a chance that the original input parameters to a model do not convey enough information. Some transformations of the original parameters, and specifically interactions between them, are needed to make the data more available for information extraction.
In other words, in a supervised learning terminology, there is a long and unpaved journey between the inputs (also called predictors, features or independent variables) and the outputs (also called responses or dependent variables). Thus, the difficulty is not only embedded in PLOS  picking the right machine learning algorithm for the problem at hand, but also in picking proper transformations and interactions of the inputs or their subsets. There are different methods capable of addressing transformations and subset selection. However, to the best of our knowledge, none of these methods solves both issues simultaneously. In our discussions in this paper, we denote the vectorial form of an input variable x by an N × 1 vector x as a collection of N observations. The assembly of p such inputs and an intercept is denoted by an N × (p + 1) matrix X = (1, x 1 , x 2 , . . ., x p ). The vectorial form of the output y is denoted by an N × 1 vector Y. For example, based on this description, a linear model is defined as where ε is the N × 1 noise vector, and β = (β 0 , β 1 , . . ., β p ) T is a (p + 1) × 1 vector of coefficients with the first element β 0 as the intercept (or bias) of the model. In what follows next, we review a series of methods and algorithms that are used to find some subset(s) of the inputs that could possibly relate the inputs to outputs in an efficient way.

Subset selection
There are currently various methods for selecting predictors, such as the traditional best subset selection, forward selection, backward selection and stepwise selection methods [1,2]. In general, the best subset procedure finds for each k 2 {1, 2, . . ., p}, the subset of inputs of size k that minimizes the Residual Sum of Squares (RSS) [3][4][5][6]. There are fast algorithms optimizing the search [7]. However, searching through all possible subsets could become laborious as p increases.
A number of automatic subset selection methods seek a subset of all inputs, that is as close as possible to the best subset method [1]. These methods select a subset of predictors by an automated algorithm that meets a predefined criterion, such as the level of significance (set by the analyst). For example, the forward selection method [1] starts with no predictors in the model. It then adds predictors one at a time until no available predictors can contribute significantly to the response variable. Once a predictor is included in the model, it remains there. On the other hand, the backward elimination technique [1] works in the opposite direction and begins with all the existing predictors in the model, then discards them one after another until all remaining predictors contribute significantly to the response variable. Stepwise subset selection [8] is a mixture of the forward and backward selection methods. It modifies the forward selection approach in that variables already in the model do not always remain in the model. Indeed, after each step in which a variable is added, all variables in the model are reevaluated via their partial F or t statistics and any non-significant variable is removed from the model. The stepwise regression requires two cutoff values for significance: one for adding variables and one for discarding variables. In general, the probability threshold for adding variables should be smaller than the probability threshold for eliminating variables [1].
Subset selection methods are usually based on targeting models with the largest R 2 adj , or in other words smallest Root Mean Square Error (RMSE). However, there are other methods in which the selection model is based on Mallow's C p [9][10][11][12]. These criteria highlight different aspects of the regression model. As a results, they can lead to models that are completely different from each other and yet not optimal.
Unfortunately, none of these subset selection methods address the issue of multicollinearity.

Ridge regression
There are also other issues regarding the traditional subset selection regression methods. They could lead to models that are unreliable for prediction because of over-fitting issues. More specifically, they could generate models that have variables displaying a high degree of multicollinearity. Such methods can lead to R 2 values that are biased and yield to confidence limits that are far too narrow or wide. Moreover, the selection criterion primarily relies on the correlation between the predictor(s) and the dependent variable. Thus, these methods (e.g. Stepwise method [13]) do not take into consideration the correlation within the predictors themselves. The latter is a source of multicollinearity that is not addressed automatically by these mentioned methods [13].
Indeed, when collinearity among the predictors exists, the variance of the coefficients is inflated, rendering the overall regression equation unstable. To address this issue, a number of penalized regression or shrinkage approaches are available. For example, the Ridge method tries to eliminate the multicollinearity by imposing a penalty on the size of the regression coefficients [2]. Indeed, a model is fitted with all the predictors, however, the estimated coefficients are shrunken towards zero relative to the least squared estimates. Therefore, biased estimators of regression coefficients are obtained, reducing the variance and thus leading to a more stable equation.
Solving for β in Eq (1) using the Least Squares (LS) method would be equivalent to solvinĝ Here, k Ridge regression, on the other hand, places a constraint on the estimator β in order to minimize a penalized sum of squares [14,15] The complexity parameter λ > = 0 controls the amount of shrinkage. Large values of this parameter would result in a large shrinkage. The value of the constant λ is predefined by the analyst and is usually selected in order to stabilize the ridge estimators, producing an improved equation with a smaller RMSE compared to the least-squares estimates. One weakness of the Ridge method is that it does not select variables. Indeed, unlike the subset selection method, it includes all of the predictors in the final model with shrunken coefficients. The other weakness is that multicollinearity is not fully addressed. In fact, the Ridge estimate of variables in (3) only shrinks the coefficients even for the inputs with multicollinearity. However, the Ridge Method does not fix multicollinearity, it only alleviates it. This issue has been shown and addressed in [16].

Lasso
To obtain variable selection procedures, there are shrinkage methods available such as Least Absolute Shrinkage and Selection Operator (Lasso), where the penalty involves the sum of the absolute values of the coefficients β excluding the intercept [17]. Lasso is closely related to sparse optimization found in works by Candes and Tao [18]. Taking β − = (β 1 , . . ., β p ) T , the Lasso method can be presented as the following optimization problem where k b À k 1 ¼ P p 1 jb j j is the L 1 norm of β − and λ > 0. The advantage of Lasso is that much like the best subset selection method, it performs variable selection.
The parameter λ is usually selected by cross validation. For a small λ, the result is equal to the least squares estimates. As the value of λ augments, shrinkage happens in such a way that only a sparse number of variables having an active role in the final model would show up. Thus, Lasso is a combination of both shrinkage and variables selection.

LAR
Least Angle Regression (LAR) is a new model of automatic subset selection based on a modified version of forward procedure [19]. The LAR method follows an algorithmic procedure: First, the independent variables are standardized in order to obtain a mean zero. At this stage, the β coefficients are all equal to zero. Then the predictor that most correlates to the response variable is selected; its coefficient is then shifted from zero towards its least squares value. Now, once a second predictor becomes as correlated with the existing residual as the first predictor, the procedure is paused. The second predictor is then added to the model. This procedure then continues until all desired predictors are included in the model, leading to a full least-squares fit.
The method of Least Angle Regression with Lasso modification is very similar to the above procedure, however it includes an extra step: if a coefficient approaches zero, LAR excludes its predictor from the model and recalculates the joint least squares path [2]. LAR methods and its variations are better subset selector algorithms compared to most of the subset selection methods.

Dantzig
Another selection approach is the Dantzig selector [20], which can be formulated as subject to kβk 1 t. Here, k.k 1 is the L 1 norm, that is the maximum of its argument. The objective of this method is to minimize the maximum inner product of the existing residual with all the independent variables. This approach has the capacity of recovering an underlying sparse coefficient vector.

Knockoff filter
This method is recently introduced as a new variable selection method to control the false discovery rate (FDR) [21], for linear models. For a selected subset of variable indicesŜ, the FDR is formally defined as It is also well-suited for high-dimensional linear models in which the number of features are more than the number of data points. This method is capable of being combined with different methods, such as Lasso explained above, to perform a more reliable variable selection in the context of controlling the FDR.

PCR
Lastly, Principal Component Regression (PCR) is a method that involves an orthogonal transformation to address multicollinearity [2,22,23]. This approach is closely related to the Singular Value Decomposition (SVD) method [24]. PCR applies dimensionality reduction and decreases multicollinearity by using a subset of the principal components in the model [2]. PCR is one of very few methods that tries to eliminate multicollinearity with linear transformations and, at the same time, perform a regression. The various approaches described so far aim to select the best set of relevant variables from an original set. With the exception of the PCR method, in which there are linear transformations, variables transformations are not incorporated among predictors in any of the methods mentioned above. These traditional methods do not offer the option of automatic variable transformation to address polynomial curvilinear relationships. No non-linear interpretable interaction of the predictors is available in them. An analyst usually needs to manually apply polynomial, logarithmic, square-root and interaction-between-variables transformations in order to address non-linearity of the data.
Non-Linear transformation. There are a number of non-linear transformation procedures currently available such as Box-Cox or Box-Tidwell [25,26]. These methods are relatively efficient in finding the dependent and independent variables transformations. In Box-Tidwell method [26], independent variables are transformed using a recursive Newton algorithm. As a result, it becomes susceptible to round-off errors which would in turn result in unstable and improper transformations [1]. Despite the relative success of these methods, there is no automatic variable selection embodiment with them.
Artificial Neural Networks (ANN) are the current state of the art method in transformations and capturing non-linearity [2,27]. ANN is a machine learning method that finds some non-linear transformations of the inputs using layers of nodes. One recent exemplary example is the Deep Neural Networks (DNN) used in speech recognition [28]. Despite the efficient performance in capturing the non-linearity of the data, the model itself is not comprehensible particularly if there is a physical component to the data that one needs to interpret or understand. In other words, ANN is a perfect black box model, but not a good interpretable medium for understanding physical and mathematical mechanism(s) behind the observed data.
Subset selection and transformation. As mentioned earlier, only the PCR method performs linear transformations automatically, and also picks variables. However, PCR is not enough when non-linearity is present. On the other hand, ANN has the best capability in capturing non-linearities, but acts like a black box and does not lend insight into the physical and mathematical mechanism(s) behind the observed data.
To produce an enhanced subset of the original variables, an effective selection method should have the potential of adding a supplementary level of regression analysis that would capture complex relationships in the data via mathematical transformation of the predictors and exploration of synergistic effects of combined variables in an interpretable fashion. The method that we present here has the potential to produce an optimal subset of variables, which is even interpretable in the presence of non-linear interaction between the inputs, resulting in a more efficient overall process of model selection.
The core objective of this paper is to introduce a new estimation technique for the classical least square regression framework. This new automatic variable transformation and model selection method could offer an efficient and stable model that minimizes the mean square error and variability, while combining all possible subset selection methodologies and including variable transformations and interaction. Moreover, this novel method controls multicollinearity, leading to an optimal set of explanatory variables. The final model is also easy to interpret. In other words, we will depict a method that tries to address variable selection, interpretation, non-linear interaction and transformation at the same time.

Problem definition
We assume T to be the set of all transformations on a given set of inputs {x i }, for i 2 {1, . . ., p} and x i 2 R. One possible formulation, to find the best subset and transformation estimating a dependent variable y 2 R, can be expressed as Here, one desirable candidate for the norm k.k could be the L 2 norm, since the purpose is regression. Also, Pð:Þ is the power set. This is an NP hard problem. As a result, we need to find approximations of this problem to make it traceable.
In the first step, we confine ourselves to a set of certain functions in T that are easy to interpret from a casual physical perspective. We call this set F . For example we could pick only the polynomial transformations. Consequently, the set of all transformed variables would be This step would reduce the search space for (7). However, there are sources of redundancy which we could minimize or eliminate. Knowing this, the next step could be to pick transformed variables that have a significant absolute value correlation ρ zy with the output y. This set can be expressed as Also, there is a chance that many of the elements in Z δ are strongly correlated with each other. Later, this could be a serious source of multi-collinearity. So, we could further trim Z δ by only picking the most correlated variables to the output among two correlated variables. This would reduce the set Z δ to Here, ρ αβ is the absolute value correlation between α and β. At this stage, using (8)- (10), and considering that we are looking for a linear estimator among these reduced transformations, the optimization problem (7) would become Here, Z r ¼ fz r i g i2f1;...;jZ r jg and |Z r | is the cardinality of Z r . The optimization problem (11) is nothing but a subset selection model and could be approximated by any methods of subset selection [1,2]. Hence, we now have a model (11) that not only takes care of some desirable interpretable transformations, but also extracts the most meaningful set of parameters.
Note. As we intend to provide a data mining method rather than a pure statistical one, the easy interpretation would act as a constraint on the types of transformations in (7). For example, in a medical investigation, the investigator is mainly looking for basic algebraic interactions between the inputs which can provide physiological view of the system under scrutiny. Hence, the non-linear transformations and interaction between terms must be as basic as possible, such as exponents, logarithms, multiplications and etc. On the other hand, a linear model, like (11), should be used to keep the interpretability of the model intact providing a robust and accurate model. By this formulation, we are trying to deploy an interpretable and accurate data mining model, instead of a black-box pure statistical learning method. Our effort is not to compete with statistical learning methods, but to provide an easy and a faithful-fit data mining method. In the next section, we are going to discuss our methodology in more practical detail.

Methodology
As mentioned before, we are looking for transformations that are easy to interpret. There are four main transformation categories of this type capturing the non-linearity in a data set [2]. These transformations are as follow but not limited to candidates would be 1 We are going to use this set of transformations, namely F ðfx j g j2O ; a; MÞ, for the rest of this paper. After the construction of these interpretable interactions transformations, one can start to look for the best model, for Y, among the set of all transformations 1-4. Here, Y 2 R N is the vector form of the output y.
Denoting the set of variables created by transformations 1-4 as Z, which is the matrix form of Z in (8), we are looking for the best model where some elements of β z are zero. We note that we could further equip our algorithm with Standardized Regression (similar to the first step of the LAR method) to diminish the possibility of a numerically ill-conditioned variable matrix Z. In fact, some elements of β z are zero since there is a chance that some columns of Z are linearly dependent or that they do not contribute to any correlation with Y. We can address these two issues, by a modified dictionary search [17] algorithm as follows. This part stands out for (9) and (10).
• Any column of Z that has a non-significant correlation (less than δ) with Y can be discarded; see (9).
• Any two columns of Z that have a high correlation to each other (greater than B) are redundant columns. Between these two columns, the one that has a higher correlation with Y is picked and the other is discarded; see (10).
As a result of this methodology, we can now solve model (12) for only a reduced matrix. We denote this reduced matrix as Z r and its corresponding vector of coefficients as b r z . The final task is to find the best subset of the columns in Z r to model the data in Y. The latter can be done by any method of subset selection including the best subset selection method [1]. The subset selection method that we have used in our implementation is based on targeting models with the largest R 2 adj , or in other words smallest RMSE. As a reference point, we call our methodology the Parameter Selection Algorithm.

Parameter selection algorithm
The goal of the Parameter Selection Algorithm is to find the best interpretable model on the original observed variables X, from a set of basic transformations, estimating Y. Our method is summarized in Algorithm 1.
Step 1 of this algorithm is input specification.
Step 2 is where the dictionary of transformations and interactions is made. Steps 3 and 4 correspond to the elimination of columns of the dictionary which involve either a non-significant correlation to the output or multicollinearity between its elements.
Step 5 is where the best model is finally found, subject to the constraint that the final set of variables has a Variation Inflation Factor (VIF) less than 10. VIF elements are the main diagonal values of the inverse of the multiplication of the input matrix transposed with the input matrix. For example if X is the input, then C = (X T X) −1 and VIF j = C jj [1]. Although we eliminate similar-looking variables in step 4, checking for the VIF [29] is a necessary condition to make sure that no multicollinearity is introduced into the final model. In practice, step 5 can be solved by maximizing the R 2 adj among all possible subsets of the variables in Z r [1].
Steps 2, 3 and 5 in this algorithm can be made parallel to decrease the computational time of the method. To our best knowledge, Algorithm 1 is the first linear data mining method that performs both variable transformation and model selection while adding interaction terms and also preventing multi-collinearity, in one package.
The hyper-parameters δ and B are important factors in controlling the speed of convergence of the Parameter Selection Algorithm. In Algorithm 1, the smaller the value of δ (similarly, the larger the value of B), the bigger the space of search in step 5. As a result, the speed of convergence would depend greatly on these two parameters.

Solve minimize
Candidates for B and δ. The hyper-parameter δ is straightforward to settle. Most of the contribution of a model comes from variables having a high univariate correlation coefficient with the output. As a result, we could discard variables with smaller contributions. Here, small is measured with respect to the highest absolute value univariate correlation coefficient with the output. Usually, an absolute value univariate correlation coefficient of 0.5, and above, is considered to be high [30]. This is 50% of the maximum allowed absolute correlation of 1. Hence, from a conservative perspective, we could set δ to be half of the maximum highest absolute value univariate correlation coefficient among all variables. We call this the default value of δ.
On the other hand, the hyper-parameter B can be characterized with the VIF concept. Each element of the VIF vector can be expressed as Parameter selection algorithm Here, R 2 j is the multiple R 2 for the regression of x j against other inputs. Hence, if we want two inputs to have a small correlation with each other, we must have a possible VIF between them to be less than 10. This would impose an R 2 = 0.9 between those variables. Hence, a correlation of *0.95 would say if two inputs are highly correlated or not. On the other hand, we know that if we set the independence limit B ¼ 0:95, we would construct a huge dictionary of inputs when transformations are available. To have a balance between the two, our recommendation is

Synthetic examples
In this section, we provide a few synthetic examples using Parameter Selection Algorithm. In the following examples, we try to show that the algorithm that we have proposed is capable of finding the non-linear transformations in a model.
Example. Taking x 1 , x 2 , x 3 to be independent uniformly distributed random variables between 0 and 100, we sampled 1000 data points and then created the non-linear functional y = 120 + 80x 1 x 3 . We take the original input matrix X to be composed of all x 1 , x 2 , and x 3 . Using the traditional best subset selection [7], accompanied with a control over VIF not to get above 10, we get the results shown in Fig 1. From this figure, it is clear that the best subset selection model is not capable of capturing the correct non-linearity in the model. The heteroscedasticity of the residual plot can be seen in Fig 2. The found best subset of parameters is {x 1 , x 2 , x 3 }. On the other hand, if Algorithm 1 is used, with a strict choice of B ¼ 0:5 and the default value of δ, the non-linearity is captured completely by our method (See Figs 3 and 4). The subset of parameters found by our method is the model non-linear parameter {x 1 x 3 }.
Example. If χ is a uniform random random variable between 0 and 1, we set We sampled 1000 data points of x 1 , x 2 , and x 3 and then created the non-linear functional y ¼ 120 þ 1000 x 2 . We take the original input matrix X to be composed of all x 1 , x 2 , and x 3 . Using the traditional best subset selection [7], accompanied with a control over VIF not to get above 10, we get the results shown in Fig 5. Again, from this figure, it is clear that the best subset selection model is not capable of capturing the correct non-linearity in the model. The heteroscedasticity of the residual plot can be seen in Fig 6. The found subset of parameters is {x 1 , x 2 }. On the other hand, if Algorithm 1 is used, the non-linearity is captured completely (See Figs 7 and 8). The subset of parameters found by our proposed method is the non-linear parameter f 1 x 2 g. Here, B ¼ 0:8 and the default value of δ was used.

Real data example
The synthetic examples in the previous section showed the capability of our method in capturing the true non-linearity of a dataset. In this section, we show a real data case study. Cardiovascular Diseases (CVDs) are the major cause of deaths in the United States, killing more than 350,000 people every year [31]. One of the major contributors to CVDs is arterial stiffness [32,33]. Arterial stiffness can be approximated by Carotid-femoral Pulse Wave Velocity (PWV) [34]. In fact, PWV is one of the most important quantitative index for arterial stiffness [33]. PWV measures the speed of the arterial pressure waves traveling along the blood vessels and higher PWV usually highlights stiffer arteries. Increased aortic stiffness is related to many clinically adverse cardiovascular outcomes [32]. PWV constitutes an independent and valuable marker for cardiovascular diseases (CVDs) and its use is crucial as a routine tool for clinical patient assessment.
In this section, our aim is not to present the most accurate PWV model. However, our goal is to show that if our technique of model construction is used (see Algorithm 1), we are able to find a more interpretable model.
The data we present is collected from 5444 Framingham Heart Study (FHS) participants [35]. Each participant had undergone an arterial tonometry data collection. The participants Parameter selection algorithm were part of FHS Cohorts Gen 3 Exam 1 [36], Offspring Exam 7 [37], and Original Exam 26 [38]. The California Institute of Technology and Boston University Medical Center Institutional Review Boards approved the protocol and all participants gave written informed consent. Here, we try to find models for PWV based on the following inputs: Age (A), Pulse Duration (D), Weight (W), Height (H), and Body Mass Index (BMI).
One model is based on the traditional best subset selection method monitored for VIF < 10, and the other based on the Parameter Selection Algorithm method (Algorithm 1). The participant characteristics are shown in Table 1.
The p-value of these parameters are 6 × 10 −4 , 0, 2 × 10 −21 , and 8 × 10 −46 , receptively. From (15) one can interpret that Age (A) is a dominant factor in PWV. Furthermore, The Age adjustments with Heart Rate HR ¼ 60 D is of great importance. The other interpretable factors are the adjusted values of slenderness (body mass index BMI, and the weight W) with

Comparison and results discussion
Comparing Figs 9 and 12, it is clear that the Parameter Selection Algorithm method is superior to the best subset selection method. The R 2 adj of the Parameter Selection Algorithm model is almost %11 better than the best subset selection method. Both methods suffer in capturing all the variation and non-linearity in data (compare Fig 10 to Fig 13 and Fig 11 to Fig 14). Parameter selection algorithm However, Parameter Selection Algorithm is better in this respect. The heteroscedasticity of the best subset selection method is worse than that of the Parameter Selection Algorithm method (compare Fig 11 to Fig 14). The Bland-Altman limits of agreement of the Parameter Selection Algorithm method is also better than those of the best subset selection method (compare Fig 10 to Fig 13). The latter shows that the Parameter Selection Algorithm method is a more precise method than the best subset selection method.  output of the Parameter Selection Algorithm is interpretable and also behaves as a dimensionality reduction algorithm.
Finally, we again mention that our goal is not to show the best possible model for PWV, but rather to show the capabilities of our presented method.

Other applications
The interpretability of Algorithm 1 would be an advantage in analyzing physiological data. In other words, complex biomedical and bioengineering databases could be appropriate fits for this method. In previous section, we expressed the application of our algorithm to PWV data. This suggests that any continuous physiological variable can be treated the same way. For example, important biomedical continuous variables such as Cardiac Output (CO) [39], Ejection Fraction (EF) [40], Stroke Volume (SV) [39], Blood Pressure (BP), and Homeostatic Model Assessment (HOMA) [41] can all be estimated and interpreted using Algorithm 1. This list variables can be extended beyond the mentioned cases.

Conclusion and future works
In this paper, we have introduced the Parameter Selection Algorithm (Algorithm 1) by which one can simultaneously capture some of the non-linearities of the data into the model, introduce automatic interpretable interaction and transformation among predictions, and also pick the best model. This approach minimizes the efforts done by an analyst and is virtually automatic. So far, up to the best of our knowledge, no other algorithm or method is able to perform these tasks at the same time automatically. Here, our purpose has not been to introduce a competing statistical learning method, but to furnish a data mining tool. Despite this, we have shown that our model is almost as good as the state of the art statistical learning algorithms.
This data mining approach provides an interpretable dimensionality reduction model that faithfully models the data. We believe, the Parameter Selection Algorithm could have versatile applications in biostatistics as shown by one of the examples in this manuscript.
The hyper-parameters B and δ, presented in this article, are analyzed and set heuristically. In a future work, we intend to perform a more detailed analysis to possibly quantify optimum values for them. Furthermore, instead of just solving step 5 in Algorithm 1, we could also add the constraint that parameters with high p-values, in a model, should be discarded to provide an even sparser result.
All in all, we see this article as a proof of concept which needs further investigation to analyze the involved hyper-parameters and also tweaks to its optimization core.