Selection of Genetic and Phenotypic Features Associated with Inflammatory Status of Patients on Dialysis Using Relaxed Linear Separability Method

Identification of risk factors in patients with a particular disease can be analyzed in clinical data sets by using feature selection procedures of pattern recognition and data mining methods. The applicability of the relaxed linear separability (RLS) method of feature subset selection was checked for high-dimensional and mixed type (genetic and phenotypic) clinical data of patients with end-stage renal disease. The RLS method allowed for substantial reduction of the dimensionality through omitting redundant features while maintaining the linear separability of data sets of patients with high and low levels of an inflammatory biomarker. The synergy between genetic and phenotypic features in differentiation between these two subgroups was demonstrated.


Appendix S1
Mathematical foundations of the RLS method of feature selection We consider a data set that describes m patients O j , where j is the order number (index) of the patient O j in the cohort. Each patient O j is represented by the n-dimensional feature vector x j = [x j1 , ..., x jn ] T , where n is the total number of features (i.e., clinical characteristics and results of diagnostic tests including genetic polymorphisms). The component x ji of the vector x j is the numerical result of the i-th measurement or diagnostic test taken on patient O j . Both the clinical characteristics and the results of laboratory tests performed in patient O j can be represented by the components x ji (x ji ∈ {0, 1} or x ji ∈ R 1 ).

Linear separability of two learning sets
The feature vectors x j can be considered as points in the feature space F . We examine a possibility of linear separation of two groups of patients O j (for example m + healthy patients vs. m − ill patients) in F and for that purpose two learning sets G + = {x j ; j ∈ J + } and G − = {x j ; j ∈ J − } where J + and J − are disjoined sets of indices j (J + ∩ J − = ∅), are defined. The set J + contains m + indices, and the set J − contains m − indices.

Definition 1
The sets G + and G − are linearly separable in the feature space F , if and only if there exist such a weight vector w = [w 1 , ..., w n ] T ∈ R n , and threshold θ ∈ R 1 that (∀x j ∈ G + ) : w T x j > θ and (∀x j ∈ G − ) : w T x j < θ.
The parameters w and θ define the hyperplane H(w, θ) in the feature space F : If the sets G + and G − are linearly separable then all the elements x j of the set G + are situated on the positive side of the hyperplane H(w, θ) (w T x j > θ) and all the elements of the set G − are situated on the negative side of this hyperplane (w T x j < θ), see Figure 1.
The concept of linear separability is used in the theory of neural networks (Perceptron theory) and in pattern recognition methods [1]. Optimal linear classifiers can be designed through exploration of the linear separability of the data subsets [2].

Convex and piecewise-linear criterion functions
The positive penalty function ϕ + (w, θ; x j ) is defined for x j ∈ G + [1,2]: Similarly, the negative penalty function ϕ − (w, θ; x j ) is defined for x j ∈ G − [1,2]: The penalty functions ϕ + (w, θ; x j ) and ϕ − (w, θ; x j ) are convex and piecewise-linear (CPL) (Figure 2). The function ϕ + (w, θ; x j ) is equal to zero if the feature vector x j ∈ G + is situated on the positive side of the hyperplane H(w, θ) and is not too close to it. Similarly, ϕ − (w, θ; x j ) is equal to zero if the vector x j ∈ G − is situated on the negative side of the hyperplane H(w, θ) and is not too close to it. The perceptron criterion function Φ(w, θ) is defined on the learning sets G + and G − as the weighted sum of the penalty functions ϕ + (w, θ; x j ) and ϕ − (w, θ; x j ) [1,2]: where nonnegative parameters α j can represent prices of particular feature vectors x j . The default values of the parameters α j are equal to α j = 1/(2m + ) for x j ∈ G + and α j = 1/(2m − ) for x j ∈ G − . The optimal parameters w * and θ * are defined as the values of w and θ at the minimum Φ(w * , θ * ) = Φ * ≥ 0 of the criterion function Φ(w, θ). It was proved that there is a unique set of optimal parameters w * and θ * and that Φ * is equal to zero if and only if the sets G + and G − are linearly separable [2].
The perceptron criterion function Φ(w, θ) can be defined also on the feature vectors x j with reduced dimensionality, i.e. for vectors x j that are created with k features instead of the whole set of n features (x j ∈ F k , k ≤ n). The minimal value Φ * k of the function Φ k (w, θ) on F k can be used as the measures of the degree of the linear separability of the learning sets G + and G − in the reduced feature subspaces F k . The zero value Φ * k indicates that the learning sets G + and G − are linearly separable in the reduced feature subspaces F k . The reduction of features may increase the value of Φ * k [2]. A modified criterion function Ψ λ (w, θ), which includes additional CPL penalty functions based on the absolute values of components |w i | of vector w and the costs γ i (γ i > 0) of particular features x i , was found to be useful for the feature selection [2]: where λ ≥ 0 is the cost level. The standard values of the parameters γ i are equal to one. The modified criterion function Ψ λ (w, θ) is used in the relaxed linear separability (RLS ) method of feature subset selection [3,4]. The regularization component λ γ i |w i | used in the modified criterion function Ψ λ (w, θ) is similar to that used in the Lasso method developed in the framework of the regression analysis for the model selection [5]. The main difference between the Lasso and the RLS methods is in the types of the basic criterion functions. The basic criterion function typically used in the Lasso method is the residual sum of squares, whereas the perceptron criterion function Φ(w, θ), equation (4), is used in the RLS method. This difference effects the computational techniques applied to minimize of the criterion functions. The criterion function Ψ λ (w, θ), similarly to the function Φ(w, θ) is convex and piecewise-linear (CPL). The basis exchange algorithms, which are similar to linear programming, allow the identification of the minimum of the function Ψ λ (w, θ) or Φ(w, θ) even in the case of large, high dimensional learning sets G + and G − [2].
The optimal parameters w * λ = [w * λ1 , ..., w * λn ] T are used in the feature reduction rule [3]: The reduction of such feature i that is related to the weight w * λi equal to zero (w * λi = 0) does not change the value of the inner product (w

Relaxed linear separability (RLS ) method of feature subset selection
The n-dimensional feature space F is composed of all the n features x i from the set {x 1 , ..., x n }. Feature reduction based on the rule (6) results in appearance of the reduced feature subspaces F k . The symbol F k denotes a feature subspace that is composed of k features x i from the set {x 1 , ..., x n }. The reduced learning sets G + k and G − k are composed from the reduced feature vectors x j with the same k features. Successive increase of the value of the cost level λ in the criterion function Ψ λ (w, θ) causes reduction of some features x i and, as the result, allows to generate the descended sequence of feature subspaces F k with decreased dimensionality (F k ⊃ F k−1 ) [2]: The reduced feature subspaces F k are formed from the feature space F as a result of the omission of certain features according to rule (6). The reduced feature vectors x j ∈ F k are obtained from the feature vectors x j in the same manner. The reduction of features may result in the loss of linear separability of the sets G + and G − for some k.
In accordance with the relaxed linear separability (RLS ) method, the generation of sequence (7) of the feature subspaces F k is deterministic [3,4]. Each step F k → F k−1 is realized by an increase in cost λ values: λ k → λ k−1 = λ k + ∆ k (∆ k > 0) in the criterion function Ψ λ (w, θ). The further increase of the parameter λ allows to reduce the next feature from the reduced vectors in F k .
The feature reduction process should be stopped by fixing the cost λ at a specific level. The choice of the stop criterion of the sequence (7) is a crucial issue in RLS method.

RLS stop criterion based on minimal error rate of linear classifier
Two stop criteria for the feature reduction were examined for the RLS method. The first of these criteria was based on the error rate of the optimal linear classifiers constructed in the successive feature subspaces F k [3]. The optimal linear classifier was constructed in each F k by minimizing the perceptron criterion function Φ k (w, θ) on the reduced feature vectors x j ∈ F k . The parameters w * k and θ * k at the minimum of the criterion function Φ k (w, θ) determine the following decision rule of the optimal linear classifier LC k (w * k , θ * k ) in the feature subspace F k : where x ∈ F k , the category (class) ω + is represented by m + elements of the learning set of reduced vectors G + k and the category ω − is represented by m − elements of the learning set of reduced vectors G − k .
The quality of the optimal linear classifier LC k (w * , θ * ), rule (8), can be evaluated using the error estimator (apparent error rate) e a (w * k , θ * k ) as the fraction of wrongly classified elements x j of the sets G + k and G − k in the feature subspace F k [1]: where m is the number of all elements in the sets G + k and G − k , and m a (w * k , θ * k ) is the number of elements from these sets wrongly allocated by the rule (8).
If the same data are used for the design of the classifier using rule (8) and for the classifier evaluation, equation (9), the evaluation is biased (too optimistic) [6]. For example, when the learning sets G + k and G − k are linearly separable, then all the vectors are correctly classified by the rule (8), and the apparent error (9) is equal to zero. But, as it is often found in practical applications, the error rate evaluated on vectors x that do not belong to the linearly separable learning sets G + k and G − k may be positive. To reduce the classifier bias, the cross validation procedures can be applied [1,7]. In accordance with the p-fold cross validation learning sets G + k and G − k are divided into p subsets P i , where i = 1, ..., p. The vectors x j contained in p − 1 subsets P i are used for the definition of criterion function Φ k(i) (w, θ), equation (4), and for the computation of the optimal parameters w * k(i) and θ * k(i) . The remaining vectors x j from the other subset P i (a test set) are used for the evaluation of error rate e i (w * k(i) , θ * k(i) ), equation (9). This evaluation is repeated p times, and during each turn a different p-part P i is used as the test set. After this, the mean value e CV E of the error rates e i (w * k(i) , θ * k(i) ), equation (9), on elements of the test sets P i is computed.
The cross validation procedure allows for the use of different vectors for designing the classifier and for its evaluation, and, as the result, to reduce the bias of the error rate estimation. The error rate e CV E estimated during the cross validation procedure is called the cross-validation error (CVE ). A special case of the p-fold cross validation procedure is called the leave-one out procedure, where the number p of the parts P i is equal to the number m of elements in the sets G + k and G − k [8]. Another type of the classifier evaluation is based on the so called confusion matrix T k (w * k , θ * k ) [1]: where m ++ is the number of elements of the set G + k in the feature subspace F k correctly allocated by the optimal classifier to the category ω + , and m +− is the number of elements of G + k wrongly allocated to the category ω − , m −− is the number of elements of the set G − k correctly allocated to the category ω − , and m −+ is the number of elements in G − k wrongly allocated to the category ω + . The confusion matrix T k (w * k , θ * k ) allows to represent different types of errors related to the optimal classifier in the feature subspace F k . Both the error rate e CV E , equation (10), as well as the confusion matrix T k (w * k , θ * k ), equation (11), can be estimated using the cross validation procedures in different feature subspaces F k of the sequance (7).
The cross-validation error rate, CVE, equation (10), of the linear classifier LC k (w * , θ * ), rule (8), is used in the stop criterion in the present study. The error rate CVE is evaluated for each feature subspaces F k in the sequence (7). The feature subspace F k that is linked to the linear classifier LC k (w * , θ * ) with the lowest error rate e CV E is indicated as the one selected by the classifier. It should be stressed that the estimation of the error rates e CV E is used only for indication of the feature subspace F k , which is considered as having optimal number of features and the features should not be further reduced. The estimation of the error rates e CV E (w * k , θ * k ) does not affect in any way the sequence (7) of feature subspaces F k . In this approach to feature subspace selection, the optimal feature subspace F k is characterized by the lowest value of e CV E .

RLS stop criterion based on full linear separability
The second type of the stop criterion used in the RLS method has been based directly on checking the linear separability of the reduced learning sets G + k and G − k [4]. This approach can be useful in the case when the learning sets G + and G − are linearly separable in the full feature spaces F . Let us remark, that the learning sets G + and G − are usually linearly separable in the case when the number of objects, m, is less than the number of the features, n [1]. Such situation appears among others in the case of genetic data sets, where the number n of the features x i can be thousand times more than the number m of the objects (for example n = 50000, m = 50). Typically, such learning sets G + and G − are linearly separable.
In accordance with the stop criterion, the feature reduction process is carried out until the linear separability is threatened. The results of the numerical experiments with this stop criterion are described in [4]. The experiments were carried out both with a large synthetic data set as well as with two benchmarking genetic data sets (Breast cancer [9] and Leukemia [10]).
The synthetic data set contained m = 1000 feature vectors with n = 100 features. The vectors x j were generated randomly in accordance with the multivariate normal distribution N (0, Σ) with the covariance matrix Σ equal to the unit matrix I (Σ = I). The objects x j were divided into the learning sets G + (m + = 630 vectors) and G − (m − = 370 vectors) on the basis of pre-selected linear combination of 10 features x i (the linear key). If the value of this linear key was greater than an assumed threshold, the vector x j was attributed to the set G + . If the value of the linear key was lower than the threshold, the vector x j was attributed to the set G − . Thus, learning sets G + and G − defined in this way were linearly separable. The RLS method allowed to skip 90 features x i while preserving the linear separability of the reduced learning sets G + k and G − k , k ≥ 10, and the linear key of 10 features was correctly rediscovered by the RLS method [4].
The Breast cancer data set describes 97 patients tested for the presence of the breast cancer [9]. The positive learning set G + contains m + = 46 patients and the negative set G − contains m − = 51 patients. Each patient is characterized by expression levels of n = 24481 genes (features). The RLS method allowed to reduce the dimensionality to n = 12 genes while preserving the linear separability of the learning sets G + k and G − k . It means that the discovered linear combination of the selected 12 features allowed for full separation of the two subgroups of patients. The computational time on standard personal computer was about 25 min.
The Leukemia data set contains expression levels of n = 7129 genes in 72 patients [10]. The positive learning set G + contains m + = 25 patients and the negative set G − contains m − = 47 patients. The RLS method allowed to reduce the dimension to three genes while preserving the linear separability of the learning sets G + k and G − k . It means that the discovered linear combination of the three selected genes allowed for 100% correct separation of the two groups of 72 patients.