Logistic regression with image covariates via the combination of L1 and Sobolev regularizations

The use of image covariates to build a classification model has lots of impact in various fields, such as computer science, medicine, and so on. The aim of this paper is to develop an estimation method for logistic regression model with image covariates. We propose a novel regularized estimation approach, where the regularization is a combination of L1 regularization and Sobolev norm regularization. The L1 penalty can perform variable selection, while the Sobolev norm penalty can capture the shape edges information of image data. We develop an efficient algorithm for the optimization problem. We also establish a nonasymptotic error bound on parameter estimation. Simulated studies and a real data application demonstrate that our proposed method performs very well.


Introduction
As one of the most important issues in machine learning field, classification plays a prominent role throughout various disciplines. Until now people have developed a large number of classification methods, such as KNN, Linear (Quadratic) discriminant analysis, logistic regression, naive bayes, decision tree, SVM, neural network, deep learning, and many others [1,2]. Among all those methods logistic regression has a long history [3], and is one of the most popular approaches. Logistic regression model is a typical representative of generalized linear models and linear classification methods. Therefore, this article takes logistic regression as the research object. Traditionally, maximum likelihood method is usually used to obtain an estimator of the parameter in logistic regression model [4][5][6].
However, the big data era brings us massive complex data, one of whose most prominent characteristics is high dimensionality. The maximum likelihood estimation method in logistic regression model faces serious problems such as non-existence, non-uniqueness [7] in high dimensional settings. Regularization is a popular strategy to handle high dimensional problems [8]. Many regularized methods have been proposed over the past decades, including LASSO [9], the smoothly clipped absolute deviation (SCAD) penalty method [10], the minimax concave penalty (MCP) method [11], and so on. For high dimensional logistic regression, [12] considered L 1 -regularization path algorithm. [13] proposed an interior-point method for large-scale L 1 -regularized logistic regression. [14] proposed the group lasso for logistic regression. The L 1/2 regularized logistic regression is considered by [15] for gene selection in cancer classification. Image data is a very popular form of data, and generated in many fields, such as computer science, medicine, and so on. In addition to high dimensionality, image data usually contains spatially smooth regions with relatively sharp edges, which leads to its own characteristics including local smoothness [16], jump discontinuity [17], and many others. Local smoothness leads to highly correlated features, which makes the image classification problem more challenging [18]. Jump discontinuity makes conventional smoothing techniques inefficient [17]. On the other hand, using these characteristics in modeling process is often helpful for model efficiency enhancement, and has received a lot of attention recently. For example, [19] introduced a locally adaptive smoothing method for image restoration. [16] proposed Propagation-Separation approach for local likelihood estimation, which can handle local smoothness of image data. [20] developed an adaptive regression model for the analysis of neuroimaging data, which is a generalization of the PS approach. [21] studied theoretical performance of nonlocal means for noise removal of image data. [17] considered a spatially varying coefficient model for neuroimaging data with jump discontinuities. [18] proposed a spatially weighted principal component analysis (SWPCA) for imaging classification. [22] developed a generalized scalar-on-image regression models via total variation regularization, which can keep the piecewise smooth nature of imaging data. [23] proposed an efficient nuclear norm penalized estimation method for matrix linear discriminant analysis.
In this paper, we consider a logistic regression model with image covariates, and develop a regularized estimation approach, which combines the L 1 regularization and the Sobolev norm regularization. The L 1 penalty performs variable selection and removes covariates unrelated to the response from models [9]. The Sobolev norm penalty keeps characteristics of image data (e.g. local smoothness) in model fitting. In fact, the Sobolev regularization is a popular technology in image data analysis, such as image denoising [24], edge detection of images [25], and many others. The proposed regularization method is different from the aforementioned regularized logistic regression models. It is also different from the elastic net method [26], which is a combination of Lasso and ridge regression. The elastic net encourages the grouping effect, where strongly correlated predictors tend to be in or out of the model. However, the elastic net can not exploit structure information of image covariates, and is not suitable for models with image covariates. There are differences between our proposed method and the fused lasso method [27]. In many real data analysis, such as gene expression data, covariates have a order. Adjacent covariates are often highly correlated and have similar effects on the response variable. The fused lasso tends to make adjacent covariates share common effect on the response. The proposed method can be treated as the extended version of the fused lasso from one dimension to multidimensions. Moreover, the fusion term here is Sobolev norm penalty. Furthermore, we develop a novel algorithm to solve the optimization problem. The theoretical property of our estimator is also studied, and a nonasymptotic estimate error bound is given. Numerical studies including simulations and a real data analysis are also considered to verify the performance of our method.
The rest of the article is organized as follows. Section 2 presents the methodology, including model setup, algorithm, and theoretical property. Section 3 is numerical studies, where simulated studies and a real data application are presented. Lastly, we make a short conclusion in Section 4. The proof details of theoretical studies are put in Appendix Section.

Model setup
Suppose that we have observations (X i , Y i ) with 1 � i � n, where Y i 2 {−1, + 1} is the class label, and X i ¼ ðx ðiÞ jk : j ¼ 1; � � � ; p; k ¼ 1; � � � ; qÞ 2 R p�q is the corresponding image covariate. We further assume that (X i , Y i ) with 1 � i � n are independent and identically distributed. In order to predict Y i with X i , the following logistic regression model is assumed where Traditionally, maximum likelihood method is usually used to estimate coefficient image B. The likelihood function is and the corresponding log-likelihood function is lnðLðbÞÞ ¼ À P n i¼1 logð1 þ e À Y i b T X i Þ: Denote logistic loss function as lðbÞ ¼ logð1 þ e À Y i b T X i Þ, and the associated risk is denoted by PlðbÞ ¼ ElðbÞ. We assume that b � ¼ arg min b PlðbÞ. The empirical risk is denoted by Hence, maximizing the likelihood function is equivalent to minimizing the empirical risk min b P n lðbÞ: Many optimization methods, such as Newton-Raphson method [6], can be used to solve the above problem.
However, in the image covariate case, the corresponding coefficient image B is usually assumed to be a piecewise smooth image with unknown edges. This assumption is widely used in the imaging literature, and is critical for addressing various scientific questions [22]. The maximum likelihood method does not take advantage of these characteristics. Moreover, image covariate is usually high dimensional, and not every element of X i is useful to predict Y i . But the maximum likelihood method can not perform variable selection. Consequently, we propose a novel estimation method for B in the next subsection, which can keep characteristics of image covariate such as local smoothing, and perform variable selection simultaneously.

Estimation
For the coefficient image B, we define its discrete gradient rB 2 R p�q�2 as is the discrete gradient at the position (j, k). Furthermore, b j+1,k − b j,k is the discrete gradient in the vertical direction, and b j,k+1 − b j,k indicates the discrete gradient in the horizontal direction. The Sobolev norm of B is the L 2 norm of rB, which is written as In fact, we can rewrite kBk 2 Sob as a quadratic form of β. Specifically, we define a matrix D ¼ ðd ij Þ 2 R ð2pqÀ pÀ qÞ�pq with d ij defined in the following formula (2) Then one can easily verify that kBk We also present the matrix D with a graph in the case p = q = 3 for the purpose of understanding. Please see Fig (1

).
We then consider the following optimization problem where QðbÞ ¼ P n lðbÞ þ l 1 b T D T Db þ l 2 kbk 1 , and k�k 1 is the L 1 norm. The term λ 1 β T D T Dβ shrinks adjacent elements of B to be similar, hence it can capture the local smoothing of B. The term λ 2 kβk 1 shrinks the elements of B to 0, and performs variable selection. We next propose an algorithm to solve the optimization problem (3).

Algorithm
For the optimization problem (3), we define K = D T D = (k jl ) and HðbÞ ¼ P n lðbÞ þ l 1 b T Kb, then one can see that Q(β) = H(β) + λ 2 kβk 1 . This indicates that the function Q(β) is a convex function with the separable structure [28]. [29] shows that the coordinate descent algorithm can be guaranteed to converge to the global minimizer for any convex optimization function with the separable structure. Hence we here propose a coordinate descent algorithm to obtain the solution of the optimization problem (3). For j = 1, � � �, pq, we successively minimize Q(β) along β j direction with other parameters fixed. Specifically, denote the current value of β as β c , and For j = 1, � � �, pq, we use the second order Taylor expansion to approximate function Hðb c Hence, Moreover, where C is a constant containing no information about β j . Denote We summarize the algorithm as follows. Coordinate • Step 1. Initialization. Given initial value β.
End for.
By the proposed algorithm, we can obtain the solution of (3), which is denoted byb. As the estimator for β, the theoretical properties ofb are studied in the next subsection.

Theoretical properties
In this subsection, we consider the properties ofb. A nonasymptotic error bound ofb is given. We assume that the true value β � is sparse. Let I � ¼ f1 � j � pq : b � j 6 ¼ 0g, and k � ¼ P pq j¼1 Iðb � j 6 ¼ 0Þ be the cardinality of I � . For the purpose of theoretical studies, we make the following assumptions.

Assumption 1.
Assume that there exists a constant L such that |x ij | � L for every 1 � i � n, 1 � j � pq.
Assumption 2. Assume that there exists a constant C such that kβ � k 1 � C. Assumption 3. For the matrix K, assume that there exists a constant C 0 such that λ max (K) � C 0 , where λ max (K) is the largest eigenvalue of K.
for some a; �: Assume that there exists a con- Assumption 1 makes a common bound L for all x ij with i = 1, � � �, n, j = 1, � � �, pq. Assumption 2 gives a bound for kβ � k 1 . Combining Assumptions 1 and 2, one can make sure that P i with 1 � i � n are bounded away from zero and one. P i equalling zero or one will cause the i-th subject to be either ignorable or dominant in the likelihood function, that is not expected to appear in statistical analysis. This case can be avoided by Assumptions 1 and 2.
In Assumption 3, we assume that the largest eigenvalue of K is bounded. Assumption 4 is called Condition Stabil, which can be regarded as a stability requirement on the correlation structure [30]. Under these assumptions, we have the following theorem.

Theorem 1 Assume that Assumptions 1-3 are true and Assumption 4 holds for
ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi 2 ln ð2dÞ n ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi À 2 ln d n then we have that The proof of Theorem 1 is put in the appendix section. The theorem shows that with a high probability, the L 1 norm of estimate error is bounded by 3k � λ 2 /(sb) + (1 + 3s/λ 2 )�. One can see that the term (1 + 3s/λ 2 )� = O(d/2 d ), which can be negligible for large d. Hence, the term 3k � λ 2 /(sb) dominates the upper bound, which becomes larger when b becomes smaller. If further assume that ln(pq) = o(n), by the condition (4) one can see that λ 2 can tend to 0. Further 3k � λ 2 /(sb) ! 0, that means the upper bound can tend to zero. Consequently, the consistency ofb can be guaranteed.

The selection of tuning parameters
The optimization function (3) contains two tuning parameters λ 1 and λ 2 , which should be determined by some criteria, such as BIC, cross validation method. In our simulation studies, we select the tuning parameters by a validation set. And in real data analysis, the cross validation method is used. Before applying these methods, one should firstly determine the value range of tuning parameters. Specifically, we here make a transformation of λ 1 and λ 2 . Let λ = λ 1 + λ 2 , and α = λ 2 /λ. Then the penalty terms in (3) can be rewritten as λ(αkβk 1 + (1 − α)β T Kβ). Because α 2 [0, 1], the alternative values of α are set as 0.02κ for κ = 1, � � �, 50. With a given α, we denote λ 0 as the threshold value. Once λ � λ 0 , the solution of (3) is exactly zero. By one can see that once β = 0 is the solution, every element of 1=ð2lanÞ P n i¼1 ðÀ Y i X i Þ belongs to [−1, 1]. This means that l 0 ¼ 1=ð2anÞk P n i¼1 ðY i X i Þk 1 . Following the idea of [14], the alternative values of λ are set as 0.001 and 0.96 ν λ 0 for ν = 0, 1, � � �, 160. For the validation set method, the prediction error on the validation set of our approach with tuning parameters α, λ is denoted by PE (α, λ) . The final α, λ are selected as the minimizer of PE (α, λ) .
For The α, λ are selected as the minimizer of the cross validation prediction error [6].

Numerical studies
In this section, we evaluate the performance of our proposed method by two simulated examples and a real data analysis. For the purpose of comparison, we also consider the logistic regression model with L 1 penalty [12,13], the logistic regression model with fused lasso penalty, and linear support vector machine, which are denoted by LG-L 1 , LG-fused, and Linear SVM respectively for convenience. Meanwhile, our proposed method is abbreviated as LG-sob.
Simulation studies Example 1. We generate data from the following model where X i and B 0 both belong to R 32�32 . One result caused by image covariates is that the corresponding regression coefficient can be treated as a image too. Hence we here just treat B 0 as images, while X i is generated from a multivariate normal distribution. Specifically, we define the vectorization of X i as X i , and X i is generated from a multivariate normal distribution with mean 0 and covariance covðx ij 1 ; x ij 2 Þ ¼ 0:5 jj 1 À j 2 j for any 1 � j 1 , j 2 � 1024. The parameter image B 0 is considered in two cases, which have been shown in Fig 2. The first case of B 0 denoted by B 01 is a bird picture, in which the blue region takes value 0, and the yellow region takes value 1.
The other case of B 0 denoted by B 02 is a butterfly picture, which is more complicated and takes values in interval [−0.0197, 0.0628]. Given X i and B 0 , the response Y i is generated from a twopoint distribution In this example, the mechanism of data generation is similar to that for Example 1, the only difference is that we generate X i in a more complex way. In particular, we follow the simulation scheme of [22] and generate X i from a 32 × 32 phantom map with 1024 pixels according to a spatially correlated random process X i = ∑ l l −1 η il φ l , among which the η il are standard normal random variables and the φ l are bivariate Haar wavelet basis functions.
For these two simulated examples, along with the training set with sample size n, we also generate a validation set and a test set with sample sizes both being 500. We train the model on the training set, select tuning parameters through the validation set, and calculate the classification accuracy on the test set to evaluate the performance of the model.
For every specification of the parameter B 0 and sample size n, we replicate the simulation 100 times for each example, and the average prediction errors are computed and summarized in Table 1 for Example 1, and Table 2 for Example 2 respectively. Besides the prediction errors, we also calculate the average estimation errors P 100 i¼1 kB i À B 0 k 2 =100 for LG-sob, LG-L 1 and LG-fused, whereB i is the parameter image estimator in the i-th time. From the results, one can see that our proposed method performs better than the other three methods in all cases from the prediction perspective. As sample size n becomes larger, the prediction errors will become smaller, but the estimation errors do not decrease congruously. The reason may be that the tuning parameters are selected based on minimization of prediction error.  Moreover, we also randomly select one simulated result from the 100 replications of Example 1, and show the parameter image estimations in Fig 3. One can see that our proposed LG-sob method can capture the shapes of images, but LG-L 1 and LG-fused do not have this property.

A real data analysis
The classification of the ZIP Code Dataset is a benchmark problem in machine learning community [6]. One can obtain the ZIP Code Dataset from the following website https://web. stanford.edu/~hastie/StatLearnSparsity_files/DATA/zipcode.html [28]. The Dataset contains normalized handwritten digits, which are automatically scanned from envelopes by the U.S. Postal Service. Every observation is a handwritten digit, and is represented as a size normalized 16 × 16 grayscale image [31]. The purpose is to use the 256 pixel values to predict the corresponding digit. The Dataset contains a training set with 7291 observations and a test set with More specifically, we denote the i-th observation by X i 2 R 16�16 , and define the corresponding class label Y i = −1 if X i represents handwritten 3 and Y i = + 1 if X i represents handwritten 8. Our proposed method is applied to construct the classifier for the prediction of Y i (i.e. handwritten numeral) based on the grayscale image X i . We train the model on the training set, and evaluate the performance of the proposed method on the test set by classification accuracy. For the purpose of comparison, we also consider the logistic regression model with only L 1 penalty.
The tuning parameters are selected by 10-fold cross validation (CV) method. The CV prediction errors in various parameters setting are calculated and plotted in Fig 5. Finally, our proposed method selects the tuning parameters as α = 0.04, λ = 0.0118, while the method with L 1 penalty selects the tuning parameter as λ = 0.0014. The parameter image estimations of the two methods are shown in Fig 6. One can see that our proposed method tends to make adjacent pixels have similar effects on the model. Meanwhile, LG-L 1 tries to obtain a more sparse parameter estimation, and LG-fused method tries to make pixels only adjacent in the vertical direction have similar effects. The top-left region of parameter image has positive effects on handwritten numeral 8, and the bottom-right region has positive effects on handwritten numeral 3. The classification accuracy on the test set of our proposed method is 96.99%, while the accuracies of LG-L 1 , LG-fused, and Linear SVM are 96.39%, 96.08% and 96.39%, respectively. The proposed method performs better.

Conclusion
We have developed a novel estimation method for logistic regression with image covariates. This method can not only perform variable selection, but also capture the shape features of the parameter images. Both theoretical results and numerical studies show that our method performs well. We have proposed a coordinated descent algorithm to solve the optimization problem, and the global convergence of the algorithm is guaranteed. However, as pointed out by one referee, the coordinated descent algorithm is very time consuming, especially in the case of high dimensional image covariates. Many more efficient optimalization approaches, such as Nesterov accelerated gradient methods [32], interior-point methods [13], may be more suitable. We will research this issue in future. Furthermore, our method is mainly based on Sobolev norm regularization, compared to which total variation  regularization is more sensitive to capture sharp edges and jumps of parameter images. However, the algorithm of total variation regularization based estimation method is more complex, which can be our future work to study.

Appendix: Proof of Theorem 1
Before giving the proof of Theorem 1, we first list the bounded differences inequality as the following lemma without proof.
Lemma 1 (the Bounded Differences Inequality) Suppose that X 1 ; � � � ; X n 2 H are independent, and the function f : H n ! R satisfies the bounded difference assumption for i = 1, � � �, n. Then for all t > 0, For more details about Lemma 1 and its proof, one can refer to [33]. The following is the proof of Theorem 1.