Sparse Poisson regression via mixed-integer optimization

We present a mixed-integer optimization (MIO) approach to sparse Poisson regression. The MIO approach to sparse linear regression was first proposed in the 1970s, but has recently received renewed attention due to advances in optimization algorithms and computer hardware. In contrast to many sparse estimation algorithms, the MIO approach has the advantage of finding the best subset of explanatory variables with respect to various criterion functions. In this paper, we focus on a sparse Poisson regression that maximizes the weighted sum of the log-likelihood function and the L2-regularization term. For this problem, we derive a mixed-integer quadratic optimization (MIQO) formulation by applying a piecewise-linear approximation to the log-likelihood function. Optimization software can solve this MIQO problem to optimality. Moreover, we propose two methods for selecting a limited number of tangent lines effective for piecewise-linear approximations. We assess the efficacy of our method through computational experiments using synthetic and real-world datasets. Our methods provide better log-likelihood values than do conventional greedy algorithms in selecting tangent lines. In addition, our MIQO formulation delivers better out-of-sample prediction performance than do forward stepwise selection and L1-regularized estimation, especially in low-noise situations.


Introduction
A count variable, which takes only on nonnegative integer values, reflects the number of occurrences of an event during a fixed time period. Count regression models such as Poisson, overdispersed Poisson, and negative binomial regression are standard methods for predicting such count variables [1][2][3]. In particular, Poisson regression is most commonly used for count regression. There are numerous applications of Poisson regression models for predicting count variables, including manufacturing defects [4], disease incidence [5], crowd counting [6], length of hospital stay [7], and vehicle crashes [8].
The aim of sparse estimation is to decrease the number of nonzero estimates of regression coefficients. This method is often used for selecting a significant subset of explanatory variables [9][10][11][12]. Subset selection provides the following benefits: • data collection and storage costs can be reduced, • computational load of estimating regression coefficients can be reduced, • interpretability of regression analysis can be increased, and • generalization performance of a regression model can be improved.
A direct way of best sparse estimation involves evaluating all possible subset regression models. However, the exhaustive search method [13][14][15] is often computationally infeasible because the number of possible subsets grows exponentially with the number of candidate variables. In contrast, stepwise selection [15,16], which repeats addition and elimination of one explanatory variable at a time, is a practical method for sparse estimation. Several metaheuristic algorithms have been applied to subset selection for Poisson regression [17,18], and various regularization methods have been recently proposed for sparse Poisson regression [19][20][21][22]. Note, however, that these (non-exhaustive) sparse estimation methods are heuristic algorithms, which cannot verify optimality of an obtained subset of explanatory variables (e.g., in the maximum likelihood sense).
The log-likelihood to be maximized is a concave but nonlinear function, making it hard to apply an MIO approach to sparse Poisson regression. To remedy such nonlinearity, prior studies made effective use of piecewise-linear approximations of the log-likelihood functions, thereby yielding mixed-integer linear optimization (MILO) formulations for binary or ordinal classification [38][39][40]. Optimization software can solve the resultant MILO problems to optimality. Greedy algorithms for selecting a limited number of linear functions for piecewise-linear approximations have also been developed [38,40].
This paper aims at establishing an effective MIO approach to sparse Poisson regression based on piecewise-linear approximations. Specifically, we consider a sparse Poisson regression that maximizes the weighted sum of the log-likelihood function and the L 2 -regularization term. To that end, we derive a mixed-integer quadratic optimization (MIQO) formulation by applying a piecewise-linear approximation to the log-likelihood function. We also propose two methods for selecting a limited number of tangent lines to improve the quality of piecewiselinear approximations.
We assess the efficacy of our method through computational experiments using synthetic and real-world datasets. Our methods for selecting tangent lines produce better log-likelihood values than do conventional greedy algorithms. For synthetic datasets, our MIQO formulation realizes better out-of-sample prediction performance than do forward stepwise selection and L 1 -regularized estimation, especially in low-noise situations. For real-world datasets, our MIQO formulation compares favorably with the other methods in out-of-sample prediction performance.

Methods
This section starts with a brief review of Poisson regression, and then presents our MIO formulations for sparse Poisson regression based on piecewise-linear approximations. We then describe our methods for selecting tangent lines suitable for piecewise-linear approximations.

Poisson regression model
Suppose we are given a sample of n data instances ( . We define binary labels as The random count variable Y is assumed to follow the Poisson distribution where l 2 R þ is a parameter representing both the mean and variance of the Poisson distribution. The distribution parameter l i 2 R þ is explained by the linear regression model where w ≔ (w 1 , w 2 , . . ., w p ) > is a vector of regression coefficients, and b is an intercept term. Then, the occurrence probability of the given sample is expressed as The regression parameters (b, w) are estimated by maximizing the log-likelihood function where f k (u) is a nonlinear function defined as Since its second derivative f 00 k ðuÞ ¼ À exp ðuÞ is always negative, f k (u) is a nonlinear concave function.
The following theorem gives an asymptote of f k (u). Theorem 1. When u goes to −1, f k (u) has the asymptote Proof. We have which completes the proof.

Mixed-integer nonlinear optimization formulation
Before deriving our desired formulation, we introduce a mixed-integer nonlinear optimization (MINLO) formulation for sparse Poisson regression. Let z ≔ (z 1 , z 2 , . . ., z p ) > be a vector composed of binary decision variables for subset selection, namely, To improve the generalization performance of a resultant regression model, we also introduce the L 2 -regularization term αw > w to be minimized, where a 2 R þ is a user-defined regularization parameter [45]. We therefore address maximizing the weighted sum of the loglikelihood function of Eq (4) and the L 2 -regularization term. This sparse Poisson regression can be formulated as the MINLO problem subject to where θ 2 [p] is a user-defined parameter of the subset size. If z j = 0, then the jth coefficient must be zero by logical implication of Eq (8). Eq (9) specifies the number of nonzero regression coefficients, and Eq (10) lists all decision variables. The logical implication of Eq (8) can be imposed by using indicator constraints implemented in modern optimization software. Eq (8) can also be represented as where M 2 R þ is a sufficiently large positive constant.

Piecewise-linear approximation
It is very difficult to handle the MINLO problem by Eqs (7)-(10) using MIO software, because Eq (7) to be maximized is a concave but nonlinear function. Following prior studies [38][39][40], we apply piecewise-linear approximation techniques to the nonlinear function of Eq (5).
Letting {(u kℓ , f k (u kℓ ))jℓ 2 [h]} be a set of h tangent points for the function f k (u), the corresponding tangent lines are where f 0 k ðuÞ ¼ k À exp ðuÞ is the derivative of f k (u). As Fig 2 shows, the graph of a concave function lies below its tangent lines, so f k (u) can be approximated by the pointwise minimum of a set of h tangent lines. For each u, we approximate f k (u) by where t 2 R is an auxiliary decision variable.
We next focus on the approximation gap g k ðu j � uÞ À f k ðuÞ arising from a tangent point ð� u; f k ð� uÞÞ. By the following theorem, this gap does not depend on k; therefore, we can employ the same set {u ℓ jℓ 2 [h]} for all k 2 {0}[[m] when selecting tangent points for piecewise-linear approximations.
which completes the proof.

Mixed-integer quadratic optimization formulation
We are now ready to present our desired formulation for sparse Poisson regression.
) be a matrix composed of auxiliary decision variables for piecewise-linear approximations. We substitute Eq (11) and u = w > x i + b into Eq (12) to make a piecewise-linear approximation of the objective function of Eq (7). By Theorem 2, we use {(u ℓ , f k (u ℓ ))jℓ 2 [h]} as a set of h tangent points for the function f k (u). Consequently, the MINLO problem by Eqs (7)-(10) can be reduced to the MIQO problem where Eq (17) lists all of the decision variables. Note that optimization software can solve this MIQO problem to optimality.

Previous algorithms for selecting tangent lines
The accuracy of piecewise-linear approximations depends on the associated set of tangent lines. It is clear that with increasingly many appropriate tangent lines, the MIQO problem by Eqs (13)-(17) approaches the original MINLO problem by Eqs (7)- (10). In this case, however, solving the MIQO problem becomes computationally expensive because the problem size grows larger. It is therefore crucial to limit the number of tangent lines for effective approximations. Sato et al. [40] developed a greedy algorithm for selecting tangent lines to approximate the logistic loss function. This algorithm adds tangent lines one by one so that the total approximation gap (the area of the shaded portion in Fig 2) will be minimized. Naganuma et al. [38] employed a greedy algorithm that selects tangent planes to approximate the bivariate nonlinear function for ordinal classification. This algorithm iteratively selects tangent points where the approximation gap is largest.
These previous algorithms have two limitations addressed in this paper. First, they totally ignore the properties of the sample distribution. Second, tangent lines are determined one at a time, so the resultant set of tangent lines is not necessarily optimal. In the following sections, we propose two methods, namely the adaptive greedy algorithm and the simultaneous optimization method, to resolve the first and second limitations, respectively.

Adaptive greedy algorithm
Our first method, the adaptive greedy algorithm, selects tangent lines depending on the sample distribution.
Suppose we are given ð � b; � wÞ as regression parameter values. These values can be obtained, for example, through maximum likelihood estimation of the full model of Eq (3). We then have an empirical distribution of input values for the nonlinear function of Eq (5) as Our algorithm aims to minimize the sum of squared approximation gaps in response to this empirical distribution. Although the previous algorithms compute a set of tangent lines independent of datasets, our algorithm can adapt a set of tangent lines to each dataset.
We select h tangent points u � 1 ; u � 2 ; . . . ; u � h sequentially, where the sth tangent point u � s is determined on the condition that previous tangent points u � 1 ; u � 2 ; . . . ; u � sÀ 1 are fixed. This stepwise greedy procedure is formulated as where G ks (u) = min{g k (uju ℓ )jℓ2[s]}, and [L, U] is an input interval of the nonlinear function of Eq (5). Notably, by Theorem 2 this algorithm yields the same set of tangent lines for all

Simultaneous optimization method
Our second method, the simultaneous optimization method, selects a set of h tangent lines simultaneously, not sequentially. Suppose the intersection between the ℓth and (ℓ + 1)th tangent lines is specified by c k (u ℓ , u ℓ+1 ), meaning g k (uju ℓ ) = g k (uju ℓ+1 ) holds when u = c k (u ℓ , u ℓ+1 ). It follows from Eq (11) that We then simultaneously determine a set of h tangent points minimizing the total approximation gap (the area of the shaded portion in Fig 2). This procedure can be posed as the where c k (u 0 , u 1 ) = L and c k (u h , u h+1 ) = U are fixed, and c k (u ℓ , u ℓ+1 ) is defined by Eq (19)

Experimental results and discussion
This section describes computational experiments for evaluating the effectiveness of our method for sparse Poisson regression.

Methods for comparison
We investigate the performance of our MIQO formulation by Eqs (13) (13)-(17), and the indicator constraint to impose the logical implication of Eq (15). We fix the L 2 -regularization parameter to α = 0 in Tables 1, 2 and 6, whereas we tune it through hold-out validation using the training instances in Tables 3, 4

and 7.
We compare the performance of our method with the following sparse estimation algorithms: FwdStep: forward stepwise Poisson regression [15,16] L1-Rgl: L 1 -regularized Poisson regression [46] We implemented these algorithms using the step function and the glmnet package [46] in the R programming language. We tune the L 1 -regularization parameter such that the number of nonzero regression coefficients equals θ, then select the corresponding subset of explanatory variables. All computations occurred on a Windows computer with an Intel Core i3-8100 CPU (3.50 GHz) and 8 GB of memory.
We use the following evaluation metrics to compare the performance of sparse estimation methods. Letl i be a predicted value based on Eq (3) We next sampled explanatory variables from a normal distribution as x i � N(0, S), where Σ 2 R 30�30 is the covariance matrix. The (i, j)th entry of S is ρ |i − j| , where ρ represents the correlation strength between explanatory variables. We also sampled the error term from a normal distribution as ε i � N(0, σ 2 ), where σ is the standard deviation. We then generated the count variable y i Tables 1 and 2 show the results of our MIQO formulation for the synthetic training instances with subset sizes θ = 5 and 10, respectively. The column labeled "LogLkl" shows the log-likelihood value of Eq (4), which was maximized using a selected subset of explanatory variables. The largest log-likelihood values for each problem instance (σ 2 , ρ) are shown in bold. The columns labeled "Time (s)" show computation times in seconds required for solving the MIQO problem (MIQO) and for selecting tangent lines (TngLine).

Results for synthetic datasets
Our adaptive greedy algorithm (AdpGrd) attained the largest log-likelihood values for most problem instances but required long computation times to select tangent lines. This result implies that effective sets of tangent lines are different depending on the dataset, so the adaptive greedy algorithm, which computes a different set of tangent lines suitable for each dataset, can perform well. Our simultaneous optimization method (SmlOpt), on the other hand, selected tangent lines very quickly and also provided the second-best log-likelihood values for a majority of problem instances. These results clearly show that our AdpGrd and SmlOpt methods can find sparse regression models of better quality than do the conventional AreaGrd and GapGrd methods. Tables 3 and 4 show the prediction performance of sparse Poisson regression models for synthetic test instances with subset sizes θ = 5 and 10, respectively. The best RMSE, accuracy, and recall values for each problem instance (σ 2 , ρ) are shown in bold.
When σ 2 2 {0.01, 0.10}, our AdpGrd and SmlOpt methods delivered better prediction performance than did the FwdStep and L1-Rgl algorithms for all problem instances. In contrast, L1-Rgl algorithm performed very well when (σ 2 , ρ) = (1.00, 0.70) in Table 4. These results suggest that especially in low-noise situations, our MIO-based sparse estimation methods can deliver superior prediction performance as compared with heuristic algorithms such as stepwise selection and L 1 -regularized estimation. This observation is consistent with the simulation results reported by Hastie et al. [26]. Experimental design for real-world datasets Table 5 lists real-world datasets downloaded from the UCI Machine Learning Repository [47], where n and p are numbers of data instances and candidate explanatory variables, respectively. In a preprocessing step, we divided the total number of rental bikes by d, rounding down to the nearest integer to be an appropriate scale for the count variable to be predicted. We transformed each categorical variable into a set of dummy variables. Note that variables "dteday," "casual," and "registered" are not suitable for prediction purposes and thus were removed. Data instances having outliers or missing values were eliminated. Training instances were randomly sampled, with 500 training instances for the Bike-H dataset and 365 for the Bike-D dataset. We used the remaining instances as test instances. The tables show averaged values for 10 trials, with standard errors in parentheses. Table 6 gives the results of our MIQO formulation for the real-world training instances with subset size θ 2 {5, 10}. As with the synthetic training instances (Tables 1 and 2), our adaptive greedy algorithm AdpGrd achieved the largest log-likelihood values, but with long computation times. Our simultaneous optimization method SmlOpt was much faster than AdpGrd and provided good log-likelihood values for both the Bike-H and Bike-D datasets. Table 7 shows the prediction performance of sparse Poisson regression models for the realworld test instances with subset size θ 2 {5, 10}. Our AdpGrd and SmlOpt methods were

Conclusion
This paper presented an MIO approach to sparse Poisson regression, which we formulated as an MIQO problem by applying piecewise-linear approximation to the nonlinear objective function. We also developed the adaptive greedy algorithm and the simultaneous optimization method to select a limited number of tangent lines that work well for piecewise-linear approximations. We conducted computational experiments using synthetic and real-world datasets. Our methods for selecting tangent lines clearly outperformed conventional methods in terms of the quality of piecewise-linear approximations. For the synthetic datasets, our MIQO formulation delivered better prediction performance than did stepwise selection and L 1 -regularized estimation, especially in low-noise situations. Our MIQO formulation also compared favorably in terms of prediction performance with the other algorithms for real-world datasets. Although our method can potentially find good-quality sparse regression models, applying it to large datasets is computationally expensive. It is more practical to choose between our method and heuristic algorithms according to the task at hand. We also expect our framework for piecewise-linear approximations to work well for various decision-making problems involving univariate nonlinear functions. A future direction of study will be to develop an efficient algorithm specialized for solving our MIQO problem. We are now working on extending several MIO-based high-performance algorithms [24,48,49] to sparse Poisson regression. Another direction of future research is to improve the performance of our methods for selecting tangent lines. For example, although we selected tangent points of Eq (18) by evaluating each point u s 2 {−5.00, −4.99, −4.98, . . ., 4.99, 5.00} for s 2 [h], tuning tangent points more finely will probably make marginal improvements in the prediction performance. In addition, to upgrade the prediction performance in highnoise situations, we should adopt the L p -regularization term αkwk p with finely tuned parameters α and p in our MIQO formulation [50].