Efficient sparse estimation on interval-censored data with approximated L0 norm: Application to child mortality

A novel penalty for the proportional hazards model under the interval-censored failure time data structure is discussed, with which the subject of variable selection is rarely studied. The penalty comes from an idea to approximate some information criterion, e.g., the BIC or AIC, and the core process is to smooth the ℓ0 norm. Compared with usual regularization methods, the proposed approach is free of heavily time-consuming hyperparameter tuning. The efficiency is further improved by fitting the model and selecting variables in one step. To achieve this, sieve likelihood is introduced, which simultaneously estimates the coefficients and baseline cumulative hazards function. Furthermore, it is shown that the three desired properties for penalties, i.e., continuity, sparsity, and unbiasedness, are all guaranteed. Numerical results show that the proposed sparse estimation method is of great accuracy and efficiency. Finally, the method is used on data of Nigerian children and the key factors that have effects on child mortality are found.


Introduction
Interval-censored failure time data, which means the failure time of interest is only known to belong to a period instead of observed directly, is commonly seen in many fields, such as demography, medicine, and ecology. The statistical analysis of this special data structure has attracted much attention since first being addressed by Finkelstein (1986 [1]), and many researchers have developed significant works related to model estimations (Huang 1996 [2]; Zhang et al. 2005 [3], Zeng, Cai, and Shen 2006 [4], Lin and Wang 2010 [5]). Sun (2006 [6]) made a thorough review of research on interval-censored failure time data. Compared with right-censored data, interval-censored data can be more challenging when modeled in two ways. First, interval-censored data can be more complicated, such that sometimes it is a mixture of interval censoring and right censoring. Right censoring can be considered a special form of interval censoring with the right bound extending to infinity. Second, when implementing the proportional hazards model (Cox 1972 [7]) on right-censored data, one can use partial likelihood and does not have to estimate the baseline hazards function simultaneously with the parameters of interest. Several methods exist that deal with interval-censored failure time data. Tong, Chen, and Sun (2008 [8]) and Goggins and Finkelstein (2000 [9]) developed approaches for interval censoring, but under strict independent assumptions. These approaches are restricted to several models; for example, the former is only applicable to the additive hazards model (Lin and Ying 1994 [10]). In this paper, a sieve maximum likelihood method with Bernstein polynomials is proposed as a general way that can be applied to many semi-parametric survival models. More information is presented below.
Although basic theories on interval-censored data are well established, studies on variable selection under this data structure are very limited. To the best of our knowledge, penalized partial likelihood has been effectively used since it is intuitive to add a penalization term to the likelihood function. Tibshirani (1997 [11]) and Fan and Li(2002 [12]) successfully applied LASSO and SCAD penalties to the proportional hazards model right after they proposed them. Zou (2006 [13]) developed adaptive LASSO (ALASSO) and Zhang and Lu (2007 [14]) used it with partial likelihood. However, penalized partial likelihood is bounded to the analysis of right-censored data. For variable selection on interval-censored failure time data, piecewise constant functions are occasionally used (Wu and Cook 2015 [15]) to represent the baseline hazards model, incorporated with several penalties and the EM algorithm is applied to optimize the likelihood function. Wang et al. (2019 [16]) introduced a Bayesian adaptive lasso penalty for the additive hazards model with Case I interval-censored data, also known as current status data, in which the subjects are only visited once and one only knows whether it has failed at the exact observation time. Zhao (2020 [17]) developed a broken adaptive ridge (BAR) penalized procedure and, with iterations, some parameters finally shrank to zero. The simulation studies show satisfying results, whereas it is still found to be computationally costly due to the heavy optimization procedures and that there are two parameters to tune.
The object of interest in the present paper is another type of covariate selection technique, i.e., best subset selection (BSS). A typical BSS method is to list all the variable subsets, model with each one of them, and use some information criterion, such as AIC (Akaike 1974 [18]) and BIC (Schwarz 1978 [19]) to judge every subset. For a dataset with n 0 uncensored samples and p dimensions of parameters, a criterion has the form as: , where k(k � p) represents the number of selected parameters and l(�) represents the log-likelihood function. φ 0 is fixed as 2 or log(n 0 ) when a AIC or BIC is applied and it makes the criterion free of tuning, which can be a heavily time-consuming process when a common penalty such as LASSO, SCAD, or MCP (Zhang 2010 [20]) is used. The most significant problem of this method is that in the criterion a ℓ 0 norm is involved, the discrete nature of which makes the method a NP-hard problem. Although stepwise regressions are available to help with the optimization, BSS is still infeasible when it comes to a moderately large p. Su et al. successfully developed an approximated form of information criterion as a penalty term [minimum information criterion (MIC)] for general linear models (GLM, 2018 [21]) and the Cox model with right-censored data (2016 [22]), which extends the BSS method to a large variable dimension.
In this study, a form of approximated information criterion is proposed under the intervalcensored data structure. The result provides several major contributions to the current literature. First, an approximated BSS method is introduced into the analysis of interval-censored data, with which the variable selection approaches are rarely studied. Second, great efficiency is achieved by conducting the estimation of both the coefficients and baseline hazards function with free-tuning covariate selection simultaneously.
The rest of this paper is organized as follows. In the first section, the notation and detailed estimation procedure of interval-censored data are given. In the next section, the approximation of BSS is presented. In the third section, the simulation results of the proposed method are shown along with several other commonly seen penalties. The application section contains a survey example and the last section concludes this research and addresses a short discussion.

Interval censoring
Consider a failure time study that involves n independent subjects and for an i th subject there is a p-dimensional vector Z i (1 � i � n) that may affect its failure time T i . According to the proportional hazards model, the cumulative hazard function Λ(t) is given by where Λ 0 (t) denotes the baseline cumulative hazard function and β = (β 1 , β 2 , . . ., β p ) 0 the regression coefficients. The corresponding survival function is Under the interval-censored data structure, observations will be recorded as denoting the interval in which the failure of the i th subject belongs. Then, the likelihood function can be given as In practice, if one does not observe the failure of some samples during the entire experiment, these samples will be considered right-censored. For right-censored subject, R i = 1 and thus S(R i |Z i ) = 0.
To estimate ξ = (β, Λ 0 ), the traditional approach is to maximize the log-likelihood function l n (β, Λ 0 ), usually by finding the zeros of the derivatives. The main difficulty is to estimate finite-and infinite-dimensional parameters at the same time. This problem is discussed in the next section.

Regularized sieve maximum likelihood estimation
To deal with the infinite-parameter estimation problem mentioned above, a sieve method (Huang and Rossini 1997 [23] and Zhou, Hu, and Sun [24]) is developed for our study. Now, consider a parameter space with M a positive constant and F representing the function set that contains all bounded, continuous, non-negative, and non-decreasing functions. One common way to model λ 0 (�) is using splines (Wood, Pya, and Säfken 2017 [25] and Wang et al. 2019 [26]), but here, with the given restricted shape of the baseline cumulative hazards function, use of the Bernstein basis polynomials (Wang and Ghosh 2012 [27]) is preferred.
Hence, a sieve parameter space on the domain of observed data, recorded as [u, v]. Here, M n is a positive constant and b j (x, u, v, m) is defined as where m decides the number of terms in the Bernstein basis polynomials and ω ¼ ðõ 0 ;õ 1 ; . . . ;õ m Þ denotes the coefficient vector of the terms. Note that for the nondecreasing and non-negative requirements of Λ 0n , one needsω following the inequality 0 �õ 0 �õ 1 � . . . �õ m . This constraint can be ensured by reparameterization, introducing a novel vector ω = (ω 0 , ω 1 , . . ., ω m ) and letõ l ¼ P l k¼0 exp ðo k Þ. By focusing on sieve space X n , the complex estimations of both infinite-and finite-dimensional parameters are converted into a much simpler estimation problem that contains only finite-dimensional parameters (β, ω). Thus, given the argument matrix Z = (Z 1 , Z 2 , . . ., Z n ), our likelihood function has the form To estimate parameters and select covariates simultaneously, minimizing the sieve log-likelihood function with a penalty term is considered: It is intuitive to replace pen(β) with various developed penalties. For LASSO, let However, the aforementioned penalties all need a time-consuming tuning process for hyperparameter φ. The approximate information criterion is introduced as a penalty term in the following section, which frees us from tuning the parameter φ and greatly reduces computing time.

Approximation of information criterion
A novel sparse estimation method for interval-censored data is sought from the idea of approximation. The BSS method with certain information criteria does not need the parameter-tuning process, but it is infeasible for a large p. In this part, a smooth approximation of information criteria is developed that can be further used in the way of regularization methods.
The core task is to approximate the ℓ 0 norm in the information criteria. The ℓ 0 norm can be defined by indicator functions The essential job is to approximate the indicator functions, for which one introduces a function η(x) that satisfies (1) ; the latter is a smooth function and non-decreasing on R þ . Clearly, η(�) has captured the key features of the indicator I(x 6 ¼ 0).
One natural thought is to adopt sigmoid functions, which are commonly used as a smooth output unit in binary classification. The classic choice is the logistic activation sigmoid sðxÞ ¼ 1 1þe À x , and by making some minor changes on the independent variable and the intercept one can successfully develop our solution as follows: , and it seems that, with a larger θ, η(�) will be more like the target indicator function. Nevertheless, comparing Fig 1a  and 1b, we avoid directly setting γ = 1 because there is a cusp at x = 0, although it appears to give a better performance on sparsity. This concern restricts one to set γ = 2. To achieve balance smoothness and sparsity together on one penalty, our motivation is to use the reparameterization procedure.

Reparameterization
By introducing η(�), the smoothness problem of the ℓ 0 norm is preliminarily solved by setting γ = 2. Fan and Li (2001 [28]) proposed three properties that a good penalty should possess: unbiasedness, sparsity, and continuity. Unbiasedness and continuity are apparently ensured by the definition. The sparsity needs to be enforced since we have chose γ = 2. For this purpose, the following reparameterization procedures are considered. Set a vector � ¼ ð� 1 ; � 2 ; . . . ; � p Þ 0 2 R p and relate ϕ to β by and then the reparameterization can be written as β = H ϕ. In this way, (3) is rewritten as follows: where tr(H) denotes the trace of H. By reparameterizing β with ϕ, two goals can be achieved simultaneously: one is to keep the smoothness of the regularization problem, and the other is to obtain good performance on sparsity. These two aspects will be explained in the next section with figures.

Smoothness.
In (4), the second term of the right-hand side ϕ 0 tr(H) is smoothed by the definition of η(�) when γ = 2, so the problem remaining here is to check the smoothness of the first term, which is essentially decided by H ϕ. β = H ϕ is composed of formula β i = ϕ i �η(ϕ i ), i = 1, . . ., p, and it is commonly known that the product of two smooth functions is also smooth. The relationship of β = H ϕ and ϕ is illustrated in Fig 2, in which a desired one-toone mapping can be seen.
2. Sparsity. In the preceding section, the reason for choosing γ = 2 was explained, although γ = 1 is favorable in sparsity, which is shown in Fig 1a and 1b. Here, after reparameterization, the relationship between H, which determines the penalty, and β, the true coefficients, is explored. η(ϕ) and β are plotted in Fig 3, which demonstrates that the scale of penalty really assembles the situation when one sets γ = 1, in which the regularization will penalize the likelihood function with a considerably small coefficient allocated to wrong covariates.
One returns to the three properties that Fan and Li recommended, i.e., continuity, sparsity, and unbiasedness, after reparameterizing β with ϕ. Unbiasedness is ensured by the definition of η(�). Continuity lies naturally in maintaining smoothness. The goal of sparsity is attained by reparameterization, as explained in this section. Accordingly, the penalty developed herein fulfills all requirements.
Noted that after reparameterization, the penalty term becomes non-convex on β. This implies the penalized log-likelihood function (4) can have multiple local optima and the initial value of optimization process will effect the result. Thus, we use simulated annealing (Belisle 1992 [29]) and BFGS quasi-Newton algorithm (see, e.g., Jorge Nocedal 1980 [30]) to obtain the final result. The simulated annealing is robust and seeks the global optimum, and BFGS algorithm is very fast and assures the result converges to a critical point. Both methods are built-in in Matlab with function simulannealbnd and fminunc. To further scrutinize our method, the numerical results are presented in the next section.

Simulation study
An extensive simulation study is conducted in this section to assess the performance of the proposed sparse estimation method with the approximated information criterion (appIC) on finite interval-censored samples. In this study, a p-dimensional covariate set Z p × n = (Z i , Z 2 , . . ., Z p ) is considered, and Z i is generated from multivariate normal distribution, with mean zero and covariance matrix S. S is defined by Here we set three scenarios: n = 100, p = 10, n = 300, p = 10 and n = 300, p = 30. For the true coefficient vector, we set the first and last two components at b 0 (b 0 = 0.5 or 1), with other components zero. The baseline cumulative hazards function takes the forms Λ 0 (t) = t and Λ 0 (t) = log(t + 1), respectively.
When constructing interval-censored data structure, M visiting points are set in the interval [u, v] with a uniform gap (v − u)/M. To simulate the real situation in which some samples will be difficult to reach, every point is allocated a 50% chance to actually observe a certain sample. In this simulation, u is fixed at 0 and v is set as 3, thus when b 0 = 0.5, there are approximately 25% and 35% right-censored portions for Λ 0 (t) = t and Λ 0 (t) = log(t + 1), and when b 0 = 1, the portions are approximately 30% and 40%. Meanwhile, the number pf observation points M are set at 10 and 20, and the latter is supposed to bring more information.
The hyperparameters of the proposed method are assigned as follows: (1) we set Bernstein polynomials parameter m = 3 because we find it sufficient to characterize the baseline cumulative hazards function; (2) we set γ = 2 for smoothness as described in the previous sections; ( 3) we fix φ at log(n 0 ) (n 0 denotes the number of samples that are not right-censored) as BIC; (4) we assign θ with n 0 and the robustness of the estimate with different value of θ is shown in S1 File. The results are presented in Tables 1-5. In Tables 1-4, the performance is measured in three ways: mean, bias and standard deviation (SD), which indicate the accuracy and stability of our estimates. In Table 5, the performance under all scenarios are assessed with four measurements: average selected size (Size), the average true positive size (TP), the average false Table 1 positive size (TP) and the median and standard deviation of ðβ À βÞ 0 EðZZ 0 Þðβ À βÞ. It can be seen that the bias and standard deviation both greatly reduce when the sample size increases from 100 to 300. When p jumps to 30, appIC method remains feasible and generate good results. And the estimating accuracy and stability generally improve with a more frequent visiting pattern, which is indicated by a large M. Besides, it is shown that the selection correctness of the appIC method increases significantly with greater signals (b 0 = 1), compared to b 0 = 0.5. Meanwhile, the baseline cumulative hazards function is well modeled by Bernstein polynomials and some of the results are displayed in Fig 4. It is obvious that when n increases, the polynomials fit the hazards function better.   To comprehensively assess the performance of appIC, its estimation results were compared with other commonly used approaches to variable selection: LASSO, SCAD, MCP, and BAR under the most serious conditions above, that is M = 10 and b 0 = 0.5. The parameters of these penalties are tuned with 5-fold cross validation with the largest log-likelihood value on the validation sets. The estimations with true covariates (oracle) and full models (without selecting the covariates) are presented alongside in Tables 6 and 7. The measurements are described in the previous paragraph. It is obvious that the appIC sparse estimation performs well in both  Table 4. Simulation result of appIC sparse estimation with b 0 = 1 and baseline cumulative hazards function Λ 0 (t) = log(t + 1). selection correctness and estimation accuracy. The variable size that it has chosen is especially close to the true size, 4, with relatively low FP. That is to say, the proposed method is very unlikely to include an irrelevant variable. Meanwhile, the TP of the appIC method rises to a satisfactory level when n increases from 100 to 300, performing close to SCAD and MCP. Besides, the square error of the proposed method is low with both n = 100 and n = 300. Furthermore, it is worth mentioning that our method is far faster than common penalties owing to the free-tuning on the hyperparameter. The CPU time of different methods under various conditions can be found in S1 File.

Application
In this section, the focus is on Nigerian child mortality data from the Demographic and Health Surveys (DHS) Program (https://www.dhsprogram.com). The dataset records women's information from various aspects, including their children. The survey was very detailed; nevertheless, due to some practical restrictions the survival time of the children are only recorded in months or years, which makes it an interval-censored data structure. Meanwhile, it is found that the sample child mortality rate is over 20%, significantly higher than the global average, and our goal is to identify the key factors that affects children's survival status. A total of 24 potential factors are listed in Table 8. After excluding the samples that hold null value in either one of the variables, 8, 671 valid child samples are found, out of which 6,  Table 8 are dummy variables and are assigned 1 when the corresponding statements are true. Variables 15-24 are standardized so that the significance of all the factors can be compared. Most variables are concerned with the mother, and four variables marked with asterisks in Table 8 are child specific.
To apply the appIC regression procedure here, m = 3, γ = 2, θ = n 0 = 1841, and φ = log(n 0 ) (BIC) are set. Meanwhile, the observation interval on the children is set as [0,144]; that is to say, if the child lives to 144 months, or 12 years old, he or she is recorded as right censored. The results are shown in Table 8. According to the present research, eight variables have effects on child mortality. Having telephone, longer preceding birth interval of the mother, the usage of modern contraceptive methods, having electricity and more household members can reduce child mortality hazards, in the order of effectiveness. Meanwhile, the hazards increase with the mother having had more children and the child being twin. If the family can not decide whether to get medical aid when the child is seriously ill, the mortality hazards also increase. The baseline cumulative hazards function and baseline survival function are presented in Fig 5.

Conclusion
In this paper, an approximated information criterion for the proportional hazards model under the interval-censored failure time data structure is discussed. The common penalties usually need a time-consuming hyperparameter tuning process, which is not necessary if one uses BSS with some well-known information criteria, such as BIC or AIC. The modified logistic sigmoid function is used herein to emulate the ℓ 0 norm and accordingly convert the BIC as a penalized likelihood function that can be implemented in the way of regularizations. This method literally builds a bridge between BSS and the regularizations, with a special and novel strength in efficiency since it simulates the baseline hazards function, estimates coefficients of covariates, and chooses variables simultaneously, without tuning hyperparameters for the penalty term. The numerical results, including an application to child mortality, show that this method possesses great potential to facilitate mainstream sparse estimation for interval-censored data, with which the subject of variable selection is rarely studied. There exist some interesting directions of planned future work. First, in this paper only the situation that the censoring time is independent of the failure time is considered, which

PLOS ONE
sometimes may not conform with practice. Many studies discussing informative censoring exist and one can explore the proposed methods under that circumstance. The second direction is to change the survival model. Herein, only the proportional hazards model is applied, but several other superb semi-parameter survival models exist, e.g., the additive hazards model. One can compare the estimation accuracy or efficiency and show the reason. Third, in practice, some datasets with a very large covariate dimension are seen, a typical one of which is genetic data. Study on this problem is absolutely meaningful and clearly more research is needed in this direction.