Figures
Abstract
In many research fields, measurement data containing too many zeros are often called semicontinuous data. For semicontinuous data, the most common method is the two-part model, which establishes the corresponding regression model for both the zero-valued part and the nonzero-valued part. Considering that each part of the two-part regression model often encounters a large number of candidate variables, the variable selection becomes an important problem in semicontinuous data analysis. However, there is little research literature on this topic. To bridge this gap, we propose a new type of variable selection methods for the two-part regression model. In this paper, the Bernoulli-Normal two-part (BNT) regression model is presented, and a variable selection method based on Lasso penalty function is proposed. To solve the problem that Lasso estimator does not have Oracle attribute, we then propose a variable selection method based on adaptive Lasso penalty function. The simulation results show that both methods can select variables for BNT regression model and are easy to implement, and the performance of adaptive Lasso method is superior to the Lasso method. We demonstrate the effectiveness of the proposed tools using dietary intake data to further analyze the important factors affecting dietary intake of patients.
Citation: Lu Y, Liu A, Jiang T (2025) The variable selection of two-part regression model for semicontinuous data. PLoS One 20(6): e0322937. https://doi.org/10.1371/journal.pone.0322937
Editor: Flavio A. Ziegelmann, Universidade Federal do Rio Grande do Sul, BRAZIL
Received: August 28, 2024; Accepted: March 31, 2025; Published: June 3, 2025
Copyright: © 2025 Lu et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: Data cannot be shared publicly because of confidentiality of patients. Data are available from the Eunice Kennedy Shriver National Institutional Data Access/Ethics Committee (contact via nanselt@mail.nih.gov) for researchers who meet the criteria for access to confidential data.
Funding: This work is supported by Zhejiang University of Science and Technology Special Fund for Basic Scientific Research [No. 2025QN084] awarded to Y.L., the Research Project of Zhejiang Federation of Humanities and Social Sciences [No. 2025N072] awarded to Y.L., and in part by the Hangzhou Philosophy and Social Science Planning Project [No. Z23JC042]. The funders all play an important role in the study research design, data collection and analysis.
Competing interests: The authors have declared that no competing interests exist.
1 Introduction
In many research fields, measurement data containing too many zeros are often called semicontinuous data. Aitchison (1955) [1] pointed out that semicontinuous data can be regarded as generated by a mixed distribution consisting of a certain proportion of the zero-valued data (degenerate distribution) and the nonzero continuous data (continuous distribution). For instance, in the study of dietary intake of patients, some food components are consumed almost daily by patients, while others are consumed occasionally, resulting in many zeros in the intake data (Lu et al., 2020 [2]). In nonlife insurance, claim outcomes generally contain a probability mass at zero, indicating no claim occurrence (Frees et al., 2013 [3]; Yang, 2022 [4]). In addition, examples of semicontinuous outcomes include health care expenditures with zero representing no utilization (Smith et al., 2017 [5]; Huling et al., 2021 [6]), rainfall amounts with zero representing no rain (Hyndman and Grunwald, 2000 [7]), and alcohol consumption (Liu et al., 2008 [8]), among many others.
For semicontinuous data, too many zero values lead to severe right-biased data distribution, making it difficult to fit the data by the traditional distribution model. In this case, there are Tobit model, sample selection model and two-part model available for semicontinuous data. Among these models, the two-part model is the most commonly used method, which regards data as generated by two different random processes. The first process is usually considered as the binary part of the data, which considers whether the zero value occurs, indicating whether a certain behavior has occurred, and this process can be assumed to follow Bernoulli distribution. The second process is often referred to as the continuous part of the data, which considers the generation of nonzero values, and this process can be assumed to follow some kind of continuous distribution, such as normal distribution, gamma distribution, etc. Specifically, let follow a semicontinuous distribution with probability
of being nonzero and density function
for the positive part, then the two-part model is constructed as
where ,
denotes the indicator function. In Eq 1, when
, all data is zero, and all data is nonzero which can be assumed to follow a continuous distribution when
. In general, we assume that
is strictly between 0 and 1 to ensure the data set contains a certain number of nonzero values.
To understand the relationship between a semicontinuous outcome and a set of predictors, two-part regression models are typically used, which model the zero part and positive part separately. In general, the probability mass at zero can be characterized by a logistic regression, and the positive part can be modeled using distributions of positive random variables such as lognormal, gamma, and generalized beta of the second kind (GB2) distributions. Since each part may encounter different and a large number of candidate variables, the problem of variable selection may arise in the two-part regression models of semicontinuous data. To date, researchers have proposed a number of methods to select variables for zero-inflated count data which are similar to semicontinuous data (Zeng et al., 2014 [9]; Wang et al., 2015 [10]; Cantoni and Auda, 2018 [11]; Lee et al., 2020 [12]), but there is little literature on variable selection of semicontinuous data. Han et al. (2018) [13] sought a feasible way of conducting variable selection for random effects two-part models. But the models raise computational challenges in fitting due to numerical integration of the related random effects. Feng and Boyle (2021) [14] proposed a sparse group lasso regularization method for the selection of groups of variables in two-part models. Because they focused on whether a group of variables contribute to the whole model, the sparse group lasso regularization was combined with the two-part model, where the group of variables were carefully constructed to deal with the underlying relationships corresponding to group coefficients. However, this method only allows a group of variables to be selected or excluded, and cannot select the important variables within the group. Therefore, this approach is not ideal in some applications. For example, when studying factors that influence the occurrence of a disease, a gene is described by a group of variables, but it is clear that not every variable has a significant effect. By selecting genes (variables) to extract those genes that directly affect the classification accuracy and ignoring those genes that have no impact on the classification accuracy, computational performance and classification accuracy can be significantly improved. In addition, because the sub-models in the two-part models may contain different variables, or there is no obvious group structure between variables, the proposed method may not have good variable selection effect. In the two sub-models of two-part models, we only focus on whether a single variable has an effect, and propose two methods based on penalty functions to select variables for semicontinuous data. Our proposed methods are easy to implement and effectively select the important variables.
The rest of the article is arranged as follows. In Sect 2, we introduce the Bernoulli-Normal regression model for semicontinuous data, and a Gauss-Newton iteration method is given to estimate parameters. In Sect 3, we propose a new type of variable selection methods of Lasso, adaptive Lasso penalizing the Bernoulli-Normal regression model respectively, and a coordinate descent algorithm is proposed to estimate the parameters. In Sect 4, we conduct simulation studies to evaluate the performance of the proposed methods. Real data from a dietary intervention trial is used to illustrate the methods in Sect 5, and some concluding remarks are given in Sect 6. All simulations and real data analysis results are conducted by R software, and all codes have been provided in the Appendix (See S1 Appendix).
2 Bernoulli-Normal regression model
Consider a set of independent identically distributed samples } from a semicontinuous population X, where n is the sample size. Denote
,
,
, where
is the indicator function. Based on the basic idea of constructing a two-part model, we can divide the model into two parts. For the first part, whether X is zero can be treated as being from a Bernoulli distribution, that is, it is assumed that Y follows a Bernoulli distribution; For the second part, the nonzero part is assumed to follow a normal distribution. In practical application, considering the nonzero part has a certain skewness, then a logarithmic transformation for X > 0 is generally applied. According to the above construction process, the Bernoulli-Normal two-part (BNT) model is established as
where represents the mixing ratio, that is, the proportion of nonzero continuous data;
is a normal density function with mean
and variance
. According to the skewness of specific data, other distributions can also be considered to replace the normal distribution in Eq 2, such as gamma distribution, skewed normal distribution, etc. In this paper, we mainly focus on the BNT model.
In BNT model (See Eq 2), in order to understand the relationship between a semicontinuous outcome and a set of predictors, the BNT regression model is established as
Where is a q1 + 1 dimensional covariable vector of the mixing proportion
, and
is the corresponding q1 + 1 dimensional coefficient vector. Similarly,
is a q2 + 1 dimensional covariable vector of the mean parameter
, and
is the corresponding q2 + 1 dimensional coefficient vector. Set
in Eqs 4 and (5), then
and
represent the intercept terms of the two sub-regression parts respectively. In addition, the covariable vector
and
can be same or different in instance data.
2.1 Likelihood function of the BNT regression model
Based on Eqs 4 and (5), we get the likelihood function of the BNT regression model
where ,
is the indicator function.
Note that
it’s easy to derive
By substituting Eqs 7–9 into Eq 6, the log-likelihood function of the BNT regression model is obtained as
where
In Eq 10, it is obvious that is divided into two independent parts. The first part
is the binomial part, corresponding to the log-likelihood function of the logistic regression, which can be used to estimate the parameter
. The second part
is the continuous part, corresponding to the log-likelihood function of the general linear regression, which can be used to estimate the parameters
and
. In this case, the maximization of the log-likelihood function is equivalent to maximize
and
respectively, that is,
2.2 Gauss-Newton iterative parameter estimation method
At present, there are many parameter estimation methods for two-part regression model. In practical application, the specific parameter estimation method is determined by the purpose of investigation and the form selected by each part, among which the maximum likelihood method is one of the most commonly used tools, and its basic algorithm is Gauss-Newton iteration method. To this end, the Gauss-Newton iterative estimation process of BNT regression model is given below. In Eq 10, since is divided into two independent parts, we use the Gauss-Newton iteration method to estimate the parameters in
and
respectively.
Define the score function of parameter as
With
we get
Define the observation information matrix of parameter as
It follows that
Therefore, based on Eqs (11) and (12), the maximum likelihood estimate of parameter
can be obtained through the following iterative equation
where ,
represents the parameter iteration value obtained at the tth step.
Define the score function of parameter as
Since
we get
Define the observation information matrix of parameter as
We have
where
Similarly, based on Eqs 14–(16), the maximum likelihood estimate value of parameter
can be obtained through the following iterative equation
where ,
represents the parameter iteration value obtained at the tth step. In addition, it should be noted that the observation information matrix can also be replaced by Fisher information matrix in Eqs 13 and 17.
3 Variable selection of BNT regression model
At present, the methods based on penalty function are widely used in variable selection problems. Considering the advantages of Lasso and adaptive Lasso penalty function, the following variable selection methods for BNT regression model are proposed.
3.1 Lasso and adaptive Lasso penalized likelihood function
Fan and Li (2001) [15] adopted the penalized likelihood function method for variable selection, and showed that the logarithmic loss function plus penalty function is the most effective variable selection method, which is also called penalized likelihood function method. By selecting an appropriate model, the penalized likelihood function method generally has the following form
where is the likelihood function, and it is generally taken as a log-likelihood function, as is the case in this paper.
is the penalty function for
, and
is the tuning parameter represents the magnitude of penalty.
In the regression analysis, it is found that the estimator obtained by the ordinary least square method usually has a small deviation and a large variance, so the final prediction accuracy is not ideal. In order to improve the accuracy of prediction, researchers suggest that some regression coefficients can be compressed or set to 0. The basic idea is to improve the accuracy of prediction by sacrificing part of the bias. Therefore, Tibshirani (1996) [16] proposed the Lasso penalty function
where is tuning parameter.
According to the no-penalty log-likelihood function (See Eq 10), the Lasso penalized log-likelihood function of BNT regression model is obtained as
where
and
are tuning parameters. Besides, Eq 19 does not penalize intercept parameters
,
and variance parameter
.
In Eq 19, the penalized log-likelihood function of BNT regression model can be divided into two independent parts and
, where
is regarded as the Lasso penalized log-likelihood function of binary part logistic regression, and
is regarded as the Lasso penalized log-likelihood function of continuous part normal regression. Since
and
respectively contain parameters
and
, the optimal estimator of Eq 19 can be respectively obtained by
Although the method based on Lasso penalty function can efficiently select variables, its estimators are biased and do not satisfy the Oracle properties. Therefore, Zou (2006) [17] proposed the adaptive Lasso penalty function
where is penalty weight,
is a consistent estimate of the parameter
obtained without penalty, and generally set
. The adaptive Lasso penalty function applies different penalty weights to the coefficients of different covariables. In addition, Zou (2006) [17] has proved that the adaptive Lasso estimators have Oracle properties.
According to the no-penalty log-likelihood function (See Eq 10), the adaptive Lasso penalized log-likelihood function of BNT regression model is obtained as
where
and ,
,
are the weight coefficients.
,
are tuning parameters. Similarly, Eq 20 does not penalize intercept parameters
,
and variance parameter
. In addition, we set the weight coefficient as
, where
is the maximum likelihood estimate of the parameter
. And it should be noted that Eq 20 is reduces to the Lasso penalized log-likelihood function of BNT regression model (19) when
. Therefore, the Lasso method can be regarded as a special case of the adaptive Lasso method.
Similar to the previous discussion, the adaptive Lasso penalized log-likelihood function of BNT regression model (20) can be divided into two independent parts and
, where
is regarded as the adaptive Lasso penalized log-likelihood function of the binary part logistic regression, and
is regarded as the adaptive Lasso penalized log-likelihood function of continuous part normal regression. Since
and
respectively contain parameters
and
, the optimal estimator of Eq 20 can be respectively obtained by
As mentioned above, the Lasso penalized log-likelihood function can be regarded as a special case of the adaptive Lasso penalized log-likelihood function. Therefore, below we only introduce the parameters estimation procedure of the adaptive Lasso penalty likelihood function.
3.2 The coordinate descent method for parameters estimation
At present, many scholars have proposed efficient estimation algorithms for the Lasso method and adaptive Lasso method, such as least angle regression algorithm (Efron et al., 2004 [18]; Friedman et al., 2007 [19]), coordinate descent algorithm (Wu and Lange, 2008 [20]; Friedman et al., 2010 [21]), etc. In this paper, we use the coordinate descent method to optimize the estimation of the adaptive Lasso penalized log-likelihood function (20). In addition, since is the simplest linear regression and
in the generalized linear regression, the optimal solution process of
is given first, and then the optimal solution of
is given.
Based on the log-likelihood function of BNT (10), the normal distribution log-likelihood function of the continuous part without penalty term is
By removing the last two terms in which are not related to
, the maximum likelihood estimate of the parameter
is equivalent to the least squares estimate of
about the parameter . Therefore, with the addition of the adaptive Lasso penalty function, the optimization
is equivalent to
where
In addition, since the coordinate descent method needs to be updated for one variable dimension each time, in order to facilitate the derivation, each variable of parameter in
is expressed, that is,
Next, the coordinate descent method is used to solve the optimal value of Eq 22. In the tth iteration, it is assumed that the kth variable is being updated and the estimated value has been obtained, where
represents the iteration value of the hth variable (
) in the tth iteration.
Firstly, taking the partial derivative of Eq 22 with the parameter , we get
where , and
is the sign function.
Then, setting Eq 23 equal to 0 and through a series of deductions, we get the updated value of the kth variable in the tth iteration
where S(a, b) is a soft-thresholding operator defined by Donoho and Johnstone (1994) [22], and its concrete form is
In the process of the tth iteration, the remaining variables are updated successively according to the update formula (24). Observe the change in each dimension of the tth iteration value and the (t–1)th iteration value
. If the change value of all dimensions is small enough or meets certain convergence conditions, then the tth iteration value
is the final parameter estimator. Otherwise, the next iteration is entered according to the update formula (24).
In the following, we use coordinate descent method to optimize . Based on the log-likelihood function of BNT (10), the logistic regression log-likelihood function of the binary part without penalty term is
Adding the adaptive Lasso penalty term, the binary part of the adaptive Lasso penalized logistic regression log-likelihood function is obtained as
In Eq 25, the first term is a non-convex function and cannot be directly estimated by the coordinate descent method. For the no-penalty logistic regression log-likelihood function
, the process of Gauss-Newton method is given in Sect 2. And the iterative formula (17) to solve the optimal parameter is also called iteratively reweighted least squares (IRLS) method. Assuming that the current estimate of parameter
is
, the second-order approximation of
, namely, the second-order Taylor expansion of
at
is obtained as
where
and is a constant term. In this case,
is called the work response, which changes with each iteration and is a temporary response during the iteration. wi is called the weighted value which also changes with each iteration.
Since is a quadratic convex function and the second order is approximately
, the optimal adaptive Lasso penalized logistic regression log-likelihood function (25) is equivalent to minimizing the penalized weighted least squares of the adaptive Lasso, that is,
At this time, the optimal solution of Eq 27 can be solved directly by the coordinate descent method, which is similar to the optimal solution of . In the process of the tth iteration, it is assumed that the kth variable is being updated and the estimated value is
, where
represents the iteration value of the hth variable (
) in the tth iteration.
Firstly, taking the partial derivative of Eq 27 with the parameter , we get
where , and
is the sign function.
Then, setting Eq 28 equal to 0 and through a series of deductions, we get the updated value of the kth variable in the tth iteration
where S(a, b) is a soft-thresholding operator.
In the process of the tth iteration, the remaining variables are updated successively according to the update formula (29). Observe the change in each dimension of the tth iteration value and the (t–1)th iteration value
, if the change value of all dimensions is small enough or meets certain convergence conditions, then the tth iteration value
is the final parameter estimator. Otherwise, the next iteration is entered according to the update formula (29).
Combining the second-order Taylor expansion approximation and the coordinate descent method, the optimal solution process of Eq 25 is equivalent to a nested loop sequence. ① External loop: For the current parameter estimated value , the second-order Taylor approximation
is obtained from Eq 26; ② Internal loop: The coordinate descent method is used to estimate the parameters of the penalized weighted least squares (27) of the adaptive Lasso, and the update iteration is carried out according to Eq 29.
3.3 Selection of tuning parameter
The Lasso or adaptive Lasso penalized BNT regression model also involves another problem, that is, the selection of tuning parameter . At present, the three commonly used methods are K-fold cross-validation, Generalized cross-validation (GCV) and BIC information criterion. Because K-fold cross-validation can effectively avoid overfitting when evaluating model prediction performance, the final results are more convincing, and the calculation process of this method is relatively simple. Therefore, we mainly adopt K-fold cross-validation to adjust the choice of parameter
, and set K = 10 in the simulation studies. The basic idea of this method is to divide data into two parts: ① Training set is used to train model; ② Validation set is used to verify model. This method is very popular in statistical data analysis, and the specific steps are as follows.
(1) The whole sample is divided into K parts equally, denoted as ;
(2) Keep T1 as the validation set and the remaining as the training set. Firstly, the model is trained in the training set, then the trained model is fitted in the verification set, and the fitting value of the response variable is denoted as
. The estimation error of the model can be expressed as
where |T1| is the sample size of T1;
(3) Repeat Step (2), each time retain a set of as the verification set, and the remaining as the training set, so a set of
is obtained;
(4) Calculate the total mean error of cross validation:
(5) The estimate of the tuning parameter is obtained as
4 Simulations
The section will compare the variable selection effects of the two proposed methods through simulation studies.
4.1 Simulation data
Firstly, covariables are generated from the multivariate normal distribution , where covariance matrix is
whose elements are
(
). In the settings below,
,
, and the sample size is
. Secondly, the model coefficients are set in the simulation studies. The coefficients of some covariables are set to nonzero, which means that the simulation includes the corresponding covariables to generate the response variables. The coefficients of the remaining variables are set to zero, indicating that these variables have no effect on the response variables in the model. Finally, according to the BNT regression model, the real model is assumed as
where
we set in the following simulation scenarios.
In this section, the following three scenarios are simulated, and each scenario is repeated 500 times (More simulations see S1 Appendix).
(1) When q = 10, the regression coefficients in model (30) are set as:
(2) When q = 15, the regression coefficients in model (30) are set as:
(3) When q = 25, the regression coefficients in model (30) are set as:
4.2 Simulation results
The proposed variable selection methods of the BNT regression model are based on the Lasso and adaptive Lasso penalty functions, respectively. The effect of two methods are evaluated by comparing the following 5 statistics, and the logistic regression of the binary part, the normal regression of the continuous part and the whole of two parts are calculated respectively.
(1) Mean square error of prediction (MPSE):
where is the predicted value of response variable xi.
(2) Mean square error of parameter (MSE):
where is the estimate of the covariable coefficient
.
(3) Define Sensitivity and Specificity as:
where is the number of j that satisfies the condition A.
(4) Combining Sensitivity and Specificity, define Accuracy as:
where q is the number of covariables.
For the 5 evaluation statistics, the MPSE value represents the prediction error of the model, and the smaller the value, the better the model fits. The MSE value indicates the difference between the estimated parameter and the actual parameter, and the smaller the value, the closer the estimated value of the independent variable coefficient is to the actual value. The values of Sensitivity, Specificity and Accuracy are in the interval [0,1] to evaluate the effectiveness of variable selection. The Sensitivity value represents the proportion of selected variables in real important variables. The Specificity value represents the proportion of selected variables in real unimportant variables. And the Accuracy value represents the proportion of correctly selected and correctly eliminated variables in total variables.
For the three scenario settings, we specifically consider ,
,
,
. The simulation results are shown in Tables 1, 2, 3, 4, 5, and 6. According to the results, it can be concluded
- From the results of MPSE and MSE, the adaptive Lasso method is slightly better than the Lasso method in the single binary part, the continuous part and the whole of two parts.
- In terms of the results of Sensitivity, both methods have high Sensitivity. As n increases, the Sensitivity of two methods becomes better. From all the results, the Lasso method shows better performance, but when the correlation coefficient
is larger, the performance of the adaptive Lasso method becomes better in the binary part.
- In terms of the results of Specificity, the values of two methods in the binary part are better than the continuous part. As the correlation coefficient
increases from 0 to 0.6, the Specificity of Lasso method improves, but the Specificity of adaptive Lasso method has no significant change. In a word, the adaptive Lasso method shows better results in different settings.
- From the overall Accuracy results, the variable selection effect based on the adaptive Lasso method is better than Lasso method. And with the increase of n, the Accuracy of the two methods is getting higher.
5 Case application
In medicine, investigators often encounter semicontinuous data. For example, in studies of dietary intake, the dietary intake data often contain a large number of zeros, because some food components are only occasionally consumed by patients. In this section, the proposed methods are applied to the CHEF data to analyze the factors influencing the dietary intake of patients.
CHEF (Cultivating Healthy Environments in Families with Type 1 Diabetes) is an 18-month randomized trial to evaluate the efficacy of a family-based behavioral intervention that integrated motivational interviewing, active learning, and applied problem-solving to increase intake of whole plant foods among youth with type 1 diabetes. In the CHEF study, a total of 136 children with type 1 diabetes participated, and CHEF data are obtained by collecting dietary data over six periods, including the baseline period (before the intervention) and the intervention period, based on the diet records of the patients. A total of 12 food components are recorded in CHEF data, among which 8 food are continuous variables because they are consumed by patients every day. The other 4 foods Total Fruit (TF), Whole Fruit (WF), Dark Green/Orange Vegetables Legumes (DOL) and Whole Grain (WG) have excessive zeros, which due to the patient’s occasional intake of these food components. Therefore, CHEF data contains four semicontinuous variables which is a typical semicontinuous data. In addition, 28 candidate variables that may affect patient intake are collected from the CHEF trial. In pursuit of precision intervention, investigators are interested in the factors that influence the dietary intake of children with type 1 diabetes before behavioral interventions. Details of the study design, randomization procedures and treatment conditions can be found in Nansel et al. (2015) [23].
The variable selection methods proposed are mainly aimed at the BNT model. In order to avoid the wrong conclusion caused by the poor fitting of the model, we first carry out Shapiro-Wilks normality test for the continuous part of the four semicontinuous variables. The specific test results are shown in Table 7. And it appears that only the continuous part of DOL response variable follows a normal distribution. Therefore, we build a BNT regression model for this variable, and use the proposed methods based on Lasso and adaptive Lasso penalty function to select the key factors. The estimated coefficients and selected factors by the two methods are respectively showed in Tables 8 and 9. From the results, we can conclude as follows. For the binary part and the continuous part, different subsets of variables are selected based on different methods. Some variables are both selected by the two methods. Although the estimated coefficient values by the two methods are different, the positive or negative signs of their estimated coefficient are the same. For example, the binary part of CQOLEMO (Child Generic quality of life, Emotional subscale) and C
QOLSCH (Child Generic quality of life, School subscale) variables. Both methods can make variable selection while avoiding overfitting.
In addition, we use the -2Loglik, AIC and BIC criteria to compare the variable selection effects of the two methods. In order to further verify the effectiveness of the proposed method on real data, the K-fold cross-validation method is adopted. The specific results are shown in Table 10. It shows that the adaptive Lasso method respectively has smaller -2Loglik, AIC, BIC and ME values for the whole model (Two Parts). Moreover, for the binary part and the continuous part, the adaptive Lasso method also has smaller -2Loglik, AIC, BIC and ME values. Therefore, the results show that the proposed method based on the adaptive Lasso penalty function has better fitting effect. The results of ME show that the adaptive Lasso method has a small error for real data. That is, the variables selected by the adaptive Lasso method are a more efficient subset of important variables.
Based on the results of modeling, we discuss the factors influencing the intake of DOL from three aspects.
- Factors in the binary part, that is, explanatory variables that affect whether patients consume DOL foods, such as P
QOLG (Parent Generic quality of life, total score), C
QOLEMO, C
QOLSCH. The factors with a positive coefficient would increase the likelihood of ingestion of DOL, such as P
QOLG. The factors with a negative coefficient would reduce the likelihood of patients ingesting of DOL, such as C
QOLEMO, C
QOLSCH.
- Factors of the continuous part, that is, the interpretative variables that affect the amount of DOL ingested by patients, such as WPF, P
QOLG, C
QOLPSY (Child Generic quality of life, Psychosocial subscale), C
HEB (Child Healthy Eating Barriers). The factors with a positive coefficient would increase the patient’s intake of DOL, such as WPF, P
QOLG. The factors with a negative coefficient would reduce the patient’s intake of DOL, such as C
QOLPSY, C
HEB.
- The factors with a coefficient of 0, indicating no effect on DOL food intake, such as P
HEB (Parent Healthy Eating Barriers), P
HES (Parent Self-Efficacy). Therefore, the adaptive Lasso penalized BNT regression model can be used to further analyze the important factors affecting the dietary intake of patients.
6 Concluding remarks
In this paper, we mainly discuss the variable selection problem of BNT regression model. In the framework of the penalized likelihood function method, we respectively propose the variable selection method based on Lasso or adaptive Lasso penalty function. Simulation results show that both methods can select variables for the BNT regression model, but the performance of the adaptive Lasso method is better than the Lasso method. In addition, we exemplified the proposed methods with a real dietary data to analyze factors influencing the dietary intake of patients, and the results illustrate the advantages of the proposed method.
With the advances of information technology, semicontinuous data occur in more and more fields. The proposed methods can be applied to other research fields. For example, in insurance fields, the cumulative loss amount data often contains a large number of zeros. At this point, the semicontinuous two-part regression model can be used to analyze the influencing factors of cumulative loss, and the proposed methods can select the more important risk factors. Although this paper has achieved some breakthrough results, there are still some research problems that need to be further addressed. For instance, as more effective penalty functions are proposed, the variable selection method of semicontinuous two-part regression models based on these penalty functions also deserves further investigation. Considering the distribution characteristics of the continuous part of the semicontinuous data, different two-part models can be constructed, such as the Bernoulli-Gamma model and Bernoulli-Skewed Normal model. The variable selection problem of these regression models is one of the subsequent research problems. In addition, our research is mainly for the case of , when
, the variable selection method of semicontinuous two-part regression model is also one of the contents of our current research.
Supporting information
S1 Appendix. In the Appendix, we first provide more additional simulation studies to further emphasize the performance of the proposed methods, and then we share the code in a way that follows best practices and promotes repeatability and reuse.
https://doi.org/10.1371/journal.pone.0322937.s001
(DOCX)
Acknowledgments
The authors were grateful to the referees, associate editor, and the editor for their valuable comments and suggestions. The authors thanked Dr. Tonja Nansel for helpful discussions on the CHEF study.
References
- 1. Aitchison J. On the distribution of a positive random variable having a discrete probability mass at the origin. J Am Statist Assoc. 1955;50(271):901.
- 2. Lu Y, Liu A, Jiang M, Jiang T. A new two-part test based on density ratio model for zero-inflated continuous distributions. Appl Math J Chin Univ. 2020;35(2):203–19.
- 3. Frees EW, Meyers G, Cummings AD. Summarizing insurance scores using a gini index. J Am Statist Assoc. 2011;106(495):1085–98.
- 4. Yang L. Nonparametric copula estimation for mixed insurance claim data. J Bus Econ Statist. 2020;40(2):537–46.
- 5. Smith VA, Neelon B, Preisser JS, Maciejewski ML. A marginalized two-part model for longitudinal semicontinuous data. Stat Methods Med Res. 2017;26(4):1949–68. pmid:26156962
- 6. Huling JD, Smith MA, Chen G. A two-part framework for estimating individualized treatment rules from semicontinuous outcomes. J Am Statist Assoc. 2020;116(533):210–23.
- 7. Hyndman RJ, Grunwald GK. Applications: generalized additive modelling of mixed distribution markov models with application to Melbourne’s rainfall. Aus NZ J Statist. 2000;42(2):145–58.
- 8. Liu L, Ma JZ, Johnson BA. A multi-level two-part random effects model, with application to an alcohol-dependence study. Stat Med. 2008;27(18):3528–39. pmid:18219701
- 9. Zeng P, Wei Y, Zhao Y, Liu J, Liu L, Zhang R, et al. Variable selection approach for zero-inflated count data via adaptive lasso. J Appl Statist. 2013;41(4):879–94.
- 10. Wang Z, Ma S, Wang C-Y. Variable selection for zero-inflated and overdispersed data with application to health care demand in Germany. Biom J. 2015;57(5):867–84. pmid:26059498
- 11. Cantoni E, Auda M. Stochastic variable selection strategies for zero-inflated models. Statist Model. 2017;18(1):3–23.
- 12. Lee KH, Coull BA, Moscicki A-B, Paster BJ, Starr JR. Bayesian variable selection for multivariate zero-inflated models: application to microbiome count data. Biostatistics. 2020;21(3):499–517. pmid:30590511
- 13. Han D, Liu L, Su X, Johnson B, Sun L. Variable selection for random effects two-part models. Stat Methods Med Res. 2019;28(9):2697–709. pmid:30001684
- 14. Feng T, Boyle LN. Sparse group regularization for semi-continuous transportation data. Stat Med. 2021;40(14):3267–85. pmid:33843070
- 15. Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Statist Assoc. 2001;96(456):1348–60.
- 16. Tibshirani R. Regression shrinkage and selection via the lasso. J Roy Statist Soc Ser B: Statist Methodol. 1996;58(1):267–88.
- 17. Zou H. The adaptive lasso and its oracle properties. J Am Statist Assoc. 2006;101(476):1418–29.
- 18. Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression. Ann Statist. 2004;32(2).
- 19. Friedman J, Hastie T, Höfling H, Tibshirani R. Pathwise coordinate optimization. Ann Appl Stat. 2007;1(2).
- 20. Wu TT, Lange K. Coordinate descent algorithms for lasso penalized regression. Ann Appl Stat. 2008;2(1):224–244.
- 21. Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010;33(1):1–22. pmid:20808728
- 22. Donoho DL, Johnstone IM. Ideal spatial adaptation by wavelet shrinkage. Biometrika. 1994;81(3):425–55.
- 23. Nansel TR, Laffel LMB, Haynie DL, Mehta SN, Lipsky LM, Volkening LK, et al. Improving dietary quality in youth with type 1 diabetes: randomized clinical trial of a family-based behavioral intervention. Int J Behav Nutr Phys Act. 2015;12:58. pmid:25952160