Conceived and designed the experiments: SM AL JC WP RAK. Performed the experiments: SM. Analyzed the data: AL. Contributed reagents/materials/analysis tools: SM AL. Wrote the paper: SM JC WP RAK.
The authors have declared that no competing interests exist.
The MESA (Multi-Ethnic Study of Atherosclerosis) is an ongoing study of the prevalence, risk factors, and progression of subclinical cardiovascular disease in a multi-ethnic cohort. It provides a valuable opportunity to examine the development and progression of CAC (coronary artery calcium), which is an important risk factor for the development of coronary heart disease. In MESA, about half of the CAC scores are zero and the rest are continuously distributed. Such data has been referred to as “zero-inflated data” and may be described using two-part models. Existing two-part model studies have limitations in that they usually consider parametric models only, make the assumption of known forms of the covariate effects, and focus only on the estimation property of the models. In this article, we investigate statistical modeling of CAC in MESA. Building on existing studies, we focus on two-part models. We investigate both parametric and semiparametric, and both proportional and nonproportional models. For various models, we study their estimation as well as prediction properties. We show that, to fully describe the relationship between covariates and CAC development, the semiparametric model with nonproportional covariate effects is needed. In contrast, for the purpose of prediction, the parametric model with proportional covariate effects is sufficient. This study provides a statistical basis for describing the behaviors of CAC and insights into its biological mechanisms.
The MESA (Multi-Ethnic Study of Atherosclerosis) is an ongoing study of the prevalence, risk factors, and progression of subclinical cardiovascular disease in a multi-ethnic cohort (
The CAC has a “point mass at zero+continuous” distribution and is a special case of zero-inflated data. Simple regression models are not capable of describing such data. It is not our intention to comprehensively review analytic methodologies for zero-inflated data. Instead, we focus on the statistical models for CAC. To describe nonzero CAC values, existing methods include generalized estimating equations
In MESA, after extensive comparisons and evaluations, Kronmal
Building on existing studies
The MESA is a study of the characteristics of subclinical cardiovascular disease (disease detected non-invasively before it has produced clinical signs and symptoms) and the risk factors that predict progression to clinically overt cardiovascular disease or progression of the subclinical disease
CAC was measured with electron-beam computed tomography (EBT) at three field centers or multidetector computed tomography (MDCT) at the other three field centers. Each participant was scanned twice consecutively, and the results from the two scans were averaged to provide a more accurate estimation. The amount of calcium was quantified with the Agatston scoring method
The MESA study has been approved by the Human Subjects Research Review Committee at University of Washington and all six sites. Detailed information is available at the MESA website
The distribution of CAC is highly skewed. We make the logarithm transformation and study
Consider two-part models, where in the first part, we model the occurrence of a nonzero CAC value. More specifically, consider
We determine the link function
Parametric, proportional covariate effects:
Parametric, nonproportional covariate effects:
Semiparametric, proportional covariate effects:
Semiparametric, nonproportional covariate effects:
Models(i) and (ii) are parametric, whereas models (iii) and (iv) are semiparametric. There is a rich literature on the advantages and disadvantages of parametric and semiparametric models
Most existing two-part models share a similar spirit with models (ii) and (iv) in that there is no constraint on the covariate effects
With a statistical model, we are interested in its two closely related but distinct properties. The first is the estimation property, where the goal is to fully describe the relationship between covariates and response variable. The second is the prediction property, where the goal is to accurately predict values of the response variable for subjects that are not used in model building. Theoretically speaking, there exists a true data generating model. This model not only provides the best description of the relationship between covariates and response but also has the best prediction performance. However, in practice with finite sample data, the true model is not known, and the models most suitable for estimation and prediction may differ.
With a simple linear regression model (M1):
We also consider the alternative model (M2):
With a normally distributed random error, up to a constant, the log-likelihood function for a single observation is
With models (i) and (ii), we consider the maximum likelihood estimates (MLE), which are defined as the maximizers of
With model (iii), we further assume that
With model (iv), we adopt a similar estimation strategy and consider the PMLE defined as the maximizer of
With parametric models (i) and (ii), inference can be based on the asymptotic normality result and the Fisher information matrix. However, with semiparametric models (iii) and (iv), such an approach involves smoothed estimation and is very difficult to employ. We propose the following bootstrap approach for inference of all parameters in all models: (a) Fit the model and compute the MLEs (PMLEs); (b) With the observed covariate values, generate random errors from the normal distribution with mean zero and variance
We collect measurements on the following covariates, which have been suggested as possibly associated with CAC in various publications: gender (female is used as the reference group), race/ethnicity (Caucasian, Chinese, African-American, and Hispanic; Caucasian is used as the reference group), former smoker (binary indicator), current smoker (binary indicator), diabetes (binary indicator), SBP (systolic blood pressure), DBP (diastolic blood pressure), age, BMI (body mass index), LDL cholesterol, and HDL cholesterol. We consider the following parametric models: (i.1) model (i) with linear effects for all covariates; (i.2) model (i) with linear effects for all covariates plus quadratic effects for LDL and HDL; (ii.1) model (ii) with linear effects for all covariates; and (ii.2) model (ii) with linear effects for all covariates plus quadratic effects for LDL and HDL. Models (i.1) and (ii.1) are more commonly adopted in practice, whereas models (i.2) and (ii.2) have been motivated by the nonproportional semiparametric model, i.e, the “biggest model”, and suggested by a reviewer. In semiparametric models (iii) and (iv), among the 13 covariates, 7 are binary, which naturally have parametric effects. Our preliminary analysis also suggests parametric covariate effects for SBP and DBP. Thus, there are 9 parametric covariate effects and 4 nonparametric ones. There are a total of six models considered.
For
The solid line is the estimate. The dash-dotted lines are the point-wise 95% confidence intervals. The y-axis is the value of the function.
The solid line is the estimate. The dash-dotted lines are the point-wise 95% confidence intervals. The y-axis is the value of the function.
The solid line is the estimate. The dash-dotted lines are the point-wise 95% confidence intervals. The y-axis is the value of the function.
The solid line is the estimate. The dash-dotted lines are the point-wise 95% confidence intervals. The y-axis is the value of the function.
The solid line is the estimate. The dash-dotted lines are the point-wise 95% confidence intervals. The y-axis is the value of the function.
The solid line is the estimate. The dash-dotted lines are the point-wise 95% confidence intervals. The y-axis is the value of the function.
Covariate | Model (i.1) | Model (i.2) | Model (ii.1) | Model (ii.2) | Model (iii) | Model (iv) | ||||||
Logistic | Linear | Logistic | Linear | Logistic | Linear | Logistic | Linear | Logistic | Linear | Logistic | Linear | |
Gender:Male( |
0.988(0.063) | 0.659(0.042) | 0.971(0.063) | 0.646(0.047) | 0.966(0.072) | 0.688(0.085) | 0.951(0.073) | 0.675(0.075) | 0.969(0.062) | 0.647(0.047) | 0.945(0.074) | 0.681(0.073) |
Race:Chinese ( |
−0.227(0.078) | −0.151(0.047) | −0.217(0.078) | −0.144(0.052) | −0.140(0.092) | −0.291(0.074) | −0.134(0.092) | −0.273(0.099) | −0.211(0.079) | −0.141(0.053) | −0.119(0.092) | −0.285(0.099) |
Race: AfricanAmerican ( |
−0.736(0.065) | −0.491(0.052) | −0.756(0.065) | −0.503(0.046) | −0.794(0.070) | −0.384(0.100) | −0.810(0.071) | −0.406(0.080) | −0.731(0.066) | −0.488(0.047) | −0.787(0.070) | −0.398(0.081) |
Race:Hispanic ( |
−0.593(0.065) | −0.396(0.046) | −0.604(0.064) | −0.402(0.042) | −0.626(0.073) | −0.348(0.081) | −0.636(0.074) | −0.354(0.084) | −0.596(0.067) | −0.398(0.044) | −0.628(0.071) | −0.358(0.085) |
Former smoker( |
0.352(0.059) | 0.234(0.038) | 0.354(0.059) | 0.235(0.038) | 0.368(0.069) | 0.211(0.069) | 0.366(0.069) | 0.218(0.069) | 0.354(0.060) | 0.236(0.039) | 0.370(0.072) | 0.213(0.071) |
Current smoker( |
0.580(0.080) | 0.387(0.054) | 0.579(0.080) | 0.385(0.054) | 0.620(0.091) | 0.329(0.101) | 0.617(0.091) | 0.328(0.101) | 0.573(0.083) | 0.382(0.056) | 0.609(0.094) | 0.328(0.096) |
Diabetes ( |
0.309(0.058) | 0.206(0.041) | 0.308(0.057) | 0.205(0.041) | 0.255(0.071) | 0.281(0.066) | 0.253(0.070) | 0.281(0.066) | 0.300(0.057) | 0.201(0.041) | 0.243(0.070) | 0.275(0.068) |
SBP ( |
0.008(0.002) | 0.005(0.001) | 0.008(0.002) | 0.005(0.001) | 0.009(0.002) | 0.004(0.002) | 0.009(0.002) | 0.004(0.002) | 0.008(0.002) | 0.005(0.001) | 0.009(0.002) | 0.004(0.002) |
DBP ( |
−0.0009(0.004) | −0.0006 (0.002) | −0.001(0.004) | −0.0006 (0.002) | −0.003(0.004) | 0.003(0.004) | −0.003(0.004) | 0.002(0.004) | −0.001 (0.004) | −0.0007 (0.002) | −0.0034 (0.004) | 0.0032(0.004) |
Among the six models, models (i.1)-(iii) are special cases of model (iv).
We consider the likelihood ratio test statistic
Based on the above results, we conclude that,
To evaluate the prediction performance, ideally, two independent datasets (one training set and one testing set) from studies with comparable designs are needed. We are not able to find a study fully comparable to MESA. As an alternative, we consider the following Monte Carlo-based approach. (a) Randomly split the data in to a training set and a testing set with equal sizes; (b) Estimate the unknown parameters using the training set only; (c) Make predictions for subjects in the testing set. Specifically, for a subject, first predict the probability of a nonzero CAC using the logistic regression model. Dichotomize the predicted probability at 0.5 and create the binary CAC status (zero or nonzero). If a nonzero CAC status is obtained, predict its actual value using the linear regression model; and (d) To avoid bias caused by an extreme split, repeat Steps (a)–(c) 500 times. Compute summary statistics.
In Step 1, we use random partition to generate independent training and testing sets. To avoid an extreme partition, multiple partitions are carried out. In the prediction evaluation, we are interested in the probability of correctly predicting the binary CAC status (zero or nonzero) as well as the overall mean squared error (MSE), which measures the ability to predict the actual CAC values. Under the six models, the mean error rates for predicting zero versus nonzero CAC are
The above results suggest that, despite their significantly different estimation results, all models have similar prediction performance. It is interesting to note that model (i.2), which has parametric proportional covariate effects, has prediction performance better than all of the other models (although the differences are small). As model (i.2) is a submodel of model (iv), this finding may seem counterintuitive. However, as discussed above, it can be explained by the bias-variance tradeoff. Although the prediction performance of model (i.2) is only slightly better than that of the other models, it has fewer unknown parameters than four of the alternative models and is easy to estimate. Thus, we conclude that model (i.2) is the most suitable for the purpose of prediction.
In the above sections, interesting findings include the nonlinear relationships found in model (iv) and the discrepancy between the models most suitable for estimation and prediction. Examination of
As discussed in
Models (i) and (iii) assume that the two covariate effects are perfectly proportional, whereas models (ii) and (iv) assume no proportionality. There are intermediate, partially proportional models with some covariate effects being proportional and the others not. Such models are as difficult to interpret and estimate as nonproportional models and hence not pursued.
Our conclusions on the CAC models are based on the analysis of MESA. There are a few other studies examining similar cardiovascular problems, including the CHICAGO study
Coronary artery calcification is an important predictor of cardiovascular disease events. Our literature review suggests certain limitations of existing CAC modeling studies. In this article, we analyze the MESA data and systematically investigate various CAC models. Building on existing studies including
We thank the editor and referee for careful review and insightful comments, which have led to significant improvement of this article. We thank the investigators, the staff, and the participants of MESA for their valuable contributions. A full list of participating MESA investigators and institutions can be found at