Bayesian Classification and Regression Trees for Predicting Incidence of Cryptosporidiosis

Background Classification and regression tree (CART) models are tree-based exploratory data analysis methods which have been shown to be very useful in identifying and estimating complex hierarchical relationships in ecological and medical contexts. In this paper, a Bayesian CART model is described and applied to the problem of modelling the cryptosporidiosis infection in Queensland, Australia. Methodology/Principal Findings We compared the results of a Bayesian CART model with those obtained using a Bayesian spatial conditional autoregressive (CAR) model. Overall, the analyses indicated that the nature and magnitude of the effect estimates were similar for the two methods in this study, but the CART model more easily accommodated higher order interaction effects. Conclusions/Significance A Bayesian CART model for identification and estimation of the spatial distribution of disease risk is useful in monitoring and assessment of infectious diseases prevention and control.


Introduction
Cryptosporidium causes gastrointestinal infection in humans and animals and is now the most common protozoan parasite associated with gastroenteritis [1]. Cryptosporidiosis diseases are sensitive to weather variability as temperature and/or rainfall can influence the development and transmissibility of cryptosporidium and may also affect people's health-related behaviour. However, there are complex spatio-temporal interactions between the potential explanatory variables of these diseases that motivate further investigation.
Spatial dependence and heterogeneity are well known as major features of in spatial analysis of disease risk [2,3]. Spatial dependence can arise from the delineation of spatial units of observation (such as suburbs, statistical local areas and counties), spatial aggregation, and the presence of spatial exploratory factors. Spatial heterogeneity is related to the lack of stability over space of the spatial relationships between the observations [4,5].
Bayesian methods have been shown to account more sensibly and comprehensively for uncertainty in inference than frequentist methods, particularly with regard to the handling of parameter and model uncertainty [6,7,8]. Bayesian algorithms such as Markov Chain Monte Carlo (MCMC) have allowed for more widespread application of Bayesian methods to many fields of scientific investigation, including public health [9].
Bayesian spatial conditional autoregressive (CAR) models are increasingly being used to estimate spatial variation in disease risk between spatially aggregated units [2,10,11]. These models are typically represented as a linear regression between the response and explanatory variables with additional terms to explain spatial correlation. These models thus incorporate and estimate spatial correlation while simultaneously estimating covariate effects. Recently, Bayesian spatial and spatiotemporal models have been used to study the geographical distribution of tropical diseases including Ross River virus, malaria and schistosomiasis [2,12,13,14].
Classification and regression tree (CART) models provide an alternative representation of the relationship between a response variable and potential explanatory variables. These models have been shown to be very useful in identifying and estimating complex hierarchical (high order nonlinear interaction effect) relationships in ecological and medical contexts [15,16,17,18]. CART models are accepted in many fields of research because they are easy to interpret, more flexible than conventional parametric regression models and have a good predictive power [16]. Bayesian CART models have also been developed [19,20] but have yet to be widely applied [21,22,23].
In a previous study we used a frequentist CART model to assess the relationship between social-ecological factors and cryptosporidiosis [24]. In this study we apply the Bayesian CART algorithm developed by O'Leary [22] to predict the spatial distribution of the cryptosporidiosis infection using selected social-ecological factors and climate variables. We also compare the outcomes of the spatial CART model with those of the Bayesian spatial CAR model.

Data collection
The dataset considered here has been described elsewhere [24]. Briefly, we obtained the computerised dataset on notified cryptosporidiosis cases by local government areas (LGAs) in Queensland for the period of 1 st January-31 st December 2001 from the Queensland Department of Health. The dataset includes the onset date and place of onset of the notified cases of cryptosporidiosis infection, age and sex of the patients and laboratory test date. Weather (daily temperature and daily rainfall) and socio-economic index for areas (SEIFA) data were obtained for the same period from the Australian Bureau of Meteorology and the Australian Bureau of Statistics, respectively.

Bayesian CART model
CART models are binary decision trees that are built by dividing the predictor space repeatedly into partitions, or nodes, based on splitting rules of the predictor variables [15]. The aim of partitioning the space in this manner is to progressively increase the homogeneity of the response variable y within each node. The response variable determines the type of tree and the homogeneity of the terminal nodes. If the response variable is categorical then a classification tree is used to predict the classes of the response, and assessment of homogeneity is based on (correct) allocation of observations within a node to a single class; alternatively if the response is continuous then a regression tree predicts the average response within a node, and assessment of homogeneity is based on the corresponding variance, deviance, residual sums of squares or similar measure.
This modelling approach facilitates the fitting of complex nonlinear interactions, such as combination of environmental and sociological variables to help explain spatial patterns of a disease (e.g. [8]), combinations of habitat variables describing ecological niches [11], or gene-gene interactions that explain diseases [25].
Consider a response variable y i and predictor variables x il , i = 1,…, n; l = 1,…, L. The partition of the response variable starts at the root node and divides the predictor space (observations i) at each internal or split node S k , k = 1,…, K21, where K is the size of the tree (defined as the number of terminal nodes). At each splitting node S k , the partition is based on a splitting rule R k , of a variable V k and divides the observations {y i ; y i [ S k } into the left and right child node. Terminal nodes T 1 , …, T K , also called leaves, are the final nodes in which the predictor space is not split any further. At each splitting node S k , the lth predictor is selected as the splitting variable V k from the list of possible predictor variables x l . If this predictor is continuous, e.g. S 1 in Figure 1, then the splitting rule R k is based on a value a so R k = a, where min(V k )#a#max(V k ). For example, at S 1 in Figure 1, V 1 is Temperature and R 1 is Temperature #32.5, so that observations with temperature less than or equal to 32.5 are partitioned to the left of the tree and the remainder are partitioned to the right. Alternatively, for a categorical response, R k is based on a class subset c so R k = c, where c5 {possible levels of V k }. Letting y k represent the parameters corresponding to the assumed distribution of the data in the kth terminal node, the parameter vector h k = (R k , S k , V k , y k ) defines the parameter set or tree structure in this node; thus h K = {h k , k = 1, … K}.
Following O'Leary [22], in a Bayesian framework, the joint distribution of the model parameters (size of tree K, tree structure h k and response variable y) is modelled by Þp y K,h k j ð Þ : Here p(K) is the prior probability distribution for each model (where the model is defined by the number of terminal nodes K), p(h k |K) is the prior probability distribution of the parameter set h k given model K, and p(y|K, h k ) is the likelihood of the data y given the model K and the corresponding parameter set h k . Bayesian analysis about the tree size K and tree structure h k is calculated from the joint posterior distribution p(K, h k |y). For regression trees, if the (continuous) response variable y is assumed to have a normal distribution, then y k = (m k , s 2 k ) and the likelihood is For classification trees, the (categorical) response variable y is typically assumed to have a multinominal distribution, so that if there are N categories, y k = (p k1 ,..,p kN ) and the likelihood is where m kj is the number of data points y at the kth terminal node k which are classified into the jth category.
The prior for the model is p(h k |K) p(K), so that For a regression tree with a normal likelihood, a noninformative prior for p(y k |V, S, K) can be represented by a normal prior with a large variance for m k and a uniform prior with a large range for s k . For a classification tree with a multinomial likelihood, a noninformative prior for p(y k |V, S, K) can be represented by a Dirichlet prior for p k with hyperparameters equal to 1. Dirichlet priors may also be used in both regression and classification trees for the splitting node p(S k | K), variables p(V k | S k , K), and splitting rules p(R k | V k , S k , K): j~Dir(S k a s 1 , . . . ,a s k ), When no prior information is available about these variables, noninformative uniform distributions can be defined by setting all hyperparameters to 1, so that a S 1 , . . . ,a S k~1 ; The prior on the size of the tree p(K) is assumed to be a truncated Poisson distribution with parameter l (expected number of nodes in the tree), This prior imposes a left limit of k.0 because the minimum model contains one terminal node. The value of l represents the expected number of splitting nodes is restricted to an interpretable size K * . In the case study considered here, this was taken to be l = 10 [20].
In the present case study there was no information available about the model variables, so, noninformative priors were adopted. In other situations, if such information is available, then informed priors may be used instead. For example, in an analysis of habitat suitability of a threatened species, O'Leary et al. [23] discuss how to elicit from an expert the size of the tree, the relative importance of the variables, and the splitting rules for the most important variables. They also show how to translate this information into priors and combine with the data for Bayesian classification trees.
The sensitivity of the Bayesian CART model to the choice of priors has been investigated by O'Leary [22] for classification trees. The sensitivity analysis involved the investigation of the hyperparameters of the priors for tree size (number of terminal nodes), splitting nodes, splitting variables and splitting rules. The results indicated that the posterior distribution is relatively robust to these priors except for extreme choices of the hyperparameters.
The Bayesian CART models were fitted using the approaches suggested by Chipman et al. [19] and Denison et al. [20]. A reversible jump MCMC algorithm was used [20,26], with single long chain [20]. The final stopping rule was based on the stability of the posterior distribution [20].
A fully Bayesian simulation from the posterior distribution could have been implemented via a greedy search algorithm. However, currently this is computationally infeasible because the parameter space is large and has an inflexible hierarchical structure. Instead we chose to follow the overall approach of Denison et al. (1998) and Chipman et al. (1998), by constraining the search algorithm to examine only the more optimal portions of the model space [19,20]. This stochastic search algorithm is based on careful choice of model performance criterion to ensure that a range of good models are selected [22]. Therefore, Bayesian CART search algorithm produces a large number of trees, whilst traditional CART only produces one tree. The selection of the best classification tree, in Bayesian CART algorithm, is based on the research aim, in this case study the tree with the highest sensitivity and specificity.
Following O'Leary [22], the goodness of fit of a classification tree is assessed by several accuracy measures, calculated from the confusion or loss matrix ( Table 1). The ''best'' tree can be defined as the one that minimizes/maximises one or more accuracy measures, depending on the aims of the study. In this paper the following accuracy measures were chosen: the misclassification rate (MCR) = (number of false positives (b)+number of false negatives (c))/total number (N), sensitivity = number of true positives (a)/(number of true positives (a)+number of false negatives (c)) and specificity = number of true negatives (d)/ (number of true negatives (d)+number of false positives (b)). A set S G of G ''good'' trees is identified based on preset criteria, in this case study trees with highest sensitivity and specificity, and lowest MCR. The variables and splitting rules at each splitting node of the trees in S G are examined, and convergence is declared when the membership of S G and structure of the component trees has stabilised, i.e. the same trees are in the set S G .
For each tree in the set of good classification trees SC G the following summary statistics can be examined: tree structure (variables, splitting rules and number of terminal nodes), sensitivity, specificity, deviance (226log likelihood p(y|K, h k )), log likelihood and log posterior probability. From this set of good classification trees, depending on the aims of the analysis, a small number of trees may be chosen as the ''best'' trees, based on the modal tree structure (same size tree with the same variables and splitting rules), highest sensitivity and specificity, lowest deviance, and the highest likelihood and posterior probability.
For regression trees, the stopping criterion is based on posterior probabilities, deviance and residual sums of squares (RSS) Therefore a set of SR G G good R regression trees, for a certain number of iterations after burn-in, is identified to have the smallest RSS, minimum deviance and maximum likelihood and posterior probabilities of p(y k |V, S, K) (i.e. distribution of the data given the tree structure). Similar to classification trees, tree structure (variables, splitting rules and number of terminal nodes) in SR G is investigated. Once the membership of SR G and structure of the component trees has stabilised, this set of regression trees is declared ''good''. Bayesian models focus on the estimation of the model parameters (and model) conditional on all of the observed data.
Overfitting of the Bayesian CART model can be assessed in the following manner. Following the practice adopted in crossvalidation, the data can be split into a training and test dataset, using a stratified random sample to ensure equivalent allocation of presences and absences (for a classification tree) or subgroups (for a regression tree) [27,28]. The model is then fit to the training dataset and the set of best trees is identified. For each tree, the posterior predictive distribution [27] is computed for both the training dataset and the test dataset and a confusion matrix based on the posterior predictive distribution and the observed data is computed. This is performed for each iteration of the MCMC algorithm, thus incorporating the uncertainty of the model parameters and the data in the evaluation. Finally, overfitting is assessed by comparing the accuracy measures (classification trees) or RSS (regression trees) between the training and validation datasets for the best trees. This approach is an adaptation of the typical use of predictive posterior distributions [27], in that instead of comparing the distribution of the observed data with that of future observations ỹ under a proposed model, here we compare these distributions of observations in the training and validation datasets.
The cryptosporidiosis dataset contains a large number of zero incidence rates (n = 1131 out of 1332 observations). To accommodate this, two Bayesian CART models were applied to incidence of cryptosporidiosis in LGAs: 1) a Bayesian classification tree in which the response is binary: presence/absence of cryptosporidosis; 2) a Bayesian regression tree in which the response is continuous: positive incidences rates, i.e ignoring zeros. This two stage approach is similar to hurdle and zero-inflated models [29].

Bayesian CAR model
An initial descriptive analysis of cryptosporidiosis was performed. Crude standardised morbidity ratios (SMRs) for each LGA for the whole study period were calculated using standard methods [9], where SMR = (the observed number of cryptosporidiosis cases)/(the expected number of cryptosporidiosis cases). This model assumed that the observed counts of cases (O kt ) for the kth LGA (k = 1…125) in the tth month in 2001 follow a Poisson distribution with mean (m kt ), that is, where a is the intercept, b 1 is the coefficient for temperature, b 2 is the coefficient for rainfall, b 3 is the coefficient for SEIFA, b 4 is the interaction coefficient of temperature and SEFIA, c is a LGA-level temporal trend coefficients, u is LGA-level variation that is spatially structured (ie. spatially-structured factors not explained by the model covariates), v is spatially unstructured LGA-level variation, and d is the amplitude of seasonal oscillation in the month-specific random effects, which was modelled by a sinusoidal term cosine(2p6t/12). Spatial correlation between LGAs was modelled using a CAR prior for u, using a simple adjacency weights matrix [9].
Parameter estimation was obtained via MCMC simulation using an initial burn-in of 5000 iterations and subsequent set 100,000 interactions for estimation. Convergence was assessed by examining posterior density plots, history plots and autocorrelation of selected parameters. Model selection was performed using the deviance information criterion (DIC), where a lower DIC suggests a better trade-off between model fit and parsimony. Poisson regression models were developed in a Bayesian framework, using the WinBUGS software version 1.4 [30]. Figure 2 shows the spatial patterns of cryptosporidiosis, rainfall, temperature and SEIFA in Queensland by LGA. The figure confirms that all these variables varied with geographical location.

Bayesian classification tree
A set of five good Bayesian classification trees, with the highest sensitivity, specificity and lowest deviance, are displayed in Table 2. The first tree has the highest sensitivity and specificity, and lowest deviance. Since the focus of this case study was on correct prediction of presence (highest sensitivity) the first tree was selected as the best. This tree, depicted in Figure 3, indicates that presence of cryptosporidiosis was predominantly explained by a high-order nonlinear interaction between temperature, SEIFA and rainfall. The probability of cryptosporidiosis was largest when temperature was high and rainfall was low, temperature was low and SEIFA was very low, and temperature was low and SEIFA was mid-range but rainfall was low. Table 3 shows the quantiles of sensitivity, specificity and log posterior (distribution of data given the tree structure) for training and validation datasets over all accepted classification trees. This shows that the Bayesian CART algorithm search space includes trees with very low (close to zero) to very high (close to one) sensitivity and specificity.
Overfitting of Bayesian classification trees was explored by investigating the quantiles of sensitivity and specificity for training and validation dataset, over all accepted trees. Table 3 reveals similar 95% CIs for sensitivity and specificity between the training and validation datasets, indicating no over-fitting. However, for the validation dataset, the fourth and fifth trees have slightly higher sensitivity than the first tree.

Bayesian regression tree
The Bayesian CART algorithm was applied to positive incidence rates of cryptosporidium. The set of five best regression trees (with lowest RSS and deviance) have the same log RSS (258.96 and 258.47), log posterior (216.18 and 213.56) and deviance (22.58 and 17.35) for both training and validation dataset respectively. The only difference between these trees is the splitting rules, which have all resulted in the same y observations being classified into the same terminal nodes. Over the 300,000 iterations, the iteration number for each of these five trees are very different, indicating that the Bayesian CART did not get  trapped in local maxima. The first and second trees were designated as the 'best trees' since they were most consistently accepted in the set of good trees. The best regression tree modeling positive incidence rates of cryptosporidium is displayed in Figure 1. There are three groups of positive incidence rates of cryptosporidium, ranging from low to high incidence. A monthly mean incidence rate of cryptosporidium of 78.22/100,000 (n = 105; far left terminal node) occurs in areas with temperatures less than or equal to 28.5u and SEIFA less than or equal to 1033.8. The monthly mean incidence rate is reduced to 4.73/100,000 when temperatures are the same but SEIFA is greater than 1033.8. The highest monthly mean incidence rate (134.76/100,000) occurs when the temperature is greater than 28.5u.
The quantiles of log RSS, deviance and log posterior (distribution of data given the tree structure) over all accepted regression tees are displayed in Table 4. The Bayesian regression tree algorithm search space includes trees with low to high RSS, deviance and log posterior. There was no evidence over-fitting with Bayesian regression trees since there was little difference in log RSS and deviance between training and validation datasets.
Spatial CAR model Table 5 shows that under the spatial regression (CAR) model, the average increase in monthly cryptosporidiosis incidence rates was 9% (95% credible interval (CrI): 0-18%) for a 1uC increase in monthly average maximum temperature. However, there was no substantive association between SEIFA, rainfall and cryptosporidiosis incidence. No interactions effects were found between temperature and SEIFA.

Comparison with frequentist CART models
We also compared the outcomes of the Bayesian CART model with those of the traditional CART model [8]. Both the Bayesian CART and traditional CART models show that SEIFA and temperature were associated with the cryptosporidiosis disease. However, the analyses indicate that Bayesian CART gave slightly better prediction accuracy (ie. high sensitivity) (sensitivity Bayeisan: 79%; specificity Bayesian: 50%) than the CART accuracy (sensitivity frequentist: 10%; specificity frequentist: 99%) established using the more traditional frequentist approach. An important difference between the two models was that the frequentist tree gave equal weighting to correct classification of all observations, whereas the Bayesian tree differentially weighted the groups of presences and absences based on the respective sample size.

Discussion
Both the Bayesian CART and Bayesian CAR models show that temperature was significantly associated with the cryptosporidiosis disease. The analyses indicate that the nature and magnitude of the effect estimates were similar for the two methods used in this study. However, the Bayesian CART allowed more flexible identification and description of nonlinear interactions between explanatory or predictor variables, while still allowing for local smoothing.
The Bayesian CART model revealed a strong nonlinear interaction between SEIFA and temperature, and a weaker interaction with rainfall, in predicting incidence rate of cryptosporidiosis. In contrast, because only main effect term and one interaction term (ie. temperature and SEIFA) were included in the spatial CAR model, other interactions were not identified.   Although other interactions (ie. temperature, rainfall and SEIFA) could of course be included in the CAR model, it is difficult to identify a priori which interactions to include and evaluation of all possible interactions would require a much larger dataset than was available here. We also considered including these interactions in a spatial CAR hurdle model, which allows for zero-inflation by having a probability mass at zero, but found this to be difficult to fit in terms of stability and interpretability of the estimates and corresponding predictions. This is possibly not surprising given that the discretisation of the data into two components (zero and nonzero) may impact on the representation of the spatial component in the model, especially when taking into interactions into account. This requires further future investigation. In the meantime, a posteriori inclusion of interactions, based on the CART, into the CAR model analyses is a potentially useful alternative.
A strong advantage of a Bayesian framework for the CAR and CART models is that all the parameters of the model are treated as variables, so that probabilistic inferences are made on the basis of the corresponding posterior distributions [30]. Moreover, by virtue of the MCMC computation, the distributions used to describe these variables are no longer constrained to analytically tractable (e.g., normal) formulations. Furthermore, under a Bayesian CART framework, a diverse range of tree structures can be readily explored. The typical frequentist approach of fitting the CART model uses single recursive partitioning algorithms [31,32] in which the choices of the splitting rules at nodes further down the tree are constrained by the choices made at nodes above it, and only get one optimal tree. In contrast, the Bayesian CART approach investigates a wide variety of tree structures with different variables, splitting rules and number of terminal nodes. At any splitting node, the variable and splitting rules are randomly selected from the prior and trees that perform well in terms of high likelihood (low deviance) and posterior probabilities are chosen. Accounting for model uncertainty in this manner can improve predictive performance [8].
A Bayesian CART model for identification and estimation of the spatial distribution of disease risk can be useful in monitoring and assessment of infectious diseases and in decision-making about prevention and control. The methodology developed through this study may be directly applicable to research on other infectious diseases, with further potential for application to a wider range of public health problems.