Bayesian Multimodel Inference for Geostatistical Regression Models

The problem of simultaneous covariate selection and parameter inference for spatial regression models is considered. Previous research has shown that failure to take spatial correlation into account can influence the outcome of standard model selection methods. A Markov chain Monte Carlo (MCMC) method is investigated for the calculation of parameter estimates and posterior model probabilities for spatial regression models. The method can accommodate normal and non-normal response data and a large number of covariates. Thus the method is very flexible and can be used to fit spatial linear models, spatial linear mixed models, and spatial generalized linear mixed models (GLMMs). The Bayesian MCMC method also allows a priori unequal weighting of covariates, which is not possible with many model selection methods such as Akaike's information criterion (AIC). The proposed method is demonstrated on two data sets. The first is the whiptail lizard data set which has been previously analyzed by other researchers investigating model selection methods. Our results confirmed the previous analysis suggesting that sandy soil and ant abundance were strongly associated with lizard abundance. The second data set concerned pollution tolerant fish abundance in relation to several environmental factors. Results indicate that abundance is positively related to Strahler stream order and a habitat quality index. Abundance is negatively related to percent watershed disturbance.


Introduction
Ecologists and other environmental scientists often consider a large number of plausible regression models in an effort to explain ecological relationships between several covariates and a response variable. Model selection procedures, such as Akaike's information criterion (AIC), are routinely employed to help researchers decide upon an appropriate model to describe the environmental system [1].
In addition to the increase in the use of model selection methods, advancing technology has led to the routine use of global positioning systems (GPS) to collect spatially referenced data. The increase in spatial data collection has led environmental scientists to recognize that there may be substantial spatial correlation present in their data. As a result spatial correlation models have become more popular in recent years. Here we investigate a model selection method for geostatistical regression models. In addition to estimating regression coefficients, a geostatistical regression model involves fitting a spatial correlation function to the regression errors. This function allows correlation between observations to decrease as separation in space increases. These models are traditionally termed universal kriging models [2]. The kriging terminology, however, refers to spatial prediction and ecologists are often more interested in inference concerning the covariate portion of the model. Therefore, the term geostatistical regression is used for a spatially correlated regression analysis.
In most regression model selection methods, spatial correlation is ignored. This can lead to erroneous inference of the importance of some covariates in explaining variation in the response variable [3]. Hoeting et al. [4] explore use of AIC for geostatistical regression models. Thompson [5] considers a Bayesian approach to geostatistical regression selection and model averaging predictions using integral approximations to obtain the necessary model weights.
In this paper a Bayesian selection procedure for geostatistical regression models was investigated using a Markov chain Monte Carlo (MCMC) approach. Bayesian and MCMC methods are becoming increasingly popular in the ecological literature [6][7][8][9]. Chapter 7 in Givens and Hoeting [10] provides an introduction to MCMC procedures. Green [11] proposed reversible-jump MCMC (RJMCMC) as a method for traversing model space as well as parameter space. This allows one to make Bayesian inference on the model set in an MCMC setting.
The RJMCMC method presented in this paper has several advantages. Many covariates can be examined through a stochastic model search. The prior weighting of the coefficients is another benefit over methods such as AIC. Certain covariates can be given more or less weight a priori. Methods such as AIC selection weight all covariates equally. The RJMCMC approach has one other major advantage over AIC and Bayesian closed form approximations, it is directly extendable to spatial generalized linear mixed models (GLMMs). This implies the RJMCMC approach can be an all purpose tool for geostatistical regression inference for Gaussian and non-Gaussian data. In order to demonstrate the method in either scenario, we present two example analyses. The first examines the previously analyzed whiptail lizard data set. We chose this data set as a way to assess the accuracy of the method. The whiptail data set has been analyzed several times [3][4][5] with spatial regression models each time the same general model has been chosen. Thus, we analyzed it with our method to confirm the results. The second analysis examined fish abundance data using a Poisson spatial GLMM to illustrate the extension to non-normal data.

Geostatistical regression models
The geostatistical model [2] is a commonly used model for spatially referenced data in a continuous spatial domain. A general model that allows both normal and non-normal response data is presented so that the method can be applied in a variety of ecological studies. In order to account for spatial correlation in normal and non-normal data [12] propose using a spatial GLMM. The spatial GLMM approach uses a traditional GLM with a spatially correlated random effect in order to obtain spatial correlation in the observed data.
Let y~(y(s 1 ), . . . ,y(s n ))' be a set of spatially referenced measurable observations, where s denotes a location in a spatial domain D (usually a two dimensional region). In a spatial GLMM the response variables, y(s) are assumed to be independent given an underlying Gaussian spatial process z(s) and the vector of covariates x(s). That is to say ½yjz is distributed according to p(yjz)~P n i~1 pfy(s i )jm(s i )g, where m(s i )~' {1 fz(s i )g is the conditional mean of y(s i ) and ' is a link function.
The geostatistical process z~(z(s 1 ), . . . ,z(s n )) is assumed to have the normal distribution N(zjXb,S), where X is a matrix of covariates with each row associated with site s i , i~1, . . . ,n. The spatial covariance matrix is defined by the spatial covariance function of the form where r is an isotropic correlation function, W is a 2|2 positive definite matrix that allows for geometric anisotropy [2], s 2 is the sill parameter, and t 2 is the nugget parameter. The correlation function r( : ) may take many forms. Typical choices are the exponential, Matern, or spherical correlation functions [13,14]. The joint density of the observations y and the spatial process z is given by p(y,zjb,j,X)~p(yjz)N(zjXb,S), where j is a vector of the five spatial covariance parameters s, t and the unique elements of W. Hence, the likelihood for the geostatistical regression model is given by L(b,jjy,X)~Ð z p(yjz)N(zjXb,S)dz. There is only one common observation model for which this likelihood can be obtained in closed form, the normal distribution. In this case, the resulting likelihood is the multivariate normal N(yjxb,Szt 2 I) distribution.

Bayesian inference
As the model is presented in the previous section there are significant difficulties in making Bayesian inference. Previous MCMC analysis of spatial data has noted a high posterior correlation between s and t making MCMC samplers slow to converge [15]. Therefore, we use the alternate parameterization h 1~l og (s 2 ) and h 2~l og (t 2 ). This significantly reduced correlation in the MCMC samples the examples considered herein.
In addition, because W must remain positive definite (i.e., c 0 Wcw0 for all vectors c) there are several awkward constraints on the entries of W. To remove the constraints, we reparameterized by factoring the anisotropy parameters as W~AyA, where A is a diagonal matrix with positive elements and y is a positive definite correlation matrix. Let a~(a 1 ,a 2 )'~(2 log (A 11 ), 2 log (A 22 ))'. Any a[R 2 is valid allowing a large range of possible priors. Due to the fact that y is a 2|2 correlation matrix, it has but one parameter y that represents the angle of anisotropy. The valid range of y in that case is (21,1). The elements of the original anisotropy matrix y can be rewritten as functions of a and y, so W ii~e xpfa i g for i~1,2, and W ij~y expf(a i za j )=2g for i=j. The resulting collection of spatial correlation and variance parameters is given by j~(h 1 ,h 2 ,y,a 1 ,a 2 ).

Model uncertainty
Bayesian inference for the spatial regression model is based on the posterior distribution p(b,jjy)!L(b,jjy,X)p(b,j), where p(b,j) is the parameter prior distribution. Desired quantities for summarization of the density are usually in the form of expected values, for example posterior means, variances, and percentiles or credible intervals. The posterior density is intractable; therefore, these quantities can be approximated from an MCMC sample.
The Bayesian approach to model uncertainty assumes that the model itself, like the parameter values, is an unknown entity. Therefore, the joint posterior distribution of the parameters and the model is of interest. The joint posterior for model k is given by where p(m k ) is the prior probability of the kth model and b k is a vector of coefficients for the covariates in model m k . A classic model prior for regression analysis is derived by treating inclusion of the p coefficients as a series of independent Bernoulli trials with probability p j , j~1, . . . ,p [16]. The result is the following prior where I j is an indicator that covariate j is included in the regression model. This prior includes the uniform prior p(m k )~1=2 p ; obtained by setting p j~1 =2, j~1, . . . ,p.
In most model selection problems the object of inference is not the joint model-parameter posterior, it is the marginal posterior distribution of the model set. This marginal distribution is the posterior model probability (PMP) for model k, given by The PMP is almost always unobtainable in closed form. Typically, the model with the largest PMP is selected. Alternatively, one may not want to select a specific model, but use all of the models, appropriately weighted by their PMPs, in an ensemble fashion.
Hoeting et al. [17] provide a detailed description of this type of inference termed Bayesian Model Averaging (BMA). This paper will use both BMA and maximum PMP to make inference concerning the importance of each covariate in explaining an ecological or environmental response. The maximum PMP model provides information on important covariates. Another quantity, the posterior inclusion probability (PIP), is also useful in regression settings. The PIP for the jth covariate is defined This is the model averaged posterior probability of inclusion of the jth covariate. The PIP for each covariate provides a measure of importance of each covariate to the response.

Reversible-Jump MCMC
Unlike ordinary regression, closed forms of the PMPs and PIPs cannot be obtained for spatial regression models. Green [11] proposed the RJMCMC method for sampling from the joint space of the parameters and model. The key difference between MCMC and RJMCMC is that the parameter space changes as the chain moves between models. This makes the algorithm construction more complex. Sample averages can then be used to approximate expected values of model and parameter functions, such as PMPs and PIPs.
A major challenge of the general RJMCMC method is that a double proposal is necessary to move to a different model. First, an appropriate model must be proposed, followed by an acceptable proposal for the parameters of the model. If either of these two proposals is inefficient then the chain will fail to mix well and a large number of iterations will be necessary to do posterior model inference. This can make computations very slow in the spatial regression case because the n by n covariance matrix S that must be inverted at every MCMC iteration.
In order to avoid long RJMCMC runs with spatial regression models an efficient proposal scheme is necessary. Godsill [18] suggested a general proposal method for model classes where some of the parameters are shared among each model. In the spatial regression case, the spatial parameters j~(h 1 ,h 2 ,a,y)' are common to all of the models, whereas b k differs for each model. If the conditional posterior distribution of the model given the shared parameters is available than a more efficient Partial Analytic RJMCMC (PARJ) algorithm can be constructed. Using the basic idea of Godsill, a PARJ chain was constructed for spatial regression model moves in the following manner. For a current state q~(b k ,j,m k ,z), 1. Update b k , j using the full conditional distributions as in a standard Gibbs sampler (details in File S1). 2. Update m k with PARJ as follows 3. Update the entire vector z using a Langevin-Hastings proposal [19] (See File S1 for details).
Upon examination of (6), one can see there is no need to draw b k Ã proposal values due to the fact that the conditional distribution is available in closed form when p(b k )~N(b k jm k ,V k ). Details of the calculation of p(m k jj,z) are given in File S1. In the present examples we use a discrete random walk for J(m k ). First, a covariate is chosen with Table 1 summarizes the steps in the PARJ algorithm. In the following sections we demonstrate the method with the analysis of two separate data sets.

Example 1: Analysis of Whiptail Lizard Abundance
The proposed model selection methodology was applied to the previously analyzed whiptail lizard data set in order the compare the RJMCMC method to other more traditional methods. The data were initially analyzed using a stepwise procedure with a spatial correlation correction [3]. The data were subsequently analyzed using a BIC approximation to the PMP [5] and finally using AIC [4]. Each of these analyses demonstrate the danger of ignoring spatial correlation when selecting covariates. A larger model is often selected to account for the ignored correlation. We used this data set to determine whether the method and proposed RJMCMC algorithm performed as expected.
The data set is composed of abundance data of the Orangethroated whiptail lizard in Southern California. At n = 149 locations where lizards were observed the average number of lizards trapped during a week long trapping period was recorded. The response variable analyzed is the log transformed value y(s) = ln(average no. trapped at location s).
Several covariates were collected to investigate which environmental conditions explain lizard abundance. The original set of environmental covariates contained 37 variables. After initial screening [5] six covariates remained which held potential to explain lizard abundance: Crematogaster ant abundance (3 levelslow, medium, high), log percent sandy soils, elevation, a bare rock indicator, percent cover, and log percent chaparral plants. Using Table 1. Steps in PARJ algorithm for geostatistical regression models. Step Update type indicators for ant level 1 and 2, there are 128 possible models. All 6 covariates were normalized to have mean zero and variance 1.
The PARJ approach was used to select covariates with spatial correlation present. Here, the full spatial model is used with nugget and anisotropy. The exponential function r(d ij )~expf{d ij g, where d ij~hij 0 AyAh ij was used to model spatial correlation. None of the previous analyses included anisotropy effects. In addition, a model without spatial correlation was analyzed as well to determine any effects that might occur when correlation is ignored in the selection process.
Following [5] and [20], the mean and variance of the b k normal prior was chosen to be m~ ( y y,0, . . . ,0), where y y the sample mean of y. The prior covariance was chosen to be V k~2 :85 2 (s 2 zt 2 )(X k' X k ) {1 . Priors for the variance parameters h 1 and h 2 where chosen to be N(0,10). The sample variance var (z i )~0:17, therefore, a prior variance of 10 on the log variance parameters provided adequate coverage over the set of realistic values of the partial sill and nugget. For each range parameter a i , a N(0,1) prior distribution was used. Automatically choosing extremely large variances for the range priors can put an unrealistic amount of mass on ranges well beyond maximum observed distances, which can adversely affect results. A variance of 1 distributed 90% of the prior mass of the correlation range (expfag=3) less than the maximum observed distance. A uniform distribution can be used for a noninformative prior on y, however, a triangular distribution over (21, 1) and centered on 0 produced a more stable MCMC analysis. This was due to the fact that y values near +1 tended to cause inconsistencies in the covariance structure.
The PARJ chain was run for 100,000 iterations following a sufficient burn-in period. See Table 2 for a list of tuning parameters used in the RJMCMC within model moves. We did not thin the chain as that is primarily a tool to relieve storage issues for high dimensional parameters. The number of parameters in this analysis (and the following example) is not that great as to cause computer storage problems. However, if one would like to monitor model averaged predicted values of the response variable at many sites, thinning may become necessary. In order to judge whether the run length was acceptable, the Heidleberger-Welch diagnostic test [21] was performed on the parameters common to all models. Visual inspection also confirmed that this was an acceptable chain length. See File S2 for more details and figures. The analysis was performed using the R statistical package. Code is available in the File S3.

Example 2: Abundance of Pollution Tolerant Fish
In this section the PARJ approach is demonstrated on a data set of fish counts for several geographically referenced sites. A Poisson model with spatially correlated random effects model is adopted for the observed counts. Model selection analysis is performed to determine which of several environmental factors contribute to overall abundance of pollution-intolerant fish.
In 1994 and 1995 numerous stream sites in the Mid-Atlantic region of the Unites States (Maryland, Pennsylvania, Virginia, West Virginia) were visited as part of the U.S. Environmental Protection Agency's EMAP water quality monitoring program. Several stream characteristics were measured to assess water quality. At n~119 sites fish were sampled and classified according to their pollution tolerance. When assessing water quality the abundance of pollution intolerant fish at a site is often a good index.
The emphasis of the analysis is the effects of pollution and stream quality variables, however, there are several natural factors which might also influence abundance. The natural covariates include Strahler stream order (ORDER), elevation (ELEV), and watershed area (WSA). The remaining covariates can be modified by human use and, therefore, were considered potential stream quality variables. Stream quality variables included: road density in the watershed (RD) (No./area), percentage of watershed classified as disturbed by human activity (DISTOT), an index of fish habitat quality at the stream site (HAB), concentration of dissolved oxygen in the stream at the sampling site (DO), % areal fish cover at the sampling site (XFC), and percent sand in stream bed substrate (PCT). Strahler order is a measure of how far a particular section of stream is from its headwater sources. For example, headwater streams have order 1. Two order 1 streams merge to form an order 2 stream, etc. Thus ORDER is an index of stream size and relative location in the watershed. The covariates in this analysis have vastly different scales, therefore, to increase MCMC mixing, the covariates were standardized to have mean 0 and variance 1.
It is believed a priori that the natural variables should be included in the final model, but, this is not known with certainty. Therefore, the natural variables will be more heavily weighted in the prior model probabilities. For the natural variables prior inclusion probabilities were set to p j~0 :75, while for the remaining disturbance variables p j~0 :5. This a priori weighting illustrates one of the advantages of the Bayesian approach to selection. In addition, the data were also analyzed with a flat model prior (p j~0 :5 for all j) to examine sensitivity of this prior weighting.
A Poisson distribution with the canonical log link function is chosen for the abundance model. So, the likelihood is given as where logfm(s i )g~z(s i ) and z*N zjXb k ,S ð Þ . Priors for b k and log s 2~h 1 were chosen to be N(0,100s 2 (X' k X k ) {1 ) and N(0,10) respectively. Here we chose a flatter prior for b k than in the previous example in order to have minimal prior influence on the coefficients. This was the first examination of this data in a model selection context, whereas we chose a more informative prior in the whiptail analysis so the results were comparable to previous analyses. As in the previous example, a variance of 10 for the h 1 prior placed significant mass beyond the scale of the measured data, therefore, it was sufficiently vague, but still proper. An isotropic exponential spatial correlation model was used for covariate selection after initial analysis of the full model suggested no significant anisotropy or nugget effect. The Poisson observation model essentially takes the place of the nugget measurement error variability in the Gaussian process. We assumed that the range parameter a followed a N(0,1) distribution. This prior placed more than 90% of the prior mass below the maximum observed range to maintain stable inference. The PARJ chain was run for 300,000 iterations after a sufficient initial burn-in. Table 2 presents the proposal distributions used for within model moves. In the Langevin-Hastings updates of z, a value of h~0:013 was used providing an acceptance rate of approximately 50%. New models were proposed via the random walk approach of randomly selecting a covariate for addition or deletion from the current model. See File S2 for the convergence diagnostic analysis. Code for analysis within the R statistical language is available in File S3.

Example 1: Analysis of Whiptail Lizard Abundance
The top five models in PMP are given in Table 3. The PARJ algorithm visited 114 out of 128 possible models. The top model included Ant 1 and log percent sandy soil. With only 17.8% posterior model mass, however, there is considerable uncertainty in the best model. Under the independence assumption, the top PMP model includes: Ant 1 , log percent sandy soil, and log percent chaparral (PMP = 30.8%). The top spatial regression model ranks 5th in order with a PMP of only 5.5% under the independence assumption. Table 4 shows the PIPs for each of the covariates. In addition, the table also shows the PIPs for the analysis without spatial correlation. When spatial correlation is included, log percent sandy soil is the only covariate with a PIP greater than 50%. When spatial correlation is ignored during the selection process Table 4 illustrates that a different model inference is obtained. Without a spatial model Ant 1 becomes a virtually certain inclusion and log percent chaparral also enters as an important variable along with log percent sandy soil.
The fact that log percent chaparral enters the model in the independence case but is absent in the spatial model could be due to a spurious regional effect. At nearby sites both the response and covariate are likely to have similar values due the spatial correlation of each variable. In the sample, high response values seem to be associated with high covariate values, but it is really the proximity of the sites to one another that is driving the relationship. The same is also no doubt true for the PIP increase in the Ant 1 covariate.
The results obtained herein are similar to the previously mentioned analyses of this data. Using an isotropic Matern model with nugget [5], noted that log percent sandy soil seemed to be the most important covariate. Hoeting et al. [4] noted that the highest AIC model contained Ant 1 and log % sandy soil. While the MCMC analysis did not pick the full model under the independence assumption as AIC does [4], certainly more weight was placed on the other covariates. Using a spatial stepwise selection [3] also noted that the best model was one that contained ant abundance and log % sandy soil.
In addition to the model inference obtained through the PARJ method, Figure 1 shows the marginal posterior density estimates for the top four PIP coefficients. Using these distributions direction and magnitude of the covariate effects can be examined after accounting for model uncertainty. Relative to high ant abundance, the presence of low ant abundance negatively influences lizard abundance. This should be expected as the ants are the main prey source. Lizard abundance is positively related to the percent of sandy soil in the substrate. The remaining coefficients, elevation and percent cover, have smaller PIP as can be seen in Figure 1 by the size of the bar relative to the density curve. Elevation seems to have neither strong positive or negative influence as the density curve is centered at zero. There seems to be a positive influence of percent cover, but there is substantial probability (&65%) that it is equal to zero.

Example 2: Abundance of Pollution Tolerant Fish
During the analysis, the PARJ chain visited 419 out of 512 possible models. The top five models, as determined by PMP, are given in Table 5. The results indicate that there is considerable uncertainty; the maximum a posteriori model accounts for only 12.4% of the total probability mass in the informative model and 17.8% of the mass in the flat prior analysis. The top PMP model coincides in each analysis. The models containing the predictors WSA and ELEV in the informative prior analysis ranked lower and had a smaller PMP than in the flat prior analysis. This suggests that WSA and ELEV are not as important to intolerant fish abundance as was described by the informative prior. This also explains the increase in maximum PMP. The data contradict the informative prior which leads to an increase in model uncertainty. Three covariates stand out as having significant probability (w0:5) of inclusion in the regression model, stream order (ORDER), percentage of watershed disturbed by human use (DISTOT), and habitat quality (HAB) ( Table 6). The same is true of the flat prior analysis. The PIPs are much smaller for ELEV and WSA than the informative model prior, leading to the conclusion that they are not as important to intolerant fish abundance a  posteriori. All other covariates remain relatively unchanged with respect to PIP. Figure 2 illustrates the estimated marginal posterior densities for the four parameters: ORDER, WSA, DISTOT, and HAB. ORDER is positively related to intolerant fish abundance. WSA is also positively related to intolerant fish abundance, however, there is not strong evidence of a significant effect. Abundance is negatively related to DISTOT and positively related to HAB. Investigation of the scale of coefficient values for DISTOT and HAB shows that DISTOT seems to have a larger magnitude effect than HAB, suggesting a larger effect of watershed scale disturbance over site level effects. In fact, the PARJ output can be used to determine the model averaged posterior distribution of the ratio of the coefficients for DISTOT to HAB. The posterior probability that DISTOT has a larger magnitude coefficient is 60.4%, providing some evidence of a larger watershed level effect. When both variables are included in the model, however, the 95% highest posterior probability interval for the ratio of absolute coefficient values is 0.00-4.93, indicating the evidence is not strong.

Discussion
To date, Bayesian methods of model selection and multimodel inference (i.e. model averaging) have not been utilized in great numbers of published research endeavors. No doubt this is due to the fact that PMPs are usually never available in closed form for many of the models typically used in ecology. This is also true, however, for information criterion such as AIC when generalized linear mixed models are utilized.
The Partial Analytic RJMCMC presented here should be considered a worthy competitor for information criterion methods such as AIC for spatial regression models. First, there is a practical implementation argument. The PARJ algorithm provides a consistent inference methodology for spatial regression modeling. This same method can be used for normal distribution models as well as spatial GLMMs for non-normal data such as counts. While an ''off-the-shelf'' RJMCMC procedure is often challenging to implement, the PARJ version is not any more difficult to program than a standard Gibbs MCMC sampler for a spatial regression model. One needs only add a model-jumping step. The analyses presented here were programmed in R where run times were on the order of a couple hours for the fish data. The second benefit of a PARJ is that it is Bayesian in nature. The PARJ approach allows information that the researcher possesses about the covariates to enter into the inference. The covariates can be weighted differentially a priori.
The PARJ method presented here is not limited to geostatistical regression models. Any covariance model can be used for the the latent process variable z. The PARJ method can be easily modified to make model inference with time series correlation or correlation due to factor random effects. This fact makes this method even more appealing as a general procedure for correlated data regression modeling.

Supporting Information
File S1 Reversible-jump MCMC details. Additional material for implementing the RJMCMC methods presented in the paper. (PDF) File S2 Convergence assessment. Additional material on assessing convergence of the RJMCMC chains to their stationary distributions. (PDF) File S3 Analysis code R. R code for performing the analysis presented in the paper. (R)