A Generalized Approach to the Modeling of the Species-Area Relationship

This paper proposes a statistical generalized species-area model (GSAM) to represent various patterns of species-area relationship (SAR), which is one of the fundamental patterns in ecology. The approach enables the generalization of many preliminary models, as power-curve model, which is commonly used to mathematically describe the SAR. The GSAM is applied to simulated data set of species diversity in areas of different sizes and a real-world data of insects of Hymenoptera order has been modeled. We show that the GSAM enables the identification of the best statistical model and estimates the number of species according to the area.


Introduction
The variation in the number of species with area, known as species-area relationship (SAR), is one of the most important ecological patterns [1]. The models of SAR enable the prediction of the number of species that coexist and share resources, as well as the impact of the extinction of species caused by habitat loss. Sampled data for a single species, or all species of a specific trophic level within a particular site have shown that the SAR has a welldefined shape, most often described by power and exponential curves [2]. The number of species in an area increases with increasing island area, but the rate of increase slows for larger islands. Many hypotheses have been proposed to explain the SAR [3,4,6]. For instance, some are based on the immigration and extinction of species [4], random sampling processes [6] or the Second Law of Thermodynamics [5].
These different hypotheses have generated many mathematical models for the description of the SAR [1][2][3][7][8][9][10]. The early models were based on deterministic modeling, which assumes that every set of variable states is uniquely determined by the parameters in the model. For instance, Arrhenius considered that the number of species (S) is related to area (A) through a power law form [11] (called the power-function), i.e. S~S 0 A z (or log S~log czz log A), where S 0 represents the number of species in a unit area (A~1) and 0vzv1. Due to the random nature of the sampled data, statistical modeling is more suitable for SAR description than the deterministic approach [6], therefore, many statistical models have been developed (e.g. [2]). Moreover, statistical models can be thought of as general cases of deterministic models, because the mean value of the random variable of interest yields the results of the deterministic model.
Because there exist many models to address the SAR (e.g. [3,9,12]), a natural question is how to select the best model for a given data set. To address this issue, here we integrate different models within a common framework and consider the problem of curve fitting by the transformed generalized linear model (TGLM) [13]. We propose the use of the generalized species-area model (GSAM) to describe the SAR. GSAM includes many models, such as those described in [3,9] as special cases. We also consider a model that simulates the colonization process of a region by different species and show that the GSMA has best fitted the data in comparison with traditional power-curve models. Finally, we use the data on the cumulative species richness of parasitic Hymenoptera from 25 nested plots in a beech forest on limestone [14]. Our results show that the GSAM can determine the best model for the data and estimate the number of species accurately.

Generalized species-area models
In species-area curves, the number of species (S) is the dependent variable and the area (A) is the explanatory variable. Some mathematical models of SAR propose that the number of species is related to the area as where b is a p-parameter vector [1]. Function m can be derived from laws governing the physical system that gave rise to the data. As such models are deterministic and the properties related to the random nature of variable S are neglected, the deterministic models are often inadequate due to the stochastic nature of the data [6]. The statistical modeling usually assumes that Eq. 1 can be written as where fe i g are independent and identically distributed random noises (i.i.d) -usually, e i *N(0,s 2 ). Note that the mode in Eq. 2 is a generalization of the determinic model, i.e. E(S i )~m(A i ,b). The model given by Eq. 1 can be understood as a particular case of the model EfL(S i ,l)g~m(A i ,b), where L( Á Á Á ,l) is a monotonic transformation and l is a scalar parameter defining such a transformation. For instance, in cases whose data suggested by Eq. 2 are unsatisfactory, the experimenter can assume a model with logarithmic transformation, i.e.
This paper proposes a new model called generalized species-area model (GSAM), which is based on the TGLM approach proposed in [13]. The GSAM works with a general parametric family of transformations from the dependent variable S to S (l)~L (S; l) and postulates that the transformed random variable S (l) follows a continuous probability distribution belonging to the exponential family. Furthermore, the GSAM assumes that there exists some l value such that S (l) satisfies the usual assumptions of the generalized linear models (GLM) [15]. A suitable choice of the family of transformations enables the representation of power-curves, their recent extensions (see [11,[16][17][18][19]), the models presented in [2,3,9] and the logarithmic model described in [16] as special cases of the GSAM. We have considered the Box-Cox power transformation [20], which is effective at turning skewed unimodal distributions into nearly symmetric normal-like distributions.
Let S~(s 1 , . . . s n ) T be the vector of observations. By using we can obtain the transformed observations S (l)~( s (l) 1 , . . . s (l) n ) T . The GSAM assumes that there exists some l value such that the transformed random variables fS (l) 1 , . . . S (l) n g can be considered independently distributed. Each S (l) i follows an exponential family distribution with a probability density function of the form where b(x) and c(x,w) are appropriate known functions. The dispersion parameter w is assumed to be the same for all observations. The mean and variance of S (l) i are, respectively, The GSAM also considers a systematic component given by where the link function g( : ) is a known one-to-one continuously differentiable function and x i is a specified vector (p|1) of the explained variables, which include the area, known functions of the area, and other environmental variables. Matrix X whose rows are vectors x T i , i~1, . . . ,n, is a specified n|p model matrix of full rank pvn and b~(b 1 , . . . ,b p ) T is a set of unknown linear parameters to be estimated. The link function is assumed to be monotonic and differentiable.
The GSAM proposed here considers three components of structural importance: (i) the Box-Cox family of transformations (Eq. 4) in association with a more general form for the distribution of the transformed variable S (l) (Eq. 5); (ii) a linear predictor function and (iii) a possible nonlinear link function for the regression parameters (Eq. 6). Moreover, when the variance function V (m i ) is not constant, i.e. when the variance is correlated with mean m, some distributions of the exponential family enable the handling of data presenting heteroscedasticity. In this context, GSAM is a generalization of the previous mathematical models that describe the SAR.
Species-area relationship models. Many models have been proposed for the description of the SAR and some can be linearized by a logarithmic transformation of the response variable (i.e. diversity of species). Models that are special cases of the GSAM have the following properties: (i) the transformation parameter in Eq. 4 is l~0, i.e. a log-transformation is adopted, S (0)~l og (S) and m~EfS (0) g~Eflog (S)g or no transformation is considered for variable S, i.e. we assume l~1 and m~EfSg; (ii) the distribution in Eq. 5 is the normal distribution; and (iii) the link function is the identity function, g(m)~m and the systematic component in Eq. 6 is given by m~Xb. The elements of matrix X may be area A, log (A) or additional variables, as in [12]. Very simple forms of the systematic component are given by m~b 0 zb 1 log A or m~b 0 zb 1 A. This special case of the GSAM can be understood as the particular cases proposed in [11,16]. A list of some models of SAR is provided.
1. Considering the stochastic nature of variable S, which represents the number of species, the power model proposed by Arrhenius [11] can be written as, The logarithm of variable S yields where the parameters of the power-curve in log-log space are b 0~l n (b 0 ) and b 1~b1 .
Some species-area relationships may also be represented by linear functions, specifically:

Linear model proposed by MacArthur and Wilson [4]
EfSg~b 0 zb 1 A: 6. Logarithmic function proposed by Gleason [16] EfSg~b 0 zb 1 ln (A): 7. Quadratic logarithmic function proposed by Gitay et al. [22] EfSg~b 0 zb 1 ln (A) f g 2 , ð21Þ or 8. General power-logarithmic function proposed by Gitay et al. [22] EfSg~b If b 2 is any real number and Db 0 DwDb 1 ln (A)D, from Newton's generalized binomial expansion we obtain where the binomial coefficients with an arbitrary upper index can be defined as Therefore, the logarithmic function model can be written as Full-scale generalized species-area relationship model. The right side of all equations presented in the previous section always involves polynomial terms, such as ln (A), A and/or 1=A. Here, we propose a generalization of these models by considering the right side of the full-scale model consists of three polynomials, i.e. and The left sides of those equations have the number of species (S) or ln (S). In order to generalize them, we have assumed that S is a random variable and considered the Box-Cox power transformation (Eq. 4). The curves defined by the GSAM assume a linear predictor function and a nonlinear link function g(m) for the systematic component (Eq. 6). The systematic component of the GSAM is given by The GSAM has other models as special cases. For instance, the persistence function, P2-full model, is a special case of GSMA if we consider the identity link function g(m)~m, where m~E log (S) f g , i.e.
Although the model defined by Eq. 30 has a large number of parameters (theoretically, it can have an infinite number of parameters), in practice the fitted models have no more than six parameters. The advantage of such a model is that it enables the formulation of hypothesis testing for the choice of the parameters to be removed from those that are significant for better describing the SAR.
Model fitting. The parameters to be estimated in the GSAM are l, b and w (Eqs. 4 and 6). In order to obtain maximum likelihood estimates for the vector of parameters b and dispersion parameter w, we have defined a profiled likelihood function for l and used the same algorithm proposed in [13]. By assuming the model given by Eq. 5, the log-likelihood function for the vector of the transformed observations S (l)~( s (l) 1 , . . . ,s (l) n ) T can be written as where h i~Ð V {1 (m i )dm i~q (m i ) and J(l,s i )~Ds i D (l{1) , i~1, . . . ,n is the Jacobian of the transformation from S to S (l) .
The procedure described in [13] is used for making inferences about parameters (b,w) first assuming that l is fixed and obtains the log-likelihood equations for estimating b (l) and w (l) . The maximum likelihood estimates (MLE) of b, g, m and w for a given l are denoted byb b (l) ,ĝ g (l)~Xb b (l) ,m m (l)~g{1 (ĝ g (l) ) andŵ w (l) , respectively.b b (l) can be calculated, without knowledge on w (l) , adjusting the GSAM (Eqs. 5-6) to S (l) by iteration.
The iteration starts with an initial set of valuesb b (l)(k) , k~1, used to evaluate W (l)(k) and  Table 4. Normal GSAM model fitted by the systematic component shown in Table 2. and z (l)(k)~( z (l)(k) 1 , . . . ,z (l)(k) n ) T is a working vector whose components are given by The next estimateb b (l)(kz1) can be obtained bŷ This new value is used to update W (l)(kz1) and z (l)(kz1) and the procedures are repeated until convergence has been achieved.
Estimating parameter w (l) is more difficult than estimating b (l) . In principle, w (l) could also be estimated by maximum likelihood, although there may be practical difficulties associated with this task for some members of Eq. 5. Details about the technique used for finding the MLEŵ w (l) for a fixed l can be found in [13].
In order to obtain the MLEl l, we replace MLEb b (l) andŵ w (l) in (33), which results in the profile log-likelihood function l P (l)~l(b b (l) ,ŵ w (l) ,l). The plot of the profile likelihood function l P (l) against l for a sequence of values of l numerically determines the MLE for l. Once the MLE forl l has been obtained, it can be used to produce the unrestricted estimateŝ b b~b b (l l) andŵ w~ŵ w (l l) .
Assuming that the estimatedl l is known, the confidence intervals for parameters b (l) and w (l) can be calculated in the usual context of the GLM and using the adjusted valuesb b (l) and w w (l) . We consider the approximate covariance matrix ofb b (l l) and the variance ofŵ w (l l) given in [13] to make inferences about these parameters. Here, we have considered the gamma, Gaussian and inverse Gaussian distributions for the probability density function (Eq. 5) We also performed likelihood ratio (LR) tests [23] using a statistic w~2fl P (l l){l P (l (0) )g, which has an asymptotic x 2 1 distribution for testing l~l (0) and constructed a large sample confidence interval for l by inverting the LR test.

Simulation of the colonization process of a region
The parameters of SAR curves are determined from the survey data. As a proof of concept we first used simulated data for 80 species placed in a 50|50 cell lattice according to a neutral model [24] without dispersal limitation, as applied in [25]. This lattice was then resampled so as to establish the shape of the SAR (Figure 1).
The adjusted models are variations of the full-scale model with six parameters, given by g(m)~b 0 zb 1 ln (A)zb 2 ln A ð Þ 2 The following canonical link functions were considered: (i) g(m)~m for the normal GSAM, (ii) g(m)~1=m for the gamma GSAM and (iii) g(m)~1=m 2 for the inverse Gaussian GSAM. Table 5. Moreover, the traditional models presented in the previous section were considered by assuming that the random variable S is normally distributed. We could estimate the mean of the transformed data E(S (l) )~m, but to predict the expected value of the untransformed dependent variable S, when the GSAM is adjusted to the data, E(S) must be estimated. The dependent variable S can be explained by subtracting m on both sides of Eq. 4 and solving this equation for S. When l=0 we can write The expected value of the species number S, on the original scale, can be evaluated by a first-order approximation of the binomial expansion (Eq. 38), as given in detail in [13]: The best model can be chosen by using the AIC and BIC criteria [26], which are measurements of the relative goodness of fit of a statistical model for a given set of data. The mean square error (MSE) and mean absolute percent error (MAPE) are given, respectively, by whereŝ s 2 s is the sample variance of S andŝ s i is the estimate of s i given by Eq. 39. Table 1 shows some of the traditional models adjusted with normal error. The selected model was the logarithmic quadratic function proposed by [21], with minimum AIC~221:09, BIC~244:39, MSE~1:02% and MAPE~2:50%. Table 2 shows the selected models adjusted by the GSAM with normal, gamma and inverse Gaussian (I.G.) distributions. Note that the adjusted models are variations of the full-scale model. Table 3 shows the GSAM models fitted withl l adjusted by the profile likelihood. The selected normal GSAM has minimum AIC~79:90 and BIC~109:02, which are the lowest values among the adjusted models. MSE and MAPE of the model are also smaller than those of the adjusted model with gamma and inverse Gaussian distributions. The adjusted value of parameter l isl l~0:822 with confidence interval (0:729; 0:922). Because l is different from zero or one, there is a significant difference between the results achieved with this model or by using the traditional models given in Table 1. Therefore, for this data set, the normal GSAM is the model that has best fitted the analyzed data. The MLE estimates of the systematic component and standarddeviation (SD) of the systematic component are shown in Table 4. Figure 1 shows the systematic observation of SAR on the original scale and the fitted curve with the adjusted GSMA models. E(S) was calculated by Eq. 39 and for the adjusted GSAM we obtained MSE~0:08% and MAPE~0:61%, respectively.

Application to real data
The GSAM model was applied to a data set that consisted of 25 observations of parasitic insects of the Hymenoptera order in a beech forest on limestone. Hymenoptera is one of the largest orders of insects that comprise sawflies, wasps, bees and ants. The total number of Hymenopteran species in Europe exceeds 20,000. The data considered here contain the summary of a long-term study of the ecology of parasitic Hymenoptera in a German beech forest, i.e. the Göttingen forest, which is approximately 120 years old and has grown over a ground limestone. The climate of the forest is typical of Central Europe and the work area covered approximately four acres. The study was conducted for 8 years (starting in 1980) in 144 square meters of forest soil.
The analysis of the SAR for Hymenoptera is essential, because the insects that belong to this order are the most important environmental agents fundamental for nutrient recycling and control of harmful species. The group is ubiquitous and it is common sense to assume that there is at least one species of parasitic insects for each species of herbivore insects [14]. Many of such species can be considered for the biological control of plague in agriculture. For instance, wasps, from Symphyta suborder, are plague conifers in the Northern hemisphere and several species of ants cause losses of millions of dollars for agriculture. Such insects act as special indicators and enable the inference of the diversity of arthropods of a broad spectrum of niches. Hymenoptera parasitoids are sensitive to environmental pollution, therefore fluctuations in their population are observed earlier than in their hosts [27]. This sensitivity makes this group an ideal candidate for studies on conservation. Therefore, the knowledge on how the number of species scales with area is fundamental for the prediction of the impact of such insect parasitic on both ecosystems and agriculture. Table 5 shows the species richness in different sample areas (see also [14]). We modeled the data by taking into account all the models presented in previous sections. The results show that the normal GSAM withl l~1 was the best fitted model. No transformation of the original data was necessary: E(S)~b 0 zb 1 ( ln A) 2 zb 2 =A: The parameters of the fitted model are shown in Table 6. The fitted mean together with the data provided in Table 5 are shown in Figure 2. The adjusted model resulted in AIC~220, BIC~238, MSE~1:86% and MAPE~6:69%, therefore, the GSMA model has proved very accurate. Our fit has provided a very good description of the increase observed in species richness and differs from the simple power-function presented in [14]. Interestingly our best fitting model includes features of the modified persistence model [18], but it has not been predicted by any recent macroecological theory, which calls for a fresh look on the patterns and constraints of spatial species distribution.

Conclusions
The generalized species-area model (GSAM) proposed here has provided a generalized model to mathematically describe the SAR. The GSMA can reduce the efforts devoted to finding the best model and can more accurately represent the effect of the area over the diversity of species than the power-curve models commonly used. This fact has been verified in simulated and realworld data.