Estimation of Clustering Parameters Using Gaussian Process Regression

We propose a method for estimating the clustering parameters in a Neyman-Scott Poisson process using Gaussian process regression. It is assumed that the underlying process has been observed within a number of quadrats, and from this sparse information the distribution is modelled as a Gaussian process. The clustering parameters are then estimated numerically by fitting to the covariance structure of the model. It is shown that the proposed method is resilient to any sampling regime. The method is applied to simulated two-dimensional clustered populations and the results are compared to a related method from the literature.


Introduction
In ecological studies, consideration of spatial structure can lead to useful insights regarding the process of interest (see, for example [1][2][3]). If the process exhibits clustering, then parameterised models which describe the spatial structure can provide a greater understanding of the behaviour of the species [4]. However, accurately determining parameters that describe a population can be a difficult problem within the natural environment. Measurements are often expensive and difficult to make, and so usually only a sparse sample of a population is available. In this paper we propose a new method for estimating the parameters of a cluster process from a sparse set of quadrat samples with arbitrary design, i.e. any sampling design such as transects, random etc. can be used without damaging the estimator.
The key theoretical framework upon which this work is built is the Gaussian process (GP), which can be defined as a collection of random variables, any finite number of which have a joint Gaussian distribution [5]. The Gaussian process framework can provide a useful tool for modelling stochastic processes and has seen much attention recently within the machine learning community, where it is used to solve regression and classification problems. GP regression has been widely studied within the field of geostatistics where it is known as kriging [6]. Under this guise it has also been given limited attention within the field of ecology (see, for example [7][8][9]). The kriging equations represent a special case of a Gaussian process.
In this paper we prove that the GP framework provides a useful route to estimating the parameters of a stochastic process, and has several advantages over previously published methods of achieving the same. The Neyman-Scott Poisson process is used as an example in this derivation, as this general cluster model has been widely studied and applied to naturally occurring populations (see, for example, [10][11][12][13]). The estimation is performed by fitting a theoretical covariance function to the empirical GP counterpart by numerical optimisation.
The standard approach to estimating the parameters of a cluster process is based upon the K-function developed by [14]. This estimation procedure assumes that the spatial process has been mapped over the whole survey area. This is not always practical and so alternative methods have been developed based upon line transect surveys. Most of these have only been developed to estimate the mean intensity of a population based upon a partial mapping, and cannot separate all of the parameters. However, Cowling [4] developed a method for estimating all the parameters in a two-dimensional Neyman-Scott process based upon a onedimensional K-function along the transect line. An error in Cowling's derivation of the K-function was corrected in [15], where it was concluded that the corrected method clearly outperformed competing methods.
In this paper an alternative method for Neyman-Scott parameter estimation is developed, based upon GP regression. An experimental arrangement similar to Cowling is used, with the same transect sampling design. Experiments with alternative sampling designs are then tested to demonstrate the resilience of the technique.

Related Work
Within the ecological literature the standard approach to estimating the parameters of a cluster process based upon Ripley's K-function [14]. For a stationary isotropic process with intensity t, it is defined as K(h)~t {1 E½number of events within distance h of an arbitrary event ð 1Þ The K-function for a Neyman-Scott process of dimension d is given by Cressie [16] where l is the intensity of the parent process, m is the number of events per cluster and F (h) is the distribution function for the distance between two events in the same cluster. If K(h,m m,l l) is the K-function evaluated at estimates of m and l, andK K(h) is a nonparametric estimator obtained from the data, then a least-squares estimate of m and l is obtained by minimising the ad-hoc criterion: where c and h 0 are tuning constants. The above estimation procedure assumes that the spatial process has been mapped over the whole survey area. This is not always practical and so Cowling and Aldrin [4,15] developed a method for estimating all the parameters in a two-dimensional Neyman-Scott process based upon a one-dimensional K-function along a transect line. The key steps in this method are reproduced below.
Cowling introduces a normal detection function g(x), which is the probability of detecting an offspring at a distance x from the transect line.
The two parameters g 0 external data, and are assumed known. The K-function for the detected points projected onto the transect line is then derived as where W is the distribution function of the standard normal distribution. The empirical K-function is given bŷ where L is the length of the transect, n is the number of detected points, and y i and y j are the positions along the transect line. The parameters l and r are then estimated by fitting the theoretical Kfunction to its empirical counterpart. Furthermore, given that m can be estimated by substituting E(n) by the observed n, and l and s by their estimates.

Gaussian Process Regression
In this section the GP regression methodology is outlined. For a fuller explanation the reader is directed to [5]. Consider a dataset D D~f(x 1 ,y 1 ),(x 2 ,y 2 ), . . . ,(x n ,y n )g which contains n observations of some scalar variable y, taken at locations x. The dataset in equation 8 can be more compactly represented as D~fX ,yg where X is an n by d array of measurement locations, which will be referred to as training points, and y is a vector of the observations at those locations. Similarly, if predictions are to be made at more than one location, X Ã refers to an m by d array of test points, and y Ã is the predicted output at these locations. The distribution of the training outputs and the test outputs is jointly Gaussian with dimension nzm, mean m and covariance S y :N m, K(X ,X )zs 2 n I K(X ,X Ã ) where m is a matrix containing the mean function evaluated at each of the n training points and m test points.
The covariance matrix in equation 9 has been partitioned to give the covariance matrices between the test points K(X Ã ,X Ã ), training points K(X ,X ), the cross covariance between both sets K(X ,X Ã ) and its transpose K(X Ã ,X ). The values of y in the data set are not the actual function values, only noisy realisations of them. To account for this s 2 n is added to the leading diagonal of the training covariance matrix.
Both the conditional and marginal distributions of a joint Gaussian distribution are Gaussian. It is this property which makes the Gaussian distribution appropriate for stochastic modelling, as closed form expressions for these distributions can be derived. Because y is known, it is possible to condition the joint Gaussian prior distribution on the observations [17] to give expressions for the mean and variance of the posterior GP: where y y Ã~K (X Ã ,X ) K(X ,X )zs 2 As in the case of kriging, the appropriateness of the GP model is entirely dependent on the choice of covariance function which has form arbitrarily selected by the user. Within GP literature [5] a popular form for the covariance function is the squared exponential where l is a length scale that determines the strength of correlation between points. As two points are separated by a large difference, the covariance will tend to zero and the GP variance at test points far from measurements will tend to the underlying variance of the function s 2 f . The parameters s n ,s f and l can all be varied, and doing so will affect the resulting GP model. The free parameters associated with any form of covariance function are referred to as hyperparameters. Also of interest is the marginal likelihood P(yDX ) which can be obtained directly by considering that y*N (0,K(X ,X )zs 2 n I).
The marginal likelihood, or its logarithm, gives a measure of how well the covariance function explains the training data. The absolute value of the log marginal likelihood (LML) is dependent on the dataset but for any given dataset the LML can be used to compare different forms of covariance function and tune the hyperparameters.
Essentially, instead of trying to fit a parametrised model to the underlying function f , the GP (or the geostatistical) approach uses Figure 1. Two quadrats (shaded grey boxes) sampling a clustered distribution. Consider a realisation of a Neyman-Scott process where invisible parent events independently produces of children. The survey field is of size W |W and is to be sampled using square quadrats of side w, thus the measured quantity is the number of children observed in each quadrat. doi:10.1371/journal.pone.0111522.g001 a parametrised model to describe the covariance. The key assumption is that the covariance of the process can be described using a simple parametric model.
As one would expect, in equation 12 the estimate of y Ã is a weighted average of the observations y. The form of equation 13 is also intuitive. The term K(X Ã ,X Ã ) represents the prior covariance between the test points before any observations are made. When observations are made at locations X then the covariance in the prior will be decreased. The extent of this decrease depends upon the correlation between the observation locations and the test points, this is captured in K(X ,X Ã ). As a covariance function is by definition positive semi-definite, observations always result in information gain. However, these observations are also correlated with each other, and not as informative as would be expected if they were independent. Hence the inverse term K(X ,X )z ½ s 2 n I {1 decreases the information gain accordingly. It is also interesting to note that y does not appear in equation 13, thus the variance only depends on the location of the observations, not on the value of the observations themselves.
The main computational burden in computing the mean function and the variance from equations 12 and 13 comes from the inversion of the training point covariance matrix, K(X ,X ).

Estimation of Clustering Parameters
Consider a realisation of a Neyman-Scott process where invisible parent events are Poisson distributed with intensity l per unit area; each parent independently produces a Poisson    The survey field is of size W |W and is to be sampled using square quadrats of side w, thus the measured quantity is the number of children observed in each quadrat. N is the unknown number of clusters within the region W 2 , centred on parents f(px 1 ,py 1 ),(px 2 ,py 2 ), . . . ,(px N ,py N )g: The covariance of the counts c 1 and c 2 measured in two quadrats covering regions Q 1 and Q 2 , separated by distance h (see Figure 1) is defined as: Let W i be a bivariate Gaussian distribution centred at a parent location (px i ,py i ) then the expected number of children from cluster i that will fall within Q 1 is given by and the covariance between two quadrats separated by distance h can be obtained by substituting into equation 16: cov(c 1 , Expanding this expression gives The placement of cluster i is arbitrary, and could be anywhere within the region W 2 with equal probability. Evaluating the expectation operator, and noting that N~lW 2 , gives cov(c 1 ,c 2 )~m 2 l ð ð Figure 6. Covariance structure of GP models. The solid black line shows the mean covariance between quadrats on the true intensity map, plotted against separation (binned into unit intervals In this implementation Equation 22 was evaluated by solving the inner integrals dx and dy to give a closed form solution in terms of the error function, and then the outer integrals dpx i and dpy i were evaluated numerically.

Estimation of simulated process parameters
Events were simulated from nine clustered populations using a two-dimensional Neyman-Scott model within a 100|100 square survey area. In each case the product ml was equal to 0:25, hence the expected number of events is the same for each population, however the extent of clustering varies. One realisation of each population is shown in Figure 2 and the simulation parameters are given in Table 1. Note that s3 exhibits the most clustering, and s7 the least. Edge effects were minimised by simulating the process for a 200|200 region, and then only considering events lying within a central 100|100 square.
The point process was then converted into an intensity map by dividing the survey area into unit quadrats and counting the number of events lying within each quadrat. One realisation of the intensity maps for each population is shown in Figure 3. These maps served as a 'ground truth', and represent the function that the GP attempted to model using sparse data.
In order to be consistent with the experiment carried out by Cowling, nine vertical, equally spaced transects of length 100 were used as the training data for the GP. All quadrats along each transect were used as training data, the effective area surveyed was thus 9% of the total study area.
The GP models were generated using a squared exponential covariance function with additive Gaussian noise. The hyperparameters for the covariance function were selected by maximising the LML. Figures 4 and 5 show the resulting GP mean and variance for the realisation of the populations depicted in Figures 2 and 3. The variance plots show how the uncertainty increases with distance from the vertical transects. For populations s1, s2 and s3 where r~2 the variance increases relatively rapidly from the sample locations, as the offspring are clustered closely to the parents and inference is weak beyond the extents of a cluster. For populations s7, s8 and s9 where r~8, the pattern is much smoother. The variance is almost constant across the survey field, only rising at the edges where extrapolation is occurring rather than interpolation.
For each population, Figure 6 shows the average covariance between samples plotted against their separation. The GP squared Estimation of Clustering Parameters Using Gaussian Process Regression exponential covariance function with the learnt hyperparameters is superimposed in each case.
The parameters of the Neyman-Scott process were then estimated by fitting the theoretical covariance function given in Equation 22 to the optimised GP covariance function. In practice this was performed by minimising where k(h,s f ,l) is the squared exponential covariance function given in Equation 14 evaluated for two points distance h apart with hyperparameters s f and l; cov(h,l,r,m) is the theoretical covariance function;l lm is an estimate of the mean intensity obtained by taking the mean of the training points; H is the maximum range for the minimisation such that k(h §H,s f ,l)&0.
The simulation was repeated 100 times and the means and standard deviations of the parameter estimates are given in  Estimation of Clustering Parameters Using Gaussian Process Regression Table 2. Corresponding results from Cowling are also reproduced as a benchmark, however a direct comparison is not entirely appropriate for two reasons. Firstly, because to our knowledge these results were not recomputed and republished once an error in Cowlings method was corrected in [15]. Secondly, Cowling's method assumes that an observer is travelling along the onedimensional transect line, and events become exponentially more difficult to detect as perpendicular distance from the transect increases. Hence although the effective area surveyed was 9% of the total study area (as in this experiment) the detection function ensures that only some fraction of events are detected. This contrasts with the GP estimator, where it is assumed that all events within a quadrat are detected. Table 2 shows that in the strongly clustered populations s2, s3 and s6 the clustering parameters can be estimated much more reliably than in the weakly clustered populations s4, s7 and s8. In the weakly clustered populations Cowling's method fails to detect any noticeable clustering, and so parameter estimation was not attempted. The GP estimator typically detects the weak clustering and tends to fit a smooth function with a very large length scale (as shown in Figures 4(d),(g) and (h)). However, occasionally (i.e. 9 runs out of 100 for population s7) the GP fails to detect the clustering, and instead overfits a spiky function with a very short length scale. Despite these outliers, on average the GP gives reasonable, if highly variable estimates for the weakly clustered populations.
In all cases the GP estimator tends to overestimate r and m. All the parameter estimates compare favourably with the K-function method of Cowling, however it must be remembered that the two experiments involve a different form of information loss: Cowling's method assumes that not all events are detected, and the GP method aggregates all event locations into a single quadrat count and so some high resolution position data is disregarded.

Alternative Sampling Designs
The previous section demonstrated that a GP can be used to estimate clustering parameters with an accuracy that compares favourably with existing methods. However the real advantage of this method is its resilience to arbitrary sampling designs. The experiment was repeated for the most clustered population s3 using three different sampling strategies: samples at uniformly distributed random points; 'block' sampling (nine 10610 grids of samples with equal spacing between each grid, see figure 7); and a random walk starting at a random location. Figure 7 shows the GP mean and variance, along with the sample locations for the first run of this experiment. The transect and block patterns were identical for each run, but the random point and walk patterns were different for each run.
All of the new surveys were designed to give the same 9% coverage of the survey field. The results are shown in Table 3, with the corresponding results from Table 2 reproduced for  comparison. For this population, on balance the transect sample seems to be the most consistent estimator of the parameters. We postulate that this is because transect sampling gives reasonably even coverage of the survey field, plus contiguous samples. The former is required for the model to detect the intensity of the clusters (l), while the latter helps the model fit to the distribution of the clusters themselves (m and r). Random point sampling provides the best coverage of the survey field and so the GP mean is likely to be the most representative of the true pattern. This is evident from Figure 7(d) where almost all of the clusters are detected and modelled. However, the random point sample provides the worst estimate of r. The other sampling designs all contain contiguous samples, which would be preferred in this case because the covariance is only detectable with separation between quadrats of up to approximately 5 units. Figure 7(f) provides an indication of why the block sample estimates have a very high variance; the GP mean will be unrepresentative of the underlying pattern on many of the runs if little or no clusters fall within the blocks. However when a cluster does fall within one of the blocks it will be modelled precisely, hence the good estimate of r. These results show that while some survey designs are better than others, the GP estimator is applicable to any survey design and provides a similar estimate of the parameters in all cases.
To investigate how sensitive the method is to sample coverage, the transect sample experiment was repeated with a varying number of transect lines. As before, the experiment was run with 100 realisations of population s3 with equally spaced transect lines covering 5% to 30% of the survey area in steps of 5%. The resulting parameter estimates are shown in Figure 8. From this figure, the diminishing returns on increasing sample size are apparent. The variability in the estimate decreases significantly from 5% to 20% coverage, but increasing coverage further has little noticeable effect.
In these experiments unit quadrat size is arbitrarily used, but in practice choice of quadrat size relative to the survey area would be another variable to consider when designing the sample. Too small a quadrat size will result in a very sparse intensity map, which may result in a GP model that is 'overfitted' to a small number of detected events. If the quadrat size is too large, multiple clusters of events will be aggregated within a single quadrat, and this loss of resolution may result in a GP model that is 'underfitted', with individual clusters smoothed out. In both these extremes one would expect the estimator to perform poorly. In these experiments, the unit quadrat size is less than the standard deviation of the clusters (r~2,4,8) but of a similar order of magnitude; in practice we would recommend a similar approach if some prior knowledge of the underlying process is available.

Conclusions
We have shown that a GP can provide a good model for a stationary, isotropic cluster process such as the Neyman-Scott model. As a GP is completely defined by its mean and covariance function, these provide a good proxy for the parameters of a cluster process. The GP model can be used to estimate these parameters, with a precision and accuracy which compares well with other methods within the literature. The main advantage to the GP method is its resilience to arbitrary sampling design.