Simplified procedure for efficient and unbiased population size estimation

Population size estimation is relevant to social and ecological sciences. Exhaustive manual counting, the density method and automated computer vision are some of the estimation methods that are currently used. Some of these methods may work in concrete cases but they do not provide a fast, efficient and unbiased estimation in general. Recently, the CountEm method, based on systematic sampling with a grid of quadrats, was proposed. It offers an unbiased estimation that can be applied to any population. However, choosing suitable grid parameters is sometimes cumbersome. Here we define a more intuitive grid parametrization, using initial number of quadrats and sampling fraction. A crowd counting dataset with 51 images and their corresponding, manually annotated position point patterns, are used to analyze the variation of the coefficient of error with respect to different parameter choices. Our Monte Carlo resampling results show that the error depends on the sample size and the number of nonempty quadrats, but not on the size of the target population. A procedure to choose suitable parameter values is described, and the expected coefficients of error are given. Counting about 100 particles in 30 nonempty quadrats usually yields coefficients of error below 10%.


Introduction
Population sizing is a longstanding problem with a wide range of applications such as security, social sciences and ecology. A population is a finite set of N separate items or "particles" of interest, e.g. humans, birds, etc. Several approaches have been taken to address the problem. The traditional density method [1,2] is widely used by media, police and convention organizers for crowd size estimation, but the estimation usually ignores sampling and relies on imprecise visual estimation. Frequently, bird censuses also lean on visual estimation [3][4][5][6] or exhaustive manual counting [7,8] that is slow, tedious and difficult to verify. Automated computer vision can work in some particular cases with regular patterns on homogeneous backgrounds and non-overlapping particles [9][10][11][12][13][14][15]. However, automatic algorithms are generally biased and may show a poor performance [16].
An unbiased population size estimation method (hereafter CountEm method) was recently proposed [17]. CountEm can be applied to any kind of particle irrespective of population size and pattern (see for instance Figs 1 and 5 in [17]). The only practical limitation is the basic requirement that all the particles in the population should be unambiguously identifiable for manual counting in the considered image. It is based on well known principles of geometric sampling for stereology which have been previously applied to quantitative microscopy [18,19]. The main idea is to properly sample and count between 50 and 200 particles in order to estimate populations of any size and spatial distribution. Systematic sampling is performed with a uniform random (UR) grid of quadrats, see Fig 1. The forbidden line rule [20] is used to avoid bias due to edge effects: a particle is counted only if it touches the quadrat but it does not hit the extended forbidden line of the quadrat (see Fig 1C). The population size estimator, b N , is the total number of sampled particles, times the sampling period. The precision of the method was tested [17] on two images with manually annotated particle positions, yielding planar point patterns e.g. Fig 2A. As the method is unbiased, the only source of error comes from sampling variance which can be estimated empirically via Monte Carlo resampling under identical conditions (see Section Simulation procedure). The empirical variance of the population size estimator was computed among 32 2 = 1024 Monte Carlo replications of the estimator for each given grid of quadrats. The empirical coefficient of error was in the 5% − 10% range, counting about 50 − 100 sampled particles for both point patterns of sizes 1120 and 4633 respectively.  [20]. Only heads marked with yellow arrowheads are counted in the quadrat, the rest are not because they hit the extended forbidden edge (in red).  Simplified procedure for efficient and unbiased population size estimation Some practical criteria were given [17] to choose the grid parameters. However, in practice, the choice of starting values for these parameters may not be obvious to every user. Moreover, note that these practical criteria were only checked to be valid on two pictures.
Here we propose to simplify these practical guidelines, by using a more convenient parametrization of the grid. The new estimation protocol is presented and justified in section Materials and Methods. The protocol and the parameter values are tested on the 51 images of the crowd counting dataset, which is described in The crowd counting dataset. The Monte Carlo resampling procedure used to compute the empirical coefficients of error is described in Section Simulation procedure. The results and conclusions are presented in Sections Results and Conclusions respectively.

Definitions and notation
We recall the necessary notation [17]: • N: Population size, i.e. number of particles in the target population.
• Q: Total number of particles sampled by the quadrats.
• CE e ð b N Þ: Empirical coefficient of error of b N , calculated via Monte Carlo replications (Eq 7).
Here we propose two alternative grid parameters, f and n 0 , which are related to t and T as shown in Fig 2B: • f: Sampling fraction, f = t 2 /T 2 .
• n 0 : Initial number of quadrats, n 0 = B x B y /T 2 where B x , B y represent image width and height in pixels, respectively.
Next we present an outline of the "standard" CountEm method [17] and of the simplified protocol which we propose here.

Outline of the CountEm method
The main steps of the standard CountEm method [17] are: 1. Crop the image, excluding empty regions as in Fig 2A. This step is optional but highly recommended to increase efficiency.

Choose suitable values of t and T in pixels.
3. Superimpose the grid uniformly at random on the image, e.g. Fig 1A. Optionally, the grid might be tilted at will by a given fixed angle in order to avoid alignments of quadrat and particle rows which would increase the variance [21]. 4. Manually count the total number, Q, of particles captured by the quadrats. Use the forbidden line rule to ensure unbiasedness: only particles intersecting the quadrat, and not touching the extended forbidden line (in red in Fig 1C), are counted.
5. Use Eq 1 to obtain the estimated population size, b N .
6. Use Eq 3 of our previous paper [17] to predict CE e ð b N Þ.

Outline of the simplified CountEm protocol
The simplified CountEm protocol proposed here consists of the following steps: 1. Apply standard CountEm step 1).

2.
Choose suitable values of f and n 0 . Tentatively one may start with n 0 = 100 and f = 0.04.
4. By cursory inspection, check that Q and the number of nonempty quadrats are approximately in the following ranges depending on the desired coefficient of error: If Q looks too low, then go back to step 2) and increase f. On the other hand if n is too low, go back to step 2) and increase n 0 .

Apply standard CountEm steps 4), 5) and 6).
The corresponding software is freely available at http://countem.unican.es). The whole estimation process can be made in a few minutes.

Justification of the protocol
Choosing suitable parameters t and T in pixels, as suggested in our previous paper [17], can sometimes be laborious. Two practical criteria were given [17], namely aim at (i) having Q in the 50 − 150 range, and (ii) counting no more than 4 or 5 particles per quadrat. These two recommendations imply that the number of nonempty quadrats, n, may lie between 20 and 50. The resulting coefficient of error should be in the 5% − 10% range.
Consider an example in which the preceding criteria are fulfilled. The goal would be to estimate the number of particles N in the image, with coefficient of error below 10% using the old parametrization t, T. After choosing some initial parameters t and T, suppose that we count Q = 40 in n = 18 nonempty quadrats, with an estimated coefficient of error ceð b N Þ ¼ 15%. In order to reduce the error, we should increase Q and n since both are below the suggested ranges. But how should we proceed? Increasing t to get larger quadrats? Decreasing T to obtain more quadrats? Or both? To what extent?
We propose to replace the parameters t and T with the sampling fraction, f, and the initial number of quadrats, n 0 as shown in the preceding subsection. This parametrization is more intuitive and even inexperienced users should find it easy to implement. Reducing the error is straightforward following the simplified procedure with the new parameters as described above.
The validity of the protocol has been checked on 51 images, studying in detail the error ranges corresponding to different sets of parameters.
The empirical squared coefficient of error, CE 2 e ð b N Þ, was computed by Monte Carlo resampling for each of the 51 point patterns in the crowd counting dataset. The dataset is described in Section The crowd counting dataset, whereas the details on the calculation of CE 2 e ð b N Þ are shown in Section Simulation procedure.

The crowd counting dataset
A total of 51 images were used for two purposes, namely checking the validity of the practical criteria discussed above, and analyzing the optimal f and n 0 values to ensure efficient estimation. The crowd sizes, N, vary from 96 to 4633. 50 of the images and their corresponding point patterns (see Fig 3) were borrowed from the UCF dataset [12]. The additional image is the spectators image countem.unican.es shown in Figs 1 and 2, which was already analyzed in our previous paper [17] together with the image corresponding to the largest crowd of the dataset (N = 4633).

Simulation procedure
The empirical squared coefficient of error CE 2 e ð b N Þ was computed by Monte Carlo resampling on the 51 point sets for different choices of parameters {f, n 0 }. The corresponding parameters {t, T} were calculated as follows ( Fig 2B): Simplified procedure for efficient and unbiased population size estimation The resulting grid was tilted an arbitrary fixed angle of 30˚with respect to the x axis, before applying the resampling procedure described in our previous paper [17]. Ideally the angle should be suitably selected for each image in order to avoid alignments of quadrat rows with particle patterns. However, visual inspection (Fig 3) reveals that most of the images have either horizontal spectator rows (= 0˚) or no concrete particle alignments. Therefore choosing 30f or the whole dataset was judged to be reasonable. This problem was addressed in [21], Fig 11. Next we recall the necessary notation to describe the resampling procedure: • Y = {y 1 , y 2 , . . ., y N }: finite set of N point particles in a bounded area. We studied 51 such sets (Fig 3).
• y i 2 Y: ith point particle of the set.
• J 0 : fundamental square tile or box of side length T.
• z 2 J 0 : UR point in the fundamental tile.
• Λ z : UR systematic grid of quadrats, generated by shifting the lower left corner of a quadrat from an arbitrary initial position in J 0 into the UR point z, thus dragging the whole quadrat grid together.
• Q = Q(Y \ Λ z ): random sample size, namely the total number of particles captured by the quadrats.
For each pair {t, T} a total of K 2 = 32 2 = 1024 replicated superimpositions of the grid Λ z onto Y were generated, corresponding to K 2 systematic replications {z k , k = 1, 2, . . ., K 2 } of the point z within J 0 . These K 2 positions were arranged in a random subgrid within J 0 which should be expected to be more efficient than independent random replications [17]. For each k, the corresponding sample total, was computed automatically from Eq 1 using the spatstat package [22]: The empirical mean, variance and squared coefficient of error of b N were computed respectively as follows, Results: Justification of the recommended parameter values

Conclusions
CountEm describes an unbiased and efficient population size estimation method. It can be used irrespective of population size and pattern. It can be applied to humans, animals, or indeed to any kind of distinguishable particles. We have proposed new parameters to characterize the grid of quadrats, namely sampling fraction f, and initial number of quadrats, n 0 . A crowd counting data set containing 51 images and corresponding position point patterns have been used to analyze the suitable parameter values and the resulting coefficients of error. Population size has been shown to have no impact on the coefficient of error of the estimation, only sample size, Q, and number of nonempty quadrats, n, are relevant. Usually Q ≳ 100 and n ≳ 30, yield coefficients of error below 10%. The suitable parameter values depend on the order of magnitude of population size and spatial distribution. For the sizes and spatial distributions of our crowd counting dataset, n 0 = 100 and f = 0.04 are reasonable initial values. We believe that the reparametrization defined here allows a more intuitive and fast choice and/or adjustment of the working parameters.