Assessing Rotation-Invariant Feature Classification for Automated Wildebeest Population Counts

Accurate and on-demand animal population counts are the holy grail for wildlife conservation organizations throughout the world because they enable fast and responsive adaptive management policies. While the collection of image data from camera traps, satellites, and manned or unmanned aircraft has advanced significantly, the detection and identification of animals within images remains a major bottleneck since counting is primarily conducted by dedicated enumerators or citizen scientists. Recent developments in the field of computer vision suggest a potential resolution to this issue through the use of rotation-invariant object descriptors combined with machine learning algorithms. Here we implement an algorithm to detect and count wildebeest from aerial images collected in the Serengeti National Park in 2009 as part of the biennial wildebeest count. We find that the per image error rates are greater than, but comparable to, two separate human counts. For the total count, the algorithm is more accurate than both manual counts, suggesting that human counters have a tendency to systematically over or under count images. While the accuracy of the algorithm is not yet at an acceptable level for fully automatic counts, our results show this method is a promising avenue for further research and we highlight specific areas where future research should focus in order to develop fast and accurate enumeration of aerial count data. If combined with a bespoke image collection protocol, this approach may yield a fully automated wildebeest count in the near future.


Introduction
Aerial surveys, in which the abundance of a population is estimated by flying transects over its habitat and counting the number of animals within a given sampling strip, are an essential tool for assessing wildlife population numbers [1,2]. Many species are monitored in this way, including birds [3][4][5], land mammals [6][7][8][9], and aquatic fauna [10,11]. While in-air counts are still used (i.e. animals are enumerated as they are encountered by observers), a common approach, especially with aggregated species living in high densities, is to employ aerial photography and then later count animals within images. The second stage of this process is frequently a labour-intensive procedure [12] that requires highly-skilled counters.
Automating the process of counting animals in images would therefore relieve a significant burden on governmental and non-governmental conservation organizations. Repeated measures of the population size over time allows managers to not only develop accurate estimates of the true population size, but it also enables the estimation of critical parameters about the population such as rates of recruitment, mortality, immigration and emmigration. These diagnostic parameters provide an early warning indicator of a population's health and are core metrics of any adaptive management system. Therefore, increasing the accuracy and the processing speed of a population count enables managers to access critical data and implement preemptive management strategies at an early stage, rather than waiting months for the results to be counted. Furthermore, an automated counting system could increase the frequency between consecutive population counts and thereby increase the temporal resolution of trends.
Achieving automated animal counts has been the subject of extensive research [13][14][15][16][17]. This research forms part of the rapidly evolving field of machine learning and computer vision [18]. Applications of these techniques are diverse and recent advances include the accurate detection of faces [19], facial expressions [20], pedestrians [21], and handwritten text [22]. In the context of ecology and conservation, machine learning has been deployed to classify species based on vocalisations [23], to identify behavioural states [24], and to track and identify moving animals [25]. However the most significant application has been in the automation of animal census methods, either through direct enumeration of animals [13,14], or through computer-aided mark recapture methods based on automatic identification of individuals [26][27][28].
In this work, we evaluate the performance of a recently proposed method for the classification of objects [29]. The method is based on the popular histogram of oriented gradients technique [21] but has the distinct advantage of extracting only rotationally invariant features; thus making it suitable for aerial survey images in which animals may be oriented in any direction. We apply the method to the complete set of survey images taken during the 2009 Serengeti National Park wildebeest count. The wildebeest count is performed every 2 to 3 years and involves flying transects at an altitude of 350-400ft above ground. The aircraft travels at a speed of 120-180kph (subject to wind speed and direction), with images taken every 10 seconds from a camera mounted through the floor of the aircraft [30]. The result is approximately 2000 images that take 3 weeks for a single individual to count. To test whether the method proposed by [29] is able to automate the counting of wildebeest we implemented the algorithm, automatically counted the 2009 images, then compared the performance of the method to the manual totals. By testing the method on this dataset we are able to comprehensively evaluate its performance in an applied setting on a task of genuine ecological importance. For completeness we include here a brief description of the method employed to extract invariant features from images. This is based on [29] and we refer interested parties to that work for a more complete description of their method. The histogram of oriented gradients (HOG) technique [21] is a popular method that uses the distribution of gradients within regions of images to classify objects. Liu et al. [29] modified this approach so that instead of using a discrete grid, HOG cells are treated as continuous functions that may be approximated using Fourier series. The advantage of this approach is that the extracted features of the image are constant even if the underlying object within the image rotates.
As in [29], to process an image we first construct a matrix of gradients in complex form from the grayscale image I using a finite-difference scheme Hence, each element of G denotes the gradient at the corresponding image pixel in the form Δx + iΔy. In polar coordinates this may be written as re iθ . If we consider each element of G as an individual cell [21] then the distribution of gradients is effectively a Dirac delta function, centred at θ, Performing a Fourier series expansion of a Dirac delta function leads to Truncating this series at some maximum mode means we are left with a sequence of complex valued coefficients which represent the Fourier transform of the image gradient. The gradient at each pixel is therefore encoded by a sequence of Fourier coefficients and the full transformed image is stored in a 3-dimensional complex array, representing x and y coordinates and the modes of the Fourier transform. We denote the 2-d array of mode m coefficients as F m . Next we introduce the Fourier basis functions U j, k shown in Fig 1. By performing a convolution between a basis function U j, k and a Fourier gradient field F m we obtain a Fourier HOG feature which encodes information about the image gradients in the region covered by the basis function. These radially symmetric basis functions act in a manner equivalent to the cells of the original HOG method. If the original image I is rotated then each of the complex valued features X k, m will also be altered, i.e. in this form they are not invariant to rotations. However due to the shift property of Fourier analysis, rotations of the original image can be mapped to multiplications of the Fourier coefficients. A rotation of the original image by an angle α will result in the movement of pixels to another location and a rotation in the orientation of the gradients. These two effects can be mapped to the HOG feature by firstly rotating the Fourier transform of the gradient field F m by α, and secondly by rotating the basis functions U j, k by −α.
If X k, m is the original HOG feature, and X 0 k, m is the corresponding feature calculated after the image has been rotated by α, then From this equation we can see that if m = k then image rotations have no impact on the descriptor and it is rotation invariant. Also, by taking the product of two descriptors, X k1, m1 and X k2, m2 , a composite descriptor is formed, Again we note that the composite descriptor remains constant for all angles of rotation angle α if (m1 − k1 + m2 − k2) = 0. We may therefore construct rotational invariant features of the image from the features defined by Eq 4, by firstly using features for which m − k = 0, and secondly by taking the product of any two features for which (m1 − k1 + m2 − k2) = 0.

Implementation
The 2009 wildebeest count resulted in 2,018 images taken with a Nikon D2X 35mm camera shooting 4288x2848 pixel JPG images. Three separate counts of the aerial images were performed. Firstly, two independent counts were performed simultaneously by two different individuals. A third count was then performed by three individuals for images where there was a discrepancy between initial counts. This final count is taken to be the correct count for our comparison metrics. To evaluate the Fourier HOG method, the full 2009 image set was counted using machine learning software. The adaboost algorithm [31] was employed with a decision tree underlying classifier. Training images were drawn from 100 images taken from the 2012 survey. The code was written in Python 2.7 (www.python.org) using OpenCV [32] for image operations and the sci-kit learn package [33] for classification. The classification code was parallelized using PyCUDA [34] and the code was run on an NVIDIA GeForce GT 630 graphics card. All code is based on open source libraries and is available here https://github.com/ctorney/ wildCount An iterative process was employed to train the classifier based on the 2012 images. First a set of sample images was generated from the 2012 image set by manually locating wildebeest. Next the classifier was trained on this small training data set and several further images were automatically counted. The results from this count were manually checked and corrected then used to create a larger training data set of 3000 positive samples and 3000 negative samples.
The trained classifier was then applied to the 2009 image set. Images were converted to grayscale then they were scanned for regions above a threshold level of local contrast. Regions that were uniform were discarded. Next each pixel that was in a non-uniform region was taken to be the centre of an object to be classified and rotation-invariant Fourier HOG features were extracted. Each pixel was then classified either as a wildebeest or not, then contiguous blocks of pixels were grouped and counted as a single individual.

Results
To assess the accuracy of the method, total wildebeest counts are compared to the multiple counts performed by human counters. When using 3000 training examples for each class (positive or negative) we find good agreement between the automated totals and the manual counts as shown in Table 1. In Fig 2 performance of the algorithm is assessed against the final manual count and the two prior counts are shown for comparison. We note that while the automated total is more accurate than either initial counts, the root mean square error per image is greater. This metric, calculated as ffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 where D i is the difference in the count for image i and the summation is taken over all N images, reveals that the greater overall accuracy is due to the lack of any systematic bias in the machine learning algorithm. The mean error (defined as 1 N P D i ) in Table 1 shows that each first pass manual count had either a positive or negative bias, whereas the algorithm displayed little systematic bias and was therefore able to obtain a more accurate total count, despite displaying a greater RMS error.
To measure the precision and recall of the method, 100 images were randomly selected and the number of true and false positives, and true and false negatives, were recorded. These results are shown in Table 2.

Discussion
The advance in new technologies such as earth-orbiting satellites [14] or unmanned aerial vehicles [9], has led to a rapid increase in high-resolution, easily accessible image data. To keep pace with this progress, computational tools are required to automate image processing and ensure that these vast amounts of data are transformed into useful information. One area where modern computer vision techniques have the potential to significantly improve current practices is in the automated detection of animals within aerial count images. We have implemented a recent object classification method [29] which uses rotation-invariant features and is therefore suitable for use with these types of images. By testing the method against multiple manual counts we find that its performance is comparable to a first-pass human count. The algorithm has a greater per image error rate, but is overall more accurate than two individual human counters. This is due to a lack of any systematic bias in errors, with landscape features leading to high rates of false positives, and low light conditions leading to false negatives (see Fig 3 for example images). Currently the algorithm is unlikely to outperform multiple human counts, either by trained professionals or through a citizen science approach that averages many counts by non-specialist individuals (such as the snap-shot Serengeti project operated through the zooniverse platform). However a combination of automated and manual counting would represent an ideal application of the method in its current form, either as a first-pass count or as a method to assess the performance of citizen scientists. A significant promising aspect of the method is that it appears able to identify and differentiate between animal species. Although we were unable to quantify the performance of the algorithm in this regard with the current data set, these preliminary results show that common species such as zebra may be distinguished from wildebeest by the algorithm. In future we intend to further test this performance with training and testing data sets of multiple species.
A further avenue for future research will involve acquiring 3-dimensional information about the scene. As the features used by the classification method are based on the shape of the object, the error rate will be greatly reduced by obtaining the 3-dimensional structure of the object. This could be achieved through range imaging techniques, such as structure from motion [35] or LIDAR [36]. An alternative approach to increase accuracy would be to include a near-infrared thermal band that could differentiate between endothermic animals and the background. Both thermal and 3-d information could be used in combination with image gradients to enhance the accuracy of the method.