Identifying individual polar bears at safe distances: A test with captive animals

The need to recognise individuals in population and behavioural studies has stimulated the development of various identification methods. A commonly used method is to employ natural markers to distinguish individuals. In particular, the automated processing of photographs of study animals has gained interest due to the speed of processing and the ability to handle a high volume of records. However, automated processing requires high-quality photographs, which means that they need to be taken from a specific angle or at close distances. Polar bears Ursus maritimus, for example, may be identified by automated analysis of whisker spot patterns. However, to obtain photographs of adequate quality, the animals need to be closer than is usually possible without risk to animal or observer. In this study we tested the accuracy of an alternative method to identify polar bears at further distances. This method is based on distinguishing a set of physiognomic characteristics, which can be recognised from photographs taken in the field at distances of up to 400 m. During five trials, sets of photographs of 15 polar bears from six zoos, with each individual bear portrayed on different dates, were presented for identification to ten test observers. Among observers the repeatability of the assessments was 0.68 (SE 0.011). Observers with previous training in photogrammetric techniques performed better than observers without training. Experience with observing polar bears in the wild did not improve skills to identify individuals on photographs. Among the observers with photogrammetric experience, the rate of erroneous assessment was on average 0.13 (SE 0.020). For the inexperienced group this was 0.72 (SE 0.018). Error rates obtained with automated whisker spot analysis were intermediate (0.26–0.58). We suggest that wildlife studies will benefit from applying several identification techniques to collect data under different conditions.


Introduction
When investigating life history processes in wildlife, the ability to identify individual animals carries great weight. Therefore, methods that facilitate the recognition of individuals are numerous, and continuously improved and extended to integrate rapidly emerging technologies [1] and address new research questions [2]. One line of approach is to catch animals and PLOS ONE | https://doi.org/10.1371/journal.pone.0228991 February 13, 2020 1 / 16 a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 The features include shape of the body and head, the occurrence of scars, patterns of hair strands on the head, and the pattern of dark patches on the muzzle. To experimentally test the reliability of this method, we collected sets of photographs of individual polar bears, with each individual being portrayed at three different dates, from various zoos in Europe and North America. A similar level of detail in photographs can be achieved when photographing polar bears at distances ranging in the hundreds of meters in the field by commonly available optics, including high-magnification telescopes. The identity of the polar bears was known to the experimenter but not to the test observers. The task of the test observers was to distinguish the bears in the photographs in a series of successive trials by visual matching. Test observers were recruited from among those with extensive experience with polar bears in the wild, those trained in photogrammetric techniques, and those without previous relevant experience. The bears were also identified following the method developed by Anderson et al. [34], which is based on automated analysis of individual-specific patterns of whisker spots. Specifically, we aimed to (1) test the reliability of identifying polar bears individually based on physiognomic characteristics, and assess the importance of analytical training or acquaintance with polar bears, and (2) compare the reliability of this method with results obtained by analysis based on whisker spot pattern.

Methods
Photographs of 15 different polar bears were obtained from six zoos in Europe and North America (Table 1). Zoo managers and polar bear care takers were requested to take photographs of each polar bear on three different occasions. The goal was to have photographs collected at a time scale of weeks or months, corresponding to the length of a typical field season in the Arctic. In practice, the occasions were separated by an average of 25 days (range 4-205 days), with 75% of successive occasions occurring within 15 days. One of the individuals was photographed on only one occasion, which means that, in total, 43 collections of photographs taken at a single occasion (called "sets" hereafter) were available. Photographers were further asked to take pictures of active bears (not sleeping or lying on the ground) and to use a camera with a suitable zoom lens, such that the bear covered 50-90% of the image. This resulted in images in which the body of the polar bears covered 0.3-10 megapixels (on average 4.0 megapixels), which corresponded to the resolution that can be obtained when photographing polar bears in the wild at a distance of 200-400 m with a suitable camera system. A bear covering 0.5 megapixels is sufficient to distinguish the details used in this study, as long as the photographs are in focus and sharp. Most but not all photographs met the minimum quality requirements. During some occasions, the light conditions at the time of shooting were poor and resulted in blurry or grainy photographs, impairing the visibility of details. Nevertheless, poor photographs were included in the identification process. To further simulate field conditions, photographers in the zoos were encouraged to take pictures at different times of the day and, if possible, to select days with various weather conditions. Photographers were further asked to portray the bears from a front view of the top of the head, front view right, front view left, left view and right view (S1 Fig). Before initiating the trials, any cues that could help reveal the identity of the polar bears were removed from the photographs. This was done by erasing the background and by removing all embedded information on time of picture, camera type, and GPS coordinates.

Experimental setup
For each of the five trials, the test observers were presented with 15 sets of photos. Sets were composed of an average of 13 photographs (range 7-20), each featuring the same individual.
The task of the test observers was to judge correspondence in identity among sets. The sets were chosen randomly without replacement, resulting in 10 or 11 (average 10.4) different individuals per trial. Trials were treated independently of each other, such that the same set of photographs could appear in multiple trials. Nevertheless, all comparisons concerned unique pairs of sets. The test observers knew that all photographs within a set were of the same individual and that the number of sets of the same individual was three at the highest, but the total number of different bears in the tests remained undisclosed. Information on the (relative) date of collection was also not provided. To facilitate comparisons of photographs within and among sets, photographs were ordered depending on the angle of the bears' heads: bear facing far left (or left backward), facing towards the camera, and facing far right. If needed, the quality of photographs was improved by adjusting contrast. To test the effect of experience on the ability to identify individuals, four types of test observers were recruited depending on any combination of (a) previous extensive experience in observing polar bears in the wild and (b) trained in identifying objects (not necessarily polar bears) in photographs. Sample sizes of the types (experience with polar bears and with or without photogrammetric techniques, no experience with polar bears and with or without photogrammetric techniques) were 2, 2, 2 and 4 test observers, respectively. The net time to process a single trial was 1-2 days, and processing all trials was spread over a period of 2-5 weeks.

Processing photos
For each set, bears were classified according to a list of features (Table 2), which included information on posture, head shape, body condition, pattern of hair over the body and on the head. What these generic features have in common is that they are distinguishable in wild polar bears at distances of up to 400 m (Fig 1). Furthermore, test observers were asked to identify any ad hoc features that an observer might find useful as a natural marker. Details of the head were mapped on a standard sheet [4] with head profiles (left side, right side, and top of the head), including the pattern of dark spots between nose and lips, and the pattern of grey tones on the muzzle, scars or wounds, and hair patterns (see Fig 2 for an example). By pairwise comparisons, sets were screened for corresponding features. In the case that a match was suspected, photographs of the two sets were compared side by side and closely inspected to confirm (or reject) that the same individual was involved. These assessments resulted in the preliminary ratings, composed of a matrix of all sets against each other in which observers coded their ratings as "D" or "S" (when sets were thought to be from different or the same individuals, respectively). Trials were processed consecutively, and after completing the five trials the test observers were given the opportunity to reconsider assessments. This was done to allow for the effects of any experience built up during the trials, resulting in the final ratings used in the analyses.

Statistical analyses
The experiment was structured as a hierarchical design, in which 15 sets of photographs were presented in five consecutive trials. All possible pairs among the sets amounted to 5 × 15 × (15 − 1) / 2 = 525 comparisons, each with a unique comparison ID. All comparisons were rated by the ten observers. Thus, comparisons were nested within trials, and observers were crossed by comparisons (data in S1 Dataset). Each pairwise comparison resulted in one of four possible outcomes [36], depending on the observer's rating and the similarity class (whether bears were the same or different individuals). In those cases that bears were the same, they were either correctly identified (true positive, TP) or they were misidentified as being different (false negative, FN). The false negative rate is the proportion of misidentifications, or FNR = (∑FN) / (∑TP + ∑FN). Likewise, when bears were different, they were either correctly identified (true negative, TN) or they were misidentified as being the same (false positive, FP). The false positive rate is FPR = (∑FP) / (∑TN + ∑FP). The error rate ER follows from the Euclidian distance between FPR and FNR as ER = p (FPR 2 + FNR 2 ). For calculation purposes, the observers' ratings were recoded into a variable "outcome", coded as 0 when the ratings were correct and 1 when the ratings were incorrect. In this way, "outcome" averaged by similarity class estimates FPR and FNR.
The statistical analyses were performed with the software R [37]. The observations were analysed by generalized linear mixed-effects models, glmer in the R-package lme4 [38] adopting a binomial distribution and logit link function. To account for the structure of the experiment, observer and comparison ID nested within trial were treated as random factors. To explore the repeatability among observers [39], the ratings were modelled in a random effects model. Repeatability, defined as the variance due to random effects as proportion of total variance [39], was obtained by the function rptBinary in the R-package rptR [40], with the built-in functionality to estimate the standard error by parametric bootstrapping (1000 times in our case).
Outcome was subjected to mixed-effects modelling with two fixed factors: similarity class and a variable representing the four levels of observers' experience. It was not possible to analyse effects of the four levels of experience, similarity class and their interaction simultaneously as parameters could not be estimated due to convergence problems. Therefore, we followed a two-step approach by first exploring main effects of experience and similarity class, and subsequently testing any interaction effects. Testing for differences among the experience types was by post-hoc pairwise comparisons in the R-package emmeans [41] with Tukey adjustment. Significance of fixed factors was under the assumption that the coefficient's estimates are normally distributed (z-test). As models generated by glmer do not provide a way to calculate standard errors of predictions, standard errors and 95% confidence intervals were estimated by bootstrapping using the R-package boot [42], based on 1000 replicates. To evaluate the performance of the observers against a random process, the ratings were compared with the outcome of 1000 simulations for each of the five trials (S1 File for an example script). Starting with the sets in the original order of the experiment, a matrix comparing all sets to each other (rated as "different" or "same") was generated. Subsequently the order of the sets was randomized resulting in a new matrix with ratings in random order. The two matrices were compared cell-by-cell which resulted in corresponding measures of agreement (TP, TN, FP or FN). From the simulated FPR and FNR the means were calculated, and the 95% confidence intervals were obtained from the 0.025 and 0.975 percentiles.

Assessment by whisker spot pattern
Anderson et al. [34] developed a method to identify polar bears based on the pattern of whisker spots on the anterior part of the muzzle. Briefly, processing the images as described by Anderson et al. includes the following steps. (1) Photographs are warped into a standard pixel grid by affine transformation using three spots (corner of the eye, notch of the nose, trailing edge of the mouth) as reference locations. (2) By a series of image adjustments, photographs are enhanced and cropped to arrive at a black-and-white representation of the whisker spot region. (3) The resulting images are compared pairwise on a pixel-by-pixel basis. For all black pixels on photograph 1, the corresponding pixel on photograph 2 is used to calculate the distance to the nearest black pixel on photograph 2. The distances are averaged to arrive at an index of dissimilarity (the Chamfer distance [34]). Similarly, a second index of dissimilarity is calculated from comparing the photographs the other way. Finally, the two estimates are averaged for a measure of dissimilarity between the pair of photographs. We followed the methods described by Anderson et al. [34] with the following modifications. (1) The program was run in a Python environment with ImageMagick (https://www.imagemagick.org) to process the images. (2) After generating a black-and-white representation of the whisker spot area, we filtered unwanted noise from the images by removing isolated black spots that were less than 2 pixels in size. The index of dissimilarity was taken as a starting point for further calculations [34]. A threshold was set such that pairs with dissimilarity below the threshold were rated as being the same, and above the threshold as different. Increasing the threshold caused a drop in the probability that two sets of photographs of the same individual were erroneously rated as different (=FN); however, the probability that sets of different individuals were rated as similar (=FP) increased. The optimal threshold would minimise these two types of errors [43]. For graphic representation, FNR was first plotted against FPR at increasing threshold values in a modified ROC plot [44]. The optimal threshold was found as the minimal distance from any of the points on the curve to the bottom-left corner of the graph (FPR and FNR both zero) using the package pROC in R [44]. Confidence intervals of FNR at any FPR were obtained by bootstrapping with 10,000 replicates [44].
For the automated whisker spot analysis to provide useful results, the photographs should be of sufficient quality [34]. For precise mapping of the spots, the head of the polar bear must be perpendicular to the viewing axis of the camera. Sets in this study in which none of the photographs met these criteria were not used in the analysis. Following Anderson et al. [34] we subjectively qualified photographs as high quality, low quality, or unsuitable (head not in correct position or whisker spots not distinguishable). From each set, two photographs were selected for analysis, one for the left side of the head and one for the right. Subsequently, the pairwise comparisons were separated into a high-quality group (photographs of both sets were of high quality) and a low-quality group (photographs of only one or none of the sets were of high quality). When photographs were available for both sides of the head for both sets, the pair with the lowest dissimilarity index was selected for further analyses.

Results
The ten observers had equal ratings (i.e. all were "different" or "same") in 88.6% of the comparisons (n = 525), whereas in the remaining 11.4%, ratings differed to varying degrees. This apparently high degree of consistency in the ratings was confirmed by a repeatability of 0.678 (SE 0.011). Nevertheless, error rates differed widely among observers. A multi-comparison test of the error rates in relation to experience of the test observers revealed a dichotomy, in that observers with experience in photogrammetric techniques performed better than those experienced in observing polar bears or without relevant experience at all (P < 0.001; S1 Table). There was only weak evidence that among the photogrammetry-trained observers, additional experience with polar bears improved the quality of ratings (P = 0.025; S1 Table).
The final model explored in which way FPR and FNR varied with experience (i.e. experience in photogrammetric techniques) by the inclusion of an interaction between experience and the factor describing whether the same or different bears were compared (similarity class in Table 3). The model results showed that the errors were smaller when different bears were compared than in a comparison between the same individuals (FPR < FNR). Moreover, the significant interaction term indicates that on top of larger errors within the group of inexperienced observers, the errors were particularly large when inexperienced observers compared the same bears (Fig 3, Table 4).
Under a random process, the average expected FNR was 0.901 (95% confidence interval 0.600-1.000), and the average expected FPR was 0.099 (0.051-0.170). All observers had a lower FPR than expected based on a random process, as indicated by the gap between the 95% confidence intervals (Fig 3). Concerning FNR, the four observers experienced in photogrammetric techniques performed better than expected from a random process, whereas the performance of the six inexperienced observers exhibited an overlap with a random process.
In the automated whisker spot analysis, photographs were rated as high-quality in 19 out of 75 sets (a proportion of 0.253) and poor-quality in 31 sets (0.413). In 25 sets (0.333), none of the photographs were adequate to distinguish whisker spots. The number of individual polar bears in the two categories were 9 (high-quality photographs) and 11 (low-quality), respectively. In the comparisons among the high-quality sets, an optimal threshold, which minimises the probability of a mis-classifications, of 2.8 was found for the dissimilarity index. At this optimum, FNR was 0.061 and FPR 0.200, resulting in an error rate of 0.256 (Fig 4). Similarly, when comparing lower-quality photographs, the optimal threshold of the dissimilarity index was 4.0 with an associated FNR of 0.493, FPR of 0.303, and error rate of 0.579 (Fig 4). Data were analysed by a generalized linear mixed-effects model with a binomial distribution and logit link function. Dependent variable is "outcome" indicating whether a rating is correct (0) or not (1). Experience is a factor representing the experience in photogrammetric techniques (0 = experienced, 1 = inexperienced).
Similarity class is a factor indicating whether photo sets concerned the same individual (0) or different individuals (1), resulting in estimates of FPR and FNR, respectively. Observer (n = 10) and comparison ID (n = 525) within trial (n = 5) were random factors. Results are on a logit scale.
https://doi.org/10.1371/journal.pone.0228991.t003  Table 3. Observers are ranked by error rates. The horizontal lines represent the mean error rates resulting from simulating a random process. The shaded areas are the lower parts of the 95% confidence intervals. Observers are separated by experience in photogrammetric techniques. Observers A, B, H and I were experienced in observing polar bears.
https://doi.org/10.1371/journal.pone.0228991.g003 For comparison with the automated whisker spot analysis, the FPR and FNR based on physiognomic features are shown in Fig 4. Observers experienced in photogrammetric techniques performed better than the whisker spot analysis of high-quality photographs, though there was an overlap in the 95% confidence intervals. When the whisker spot analysis was based on low-quality photographs, both groups of test observers performed better than the whisker spot analysis.

Discussion
This study successfully assessed whether polar bears can be individually identified beyond the range of working distances needed when using an established method with whisker spot patterns [24,34]. In contrast to whisker spots (with an ultimate range of 50 m), the physiognomic features used in this study can be distinguished on photographs of wild polar bears at distances of up to 400 m, obtained by standard optical equipment. This means that this technique can be used to study polar bears that are not habituated to the presence of humans and therefore can only be observed from distances that are too far to take adequate photographs for whisker spot analysis. This study underlines the importance of a training for correct assessments [45]. It was revealing that within our sample of observers, experience with photogrammetric techniques, rather than experience with polar bears by previous intensive observations in the wild, was associated with skills to identify individuals from photographs. Interestingly, observers without photogrammetric experience did almost as well as experienced observers in establishing that two bears were different (the FPRs were only slightly different between the two groups). However, observers without photogrammetric experience fell short in detecting smaller details needed to establish that two bears were the same. Consequently, the FNR of observers without photogrammetric experience exhibited an overlap with results from random simulations.
Any technique to identify individuals in a non-automated way improves when observers acquire the needed skills and get used to individual variation in appearance. Evans [8] describes how observers had to get acquainted to the specific colour pattern on the bill of Bewick's swans Cygnus columbianus bewickii to identify individuals properly. Similarly, processing photographs in an automated way requires trained observers as well, and shifts the labour and required competence to the collecting of high-quality photographs [22,34]. Regardless of what stage of the identification process requires the most effort, the stored photographs give an opportunity to check assessments and, if necessary, to apply corrections with new insights or improved technologies.
With the probability of incorrect assessment being on average 0.13 for observers experienced in photogrammetric techniques, this test supports the supposition that individual polar bears can be recognised reasonably well using physiognomic features. The results indicate that if any error is made, it is most likely that the identity of the same individual is overlooked. Missing a true match is the more common source of errors in other species as well [6,46], which may lead to biased abundance estimates when the observations are used in mark-resight analyses [47]. It is therefore important to reduce identification errors in photographic matching. This reduction can be achieved by first splitting the observations by well-defined grouping variables [48]. A single grouping variable with two levels reduces the number of required comparisons by approximately 50% when groups are similarly sized, and with two grouping variables (yielding 4 combinations) the reduction is 75%. In polar bear field studies, for example, observations can be grouped by gender, age, tagging status (e.g. ear marks or collars) and female breeding status (accompanied by cubs or not), thus potentially reducing the number of pairwise comparisons and identification errors considerably.
Testing for how long distinctive identifying features remain unaltered was beyond the scope of this experimental study. In general, the pattern of natural markers may change over animals' lifetimes (but see Bauwens et al. [48]), and with longer periods between successive observations, extra care is needed to properly identify an animal to avoid false negatives (i.e. overlooking the return of a familiar animal). This may particularly hold for polar bears, in which structural body growth continues up until the age of four (females) or five (males) years [49].
For several reasons, the method to identify polar bears by whisker spot patterns as developed by Anderson et al. [34] is attractive. First, an objective metric of dissimilarity is obtained. Second, processing photographs in an automated workflow avoids laborious scrutiny of photographs. However, our results indicate that identification by physiognomic features may be more accurate than analysis based on whisker spot patterns. In observers with photogrammetric experience, the probability of an error was on average two times smaller than when analysing by whisker spot patterns, and the difference was even larger when low-quality photographs were used. Despite the considerable difference also when using high-quality photographs, the difference in error rates between the two methods was not statistically significant. We attribute this to a lack of power as the sample size of high-quality photographs was low. In addition, the better performance of the non-automated method may be partly attributed to the fact that our study photographs were taken to distinguish a wide array of physiognomic features and not specifically for whisker spot analysis. Therefore, the quality of the photographs may have fallen short. In their work on whisker spot patterns in polar bears, Anderson et al. [34] achieved more accurate results with their probability of error being tenfold smaller than ours, but this was for excellent photographs, which formed only 10% of the material collected in that study. Error rates associated with the more abundant photographs of sub-excellent quality were in line with our results: 0.16-0.53, as derived from Fig 8 in Anderson et al. [34], versus 0.209-0.579 in this study. Finally, identification by physiognomic characteristics may be more accurate than analysis by patterns of whisker spots because the first uses several keys in the identification process, rather than focusing on a single aspect of an animal's physiognomy. When photographs are rated as excellent, the method using whisker spot patterns may be the preferred approach. When quality is lower or when bears were photographed at distances larger than 50 m, the method based on a set of physiognomic features gives more accurate results. We suggest, therefore, that wildlife studies may benefit from applying several identification techniques to collect data under different circumstances. (XLSX) S1 File. Script to obtain estimates of FPR and FNR when assuming a random process. (PDF) S1 Table. Model results of exploring effects of experience on error rates. (PDF)