Using virtual reality and thermal imagery to improve statistical modelling of vulnerable and protected species

Biodiversity loss and sparse observational data mean that critical conservation decisions may be based on little to no information. Emerging technologies, such as airborne thermal imaging and virtual reality, may facilitate species monitoring and improve predictions of species distribution. Here we combined these two technologies to predict the distribution of koalas, specialized arboreal foliovores facing population declines in many parts of eastern Australia. For a study area in southeast Australia, we complemented ground-survey records with presence and absence observations from thermal-imagery obtained using Remotely-Piloted Aircraft Systems. These field observations were further complemented with information elicited from koala experts, who were immersed in 360-degree images of the study area. The experts were asked to state the probability of habitat suitability and koala presence at the sites they viewed and to assign each probability a confidence rating. We fit logistic regression models to the ground survey data and the ground plus thermal-imagery survey data and a Beta regression model to the expert elicitation data. We then combined parameter estimates from the expert-elicitation model with those from each of the survey models to predict koala presence and absence in the study area. The model that combined the ground, thermal-imagery and expert-elicitation data substantially reduced the uncertainty around parameter estimates and increased the accuracy of classifications (koala presence vs absence), relative to the model based on ground-survey data alone. Our findings suggest that data elicited from experts using virtual reality technology can be combined with data from other emerging technologies, such as airborne thermal-imagery, using traditional statistical models, to increase the information available for species distribution modelling and the conservation of vulnerable and protected species.


Introduction
In the face of unprecedented biodiversity loss, critical decisions are needed on the conservation of vulnerable and protected species [1,2]. Unfortunately, information is seldom available or dense enough in space and time to effectively inform those decisions [3,4]. Monitoring programs often rely on observational records of species collected during ground surveys. However, ground-based detection of vulnerable and protected species is difficult, particularly when species are rare or elusive, and information may be biased towards single or few individuals (e.g. radio-collared animals) or sites with high abundance [5,6]. Furthermore, monitoring large areas using traditional ground-survey methods is logistically and financially infeasible; professional monitoring can be time consuming and expensive, and while volunteer data are a valuable source of lower-cost information [7], the data may be biased and range widely in quality [8,9,10]. These issues all contribute towards the sparse data problem.
Emerging technologies provide an opportunity to increase the spatial and temporal coverage of data, increase the quality of information gained, potentially lower the cost of sampling, and thereby benefit conservation efforts for species and their habitats [11]. For instance, implementation of top-down thermal imaging captured by Remotely-Piloted Aircraft Systems (RPAS), commonly known as drones or unpiloted aerial vehicles, can provide a cost-effective alternative to species counting [12,13]. Alternatively, virtual reality (VR) can create immersive experiences of field conditions used to gather expert information both cost-effectively and with relative ease [14]. These immersive experiences are expected to improve elicitation responses due to the priming of visual memories from similar environments [15,16,17]. The resulting information can then be incorporated into quantitative analyses, for example as informative priors (e.g. [18]).
In this study, we aimed to demonstrate how thermal imagery and expert opinions can be harnessed to add value to models based on ground-survey data alone. Few studies have considered combining thermal imagery [19] or VR-elicited expert information within statistical models [14,17]. To our knowledge, this is the first study to use both methods for species distribution modelling in a conservation context. Here, we focused on the koala (Phascolarctos cinereus), a native Australian marsupial that can be difficult to detect even by trained observers [20] and which is listed as vulnerable in Queensland, New South Wales and the Australian Capital Territory under the Australian Commonwealth Environmental Protection and Biodiversity Conservation Act 1999 [21,22]. More specifically, we modelled koala presence and absence data collected from ground-surveys, aerial thermal imagery and experts immersed in 360-degree imagery, using a suite of habitat covariates. This allowed us to examine whether combining information gained from emerging technologies with ground-survey data would improve (i) understanding of the drivers influencing species presence or absence and, most importantly, (ii) the accuracy and/or precision of predictions at unsampled locations.

Study area
The study area, Alexander Clarke Park in Loganholme, southeast Queensland, Australia is approximately 0.20 km 2 , open to the public and contains a fenced-off area for off-leash dogs, multiple playgrounds and walking tracks. The roughly triangular-shaped park is bordered on two sides by the Logan River, and on its third, northwestern side by residential housing. Vegetation in the parkland ranges from dense forest to open scrubland and grass fields, with the denser vegetation tending to occur closer to the river. We chose this study area for several reasons. The park is known to provide habitat for a small population of koalas and aerial thermal-imagery surveys of the area had previously been conducted in 2016 [23]. In addition, koala sightings recorded by citizen scientists were available, and characteristics of the park were representative of areas where ground-based surveying for koalas might be difficult (e.g. due to obstacles such as dense forest and water features).

Koala observation data
The complete dataset of 82 koala observations contained 41 presences and 41 absences, each recorded at a unique location in the study area (S1 Table). Fifteen of the presences were identified from sightings made during opportunistic ground surveys conducted by citizen scientists between 2012 and 2017 [24]. Two more sightings were made by the project team in December 2017 while thermal imagery was captured (see below). The remaining 24 presences were identified from aerial thermal-imagery surveys conducted in October 2016 [23] and December 2017. On both occasions, thermal imagery was collected during RPAS (DJI M600) flights conducted over the study area for later identification of thermal hotspots as 'potential' koalas ( Fig  1). A Tau-2 640 captured forward-looking infrared radiometer footage while a Mobius Action Camera and Sony NEX5 captured red-green-blue footage. The RPAS was flown at 60-70 m above ground to accommodate the size of the site within time constraints in the morning (between 5:50 am and 9:20 am; 2016 and 2017) and evening (between 3:53 pm and 5:56 pm; 2016 only), following line transects 8.37 m apart and orientated west-northwest to east-southeast. Volunteers (in October 2016) or members of our project team (in December 2017) conducted ground surveys while the footage was captured (walking the same line transects as the RPAS), noting the geographical coordinates of any koala sightings. Resultant data were then input into a koala detection algorithm to identify koalas and assign confidence ratings between 0 and 1 [12] (Table 1). Eleven koala presences were identified in 2016, and 13 in 2017 (S1 Table). Using all of the survey data collected in December 2017, the 41 koala absences were then randomly generated from sites in the study area where there was no evidence of koalas based on the thermal imagery and ground surveys. This allowed us to model koala presence/ absence rather than presence-only or presence/pseudo-absence (see Statistical modelling).

Habitat data
Habitat data were collected during field surveys by the project team or derived from freely available geographic information system (GIS) datasets. In December 2017, we estimated the height and density of the tallest vegetation layer in the field, following [27], at each of the 82 koala observation (presence and absence) sites. Height was estimated by measuring the angle of elevation to the top of the tallest tree and the distance between the tree and observer, and then recorded on an ordinal scale of 1 to 5, where 1 > 30 m, 2 = 10-30 m, 3 < 10 m, 4 = 2-8 m, and 5 = 0-2 m. Density was based on visual estimates of percentage canopy cover and recorded on an ordinal scale from 1 to 4, where 1 = 70-100%, 2 = 30-70%, 3 = 10-30%, and 4 < 10%. Height and density data were used solely for the purpose of selecting subsets of images shown to experts (see Expert elicitation data), rather than for statistical modelling of koala presence/absence and prediction at unobserved locations.
Koala populations are threatened by multiple and likely interacting factors, including habitat loss and fragmentation, fire, drought, disease, dog attack and vehicle collision [22,28,29]. Furthermore, these highly specialized, arboreal foliovores are restricted to regions dominated by their food-tree species. Consequently, the presence and quality of these food-tree species have been identified as additional factors affecting koala presence [22,28,29]. We therefore generated GIS-based covariates that represented such factors for each of the 82 koala observation sites (to use in statistical modelling) and for all unobserved prediction sites (to use for model prediction visualization). In addition to the spatial covariates of longitude and latitude, . We also generated a binary covariate (REV) that represented whether a site contained remnant vegetation dominated by Eucalyptus food-tree species favoured by koalas in southeast Queensland, such as E. tereticornus (1), or otherwise (0) [31,32,33]. Finally, we measured the Euclidean distances (m) from each site to the nearest path ( [34]; Path), which has been used previously as a proxy for distance to potential sources of human disturbance [7], and to the nearest fresh water (i.e. the Logan River [35]; Water), which has been used previously as a proxy for the moisture content of leaves (from which koalas gain much of their water requirements [7]). These six GIS-based covariates (longitude, latitude, FPC, REV, Path, Water) were created in R statistical software [36] using the sp [37], raster [38] and geosphere [39] packages.

Expert elicitation data
Subsets of images that captured the range of koala habitat at the 82 koala observation sites were shown to six koala experts. To generate the subsets, we used cluster analysis (completelinkage unweighted pair grouping method with arithmetic mean and the Gower distance measure; [40]) to group sites based on vegetation height and density, distance to water, FPC, and koala presence or absence. The resultant clusters contained between 7 and 11 sites each. Images of sites were then captured in December 2017 using a Samsung Gear camera (Fig 2). Finally, we converted these images into 360-degree views using the game engine Unity (Unity Technologies, San Francisco, www.unity3d.com).  We created subsets of ten 360-degree images each by randomly selecting a site from each cluster. To elicit information from the experts on koala presence and absence in the study area, we then showed each expert the ten images from a randomly selected subset. The experts included government employees, academic researchers and citizen scientists representing a mix of genders and ranging in age and type of koala expertise (e.g. management, research).
There is an extensive literature on elicitation processes and the use of expert knowledge in ecological contexts [41,42]; in this study we followed a structured two-part elicitation procedure (cf. [4]) that allowed us to combine the information elicited from experts with data from field surveys within a statistical model, following [41] (Table 2, S1 File). First, we briefed the experts about the study and provided them with definitions of terms, explanations of probabilities [41] and practice questions while they wore VR-headsets and were immersed in example images. Secondly, we conducted the expert-elicitation interview. For each of the ten images in which the experts were immersed, we asked questions designed to elicit estimates of the probability of the presence of koalas at the site and its habitat suitability, and an assessment of their confidence in those estimates (not very sure, quite sure or very sure; Table 2, S1 File) [17]. The first set of questions specifically on koala presence followed an 'outside in' method to elicit probabilities, whereby experts are first asked about the extrema (absolute lower and upper limits) and then the mode (most plausible, i.e. the expected value) [17,43]. The next set of questions on habitat suitability provided the experts with an alternative way of expressing their assessment of koala presence. The same elicitor interviewed each expert, each expert was interviewed separately, and images were shown to each expert in random order.
We expected that the experts would have greater confidence in their elicited probabilities about habitat suitability than koala presence specifically, given that the experts had no information about the habitat surrounding the location they were viewing. For instance, an expert may deem a densely forested habitat suitable for koalas with high confidence, but their confidence in the probability of a koala being present at that site may be lower given the surrounding habitat is not visible; an image showing a densely forested habitat in the study region could be surrounded by human residences, the river, walking tracks, forest or playing fields, all of which may differentially influence koala presence. The expectation was confirmed, with 98.3% of the elicited probabilities on habitat suitability having confidences of 'quite sure' and 'very sure', compared with only 76.7% of the elicited probabilities on koala presence (S3 Table). Therefore, we chose to use the elicited probabilities on habitat suitability, and their corresponding confidence ratings, in the statistical models that incorporated expert data (see Statistical modelling). Information elicited from koala experts using a two-part structured elicitation procedure (see S1 File for the full protocol).

Statistical modelling
Survey-based models. The koala observation data from ground and thermal-imagery surveys were binary (i.e. presence/absence). Therefore, we fit a logistic regression model [44] to these data, using R statistical software [36]. We fit two models: one to the ground-survey data only (the 'base' G model, n = 34), and one to the combined ground and thermal-imagery survey data (the GT model, n = 82). We included the six habitat covariates in each model: FPC, REV, Path, Water, longitude and latitude. In each case, the covariates were centered and scaled using the means and standard deviations of the respective covariates in the base (G) model. We also included weights in each model based on the confidence in each presence or absence observation: 1.00 for the 17 presences from the ground surveys; 0.50, 0.90 or 1.00 for the 24 presences from the thermal-imagery surveys (as per the confidence ratings from the koala detection algorithm); and 0.90 for the 41 absences ( Table 1). The 0.90 weighting for absences was appropriate because the absences had been generated randomly from sites where koalas had not previously been observed in ground or thermal imagery surveys; as such, there was a chance that koalas may actually have been present at those sites (i.e. false negatives), making a 1.00 weighting inappropriate. The degrees of freedom in the G model were limited because only 34 observations were used to estimate seven parameters (i.e. six habitat covariates and the intercept). Nevertheless, the model served as a base from which we could compare each of the subsequent models, which included additional data from the thermal-imagery surveys (the GT model), the VR-elicited expert information or both sources (see below).
Expert elicitation model. The expert elicitation data from the six experts were represented as probabilities ranging between 0 and 1, so we fit a Beta regression model to these data (the E model, n = 60) [45,46] using the betar gam function of the R package mgcv [47]. We used weights of 0.50, 0.75 and 1.00 in this E model. The weights were based on the experts' confidence in their stated probabilities of not very sure, quite sure, and very sure, respectively (Table 1; S1 File), and spanned the range of weights used for the survey-based models (i.e. 0.50 to 1.00). We centered and scaled each covariate in the E model (i.e. FPC, REV, Path, Water, longitude, latitude) using the means and standard deviations of the respective covariates in the base (G) model. Finally, we included a random effect for expert (nominal: one to six) in the E model to account for expert-to-expert variability and generalize inference beyond the six experts specifically.
Combined models. We combined the estimated parameters and their standard errors from the fitted Beta E model with those from the fitted logistic (i) G model and (ii) GT model, following the standard method for combining estimates from different models [48]. This created a combined G_E and a combined GT_E model, respectively, from which to model the distribution of koalas in the study area. This also allowed us to determine whether combining information gained from the VR-elicited data with the survey data improved model performance. For each of these models, we combined the inverse variance-weighted estimates of the parameter effects [49] from the logistic regression on the observation (o) data from the G or GT model, β o , and the Beta regression on the elicited (e) data from the E model, β e , with the variances s 2 o , and s 2 e , respectively. The combined (c) parameter estimates and their corresponding variances are given by: We calculated confidence intervals for the parameters from each of the two combined models (G_E and GT_E) using the respective combined standard error, sc, and the number of degrees of freedom for the t statistic being equal to the sum of the residual degrees of freedom from the respective G or GT model and the E model, v c = v o + v e , where α is the level of statistical significance (0.05), i.e.,b c � t a=2;v c s c .
Predictive performance evaluation. We used a dataset comprising koala observations from the thermal-imagery surveys that had high confidence (weights � 0.90) as a validation dataset to assess the predictive performance of each model (G, GT, G_E and GT_E). We focused on high confidence data to ensure that the values in the validation dataset were representative of the true values on the ground; the higher the confidence in the validation set, the greater the ability to accurately assess the predictive ability of the models. As with other datasets, we centered and scaled each of the covariates in this dataset using the means and standard deviations of the respective covariates in the base (G) model. For the G and G_E models, we used the fitted models to make predictions at the validation sites. For the GT and GT_E models, we performed leave-one-out cross validation (LOOCV) whereby a validation site was removed, the GT or GT_E model was fit to the remaining data, and a prediction was then made at the validation site. This process continued until a LOOCV prediction was made at each validation site.
In the conservation context of this modelling scenario, the priority was to identify as many koala presences as possible, rather than to find areas where they were absent. We therefore used a cut-off value of 0.4 (rather than 0.5) to classify predicted probabilities as either absences (0; for predicted probabilities < 0.4) or presences (1; for predicted probabilities � 0.4) [50]. We then compared the observations and predictions to evaluate the predictive performance of each model based on classification accuracy (correct predictions divided by the total number of predictions), along with sensitivity (true positive rate) and specificity (true negative rate) and root mean-square prediction error (RMSPE). The larger the classification accuracy, sensitivity and specificity, and smaller the RMSPE, the greater the predictive ability of the model.

Model prediction visualization.
Each of the fitted models (G, GT, G_E and GT_E) was used to make predictions at 636 unobserved sites across the study area. This allowed us to visualize and compare koala distribution across the study area as predicted by each model (G, GT, G_E and GT_E). The covariates in this dataset (i.e. FPC, REV, Path, Water, longitude, latitude) were each centered and scaled using the means and standard deviations of the respective covariates in the base (G) model.

Results
Accuracy of the base G model increased by 75% and RMSPE decreased by 26% when groundsurvey observations were combined with data from the emerging technologies (GT_E model; Table 3). The GT_E model had the greatest accuracy and sensitivity (true presence rate) and smallest RMSPE of all models (Table 3). Although the G model had the greatest specificity (true absence rate), it had exceptionally low sensitivity (0.25), suggesting the model predicted koalas to be absent in most locations. The specificity of the GT_E model was relatively high (at 0.375) given there was no negative impact on the sensitivity of this model (0.937).
Models that combined VR-elicited expert information with the survey data improved predictive accuracy and produced the most precise parameter estimates. The precision for the parameter estimates in the G model was low relative to the other models (Fig 3), and consistent with its sample size. Adding thermal-imagery data (the GT model) increased the precision and adding information elicited from experts (the G_E model) narrowed the confidence intervals further still (Fig 3). Furthermore, the combined GT_E model generated the most precise estimates of any model for each of the regression parameters.
Latitude and distance to fresh water (Water) were the only covariates with significant relationships with koala presence and absence. In the G model, latitude had a significant and positive effect on koala presence/absence, whereas distance to fresh water had significant and negative effects in the G_E and GT_E models (Fig 3). For this latter covariate, there was also a change in the mean direction of its effect among the models, having positive (albeit non-significant) effects in the G and GT models and negative (significant) effects in the G_E and GT_E models (Fig 3).
The predicted presence/absence of koalas in the study area also differed among models (Fig  4). Observer bias was apparent in the G-model predictions, for which presence of koalas was strongly predicted in the northern, more open parts of the study area near residential housing and away from the river, where access would be easy during ground surveys (Fig 4). In contrast, the GT model predicted that koala presence would be less likely in the open areas, and more likely closer to the bend of the Logan River where most thermal hot-spots had been identified (Fig 4). Predictions of koala presence were even less likely in the northern and open areas of the park when the expert information was included in the models (i.e. G_E and GT_E), with the greatest probability of presence occurring in the southwest corner of the park close to the river (Fig 4).

Discussion
Results from our case study show how data from emerging technologies can be harnessed to improve observation-based distribution models for vulnerable and protected species, particularly for cryptic ones like the koala. Like many cryptic species, koalas are notoriously difficult to detect in the field [20,22]. That they are equally likely to be found in habitat of presumed low or high quality exacerbates their imperfect detection, and observation data are thus highly prone to false negatives [25,26]. Species distribution models typically perform poorly when there is imperfect detection; even models using covariates selected on the basis of the best available knowledge will still perform poorly when misclassifications are present [51]. A future extension to the modelling approach presented here could be to estimate and explicitly incorporate detectability within an occupancy model. Although such models are not free from potential biases [52], they may provide a solution to imperfect detection when data are available to inform the detection process, as may the integration of different types of presence or absence data [53].
In our study, the emerging technologies helped to improve models by reducing survey bias [54], specifically by (i) increasing sample size, (ii) sampling in areas that were hard to access, and (iii) generating absences with a high level of confidence such that logistic regression could be used rather than models for presence-only or pseudo-absence data [55]. Survey bias was strongly apparent in the ground-survey only model, with koala presences predicted in open areas near residential housing and places that were easily accessible by observers; false negatives that were corrected upon inclusion of the RPAS and/or expert data (Fig 4). Incorporating data from emerging technologies resulted in an increase in sample size and coverage of the study area, which substantially increased model sensitivity (true presence rate; from 0.250 to 0.937). While specificity (true absence rate) of the ground-survey base model was the highest among models (at 0.500), that of the model combining ground-survey, thermal-imagery survey, and expert-elicited data was comparably high (at 0.375), particularly given model performance evaluation was prioritized towards correct identification of koala presence. By reducing survey bias, classification accuracy of the model consequently improved, as did the ability to predict koala presences at unobserved locations. This is important because accurate species distribution maps provide information critical for conservation [56], especially given that anthropogenic impacts related to habitat fragmentation and climate change continue to alter species abundances and ranges [2,57]. Improvements in accuracy and, in particular, reductions in false negatives, will provide the information needed to ensure surveillance, habitat restoration and protection measures are implemented in areas most likely to yield positive conservation outcomes [58]. Models combining ground-based observations with data from thermal-imagery and VRelicited expert information had the most precise regression parameter estimates (Fig 3). However, few covariates had significant relationships with the presence and absence of koalas in the study region. This was the case even when the data from experts and thermal imagery were included in models and despite canopy cover (related to FPC), proximity to roads and dogs (for which distance to paths may act as proxy [7]) and the presence of food-tree species (Eucalyptus spp.; REV) all having been identified previously as important [28,59,60,61]. The relatively coarse scale of the covariates generated from GIS data and their limited range of values within the small study area (S1 Table) may have contributed to these 'negative' findings. However, when additional information from experts was included in the models fit to groundbased and/or thermal data, there was a significant relationship between koala presence/absence and distance to fresh water, with greater probability of koalas being present in areas closer to the river. The leaves koalas eat provide most of their water needs, so this covariate may act as a surrogate for leaf moisture content, given trees near the river would have constant fresh water supply [7,62,63]. Areas surrounding prominent sources of fresh water in other parts of southeast Queensland, for example North Stradbroke Island, also support relatively high numbers of koalas [64]. Such findings support research that suggests extreme events which affect water availability, such as droughts and heatwaves, may affect species like koalas that use evaporative cooling for thermoregulation [65,66]. Incorporation of finer-scale data on directly measured covariates such as air temperature, precipitation, leaf-chemistry and moisture content, and soil quality, together with continued surveying of koala presence and absence through time, may help to tease apart these potential drivers of koala presence in the study area (e.g. [29]). Elevation, fire frequency, and distance to both sealed and unsealed roads may also be useful as coarser-scale covariates to consider when developing models to understand drivers of koala presence in larger study areas, for example those beyond urban parklands, including rural areas and forest reserves [7,29]. However, if the primary goal of modelling is prediction, then fine-scale covariates must also be available at all unobserved locations. This may rule out many of the finer-scale covariates collected in the field (e.g. direct measures of leaf-chemistry and moisture content) that are measured at observed locations only.
While our study is relatively small in size, both in terms of its spatial extent and the number of koalas observed from the ground, it provides a proof-of-concept that the certainty of models can be increased by including additional information from new technologies and expert elicitation in combination with traditional statistical modelling. Specifically, we demonstrated how ground-and/or RPAS-based observation data can be combined with VR expert-elicitation data to provide a more comprehensive, precise and accurate species distribution model. In the current era of accelerated, human-induced biodiversity loss and high extinction risk [2,67], we need to be innovative and creative about how we capture data and generate information to characterize biodiversity variables for timely and effective conservation. Otherwise, limited knowledge about rare, cryptic and of-concern species, such as differences in habitat needs at different life stages [68], when, where and why organisms move [69], and/or biotic interactions as species ranges shift due to climate change and other anthropogenic impacts [1,67] will continue to constrain conservation efforts. Thus, we do not advocate for an end to ground surveys; ground-collected observational data and existing ecological records will remain vital to ground-truth and combine with other forms of presence/absence data to generate much needed information for conservation [67,70]. Furthermore, the methods we present can be expanded to other wildlife species or places where little to no data are available, particularly where thermal imaging via RPAS is an appropriate solution to any detection obstacles such as site inaccessibility and habitat complexity (e.g. [71]). Such efforts will help to close data gaps and provide the scientific information needed for enhanced conservation of vulnerable and protected species.

Data accessibility
The R script and associated data to run the models and produce the results and figures herein are available in the Supplementary Information (S2 File).
Supporting information S1 Table. Koala and habitat data. Koala presence/absence and habitat data collected during field surveys of the study area or derived from freely available geographic information system (GIS) datasets.