Leveraging synthetic data produced from museum specimens to train adaptable species classification models

Jarrett D. Blair; Kamal Khidas; Katie E. Marshall

doi:10.1371/journal.pone.0329482

Abstract

Computer vision has increasingly shown potential to improve data processing efficiency in ecological research. However, training computer vision models requires large amounts of high-quality, annotated training data. This poses a significant challenge for researchers looking to create bespoke computer vision models, as substantial human resources and biological replicates are often needed to adequately train these models. Synthetic images have been proposed as a potential solution for generating large training datasets, but models trained with synthetic images often have poor generalization to real photographs. Here we present a modular pipeline for training generalizable classification models using synthetic images. Our pipeline includes 3D asset creation with the use of 3D scanners, synthetic image generation with open-source computer graphic software, and domain adaptive classification model training. We demonstrate our pipeline by applying it to skulls of 16 mammal species in the order Carnivora. We explore several domain adaptation techniques, including maximum mean discrepancy (MMD) loss, fine-tuning, and data supplementation. Using our pipeline, we were able to improve classification accuracy on real photographs from 55.4% to a maximum of 95.1%. We also conducted qualitative analysis with t-distributed stochastic neighbor embedding (t-SNE) and gradient-weighted class activation mapping (Grad-CAM) to compare different domain adaptation techniques. Our results demonstrate the feasibility of using synthetic images for ecological computer vision and highlight the potential of museum specimens and 3D assets for scalable, generalizable model training.

Citation: Blair JD, Khidas K, Marshall KE (2025) Leveraging synthetic data produced from museum specimens to train adaptable species classification models. PLoS One 20(9): e0329482. https://doi.org/10.1371/journal.pone.0329482

Editor: Gianniantonio Domina, University of Palermo, ITALY

Received: March 21, 2025; Accepted: July 17, 2025; Published: September 3, 2025

Copyright: © 2025 Blair et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All data files are are available on Zenodo (https://doi.org/10.5281/zenodo.15038538). All code is available in a public GitHub repository (https://doi.org/10.5281/zenodo.15536489).

Funding: JDB received a Mitacs Accelerate grant to fund this work. The grant does not have a specific grant number. Mitacs’ website can be found at https://www.mitacs.ca/. The funders did not play any role in the study design, data collection, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

The field of ecology is transitioning into a ‘big data’ science [1–3]. This means that the volume and velocity of data required for modern ecological analytics is surpassing the capacity of individual researcher’s ability to collect and process [1, 4]. As such, manual data collection methods such as visual taxonomic classification have become a bottleneck in the ecological data acquisition pipeline [5]. To alleviate this bottleneck, significant effort has gone into automating the classification process through tools such as computer vision [6–8]. However, building effective computer vision tools for ecological research presents its own unique challenges. Among the most pervasive of these challenges is collecting high quality, annotated data to train the classification algorithms used by these tools [9, 10]. This challenge stems from two main underlying issues: a lack of human resources to collect and process data, and, in some cases, few biological replicates. Applications such as iNaturalist and Wildlife Insights have addressed the human resource issue through massive crowd-sourcing and volunteer campaigns to collect their images and annotations, but such campaigns present their own logistical challenges [11, 12]. Additionally, crowdsourced data often has a taxonomic bias towards charismatic and conspicuous fauna, and thus the issue of a lack of biological replicates for the majority of species persists [13].

Oversampling via image augmentation is a potential solution to both the human resource and biological replicate challenges of collecting high quality image datasets. Generally, oversampling refers to the process of generating additional training images from existing data, and image augmentation refers to the application of transformations (cropping, rotations, colour adjustments, etc.) to images [9, 14]. This can greatly increase the size of a training dataset without any additional specimen collection or annotation effort. While many forms of image augmentation are applied through post-processing raw images, other forms of image augmentation work by changing how the images are originally captured. One example is multiview augmentation, which works by capturing several images of a single specimen from multiple angles or postures [15–17]. While 2D images are limited to a single perspective, multiview augmentation works in three dimensions, which enhances the model’s feature learning and generalization better than standard augmentations made to raw 2D images. This is because the same specimen, when viewed from multiple perspectives (dorsal, ventral, side, anterior, etc.), can appear very differently on a two-dimensional plane. If a model never sees a given perspective during training, it might not be able to generalize its knowledge to that perspective when the model is deployed. However, multiview augmentation can often have the disadvantage of requiring greater manual sampling effort compared to other oversampling techniques as it requires multiple views of the same specimen [18, 19].

Another oversampling technique is the use of “synthetic images” such as rendered images of 3D assets. 3D assets can be generated from real specimens by using methods such as photogrammetry, or completely synthetically through artistic creation [20, 21]. Similar to multiview augmentation, 3D assets can be rendered from different perspectives, in different postures, or under different imaging conditions [19, 22]. However, unlike with manual multiview augmentation, this can all be done automatically and repeatedly once the 3D asset is created. Additionally, 3D assets allow for otherwise-destructive augmentations to be made (e.g., removing legs from a beetle mesh) without damaging the original specimen. Given that all these techniques (destructive, multiview, and other image augmentations) can all be applied in combination automatically, 3D assets have considerable potential to generate large amounts of synthetic images from a relatively small number of specimens.

The motivation for creating 3D assets from real specimens goes beyond their applicability in computer vision. To improve the accessibility of their specimens, museums have gone to great effort to digitise their natural history collections, primarily through the use of 2D imaging [23–26]. However, by definition, 2D images contain far less information than 3D assets [27–29]. Even when they are used for 1-dimensional morphometric measurements, 2D images are less robust than 3D assets due to parallax errors (i.e., errors due to changes in viewing angles) [21]. Therefore, by capturing anatomically-correct 3D representations of museum specimens, we can simultaneously improve museum collection digitization while also creating a valuable resource for computer vision model training.

Although synthetic images can be used to build large image datasets, most computer vision models in ecology are used to classify real photographs. This presents a significant challenge to using this type of workflow as a source of training data because synthetic images belong to a different image domain (i.e., image type) than images taken using standard photography [30–32]. While synthetic images produced from renders of 3D assets often appear photorealistic, computer vision models may still pick up on subtle differences, thus making model generalization across domains more difficult [33–36]. To date, most research on synthetic animal images in computer vision has focussed on pose estimation [22, 31, 37–40]. When used for classification, challenges in domain generalization have led to synthetic images being used as supplementation in training datasets rather than as a primary source [20]. Given that much of the purpose of exploring synthetic images for computer vision is to use them for model training, problems with domain generalization appear to be limiting their potential.

Domain adaptation techniques are one way to address the challenge of domain generalization [35, 41]. The goal of domain adaptation techniques is to enable a model trained on a source domain (where abundant labelled data is available) to perform well on a target domain (where labelled data is scarce or absent), despite differences in the data distribution between these domains. Domain adaptation techniques fall into two broad categories: data solutions, which aim to minimise the distribution gap between the source and target domains at the data level (e.g., make the domains appear more alike), and algorithmic solutions, which modify the learning algorithm to learn features that are generalizable across the domains. A relatively simple example of an algorithmic solution is maximum mean discrepancy loss (MMD loss). MMD is a statistical measure of the difference between two data distributions [42, 43]. When used in a loss function, the model learns the primary task of classification while also trying to minimise the MMD distance between the source and target domain distributions [44, 45]. When applied to models trained on synthetic images, domain adaptation techniques could help these models generalize to real photographs, thus improving their practicality [41].

Here we demonstrate a classification model training pipeline that predominantly uses synthetic images rendered from 3D assets to achieve predictive performance on photographs similar to a model trained entirely with real photographs, thus providing a potential solution to domain adaptation problems in synthetic biological data. Our pipeline includes three modules: 1) 3D asset creation using white-light 3D scanners, 2) rendering synthetic images in the computer graphics software “Blender”, and 3) classification model training using domain adaptation techniques. The pipeline is modular and can be used in combination with other synthetic image pipelines. As a case study for our pipeline, we used the skulls from 16 terrestrial Canadian carnivore species (Order: Carnivora). As a scalable, generalizable approach that addresses the challenge of domain adaptation when using synthetic images, the pipeline we describe here demonstrates the feasibility and advantages of employing 3D assets for computer vision model training.

Methods

In this study, we used skulls from 16 terrestrial/semi-aquatic Canadian carnivore species (Order: Carnivora), currently preserved at the Canadian Museum of Nature (Gatineau, Quebec; S1 Table, (S1 File)). Skulls served as an ideal case study for three reasons: (1) all species included in our study can be distinguished by external skull morphology, (2) the smooth surfaces of the skulls make them suitable for white-light 3D scanning, and (3) skulls are underrepresented among computer vision classification models [46]. Individual skulls were selected based on their development stage and completeness. Only adult skulls with >50% of their teeth intact and no excessive damage were included. Mandibles were not included, primarily due to their frequent damage or omission from the remainder of the specimen, as well as the fact that they would occlude other diagnostic features on the ventral side of the skull.

All data and code is available on GitHub [47]. No permits were required for the described study, which complied with all relevant regulations.

Data collection

Creating 3D assets.

We scanned the skulls in batches with a Creaform GoSCAN 20 handheld scanner at an original resolution of 0.7 mm. Batches were species-specific, and batch sizes were determined by the number of skulls that could fit on a 40 cm turntable with minimum separation of Â1/2 skull width (S1 Fig, S1 File). We scanned each batch twice (once dorsally and once ventrally) with the relative position of each skull remaining the same between scans. To keep the skulls in position for the ventral scans, we affixed them to the turntable using adhesive putty. For the dorsal scans, we simply rested the skulls in their natural position. In total, we scanned 30 skulls from each species. However, one scan of Vulpes vulpes (red fox, Linnaeus 1758) became corrupted and is excluded from the study. Across all the scanned specimens, there was a slight, unintentional male bias (193 males, 162 females, 124 no sex recorded). This is consistent with sex biases in mammalian natural history collections broadly [48].

After each scan was completed, we processed the 3D images using VXelements 9 [49]. First, we refined the resolution of each completed scan to 0.5 mm with the “Smart Resolution” feature of VXelements. Then we made a first pass cleaning of the image, which included removing objects such as the turntable and adhesive putty from the scan. After a pair of corresponding images (i.e., dorsal and ventral scan) had been cleaned, we used the VXelements “Merge scans” feature to merge them. We then completed a final round of cleaning in the VXmodel application of VXelements. This included filling holes in the skull (e.g., the foramen magnum) and removing artifacts of the scanning and merging process. Finally, we saved each skull image individually as a Wavefront OBJ mesh, a .mtl material file, and a bitmap texture file.

Image collection.

To create a synthetic image dataset, we rendered 2D images of our skull assets using the open-source 3D modelling software Blender [50]. For specific Blender environment settings, refer to [47]. In Blender, we created a standard environment for each skull to be rendered in one at a time (S2 Fig, S1 File). This ensured the rendering conditions (such as lighting and camera perspective) were consistent across specimens and species, and more closely matched the real photograph imaging conditions. After a skull asset was loaded into the Blender environment, we scaled it to a standardised length (measured anterior to posterior) and decimated it to a standardised polygon count. We did this to increase rendering efficiency and ensure higher uniformity across specimens and species. Images of each skull were rendered with the ‘Cycles’ rendering engine from 92 angles: 18 angles (one every 20^° around the yaw axis) at camera pitches of 0^°, ± 30^°, ± 60^°, and one render at ± 90^° (S3 Fig, S1 File). We automated skull loading and rendering with a Blender Python API script. Each render had a resolution of 720×720 pixels. In total, 479 3D skull assets were rendered to produce 44,068 synthetic images (Fig 1).

Download:

Fig 1. Example of a rendered 3D skull image and specimen photograph.

(a) A rendered 3D skull image of a Canis lupus specimen. (b) A photograph of a different C. lupus specimen.

https://doi.org/10.1371/journal.pone.0329482.g001

To create an image dataset of real photographs, we photographed an additional 10 specimens from each of the 16 species. As in the synthetic image dataset, there was an unintentional male bias among the specimens photographed (65 males, 48 females, 47 no sex recorded). None of these specimens had been included in the 3D asset dataset. As in the Blender environment, we created a standardised photography setup to minimise variation in the imaging conditions between specimens and species. However, due to the large size variation of the skulls (∼8 cm to ∼44 cm in length), the distance from the camera to the specimen varied depending on the species being photographed. We photographed the specimens one at a time on a black turntable in front of a black backdrop with a Laowa FFii 90mm f2.8 CA-Dreamer Macro 2X lens attached to a Nikon Z6 mirrorless camera. Each specimen was photographed from the same 92 angles as the synthetic images. Markers placed around a turntable and protractor attached to the camera tripod were used to assist with angle precision. However, due to human error in the photography process, only an average of 90.4 photos were taken per specimen. Most notably, Neogale vison (American mink, Schreber 1777) had no photographs taken from a camera pitch of 0^°.

Using the image processing software Fiji [51], we cropped the photos to a square that would allow the skull to make a full rotation while remaining in the frame. We also removed any duplicate photos. This resulted in a total of 14,467 images for 160 specimens.

Model training

Data split and processing.

To create training and testing datasets, we split the sets of synthetic and photograph images separately, each at a ratio of 80:20. Splits were made at the specimen level so that all images of a single individual were only in one dataset. Despite being the target domain, we created a photograph training dataset for two reasons. The first was because the MMD model (see "Model architecture and training procedure") required unlabelled target domain images during training. The second reason was to create a baseline for comparison when a model was only trained by use of photographs. Other than in the supplemented model, the synthetic images and photographs were kept separate. This resulted in a synthetic image dataset split of 24:6 individuals and 35,328:8740 images, and a photograph dataset split of 8:2 individuals and 11,561:2906 images. The supplemented model, fine-tuned model, and subset photograph model used a 25% subset of the photograph training data, resulting in a specimen split of 2:2:6 (training:testing:unused).

We scaled all images to 224 × 224 pixels for training and testing, as this is the resolution required by the VGG19 architecture [52]. We applied the following augmentations to datasets that were used for supervised training: vertical and horizontal translation, horizontal flips, and colour jitter to brightness, contrast, saturation, and hue. Testing dataset images had no augmentations.

Model architecture and training procedure.

We built all classification models with the VGG19 feature extractor architecture with preloaded ImageNet weights [52]. To this, we added a max pooling layer, flattened layer, two dense layers, and a softmax classification layer. All models used the Adam optimizer and were trained for 100 epochs with early stopping. For all models except the supplemented model and fine-tuned model, the training stopped if there were 15 consecutive epochs with no improvements to the training domain testing loss. For example, if the model was trained by use of photographs, the early stopping criteria would be based on the photograph testing dataset loss. The supplemented and fine-tuned models had access to labelled synthetic images and photographs, so the final epoch for these models were selected manually to optimise performance across both domains. All models were trained with the PyTorch and Pytorch-Adapt libraries in Python [47, 53–55].

To set baseline comparisons for how classification models performed in the absence of domain adaptation techniques, we trained three models. The first baseline model was trained by use of only the synthetic image training dataset (hereon simply referred to as ‘the baseline model’). This model set a lower limit of classification performance when models were tested on photographs. The ‘photograph baseline model’ was trained exclusively with the photograph training dataset to set an upper-limit comparison for model performance when tested on photographs. Finally, the ‘subset photograph model’ was used as a point of comparison to see how a model would perform when trained with the same number of labelled photographs as the fine-tuned and supplemented model. All three baseline models were optimised by cross-entropy loss.

To improve generalization from the synthetic images to the photographs, we tested three models that each used a different domain adaptation technique. The first domain adaptation model used MMD loss to align the feature space of the synthetic images and photographs [43]. During training, the model was fed with labelled synthetic images and unlabelled photographs from each domain’s respective training dataset. This implementation of MMD loss was based on [44]. The second domain adaptation model used a technique called "fine-tuning". This model was trained on the subset photograph dataset, but rather than beginning training with the ImageNet weights as in the other models, it began training from the final weights of the MMD model. Finally, we trained a ‘supplemented model’ which was trained with an image dataset that combined the synthetic image training dataset with the subset photograph training dataset (35,328 synthetic images and 2,907 photographs). Both the supplemented model and fine-tuned model were optimised using cross-entropy loss.

Feature visualization.

To visualise each model’s representation of the testing data’s feature space, we used t-SNE on the activations of the model’s post-convolution flattened layer [56]. To quantify the feature extractor’s clustering ability, we measured silhouette scores from each set of t-SNE embeddings. Silhouette scores measure how well-separated clusters are in a dataset, considering both the distance within clusters (i.e., cluster tightness) and the distance between clusters (i.e., cluster separation). Clusters were defined by the ground truth labels of the images. From these clusters, we calculated three silhouette scores: by use of only synthetic images, only photographs, and with both datasets combined. Domain confusion for each model was assessed visually from the t-SNE embeddings.

To visualise which features of the images the model was using to make classifications, we created gradient-weighted class activation maps (Grad-CAMs) with 100 randomly-selected images from the photograph testing dataset [57]. The Grad-CAMs were generated by activation of the predicted class in the final convolutional layer of each model. To quantify the types of features used by the model, we assigned each Grad-CAM a score based on the criteria in (S2 Table, S1 File). Grad-CAM scores were assigned manually and averaged for each model.

Results

Accuracy

When trained on exclusively synthetic images and tested on photographs, the baseline model recorded an accuracy of 55.4% (Table 1). All methods of domain adaptation produced improvements in photograph classification accuracy over the baseline model, with the supplemented model resulting in the highest classification accuracy at 95.1% (Fig 2). This was the only model to measure >90% classification accuracy on both the synthetic and photograph testing datasets (95.6% and 95.1%, respectively).

Download:

Fig 2. Confusion matrices for the (a) baseline (synthetic only) and (b) supplemented (synthetic and photograph subset) models’ classifications of the Carnivora skulls housed at the Canadian Museum of Nature.

The rows are the species’ true classifications, while the columns represent the times the model made a classification as that species. Cells are shaded according to the proportion of the true labels classified as each species (i.e., shaded by row).

https://doi.org/10.1371/journal.pone.0329482.g002

Download:

Table 1. Skull species classification accuracy for each model, as measured on the synthetic image and photograph testing datasets. The synthetic image dataset was composed of renders of 3D skull assets and the photograph dataset was composed of photographs taken from skulls directly. The highest accuracy score for each dataset is underlined and italicised. The “Epochs” column represents the number of epochs each model was trained for. The MMD + Fine-tuning model combines the number of epochs used for the MMD model with the number of subsequent fine-tuning epochs.

https://doi.org/10.1371/journal.pone.0329482.t001

To measure how well the model would perform if it were trained exclusively on images from its own domain, we also measured accuracy on two photograph-trained models. Across all models, the photograph baseline model recorded the highest photograph classification accuracy at 99.2% (Table 1). When the photograph model was trained with the same number of labelled images as the domain-adapted models, its accuracy dropped to 85.3%, lower than both domain-adapted models that also used labelled photographs during training. When tested for domain generalization on the synthetic image test dataset, both the photograph baseline model and photograph subset model measured <30% classification accuracy.

Qualitative analysis

t-SNE visualisation.

The t-SNE clustering and silhouette scores showed that the feature extractors of models trained with synthetic images were better at clustering species into single, distinguishable clusters than the feature extractors of models trained only with photographs (Fig 3, Table 2). The supplemented model recorded the highest species cluster silhouette scores, regardless of whether the synthetic and photograph clusters were measured separately or together. The photograph baseline and subset models both produced poor silhouette scores for the photograph species clusters, despite their relatively high accuracy measurements when classifying photographs.

Download:

Fig 3. Visualisation of the feature space of six skull classification models using t-SNE.

Given that the absolute axis values of t-SNE plots did not contain meaningful information, they are not shown. Each t-SNE plot was generated from using the activations of the model’s post-convolution flattened layer. Blue ‘x’ points represent synthetic images, and the red dots represent photographs. All images were from the test dataset.

https://doi.org/10.1371/journal.pone.0329482.g003

Download:

Table 2. t-SNE silhouette scores based on clusters formed from the t-SNE embeddings of each model, and labelled using the ground truth labels from each dataset. The t-SNE embeddings were calculated from each model’s activations of the final convolutional layer. The combined score measures silhouette score when the synthetic image and photograph testing datasets were combined. The highest silhouette score for each dataset is underlined and italicised.

https://doi.org/10.1371/journal.pone.0329482.t002

Qualitatively, overlap between domains in the t-SNE plots appeared highest in the fine-tuned model (Fig 3c). In the fine-tuned model, all individuals from a given species occupied the same general feature space, regardless of the image’s domain. This indicated that the model was extracting features generalizable across domains, opposed to domain-specific features. The only exceptions to this were Lontra canadensis (North American river otter, Schreber 1777) and Canis lupus (grey wolf, Linnaeus 1758), which had slight shifts in feature space between domains. The MMD model had similarly high domain confusion (Fig 3b), with the key exception being the Lo. canadensis photographs, which clustered with Gulo gulo (wolverine, Linnaeus 1758) images rather than with Lo. canadensis synthetic images (S4 Fig, S1 File). In the supplemented model, images of the same species, but from different image domains, often formed adjacent clusters (S4 Fig, S1 File). The photograph baseline and photograph subset models showed relatively poor domain confusion (Fig 3e,f).

Grad-CAMs.

The fine-tuned model measured the highest average Grad-CAM score, with 96 images scoring a value of 3, and four images scoring a value of 2 (Fig 4). This indicated that the model more frequently used features on the skull rather than the background to make classifications when compared to the other models. The photograph baseline and photograph subset models had the lowest average Grad-CAM scores, as both models had 50 or more images with a score of 0.

Download:

Fig 4. Grad-CAM scores for each model.

The average scores for each model are printed at the top of the model’s bar. High Grad-CAM scores (i.e., closer to 3) indicated that the model frequently focussed on skull morphology to make classifications opposed to background features. Exact Grad-CAM scoring criteria can be found in (S2 Table, S1 File).

https://doi.org/10.1371/journal.pone.0329482.g004

Discussion

The pipeline

Domain generalization is a challenge for using 3D assets as a source of training images in classification models for biodiversity monitoring [20, 41]. In this study, we present a simple, effective, and modular pipeline to train domain-generalizable classification models primarily using synthetic images generated from 3D assets. Through the pipeline’s three components (3D scanning, image rendering, and model training), the pipeline explicitly addresses classification generalizability on two fronts. First, for synthetic image generation we created a Blender environment that automatically produced 92 multiview images of each skull. The Blender environment also allowed for a high degree of precision in image angles, thus ensuring maximum diversity in the angles each skull was imaged from. The limit to the number of images produced for each skull was also entirely self-imposed, and the Blender environment can be modified to effectively have no limit on the number of unique images produced for each skull. Second, by using domain adaptation training procedures such as MMD loss, we were able to produce models that were highly accurate when tested on both source domain and target domain images (synthetic images and photographs, respectively).

The simplicity and modularity of this pipeline allow it to potentially be applied to a wide range of taxa and objectives. The pipeline’s three primary components all function independently of each other, and thus can be substituted with alternative methods without affecting the other components. For example, the 3D assets produced via scanning could be subjected to more complicated rendering methods that allow the assets to be posed and rendered more realistically, such as in replicAnt [22], which uses a video game graphics engine to render 3D assets in realistic environments for use in computer vision models. Scanning can also be swapped out with other 3D asset creation methods such as photogrammetry, which have the advantage of potentially being more affordable than 3D scanners [21, 28, 29]. While the subjects of the classification models can obviously be substituted with other taxa, the models themselves are not rigid either and can be customized to fit various classification problems (changing the model architecture, using alternative domain adaptation techniques, etc.).

Domain adaptation

Here we show that domain adaptation methods such as MMD loss and fine-tuning can significantly improve classification performance on target domain images (Table 1). On its own, MMD loss improved classification performance over the baseline model, but still underperformed both models trained exclusively with real photographs. Accuracy of the supplemented and fine-tuned models surpassed the photograph subset model (+9.8% and +4.1% respectively), even though all three models were exposed to the same number of labelled photographs during training. This shows that in the absence of a large, labelled target domain dataset, training datasets primarily composed of synthetic images can yield high classification accuracy.

Even in cases where large, labelled target domain datasets are available, the inclusion of synthetic images during training might still warrant consideration. A known challenge in ecological applications of computer vision is that models can learn contextual clues in images (e.g., background features) to inform classification decisions [58]. In moderation this might be acceptable, as the inclusion of relevant contextual metadata (e.g., spatiotemporal data) were shown to improve model performance [59, 60]. However, if the contextual clues are not generalizable to new situations, such as the markings on the turntable in our photograph dataset, or if the contextual data becomes more important to the model than the morphological features of the specimen, this becomes problematic. As we have shown by testing our photograph models on synthetic images (Table 1), models that rely heavily on contextual features generalize poorly to new situations. An advantage of synthetic images is that all features of the images can be tightly controlled by the dataset’s creator. In the synthetic images used for this study, we ensured that all contextual features of the images were uniform across species, thus forcing the model to exclusively learn from the morphology of the specimens. This yielded a more generalizable baseline model (55.4% cross-domain accuracy vs. 28.3%), which was further enhanced using domain adaptation techniques such as MMD loss and fine-tuning. Here we have shown that by first training a model on a dataset with generalizable features (such as our synthetic image dataset), and then adapting that model to the real-life domain in which it will be deployed, a model can be encouraged to learn relevant, generalizable features while still maintaining high classification accuracy (Fig 4).

Conclusion

Collecting labelled images is an obstacle to building computer vision models for ecological applications due to the high quantity of images required and the human resources needed to collect and label said images. Leveraging the wealth of readily available biological samples in natural history collections to create 3D models for synthetic images generation is an interesting solution to this problem, but so far has faced its own challenge of domain adaptation. In this study, we report a simple approach to producing generalizable classification models trained primarily by use of 3D assets generated from museum specimens. Our work is a step towards unlocking the full potential of synthetic images and museum collections in the context of computer vision for ecology.

Supporting information

S1 File. Supporting figures and tables.

https://doi.org/10.1371/journal.pone.0329482.s001

(PDF)

Acknowledgments

We thank the Canadian Museum of Nature for allowing access to their specimens, equipment, and facilities. We thank Greg Rand, Marie-Helene Hubert, and Alan McDonald for their assistance with the specimen collections and scanning equipment. We also thank Drs Leonid Sigal, Michelle Tseng, and Rachel Germain for their insightful discussions.

References

1. Farley SS, Dawson A, Goring SJ, Williams JW. Situating ecology as a big-data science: Current advances, challenges, and solutions. BioScience. 2018;68(8):563–76.
- View Article
- Google Scholar
2. Wüest RO, Zimmermann NE, Zurell D, Alexander JM, Fritz SA, Hof C, et al. Macroecology in the age of Big Data – Where to go from here?. J Biogeogr. 2019;47(1):1–12.
- View Article
- Google Scholar
3. Nathan R, Monk CT, Arlinghaus R, Adam T, Alós J, Assaf M, et al. Big-data approaches lead to an increased understanding of the ecology of animal movement. Science. 2022;375(6582):eabg1780. pmid:35175823
- View Article
- PubMed/NCBI
- Google Scholar
4. Yang C, Huang Q. Spatial cloud computing: A practical approach; CRC Press.
5. Karlsson D, Hartop E, Forshage M, Jaschhof M, Ronquist F. The Swedish Malaise Trap Project: A 15 year retrospective on a countrywide insect inventory. Biodivers Data J. 2020;8:e47255. pmid:32015667
- View Article
- PubMed/NCBI
- Google Scholar
6. Weinstein BG. A computer vision for animal ecology. J Anim Ecol. 2018;87(3):533–45. pmid:29111567
- View Article
- PubMed/NCBI
- Google Scholar
7. Høye TT, Ärje J, Bjerge K, Hansen OLP, Iosifidis A, Leese F, et al. Deep learning and computer vision will transform entomology. Proc Natl Acad Sci U S A. 2021;118(2):e2002545117. pmid:33431561
- View Article
- PubMed/NCBI
- Google Scholar
8. Tuia D, Kellenberger B, Beery S, Costelloe BR, Zuffi S, Risse B, et al. Perspectives in machine learning for wildlife conservation. Nat Commun. 2022;13(1):792. pmid:35140206
- View Article
- PubMed/NCBI
- Google Scholar
9. Schneider S, Greenberg S, Taylor GW, Kremer SC. Three critical factors affecting automated image species recognition performance for camera traps. Ecol Evol. 2020;10(7):3503–17. pmid:32274005
- View Article
- PubMed/NCBI
- Google Scholar
10. Blair JD, Gaynor KM, Palmer MS, Marshall KE. A gentle introduction to computer vision-based specimen classification in ecological datasets. J Anim Ecol. 2024;93(2):147–58. pmid:38230868
- View Article
- PubMed/NCBI
- Google Scholar
11. Van Horn G, Mac Aodha O, Song Y, Cui Y, Sun C, Shepard A. The iNaturalist species classification and detection dataset. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition. 8769–78.
12. Ahumada JA, Fegraus E, Birch T, Flores N, Kays R, O’Brien TG, et al. Wildlife insights: A platform to maximize the potential of camera trap and other passive sensor wildlife data for the planet. Environ Conserv. 2019;47(1):1–6.
- View Article
- Google Scholar
13. Blair J, Weiser MD, Kaspari M, Miller M, Siler C, Marshall KE. Robust and simplified machine learning identification of pitfall trap-collected ground beetles at the continental scale. Ecol Evol. 2020;10(23):13143–53. pmid:33304524
- View Article
- PubMed/NCBI
- Google Scholar
14. Shorten C, Khoshgoftaar TM. A survey on image data augmentation for deep learning. J Big Data. 2019;6(1).
- View Article
- Google Scholar
15. Raitoharju J, Meissner K. On confidences and their use in (semi-)automatic multi-image taxa identification. In: 2019 IEEE symposium series on computational intelligence (SSCI); 2019. p. 1338–43. https://doi.org/10.1109/ssci44817.2019.9002975
16. Ärje J, Melvad C, Jeppesen MR, Madsen SA, Raitoharju J, Rasmussen MS, et al. Automatic image-based identification and biomass estimation of invertebrates. Methods Ecol Evol. 2020;11(8):922–31.
- View Article
- Google Scholar
17. Niri R, Gutierrez E, Douzi H, Lucas Y, Treuillet S, Castaneda B, et al. Multi-view data augmentation to improve wound segmentation on 3D surface model by deep learning. IEEE Access. 2021;9:157628–38.
- View Article
- Google Scholar
18. 3D/VR in the academic library: Emerging practices and trends.
19. Irschick DJ, Christiansen F, Hammerschlag N, Martin J, Madsen PT, Wyneken J, et al. 3D visualization processes for recreating and studying organismal form. iScience. 2022;25(9):104867. pmid:36060053
- View Article
- PubMed/NCBI
- Google Scholar
20. Beery S, Liu Y, Morris D, Piavis J, Kapoor A, Meister M, et al. Synthetic examples improve generalization for rare classes. In: Proceedings – 2020 IEEE winter conference on applications of computer vision, WACV; 2020.
21. Plum F, Labonte D. scAnt—An open-source platform for the creation of 3D models of arthropods (and other small objects). PeerJ. 2021;9:e11155. pmid:33954036
- View Article
- PubMed/NCBI
- Google Scholar
22. Plum F, Bulla R, Beck HK, Imirzian N, Labonte D. replicAnt: A pipeline for generating annotated images of animals in complex environments using Unreal Engine. Nat Commun. 2023;14(1):7195. pmid:37938222
- View Article
- PubMed/NCBI
- Google Scholar
23. Soltis PS. Digitization of herbaria enables novel research. Am J Bot. 2017;104(9):1281–4. pmid:29885238
- View Article
- PubMed/NCBI
- Google Scholar
24. Tegelberg R, Kahanpaa J, Karppinen J, Mononen T, Wu Z, Saarenmaa H. Mass digitization of individual pinned insects using conveyor-driven imaging. In: Proceedings – 13th IEEE international conference on eScience; 2017.
25. Nelson G, Ellis S. The history and impact of digitization and digital data mobilization on biodiversity research. Philos Trans R Soc Lond B Biol Sci. 2018;374(1763):20170391. pmid:30455209
- View Article
- PubMed/NCBI
- Google Scholar
26. Nelson G, Paul DL. DiSSCo, iDigBio and the future of global collaboration. BISS. 2019;3.
- View Article
- Google Scholar
27. Wheeler Q, Bourgoin T, Coddington J, Gostony T, Hamilton A, Larimer R, et al. Nomenclatural benchmarking: The roles of digital typification and telemicroscopy. Zookeys. 2012;(209):193–202. pmid:22859888
- View Article
- PubMed/NCBI
- Google Scholar
28. Nguyen CV, Lovell DR, Adcock M, La Salle J. Capturing natural-colour 3D models of insects for species discovery and diagnostics. PLoS One. 2014;9(4):e94346. pmid:24759838
- View Article
- PubMed/NCBI
- Google Scholar
29. Ströbel B, Schmelzle S, Blüthgen N, Heethoff M. An automated device for the digitization and 3D modelling of insects, combining extended-depth-of-field and all-side multi-view imaging. Zookeys. 2018;(759):1–27. pmid:29853774
- View Article
- PubMed/NCBI
- Google Scholar
30. Prakash A, Boochoon S, Brophy M, Acuna D, Cameracci E, State G, et al. Structured domain randomization: Bridging the reality gap by context-aware synthetic data. In: 2019 international conference on robotics and automation (ICRA); 2019. https://doi.org/10.1109/icra.2019.8794443
31. Jiang L, Liu S, Bai X, Ostadabbas S. Prior-aware synthetic data to the rescue: Animal pose estimation with very limited real data.
- View Article
- Google Scholar
32. Jiang L, Ostadabbas S. SPAC-Net: Synthetic pose-aware animal controlnet for enhanced pose estimation.
- View Article
- Google Scholar
33. Wang M, Deng W. Deep visual domain adaptation: A survey. Neurocomputing. 2018;312:135–53.
- View Article
- Google Scholar
34. Larrazabal AJ, Nieto N, Peterson V, Milone DH, Ferrante E. Gender imbalance in medical imaging datasets produces biased classifiers for computer-aided diagnosis. Proc Natl Acad Sci U S A. 2020;117(23):12592–4. pmid:32457147
- View Article
- PubMed/NCBI
- Google Scholar
35. Farahani A, Voghoei S, Rasheed K, Arabnia HR. A brief review of domain adaptation. Transactions on Computational Science and Computational Intelligence. Springer International Publishing; 2021. p. 877–94. https://doi.org/10.1007/978-3-030-71704-9_65
36. Roberts M, Driggs D, Thorpe M, Gilbey J, Yeung M, Ursprung S, et al. Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans. Nat Mach Intell. 2021;3(3):199–217.
- View Article
- Google Scholar
37. Cao J, Tang H, Fang HS, Shen X, Lu C, Tai YW. Cross-domain adaptation for animal pose estimation; 2019. p. 9498–507. https://openaccess.thecvf.com/content_ICCV_2019/html/Cao_Cross-Domain_Adaptation_for_Animal_Pose_Estimation_ICCV_2019_paper.html
38. Zuffi S, Kanazawa A, Berger-Wolf T, Black MJ. Three-D safari: Learning to estimate zebra pose, shape, and texture from images “In the Wild”; 2019. p. 5359–68. https://openaccess.thecvf.com/content_ICCV_2019/html/Zuffi_Three-D_Safari_Learning_to_Estimate_Zebra_Pose_Shape_and_Texture_ICCV_2019_paper.html
39. Mu J, Qiu W, Hager GD, Yuille AL. Learning from synthetic animals; 2020. p. 12386–95.
40. Li C, Lee GH. From synthetic to real: Unsupervised domain adaptation for animal pose estimation. p. 1482–91. Available from: https://openaccess.thecvf.com/content/CVPR2021/html/Li_From_Synthetic_to_Real_Unsupervised_Domain_Adaptation_for_Animal_Pose_CVPR_2021_paper.html
41. Peng X, Usman B, Saito K, Kaushik N, Hoffman J, Saenko K. Syn2Real: A new benchmark for synthetic-to-real visual domain adaptation.
- View Article
- Google Scholar
42. Borgwardt KM, Gretton A, Rasch MJ, Kriegel H-P, Schölkopf B, Smola AJ. Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics. 2006;22(14):e49-57. pmid:16873512
- View Article
- PubMed/NCBI
- Google Scholar
43. Gretton A, Sejdinovic D, Strathmann H, Balakrishnan S, Pontil M, Fukumizu K. Optimal kernel choice for large-scale two-sample tests. 25.
44. Long M, Cao Y, Wang J, Jordan MI. Learning transferable features with deep adaptation networks.
- View Article
- Google Scholar
45. Yan H, Ding Y, Li P, Wang Q, Xu Y, Zuo W. Mind the class weight bias: Weighted maximum mean discrepancy for unsupervised domain adaptation. Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. p. 2272–81.
46. Henzler P, Mitra NJ, Ritschel T. Escaping Plato’s Cave: 3D shape from adversarial rendering. p. 9984–93. https://openaccess.thecvf.com/content_ICCV_2019/html/Henzler_Escaping_Platos_Cave_3D_Shape_From_Adversarial_Rendering_ICCV_2019_paper.html
47. Blair J. Skull-adapt. https://zenodo.org/doi/10.5281/zenodo.15536489
48. Cooper N, Bond AL, Davis JL, Portela Miguez R, Tomsett L, Helgen KM. Sex biases in bird and mammal natural history collections. Proc Biol Sci. 2019;286(1913):20192025. pmid:31640514
- View Article
- PubMed/NCBI
- Google Scholar
49. Creaform. VXelements.
50. Blender F. Blender. www.blender.org
51. Schindelin J, Arganda-Carreras I, Frise E, Kaynig V, Longair M, Pietzsch T, et al. Fiji: An open-source platform for biological-image analysis. Nat Methods. 2012;9(7):676–82. pmid:22743772
- View Article
- PubMed/NCBI
- Google Scholar
52. Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition.
- View Article
- Google Scholar
53. Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, et al. PyTorch: An imperative style, high-performance deep learning library. In: Wallach H, Larochelle H, Beygelzimer A, Alché-Buc Fd, Fox E, Garnett R, editors. Advances in neural information processing systems. vol. 32. Curran Associates, Inc. Available from: https://proceedings.neurips.cc/paper_files/paper/2019/file/bdbca288fee7f92f2bfa9f7012727740-Paper.pdf
54. Musgrave K, Belongie S, Lim SN. PyTorch adapt.
- View Article
- Google Scholar
55. Foundation PS. Python language reference. www.python.org
56. Van der Maaten L, Hinton G. Visualizing data using t-SNE. J Mach Learn Res. 2008;9(11):2579–605.
- View Article
- Google Scholar
57. Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D. Grad-CAM: Visual explanations from deep networks via gradient-based localization. Int J Comput Vis. 2019;128(2):336–59.
- View Article
- Google Scholar
58. Beery S, Van Horn G, Perona P. Recognition in Terra incognita. Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics). Berlin, Heidelberg: Springer; 2018.
59. Terry JCD, Roy HE, August TA. Thinking like a naturalist: Enhancing computer vision of citizen science images by harnessing contextual data. Methods Ecol Evol. 2020;11(2):303–15.
- View Article
- Google Scholar
60. Blair J, Weiser MD, de Beurs K, Kaspari M, Siler C, Marshall KE. Embracing imperfection: Machine-assisted invertebrate classification in real-world datasets. Ecol Inform. 2022;72:101896.
- View Article
- Google Scholar

[ref1] 1. Farley SS, Dawson A, Goring SJ, Williams JW. Situating ecology as a big-data science: Current advances, challenges, and solutions. BioScience. 2018;68(8):563–76.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Wüest RO, Zimmermann NE, Zurell D, Alexander JM, Fritz SA, Hof C, et al. Macroecology in the age of Big Data – Where to go from here?. J Biogeogr. 2019;47(1):1–12.
View Article
Google Scholar

[5] View Article

[6] Google Scholar

[ref3] 3. Nathan R, Monk CT, Arlinghaus R, Adam T, Alós J, Assaf M, et al. Big-data approaches lead to an increased understanding of the ecology of animal movement. Science. 2022;375(6582):eabg1780. pmid:35175823
View Article
PubMed/NCBI
Google Scholar

[8] View Article

[9] PubMed/NCBI

[10] Google Scholar

[ref4] 4. Yang C, Huang Q. Spatial cloud computing: A practical approach; CRC Press.

[ref5] 5. Karlsson D, Hartop E, Forshage M, Jaschhof M, Ronquist F. The Swedish Malaise Trap Project: A 15 year retrospective on a countrywide insect inventory. Biodivers Data J. 2020;8:e47255. pmid:32015667
View Article
PubMed/NCBI
Google Scholar

[13] View Article

[14] PubMed/NCBI

[15] Google Scholar

[ref6] 6. Weinstein BG. A computer vision for animal ecology. J Anim Ecol. 2018;87(3):533–45. pmid:29111567
View Article
PubMed/NCBI
Google Scholar

[17] View Article

[18] PubMed/NCBI

[19] Google Scholar

[ref7] 7. Høye TT, Ärje J, Bjerge K, Hansen OLP, Iosifidis A, Leese F, et al. Deep learning and computer vision will transform entomology. Proc Natl Acad Sci U S A. 2021;118(2):e2002545117. pmid:33431561
View Article
PubMed/NCBI
Google Scholar

[21] View Article

[22] PubMed/NCBI

[23] Google Scholar

[ref8] 8. Tuia D, Kellenberger B, Beery S, Costelloe BR, Zuffi S, Risse B, et al. Perspectives in machine learning for wildlife conservation. Nat Commun. 2022;13(1):792. pmid:35140206
View Article
PubMed/NCBI
Google Scholar

[25] View Article

[26] PubMed/NCBI

[27] Google Scholar

[ref9] 9. Schneider S, Greenberg S, Taylor GW, Kremer SC. Three critical factors affecting automated image species recognition performance for camera traps. Ecol Evol. 2020;10(7):3503–17. pmid:32274005
View Article
PubMed/NCBI
Google Scholar

[29] View Article

[30] PubMed/NCBI

[31] Google Scholar

[ref10] 10. Blair JD, Gaynor KM, Palmer MS, Marshall KE. A gentle introduction to computer vision-based specimen classification in ecological datasets. J Anim Ecol. 2024;93(2):147–58. pmid:38230868
View Article
PubMed/NCBI
Google Scholar

[33] View Article

[34] PubMed/NCBI

[35] Google Scholar

[ref11] 11. Van Horn G, Mac Aodha O, Song Y, Cui Y, Sun C, Shepard A. The iNaturalist species classification and detection dataset. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition. 8769–78.

[ref12] 12. Ahumada JA, Fegraus E, Birch T, Flores N, Kays R, O’Brien TG, et al. Wildlife insights: A platform to maximize the potential of camera trap and other passive sensor wildlife data for the planet. Environ Conserv. 2019;47(1):1–6.
View Article
Google Scholar

[38] View Article

[39] Google Scholar

[ref13] 13. Blair J, Weiser MD, Kaspari M, Miller M, Siler C, Marshall KE. Robust and simplified machine learning identification of pitfall trap-collected ground beetles at the continental scale. Ecol Evol. 2020;10(23):13143–53. pmid:33304524
View Article
PubMed/NCBI
Google Scholar

[41] View Article

[42] PubMed/NCBI

[43] Google Scholar

[ref14] 14. Shorten C, Khoshgoftaar TM. A survey on image data augmentation for deep learning. J Big Data. 2019;6(1).
View Article
Google Scholar

[45] View Article

[46] Google Scholar

[ref15] 15. Raitoharju J, Meissner K. On confidences and their use in (semi-)automatic multi-image taxa identification. In: 2019 IEEE symposium series on computational intelligence (SSCI); 2019. p. 1338–43. https://doi.org/10.1109/ssci44817.2019.9002975

[ref16] 16. Ärje J, Melvad C, Jeppesen MR, Madsen SA, Raitoharju J, Rasmussen MS, et al. Automatic image-based identification and biomass estimation of invertebrates. Methods Ecol Evol. 2020;11(8):922–31.
View Article
Google Scholar

[49] View Article

[50] Google Scholar

[ref17] 17. Niri R, Gutierrez E, Douzi H, Lucas Y, Treuillet S, Castaneda B, et al. Multi-view data augmentation to improve wound segmentation on 3D surface model by deep learning. IEEE Access. 2021;9:157628–38.
View Article
Google Scholar

[52] View Article

[53] Google Scholar

[ref18] 18. 3D/VR in the academic library: Emerging practices and trends.

[ref19] 19. Irschick DJ, Christiansen F, Hammerschlag N, Martin J, Madsen PT, Wyneken J, et al. 3D visualization processes for recreating and studying organismal form. iScience. 2022;25(9):104867. pmid:36060053
View Article
PubMed/NCBI
Google Scholar

[56] View Article

[57] PubMed/NCBI

[58] Google Scholar

[ref20] 20. Beery S, Liu Y, Morris D, Piavis J, Kapoor A, Meister M, et al. Synthetic examples improve generalization for rare classes. In: Proceedings – 2020 IEEE winter conference on applications of computer vision, WACV; 2020.

[ref21] 21. Plum F, Labonte D. scAnt—An open-source platform for the creation of 3D models of arthropods (and other small objects). PeerJ. 2021;9:e11155. pmid:33954036
View Article
PubMed/NCBI
Google Scholar

[61] View Article

[62] PubMed/NCBI

[63] Google Scholar

[ref22] 22. Plum F, Bulla R, Beck HK, Imirzian N, Labonte D. replicAnt: A pipeline for generating annotated images of animals in complex environments using Unreal Engine. Nat Commun. 2023;14(1):7195. pmid:37938222
View Article
PubMed/NCBI
Google Scholar

[65] View Article

[66] PubMed/NCBI

[67] Google Scholar

[ref23] 23. Soltis PS. Digitization of herbaria enables novel research. Am J Bot. 2017;104(9):1281–4. pmid:29885238
View Article
PubMed/NCBI
Google Scholar

[69] View Article

[70] PubMed/NCBI

[71] Google Scholar

[ref24] 24. Tegelberg R, Kahanpaa J, Karppinen J, Mononen T, Wu Z, Saarenmaa H. Mass digitization of individual pinned insects using conveyor-driven imaging. In: Proceedings – 13th IEEE international conference on eScience; 2017.

[ref25] 25. Nelson G, Ellis S. The history and impact of digitization and digital data mobilization on biodiversity research. Philos Trans R Soc Lond B Biol Sci. 2018;374(1763):20170391. pmid:30455209
View Article
PubMed/NCBI
Google Scholar

[74] View Article

[75] PubMed/NCBI

[76] Google Scholar

[ref26] 26. Nelson G, Paul DL. DiSSCo, iDigBio and the future of global collaboration. BISS. 2019;3.
View Article
Google Scholar

[78] View Article

[79] Google Scholar

[ref27] 27. Wheeler Q, Bourgoin T, Coddington J, Gostony T, Hamilton A, Larimer R, et al. Nomenclatural benchmarking: The roles of digital typification and telemicroscopy. Zookeys. 2012;(209):193–202. pmid:22859888
View Article
PubMed/NCBI
Google Scholar

[81] View Article

[82] PubMed/NCBI

[83] Google Scholar

[ref28] 28. Nguyen CV, Lovell DR, Adcock M, La Salle J. Capturing natural-colour 3D models of insects for species discovery and diagnostics. PLoS One. 2014;9(4):e94346. pmid:24759838
View Article
PubMed/NCBI
Google Scholar

[85] View Article

[86] PubMed/NCBI

[87] Google Scholar

[ref29] 29. Ströbel B, Schmelzle S, Blüthgen N, Heethoff M. An automated device for the digitization and 3D modelling of insects, combining extended-depth-of-field and all-side multi-view imaging. Zookeys. 2018;(759):1–27. pmid:29853774
View Article
PubMed/NCBI
Google Scholar

[89] View Article

[90] PubMed/NCBI

[91] Google Scholar

[ref30] 30. Prakash A, Boochoon S, Brophy M, Acuna D, Cameracci E, State G, et al. Structured domain randomization: Bridging the reality gap by context-aware synthetic data. In: 2019 international conference on robotics and automation (ICRA); 2019. https://doi.org/10.1109/icra.2019.8794443

[ref31] 31. Jiang L, Liu S, Bai X, Ostadabbas S. Prior-aware synthetic data to the rescue: Animal pose estimation with very limited real data.
View Article
Google Scholar

[94] View Article

[95] Google Scholar

[ref32] 32. Jiang L, Ostadabbas S. SPAC-Net: Synthetic pose-aware animal controlnet for enhanced pose estimation.
View Article
Google Scholar

[97] View Article

[98] Google Scholar

[ref33] 33. Wang M, Deng W. Deep visual domain adaptation: A survey. Neurocomputing. 2018;312:135–53.
View Article
Google Scholar

[100] View Article

[101] Google Scholar

[ref34] 34. Larrazabal AJ, Nieto N, Peterson V, Milone DH, Ferrante E. Gender imbalance in medical imaging datasets produces biased classifiers for computer-aided diagnosis. Proc Natl Acad Sci U S A. 2020;117(23):12592–4. pmid:32457147
View Article
PubMed/NCBI
Google Scholar

[103] View Article

[104] PubMed/NCBI

[105] Google Scholar

[ref35] 35. Farahani A, Voghoei S, Rasheed K, Arabnia HR. A brief review of domain adaptation. Transactions on Computational Science and Computational Intelligence. Springer International Publishing; 2021. p. 877–94. https://doi.org/10.1007/978-3-030-71704-9_65

[ref36] 36. Roberts M, Driggs D, Thorpe M, Gilbey J, Yeung M, Ursprung S, et al. Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans. Nat Mach Intell. 2021;3(3):199–217.
View Article
Google Scholar

[108] View Article

[109] Google Scholar

[ref37] 37. Cao J, Tang H, Fang HS, Shen X, Lu C, Tai YW. Cross-domain adaptation for animal pose estimation; 2019. p. 9498–507. https://openaccess.thecvf.com/content_ICCV_2019/html/Cao_Cross-Domain_Adaptation_for_Animal_Pose_Estimation_ICCV_2019_paper.html

[ref38] 38. Zuffi S, Kanazawa A, Berger-Wolf T, Black MJ. Three-D safari: Learning to estimate zebra pose, shape, and texture from images “In the Wild”; 2019. p. 5359–68. https://openaccess.thecvf.com/content_ICCV_2019/html/Zuffi_Three-D_Safari_Learning_to_Estimate_Zebra_Pose_Shape_and_Texture_ICCV_2019_paper.html

[ref39] 39. Mu J, Qiu W, Hager GD, Yuille AL. Learning from synthetic animals; 2020. p. 12386–95.

[ref40] 40. Li C, Lee GH. From synthetic to real: Unsupervised domain adaptation for animal pose estimation. p. 1482–91. Available from: https://openaccess.thecvf.com/content/CVPR2021/html/Li_From_Synthetic_to_Real_Unsupervised_Domain_Adaptation_for_Animal_Pose_CVPR_2021_paper.html

[ref41] 41. Peng X, Usman B, Saito K, Kaushik N, Hoffman J, Saenko K. Syn2Real: A new benchmark for synthetic-to-real visual domain adaptation.
View Article
Google Scholar

[115] View Article

[116] Google Scholar

[ref42] 42. Borgwardt KM, Gretton A, Rasch MJ, Kriegel H-P, Schölkopf B, Smola AJ. Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics. 2006;22(14):e49-57. pmid:16873512
View Article
PubMed/NCBI
Google Scholar

[118] View Article

[119] PubMed/NCBI

[120] Google Scholar

[ref43] 43. Gretton A, Sejdinovic D, Strathmann H, Balakrishnan S, Pontil M, Fukumizu K. Optimal kernel choice for large-scale two-sample tests. 25.

[ref44] 44. Long M, Cao Y, Wang J, Jordan MI. Learning transferable features with deep adaptation networks.
View Article
Google Scholar

[123] View Article

[124] Google Scholar

[ref45] 45. Yan H, Ding Y, Li P, Wang Q, Xu Y, Zuo W. Mind the class weight bias: Weighted maximum mean discrepancy for unsupervised domain adaptation. Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. p. 2272–81.

[ref46] 46. Henzler P, Mitra NJ, Ritschel T. Escaping Plato’s Cave: 3D shape from adversarial rendering. p. 9984–93. https://openaccess.thecvf.com/content_ICCV_2019/html/Henzler_Escaping_Platos_Cave_3D_Shape_From_Adversarial_Rendering_ICCV_2019_paper.html

[ref47] 47. Blair J. Skull-adapt. https://zenodo.org/doi/10.5281/zenodo.15536489

[ref48] 48. Cooper N, Bond AL, Davis JL, Portela Miguez R, Tomsett L, Helgen KM. Sex biases in bird and mammal natural history collections. Proc Biol Sci. 2019;286(1913):20192025. pmid:31640514
View Article
PubMed/NCBI
Google Scholar

[129] View Article

[130] PubMed/NCBI

[131] Google Scholar

[ref49] 49. Creaform. VXelements.

[ref50] 50. Blender F. Blender. www.blender.org

[ref51] 51. Schindelin J, Arganda-Carreras I, Frise E, Kaynig V, Longair M, Pietzsch T, et al. Fiji: An open-source platform for biological-image analysis. Nat Methods. 2012;9(7):676–82. pmid:22743772
View Article
PubMed/NCBI
Google Scholar

[135] View Article

[136] PubMed/NCBI

[137] Google Scholar

[ref52] 52. Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition.
View Article
Google Scholar

[139] View Article

[140] Google Scholar

[ref53] 53. Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, et al. PyTorch: An imperative style, high-performance deep learning library. In: Wallach H, Larochelle H, Beygelzimer A, Alché-Buc Fd, Fox E, Garnett R, editors. Advances in neural information processing systems. vol. 32. Curran Associates, Inc. Available from: https://proceedings.neurips.cc/paper_files/paper/2019/file/bdbca288fee7f92f2bfa9f7012727740-Paper.pdf

[ref54] 54. Musgrave K, Belongie S, Lim SN. PyTorch adapt.
View Article
Google Scholar

[143] View Article

[144] Google Scholar

[ref55] 55. Foundation PS. Python language reference. www.python.org

[ref56] 56. Van der Maaten L, Hinton G. Visualizing data using t-SNE. J Mach Learn Res. 2008;9(11):2579–605.
View Article
Google Scholar

[147] View Article

[148] Google Scholar

[ref57] 57. Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D. Grad-CAM: Visual explanations from deep networks via gradient-based localization. Int J Comput Vis. 2019;128(2):336–59.
View Article
Google Scholar

[150] View Article

[151] Google Scholar

[ref58] 58. Beery S, Van Horn G, Perona P. Recognition in Terra incognita. Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics). Berlin, Heidelberg: Springer; 2018.

[ref59] 59. Terry JCD, Roy HE, August TA. Thinking like a naturalist: Enhancing computer vision of citizen science images by harnessing contextual data. Methods Ecol Evol. 2020;11(2):303–15.
View Article
Google Scholar

[154] View Article

[155] Google Scholar

[ref60] 60. Blair J, Weiser MD, de Beurs K, Kaspari M, Siler C, Marshall KE. Embracing imperfection: Machine-assisted invertebrate classification in real-world datasets. Ecol Inform. 2022;72:101896.
View Article
Google Scholar

[157] View Article

[158] Google Scholar

Figures

Abstract

Introduction

Methods

Data collection

Creating 3D assets.

Image collection.

Model training

Data split and processing.

Model architecture and training procedure.

Feature visualization.

Results

Accuracy

Qualitative analysis

t-SNE visualisation.

Grad-CAMs.

Discussion

The pipeline

Domain adaptation

Conclusion

Supporting information

S1 File. Supporting figures and tables.

Acknowledgments

References