Fungi Identify the Geographic Origin of Dust Samples

There is a long history of archaeologists and forensic scientists using pollen found in a dust sample to identify its geographic origin or history. Such palynological approaches have important limitations as they require time-consuming identification of pollen grains, a priori knowledge of plant species distributions, and a sufficient diversity of pollen types to permit spatial or temporal identification. We demonstrate an alternative approach based on DNA sequencing analyses of the fungal diversity found in dust samples. Using nearly 1,000 dust samples collected from across the continental U.S., our analyses identify up to 40,000 fungal taxa from these samples, many of which exhibit a high degree of geographic endemism. We develop a statistical learning algorithm via discriminant analysis that exploits this geographic endemicity in the fungal diversity to correctly identify samples to within a few hundred kilometers of their geographic origin with high probability. In addition, our statistical approach provides a measure of certainty for each prediction, in contrast with current palynology methods that are almost always based on expert opinion and devoid of statistical inference. Fungal taxa found in dust samples can therefore be used to identify the origin of that dust and, more importantly, we can quantify our degree of certainty that a sample originated in a particular place. This work opens up a new approach to forensic biology that could be used by scientists to identify the origin of dust or soil samples found on objects, clothing, or archaeological artifacts.


Introduction
In the Sherlock Holmes mysteries the author Arthur Conan Doyle repeatedly has Holmes solve crimes by identifying the geographic origin of mud on a shoe, a pair of pants or some other material [1]. Since the publication of the Sherlock Holmes mysteries, there has been a large body of research devoted to determining when and where a dust sample might have originated [2][3][4]. Such work has been valuable in both archaeological and criminal investigations and has been used to ascertain the origin of dust or soil found on artifacts [1], on skin, in lungs [2], on clothes, on a document [5] or contraband in a shipment [6] or even on the grill of a car.
Such investigations can be based on the abiotic characteristics of the soil or dust [7], as Sherlock described, but they are frequently based on analyses of the types of plant pollen present in such samples. Because the plant communities found in different regions and habitats are often distinct, the idea is that plant pollen found in dust samples can be used to identify where that dust sample originated. Yet, while the idea of "biogeoprinting" samples on the basis of pollen samples is beguiling it has suffered from several problems that have hindered its utility for forensic dust analysis [8][9][10]. First, it requires that samples be sufficiently rich in pollen to permit identification. Second, it requires that the collected pollen be diagnostic not just of a particular biome (e.g., temperate forest) but of a more specific region, the odds of which improve with the number of pollen taxa encountered [11]. Third, it requires that we can accurately identify the pollen from all over the world or at least all over the region from which an object or body has potentially originated, which requires comprehensive collections of pollen [12] and the expertise necessary to match pollen to those collections [13]. In practice, this has meant that, while the use of palynological approaches in forensics and archaeology has become more common [4], it has depended heavily on expert identification of the pollen of individual plant species and knowledge about the distribution of those species. As a result, palynology remains, as a recent review put it, "a rarely used technique," [14] relative to many of the other tools available to crime scene investigators or archaeologists. Moreover, when palynology is used to discern the geographic origin of a sample, the result is almost always based on expert opinion and devoid of statistical inference.
In addition to pollen, dust typically harbors a myriad of other taxa, including large numbers of microbial taxa [15] and there is some evidence that the analysis of these taxa can also be used to identify the geographic origin of dust or soil samples [16,17]. Fungi represent a particularly promising taxon for biogeoprinting and forensic analyses in general [18] for a variety of reasons. First, the global diversity of fungi is enormous with nearly 100,000 described fungal species and far more that remain undescribed [19] with individual samples of soil or dust harboring large numbers of fungal taxa [20,21]. Second, many fungal taxa are restricted in their geographic distribution and are only found in particular locations or ecosystem types [22]. Third, since many fungal taxa produce spores that are tolerant of dessication and other environmental stresses they can persist in samples for prolonged periods of time. Despite the potential advantages of fungal analyses for biogeoprinting, their utility remains more potential than realized and there are very few cases of fungi being used for forensic investigations [18]. We do not know if fungi can effectively be used to determine the geographic origin of dust samples and the utility of such an approach will ultimately require combining a broad-scale analysis of dust-associated fungi with rigorous statistical analyses to assess the probability that a sample has come from a particular habitat or region.
Working with the public, we obtained outdoor dust samples from 928 sites located across the United States. We identified the fungal taxa present in each of the collected samples via amplicon sequencing on the Illumina HiSeq platform of an internal transcribed spacer (ITS) region of the rRNA operon. We then took a random subset of these samples and developed a statistical learning model via discriminant analysis [23] based on fungal taxa occurrences across the country. Using the remaining samples (a set independent from the first group), we tested whether the model could predict the geographic area from which a given dust sample was collected. An error was assigned to each sample on the basis of the great-circle distance in kilometers between its actual location and its predicted location. This prediction procedure was repeated across disjoint subsets of the 928 samples via five-fold cross-validation. The resulting model successfully predicts the origin of samples to within a great-circle distance of 230 km, on average, with high probability.

Data collection
Outdoor dust samples were collected by volunteers participating in the Wild Life of Our Homes project [24] (WLOH, homes.yourwildlife.org), a continental-scale citizen science project mapping microbial diversity (fungi, bacteria, archaea) inside and outside the home. We recruited participants, representing all 50 states and the District of Columbia, through our website, social media and email campaigns over the period January 2012 to March 2013. The 1,430 enrolled participants were provided a written Informed Consent form approved by the North Carolina State University's Human Research Committee (Approval No. 2177) as well as instructions for sampling their home and a home microbe sampling kit. Each home sampling kit contained dual-tipped sterile BBL CultureSwabs. Participants were instructed to sample the upper door trim on the outside surface of an exterior door, a sampling location that is unlikely to be cleaned frequently and serves as a passive collector of outdoor aerosols and dust with little to no direct contact from the home occupants. Here we focus on n = 928 of the 1430 samples that were successfully sequenced in our lab.

Molecular analysis
Participants returned swabs by first-class mail over the period March 2012 to May 2013, and these swabs were stored in a -20°C freezer until processed. Latitude and longitude coordinates were derived from location information (address) and were used to obtain geo-referenced environmental variables for each household. Daily temperature (°C) and daily precipitation (mm) were calculated from the Climate Research Unit Time Series v3.21 Dataset [25] (monthly coverage from 1901 to 2009) and land type was classified by the National Land Cover Database 2006 [26] as belonging to one of twenty classes respresenting varying types of urbanization, forestation, wetlands, etc. As fungal communities likely differ between varied climates [22], accounting for these data may identify climates for which our forthcoming model excels or struggles at making accurate spatial origin predictions. This analysis is useful because it would allow one to make predictions about the types of fungi likely to be found in samples from a particular region without necessarily collecting lots of samples from that region.
Swabs were prepared for sequencing using the direct PCR approach [27]. Swab tips were placed directly into wells in 2 mL 96-well plates (Axygen Inc.) along with the appropriate negative control samples. Plates were processed using the Extract-N-Amp PCR kit (Sigma-Aldrich, Inc.) following a modified version of the manufacturers instructions. After each well received 250 μL of the Extract-N-Amp Extraction solution, the plate was sealed securely with a 96 round well Impermamat Silicon Sealing Mat (Axygen, Inc.) and heated at 90°C for 10 minutes in a dry bath. Extract-N-Amp Dilution solution was then added to the wells at a 1:1 ratio to the Extraction solution and mixed gently by pipetting. The plate was resealed with the mat and stored at 4°C. PCR was conducted in 20 μL triplicate reactions per sample using 10 μL of Extract-N-Amp Ready Mix, 1 μL of the forward and reverse primers, 5 μL of PCR-grade water, and 4 μL of the Extract-N-Amp sample solutions from the 96-well plate.
Fungal diversity in each sample was assessed using a high-throughput sequencing method to characterize the variation in a marker gene sequence. We sequenced the first internal transcribed spacer (ITS1) region of the rRNA operon, the most widely used 'barcode' for fungal community analyses [28], using the ITS1-F (CTTGGTCATTTAGAGGAAGTAA) and ITS2 (GCTGCGTTCTTCATCGATGC) primer pair [21]. The primers included the appropriate Illumina adapters with the reverse primers also having an error-correcting 12-bp barcode unique to each sample to permit multiplexing of samples. PCR products from all samples were quantified using the PicoGreen dsDNA assay, and pooled together in equimolar concentrations. Samples were sequenced on an Illumina HiSeq instrument. All sequencing runs were completed at the University of Colorado Next Generation Sequencing Facility.
The 100-bp sequences were demultiplexed using a custom Python script with quality filtering and phylotype (i.e., operational taxonomic unit) clustering conducted using the UPARSE pipeline [29]. During quality filtering, a maxee value of 0.5 was used (indicating that on average 0.5 nucleotides were incorrectly assigned in every sequence). Sequences were also dereplicated and singleton sequences were removed prior to phylotype determination. Representative sequences from the phylotypes were checked for ! 75% similar to ITS1 sequences contained in the UNITE November, 2012 database [30], the most comprehensive reference database for sequence-based fungal analyses. We discarded sequences that were < 75% similar to those in the UNITE database because, while they could be fungal, they could also be from other non-fungal, eukaryotic groups and we wanted to be careful to restrict our analyses just to those taxa that we were very confident were fungal. The representative sequences were then used to categorize the raw sequences into phylotypes at the 97% similarity threshold. Phylotypes were classified to taxonomic groups using the RDP classifier with a confidence threshold of 0.5 [31] against the UNITE database. Samples were rarefied to 20,000 randomly-selected sequences per sample in order to compare all samples at an equivalent sequencing depth.

Statistical analysis
Spatial analysis and classification of the sequenced fungal taxa data was achieved via discriminant analysis [23]. Discriminant analysis is a two-stage classification method built on Bayes' Theorem that classifies a new observation as arising from one of many possible populations based on its measured characteristics. The analysis proceeded in two stages: we first estimated the spatial distribution of each species' occurrence probability using available samples, and then inverted these probabilities to predict the spatial origin of a new sample. Define Y ij as the binary indicator that species j = 1,. . .,m is present in the sample taken at spatial location s i , i = 1,. . .,n, and let p j (s i ) = Prob(Y ij = 1).
We estimated the probability of the presence of species j in samples taken across a fine grid of points T covering the continental United States using kernel smoothing [23]. Kernel smoothing locally weights noisy observations via a Gaussian kernel and thereby produces a smooth portrait of estimated occurrence probabilities over T, allowing for our method to make predictions at locations for which we may not have dust sample data. We let T = {t 1 , . . ., t N } with larger choices of N achieving finer granularity at the expense of increased computational costs. The estimated occurrence probability of species j at location t 2 T iŝ jjt − s i jj is the great-circle distance (km) between t and s i , k j ðhÞ ¼ exp½Àh 2 ð2r 2 j Þ À1 is the Gaussian kernel function, and ρ j is the kernel bandwidth. The estimated probabilityp j ðtÞ is a locally-weighted average of the observations, with the weights w ij (t) decaying as a function of the distance from t. We select kernel bandwidths separately for each species via generalized cross-validation [23]. For species j = 1,. . .,m, define y j = (Y 1j ,. . .,Y nj ) T and W ρ = [W(s 1 ),. . .,W(s n )] T with W(s i ) = [w 1j (s i ),. . .,w nj (s i )] T , i = 1,. . .,n, so thatŷ j ¼ W r y j . Then the best kernel bandwidth ρ j for species j minimizes Using ρ j and (Eq 1) we findp j ¼ ½p j ðt 1 Þ; . . . ;p j ðt N Þ T , the estimated spatial distribution of occurrence probabilities of species j over T. Given these occurrence probabilities for every species, we then classified the spatial origin of a new sample with binary features (Y 01 ,. . .,Y 0m ) taken at an unknown location s 0 . Assume a flat prior is placed on T so that a sample is equally likely to have originated from any location t. Then the Bayes' rule (under 0/1 loss) iŝ log That is, the Bayes' rule selects the spatial location that maximizes the log-likelihood of the new sample. We performed five-fold cross-validation to illustrate the method. The data were randomly split into five groups. For each group, its data comprised the testing data, and the data in the remaining four groups formed the training data. The training data were split further into subtraining (80% of the training data) and subtesting (20% of the training data). With the subtraining data, we obtained estimated occurrence probabilities via (Eq 1) with per-species bandwidths minimizing (Eq 2). Then, for each sample in the testing data, we inverted these probabilities with (Eq 3) to predict their spatial origins. The great-circle distance in kilometers between a sample's predicted originŝ and its true origin s 0 defines prediction error.
Suppose, however, that rather than predict a single spatial locationŝ as the true origin s 0 , it is of interest to form a neighborhood aroundŝ that contains s 0 with some degree of certainty. Consider a sample collected from an unknown location s 0 with features (Y 01 ,. . .,Y 0m ) from the withheld subtesting data, and assume a flat prior on T. Using the per-species bandwidths found with the subtraining data, we calculate (Eq 4) at every t 2 T. Standardizing these log-likelihood values so that they sum to one yields a predictive probability mass function, say, f over T such that f(t) expresses the probability the sample originates from t. We threshold these probabilities to form a neighborhood For some significance level α, q is chosen so that across all R q formed from samples in the subtesting data, 100(1 − α)% of these R q cover their sample's true origin s 0 . This double bootstrapping approach of selecting a threshold q to achieve nominal coverage has been studied extensively [32][33][34]. We call (Eq 5) a 100(1 − α)% prediction region, the spatial extension of a prediction interval, constructed to contain the true origin with probability 1 − α. As the subtesting data is independent of the subtraining data used to train the model, and these combined training data are independent of the testing data, we ensure the q acquired in this manner will produce prediction regions in the testing data with approximately 100(1 − α)% coverage.

Computation
To implement our method in reasonable computing time, we obtained a fine grid T by overlaying the U.S. with a 100 × 100 grid of points and retaining only those points that fell over U.S. soil. Treating these points as grid cell centers yielded a finely granular grid T of N = 6,041 cells, each of dimension 28.0 km north-south by 58.6 km east-west. Performing spatial prediction over T required a collective computing time of just under 3 hours using R statistical software and parallelized code distributed across five cores of a 3.6 GHz machine with 60 GB RAM running 64Bit CentOS Linux 5.0. The code is available online at https://github.com/nsgrantham/ fungi-identify.

Results
The sequence-based methods yielded a database of 38,473 fungal taxa with 72.4% of taxa found in < 10 samples, 96.1% found in < 100 samples, and an average of 727 fungal taxa per individual dust sample. Our statistical analyses of these data revealed that many fungal taxa exhibit a high degree of geographic endemism. This endemism is at the root of our ability to make predictions about the geographic origins of samples. For example, consider the geographic distribution of Eutypa lata (Fig 1) which was not found in a single sample east of the Sierra Nevada mountain range but had reasonably high occurrence probabilities in samples collected from northern California. A sample with Eutypa lata, a common pathogen of grapevines [35], is therefore more likely to have originated from grape-growing regions in the western U.S. By comparison, Teratosphaeria microspora is a far more ubiquitous fungus (Fig 2), but it occurs most frequently around the Great Lakes and along the West Coast where there is a > 90% chance its presence is identified in a randomly selected sample. Therefore, the absence of Teratosphaeria microspora suggests a sample is less likely to have originated from these regions. Of course, these are just selected examples and ultimately, our ability to statistically assess the origins of samples results from considering the geographic distributions of a much larger portion of the 38,473 identified fungal taxa. Fungal occurrence probability distributions combined with the fungal taxa identified in a dust sample taken in the U.S. allow for prediction of the sample's most likely geographic origin. Moreover, 50%, 75%, and 90% prediction regions help to capture the shape and spread of the sample's unique collection of fungal communities by marking regions where the sample is likely to have originated with respective probabilities 0.5, 0.75, and 0.9. Consider a sample taken from a home in central Michigan, USA (Fig 3). Our methods identified a location just 229 km to the southwest in northern Indiana as the sample's most likely origin. The narrow prediction regions at varying confidence levels show that the fungal community in this sample is characteristic of the region in question, suggesting that samples with similar communities are not expected to originate far from the Great Lakes.
Across all 928 samples, median prediction error was 230 km (similar to that in Fig 3) with 5% of samples achieving better than 58 km and 5% achieving worse than 1,039 km (Fig 4,  Table 1). The 50%, 75%, and 90% predictions regions formed by the subtesting data retained their respective coverage rates when applied to the testing data. Table 1 summarizes prediction errors and prediction region coverage across several key covariates including the density of samples (sampling intensity), number of different fungal taxa present (fungal richness), daily temperature, daily precipitation, and land type.
As might be expected, the accuracy of predictions was lower for samples taken at locations without many neighboring locations (low sampling intensity) as evidenced by higher median prediction error (278 km) and poor prediction region coverage (Table 1). Predictions also suffered for samples exhibiting low fungal richness. Differences in temperature and precipitation likely contribute to differences in fungal distributions across the U.S. [22], and we detected Teratosphaeria microspora distribution. The biogeography of Teratosphaeria microspora is much different than that of Eutypa lata (Fig 1). The darker and more prevalent shading suggests Teratosphaeria microspora is a fairly common and widespread fungal taxa. However, it occurs with highest frequency among samples collected from the West Coast and throughout Midwestern regions bordering the Great Lakes.   slight contributions of these climatic variables to our error distributions. Prediction errors tended to be higher and prediction region coverage lower for locations with relatively stable precipitation throughout the course of a year (below-median precipitation standard deviation) and, to a lesser degree, more variable temperature (above-median temperature standard deviation). Urban and suburban (i.e., developed) land types had low prediction errors and high prediction region coverage probabilities relative to less developed areas. The accuracy of our predictions also varied depending on the geographic region in question (S1 Fig). Predictions made in the western half of the U.S. (west of -100°longitude) tended to land in close proximity to their sample's true origin (median prediction error of 190 km). Prediction errors were relatively high for samples taken in the northern states of Idaho, Montana, Wyoming, and the Dakotas, likely due to the dearth of sampling in this region. Conversely, even though the East Coast was very well sampled, predictions were least accurate in a narrow band from New England down to Mississippi.

Discussion
In short, the accuracy of our prediction model depends on sampling intensity, climatic conditions, and the region being considered. The accuracy of our approach could likely be improved by more thorough sampling in regions of low sampling intensity. The approach may also be improved by deeper sequencing of samples or by sequencing multiple samples from a given source (where such samples exist).
Our results represent an important proof of concept of a novel approach in forensic biology. Based on the dust collected outside homes with a sterile swab we can identify the geographic origin of a sample taken in the United States with a median error of 230 kilometers. More importantly, we can place a measure of certainty on each prediction that a sample originated in a particular place.
Moreover, our method is easily amenable to dust analysis by forensic teams. As shown in Fig 5, a new dust sample need only have its fungal communities sequenced before being fed to our spatial source prediction method which compares the new sample against a database of samples from known locations to arrive at a most likely place of origin. Of course, successful applications of this method to new regions will require a priori information on the distributions of fungal taxa across the region of interest-information that can only be obtained by collecting and analyzing reference dust samples to be incorporated into the database.
We were able to predict the origin of samples because many fungal taxa are more likely to be found in some regions of the United States than in others. Many biotic and abiotic factors contribute to the observed biogeographical structure in dust-associated fungal taxa, including differences in host plant communities, climatic conditions, soil edaphic factors, land-use type, and agricultural practices. In addition, biogeographic regions also differ in the composition of their fungi on the basis of historical isolation (and dispersal limited taxa [36]). Better understanding the factors that limit individual fungal taxa may allow us to predict not only the geographic origin of samples, but also other attributes of the area of origin. For example, where the distribution of fungal pathogens of plants coincides with that of host plants, the presence of those fungi can be used as an indication not just of geography but also of farming and land use practices (Fig 1).
The approach described here could be extended in several ways. First comparison of our results to models on the basis of the distribution of plants (pollen), bacteria, or even animal parts will be both possible and informative [17]. Dust has long been known to contain biomass (and hence DNA) from a broad range of taxa, including bacteria and arthropods [1]. Inclusion of data from these other taxa is likely to increase the accuracy of our models. Second, we would also like to be able to identify the origin of many kinds of samples, not just dust samples from houses and we assume that our approach could also be applied to other sample types. The accumulation of dust on other surfaces is an obvious next step, so too is soil. Microscopy-based or geochemical analyses of soil could be useful for identifying the geographic origin of soil samples [37], but these methods are not trivial and their utility for geolocating samples from across the U.S. remains undetermined. Third, to understand the global utility of our approach we will need to consider samples from other regions. The precision of global biogeoprinting will be contingent on the number of species restricted to particular climates and hosts (as in the U.S.), but also to particular biogeographic regions. Finally, here we explore samples from a single region on which biological material has settled. The holy grail of biogeoprinting will be understanding whether not just the origin but also the trajectory of samples can be ascertained. By the same token, it may be possible that useful differences exist among seasons that would allow one to discern not only the geographic origin, but also the timing of dust exposure. In this preliminary study, we could not assess the temporal variability in these fungal communities given that the settled dust accumulated over an indeterminate period of time. However, our data suggest that the temporal variability in fungal community composition is less than the geographic variability. If this were not the case, our approach would likely not perform as well as it has.
Supporting Information S1 Fig. All 928 predictions produced by the model over five-fold cross-validation. Each line connects a sample's true (blue) and predicted (red) origin. (TIFF) Fig 5. Summary of the spatial source prediction method. Given a database of sequenced dust samples from known origins s 1 ,. . .,s n , kernel smoothing produces taxon-specific "hot spot" maps like Figs 1 and 2. Using Bayes' rule, our method combines these estimated occurrence probabilities with the sequenced taxa data observed in a new dust sample of unknown origin s 0 to identify the sample's most likely originŝ^(red point). The enveloping prediction regions suggest broader areas where the sample is likely to have originated beyond a single "pin-in-a-map" point estimate.

Author Contributions
Conceived and designed the experiments: HLM JBH AB JWL NF RRD. Performed the experiments: HLM JBH AB JWL NF RRD. Analyzed the data: NSG BJR KP EBL JBH AB JWL NF RRD. Contributed reagents/materials/analysis tools: NF RRD. Wrote the paper: NSG BJR KP EBL HLM JBH AB JWL NF RRD. Coordinated public science components of project: HLM.