Microbial community dissimilarity for source tracking with application in forensic studies

Microbial source-tracking is a useful tool for trace evidence analysis in Forensics. Community-wide massively parallel sequencing profiles can bypass the need for satellite microbes or marker sets, which are unreliable when handling unstable samples. We propose a novel method utilizing Aitchison distance to select important suspects/sources, and then integrate it with existing algorithms in source tracking to estimate the proportions of microbial sample coming from important suspects/sources. A series of comprehensive simulation studies show that the proposed method is capable of accurate selection and therefore improves the performance of current methods such as Bayesian SourceTracker and FEAST in the presence of noise microbial sources.


Introduction
Trace evidence, as defined by the Federal Bureau of Investigation, refers to evidence transferred by a suspect to a crime scene [1]. More commonly this is used to refer to evidence that is minute or evidence where traditional fingerprinting methods are difficult or cannot be performed, but residual information may still be collected. This includes items such as hair, soil, fiber, glass, and other environmental objects commonly found at a crime scene. Modern approaches to DNA fingerprinting involve analysis of Short Tandem Repeat (STR) and SNP gene markers that utilize PCR amplification to reduce the effect of contamination and degradation [2]. However, in trace fingerprint samples, contaminants account for a vast majority of observed DNA. Microbial alternatives to DNA fingerprinting show promise in trace evidence analysis. Direct contact can transfer millions of microbes nearly instantaneously [3]. In particular, microbial populations on items touched by hands have been found to be composed of approximately 60% to 70% of human skin-associated microbes [4]. Trace microbial profiles provide a rich avenue of exploration in tracking which suspects have had contact with trace evidence.
Microbial source-tracking refers to statistical methods which aim to identify sources of (contaminant) microbes, i.e., source, in an observed microbial population, i.e., sink [4,5]. With a variety of massively parallel sequencing technologies, source-tracking methods have been employed by constructing microbial profiles using nucleotides, k-mers, or Operational Taxonomic Unit (OTU) counts as features. By treating suspect microbiomes as potential contaminant sources and the trace evidence as a sink, microbial source-tracking methods can be applied to forensic science. Marker gene sets and REP-PCR strain-specific analyses rely on the presence of microsatellites, species that are not shared between the sources, and location of specific markers [6][7][8] to identify sources. However, these methods are not appropriate in forensics study. First, microsatellites are chosen based on the measured sources and are not universal; there may be sources that were not collected that contain the chosen microsatellites. If an unmeasured true source contains microsatellite features chosen from measured sources that are not the true source, these measured sources will be incorrectly selected. Second, the time difference between sample collections could be large. The temporal stability of the human microbiome may depend on the location that samples are taken. Microbiota in human gut, nose, and throat environments have been demonstrated temporally consistent [9,10], yet the skin microbiome may not be consistent over time and is perturbed by common activity such as hand washing or contact with others [6,11,12]. A study in 2014 that attempted to identify individuals based on unique microbial features saw only 13% of species-based identifiers remained after 30+ days of initial selection [11]. Community-based source tracking utilizing the full microbial profile can help dilute the effect of fluctuations [5].
In 2011, a Bayesian method SourceTracker was proposed for community-based source tracking, which estimates contamination proportions using a mixture model of OTU profiles [4]. The use of Gibbs sampling allows a probabilistic approach to calculate the mixture proportions and simultaneously accumulates dissimilar OTUs into unspecified source components. In recent years, SourceTracker has been widely used in exploratory studies involving microbial fingerprinting via surface contact with varying degrees of accuracy [4,13,14]. While Bayesian SourceTracker is capable of sensitivity adjustment through parameter tuning, it is not intended to test source selection and drastically increases computation times. Additionally, parameter tuning and Gibbs sampling are computationally expensive when working with large data sets.
In 2019, another source tracking method, FEAST, was proposed using an expectationmaximization with two parameter sets including mixture proportions and underlying relative abundance [15]. This approach functions similarly to the Bayesian SourceTracker except contains a much faster run time, reducing run times by a factor of 30 or more. However, similar to the SourceTracker, it requires parameter tuning to achieve optimum performance and is not designed to select sources.
To improve the precision, the pool of sources can be parsed to include only the sources that contribute to the sink. In this study, we focus on developing source selection and proportion estimation techniques using OTU profiles. We propose two new methods for estimating the source proportions based on the ratio of ecological similarity between sources and the sink sample. The proposed methods are evaluated and compared with Bayesian SourceTracker using comprehensive simulation studies.

Methodology
Usually a sink sample is a mixture of two or more sources, clustering analysis may show the correlation between them. However, traditional distance-based clustering methods will associate sink sample with sources that are similar, which may not capture true sources. Consider the relationship of a sink sample, , and four source samples as shown in Fig 1. A majority of the sink comes from source 1, with a minor proportion from source 3 (shown in the composition of colors). Using distance-based clustering, source 2 is incorrectly clustered due to its high similarity to a true source, source 1. Meanwhile, source 3 is failed to be captured since it is much different overall from the sink. That is, the source may be wrongly attributed when the decision is only based on the distance to the sink (Ev). Instead, the relationship with the sink should be compared to the relationship with other sources.

Fig 1. Illustration of cluster based source tracking.
A visual representation of distance based clustering for sources and sink. The sink sample (Ev) is a mixture of S1 (majority) and S3 (minority). The clustering results show that it has a high relationship (solid line) with source S1 and some relationship (dashed line) with S2 and a small relationship (dotted line) with S3.
Each suspect is treated as a source and labeled as ! . The evidence sample is considered as sink and labeled as . We measure dissimilarity between two sources, represented by !,! for each pair of sources i and j.

Relative Aitchison Difference Source Selection
If we assume that a true source will have a higher community similarity (lower dissimilarity) to the sink than to other unrelated sources, the difference can be used to measure the association. We construct a series of hypothesis tests to represent the possible outcomes for each source i given below: for all ( ≠ ). This test identifies whether the dissimilarity between the source and the evidence is smaller than the dissimilarity between unrelated sources. For the purposes of this study, the microbial profile for sources are formed from rarefied OTU counts data such that the total number of observed microbes is equivalent amongst all sources. Aitchison Distance is used to measure the dissimilarity. Aitchison distance preserves the unit-sum property of compositional data, making it an ideal metric to use for comparison and clustering methods for counts or proportion data [16]. Aitchison distance can be calculated using OTU counts via the following equations: where !" represents the abundance of k th OTU in source i. Function ( ! ) refers to the geometric mean of the sample for source i. The difference in dissimilarity is represented as the difference between two squared Aitchison distances.
Sources that are vastly different from the others may draw a disproportionate Aitchison distance due to the presence of microsatellite features. We propose a Relative Aitchison Difference to account for unusually large Aitchison distances and ensure the measure is on the same scale, [−1, ∞), for all samples for further comparison: where the average is taken across all other unrelated sources, .With this metric, the hypothesis test can be rewritten as follows: A distribution of RAD can be generated using bootstrap resampling, which resamples data with replacement. The significance of the test statistic indicates that the given source is much dissimilar to the sink than other sources and may be an important contributor to the sink sample. From here on, we will call this selection process RAD.
RAD requires the other sources for comparison to be unrelated, which may not be true in real world studies. Some examples of related sources include samples at different locations on the same subject or subjects who come from the same household or otherwise have uncommonly similar profiles. To compensate for this possibility, we recommend performing hierarchical clustering to bin microbial samples. Consider once more our four sources scenario from Fig. 1. Sources 3 and 4 are more similar to one-another than with the sink, but collectively they are more similar to sink than any other source. Hierarchical clustering with a single link captures these groups which may be collectively similar to the sink while avoiding drifting centroids other average linkage methods may produce, e.g., the centroid moves towards sources 1 and 2. Since our comparison will be only towards the sink sample , bins are determined by the first nodal connection to the sink sample and all sources in the connected branch are binned together. This binning process is demonstrated in Fig. 2. Additionally, the single link structure allows an easy visual representation of which branches may be collectively similar to the sink. When the sink is closer than other sources, the link distances to other sources will be shorter than link distances when the sink is removed, as represented by the red arrows in Fig. 2.   Fig 2. Illustration of mechanism of hierarchical cluster binning. Binning for the four sources scenario. The left dendrogram represents the connection between sources without considering the sink (Ev). The right dendrogram represents the changed relationships after the inclusion of the sink. Impacted branches are identified with an arrow, and resulting bins are circled.
When multiple sources exist within a bin, the contributor to the sink may be any or all sources within that bin. Relative Aitchison Difference is evaluated independently for each source within a bin, though all observations that are in the same bin as the current source are excluded when determining the average Aitchison difference. This means in the four sources scenario, we would not consider source 4 when evaluating the RAD for source 3 and vice-versa. In this example, the RAD for source 3 is calculated as follows:

RAD naïve approach
Here we introduce a naïve approach for estimating mixture proportions based on the ratio of measured Aitchison difference introduced above. We use ! ! * to indicate whether the source i was significant (1) or not significant (0). Proportions of microbes coming from a given source will be estimated by the ratio of mean relative Aitchison differences amongst all significant sources, as shown in the equation below: This method does not account for the possibility of missing sources, and will overestimate the proportions if a true source is missing from the source pool. From here on, this method is referred to as RAD-Naïve.

RAD SourceTracker approach
RAD may also be used as a preprocessing technique for Bayesian source tracking via Gibbs sampling. By eliminating non-important sources prior to sampling, we improve the speed and predictive accuracy of this method. For comparison purposes, we have chosen to use the conditional distribution used by Bayesian SourceTracker [4]: , where ! is a variable representing the assigned sink. Each individual sink count is represented by ! . ¬! represents the set of all other sink counts except for the current count. Here !" represents the number of sink counts from an OTU in a source ∈ {1 … }, while ! is the number of sink counts assigned to the source across all OTUs. Parameters and are imaginary prior counts to smooth the distribution, and may be tuned to change sampling sensitivity. The application of Bayesian source tracking with RAD source reduction is referred to as RAD-ST.

RAD FEAST approach
Similarly, RAD may also be used to limit the source pool of the FEAST expectationmaximization algorithm [15]. The underlying model functions similarly to the Bayesian model used by Bayesian SourceTracker, where the results are optimized by the product of relative abundance and mixture proportion. In this approach, the relative abundance for the sink is modeled: where ! is the mixture proportion for source ∈ {1 … } (K is the number of known sources) and !" is the relative abundance of each source for each OTU ∈ {1 … }. The parameters are updated using the EM algorithm proposed by Shenhav et al. [5]. The application of expectationmaximization with RAD source reduction is referred to as RAD-FEAST.

Simulation Studies
Here all three methods are applied to OTU count profiles generated by 16S metagenomics sequencing. OTU data from a longitudinal study examining behavior of resident and household surface microbial communities by Lax et al. [13] was utilized. This OTU data was previously rarefied to 2500 counts per sample. Ten palm samples from the first time point were selected. Each palm sample will represent a source for the sink. Our study will focus on scenarios where the sink is generated from a mixture of two to three sources. The remaining non-significant sources will be included in the source set. The sink was generated using a number of 'true' mixture models using mixture proportions provided in Table 1 using multinomial parametric resampling. Though other combinations of sources have been examined, the two mixtures mentioned in Table 1 will be used for demonstration. The RAD-naive, RAD-ST, and RAD-FEAST will be compared to Bayesian SourceTracker using 100 resamples as well as FEAST with 1000 expectation-maximization iterations. Parameter tuning for Gibbs sampling was run prior to proportion estimation. Furthermore, RAD methods are performed at a 5% family wise error rate adjustment (i.e. Bonferroni).

Results
The results of three-source mixtures are presented below and the results for two-source mixtures are in Appendix.

Three-source mixtures without missing sources
Our first analysis focuses on mixtures of three sources without any missing sources. RAD source selections showed a distinct divide in p-value between sources found significant and those not. All significant p-values were at machine 0 while non-significant p-values were approximately 1. Mean and standard deviation for estimated mixture proportions were calculated by compiling the results of twenty experimental replicates. Four error measurements were calculated and collected for each trial and replication to compare performance between the threesource-tracking approaches: (1) Root Mean Square Error (RMSE): where ! represents the true mixture proportion and ! represents the estimated mixture proportion for source i. RMSE and MD are absolute measures of error, and do not depend on the size of the true proportion. RRMSE and AVGRE are relative measures of error, where the errors are scaled based on the true proportion. For sources with a true proportion 0, the denominator is set to be the estimated proportion measures. The results of proportion estimation for three-source mixtures (i.e., without any missing source) are provided in Fig 3. It can be seen that SourceTracker gives some proportion to the non-included sources (i.e., short green bars) as well as an unknown category (although there is no any missing source), which is reflected in the error metrics plots in Figure 4. The RAD-ST result is better than the SourceTracker since the un-important sources are already filtered out in the RAD selection step.
When all sources are available (i.e., without any missing source), the addition of RAD source selection does not create noticeable improvement to FEAST model in terms of absolute error metrics (i.e. RMSE and MD). However, since FEAST always assign some proportion to an unknown category, the relative error metrics (i.e., AVGRE and RRMSE) result in high values for FEAST while RAD-FEAST reduce the relative errors. RAD-Naïve performs robustly across all error metrics. Consistent conclusions are obtained from the results of two-source mixtures (see Fig S1, Fig S2, and Fig S3). As community similarity increases, source-tracking methods may struggle to pick up differences without satellite microbes. To compare how the methods perform as similarity between important sources increases, eleven source combinations were selected to span the range of Jensen-Shannon group wise divergences for all triplets (Fig. 5). These divergence rates spanned between 0.37 and 0.87. Three replicates were used for each triplet. Generally, RAD-FEAST and FEAST perform comparably best in terms of absolute error while RAD-Naïve is recommended if relative error is considered.

Three-source mixtures with a missing source
Our second analysis focuses on mixtures of three sources, but with the second minor source removed from the source pool. RAD source selection again showed a distinct divide between sources found significant and those not. All significant p-values were at machine 0 while nonsignificant p-values were approximately 1. Mean and standard deviation for estimated mixture proportions were calculated by compiling the results of twenty experimental replications. As before, RMSE, MD, RRMSE, and AVGRE were calculated and collected for each trial and repetition to act as a measure of performance to compare the three-source-tracking approaches. Since our pool does not contain all sources, the denominator in the RAD-Naive method does not represent all associated sources and will overestimate proportions (Fig. 6). As expected, RAD-Naive has high error rates for all mixtures (Fig. 7).
When a source is missing, the Bayesian SourceTracker and FEAST both accumulate counts incorrectly to other sources, as seen with source of I and H (Fig. 6). The RAD-ST and RAD-FEAST show marked improvements over the original estimators. Particularly, RAD-FEAST performs best consistently across all settings and under various error measures (Fig. 7).

Discussion
RAD testing was able to accurately detect which sources had non-zero mixture components for each sink/evidence. In this study, the true significant sources were selected in every replication with uncommon instances of selecting non-significant sources (Fig S1, S2). Integrating RAD source selection in parameter estimation has shown a sizable improvement in predicted source proportions for Bayesian SourceTracker across all settings, with both the absolute and relative error rates decreased. RAD-FEAST showed strong improvements under the missing source scenario, where unusual OTUs were attributed to another source instead of unknown.
The RAD-Naïve method performs effectively for no-missing-source cases. It tends to over-estimate larger/major proportions and under-estimate smaller/minor proportions. Due to the lack of proportion assigned to an unknown category, the relative error rates were smallest when all true sources are available. Alternatively, Bayesian SourceTracker, FEAST, RAD-ST, and RAD-FEAST methods tend to under-estimate the true mixture proportions due to count accumulation into unspecified sources (i.e., the "unknown" category). In forensic practice, this feature is beneficial for its ability to take into account the possibility of missing the true culprit from our source pool.
Exploration across varying Jensen-Shannon divergence values revealed some trends among error rates as divergence increased. A negative relationship between performance and community divergence may be due to the limited divergence ranges covered by the ten palm samples from the original literature. Difficulties may arise when the divergence is small and within-subject random count variances overshadow between-subject variance. When applying community-based microbial source-tracking in forensics, RAD will allow investigators to filter sources and receive estimates for proportions of microbes coming from likely sources. This technique, when combined with other forensic tools, will help investigate the likelihood that any given source is involved with the trace evidence, and by extension the crime.