Metagenomic Profiling of Known and Unknown Microbes with MicrobeGPS

Microbial community profiling identifies and quantifies organisms in metagenomic sequencing data using either reference based or unsupervised approaches. However, current reference based profiling methods only report the presence and abundance of single reference genomes that are available in databases. Since only a small fraction of environmental genomes is represented in genomic databases, these approaches entail the risk of false identifications and often suggest a higher precision than justified by the data. Therefore, we developed MicrobeGPS, a novel metagenomic profiling approach that overcomes these limitations. MicrobeGPS is the first method that identifies microbiota in the sample and estimates their genomic distances to known reference genomes. With this strategy, MicrobeGPS identifies organisms down to the strain level and highlights possibly inaccurate identifications when the correct reference genome is missing. We demonstrate on three metagenomic datasets with different origin that our approach successfully avoids misleading interpretation of results and additionally provides more accurate results than current profiling methods. Our results indicate that MicrobeGPS can enable reference based taxonomic profiling of complex and less characterized microbial communities. MicrobeGPS is open source and available from https://sourceforge.net/projects/microbegps/ as source code and binary distribution for Windows and Linux operating systems.

In the following sections we develop a model similar to the GCP case, that allows to estimate the sequencing depth and the genome-dataset validity score [1] as used in the main text. This model is designed for extremely low coverage depths and should therefore complement the existing model. The mathematical procedures, i.e. the modified EM algorithm, are identical; we only use different data to construct the histogram and fit different distributions to that histogram.

A new model for low coverage depths
Estimating the genome-dataset validity (GDV) with the previously described approach can be problematic for very low coverage depths: here, the approach with calculating the GCP and fitting, e.g., a zero-inflated negative binomial distribution with tail (znt) model, is not appropriate since we do not expect to have sufficient datapoints for robust parameter estimation. To increase the number of datapoints, we can look at the starting positions of the reads mapped to the genome, as shown in Figure 1 (a).
Under the assumption that the sequencing depth is homogeneous over the entire genome, the probability that a mapped read starts at a certain position on the genome is equal for all positions. We denote this read start probability as p. Let us consider a read starting at position X = 0. The probability that the next read starts at position X = k follows a geometric distribution with parameter p. This becomes clear when considering that if the next read starts at position X = k, no read starts at positions X = 0..k − 1. Therefore, the distances between the starting positions of mapped reads are geometrically distributed.
In analogy to the GCP, we call the histogram of distances between read starting positions the Read Distance Profile (RDP). When reads of an organism are mapped to its reference genome, we can simply estimate the parameter p of the geometric distribution by making use of the formula for the expected value of the geometric distribution:D = E(X) = 1−p p . However, if we expect more than one contribution to the RDP (e.g., reads from two different organisms map to the same genome), we have to set up a mixed model similar to Equation 1 in [1], consisting of two or more geometric distributions. In fact, our framework presented there can be adapted to fit RDPs simply by using models consisting solely of geometric distributions. Parameter estimation is identical to the GCP case.
Since the number of reads mapped to a genome is typically in the order of hundreds or thousands even in cases of ultra-low coverage depths and the distances are very large, we have increased the number of datapoints for parameter estimation significantly with this approach. We will show in the experiments that this approach allows can be applied at much lower coverage depths than the GCP approach.

Genome-dataset validity from RDPs
The GDV score introduced in [1] is a useful tool for estimating the similarity between a set of reads and a reference genome. However, since the estimation involves fitting complex models to GCPs, we expect decreasing accuracy for lower coverage depths. Here with low and homogeneous sequencing depth and its reference genome is available. Then, the distances between neighboring read start positions in a shotgun sequencing experiment can be described by a geometric distribution. (b) When the reference genome differs from the sequenced organism, there will be islands of sequence agreement (green) divided by gaps of sequence disagreement (red). The distances between neighboring reads on the islands (green arrows) still follow a geometric distribution, disturbed by the distances spanning the gaps (red arrow).
we show that it is also possible to calculate the GDV from RDPs and should therefore be accessible for much lower coverage depths.
Similarly to the role of the zero distribution in GCPs, we have to estimate the fraction of the genome that is not covered by reads. Therefore, we have to quantify the gaps between the parts of the genome that could potentially be covered by reads (see Figure 1 b). Here, we introduce a heuristic and assume that the gap lengths also follow a geometric distribution. This is motivated by the assumption that shorter gaps should be more frequent than longer gaps.
To calculate the GDV, it is sufficient to fit two geometric distributions to the RDP: the first distribution fits the distances between reads lying on a contiguous sequence fragment of the reference (visualized by green arrows in Figure 1 b). The second distribution fits the distances between reads lying on neighboring sequence fragments divided by a gap of foreign sequence (red arrow in Figure 1 b). Let α 1 and p 1 denote the estimated parameters of the geometric distribution fitting the distances between reads on the same fragment and let α 2 and p 2 denote the estimated parameters of the geometric distribution fitting the distances between reads on neighboring fragments. Here, we also make the assumption that the distances between reads on the same fragment are typically lower than the distances between reads on neighboring fragments and therefore p 1 > p 2 . One way to estimate the GDV is to calculate the expected number of reads mapping to the genome under the assumption that the reference genome contained no foreign sequences and to relate this number to the observed number of reads R that were actually mapped to the genome. The expected number of reads is the genome length L divided by the expected distance between reads:D = 1−p 1 p 1 . Therefore, we write the GDV for low coverage depths as follows: The coverage depth on the covered sequence fragments can be calculated using the (average) read length RL: In practice, we may add a third geometric distribution to the model, which accounts for higher coverage depths, e.g., on highly conserved genes. The third geometric distribution does not influence the formulas to calculate the GDV and the sequencing depth. However, it will fit the short distances between reads and therefore influence the parameter estimation. We test this approach on low sequencing depth data and present the experiments and results in the following sections.

Experiment: GDV at low coverage depths
In this experiment, we evaluate how well the GDV score can be calculated for genomes with very low coverage depth using the distance-based approach presented above. In [1], we showed that the GDV score could only be calculated robustly for coverage depths as low as 0.2×. In principle, the distance-based approach should be able to estimate the GDV for even smaller coverage depths, therefore, we assessed its robustness systematically.
We simulated two random genome sequences from the alphabet {A,C,G,T}: a genome A with fixed length 4,600,000 bp (approximate length of the E. coli genome), which serves as reference genome, and a smaller genome B with variable length. The sequence of genome B was cut into 20 fragments that were integrated into genome A. The length of genome B was chosen such that A had a defined GDV for reads sequenced from B. For each combination of ground truth and coverage depth, we calculated the GDV and the estimated coverage depth using the distance-based approach above. The experiment was repeated 100 times. The estimated GDV scores are shown in Figure 2. Here, we see that the quality of the GDV estimate depends both on the coverage depth and on the GDV itself. Intermediate GDV scores (0.2 and 0.4) can be correctly estimated at much lower coverage depths than more extreme validities. For a GDV of 0.2, a minimum of 0.01× coverage depth is required to estimate the GDV with below 10% error in 50% of all cases. In general, a minimum sequencing depth of 0.02× was sufficient to estimate the GDV with less than 10% error. This result is remarkable when we consider that the coverage-depth-based approach applied in [1] only provided stable estimates down to 0.2× coverage depth. This means that the new approach lowered this limit by up to a factor of 20 compared to calculating the GDV from the coverage depth profile. For a genome with 0.2 GDV and 72 bp reads, this means that only one read every 7,200 bp is required to estimate the GDV correctly.
There are two effects that influence the required minimum coverage depth: For low GDV scores, only small fractions of the genome are covered with reads at all and the covered regions are smaller than for high GDV. At very low depths, it becomes less likely that two or more reads are mapped to one contiguous fragment; at this point, we start to observe mostly distances between reads on different fragments, which makes it impossible to infer the local coverage depth on the covered fragments. On the other hand, for high GDV scores, the gaps between the covered fragments become very small. If the coverage depth is too low, the distances between neighboring reads may become larger than the gaps between covered fragments, our assumption fails and makes it impossible to mathematically distinguish the distribution of gaps between reads on the same fragment from the gaps between reads on different fragments. We presume that our approach of calculating the GDV is close to the technical limit that can be reached with any approach based on read mapping.