CLUSTOM: A Novel Method for Clustering 16S rRNA Next Generation Sequences by Overlap Minimization
We simulated three variable regions of the 454–HMP dataset (V1–3, V3–5, and V6–9) to determine optimal sizes of k-mer and random sample. (A) From each region, 10 K reads were randomly sampled 10 times and NW distances calculated. k-mer distances between the pairs of the 10 K reads were calculated for each of k-mer sizes ranging from 3 to 15. The extent of linear regression between the two distance variables was plotted using the mean (upper x axis) and standard deviation (lower x axis) of the square of the Person product-moment correlation coefficients (R-Square) at every point of the k-mer sizes. (B) For individual 16S regions, we randomly sampled sequences of 100 to 2,000 reads in increments of 100 reads. The k-mer and NW distances were calculated for all possible sequence pairs at every point of the random sample sizes. The random sampling followed by the distance calculation was repeated 10 times. For ten random samples per sample size, the mean (upper x axis) and standard deviation (lower x axis) of R-Square were calculated and displayed.