CLUSTOM: A Novel Method for Clustering 16S rRNA Next Generation Sequences by Overlap Minimization
The diagram summarizes the flow of the CLUSTOM algorithm. There are five main steps that are processed sequentially. (A) Random sample: Here we assume that the input sequences (individual letters) are clustered into three OTUs, labeled in green, red, and black. A sequence subset is randomly extracted from these sequences. (B) k-mer threshold determination: Individual dots indicate pairs of the randomly sampled sequences. Distances of k-mer and Needleman-Wunsch (NW) between sequences i and j are denoted by ki,j and di,j, respectively. The user-defined distance threshold and its corresponding k-mer threshold are respectively denoted by α and β, respectively. (C) Initial clustering: If k-mer distances of any two of the input sequences are smaller than the k-mer threshold (β), they are connected in a network. The larger letters in bold indicate the seed sequences of initial clusters that are bound by circles. (D) Refinement: Seed sequences with NW distances smaller than the user-defined threshold (α) are used to construct a refined network following the procedures in (C). The larger letters in bold indicate the refined seed sequences. (E) Recovery: Each of the final clusters (circles) consists of the refined seed, its neighbors, and sequences that are directly connected to the refined seed or the neighbors in the initial clustering step.