Fig 1.
Comparative analysis of TPMA, M-Coffee, and ReformAlign on 16S-like and 23S-like rRNA datasets.
Each dataset consists of 14 sub-datasets, each exhibiting varying levels of sequence similarity ranging from 99% to 70%. Within each similarity sub-dataset, three replicates are included. For each replicate, nine initial alignments are acquired and subsequently merged using TPMA and M-Coffee. A, D The aSP, Q, and TC scores of TPMA, M-Coffee, and nine MSA tools across 16S-like and 23S-like rRNA datasets. For each gird, the average of three repetitions in one sub-dataset was calculated and subjected to min-max normalization using the alignment scores from all tools. B, E The changes in running time and memory usage of TPMA and M-Coffee across different levels of similarity for the 16S-like and 23S-like rRNA datasets. Each point represents the average time and memory consumption resulting from the combination of TPMA and M-Coffee across the three replicates. This computation excludes the time and memory required for obtaining the initial alignments. C, F Compare improvements in aSP, Q, and TC scores of TPMA, M-Coffee, and ReformAlign for initial alignments on 16S-like and 23S-like rRNA datasets. Each dataset consists of 14 sub-datasets, with three replicates per sub-dataset. A total of 378 (14×3×9) initial alignments were generated using 9 MSA tools. ReformAlign optimized these 378 initial alignments, calculating the differences between the scores of the optimized and unoptimized initial alignments for aSP, Q, and TC scores. Additionally, the disparities between the scores of the merged alignments (through TPMA and M-Coffee) and the unmerged initial alignments were also computed for aSP, Q, and TC scores. A total of 378 differences were obtained for TPMA and M-Coffee, obtained from 14×3 replicates, with each replicate generating 9 differences. The proportions of "improved," "constant," and "reduced" are summarized in donut charts.
Fig 2.
Comparative analysis of TPMA, M-Coffee, and ReformAlign across the four real datasets.
The 16S rRNA datasets include 8 subsets, each comprising 100 sequences, and nine MSA tools were utilized to generate the initial alignments for these datasets. Additionally, the mt genomes, SARS-CoV-2_20200301, and SARS-CoV-2_20200417 datasets consist of 4 subsets containing 30, 39, and 100 sequences, respectively. A, D, G, J The aSP score of TPMA, M-Coffee, and various MSA tools on the 16S rRNA, mt genomes, SARS-CoV-2_20200301, and SARS-CoV-2_20200417 datasets. It’s worth noting that some MSA tools failed to align on these datasets; only the tools depicted in the figure completed the alignment. The resulting initial alignment was then utilized for the merging process using TPMA and M-Coffee. B, E, H, K The running time and memory usage of TPMA and M-Coffee for the 16S rRNA, mt genomes, SARS-CoV-2_20200301, and SARS-CoV-2_20200417 datasets. Notably, the values exclude the time and memory needed to obtain the initial alignments. C, F, I, L The enhancements in the aSP score of initial alignments on the 16S rRNA (C), mt genomes (F), SARS-CoV-2_20200301 (I), and SARS-CoV-2_20200417 (L) datasets among TPMA, M-Coffee, and ReformAlign. Specifically, the 16S rRNA datasets comprise 72 (8×9) initial alignments, while the mt genomes datasets entail 28 (4×7) initial alignments. Furthermore, both the SARS-CoV-2_20200301 and SARS-CoV-2_20200417 datasets each encompass 20 (4×5) initial alignments. The proportions of "improved", "constant", and "reduced" are calculated based on the number of initial alignments for each dataset. It’s worth noting that for the two SARS-CoV-2 datasets, ReformAlign’s results were not displayed due to exceeding the device’s memory limit.
Fig 3.
The results of screening the accurate strategy on the 23S-like rRNA datasets.
A MSA tool rankings are determined based on aSP, Q, and TC scores. Each point corresponds to the average value of three replicates within that sub-dataset. "Mean" denotes the ranking derived from the average values spanning all sub-datasets with varying degrees of similarity. B Sequential integration of initial alignments from MSA tools following ranking by aSP, Q, and TC scores. The ranking is derived from the "Mean" results of A. Referring to the aSP score rankings (the orange line), the "Top 2" position indicates the integration of T-Coffee and MUSCLE3, while the "Top 3" placement corresponds to the fusion of T-Coffee, MUSCLE3, and MAFFT, and so forth. Analyzing the Q score rankings (the blue line), the "Top 2" position signifies the combination of T-Coffee and MUSCLE3, while the "Top 3" is the fusion of T-Coffee, MUSCLE3, and ClustalW2, and so forth. Similarly, based on the TC score ranking (the red line), the "Top 2" position indicates the combination of ClustalW2 and PCMA, while the "Top 3" placement represents the integration of ClustalW2, PCMA, and T-Coffee, and so forth. The outcomes represent the average values of the aSP, Q, and TC scores of combined alignments across all sub-datasets of different similarities. C Ablation experiment on the "Top 5" MSA tools. The disparities between the aSP, Q, and TC scores of the combined alignment obtained from the remaining four initial alignments (excluding the indicated MSA tool) and from all five initial alignments were computed. If the score from combining the remaining four initial alignments is higher, it’s classified as ’improved’; if it remains the same, it’s labeled ’constant’; if it decreases, it’s categorized as ’reduced’. Each histogram displays the proportions of "improved," "constant," and "reduced" cases, calculated from a total of 42 datasets (14*3) across all sub-datasets. D Ablation experiment on the "Top 4" MSA tools (similar to C). E The running time and memory consumption of 11 MSA tools on 42 23S-like rRNA datasets are presented. Logging was conducted using Python’s psutil library.
Fig 4.
Validation of the accurate strategy on additional datasets.
TPMA_C9 and M-Coffee_C9 represent the combined alignments of 9 initial alignments from ClustalW2, Dialign-TX, Kalign3, MAFFT, MUSCLE3, MUSCLE5, PCMA, POA, and T-Coffee. TPMA_C5 combines initial alignments from 5 MSA tools: ClustalW2, MAFFT, MUSCLE3, PCMA, and T-Coffee. TPMA_C4 and M-Coffee_C4 are the combined alignments of merging the initial alignments from the accurate strategy (ClustalW2, MAFFT, MUSCLE3, and T-Coffee). A, B We present the trends in aSP, Q, and TC scores of TPMA_C9, TPMA_C5, TPMA_C4, M-Coffee_C9, and M-Coffee_C4 on the 16S-like and 23S-like rRNA sub-datasets. The results represent the average value across the three replicates from each sub-dataset. C, D, Comparing aSP, Q, and TC scores (10 replicates) for TPMA_C9, TPMA_C5, TPMA_C4, M-Coffee_C9, and M-Coffee_C4 on simulated CIPRES-128 and CIPRES-256 rRNA datasets, alongside a sequence similarity distribution histogram. The sequence similarity of each replicate was similar within the dataset, and only one of the replicates’ sequence similarity distribution histograms was plotted. The "NULL" indicates that TPMA-C9 and M-Coffee_C9 produced no results due to the infinite alignment time of Dialign-TX. E-G, The aSP scores of TPMA_C9, TPMA_C5, TPMA_C4, M-Coffee_C9, and M-Coffee_C4 on the 16S rRNA, HVS-II, and 23S rRNA datasets with the corresponding sequence similarity distribution histograms (only one representative histogram was displayed). The 16S rRNA datasets have 8 replicates, while the HVS-II and 23S rRNA datasets contain 10 replicates.
Fig 5.
Comparative analysis of accurate and fast strategies.
TPMA_C4 is derived from the merged alignment by combining the initial alignments obtained through accurate strategy (ClustalW2, MAFFT, MUSCLE3, and T-Coffee), while TPMA_F4 results from that of the fast strategy (HAlign3, Kalign3, MAFFT, and WMSA2). The time indicated the aggregate of the running time for the four MSA tools within the combined strategy. Meanwhile, the memory reflects the highest memory consumption observed during the aligning process of the four MSA tools in the combined strategy. A-D aSP, Q, and TC scores of TPMA_C4 and TPMA_F4 on the 16S-like, 23S-like, simulated CIPRES-128 and CIPRES-256 rRNA datasets, along with the overall time and memory peak consumption during acquiring all initial alignments. Each point in A and B represents the average value from three replicates within the sub-dataset of 16S-like and 23S-like rRNA datasets. Meanwhile, both CIPRES rRNA datasets consist of 10 replicates. E-H aSP score of TPMA_C4 and TPMA_F4 on the 16S rRNA, HVS-II, 23S rRNA, and mt genomes datasets with the time and memory of that required to obtain all initial alignments. The 16S rRNA, HVS-II, 23S rRNA, and mt genome datasets comprised 8, 10, 10, and 4 replicates, respectively.
Fig 6.
Operational mechanism and illustrative cases of TPMA.
A Flowchart of TPMA. Unaligned sequences (R) are aligned using n MSA tools to generate n initial alignments, and these initial alignments along with the R are fed into TPMA. TPMA checks the n initial alignments by sorting sequences, calculating SP scores, and verifying for consistency, resulting in m valid initial alignments. The m valid initial alignments are combined in descending order of SP scores. A total of m-1 combining steps are conducted to obtain the final alignment, denoted as Afinal. B Example of a detailed merging process for two initial alignments. (i) Recode and
as binary strings consisting exclusively of 0s and 1s. (ii)
and
are divided into blocks according to the bold column Ci (C1, C4, C10, C11, C13 and C16 for
; C1, C4, C10, C11, C12 and C15 for
), which yields six pairs of blocks. Each of these block pairs consists of identical sequence fragments. (iii) Compute the SP scores for these blocks, and then merge the blocks with higher SP values into Atmp1. Atmp1 encompasses sequences of length 15, structured as follows: block C1 with identical scores, blocks C2−C4 from
, blocks C5−C10 from
, block C11 with the same score, block C12 from
, and finally, blocks C13−C15 from
.