Constructing benchmark test sets for biological sequence analysis using independent set algorithms

doi:10.1371/journal.pcbi.1009492

Fig 1.

Performance of splitting algorithms on Pfam families.

(A) Fraction of the 12340 Pfam seed families with at least 12 sequences that were split into a training set of size at least 10 and test set of size at least 2. The numbers on the Blue and Cobalt bars indicate the fraction of families successfully split at least once out of 1, 5, 10, 20, 40 independent runs. (B) Fraction of the 9827 Pfam families with at least 420 sequences in their full alignment that were split into a training set of size at least 400 and test set of size at least 20.

More »

Expand

Fig 2.

Characteristics of Pfam full families successfully split.

Each marker represents a family in Pfam. The connectivity of a sequence is the fraction of other sequences in the full family with at least 25% pairwise identity. Families successfully split into a training set of size at least 400 and a test set of size at least 20 are marked by a cyan circle, whereas families that were not split are marked by a red diamond. In (B) and (D) the cyan circle represents at least one successful split among 40 independent runs. The 34 families that Blue did not finish splitting within 6 days are not included in the Blue plots.

More »

Expand

Fig 3.

Runtime of algorithms.

Each algorithm was run once on each Pfam seed and full alignment for at most 6 days. The runtimes are reported as a function of the product of the number of sequences and the number of columns in the alignment, as bar plots including outliers (translucent grey circles). The boxes extend from the first to third quartile, and the median is marked by a horizontal line. The results for families with at most 10,000 sequences were obtained on 2 cores and 8 GB of RAM, and the remaining were obtained on 3 cores and 12GB of RAM. The results do not include 34 families that Blue did not finish running within 6 days. Blue finished 939 of 944 families in the [10⁶, 10⁷) range, 58 of 85 families in the [10⁷, 10⁸) range, and 1 of 3 families in the [10⁸, 10⁹) range (and we omitted a bar plot for Blue for [10⁸, 10⁹)).

More »

Expand

Table 1.

Runtime of implementations on Pfam seed and full.

The runtime benchmarks were obtained by running each algorithm on the seed and full multi-MSAs Pfam-A.seed and Pfam-A.full on 2 cores with 8 GB RAM for the seed alignments and on 3 cores with 12 GB RAM for the full alignments. We did not compute the maximum runtime of the Blue algorithm; the algorithm failed to terminate within 6 days for 34 families.

More »

Expand

Fig 4.

Benchmarks of HMMSEARCH.

(A) Each benchmark includes data from all families that were split into training and test sets of size at least 10 and 2 respectively by one run of the algorithm. The number of families included in the benchmark for each algorithm is stated in the labels. For each family, HMMER produces a single profile from the alignment of the training sequences. We constructed 200,000 decoy sequences from shuffled subsequences chosen randomly from UniProt. At most 10 positive test sequences are constructed by embedding a single homologous domain sequence from the test set into synthetic decoy sequence. (See Methods) The x-axis represents the number of false positives per profile search and the y-axis represents the fraction of true positives detected with the corresponding E-value, over all profile searches. The error bars at each point represent a 95 percent confidence interval obtained by a Bayesian bootstrap. (B) The faded lines are copies of the plot (A). The dark lines are the analogous curves constructed by restricting to the benchmarks to the 708 families successfully split by all four algorithms. (C) The distribution of the distances between each test sequence and the closest training sequence (measured in percent identity) for families split by Blue, Cobalt, and Cluster.

More »

Expand

Fig 5.

Homology search benchmarks on data produced by splitting algorithms.

The benchmarks are constructed as in Fig 4. Blue 40 and Cobalt 40 refer to the algorithms run with the “best-of-40” feature. BLASTP and DIAMOND are benchmarked using family pairwise search.

More »

Expand