BBMerge – Accurate paired shotgun read merging via overlap

Merging paired-end shotgun reads generated on high-throughput sequencing platforms can substantially improve various subsequent bioinformatics processes, including genome assembly, binning, mapping, annotation, and clustering for taxonomic analysis. With the inexorable growth of sequence data volume and CPU core counts, the speed and scalability of read-processing tools becomes ever-more important. The accuracy of shotgun read merging is crucial as well, as errors introduced by incorrect merging percolate through to reduce the quality of downstream analysis. Thus, we designed a new tool to maximize accuracy and minimize processing time, allowing the use of read merging on larger datasets, and in analyses highly sensitive to errors. We present BBMerge, a new merging tool for paired-end shotgun sequence data. We benchmark BBMerge by comparison with eight other widely used merging tools, assessing speed, accuracy and scalability. Evaluations of both synthetic and real-world datasets demonstrate that BBMerge produces merged shotgun reads with greater accuracy and at higher speed than any existing merging tool examined. BBMerge also provides the ability to merge non-overlapping shotgun read pairs by using k-mer frequency information to assemble the unsequenced gap between reads, achieving a significantly higher merge rate while maintaining or increasing accuracy.


Introduction
Many sequencing platforms-including Illumina and Ion Torrent, which comprise the majority of sequencing capacity at many institutions-produce relatively short reads with tens to low hundreds of bases. Short read lengths result from the decline of signal intensity and integrity with each subsequent base during the sequencing process. To compensate for this, paired-end reads are generated by sequencing two end regions of a nucleic acid fragment [1].
Although many advances have been achieved using paired-end sequencing, there remain situations in which single, longer reads are preferable to paired shorter reads, such as de novo assembly contig-building, read binning or clustering, gene annotation, and small-variant calling. To address this need, several programs have been designed to merge paired short reads into single longer reads; however, most of these are designed to primarily merge 16S rRNA gene amplicon sequences rather than shotgun sequence data. PLOS  In this study, we describe BBMerge, a new overlap-based tool for merging short highthroughput shotgun sequencing reads. BBMerge allows simple adjustment of merging sensitivity to accurately and efficiently process large datasets from a variety of sequence types. We designed BBMerge to address common difficulties associated with paired-end shotgun read merging, i.e. reducing incorrect merge rates, increasing scalability, and handling non-overlapping pairs from longer fragments, which most tools cannot merge. BBMerge's performance is compared to existing read merging tools that allow shotgun read input using both synthetic and real-world data from Chlamydomonas reinhardtii and a defined microbial community with bacterial and archaeal members (MBARC-26) [2], respectively.

Materials and methods
Synthetic and real-world sequence data In order to evaluate merging performance, we used synthetically generated data from a eukaryotic genome to allow precise evaluation of merging accuracy as well as real-world shotgun metagenome data from a prokaryotic community. These two datasets include eukaryotic, bacterial and archaeal organisms with complete reference genomes spanning a large spectrum of %GC.
Reference genomes for MBARC-26 were retrieved from JGI's IMG [11] and used for mapping as described in the following: Reference genomes were first indexed (Table 1E). Second, shotgun metagenome reads were mapped to reference sequences to a) determine insert sizes, and b) to remove reads that mapped with indels or that did not map in a properly paired orientation (Table 1F) using BBMap's default settings. This filtering step ensured the correct determination of the insert size for each read pair for subsequent grading; insert sizes of unpaired reads cannot be determined, and reads mapped with indels yield a different insert size as calculated by mapping versus merging. Mapping was not necessary for the synthetic data as the true insert size was known a priori. The remaining shotgun metagenome reads were subsampled to 20 million read pairs (Table 1G).
Grading was performed using GradeMerge (Table 1H) to obtain the number of correctly and incorrectly merged reads. A merged read was considered correct if its length exactly matched the insert size indicated by its header. The reported percentage values and signal-tonoise ratio (SNR) are defined as: , where: a. C is the number of correctly merged reads.
b. I is the number of incorrectly merged reads.
c. C% is the percent of correctly merged reads.
d. I% is the percent of incorrectly merged reads.
e. P is the number of input read pairs.
Assembly quality was evaluated using raw shotgun metagenomic reads from MBARC-26 subsampled to 20 million read pairs (Table 1I). To eliminate potential impact originating from pre-processing, reads were not filtered or trimmed. Reads were merged with each tool, then both the merged and unmerged output was passed to SPAdes v. 3.8.2 [12] for assembly in metagenome mode (Table 1J). Assembled contigs were compared to the metagenome reference using QUAST v. 4.2 [13] for evaluation (Table 1K). Global and local misassemblies as defined in [13] were combined and are reported as "total misassemblies". Paired-end read merging tools. All algorithms for read merging compared here (Table 2) are based on overlap detection [14][15][16][17][18][19], with the exception of leeHom [20] and BBMerge, which additionally use adapter-sequence detection; and COPE [18] and BBMerge, which additionally use kmer counts in non-default modes. All tools were executed as described in S1 Table. Although an effort was made to compare all available overlap-based read merging tools for a comprehensive evaluation in this study, the testing methodology precluded the use of PAN-DAseq [22], which cannot process reads with renamed headers. Eloper [23] was tested, but not included, as it was unable to produce fastq files or retain the original read headers.
Parameters and testing. Each program was tested for speed, accuracy, and scalability. All testing was executed on the NERSC Genepool cluster (http://www.nersc.gov/users/ computational-systems/genepool/), using a 1 TB, 32-core node based Intel Xeon E5-4650L CPUs @ 2.60GHz. Reads and writes were all performed using a ramdisk to eliminate any impact of contention for the cluster's shared file system.
Execution of merging tools was performed according to each program's defaults, except as noted (S1 Table). For accuracy testing, each program was run multiple times; the single parameter that was identified to impact the respective tool's sensitivity most was varied between runs (if available) (S2 Table). After each run, the resulting output was graded, i.e. each merged read's length was compared to the true insert size noted in that read's header.
Speed and scalability testing was executed using the Linux "time" command, e.g. "time bbmerge.sh <other options>", with default parameters and varying numbers of threads. For BBMerge, three modes were included in this study: default, REM, and RSEM, as described in 2.2.3 and 2.2.4. For COPE, two modes were included: default (M0), using simple overlap only, and M3, using k-mers to join non-overlapping pairs. COPE's M1 mode was not found to differ substantially from M0, and M2 did not produce output, so neither are included. Speed tests were performed on both synthetic and real-world shotgun metagenome reads. Since no significant difference was found, we only report test results for the real-world metagenome data.

BBMerge overlap-detection
Overlap-detection involves multiple heuristics, controlled by constants denoted C i . These have already been optimized through extensive empirical testing and do not need to be adjusted by the user; they are only presented to describe the algorithm. For each read pair: 1. Read 2 is reverse-complemented, because read 1 and read 2 are produced from opposite strands of the initial DNA fragment.
2. Read 1 and read 2 are aligned in every possible offset.
a. An "offset" is defined by the relative start position of the reads. For offset O = 0, each base number X i of read 1 aligns to base number X i of read 2. In general, each base X i in read 1 aligns to base X i+O in read 2.
b. This alignment only counts matches and mismatches; indels are not allowed.
3. The standard mode for determining the offset is called "ratio mode". For each offset, a ratio R is calculated: , where: B is the number of mismatches, G is the number of matches, C 0 is a constant. An optional flag, "ouq", allows B and G to be calculated using quality scores, but this is only helpful if the quality scores are accurate.
4. The two best (lowest) ratios, R 1 and R 2 , are tracked throughout the process.
5. Once the alignments finish, R 1 and R 2 are examined to decide whether an alignment will be accepted ( Fig 1A) or discarded (Fig 1B), using heuristics with different constants.
a. If R 1 >C 1 , the alignment will be rejected as invalid.
b. If R 1 Ã C 2 >R 2 , the alignment will be rejected as ambiguous.
c. If R 2 <C 3 , the alignment will be rejected as ambiguous.
d. If G<max(C 4 , V) the alignment will be rejected as having too short of an overlap. V is derived from the sequence complexity of a given pair, decreasing as complexity increases. e. If S<C 6 , the alignment will be rejected as too short. S is the insert size implied by the alignment.
f. Otherwise, the best alignment will be reported for further consideration.
6. At extreme sensitivity settings, an additional algorithm-"flat mode"-is used. This mode determines the best overlap by minimizing the number of mismatching bases.
a. At the "xstrict" and "ustrict" settings, the alignment is only accepted if the best offset from flat mode matches the best offset from ratio mode.
b. At the "xloose" setting, an alignment produced by flat mode will be accepted if no alignment was produced by ratio mode.
c. Otherwise, flat mode is not used.
7. If the pair has an alignment reported in 5) or 6), it is subjected to further scrutiny.
a. If the implied insert size is shorter than the read length, and adapter sequences have been specified, non-overlapping portions of the reads are aligned to respective expected adapter sequence. If they do not match, the alignment is rejected.
b. The number of expected mismatches (E) in the overlap is calculated using quality scores.
If B>E Ã C 5 , the alignment is rejected.
c. The probability (P) of the specific pattern of matches and mismatches is calculated. If P<C 6 , the alignment is rejected.
8. If, at this point, the alignment has not been rejected, the read pair is merged to create a new read of size equal to the insert size implied by the overlap.
a. The overlapping portions of the reads are represented in the resulting read as a consensus of the two parent sequences. Matching bases are assigned an increased quality score; for non-matching bases, the base with the higher quality score is used, and is assigned a quality score equal to the difference between the two parent qualities. Where both quality scores are equal and the bases mismatch, the resulting base is N.
b. If only the tail ends of the reads overlap, the insert size (and thus resulting read) is longer than the original read length. The merged read will be composed of the non-overlapping portion of read 1; the consensus of the overlapping sequence; and the non-overlapping portion of read 2, respectively.
c. If the tail ends of the reads do not overlap, the insert size is shorter than the initial read length, and the non-overlapping portion is non-genomic sequencing adapter readthrough. In this case the resulting read is trimmed to the insert size, and will be 100% consensus sequence.

BBMerge k-mer-based modes
BBMerge has the ability to improve merging accuracy or merge non-overlapping reads using k-mer frequency information, if the sequencing depth is sufficient (Fig 2) and the library is randomly sheared. There are two k-mer-using modes described in this paper, REM and RSEM, which stand for "Require Extension Match" and "Require Strict Extension Match". In each case, the default BBMerge algorithm is used with an additional k-mer-based extension step. To summarize: The input read file is processed once, to build a table of k-mer counts.
The file is then processed a second time to perform merging. Steps performed during the merging phase for each read pair include: 1. The standard BBMerge algorithm is used to determine the insert size S 0 based purely on overlap (Fig 1A and 1B).
2. Each read is extended by a fixed length on the tail end only, using the Tadpole assembler (https://sourceforge.net/projects/bbmap/). When not specified, as in this study, extension defaults to 50 bp. Extension will stop prematurely if a branch k-mer is encountered, or kmer depth drops below a set threshold, so extension may not reach the full length specified by the user. a. A "branch k-mer" is a k-mer with more than one possible next k-mer. They are identified based on BBMerge's optional Tadpole-specific parameters.
b. If extension completely fails such that neither read is extended by at least one base, insert size S 0 is used regardless of mode and subsequent steps are skipped.
3. If extension was successful, the BBMerge algorithm is applied to the extended reads to obtain a new insert size S 1 .
4. In REM mode, the alignment is accepted if S 0 = S 1 (Fig 1C). If there is no S 0 because overlap failed in step 1, S 1 will be used ( Fig 1D). If S 0 and S 1 exist and S 0 ! = S 1 , the alignment is rejected (Fig 1E). 5. In RSEM mode, the alignment is exclusively accepted if S 0 = S 1 (Fig 1F). If S 0 6 ¼S 1 (Fig 1G), or if there was no initial overlap detected (Fig 1H), the alignment is rejected.
In practice, REM mode can produce merged reads from initially non-overlapping pairs, with insert size > sum of the read lengths. RSEM will only produce merged reads < sum of the read lengths-a strict subset of the merged reads produced by BBMerge run in pure overlap mode. Requiring that the overlap after extension matches the initial overlap reduces false-positive merges caused by short repeats.
Although k-mer-based modes can increase accuracy and merge rates, read processing requires more time and memory in these modes. This memory constraint may hence render k-mer modes impractical on very large datasets. Though not evaluated in this study, BBMerge also has additional k-mer-related options, "ecct" and "kfilter". "ecct" enables k-mer-based error correction of reads that initially fail to merge; if the reads still fail to merge after correction, the changes are rolled back. This can increase the merge rate in data with many sequencing errors. "kfilter" is a setting applied after a potential overlap is found; if the merged read contains any k-mers that were not already present at a specified depth in the original file, the overlap is assumed to be wrong and will be rejected. All k-mer-using modes use the same kmer count table, so they can be enabled concurrently without using additional memory, and with little speed impact.

BBMerge threading
BBMerge uses both pipelined and parallel threads to achieve a high degree of scalability. Data is streamed from and to disk during execution, so that BBMerge's memory requirements (in default overlap mode) are unrelated to the amount of input data. Data is read by one thread per file and packaged into lists of P read pairs each (P = 200 by default). These lists are added to an ArrayBlockingQueue, a data structure that allows safe concurrent read/write access. A number of parallel worker threads is spawned (controlled by the "t" flag). Each worker fetches a list of reads from the queue; if the queue is empty, it will block until a new list is added. The worker thread will then iterate through the list and attempt to merge each of the read pairs, tracking statistics in thread-local variables, and adding merged reads to a new list. The finished list of merged reads is added to an output ArrayBlockingQueue, which is being fed by all of the worker threads. An output thread pulls lists from this output queue, and writes the reads to disk. The worker threads finish when all reads have been processed. Finally, the master thread summarizes and prints the statistics from the worker threads. As a result, the worker threads do not interfere with memory used by any other thread except when pulling lists from the input queue, or sending lists to the output queue; this means shared memory is only mutated twice per P read pairs. Furthermore, P can be set to an arbitrarily high value on the command line (with the "readbufferlength" flag), so that distributing and gathering work has minimal negative impact on scalability. Most tools in the BBMap package share this threading design.

Deployment and use
BBMerge is written in Java, with no other dependencies. It is distributed with both the source and precompiled class files, allowing simple deployment and use on any computer supporting Java, from Windows laptops to HPC Linux-based clusters. BBMerge is designed for production use, so to simplify pipeline integration, it supports a wide variety of input and output formats-fasta or fastq; interleaved or dual-file; raw or compressed; encoded in ASCII-33 or ASCII-64, with input format autodetection. It also provides alternative processing modes such as insert-size histogram generation, adapter-sequence detection, and overlap-based error-correction (without merging), allowing its use in situations when paired reads are preferred over merged reads.

Discussion
We tested BBMerge in three modes (default, REM, RSEM) and compared its merging performance with eight other read merging programs ( Table 2) using synthetically generated reads from an algae genome, and real-world shotgun metagenomic reads from a prokaryotic mock community (MBARC-26). Merging performance was evaluated based on accuracy, speed and computing efficiency.

Accuracy of paired-end read merging
BBMerge outperformed all other tools in merging accuracy across the sensitivity curve, with the lowest rate of incorrectly merged reads for any given rate of correctly merged reads, though this difference was more pronounced in the synthetic (Fig 3A) compared to the real-world data (Fig 3B). Similarly, BBMerge resulted in the highest correct merge rate (Fig 3) of all nonk-mer-using tools.
Results from the three discussed k-mer-utilizing modes are clearly distinguishable from those of the purely overlap-based tools and modes (Fig 3). BBMerge's RSEM mode substantially reduced the rate of incorrectly merged reads, while slightly reducing the rate of correctly merged reads. BBMerge's REM mode, and COPE's M3 mode, substantially increased correct merge rates compared to the programs' default modes by merging initially non-overlapping reads (Fig 3). BBMerge-REM achieved the highest rate of correctly merged reads in the real- world data (77.5%) followed by COPE-M3 (62.1%), and COPE-M3 achieved the highest merge rate in the synthetic data (94.4%) followed by BBMerge-REM (93.8%). Stitch yielded 69.2% incorrectly and 0.8% correctly merged reads in the synthetic data, and 49.1% incorrectly and 0.64% correctly merged reads in the real-world data (S3 Table).

Speed and scalability of paired-end read merging
Merging speeds were evaluated using the real-world metagenome reads and programs set to default sensitivity. Multi-threaded programs were allowed to use all 32 available threads. Compared to the other merging tools, BBMerge and FLASH were substantially faster, although we found that USEARCH, PEAR, BBMerge REM/RSEM, and fastq-join can all merge large datasets within reasonable timescales (Fig 4). Based on the performance on our shotgun sequence datasets, XORRO, COPE, leeHom and Stitch were projected to require >1 day to process a 500 Gbp dataset.
BBMerge variants, PEAR, and Stitch exhibited near-perfect scaling in these tests, and are expected to continue scaling past 32 threads if run on a system with more CPU cores (Fig 5). FLASH scaled linearly to 6 threads, at which point speed plateaued. leeHom scaled to a peak at 4 threads, after which speed slightly declined. USEARCH also reached a peak at~4 threads, but did not scale as well; 4-threaded speed was only 150% of single-threaded speed, rather than an ideal 400%. Subsequently, USEARCH's performance declined, ending at 85% of its peak speed at the maximum of 32 threads. Single-threaded programs (fastq-join, XORRO, and COPE) are each represented by a single point.

Assembly quality following read merging
Assembly quality was evaluated with QUAST; we report here assembly continuity (NA50), genome completeness, misassemblies, and indels as defined in [13] (Table 3, S4 Table). Gurevich et al. [13] defined NA50 as the length at which the collection of all reference-aligned contigs, of that length or longer, contain at least half of the assembled bases. Merged reads were generally characterized by substantially improved assembly continuity compared to the raw data (Table 3, Fig 6A), with BBMerge-REM reaching a nearly two-fold increase in NA50 (119 kbp compared to 60 kbp). BBMerge-RSEM, BBMerge, USEARCH, and leeHom resulted   (Table 3, Fig 6B). The studied merge tools fell into 3 misassembly-count clusters: BBMerge variants and USEARCH ranged from 115 to 131; XORRO, fastq-join, COPE-M3, FLASH, leeHom, and COPE ranged from 158 to 294; and PEAR and Stitch resulted in 660 and 20,986 misassemblies, respectively. Indel rates are noted because they can induce frameshifts, which disrupt gene annotation. BBMerge variants and USEARCH clustered together closely, with rates ranging from 0.81 (BBMerge-REM) to 0.88 (USEARCH) indels per 100 kbp ( Table 3). The other tools yielded rates ranging from 1.08 (XORRO) to 1.52 (COPE), except for Stitch (47.78 per 100 kbp). The raw data yielded 1.13 indels per 100 kbp. The fraction of reference bases covered by assemblies exhibited a narrow range from 83.9% (COPE-M3) to 85.2% (FLASH), aside from Stitch at 68.4% (Table 3). All tools except PEAR, COPE-M3, and Stitch exceeded the 84.5% genome coverage of the raw read assembly. BBMerge-REM outperformed BBMerge in every assembly metric, but COPE-M3's performance relative to COPE was more nuanced: COPE-M3 had a greater NA50 and fewer misassemblies and indels, but a 1.2% lower genome recovery than COPE.

Conclusion
Correctly merged shotgun reads can improve the performance of applications that benefit from longer reads, yet erroneously merged reads can create serious issues due to the introduction of new errors, a concern that is not present for other common preprocessing steps such as quality-trimming. Even at a low rate, the addition of incorrectly merged reads can cause misassemblies and reduced assembly contiguity compared to unmerged or correctly merged data (Fig 6). It is this possibility of introducing new errors that renders merging especially sensitive to accuracy.
Since BBMerge has been developed primarily as a tool to aid in clustering and de-novo assembly of shotgun metagenome sequence data, minimizing the false-positive merge rate has been considered paramount. Our data indicates that BBMerge successfully minimized the false-positive rate when merging shotgun reads from synthetic and real-world datasets, and was able to improve assembly quality by increasing continuity while reducing the number of misassemblies. Its ability to achieve maximal accuracy while scaling near-linearly to reach the highest speed of the compared software makes BBMerge a promising tool for improving the assembly of large datasets such as shotgun metagenomes.
Supporting information S1  Table. Number of correctly and incorrectly merged read pairs, and Signal-Noise Ratio (SNR), from the synthetic (A) and real-world (B) shotgun datasets by program and sensitivity. All numbers are out of 20,000,000 input read pairs. Defaults are in bold. (DOCX) S4 Table. Assembly report by program. (DOC)