Figure 1.
The processing flow chart for FastUniq.
Step 1: import all read pairs into memory; Step 2: sort read pairs based on nucleotide sequences; Step 3: identify duplicates in sorted read pairs and output the unique sequences.
Figure 2.
FastUniq three-tier architecture for storage of read pairs.
The high-tier objective was to store hundreds of millions or more of paired reads. Data for each read pair composed of two reads are stored in a middle-tier ‘fastq_pair’ object, and data for each read are stored in a basic-tier ‘fastq’ object.
Figure 3.
Results of duplicates removal for Illumina sequencing libraries from Acropora digitifera corresponding to multiple insert sizes.
(A) The number of read pairs before and after duplicates removal using FastUniq or the mapping-based pipeline for each library. (B) The percentage of duplicates in the results of the mapping-based pipeline identified using FastUniq or fastx_collapser for each library.
Figure 4.
Running time performance of FastUniq.
The running time is measured by the ‘time’ command in the Linux operating system.