Identification of Optimum Sequencing Depth Especially for De Novo Genome Assembly of Small Genomes Using Next Generation Sequencing Data

Next Generation Sequencing (NGS) is a disruptive technology that has found widespread acceptance in the life sciences research community. The high throughput and low cost of sequencing has encouraged researchers to undertake ambitious genomic projects, especially in de novo genome sequencing. Currently, NGS systems generate sequence data as short reads and de novo genome assembly using these short reads is computationally very intensive. Due to lower cost of sequencing and higher throughput, NGS systems now provide the ability to sequence genomes at high depth. However, currently no report is available highlighting the impact of high sequence depth on genome assembly using real data sets and multiple assembly algorithms. Recently, some studies have evaluated the impact of sequence coverage, error rate and average read length on genome assembly using multiple assembly algorithms, however, these evaluations were performed using simulated datasets. One limitation of using simulated datasets is that variables such as error rates, read length and coverage which are known to impact genome assembly are carefully controlled. Hence, this study was undertaken to identify the minimum depth of sequencing required for de novo assembly for different sized genomes using graph based assembly algorithms and real datasets. Illumina reads for E.coli (4.6 MB) S.kudriavzevii (11.18 MB) and C.elegans (100 MB) were assembled using SOAPdenovo, Velvet, ABySS, Meraculous and IDBA-UD. Our analysis shows that 50X is the optimum read depth for assembling these genomes using all assemblers except Meraculous which requires 100X read depth. Moreover, our analysis shows that de novo assembly from 50X read data requires only 6–40 GB RAM depending on the genome size and assembly algorithm used. We believe that this information can be extremely valuable for researchers in designing experiments and multiplexing which will enable optimum utilization of sequencing as well as analysis resources.


Introduction
Traditionally, Sanger sequencing was used to sequence the genomes of organisms of interest. Due to the cost and technology limitations in generating the sequence data, genomes were sequenced at approximately 6-10X coverage in order to generate draft genome assemblies [1;2]. Using Sanger sequencing technology, the human genome was sequenced at 6-8X average coverage and cost of about $ 2.7 billion and required efforts from over 3000 scientists from 6 different countries [3]. The complexity, cost and time involved in the human genome project, highlighted the dire need for the development of sequencers with higher throughput and lower cost of sequencing. This need culminated in the development of multiple high throughput or massively parallel sequencing technologies collectively referred to as the Next Generation Sequencing (NGS) technologies. Current NGS systems generate data in the form of millions of short (75-300 bp) reads [4;5] and produce data ranging from 1 GB to 600 GB depending on the sequencing platform used. Additionally, the cost of sequencing on NGS systems is much lower as compared to the automated Sanger sequencing method. According to the data released by the National Human Genome Research Institute, the cost of sequencing a human sized genome using the NGS technology is a little less than $10000 and this includes library preparation, sequencing and data analysis (http://www.genome. gov/sequencingcosts/).
As a result, NGS systems have drastically accelerated the research involving large scale sequencing and globally researchers have now undertaken large sequencing projects involving de novo genome assembly [6;7;8] and metagenomic studies [9;10] for new species, resequencing [11], exome sequencing [12;13;14], transcriptome profiling [15;16] and methylation profiling [17;18;19] for known genomes. Since NGS technologies produce sequence as short reads and have a higher error rate (0.1-1% depending on the sequencing platform) [20;21], higher depth of sequencing is recommended. Moreover using current NGS platforms, 60X to 100X coverage for large genomes and a few 100X coverage for small genomes can be easily achieved. However, sequence generation is only one aspect of sequencing projects and the analysis of large quantum of data generated by NGS platforms, presents a significant bioinformatics and computational challenge, particularly for de novo genome assembly (which involves generation of a novel previously unknown sequence entirely from the available sequence read data). The outcome of performing de novo genome assembly is a draft ''reference sequence'' of the genome of the organism of interest and availability of reference sequence for a variety of organisms by de novo assembly is important for the advancement of genomic research.
Higher read depth for NGS (as compared to Sanger sequencing) seems to be a prerequisite, but there have been no systematic efforts to identify the optimum read depth that would be sufficient to assemble a genome. This is an important question to answer as de novo assembly of millions of short reads is a computationally and bioinformatically challenging task. Recently, some studies have evaluated the impact of sequence coverage, error rate and average read length on genome assembly [22] as well as compared the currently available assembly tools [23;24], however, these evaluations were performed using simulated datasets and not using real data sets. One obvious limitation of using simulated datasets is that, it presents a best case scenario for analysis where variables such as coverage, error rates and read length that are known to impact genome assembly are carefully controlled. On the other hand, assembly using real datasets is complicated by factors such as non-uniformity of coverage [25], variable error rates (with higher errors towards the end of the read) [20] and variable read length [26;21].
In the present study, we attempt to identify the optimum average depth required to generate a ''good'' whole genome assembly using five De bruijn graph based short reads assemblers; SOAPdenovo [27], Velvet [28], Meraculous [29] IDBA-UD [30] and ABySS [31]. Even though we have selected these five assemblers for this study, there are several other excellent assembly algorithms such as ALLPaths LG [32], Celera [33], RAY [34], SSAKE [35] and VCAKE [36] that are available for assembling genomes. The five algorithms we have used in this study were selected because Velvet, SOAPdenovo and ABySS are widely used, whereas Meraculous and IDBA-UD are recently published and claim to outperform the previously published algorithms. The sequence data used in this study was from Illumina GA at an average depth of 990X for E.coli, 275X for S.kudriavzevii [37] and 200X for C.elegans [38]. We focused on Illumina data as it is the most widely used platform for generating sequence data and its high throughput enables deep sequencing of genomes, particularly small genomes. We chose the De bruijn graph based algorithms as they are best suited for assembling short reads generated by majority of the current NGS systems [39]. Commonly, De bruijn graph based assembly algorithms break reads into k-mers of specified length as originally proposed by Pevzner [40] and overlapping k-mers are identified as the nodes of the graph and a directed edge between nodes indicates that the k-mers of these nodes occur sequentially in one or more reads. Stretches of DNA sequence with non-ambiguous bases form non-branching paths in the De bruijn graph and these can be easily converted into contigs by reading along the path.
Our results suggest that, considering metrics such as genome coverage, N50 (the length of the smallest contig which when added to a set of larger contigs yields at least 50% of the genome) [41], maximum contig size, number of contigs and errors in assembly, read depth of 50X is enough to get a ''good'' genome assembly and sequencing at a depth greater than 100X does not provide any additional benefits. Additionally, we observed that computational resources required for assembling read data of 20 X-100 X depth is between 6-40 GB, for small genomes of E.coli and S.kudriavzevii.

Assembly algorithms
Five graph-based short read assemblers; SOAPdenovo (Release 1.05, 14-02-2011) [27], Velvet [28], Meraculous [29] IDBA-UD [30] and ABySS [31] were selected for this study. All these assemblers are based on De bruijn graph approach. They are publically available and are widely used for de novo assembly of short reads generated by NGS platforms such as Illumina Genome Analyzer, HiSeq and SOLiD. All of them support assembly using paired end data and were run with default parameters.
All the genome assembly work was carried out using a dual Quad-core (2.4 GHz) Linux server with 128 GB RAM.

Data Sets
The short read data for E.coli strain MG1655 (ERR022075; Read length: 2X100 bp and Insert size: 311 bp), S.kudriavzevii (SRR173086; Read length: 2 X 114 bp and Insert size: 226623 bp) and C.elegans (SRR065388; Insert size: 206 bp and SRR065390; Insert size: 356 bp. Read length for both data set: 2X100 bp) was obtained from Sequence Read Archive (SRA), NCBI. The data was generated on Illumina Genome Analyzer from paired end libraries and have read depth of 990X, 275X and 200X respectively.

Sub Sizing the Data
The data downloaded from SRA had depth of coverage of 990X for ERR022075 (E.coli), 275X for SRR173086 (S.kudriavzevii) and 200X for SRR065388 and SRR065390 (C.elegans). Since, our aim was to identify optimum read depth to produce a good quality genome assembly, we randomly sub-sized the datasets. Perl script was written for randomly sub-sizing the data to generate a range of depth of coverage; 20X, 35X, 50X, 100X, 150X and 200X. We performed assembly for each of the sub-sized blocks within each of the datasets and did not find significant difference between the assembled genomes (data not shown). This indicates that subsizing the data had no impact on the final outcome.

Metrics for accessing the quality of de novo assembled genome
For each of the depth and each of the assemblers, we used genome coverage (after comparing with the reference genome) and N50 values as the deciding criteria for identifying a ''good'' assembly. The metrics were obtained on scaffolds generated by the respective assemblers. The depth of coverage that provided the best genome coverage and N50 was identified as the optimum depth of sequencing. Genome coverage is an especially useful metric if the reference sequence of the organism itself or a relative is known. Additional metrics such as number of contigs and average contig size were also tracked.
Genome coverage. The assembly validation script designed for the GAGE (Genome Assembly Gold-Standard Evaluations) [42] assembly comparison project was used to evaluate the validity of the assemblies produced in this study. The assembly validation scripts make use of MUMmer [43] alignment tool to compare the assembled contigs to the reference genome. MUMmer is a modular system for fast whole genome alignment of finished or draft genome sequences. It can easily handle hundreds to thousands of contigs; align them to another set of contigs or a genome using the NUCmer program [43]. For alignment with NUCmer, the default parameters used were; Minimum match -20, % identify -95 and % coverage -95.
For E.coli, the reference genome used was NC_000913 and for S.kudriavzevii the reference genome used was ZP_591. The C.elegans   reference genome was NC_003279.7, NC_003280.0, NC_003281.9, NC_003282.7, NC_003283.10 and NC_003284.8 whereas the mitochondrial genome used was NC_001328.1 N50, Number of contigs and Average contig size. The N50 (the length of the smallest contig which when added to a set of larger contigs yields at least 50% of the genome) value is widely used for assembly algorithm comparison [23;24] and higher the N50 value better the assembly, provided high genome coverage is achieved. Number of contigs and average contig size give an estimate of the size of the pieces that make up the assembly. Therefore small number of contigs and high average contig size are indicators of a good genome assembly.
Accuracy of the assembly. The GAGE script that was used to identify genome coverage also generates metrics such as number of SNPs and InDels that can be used to evaluate the accuracy of the assembled genome. Presence of high number of SNPs in the assembled genome would indicate an incorrect consensus base incorporated in the assembled genome.

Results and Discussion
There has been an increase in the number of de novo genome assemblies generated using NGS data [8;27] as well as the number of assemblers available to assemble this data [27;28;32]. All of the short read assemblers are based on De bruijn graph approach and a number of studies evaluating the performance of these assembly algorithms for genomes of different sizes have been published in recent years [22;23;24]. However, majority of these studies were conducted using simulated data sets with defined error rates and uniform depth of coverage. Moreover, these studies were focused on metrics such as time required for genome assembly, computational resources utilized and effect of read length, genome size and error rate on genome assembly. With increasing throughput of NGS systems, it is important to understand how much sequencing is needed for ''good'' genome assembly, so as to minimize the cost of sequencing per sample while fully utilizing the potential of the sequencers through multiplexing. Additionally, the sequencing chemistry of NGS systems is continuously improving thus reducing the error rate; hence it may not be necessary to sequence DNA at a very high depth.
The objective of this study was to identify the optimal sequencing depth required to assemble small to mid-sized genomes (up to 100 MB) using some of the most commonly used as well as recently published assemblers. We obtained data generated on Illumina Genome Analyzer for three genomes of varying size; E.coli MG1655 (4.63 MB), S.kudriavzevii (11.18 MB) and C.elegans (100 MB) at 990X, 275X and 200X respectively, randomly subsized the data and assembled them using five graph based assembly algorithms; Velvet, SOAPdenovo, Meraculous, ABySS and IDBA-UD. We used the GAGE scripts to compare the assembled genomes with the reference genome to identify the genome coverage. In some cases, the genome coverage was greater than 100%, which we believe could be because of the presence of repeat region in the genome. Importantly, we were able to generate good genome assembly (95% or greater genome covered after comparative analysis) with Velvet, SOAPdenovo, ABySS and IDBA-UD (not in case of C.elegans), but not Meraculous, with as low as 20X depth of coverage (Tables 1, 2 and 3; See below for detailed discussion). This is the first time that it has been shown, using experimental data that such low coverage can yield an acceptable genome assembly for small organisms such as bacteria, yeast and worm. At the same time, we have demonstrated that increasing the depth of sequencing does not provide significant advantage when using Velvet, ABySS,  SOAPdenovo and IDBA-UD but increasing the depth of sequencing significantly improves the assembly generated by Meraculous. However, even in the case of Meraculous, sequencing depth .100X does not provide a significant benefit for small genomes such as E.coli and S.kudriavzevii. Additionally, computational resources required for assembling read data of 20 X-100 X depth ranges from 6-40 GB, for small genomes depending on the assembler used.
E.coli MG1655 de novo assembly E.coli Illumina reads at 20X, 35X, 50X, 100X, 150X and 200X were assembled using SOAPdenovo, Velvet, ABySS, Meraculous and IDBA-UD. The De bruijn graph based algorithm use a subset of read called k-mer in order to identify overlaps between reads. The k-mer value has a significant impact on the final assembly, hence reads at each read depth were assembled using a range of k-mer (55-91 for 50X-100X and 31-97 for 35X with steps of 2) to identify the optimal k-mer for a given read depth (Data not shown). K-mer corresponding to the maximum genome coverage was taken as the optimum k-mer. The % genome covered (after comparing with the reference genome) and N50 at the best performing k-mer at each read depth was used to identify the optimum read depth for assembling E.coli genome. Our analysis reveals that in the case of this particular dataset, with as low as 20X, we could achieve .97% coverage with all assemblers, except for Meraculous ( Table 1). The N50 value also did not change significantly from 35X to 200X for all assemblers except for Meraculous where there is an approximately 2 fold increase in the N50 value from 35X to 100X and approximately 3 fold increase from 35X to 150X and 200X (Fig. 1A). Moreover, when additional metrics such as average contig size, number of contigs, maximum contig length and number of SNPs and InDels (Table 1) in the assembled genome are considered, 50X read depth appears to be optimum read depth for assembling E.coli genome using all assemblers except for Meraculous. In case of Meraculous, there is a 2 fold decrease in the number of contigs and a 2 fold increase in the average contig length as well as a 1.5 fold increase in the maximum contig length when the read depth is increased from 50X to 100X with no significant increase in these metric thereafter (Table 1). Moreover, in case of Meraculous there is a 2 fold decrease in the number of InDels present in the assembled genome from 50X to 100X (Table 1). On comparing genome assembled using Velvet, SOAPdenovo, ABySS and IDBA-UD with the known reference (SNPs and InDels generated by the GAGE script) there was a decrease in SNPs when the sequencing depth increased to 150X, with the number of SNPs actually increasing at 200X read depth ( Table 1). Assembly generated by Meraculous has only 1-2 SNPs at all of the sequencing depth ( Table 1).
We also measured the memory requirements for assembling read data at different depth and found that the read data of up to 50X depth can be assembled with 6-16 GB RAM depending on the assembly algorithm ( Fig. 2A). Velvet and ABySS were the most memory efficient in assembling the E. coli genome followed by SOAPdenovo. Meraculous was the least memory efficient assembly algorithm. Thus, for assembling E.coli genome using Illumina sequence data, 50X coverage was sufficient to provide .98.5% coverage of the genome using Velvet, SOAPdenovo, ABySS and IDBA-UD, whereas to achieve same coverage a sequencing depth of 100X was required when using Meraculous.

S.kudriavzevii de novo assembly
For S.kudriavzevii, Illumina reads at 20X, 35X, 50X, 100X, 150X and 200X were assembled using SOAPdenovo, Velvet, Table 2. Cont. The size of the assembled genome we used ZP_591 as reference is 11201698 bases. doi:10.1371/journal.pone.0060204.t002 Table 3. Assembly metrics for C.elegans genome assembled from Illumina paired end data with Velvet, SOAPdenovo, ABySS, Meraculous and IDBA-UD.   ABySS, Meraculous and IDBA-UD. Reads at each read depth were assembled using a range of k-mer (55-91 for 50X-100X and 31-97 for 35X with steps of 2) to identify the optimal k-mer for a given read depth. Our analysis revealed that for this dataset, with as low as 20X, we could achieve .95% coverage with all assemblers, except for Meraculous ( Table 2). Genome coverage of greater than 99% was obtained for Velvet, SOAPdenovo, ABySS and IDBA for as low as 35X coverage and there was a negligible increase in genome coverage with increasing sequencing depth ( Table 2). The N50 value however painted a slightly different picture. The N50 value increased gradually for all assemblers (except IDBA-UD) from 35X to 100X with non-significant increase thereafter (Fig. 1B). The N50 value for assemblies generated by IDBA-UD seems to plateau at 35X depth of coverage. The maximum contig length and average contig size did not change substantially from 35X to 200X depth of coverage for Velvet, SOAPdenovo, ABySS and IDBA-UD, but the number of contigs decreased by 2 fold only for Velvet. Additionally, for Meraculous, there was a 3 fold decrease in the number of contigs and a 3 fold increase in both, average contig length and maximum contig length when depth was increased from 50X to 100X ( Table 2). We also evaluated the accuracy of the assembled genomes by analyzing the number of SNPs and InDels found in the genomes. There was no consistent pattern in the number of SNPs and InDels found in the assembled genomes across all the different depth and the assemblers ( Table 2). Measurement of computational requirements showed that read data of up to 100X depth can be assembled with 12-40 GB RAM depending on the assembly algorithm (Fig. 2B). ABySS was the most memory efficient in assembling the yeast genome followed by Velvet. Meraculous was again the least memory efficient assembly algorithm requiring as much as 96 GB RAM to assemble the S.kudriavzevii genome from the 200X dataset.

C.elegans de novo genome assembly
For C.elegans, Illumina reads at 20X, 35X, 50X, 100X, 150X and 200X were assembled using SOAPdenovo, Velvet, ABySS, Meraculous and IDBA-UD. Reads at each read depth were assembled using a range of k-mer (55-91 for 50X-100X and 35-95 for 35X with steps of 2) to identify the optimal k-mer for a given read depth. Our analysis revealed that for the dataset used in this study, we could achieve approximately 95% coverage with Velvet, SOAPdenovo and ABySS, but only 24% with Meraculous and approximately 90% with IDBA-UD ( Table 3). Meraculous seems to be particularly poor at assembling moderate sized genome as the genome coverage achieved at even at 150X depth was 93% ( Table 3). Due to this inconsistent performance, we did not generate C.elegans genome assembly using data with 35X read depth. Velvet, SOAPdenovo, ABySS and IDBA-UD were able to generate genome with approximately 95% coverage with as low as 35X read depth ( Table 3). The N50 value increased by approximately 3 fold and 1.8 fold for Velvet and ABySS respectively, from 35X to 50X with gradual increase thereafter. On analyzing the SNP and InDel data, we observed that Velvet generated assembly had the highest number of SNPs and InDels, whereas ABySS produced assemblies with least number of errors ( from 200X read data (Fig. 2C). Thus, for assembling C.elegans genome using Illumina sequence data, 50X depth was sufficient to provide approximately 99% coverage of genome using ABySS, 100X depth was sufficient to provide approximately coverage of the genome using SOAPdenovo, whereas with the data set used in this study, Meraculous was able to achieve only 93% genome coverage with 150X read depth.

Conclusions
Unlike large mammalian genomes where the depth of coverage achieved is typically low (between 20X and 60X) [8;27] due to the small size, genomes of small organisms get sequenced at a much higher depth. Sequencing at higher depth without any benefit to the outcome is wasteful not only from computational resource perspective, but also of sequencing resources perspective.
Our results show that 35X-50X data obtained from NGS platforms such as Illumina is sufficient to get good coverage of small genomes such as bacteria and yeast. For moderate sized genome such as C.elegans, read depth greater than 50X provides good coverage of the genome and large contigs. As all of the current NGS technologies allow multiplexing, a 50X or lower depth of coverage provides opportunity for sequencing multiple samples per run thereby further reducing the cost of whole genome sequencing. for assembled E.coli genome: N50 is the length of the smallest contig which when added to a set of larger contigs yields at least 50% of the genome. The N50 values for IDBA-UD, Velvet and SOAPdenovo seemed to reach a plateau at 35X, ABySS at 50X depth of coverage. On the other hand, the N50 value of Meraculous generated assembly increased till 150X depth of coverage. B) N50 for assembled S.kudriavzevii genome: IDBA-UD and SOAPdenovo attained peak N50 value at 35X and 100X depth of coverage respectively, whereas the N50 value of Velvet, ABySS and Meraculous generated assembly increased till 150X depth of coverage. C) N50 for assembled C.elegans genome: SOAPdenovo, ABySS and IDBA-UD reached peak N50 value at 100X depth of coverage, whereas the N50 value of Velvet generated assembly increased approximately 1.5 fold until 150X with no change thereafter. Velvet generated assembly had the best N50 values of all the 4 assemblers. doi:10.1371/journal.pone.0060204.g001