Fine De Novo Sequencing of a Fungal Genome Using only SOLiD Short Read Data: Verification on Aspergillus oryzae RIB40

The development of next-generation sequencing (NGS) technologies has dramatically increased the throughput, speed, and efficiency of genome sequencing. The short read data generated from NGS platforms, such as SOLiD and Illumina, are quite useful for mapping analysis. However, the SOLiD read data with lengths of <60 bp have been considered to be too short for de novo genome sequencing. Here, to investigate whether de novo sequencing of fungal genomes is possible using only SOLiD short read sequence data, we performed de novo assembly of the Aspergillus oryzae RIB40 genome using only SOLiD read data of 50 bp generated from mate-paired libraries with 2.8- or 1.9-kb insert sizes. The assembled scaffolds showed an N50 value of 1.6 Mb, a 22-fold increase than those obtained using only SOLiD short read in other published reports. In addition, almost 99% of the reference genome was accurately aligned by the assembled scaffold fragments in long lengths. The sequences of secondary metabolite biosynthetic genes and clusters, whose products are of considerable interest in fungal studies due to their potential medicinal, agricultural, and cosmetic properties, were also highly reconstructed in the assembled scaffolds. Based on these findings, we concluded that de novo genome sequencing using only SOLiD short reads is feasible and practical for molecular biological study of fungi. We also investigated the effect of filtering low quality data, library insert size, and k-mer size on the assembly performance, and recommend for the assembly use of mild filtered read data where the N50 was not so degraded and the library has an insert size of ∼2.0 kb, and k-mer size 33.


Introduction
Whole-genome sequencing is an invaluable tool in evolutionary and functional studies of biological systems. The development of next-generation sequencing (NGS) technologies, such as the SOLiD (Life Technologies), Solexa and Genome Analyzer (Illumina), and 454 GS FLX (Roche) systems, has increased the throughput and reduced the cost of sequencing by several orders of magnitude [1]. To date, the whole genomes of several viral [2,3], bacterial [4][5][6][7][8], and fungal species [2,[9][10][11][12] have been newly sequenced (de novo sequencing) by combining two or more NGS platforms, such as 454 and Solexa, which generate sequence reads of 250,800 bp and ,100 bp, respectively. De novo sequencing using only a single NGS platform such as Illumina and SOLiD can further reduce costs and time. Among several reports of such de novo sequencing from bacterial to mammalian genomes [4][5][6][7][8]10,[12][13][14][15], the most successful result was shown for human and mouse genomes with the Illumina platform by using various libraries, in which the scaffold sizes (N50 size = 11.5 and 7.2 Mb for human and mouse genomes, respectively) approach those obtained with traditional capillary-based sequencing [14]. As compared to the Illumina reads, few successful reports of de novo sequencing, especially of eukaryotes, are found in the case of the SOLiD read (,60 bp). Fine de novo sequencing using only SOLiD reads was reported for the genome of a bacterium Corynebacterium pseudotuberculosis [13], but the N50 length is 76.9 kb even using two assembly methods in combination. The SOLiD short sequence read is generally suspected to be too short for de novo sequencing especially when a genome includes introns and long repetitive sequences. In addition, there are only a few assemblers that can deal with the ''color-space'' format of SOLiD read data to date.
Filamentous fungi produce a wide range of secondary metabolites, such as penicillin, cyclosporin, and lovastatin, with useful medicinal, agricultural, and cosmetic properties [16][17][18]. As the genome sequencing of fungi can provide information related to secondary metabolite biosynthesis (SMB) genes, which often contain characteristic sequence motifs [19,20], the de novo sequencing of fungal isolates is anticipated to facilitate the identification of novel SMB genes. Fungal genome sequences have characteristics that make them difficult to be sequenced in comparison to bacterial ones; abundant repeat sequences especially in SMB genes, and AT-rich centromere sequences. Since SMB genes are clustered in general, sequence information longer than 50 kb is required for their identification. For example, the SMB gene cluster of aflatoxin, which is one of the most carcinogenic fungal secondary metabolites identified to date, occupies a region of ,77 kb in the Aspergillus flavus genome [21]. Recently, we successfully identified the SMB gene cluster of kojic acid (KA) [22], which is used in cosmetics as a skin-whitening agent, using the genome sequence data of Aspergillus oryzae [23], including the genomic location of two genes involved in KA synthesis. De novo sequencing of fungal genomes using only the SOLiD platform is expected to facilitate the rapid and efficient identification of novel SMB gene clusters, when the assembled sequences are sufficiently long and accurate for the identification.
To investigate whether the SOLiD short sequence read is useful for de novo sequencing of fungal genomes, we conducted a series of de novo assemblies of the filamentous fungus A. oryzae RIB40 genome using only SOLiD reads of 50 bp in color-space obtained from a single sequencing run. The 37-Mb genome of RIB40 has been sequenced by the method of Sanger and was found to contain several SMB gene clusters, even though this strain rarely produces secondary metabolites [23]. Thus, we validated the results of our de novo assembly approach in terms of the size distribution of the assembled scaffolds, the reconstruction of the genome sequence including SMB gene clusters, and the sequence accuracy. In addition, we specifically evaluated three factors to assess the performance of the de novo genome assembly; (1) quality of the read data, (2) insert size of a mate-paired library, and (3) kmer size used in the assembly program, by changing the degree of data filtering, the library insert size, and k-mer sizes.

Strain and medium
The fungal strain used in this study, A. oryzae RIB40, was obtained from the National Research Institute of Brewing, Japan (http://www.nrib.go.jp/ken/asp/strain.html). For DNA isolation, strain RIB40 was grown in liquid YPD (Yeast extract, Peptone, Dextrose) medium (Difco) at 30uC for 2 days.

Whole genome sequencing
The genomic DNA was isolated from RIB40 as described previously [24]. We constructed two mate-paired libraries, lib2.8 and lib1.9, which were derived from sheared genomic DNA fragments of sizes 2.8 and 1.9 kb, respectively, using the SOLiD Long Mate-Paired Library Construction kit (Life Technologies, USA) and 50 mg RIB40 genomic DNA. Whole-genome sequencing using the lib2.8 or lib1.9 libraries was performed using the SOLiD 5500xl system (ABI). The mate-paired sequencing of lib2.8  Table 1. Libraries, data filtering, and k-mer size used for the series of de novo assemblies.

Genome assembly
Prior to genome assembly, an XSQ file for the F3 and R3 reads generated by the SOLiD platform was converted into a color read sequence file (csfasta) and quality file (.qual) using a shell script downloaded from the ABI website (convertFromXSQ.sh; http://www. lifetechnologies.com/ us/ en/home/ technical-resources / softwaredownloads/xsq-software.html). To investigate the effect of data filtering on the assembly performance, we prepared three and two read sets from the lib2.8 and lib1.9 libraries, respectively, by changing degree of data filtering according to quality values (QV); unfiltered data (lib2.8.nofilter.k31), data excluding reads containing undetermined base(s) (designated by ''.''; lib2.8.nodot.k31 and lib1.9.nodot.k31), and data in which all bases had QVs of .10 or 90% base-level accuracy (lib2.8.qv10.k31 and lib1.9.qv10.k31, Table 1). The depth of coverage of each read set on the reference genomic sequence of RIB40 (,37 Mb, DDBJ: AP007150-AP007177; GenBank: AP007150-AP007177) is summarized in Table 2. After filtration, mate-paired reads were subjected to de novo assembly using SOLiD De Novo Accessory Tools 2.0 (Life Technologies), which includes the rsampling, SAET, Velvet [25] (upgraded to version 1.0.15), and ASiD programs (Fig. 1). We modified ASiD to accelerate the gap-filtering process by eliminating the waiting time during parallel processing. We manually set the parameter for k-mer size (-hsize) to 31 or others, as described below, and used default values for other parameters (Table S1). It should be noted that scaffolds yielded by the SOLiD De Novo Accessory Tools consist of high quality nucleotide sequences without continuous undefined nucleotides (N), because continuous N is separated by the Analysis program. The assembly results from lib1.9.nodot.k31 and lib1.9.qv10.k31 were also used to examine the effect of mate-paired library insert sizes on the assembly performance. For both read sets of lib2.8.qv10 and lib1.9.qv10, we performed the assemblies with different k-mer sizes from 25 to 35 (restricted to odd integers), as k-mer size is a crucial parameter of the Velvet program. All of the assemblies are listed in Table 1. Assemblies were executed on a DELL precision T7500 desktop computer (CPU, Xeon E562062; Memory, 96 GB; Harddisk, 2TB65; OS, Ubuntu Linux 10.04). An overview of our de novo assembly process is illustrated in Fig. 1. The pipeline built by YK will be provided on a request.

Assessment of assembled scaffolds
We assessed the genome assembly performance using three criteria; size distribution of assembled scaffolds, degree of genome reconstruction, and sequence accuracy. Size distribution of assembled scaffolds were mainly estimated from the number and N50, which is defined as the length N for which 50% of all bases in the scaffolds are in a scaffold of length L , N. The maximum size and cumulative length of the assembled scaffolds was also considered. These results were generated by SOLiD De Novo Accessory Tools. To estimate the degree of genome reconstruction, we performed an alignment of the assembled scaffolds with the reference genome sequence by the LAST program [26][27][28], and used all pairs of nucleotide sequences having alignment scores of .40, which is a criterion for significant homology, for analysis of the genome coverage and misarrangement. Misjoins, deletions, insertions, and inversions of .500 bp in the reference nucleotide sequence were counted as misarrangements. We introduced a new statistic, R50, which is N50 for sequence fragments of the reference genome covered by highly accurate sequences of assembled scaffolds (having alignment score of .40 by LAST). In R50, the total bases of the fragments are supposed to be the size of the reference genome, like the NG50 statistic defined by Earl et al. [29]. To estimate sequence accuracy, nucleotide gaps (insertions and deletions) in the pairs of assembled scaffolds and genome sequence were counted.
To assess reconstruction of gene-coding regions, the nucleotide sequence of all RIB40 genes (including introns) was subjected to Blastn searches [30,31] against the nucleotide sequences of assembled scaffolds (.95 bp), and the percentages of segment pairs having e-values of ,1E-100 (high-scoring segment pairs, HSPs) and identical bases in the gene region were evaluated. For each gene, only the highest-scoring sequence was used for the evaluation. We also summarized the analysis results for polyketide synthase (PKS) and nonribosomal peptide synthetase (NRPS) genes in A. oryzae RIB40 when the gene annotation included ''polyketide'' and ''non-ribosomal'', respectively (Table S2). For examining the gene continuity in the assembled scaffolds, two gene clusters of A. oryzae, AO0900260000082AO090026000036 (29 genes, ,73 kb) and AO0900010000182AO090001000055 (38 genes, ,75 kb), which were assigned as SMB gene clusters based on homology to the AFLA_139100-AFLA_139440 (aflatoxin biosynthesis) and AFLA_064330-AFLA_064650 (hypothetical gliotoxin biosynthesis) gene clusters, respectively, of A. flavus, were used.
As an additional analysis, the total coverage of reference sequence by the short reads was evaluated from mapping results using SHRiMP version 1.4 [32]. The open source program, MUMmer version 3.0 [33], was used to draw the dot plot of the assembled scaffolds in assembly lib2.8.nofilter.k31 over the reference genome sequence. All analyses were performed using in-house codes written in Perl language on the same desktop computer used for the genome assembly.

Performance of our de novo assembly process
To determine the feasibility of de novo sequencing the RIB40 genome using only SOLiD short reads, performance of the genome assembly was evaluated based on the N50 values, maximum size, and total coverage of the reference sequence by the assembled sequences ( Table 2). The assemblies were named by including the library number (lib2.8 or lib1.9), data filtering (nofilter, nodot, or qv10), and k-mer size (k25-k35, restricted to odd integers). The assembly lib2.8.nofilter.k31, which used unfiltered read data, was treated as the standard assembly. The standard assembly had an N50 of 1.7 Mb and a maximum scaffold size of 3.4 Mb without continuous undefined nucleotides. The coverage of the reference sequence by the assembled scaffolds reached 98.87% ( Table 2). As shown in Figure 2, most regions of the reference sequence were covered by the assembled scaffolds that consisted of .10-kb fragments.

Reconstruction of gene regions and SMB gene clusters
We investigated the degree of reconstruction of gene regions and clusters that is valuable information of fungal genomes. As shown in Fig. 3, our de novo assembly process was able to reconstruct most reference gene sequences with good accuracy. In the assembled scaffolds of lib2.8.nofilter.k31, only 4 (,0.3%) of 12,064 genes in the reference genome were completely lost (Fig. 3a). Based on the Blastn analysis, the assembled sequences covered 98% of the reference gene nucleotide sequences by HSPs, with almost all of these sequences being completely identical (Fig. 3b).
The most representative SMB proteins are PKS and NRPS, whose encoding genes contained conserved sequence motifs [34]. Because PKS and NRPS genes consist of similar repeating units, it is difficult to correctly assemble these gene sequences using only short reads. Here, in lib2.8.nofilter.k31, 96.78% of the PKS and NRPS gene sequences formed HSP with the assembled scaffolds (Table 3). Although the value is ,1.2% smaller than that of all genes, we demonstrated that a set of SOLiD short sequence reads (50-bp lengths) could highly reconstruct the PKS and NRPS genes containing many similar repeating units.
Genes must be correctly continuous to identify SMB gene clusters. As shown in Figure 4, the assembled scaffolds composed of .50and .10-kb fragments covered ,27% and ,85% of the reference genome, respectively, in lib2.8.nofilter.k31. We also examined gene continuity in the assembled scaffolds using three gene clusters, AO0900260000082AO090026000036 (29 genes, ,73 kb), AO0900010000182AO090001000055 (38 genes, ,75 kb), and AO0901130001362AO090113000138 (3 genes, ,6 kb), which correspond to SMB gene clusters of aflatoxin (A. flavus), gliotoxin (A. flavus, hypothetical), and KA (A. oryzae) [22]. A. oryzae does not produce aflatoxin due to several gene deficiencies and mutations [35][36][37][38], but does have a genomic region corresponding to the aflatoxin biosynthetic gene cluster of A. flavus [21,[39][40][41]. Similarly, A. oryzae is not reported to produce gliotoxin, but a region with homology to a gliotoxin biosynthetic gene cluster from A. flavus was identified in a homology search. As summarized in Table 3, complete gene continuity was preserved in lib2.8.nofilter.k31 and other assemblies. These results suggest that our de novo assembly approach can reconstruct SMB gene clusters in a fungal genome sequence.

Effect of data filtering on assembly performance
SOLiD short reads have a high sequence accuracy of .99.99%; however, reads containing undetermined or low-quality bases are also included. Therefore, excluding low-quality read data is expected to improve assembly accuracy. On the other hand, loss of reads by filtering may lead to inefficient connection of sequence nodes in the de Bruijin graph assembly and scaffolding contigs. To examine the effect of filtering low-quality reads on the assembly performance, we prepared three and two types of read sets for lib2.8 and lib1.9, respectively, in which reads were either unfiltered (nofilter), lacked undetermined bases (nodot), or had bases of QVs .10 (qv10), and executed the assemblies.
First, we compared the assembly results between lib2.8.nofilter.k31 and lib2.8.nodot.k31. The lib2.8.nofilter.k31 and lib2.8.nodot.k31 assemblies displayed nearly the same pattern of assembled scaffold cumulative lengths; reaching a plateau at ,37.5 Mb with ,1000 scaffolds (Fig. 5a). The number and maximum size of assembled scaffolds (Table 2), in addition to the reconstruction percentage of the total genes and genome sequence ( Fig. 3 and 4), were also similar for the two assemblies. Therefore, although the N50 value (1.36 Mb) of lib2.8.nodot.k31 was lower than that (1.68 Mb) of lib2.8.nofilter.k31, the overall assembly performance was not significantly affected by unfiltering reads including undetermined bases. The SOLiD De Novo Accessory Tools can handle reads including undetermined bases, but it is considered better to exclude such reads because undetermined 'N' bases are automatically converted to 'A' by the Velvet program.
We next tested the higher degree of data filtering by comparing the assembly results between lib2.8.nodot.k31 and lib2.8.qv10.k31, and lib1.9.nodot.k31 and lib1.9.qv10.k31, and found that filtering low-quality sequence reads reduced misarrangement of the genome sequence. As summarized in Table 4, the numbers of misjoins, inversions, deletions, and insertions (.500 bp), in addition to the total size of the deletions and insertions, were decreased by the data filtering both in lib2.8 and lib1.9. On the other hand, more scattered scaffolds were generated by the data filtering, as observed for smaller N50 values and larger numbers of scaffolds (Table 2). Redundant scaffolds were also more frequently generated by the data filtering; the cumulative lengths in lib2.8.qv10.k31 (,38.0 Mb) and lib1.9.qv10.k31 (,37.8 Mb) were more redundant than that in lib2.8.nodot.k31 (,37.5 Mb) and lib1.9.nodot.k31 (,37.7 Mb), respectively, com-pared to the total size of reference sequence (37.2 Mb, Fig. 5a). In addition, small insertions and deletions in the aligned scaffold fragments were increased by the data filtering in both lib2.8 and lib1.9 (Table 5), denoting that the sequence accuracy was decreased by the data filtering. Decreased sequence accuracy is due to the scaffolding became difficult after filtering sequence reads that hold information for connection of sequence nodes. In fact, as summarized in Table 6, the scaffolding ratio or the ratio of contig number to scaffold number was lowered by the data filtering (lib2.8.qv10.k31, 2.82; lib2.8.nodot.k31, 4.59; lib1.9.qv10.k31, 2.71; and lib1.9.nodot.k31, 2.91).

Effect of library insert size on assembly performance
When using a mate-paired library, the sequence information from such a library is expected to facilitate the assembly of genomic regions with a high repeat content that is smaller than the insert size of the library. On the other hand, if the insert size of a mate-paired library is too large, it may become difficult to scaffold contigs that are smaller than the insert size and are not sufficiently overlapped with each other. To investigate the dependence of de    Table 2), while the read data of lib2.8 and lib1.9 had similar QVs before and after the data filtering (Table S3). Although the N50 value was smaller (Table 2), the numbers of misjoin and insertion (.500 bp) were decreased in lib1.9.nodot/ qv10.k31 compared to lib2.8.nodot/qv10.k31 (Table 4). As a result, the R50 value was comparative between lib1.9 and lib2.8, and was even larger in lib1.9.qv10.k31 than in lib2.8.qv10.k31 (Table 2), even though shorter and more scaffolds were generated in the lib1.9 assemblies. The HSP percentage in gene regions is in accordance with this result (Fig. 3). Thus, although a mate-paired library of shorter insert size may degrade sizes of assembled scaffolds, it appears superior for the reconstruction of genome sequences. With respect to sequence accuracy, the number of deletion errors in scaffold fragments aligned to the reference genome was decreased in lib1.9 compared to lib2.8, although the number of insertions was increased ( Table 5).

Effect of k-mer size on assembly performance
The size of k-mer, the principle parameter for the de Bruijn graph algorithm in Velvet, is known to affect assembly performance. For example, Haridas et al. [12] reported that Velvet yielded the largest N50 value when using a k-mer of 43 for Illumina paired-end reads with a length of 75 bp. In the algorithm, sequence nodes having bases of k-mer size are created from read data and are then connected to yield as many nodes as possible in a path or sequence [25]. Therefore, k-mers of adequately large size are expected to increase assembly performance. To estimate the most adequate kmer size, we performed de novo genome assemblies with changing k-  Table 4. Numbers of misjoin, inversion, deletion, and insertion, and total sizes of deletion and insertion (.500 bp). mer size from 25 to 35 (restricted to odd integers) for the read sets of lib2.8.qv10 and lib1.9.qv10, and compared the results.
The profile of cumulative lengths shows that less redundant scaffolds were generated when using larger k-mer size in both lib2.8.qv10 and lib1.9.qv10 (Fig. 5b, c). The scaffolding ratio, the ratio of contig number to scaffold number, also increased using the larger k-mer size ( Table 6). As a result, the reference gene and genome sequences were well reconstructed with longer scaffold fragments when using a longer k-mer size up to 33 (Fig. 3b, 4). The R50 value increased with larger k-mer sizes, and reached a plateau at k-mer of 33 in lib2.8 (Fig. 6a). The k-mer size improved the assembly results more in lib1.9 than in lib2.8; the R50 values were slightly smaller in lib1.9 than in lib2.8 when the k-mer size was under 29, but became longer in lib1.9 when k-mer was 29 or larger. It should be noticed that the profile of N50 is different from that of R50 because N50 does not include the effect of misarrangements. If using N50 as a criteria, the most adequate k-mer size would be 29 in both lib2.8.qv10 and lib1.9.qv10 (Fig. 6b), but this is not sufficient for correctly estimating assembly performance.
A positive effect of using a smaller k-mer size is that it reduced the number of insertions in the scaffold fragments aligned to the assembled sequences, whereas that of deletions was dependent on a library rather than k-mer size (Table 5).

Discussion
To our knowledge, this is the first successful report of de novo sequencing of a fungal genome using only SOLiD short sequence reads. Our assembled RIB40 genome had a N50 of 1.7 Mb, which is more than 22 times longer than the longest result of 76.9 kb for a fungus using only the SOLiD platform [13]. More than 98% of the bases in the gene regions were reconstructed, and more than 85% and 25% of the reference sequence was covered by assembled scaffold fragments with lengths of .10 and .50 kb, respectively. The assemblies were also able to reconstruct ,97% of the sequences of the representative secondary metabolite biosynthetic genes, PKS and NRPS, despite the fact that both genes contain numerous repeating units. Based on these findings in A. oryzae RIB40, we conclude that the de novo sequencing of fungal genomes using only SOLiD short reads is practical and feasible for the detection of fungal genes and SMB gene clusters, when a matepaired library with an insert size of ,2 kb and a depth of coverage of ,6150 are used. Considering that successful de novo assembly Table 6. Numbers and ratios of nucleotide contigs and scaffolds generated in each assembly.  with Illumina data was achieved with various MP libraries of different insert sizes [14,15], our de novo assembly has still great advantage because it requires only a single SOLiD mate-paired library having an appropriate insert size. We also investigated the effect of data filtering, library insert size, and k-mer size on assembly performance. The summary of the results and our recommendation drawn from this study are presented in Table 7. A tendency for a trade-off between scaffold size distribution and genome reconstruction was observed except when changing k-mer sizes. Although both data filtering and use of a mate-paired library with shorter insert size decreased the N50 value or generated shorter and more scaffolds, they yielded fewer misarrangements in the assembled scaffolds. As a result, the ''real'' N50 for scaffold fragments aligned to the reference genome (R50) was comparative in between lib1.9.qv10.k31 and lib2.8.nodot.k31. Thus, we conclude that data filtering can improve genome reconstruction, if the N50 value is not significantly degraded. We recommend assembly of at least two or more data sets having different degree of filtering and choosing the assembly result from the most filtered data unless the N50 is significantly degraded. Longer k-mer sizes improved the N50 (up to 29) and R50 (up to 33), thus a k-mer size of 33 is recommended when the short read length is 50 bp. Sequence accuracy of the scaffold fragment aligned to the reference sequences was decreased by either data filtering or using longer k-mer.
The data filtering generated shorter and more scaffolds. As Lin et al. [42] reported, a depth of coverage over 40 does not affect the N50 length of scaffolds assembled by Velvet. Therefore, our findings may not be due to the lowered depth of coverage throughout the reference genome sequence, but were likely due to an excess of deficient reads in specific regions of the RIB40 genome as a result of the data filtering. This is supported by observing the coverage of reference sequence by the raw short reads decreased by 0.01% (,5 kb) after the data filtering involving qv10 for both lib2.8 and lib1.9 (Table S3). In regard to the insert size of a mate-paired library, the lib1.9 read sets with 1.9-kb insert size showed smaller N50 values than the lib2.8 ones with 2.8-kb insert size, but showed better reconstruction of the reference genome sequence and gene regions. A mate-paired library having an insert size of ,3.0 kb can yield long scaffolds, but there is an associated risk decreasing degree of genome reconstruction.
The major NGS platform used for de novo genome sequencing is currently that of Illumina. To date, greater than 1000-fold more data sets generated using the HiSeq 2000 system, which is the most recent Illumina platform, have been deposited in public databases, including those of the National Center for Biotechnol-ogy Information, European Bioinformatics Institute, and DNA Data Bank of Japan (as of 7/5/2012), than those from the SOLiD 5500 system, the latest SOLiD platform. Presently, SOLiD short reads are not generally recognized as being suitable for de novo genome sequencing; however, our present results provide counter evidence to this idea. Although we used the SOLiD 5500xl system in this study, we confirmed that read data from the SOLiD 3 Plus platform also exhibited assembly performance that was comparable to that of the 55006l (data not shown).
Our assembly result with the N50 of 1.7 Mb is considered to be long sufficient for fungal genomes, which consist of several chromosomes and having AT-rich centromere sequences that is difficult to be read. In the A. oryzae genome, the longest chromosome has the size of ,6.5 Mb, but the sequence of the chromosome would be divided into two or more scaffolds by the AT-rich centromere; thus the longest scaffold would have the size of ,3 Mb at most. Even though our assembly yielded sufficiently long scaffolds, there will still remain a room for improvement. Gnerre et al. reported the algorithm ALLPATHS-LG that can yield a N50 of 11.5 Mb for human genome using only the Illumina platform, by several improvements including handling of repetitive sequences and low coverage regions from the previous version of their algorithm, ALLPATHS 2 [14,43]. Since the longest N50 for fungi by ALLPATHS 2 was 222 kb, our pipeline may also be able to generate more long and accurate sequences by similar improvements.
The assembly pipeline can be executed on an average desktop computer, with the following recommended specifications; Intel Core i7-3930K CPU (3.2 GHz, six cores), 64 GB memory (DDR3 PC3-10600 DIM 8GB68), and 2 TB storage. Among these recommended specifications, memory size is the most critical for the smooth running of the assembly; 64 GB or more is required for the de novo assembly of a fungal genome of 40 Mb. In addition, the assemblies can be performed on Linux operating systems, such as CentOS and Ubuntu, which are available for free. Therefore, de novo genome assembly can be performed in any laboratory if short read sequence data is obtained, with a possibility that will expand the boundaries of fungal studies.

Conclusions
De novo genome assembly using only SOLiD short reads is practical and feasible for fungal genomes if mate-paired libraries with ,2-kb insert sizes, read data with a depth of coverage of ,100 fold, and k-mer size of ,33 (when the read length is 50 bp) are used. Using this approach, we reconstructed .98% of the gene regions in the A. oryzae RIB40 genome with an N50 of 1.7 Mb. We also demonstrated that mild data filtering, such as excluding reads The open circle and cross denote improvement and degradation, respectively, for each entry of the assembly performance. Degraded in lib2.8 but improved in lib1.9. Maximum at k-mer size of 29 in both lib2.8 and lib1.9. doi:10.1371/journal.pone.0063673.t007 with undetermined bases, yielded long and accurate scaffolds. Taken together, our findings suggest that accurate and highthroughput SOLiD platforms that generate short reads of ,60 bp can be utilized for the de novo sequencing of fungal genomes. This approach may be improved when using data from developing NGS technologies that yield long sequences (.1 kb) with low accuracy in combination.

Supporting Information
Table S1 Parameters used in the de novo genome assemblies. (DOC)