Combining independent de novo assemblies to optimize leaf transcriptome of Persian walnut

Transcriptome resources can facilitate to increase yield and quality of walnuts. Finding the best transcriptome assembly has not been the subject of walnuts research as yet. This research generated 240,179,782 reads from 11 walnut leaves according to cDNA libraries. The reads provided a complete de novo transcriptome assembly. Fifteen different transcriptome assemblies were constructed from five different well-known assemblers used in scientific literature with different k-mer lengths (Bridger, BinPacker, SOAPdenovo-Trans, Trinity and SPAdes) as well as two merging approaches (EvidentialGene and Transfuse). Based on the four quality metrics of assembly, the results indicated an efficiency in the process of merging the assemblies after being generated by de novo assemblers. Finally, EvidentialGene was recognized as the best assembler for the de novo assembly of the leaf transcriptome in walnut. Among a total number of 183,191 transcripts which were generated by EvidentialGene, there were 109,413 transcripts capable of protein potential (59.72%) and 104,926 were recognized as ORFs (57.27%). In addition, 79,185 transcripts were predicted to exist with at least one hit to the Pfam database. A number of 3,931 transcription factors were identified by BLAST searching against PlnTFDB. Furthermore, 6,591 of the predicted peptide sequences contained signaling peptides, while 92,704 contained transmembrane domains. Comparison of the assembled transcripts with transcripts of the walnut and published genome assembly for the ‘Chandler’ cultivar using the BLAST algorithm led to identify a total number of 27,304 and 19,178 homologue transcripts, respectively. De novo transcriptomes in walnut leaves can be developed for the future studies in functional genomics and genetic studies of walnuts.

Introduction Persian walnut (Juglans regia L.) is an important, profitable species of the genus Juglans [1,2]. The world's walnut production and harvest area are 3,747,549 tons and 1,186,399 ha, respectively. China, the United States and Iran are among the largest producers of walnut in the world [3]. Walnut cultivation has a long, archaic history, but opting for breeding programs have only started in the twentieth century. Designing breeding programs using molecular tools have mostly been done in China, the USA, France, Turkey and Iran [4]. The main objectives of breeding programs for the Persian walnut is to achieve specific traits such as late leafing, lateral bearing, good kernel color, high yield, ease of shell cracking, late flowering, early harvest dates and resistance to major diseases. In particular, the prevalence of walnut blight and anthracnose among cultivars has provided good potential for research, which may or may not be coupled with the induction of tolerance to abiotic stresses (e.g. drought and salinity) for rootstocks [2,4].
Juglans regia has a diploid genetic structure, with 2n = 32 chromosomes. It has approximately 606 Mb per 1C genome [5]. Nonetheless, the current information on the genetic modalities of walnut is not as comprehensive as it should be, and the provision of more genetic information can assist researchers in developing efficient breeding strategies for further improvements.
RNA sequencing (RNA-Seq) is a promising application of next generation sequencing (NGS) that has been successfully used for analyzing the entire transcriptome, even for nonmodel plants such as walnut that lack a complete reference genome [6]. This method enables researchers to identify transcriptome structures, novel transcripts, genes that are differentially expressed, alternative splicing and genetic variants (such as single nucleotide polymorphism) [7]. In addition, the RNA-Seq method can be applied in molecular breeding, especially in relation to research that promote the development of molecular markers [8].
In certain occasions, reference genomes are not available to researchers. A reference transcriptome can be built from RNA-Seq data through de novo transcriptome assembly. However, there is a great challenge in terms of accuracy in assembling an RNA-Seq dataset for reliable downstream genetic analysis, which is partly due to the lack of a unique approach to the discipline [9]. Accordingly, there is a clear absence of knowledge about the most suitable assembling algorithms and the best k-mer length for assembling vast amounts of RNA-Seq reads. In this context, there have been few studies about the walnut transcriptome. For instance, more than 49.9 million sequencing reads have been made possible in four organs of walnut (i.e. leaf, bud, female flowers and male flowers). These sequencing reads have been carried out by the Illumina sequencing technology [10], through the process of which one assembly of software is used (Trinity, version 2.0.6). In a previous research, 117,229 transcripts were discovered (N50 of 1955 bp) and, in another study, the Trinity assembly was used for assembling 13,2041,772 reads into 111944 transcripts with a mean length of 1,180 bp and an N50 value of 1,833 [11]. Recently, the biodiversity and plant-microbe interactions in the walnut ecosystem have been a subject of studies using RNA-Seq in California. These were carried out in the context of de novo transcriptome studies on walnut [12,13]. Trinity is the only assembler that has been used for this purpose so far. To the best of our knowledge, no other assembler has been employed to construct an appropriate transcriptome profile for walnut.
This study was designed in an effort to create new RNA-Seq data from Persian walnut. The ultimate goal was to construct a comprehensive annotated transcriptome assembly for the Persian walnut. De novo assemblies were constructed with seven independently developed de novo transcriptome assemblers having various k-mer sizes. The assemblers were, namely, Bridger (version r2014-12-01) [14], BinPacker (version 1.0.0) [15], SOAPdenovo-Trans (version 1.03) [16], Trinity (version 2.0.6) [17] and rnaSPAdes (version 3.11.1) [18]. Here, there were two widely-used merging approaches which included EvidentialGene (version 2013.07.27; http://arthropods.eugenes.org/EvidentialGene/) and the Transfuse approach (version 0.5.0; https://github.com/cboursnell/transfuse). Considering the applications of these altogether, this research aimed at a combination of all independent assemblies and led to the production of a final assembly. Based on different assembly metrics, 15 transcriptome assemblies were evaluated comparatively so as to find the best de novo transcriptome assembly.

Plant material
Self rooted clonally propagated walnuts cv. 'Chandler' were used for transcriptome sequencing. Eleven two-year old plants were transplanted into 2 liter pots containing peat and perlite (1/1, v/v). Before starting the experiment, the plants were grown in a research greenhouse under controlled environmental conditions (with 16 h light/8 h dark photoperiod for 30 days).

RNA extraction and cDNA library preparation
The total RNA was isolated from approximately 100 mg of frozen leaf tissue using the RNeasy plant mini kit (Qiagen) according to the manufacturer's instructions. The concentration of RNA was measured with NanoDrop (Thermo Scientific™ NanoDrop 2000) and the purity of samples were checked on 1% agarose gels for evaluating the 28S and 18S ribosomal RNA bands (28S/18S ratio). If samples had a ratio (28s/18s) of over 1.8 and an OD 260/280 ratio greater than 1.9, they were sent to BGI Co. in China for sequencing. In BGI Co., the RNA integrity number (RIN) was determined using the RIN algorithm of the Agilent Bio analyzer 2100 system (Agilent RNA 6000 Nano Kit, Agilent, Cat No.5067-1511). Only RNA samples that showed a RIN higher than 7 passed the quality test and were used for cDNA library construction. All cDNA libraries were sequenced using a paired-end strategy (read length, 150 bp) on an Illumina HiSeq 2000 platform.

De novo transcriptome assembly
All clean and error-corrected reads were assembled into unique putative transcripts by five different assemblers, namely, Bridger (version r2014-12-01) [14], BinPacker (version 1.0.0) [15], SOAPdenovo-Trans (version 1.03) [16], Trinity (version 2.0.6) [17] and rnaSPAdes (version 3.11.1) [18]. It is well known that transcriptome assembly algorithms are strongly affected by k-mer length (i.e. the size of overlapping sequences used in aligning the reads) [20]. Therefore, two k-mer sizes of 25 and 32 were applied for Bridger, BinPacker and Trinity assemblers. The SOAPdenovo-Trans was run with k-mer values of k = 25, 31, 41, 51, 61 and 71. Also, the rnaS-PAdes was run with an automatic route of determining the k-mer lengths. Furthermore, CAP3 was used for producing longer consensus transcripts and for reducing the redundancy of contigs obtained through all assemblers. The mentioned production and reduction were performed by setting the minimal identity (%) of the overlap as 95% [21]. For all methods, the minimum transcript length was set at 200 bp. In addition, all transcriptome assemblies that had been assembled using single k-mer assemblers were then merged into a single assembly using two different pipelines, namely, EvidentialGene version 2013.07.27; (http://arthropods. eugenes.org/EvidentialGene/) and the Transfuse v0.5.0 (https://github.com/cboursnell/ transfuse). EvidentialGene pipeline (tr2aacds) is usually known to select an 'optimal' set of de novo assembled transcripts based on the coding potential. The selection is specifically made from a pool of transcript sequences. This pipeline enables researchers to reduce the complexity of the de novo transcriptome assemblies by discarding highly similar transcripts, by sequencing fragments and transcripts with low coding potential [22,23]. On the other hand, the Transfuse can intelligently reconstruct and merge multiple de novo transcriptome assemblies [24]. In total, 15 different transcriptome assemblies were compared with each other in the current study.

Assembly quality assessment
All assemblies (i.e. 13 individual (single k-mer) and two concatenated transcriptome assemblies) were evaluated in terms of accuracy and completeness using typical methods: 1) N50 length (as the shortest contig length representing 50% of the total assembled length) in the transcriptome assemblies, 2) the total number of contigs, 3) the total number of bps in the assembly, 4) the percentage of reads that were mapped back to transcriptomes (or reads being mapped back to the transcriptome RMBT) and 5) the BUSCO analysis (Benchmarking Universal Single-Copy Orthologs, version 2.0.1) for assessing the degree of annotation [25]. The assembly statistics (including N50 length and total number of contigs) were calculated using the TrinityStats.pl from the Trinity package. Bowtie2 (version 2.3.0) was applied in a very-sensitive mode for computing the proportion of clean reads that could be mapped to each assembly [26]. The BUSCO (which is a gene-based quality assessment software) was employed to assess the completeness of the assemblies in terms of gene content. This was performed using the OrthoDB v9.1 'embryophyta' base as a reference, which contains 1,440 BUSCO groups. All 15 transcriptome assemblies were evaluated via the mentioned criteria. The assembly with the best characteristics was selected as the most optimum assembly in terms of performance.

Functional annotation
The best performing assembly was annotated using different databases as well as bioinformatics tools. The assembled transcripts were translated into putative coding sequences by Trans-Decoder (version v2.0.1, available at http://transdecoder.github.io) using default parameters. Also, the transcripts were processed through a similarity search against numerous databases at the transcript and peptide levels. To do this, BLASTX (version 2.3.0) was used for comparing the transcripts against the NCBI non-redundant (nr) protein database and against the Uni-ProtKB database (E-value cutoff of 1E-5). Moreover, all predicted proteins (as obtained by TransDecoder) were processed through a BLASTP (version 2.3.0) search against the three databases with an E-value cutoff of 1E-5. The results of BLAST were then processed using 'ana-lyze_blastPlus_topHit_coverage' -which is a script of the Trinity package -to assess the fraction of nearly full-length assembled transcripts. In addition, the transcripts were searched for conserved functional domains and transcription factors against the Pfam database and against the plant transcription factor database (PlnTFDB) (version 4.0) using both the HMMER tool (version 3.1b2) [27] and the BLASTn, respectively. The signal peptide and a prediction of the transmembrane domain were estimated by the signalP (version 4.1) and by the tmTMHMM server (version 2.0) programs, respectively. Ultimately, these transcripts were searched against all transcripts of walnut (55,846) which are constantly available in the "Nucleotide" database at NCBI and also available in the genome assembly describing the 'Chandler' cultivar of walnut [11].

Sequencing the transcriptome in walnut leaf
A high-quality reference-transcriptome-assembly was constructed by extracting the total RNA from 11 leaf samples of Persian walnut seedlings. The transcriptome was sequenced by Illumina HiSeq 2000 platform. In total, 240 million paired-end reads (each having a read length of 150-bp) were generated with a GC content that ranged from approximately 44 to 46%. On average, individual samples yielded 21.8 million (±0.3) reads. After quality trimming, which also included the correction of errors, a total of 231 million high-quality reads were made for de novo transcriptome assembly, followed by the relevant analysis. In other words, screening out poor quality reads (Q<20) led to the removal of 3.77% of the raw reads, implying a very high quality of the RNA-Seq data. The results pertaining to the RNA sequencing of the 11 samples were recorded in detail and then summarized (Table 1).

Quality evaluation of the walnut leaf transcriptome assemblies
Five assemblers (i.e. Bridger, BinPacker, SOAPdenovo-Trans, Trinity and SPAdes) were used along with different k-mer sizes and two merging pipelines (i.e. transfuse and EvidentialGene). Then, 15 de novo transcriptome assemblies were created for Persian walnut. Fig 1 shows an overview of the steps concerning the assembly process of pipelines de novo. Part of the objective was to find the best transcriptome assembly which differs characteristically from the standard assembly. A summary of the assemblies and their characteristics which are produced by each assembler (including the different k-mer) are shown in Table 2. The number of assembled transcripts varied greatly among the assemblies. Additionally, the number of transcripts decreased when the k-mer lengths of the seven different assemblers increased. In this context, the highest and the lowest number of assembled transcripts were produced by Transfuse (379,406 transcripts) and SOAPdenovo-Trans (k-mer = 71; 88,787 transcripts) methods, respectively. In general, the Transfuse produced about 2 to 5 times more transcripts than the other assemblers. Like the number of transcripts, the output of assemblers varied considerably in terms of the contigs length. Among all assemblies, the Transfuse consistently generated the longest contigs length than any other assemblies, followed by Trinity (k-mer = 32, 1,194.58 bp). The average contig length was 1,282.08 bp. Furthermore, the shortest contig length was measured in SOAPdenovo-Trans, having a k-mer of 25 (247.39 bp). The results of preliminary analysis varied in terms of the contig size in different assemblies (Fig 2). Variations in the contig lengths of the assemblies showed that the contig size frequently occurred within the range of 200-300 bp in all assemblies except when using EvidentialGene (whereby the contig lengths frequently occurred within a range of 300-400 bp). In order to further evaluate the performance of the assemblies, the overall transcriptome sizes were compared. Assemblies from the largest to the shortest transcriptome size were ranked according to the following order:

Functional annotation
After evaluating the different de novo assemblies, EvidentialGene was selected as the final assembly and was utilized for the functional annotation analysis. Out of 183,191 transcripts, a total of 111,451 transcripts were annotated using one or more databases. Based on the initial transcript sequences and the TransDecoder tool, 104,926 ORFs were predicted. In total, 84,384 transcripts and 71,314 protein sequences shared similarities when compared against the Uni-ProtKB database using the BLASTx and BLASTp searches, respectively. When aligned against the NCBI nr database, the number of significant homologous transcripts and protein sequences increased to 109,413 and 85,523, respectively. Protein sequences with little similarity at the sequence level can share conserved domains, and thus a domain-based annotation was performed. Accordingly, 79,185 transcripts were found with at least one hit to the Pfam database. The final assembly included 79,185 transcripts with at least one hit to the Pfam database. Moreover, 3,931 transcription factors were identified by BLAST searching against PlnTFDB.
In comparing the transcripts with the Rfam database, 882 significant matches were found. On the other hand, 6,591 and 92,704 of the predicted peptide sequences contained signaling peptides and transmembrane domains, respectively. Finally, a total number of 27,304 and 19,178 of transcripts revealed similarities based on BLAST search against all transcripts of the walnut and of the published genome assembly for the 'Chandler' cultivar, respectively (Table 4, S1 and S2 Tables).

Discussion
In plant species without a published genome or with an incomplete sequenced genome, such as walnut, the occurrence of de novo transcriptome assembly facilitates the study of transcriptomes and that of relevant genes of expression. However, lacking an optimum standard for de novo transcriptome assembly can entail several challenges and limitations. The current study was an attempt to alleviate those challenges and reduce the limitations. Specifically, it involved identifying a transcriptome reference in walnut leaves, one that is likely to be complete and in detail when compared to the 15 transcriptome assemblies that were generated with five de

PLOS ONE
Combining independent de novo assemblies to optimize leaf transcriptome of Persian walnut novo assemblers and two merging approaches. Additionally, the best assembly was selected based on the quality of assembly in terms of N50 length, the total number of contigs, the total number of bps in the assembly, RMBT and BUSCOs. Our analyses took advantage of the 11 RNA-Seq libraries with increasing numbers of Illumina paired-end reads. In total, a comprehensive strategy for walnut transcriptome assembly was applied. This can be applied to settings that require the reconstruction of high quality transcriptome in non-model species other than the walnut. After trimming the raw reads, 96.23% of the reads (i.e. 231 million) were used for de novo transcriptome assembly. Previous studies have reported that the range of GC content in different samples has important effects on transcriptome assemblies and on the variation of the GC content among the libraries which can lead to shorter assembled transcripts [28,29]. Here, the GC content in all samples ranged approximately from 44 to 46%. This finding is in agreement with a previous study that showed a GC content of 46.6% in Juglans hopeiensis [30].
In line with previous studies, the results showed that there is a variation among the metrics of the assemblies. Previous research indicated that different assembly methods generate significantly different assemblies [20,31]. However, apart from SOAPdenovo-Trans, most of the assemblers that were used in the current research were equally good and, in many aspects, proved to be competitive and comparable to each other. Since SOAPdenovo-Trans is mentioned as an exception, the assemblies being generated thereof were excluded from further evaluation. Generally, the quality metrics were fairly even among the top performing assemblies (i.e. all assemblies except the SOAPdenovo-Trans) in terms of N50 and contig lengths. All of the top performing assemblies showed an N50 length of more than 1,800 bp with similar contig length distributions. Previous research demonstrated that N50 in the walnut transcriptome is 1,833 bp when processed through Trinity [11]. In another study on Juglans mandshurica, N50 length was reportedly 1,863 bp [13]. The average contig length was higher than 900 bp in all of the top performing assemblies, which is comparable to previous studies on plants such as J. hopeiensis (731 bp) [30], Camelina sativa L. (1198 bp) [32], Pinus contorta (715 bp) [33] and Lens colinaris (770 bp) [34]. The current results are in line with earlier studies, even though N50 and contig lengths cannot be used for selecting the best assembly. Basic parameters such as N50 and contig lengths can be used in genome assembly, but it is well known that these parameters are inappropriate for the evaluation of transcriptome assembly, mainly because the expected transcript lengths remain unknown in some species. Furthermore, longer transcripts or larger total assemblies do not necessarily indicate a better transcriptome assembly [35]. In addition, the available scientific literature suggests that N50 values can be artificially changed based on the defined k-mer size or based on the minimum contig length [36]. On the other hand, the average of RMBT in this study appeared to be more than 94% in all assemblies that showed high levels of performance. It is worth noting that the RMBT percentage in this study turned out to be higher than those reported in previous studies, thereby indicating a high quality of the assemblies. For example, the RMBT is reportedly 89.93% in pistachio (pistacia vera L) [37] and 83.68% in almond (Prunus dulcis Mill.) [38]. Despite the fact that the three mentioned metrics (i.e. N50, contig length and RMBT) can evaluate the assemblies in different aspects, understanding the biological differences between the assemblies becomes quite difficult when evaluations are based on these metrics. Accordingly, BUSCO can be considered as an important measure of performance in studies of this kind. BUSCO is capable of searching against a database of highly conserved single-copy ortholog genes. It checks whether each BUSCO group is either complete, duplicated, fragmented or missing in the transcriptome assembly. Among the assemblies, the merging pipelines performed better as Transfuse (94.8%, with 1364 complete BUSCOs) and as EvidentialGene (94.3%, with 1358 complete BUSCOs), thereby yielding the highest BUSCO scores. These results accord closely with previous studies which reported a better performance of merged transcriptome assemblies generated from multiple assemblies [39]. In addition, the number of duplicates appeared to be high in the Trinity, but had a lower score of 88.2% (with 1,270 complete BUSCOs in kmer = 25) and of 90% (with 1,296 complete BUSCOs in kmer = 32) when compared to EvidentialGene and Transfuse. Taken together, each of the assemblers has different algorithms and can potentially identify a small number of unique transcripts relative to each other [20,40]. Thus, it might be worthwhile to combine assemblies that are generated by different assemblers, which might reconstruct a more complete transcriptome reference. However, these methods are more complicated to perform than the straightforward de novo assemblers like Trinity in terms of time and resources.
Of the merged assemblies, Transfuse showed a slightly better performance based on the final score of BUSCO, but it produced more than double the number of contigs generated by the other merging assembler, indicating that this tool produced more redundancy relative to EvidentialGene. Also, this finding was not in line with a few previous studies on walnut and on other close plants [11,37,41]. In this regard, the number of duplicated copies (based on BUSCO analysis) was higher in Transfuse (85.8%) than in EvidentialGene (44.1%), which can be considered as a weakness of Transfuse. In contrast, the percentage of complete and single copy scores of BUSCOs by EvidentialGene was higher than by Transfuse, which can be more handy in downstream analysis. On the other hand, Transfuse (92.49%) had 27.98% more reads than EvidentialGene (64.50%). These reads were aligned concordantly more than once (S3 Table). It reiterates that the assembly generated by the Transfuse method contained more redundancy than when generated by EvidentialGene. This can lead to a bias in downstream gene expression and enrichment analysis. Therefore, a higher percentage of reads being uniquely mapped back and the lower percentage of duplicated BUSCO scores can make Evi-dentialGene be selected as the final transcriptome, which might be preferable to other downstream analyses. Of the 193,422,472 transcripts produced by EvidentialGene, 61% were annotated using one or more databases, which is better than those reported previously on walnut and on the other close plants. In another report on walnut, only 22.26% of the transcripts were annotated [11]. In the current study, 104,926 ORFs were predicted, while another study on walnut predicted a total of 35,836 ORFs out of a total of 111,944 transcripts [11]. In mango (Mangifera indica L.), 80,969 transcripts were reported, out of which 33,142 ORFs were predicted [42]. The current study revealed 3,931 transcription factors when the transcripts were searched by BLAST against PlnTFDB (S4 Table). Some of these transcription factors were identified as MYB, WRKY, bHLH, NAC and bZIP. To the best of our knowledge, only seven studies have so far investigated walnut transcriptome assemblies. Those studies are publicly available and can be compared with our study. The past seven studies had used Trinity alone and by default parameters so as to construct de novo assemblies [10,11,12,13,30,43,44]. In addition, none of the studies made use of BUSCOs to evaluate the quality of assemblies. Here, however, we evaluated the walnut transcript according to the relevant database along with BUSCOs. Previous results had shown that 1,440 single-copy ortholog genes are common in plants by 89.9% (indicating 1,294 BUSCOs). This is while the current study led to the identification of 94.8% (with 1364 BUSCOs) and 94.3% (with 1358 BUSCOs) by Transfuse and EvidentialGene.

Conclusion
The purpose of the current study was to construct a complete transcriptome assembly in walnut leaves. This involved using a range of tools and k-mer values. To the best of our knowledge, this is the first study in which different assemblers were evaluated with a range of k-mers to create a de novo transcriptome assembly in walnut. Combining assemblies that are generated by straightforward de novo assemblers can lead to the construction of a more comprehensive transcriptome assembly. In addition, based on various quality metrics of assembly, the transcriptome assembly which is produced by EvidentialGene can be prioritized over other assemblies in walnut leaves. In summary, the reference transcriptome assembly which was generated by EvidentialGene included 183,191 transcripts with an N50 length of 1,831 bp, among which 64,702 transcripts were longer than 1 kb.