In Silico identification and annotation of non-coding RNAs by RNA-seq and De Novo assembly of the transcriptome of Tomato Fruits

The complexity of the tomato (Solanum lycopersicum) transcriptome has not yet been fully elucidated. To gain insights into the diversity and features of coding and non-coding RNA molecules of tomato fruits, we generated strand-specific libraries from berries of two tomato cultivars grown in two open-field conditions with different soil type. Following high-throughput Illumina RNA-sequencing (RNA-seq), more than 90% of the reads (over one billion, derived from twelve dataset) were aligned to the tomato reference genome. We report a comprehensive analysis of the transcriptome, improved with 39,095 transcripts, which reveals previously unannotated novel transcripts, natural antisense transcripts, long non-coding RNAs and alternative splicing variants. In addition, we investigated the sequence variants between the cultivars under investigation to highlight their genetic difference. Our strand-specific analysis allowed us to expand the current tomato transcriptome annotation and it is the first to reveal the complexity of the poly-adenylated RNA world in tomato. Moreover, our work demonstrates the usefulness of strand specific RNA-seq approach for the transcriptome-based genome annotation and provides a resource valuable for further functional studies.


Introduction
Next Generation Sequencing (NGS) technologies are having an important impact on genomic research because they can be employed to address questions unapproachable with earlier tools. The power and speed of NGS allow to generate high-quality genome sequences [1,2], compare genomes across multiple samples [3,4], map structural variations [5][6][7] and identify polymorphisms [8,9].
Compared to other existing methods, Whole Transcriptome Shotgun Sequencing, better known as RNA-seq, has provided a very powerful alternative to gene expression analysis [10,11]. High-throughput sequencing-based approaches of the RNA world are also used to assemble, improve or extend the transcriptome of an organism, either by reference-based or de novo strategies [12][13][14][15][16]. They also allow a comprehensive discovery of novel genes and transcripts at a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 a fraction of the cost and time of the conventional sequencing methods [17,18]. Moreover, NGS technologies can reveal previously hidden dimensions in transcriptomes, such as isoforms, novel splice junctions, long non-coding RNAs and expression of antisense RNAs [14,15,[19][20][21][22]. Therefore, with the recent increased availability of plant genome sequences, RNAseq is being also employed to improve and expand existing genome annotations [18,[23][24][25]. Finally, considering the size of plant genomes, RNA-seq is also a cost-effective strategy to reveal sequence variations in coding sequences [25][26][27][28].
Tomato is arguably the most important vegetable in the world [29]. Cultivation is largely based on hybrid varieties, although in certain areas, local accessions fetch a premium price. The importance of tomato cultivation is increasing worldwide also due to growing attention of the potential health benefits of the compounds of berries. Moreover, in recent years, tomato is becoming a widely used research model organism for fleshy fruit biology, as well as in relation to fruit size and shape evolution, plant stress response, domestication and adaptation [29].
The complete tomato genome sequence was released in 2012 [30]. The variety chosen was 'Heinz 1706', a processing inbred cultivar. The genome size is approximately 930 Mb and of the 34,725 protein-coding genes, 30,855 are supported by RNA-seq data. The official functional annotation for the tomato genome, provided by the International Tomato Annotation Group (ITAG), reports 26,026 genes that are associated to Gene Ontology (GO) terms describing their functions, for a total of 2,108 unique GO terms. All this information is a precious resource especially for functional studies [31].
Considering the recent development of plant genomics, transcriptomic studies relies on reference annotation that not necessarily capture the full catalogue of transcripts and their variations [32]. With the aim to provide information useful to increase our understanding of the tomato fruit, the focus of the present study was to present an extended transcript catalogue of this organ. To this goal, we obtained 12 sets of strand-specific RNA-seq data from tomato fruits of two varieties in two environments. Starting from a genome-guided assembly, we report a comprehensive analysis of the tomato transcriptome, improved with a collection of 39,095 transcripts that include splicing variants, transcripts overlapping annotated loci, natural antisense transcripts and transcripts completely absent from the official tomato annotation. Furthermore, we investigated the genetic diversity by analyzing the polymorphisms in the varieties under investigation.

Plant material
Transcriptomic analysis was carried out on berries of two tomato (Solanum lycopersicum L.) cultivars, 'Kiros' (K) and 'Docet' (D). 'Kiros' is an inbred variety of local origin, while 'Docet' is a hybrid cultivar (Seminis). Both varieties are used for tomato processing. Plants were grown in two different locations near Naples (Italy), Acerra (AC) and Brusciano (BR), with different soil types, volcanic, defined as Typic Haplustoll, and alluvial, defined as Typic Calciustepts, respectively [33], to have a richer catalogue of expressed molecules. Plants were grown in the traditional spring to summer crop cycle at the experimental fields of the European Environmental Company (EURECO), Acerra (Italy), which also provided the seeds. The same day, fully ripe berries from three different plants, representing biological replicates, were collected. For each sample, three fruits from the same plants were pooled in order to reduce biological variation. Fruits were immediately frozen in liquid nitrogen and stored at -80˚C until RNA isolation.
RNA isolation and evaluation. Samples were ground to a fine powder with liquid nitrogen using a sterile mortar and pestle. RNA isolation, from around 300 mg of powdered frozen tomato fruits, was performed by a phenol/chloroform procedure followed by a lithium chloride precipitation, as described [34]. To reduce technical variation, for each replicate, total RNA of each sample was pooled from 8-12 isolations. To get rid of possible DNA contamination, 10 μg of total RNA were treated with 6 U of DNAse I Amplification Grade (Invitrogen) in 1X DNAse I Reaction Buffer (Invitrogen). DNase I was removed by phenol/chloroform purification followed by RNA precipitation through the addition of 0.1 volume of 3M Sodium Acetate (pH 7.0) and three volumes of ethanol. The integrity of RNA molecules was evaluated by electrophoretic RNA measurements, recorded with an Agilent 2100 Bioanalyzer. Only samples with high (>8) and similar (overall s.d. across all samples: ± 0.21) RNA Integrity Number were further processed.

Library construction
We converted the mRNA of total RNA into a library of template molecules of known strand origin using the reagents provided in the TruSeq Stranded mRNA Sample Preparation Kit (Illumina). Briefly, the poly-A mRNAs were purified from 8 μg of total RNA using poly-T oligos attached to magnetic beads. Following purification, the mRNA was fragmented into small pieces using divalent cations under elevated temperature. The cleaved RNA fragments were copied into first strand cDNA using reverse transcriptase and random primers. Strand specificity was achieved by using dUTP (instead of dTTP) in the second strand cDNA synthesis, using DNA Polymerase I and RNase H. After 3' end adenylation and adapters ligation, the products were purified and selectively enriched by PCR to create the final ds cDNA library. Libraries were validated and quantified using an Agilent 2100 Bioanalyzer (Agilent Technologies), and pooled such that each index-tagged sample was present in equimolar amounts, with a 2 nM final concentration of the pooled samples.

RNA-sequencing and data analysis
The pooled samples were subject to cluster generation and sequencing using a HiSeq 1500 System (Illumina) in a 2x100 paired-end format at a final concentration of 8 pmol. In total, 12 libraries were sequenced at the Laboratory of Molecular Medicine and Genomics, Department of Medicine and Surgery (University of Salerno), according to manufacturer's instructions. Quality control on raw sequences was carried out with FastQC (www.bioinformatics.babraham.ac.uk/ projects/fastqc/). Prior to further analysis, we removed low quality portions while preserving the longest high quality part of NGS reads, using Trimmomatic [35]. To increase quality and reliability, the minimum length allowed was 25 bp and the minimum quality score was set to 35. Briefly, the RNA-seq analysis was a two-stages process that involved reads processing and assembly of new transcripts, and annotation of lncRNAs and sORFs. The first step included the alignment of the reads against the reference genome, the reference-guided assembly of new transcripts, the annotation and classification of new loci. The second step consisted in the assembly and annotation of lncRNAs and sORFs by using specific pipelines [36]. Transcripts encoding a protein shorter than 200 amino acids were BLASTed against the NCBI NR database. Sequences having a match with a known protein (e-value lower than 0.001) were annotated as sORFs.
The TopHat script v.2.0.11 [37] was used to align high-quality RNA-seq reads to the tomato genome (version SL2.50). The alignment files were filtered to remove duplicated reads and those with a low mapping quality (<30). Data were then used as input for Cufflinks (v. 2.2.1) [15] together with the ITAG2.40 annotation to integrate the official assembled transcripts with the new ones. In-house pipelines and manual curation were performed to remove possible annotation errors. Specifically, erroneous gene fusions and loci completely overlapping with other loci in the same strand orientation were removed.
Functional annotation of sequences and data mining on the resulting annotations, primarily based on the GO vocabulary was performed with Blast2GO with manual curation [38]. This also allowed to obtained information expressed as a controlled vocabulary of functional attributes via the GO. Gene-annotation enrichment analysis was performed employing the Fisher's exact text corrected for multiple testing (p<0.05). Summarization and visualization of enriched GO terms was carried out using REVIGO [39].

Variant calling and annotation
The Simply Unified Pair-End Read (S.U.P.E.R.) workflow was used to identify sequence variations [40]. Briefly, the pipeline removes adaptors, performs trimming, checks data quality, pairs existent reads in both files (forward and reverse reads), aligns sequences on reference genomes and call sequence variations. S.U.P.E.R output were filtered to extract significant variations (high confidence calls) affecting the samples (quality score higher than 30 and total depth for each variant higher than 6). The assignment of functional information to DNA variants was performed with SNPeff (v 3.6) [41]. We categorized each variant based on its relationship to coding sequences in the genome and on how the variant may change the coding sequence and affect its gene product.

Mapping of RNA-seq reads to the tomato reference genome
The current version of the tomato reference genome (SL2.50) was released in 2014. The estimated genome size is 930 Mbp and 84% is annotated (782 Mbp). The structural annotation comprises 34,725 genes models, with a total of 160,007 exons and 125,280 introns.
The 12 RNA-seq dataset from tomato fruits (2 varieties, 2 environments, 3 biological replicates) were mapped against the reference genome. The mean input and its standard deviation was 104,487,055.8 ± 5,795,556.7 reads and 102,660,154.5± 5,854,766.6 reads per sample, before and after the data quality control respectively, for a total of 1,253,844,670 and 1,231,921,854 reads ( Table 1). The mean mapping rate was 90.7 ± 2.9%, which indicated that a small portion of reads were not counted in the RNA-seq read mapping process ( Table 1).

Improvement of the tomato annotation
To extend the annotation of the tomato genome, the 12 RNA datasets were used to uncover novel transcripts. The high quality reads were aligned to the Solanum lycopersicum reference genome sequence (SL2.50). The resulting alignments were filtered to remove duplicated reads and those with a low mapping quality. The data were then used, together with the ITAG2.40 annotation file, to assemble transcripts. We found in total 73,820 transcripts and 34,725 matched those in the available reference. The new transcripts (39,095) were grouped into four classes.
Class J includes transcripts that are new splicing variants of already annotated loci; class O consists of transcripts that overlap partially with already annotated loci and class X comprises new transcripts that have an antisense orientation to already annotated loci. Finally, we indicate as TCONS new transcripts that are absent from the available annotation.
The Class J included 31,349 splicing variants corresponding to 10,788 unique tomato annotated genes. We found from a minimum of 1 to a maximum of 24 splicing variants per gene, for an average of 2.91 variants per gene. The distribution of splicing variants transcripts and corresponding genes per chromosome is reported in Fig 1. The average number of splicing variant transcripts per chromosome is 2,41. The highest number of genes with splicing variants (1,37) was relative to the SL2.50ch01 chromosome.
The analysis of RNA-seq reads revealed 2,951 transcripts that partially overlap with already annotated loci (class O), corresponding to 1,548 tomato genes distributed in all tomato chromosomes. On average, we found 227 class O transcripts per chromosome (Fig 1). Moreover, the analysis indicated the presence of 356 new transcripts that are antisense to already annotated loci (Class X). These transcripts, which can be also defined Natural Antisense Transcripts (NATs), refer to 278 genes, with an average of 27.38 transcripts per chromosome (Fig 1). The classification of NAT pairs based on their overlapping pattern is presented in Table 2.
Lastly, the TCONS class includes 4439 new tomato transcripts, for an average of 341.46 transcripts per chromosome (Fig 1). The majority of the new transcripts (3,514) were non-coding. Overall, with the exception of the unplaced scaffolds (SL2.50ch00), we did not observe a chromosome specific enrichment for all classes of new transcripts, suggesting that our data provides a good coverage of the whole tomato genome.

Long non-coding RNA (lncRNA)
Following the identification of new transcripts, we developed a new annotation by merging novel and known isoforms, and maximizing overall assembly quality. Subsequently, we performed the identification of lncRNA. These are non-protein coding transcripts longer than 200 nucleotides that lack or have small Open Reading Frames (ORFs). We found 10,774 lncRNAs, in which we further distinguished putative precursors of miRNA (88 in number), rRNA (176) and tRNA (12). The 10,774 lncRNAs included also new transcripts, belonging to the above mentioned four classes. Specifically, while 3,600 transcripts were already present in the ITAG2.40, 3,514 lncRNAs were new (i.e. TCONS), 323 belongs to the class X, 879 class to the class O and 2,458 to the class J.
In addition, we identified a total of 3,800 small ORFs (sORFs) containing RNAs. These are potentially translatable sequences that consist of a string of in-frame sense codons shorter than Table 1. RNA-seq main metrics. The summary statistics, calculated against the current version of tomato reference genome, report the number of reads before and after the quality control, the number of high quality mapped reads and the percentage of mapped reads.

Sample Name
Reads before data quality control

Reads after data quality control
Mapped reads (quality >30)  Class J (splicing variants). The assembled transcripts mainly included variants of already described transcripts, due to alternative splicing in fruits. A functional enrichment analysis was carried out considering genes, in order to remove redundancy. The identification of functional classes that statistically differ between the annotated tomato genome (version SL2.50, annotation ITAG2.40) and genes that possess previously undescribed splicing variants (10,788) revealed a total of 120 GO enriched terms, under the cellular component (11), molecular function (98) and biological process (11) domains. The most represented terms were "organelle membrane" (C), "binding" (F) and "cellular response to stimulus" (P) with 37, 3,331 and 72 annotated genes, respectively (Table A in S1 File). The analysis indicated that alternative splicing in fruits is a phenomenon that involve several functions. Among them, important categories of enriched biological processes are potentially related to fruit metabolism (Table A in S1 File).

Percentage of mapped reads (quality
Class O (transcripts overlapping annotated loci). The assessment of differences in GO term abundance between 1,548 tomato genes of the Class O transcripts and the tomato reference genome yielded a list of 17 GO terms under the "molecular function" GO domain ( Table B in S1 File).

TCONS (novel transcripts).
The new 4,439 TCONS sequences were firstly annotated by comparing their putative translation product against the Arabidopsis proteome (TAIR10). In total, 95 TCONS found a match with 100 experimentally validated Arabidopsis proteins. The majority of TCONS sequences were classified as lncRNA (3,514 transcripts, of which 21 were similar to Arabidopsis transcripts) and sORFs (12). The remaining 839 transcripts did not yield a functional annotation when automatically compared to Arabidopsis. The functional annotation of these sequences was then performed by transferring existing functional annotation from similar sequences identified by the Blastp algorithm. The analysis indicated that 599 (71.7%) TCONS had significant returns (i.e.: an average protein similarity higher than 50% using as threshold an e-value < 10e-6) and 6 (0.7%) had an average similarity with the retrieved proteins higher than 49% (S1 Fig). Among them, 462 could be categorized in at least one of the three principal GO domains, while 143 were did not retrieve GO information. For the 462 TCONS similar to other proteins, the ontological domain cellular component (C) was less represented, with 155 GO terms, the biological process (P) was the richest domain (with 785 terms), while 293 GO terms were present in the molecular function category (Table C in S1 File). The remaining 231 (27.6%) TCONS did not retrieve a significant similarity in the non-redundant protein NCBI database. Considering the level-2 distribution of the sequences in the molecular function domain, the GO term "binding" was the most abundant. "Metabolic process" and "cellular process" were the most representative GO terms analysing the biological process domain and the cellular component ontology domain is represented by "organelle" and "cell" terms (Fig 2 and Table D in S1 File). The presence of novel transcripts that are similar to known proteins implies a high value of the novel assembled transcripts.
Class X (NATs). New Natural Antisense Transcripts of already annotated tomato loci (356; class X) were aligned to the tomato reference database for cDNA sequences (ITAG 2.40) to find complementary target RNAs. Almost all NATs (351) gave a match, for a total of 759 tomato cDNAs relative to 537 tomato genes (Table E in S1 File). A functional enrichment analysis was performed on these target genes in order to verify which are the most represented ontology domains and GO terms. The analysis revealed 43 GO terms belonging to the molecular function domain, while six were included in the biological process domain ("aromatic amino acid family biosynthetic process", "aromatic amino acid family metabolic process", "protein folding", "cell wall macromolecule catabolic process", "cell wall macromolecule metabolic process" and "cell wall organization or biogenesis") (Table F in S1 File). We found a representative subset of the over-represented Molecular Function GO terms using semantic similarity measures. A cluster related to the directed movement of substances (organic, inorganic and ions) into, out of or within a cell, or between cells was evident in terms of semantic similarities as well as statistical significance (Fig 3). Overall, the functional analysis of the NAT pairs we identified is consistent with a significant enrichment of functions related to fruit biology.

Variant calling
RNA-seq data were employed to discover single nucleotide variants in comparison with the tomato reference genome. The number of variants and their classification is summarized in Table 3. SNPs accounted for around 90% of total sequence variation. Around 40% of the variants were common between the two varieties. The cultivar Docet had a higher number of polymorphic sites than the inbred line Kiros. Moreover, the vast majority of variants in the Docet were in heterozygosis. As expected, the majority of variants were classified as having a limited impact, that is they were related usually to non-coding regions or affecting non-coding genes. The type of effects of the variants between the two varieties did not display notable differences in percentage. The ratio between synonymous and non-synonymous variants was higher for the heterozygous variants of the Docet cultivar (1.14) and lower for the common polymorphism (0.79).
The Kiros variety showed 1269 genes with at least one variant, while Docet had the vast majority of polymorphic genes (2404) in heterozygosis. Moreover, 3350 genes with a least one single nucleotide variation were present in both genotypes under investigation. After merging the lists of genes affected by nucleotide variation (and removing duplicates) a total of 6859 unique tomato genes had at least one variant, with an average of 527.62 polymorphic transcribed sequences per chromosome. The chromosome 4 showed the highest number of polymorphic genes (S2 Fig). Fig 4 reports the distribution and density of polymorphic sites in the two varieties. Differences in variants distribution between the two varieties were evident particularly for chromosome 11 and some regions of chromosome 3 and 12. For each gene, the variants were classified in six categories according to their positions (Fig 5A), namely downstream (if within the 1 kb region downstream from the stop codon), exon, gene, upstream (within the 1 kb region upstream from the start codon), 3' UTR and 5' UTR. Polymorphisms were also grouped into nine categories according to their effect (Fig 5B), that is frame shift, non synonymous, non synonymous start, start gained, start lost, stop gained, stop lost, synonymous coding and synonymous stop.  Table E in S1 File after the redundancy reduction) in a two dimensional space derived by applying multidimensional scaling to a matrix of the GO terms' semantic similarities (more semantically similar GO terms are closer). Bubble color indicates the p-value according to the legend in upper left-hand corner. Bubble size is scaled according to the frequency of the GO terms.
doi:10.1371/journal.pone.0171504.g003 Table 3. Distribution and classification of sequence polymorphisms. Unique (i.e. present only in one variety) and common (i.e. present in both varieties) polymorphisms. For the hybrid cultivar Docet, polymorphisms were distinguished in heterozygous (H) and homozygous (O) variants.  Complexity of coding and non-coding RNAs in Tomato

Identification of novel transcripts
RNA-seq has provided the opportunity of generating a global view of the transcriptome of plant organs or developmental stages of scientific and economic importance. This is necessary to produce a genome-scale map that comprises both transcript structure, complexity and expression level in specific conditions [15,42]. Accurate genome annotation, for instance, is important for a reliable gene expression quantification [43], an essential element to understand how plants grow, differentiate and respond to the environment.
In this study, we present an improved tomato transcriptome extracted from the analysis of tomato fruits using strand specific RNA-seq libraries. To limit varietal and environmental bias, we analysed two varieties grown in two experimental fields, which mainly differ in the soil type.
Through direct mapping of more than one billion of reads to the tomato reference genome, we identified 39,095 previously undescribed transcripts. In Sorghum bicolor, RNA-seq based gene annotations identified 34,276 new transcribed genomic regions [44]. The apple reference genome was improved by the description of 71,178 genes or transcripts, which includes 17,524 novel transcripts [23]. Moreover, in the present study the chromosomal locality of the 39,095 new transcripts was determined, denoting that they were of tomato origin.
The new transcripts were also identified from the reference-guided assembly of reads that could not be mapped initially, and were divided in four classes. Class J (new splicing variants of already annotated loci) was, as expected, the group with the higher number of transcripts. This group refers to 10,788 unique tomato annotated genes. In Arabidopsis, approximately 42% of intron-containing genes are alternatively spliced [45] and in rice, about 48% [25]. Alternative splicing variants are recognised as one of the most significant contributors to transcriptome plasticity and proteome diversity [46]. The abundance of new transcripts in the class J and their GO-annotation should reflect the fact that during ripening, fruits undergo significant biochemical, physiological and morphological changes. Considering that incomplete annotations can bias gene expression estimates [15,42,43], the high number of detected isoforms suggests that organ specific annotations are not only useful but also needed in tomato.
Compared with the available version of S. lycopersicum genome (SL2.50), we found 4439 transcripts that are new and completely absent from the official annotation (class TCONS). Around one hundred had a match with Arabidopsis proteins, while the majority (3,514) were classified as lncRNAs. Even if the majority of lncRNAs have been identified in animals through high-throughput transcriptome sequencing, other targeted approaches may increase this number. Although lncRNAs have been gradually recognized as crucial gene regulators also in plants, unfortunately, only few functionally characterized examples are available [47]. Twenty-one lncRNAs had a significant similarity with Arabidopsis. In this model species, the examination of different organs and conditions indicated that more than 6,000 RNAs were classified as long transcripts from intergenic regions, but analysis of fruits is not available. The limited similarities with Arabidopsis is reasonable considering the generally low conservation of lncRNA sequences among species [48]. While 12 TCONS were classified as sORFs, of the remaining 839 sequences, 72.4% putatively codes for proteins having high similarity (>49%) with other plant proteins. The majority of these putative translational products could be functionally annotated. We retrieved a variety of GO terms that covered a wide range of molecular functions and biological processes, and we did not observe a significant enrichment. The translational products that could not be functional annotated (143) were associated to uncharacterized proteins annotated as 'unknown' or 'hypothetical'.
Natural Antisense Transcripts represented the less abundant class of the newly identified transcripts. It has been estimated that in plants, less than 10% of all transcripts overlap as cis-NAT, a percentage that is expected to be lower than in mammals. In the present study, by applying a strand-specific RNA-seq approach, we identified both cis and trans-acting NATs in fruits. Almost all NATs (98.5%) were complementary to annotated tomato loci. Since antisense transcripts are frequently functional and play various biological roles using different transcriptional and post-transcriptional gene regulatory mechanisms in plants [49] [50][51], we performed an enrichment analysis of their putative target loci. The annotation of the NAT pairs was consistent with a significant enrichment of functions related to fruit biology, such as "transporter activity" of different types of molecules. This may reflect the fact that a number of organic and inorganic molecules are transported through the plant to sink organs during fruit growth, a process that should be inhibited following fruit ripening. To our knowledge, our study is the first to reveal the occurrence and complexity of cis-and trans-NATs in tomato, also suggesting that regulation through NATs should play a role in fruits. We believe that these data will be a valuable resource for gaining insight into the complex function of the tomato transcriptome during fruit development.

Nucleotide diversity
We also performed a comparative study of the nucleotide diversity. The RNA polarity information allowed to call variants and compare the Docet, a hybrid cultivar from a breeding company, and the Kiros, which is an inbred variety of local origin. We identified more than 17.000 single nucleotide variants compared to the reference genome, which affected around 7.000 annotated genes. Considering the intrinsic features of RNA-seq [25], the data suggested an interesting level of polymorphisms for cultivated tomato, a species which suffered severe genetic bottleneck [52]. In alfalfa, RNA-seq detected 10,826 SNPs between the two genotypes under investigation [26] while in rice, higher numbers of polymorphism were detected [25]. The Docet variety had a higher number of SNPs and polymorphic genes, and interestingly, many were heterozygous, suggesting that they are directly or indirectly related to breeding. On the other hand, the vast majority of the detected variants is not expected to have a high impact on gene function. Although limited to a reduced number of genes, the characterization of the functional consequences of variants that are predicted to have a high effect may be useful to get new insights on specific genes or functions. Taking into account the number of polymorphic site, their level of heterozygosity and the missense/nonsense ratio, this study implies that a good proportion of the polymorphisms present in the Docet cultivar should be due to breeding efforts. The data also indicated that differences in variant distribution between the two varieties were evident particularly for chromosomes 11 and 3, which also contain introgressed regions and R genes. The possibility that modern tomato breeding has increased polymorphisms, also because of the introduction of new alleles from wild species, has been already discussed and it should be tested by analyzing a larger panel of varieties [52][53][54][55].

Conclusions
Our study provided information that may facilitate genomic analyses, allowing a more detailed description about gene function in tomato fruits. Strand specific data enabled us to identify multiple forms of non-coding RNAs that are polyadenylated, providing new insights into antisense transcripts and their potential role in gene regulation in fruits. They also allowed a more robust identification of long non-coding RNAs. Moreover, our data are useful for accurately quantifying overlapping transcripts and alternative splicing variants, although strand-specific data add time and complexity to the computational analysis and may be not always required. Finally, the annotation of the new transcripts, especially those not included in the reference genome, represents a contribution towards the complete tomato transcriptome.
The potential demonstrated by our study for the applicability of strand specific RNA-seq in gene prediction also implies that libraries covering different organs, tissues, developmental stages and a range of stress conditions are necessary to get annotations that are more comprehensive and to identify annotation-specific genes that may be important for functional studies. Given the increasing availability of genomics tools and the affordability of the NGS technologies, we hope that the here presented analysis of the tomato fruit transcriptome will be a useful resource for gene expression profile in this important organ.