Comparative transcriptome analysis of the different tissues between the cultivated and wild tomato

Although domesticated tomato is cultivated by wild tomato, there are a lot of differences between cultivated tomato and wild tomato, such as shape, physiological function and life history. Many studies show that wild tomato has better salt resistance and drought resistance. In addition to, domesticated tomato’s fruit is bigger and has more nutritious than wild tomato. The different features are closely related to differentially expressed genes. We identified 126 up-regulated differentially expressed genes and 87 down-regulated differentially expressed genes in cultivated tomato and wild tomato by RNA-Seq. These differentially expressed genes may be associated with salt resistance, drought resistance and fruit nutrition. These differentially expressed genes also further highlight the large-scale reconstruction between wild and cultivated species. In this paper, we mainly study GO enrichment analysis and pathway analysis of the differentially expressed genes. After GO and pathway enrichment analysis, a set of significantly enriched GO annotations and pathways were identified for the differentially expressed genes. What’s more, we also identified long non-coding RNAs and mRNAs in the two species and analyzed its essential features. In addition to, we construct a co-expression network of long non-coding RNAs and mRNAs, and annotate mRNAs associated with long non-coding RNAs as target genes, and speculate the regulation function of long non-coding RNAs. In total, our results reveal the effects of artificial and natural selection on tomato’s transcript, providing scientific basis for tomato’s research in the future.


Introduction
Tomato [1,2] is one of the most scientifically investigated vegetables because of its high commercial importance [3]. In addition, the tomato is highly perishable, and post-harvesting losses reach 25 to 50%. In tropical countries, there is a loss of 20-50% from harvesting to consumption [4][5][6]. What is more, the water content of tomato fruit is very high, and the water content is as high as 93-95% [7]. In addition, it is low in calories and rich in vitamins A, C and E and minerals, such as calcium, potassium, phosphorus. Tomato is the first in terms of contribution a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 in the diet in a rank of 10 kinds of vitamins and minerals [8]. Brazil is the largest tomato producer in South America, followed by Chile and Argentina. Northeastern region (Pernambuco and Bahia states) accounted for 46% of the production [9]. In recent years, the consumer's demand for tomato products has increased, and it is increasing rapidly in domestic and international markets with major portion of it being used for preparation of convenience food. Because tomato is rich in antioxidants, so it can be considered important source of carotenoids (lycopene), ascorbic acid and phenolic compounds. In addition, the heat increases the bioavailability of lycopene, which is better absorbed by the body when the tomato is cooked, thus, ideal for the consumption of tomato sauces and soups. The industrialization process of the tomato shows that the preparation of sauces, ketchup and others does not destroy lycopene.
In some cases, the genetic basis of phenotype associated and domestication have been examined, most notably in maize, rice, and tomato [9][10][11]. It is clear that the difference of domesticated and wild phenotype is from a small number of genetic loci. Some studies show that the transcriptional level is changing all the time during the domestic process. For example, the recent studies have indicated that the transcriptional network of maize has changed a lot during the domestic process. Although changes in gene expression or network topology are important, there is a lack of expression between domesticated and wild tomato during the domestic process.
In recent years, researchers study tomato from various aspects. Tomato is a kind of important domesticated specie. Meanwhile, Tomato is a member of a complex of 13 interfertile species that occupy a wide range of habitats in South America. In the growth process of tomato, there are 998 predicted transcription factors which belong to 62 families and account for about 2.87% of 34727 genes in tomato genome. In 2012, the tomato has been sequenced successfully. The decoding genome of tomato identifies about 34727 genes, while 97.4% genes have located on the chromosomes accurately [12]. The difference of salt tolerance is compared in wild tomato and cultivated tomato. Tal and Smith's study [13] shows that there is positive correlation between the growth of callus and the whole plant. So, plant and callus are used to study the salt resistance under the condition of salt stress in some sweet soil crops, such as tomato. Dry weight of wild plant was higher than cultivated plant and the water content of wild tomato is also higher than cultivated tomato. There is difference between wild tomato and cultivated tomato. Compared with wheat, corn, potato, sugar and sunflower, tomato is a moderate salt sensitive plant. Salt resistance is regulated by the specific stage, and every stage is regulated by multiple genes and environment.
Tomato originated in the tropics is a typical thermophilic crop. Temperature is an important factor in process of tomato's growth and plays an important role in northern China. Low temperature is one of interferential factors for tomato's production [14]. Jiang et al [15] mainly study wild and cultivated tomato's tolerance of low temperature and illustrate low temperature response mechanism in different varieties. The phenotypic diversity of wild tomato is a good artificial and natural selection system for domesticated tomato. However, the transcriptional information of tomato is still unclear. Our study aims at comparing wild tomato and cultivated tomato based on their transcriptional information. In consequence, wild tomato has better salt resistance and drought resistance, and cultivated tomato's fruit is bigger, and has rich nutrition [16,17].
Our research data is mainly from Koenig D's experimental data [18]. In Koenig D's study, he mainly focused on several aspects: the large-scale alteration of the light response co-expression network between wild and cultivated accessions; characterization of sequence diversity in wild and cultivated tomato; evidence for positive selection in wild and cultivated tomato; divergence in gene expression in wild and cultivated tomato; analysis of selective pressures on gene expression, evolution of the tissue-specific expression in Solanum pennellii and Solanum lycopersicum, effect of introgression on the transcriptome of domesticated tomato; expression divergence correlates with phenotypic differences among wild and cultivated accessions.
Based on Koenig D's research, we carried out the much more extensive and intensive research for tomato. We analyzed domesticated and wild tomato's transcriptome to identify the diversity of gene expression sequences based on natural and artificial selection. Our data contains cultivated tomato (S. lycopersicum) and wild tomato (S. pennellii). The reason of selecting wild tomato is that it is widely used for genetic donation in the process of improving cultivated tomato. Our analysis provides sufficient evidence for tomato's evolution and presents significantly differences between artificial and natural selection.

Data resource
Transcribed sequences from cultivated and wild tomato were mainly collected from the Sequence Read Archive (SRA) database [19], as follow (Table 1). Data available in the Sequence Read Archive from tomato included 14 samples from 7 distinct tissues (root, stem, leaf, flower, fruit, seedling and vegetative) encompassing a total of 62676 transcripts.

Data analysis
For the RNA-seq data, the quality of data was evaluated by the FastQC software [20], and we retained reads that contained more than 95% bases and the bases' quality score is 20.
This program, Tophat-Cufflinks [21], could process a large number of read fragments based on RNA-Seq [22][23]. Transcripts selected could be processed as follows: (1) Aligning RNA-Seq reads to the reference genome. It was a core step in the analysis workflows, and we used Tophat [24] to align RNA-Seq reads to the genome. (2) Assembling transcripts. We used Cufflinks [25] packages to assemble transcripts. Frist, cufflinks assembled transcripts. Then, cuffmerge merged two or more transcript assemblies. Third, cuffdiff identified differentially expressed genes, transcripts and detected differential splicing and promoter. Besides, the analysis of differentially expressed genes was also conducted by FPKM.

Identification of lncRNAs and DGEs
Differentially expressed genes (DEGs) were selected according to the threshold: |log2fold change| ! 1.00. False discovery rate (FDR) was used to correct the P values and genes with FDR < 0.05 were considered as significantly DGEs. We used in-house Python script to select applicable genes as DGEs. The false discovery rate (FDR) is one way of conceptualizing the rate of type 1 errors in null hypothesis testing when conducting multiple comparisons. FDR- controlling procedures are designed to control the expected proportion of "discoveries" that are false. The problem of multiple testing concerns a group of related null hypotheses h 1 , . . .h n that are tested simultaneously. In its simplest form, each test yields a summary statistic z i , and the goal is to decide which of the z i is signals (hi = 1) and which are null (hi = 0). Solutions to this problem, such as Bonferroni correction, aim to control the family-wise error rate (FWER). An alternative, which has become the dominant approach in many domains of application, is to control the false discovery rate (FDR). Regardless of which error rate they aim to control, however, most existing approaches obey a monotonicity property: if test statistic z i is declared significant, and z j is more extreme than z i , then z j is also declared significant. Extensive simulation evidence shows that, by relaxing the monotonicity property in a data dependent way, FDR regression can improve power while still controlling the global false discovery rate.
The selection of long non-coding RNA was more complex, which was divided into three steps: Size selection. Normally, we defined putative long noncoding RNAs as transcripts that are length ! 200bp, and have no or weak protein coding ability [26,27]. Therefore, we used in house Python scripts to first exclude transcripts smaller than 200bp.
Open reading frame filter. Frith et al [28] show that more than 95% of protein-coding genes have ORFs of more than 100 amino acids. Since, we need select transcripts which contained the length of ORF was less than 100bp. To select the transcripts, which were more likely to encode proteins, a Python script was developed to ensure that transcripts that encoded ORFs of 100 or less amino acids were considered as long non-coding RNA candidates.
CPC prediction. The Coding Potential Calculator (CPC) [29], which is based on the detection of quality, completeness, and sequence similarity of the ORF to proteins in current protein databases, was utilized to detect putative protein encoding transcripts with default parameters. Only transcripts that did not pass the protein-coding-score test were classified as long noncoding RNAs.

GO and pathway analysis
GO enrichment analysis [30] was completed by agriGO [31], which was used to compare the biological functions of differentially expressed genes. GO function enrichment was analyzed by Hypergeometric exact test. FDR < 0.05 set as the threshold.
Pathway was analyzed by KOBAS [32][33][34][35], which divided into two steps: first, annotating the pathway of differentially expressed genes. E-value of BLAST was 1e-8; second, identifying the significant enrichment pathways. Our differentially expressed genes were clustered by their functions, and identified the pathway processes which affected lots of genes.

RNA sequencing and assembly of transcriptome
We obtained a total of 388666 transcripts. There were 16% transcripts in 300nt-600nt and 79% transcripts in 300nt, so we can know that the transcript length mainly concentrated on 300nt-600nt. What's more, 42% transcripts had less than five exons, and 45% transcripts have 6-10 exons (Fig 1).

Analysis of gene expression
We classified and counted the expressed genes of cultivated tomato and wild tomato. In addition, we found that there were 18719 expressed genes in cultivated tomato and there were 18609 expressed genes in wild tomato. However, there were 17202 genes in the co-expression of cultivated and wild tomato, which accounted for 91% in cultivated tomato and 92% in wild tomato, respectively. Therefore, it showed that most of genes in cultivated tomato and wild tomato were co-expressed. And genetic similarity is very high (Fig 2).
For these genes, we selected differentially expressed genes (Fig 3). We chose p − value and log2fold change as the threshold value. Green was differentially expressed genes and red was normal genes. The differentially expressed genes were mainly distributed in the area of |log2fold change| > 1.  Differentially expressed genes in cultivated tomato and wild tomato Through the above methods, we selected some differentially expressed genes. There were 126 up-regulated differentially expressed genes and 87 down-regulated differentially expressed genes, such as ( Table 2) and (Table 3). (Table 2) showed a part of up regulated differentially expressed genes, and (Table 3) showed a part of down regulated differentially expressed genes.

Differentially expressed genes in different tissues
The distribution of differentially expressed genes was different in every tissue. Our data are from seven tissues, such as floral, fruit, vegetative, root, stem, leaf and seeding. Nevertheless, differentially expressed genes only distributed in four tissues, which were floral, fruit, vegetative and root. There were 76 differentially expressed genes in the four groups together; there were 23 differentially expressed genes in vegetative, root and fruit; there were 618 differentially expressed genes in root, vegetative and floral; there were 30 differentially expressed genes in  fruit, root and floral; there were 33 differentially expressed genes in vegetative, floral and fruit. There were 658 differentially expressed genes in vegetative and floral; there were 34 differentially expressed genes in floral and fruit; there were 63 differentially expressed genes in root and fruit; there were 465 differentially expressed genes in vegetative and root. In addition to, there were 875 differentially expressed genes in floral; there were 924 differentially expressed genes in vegetative; there were 2046 differentially expressed genes in root and there were 189 differentially expressed genes in fruit. From (Fig 4), we can clearly see the differentially expressed genes in different tissues. We also analyzed the expression level of differentially expressed genes, as follow in ( Fig  5). From (Fig 5), we can know the expression level of floral was highest in the four tissues. The second one was root, and the expression level of vegetative and fruit was lower than  others. Meanwhile, we can clearly see the expression levels of differentially expressed genes in different tissues. The clustering of differentially expressed genes in different tissues also reflects the difference between domesticated and wild tomato. The difference in clustering may be closely related to drought resistance, salt tolerance and fruit nutrient composition of tomato.

Function analysis and annotation of differentially expressed genes
GO analysis showed that the gene annotations were mainly in three aspects: biological processes (BP), molecular function (MF), and cell fractions (CC). In the biological process, the differentially expressed genes were mainly enriched in response to biotic stimulus, lipid transport, response to wounding, response to extremal stimulus, response to stimulus and so on, as shown in (Fig 6) in detail. In the cell components, the differentially expressed genes were mainly enriched in light-harvesting complex, cytoplasmic vesicle part, vesicle, plasma membrane light-harvesting complex, plasma membrane-derived chromophore, cytoplasmic vesicle and so on, as shown in (Fig 7) in detail. In the molecular function, the differentially expressed genes were mainly enriched in oxidoreductase activity, carboxylesterase activity, peptidase inhibitor activity, lipase activity, water transmembrane transporter activity, endopeptidase inhibitor activity, enzyme inhibitor activity, water channel activity, catalytic activity, nutrient reservoir activity, serine-type endopeptidase inhibitor activity, oxidoreductase activity and acting on the CH-OH group of donors, as shown in (Fig 8) in detail. These data showed that the three class of differentially expressed genes exhibited different GO functions, implying difference between the two different samples.

Pathway analysis of differentially expressed genes
Pathway analysis showed the differences of gene expression from another perspective. This analysis was similar to GO enrichment analysis. KEGG pathway analysis which was represented in the (Fig 9) illustrated the disparities of the three DEG categories from another perspective. The terms enriched in DEGs were Carbon metabolism, protein processing in endoplasmic reticulum, starch and sucrose metabolism, alpha-linolenic add metabolism, biosynthesis of amino acids, glycolysis / gluconeogenesis, valine, leucine and isoleucine degradation, glutathione metabolism and so on. The difference of domesticated tomato and wild tomato mainly was showed in the above biological pathway. These pathways may be important factors that affected the heat resistance, salt tolerance and so on.

Functional annotation of differentially expressed genes in different tissues
In different tissues, the distribution of differentially expressed genes was not identical. DGEs mainly distributed in floral, fruit, vegetative and root. The others have few differentially expressed genes. Therefore, the significant GO enriched analysis of differentially expressed genes showed as follow.
Differentially expressed genes belonging to fruit group. GO enrichment terms in the fruit group related to BP ontologies. In the biology process, DEGs were mainly enriched in small molecule metabolic process (22%), organic acid metabolic process (15%), carboxylic acid metabolic process (14%), oxoacid metabolic process (14%) and cellular ketone metabolic process (13%), as shown in (Fig 10B).
Differentially expressed genes belonging to vegetative group. GO enrichment terms in the vegetative group related to MF ontologies. In the molecular function, DEGs were mainly enriched in catalytic activity (84%) and oxidoreductase activity (16%), as shown in (Fig 10C).
Differential expressed genes belonging to root group. GO enrichment terms in the root group related to MF ontologies. In the molecular function, DEGs were mainly enriched in catalytic activity (49%), protein binding (33%) and oxidoreductase activity (10%) as shown ( Fig  10D) in detail.

The analysis of differentially expressed long non-coding RNAs
We predicted 554 differentially expressed long non-coding RNAs and 426 coding RNAs, as shown in the following (Table 4).
From (Table 4), we can see the number of long noncoding RNAs' exons was fewer than coding RNAs. The number of long noncoding RNAs' exons was always 1 or 2. However, the number of mRNAs' exons was !2, and it was 5 to 6 currently. Therefore, the average number of exons in long noncoding RNAs was only 1/3 of the average number of exons in the mRNAs.   Comparative transcriptome analysis between the cultivated and wild tomato  In addition, the percentage of long non-coding RNAs which contained one exon was 50%. And, the percentage of mRNAs which contained single exon was very small. In addition, the length of long non-coding RNAs was much shorter than that of mRNAs. The length of long non-coding RNAs was almost 2000bp. The most of was shorter than 2000bp and a few was longer than 2000bp. The length of mRNAs was !3000bp, and a few was <3000bp. So the length of long non-coding RNAs was much shorter than that of mRNAs. Comparing with mRNAs, the number of long noncoding RNAs' exons was fewer and the length of long noncoding RNAs was much shorter. We can also see that long noncoding RNAs were mainly distributed on chromosome 1, 6, 9 and 10, while mRNAs were mainly distributed on chromosome 1, 6, 8 and 9 (Fig 11). The distribution of length was different between long non-coding RNAs and mRNAs (Fig 12). The majority of long non-coding RNAs' length (66%) was shorter than 3000bp. The remaining 18% of long noncoding RNAs' length was shorter than 5000bp. The others were longer than 5000bp. While only 4% mRNAs' length was shorter than 3000bp. The remaining 43% of mRNAs' length was longer than 10000bp. The majority of mRNAs' length (49%) was longer 15000bp (Fig 13). In addition to, we also made a comparison of lncRNAs, mRNAs and DGEs in transcripts length (Fig 13). In the range of 0-1000bp, long non-coding RNAs' percentage was 30%, mRNAs' percentage was 11% and differentially expressed genes' percentage was 5%. In the range of 1000-3000bp, long non-coding RNAs' percentage was 25%, mRNAs' percentage was 10% and differentially expressed genes' percentage was 4%. In the range of 9000-10000bp, long non-coding RNAs' percentage was 12%, mRNAs' percentage was 5%, differentially expressed genes' percentage was 10%. Overall, as the growth of the transcript's length, the percentage of long non-coding RNAs is smaller and smaller. Comparative transcriptome analysis between the cultivated and wild tomato In addition, we constructed a co-expression network of lncRNA and mRNA to predict the concrete functions of lncRNAs (Fig 14). In (Fig 14), red represents mRNA, and orange represents lncRNA. Based on the co-expression network of lncRNA and mRNA, we predicted the lncRNA target gene (Table 5). From (Table 5), we can know that GO annotation of these target genes. The function of the target gene is mainly distributed in response to biotic stimulus, peptidase inhibitor activity and so on. LncRNA plays an important role in regulating the expression of the genes.

Discussion
In this study, the transcriptome of cultivated tomato was sequenced by Illumina high throughput sequencing and compared with the transcriptome of wild tomato, including differentially expressed genes and long non-coding RNAs. Our data reflects the differences between cultivated tomato and wild tomato and offers tomato's basic transcriptional information in future tomato research. As DEGs results showed, there are 87 down-regulated genes and 126 up-regulated genes in tomato. It could be deduced that the expression level of DEGs in cultivated tomato and wild tomato is quite high. DEGs of up regulated or down regulated in both cultivated tomato and wild tomato were verified. We performed functional analysis and distribution analysis of differentially expressed genes. Through the enriched analysis of GO and Pathway, we mainly analyzed the functions and biological pathways of these differentially expressed genes, including the bioactivity response, lipid transport, light absorption complexity, cytoplasmic vesicle, plasma membrane light-gathering complexity, carbon metabolism, protein processing, starch and sucrose metabolism, amino acid biosynthesis. DGEs mainly distributed in floral, fruit, vegetative and root and the others have few differentially expressed genes. Differentially expressed genes are distributed in different tissues, and GO enrichment is also different in every tissue. These features reflect the phenotypic difference between domesticated and wild tomato. There is difference in different tissues for differentially expressed genes, such as distribution, function and so on. In the root and vegetative tissues, differentially expressed genes are most of all. In the root group, differentially expressed genes were mainly enriched in catalytic activity (49%), protein binding (33%) and oxidoreductase activity (10%). In the vegetative group, differentially expressed genes were mainly enriched in catalytic activity (84%) and oxidoreductase activity (16%). This finding plays an important role in improving the resistance of domesticated tomatoes. In consequence, wild tomato has better salt resistance and drought resistance, and cultivated tomato's fruit is bigger and has more nutrition [36]. These features of domesticated tomato and wild tomato may be related to differentially expressed genes. The results above provide a scientific basis for tomato's cultivation and study in future.
In addition, long non-coding RNAs are critical in regulating post-transcriptional regulation of RNA silencing and gene expression. We analyzed basic characteristic of long non-coding RNA, and made a series of comparison with mRNA. These long noncoding RNAs also tend to be shorter and to have fewer introns than protein-coding transcripts. The length of long noncoding RNA is shorter than coding RNA and the number of long noncoding RNA's exon is less than coding RNA. Meanwhile, we also drawn a co-expression network between long noncoding RNA and mRNA. There is a correlation between long non-coding RNA and mRNA. Long non-coding RNA regulated mRNA's expression. Long non-coding RNA may also be a key factor in the phenotypic variation of domestic and wild tomatoes. As discussed above, cultivated tomato could be different from wild tomato based on DEGs and differentially expressed long noncoding RNAs. Disparities which are also found between cultivated tomato and wild tomato are used in many studies. To sum up, the study has compared the transcriptome of cultivated tomato and wild tomato based on data generated from RNA-sequencing. Analysis of DEGs and differentially expressed long noncoding RNAs has revealed important features of cultivated tomato and wild tomato. To better utilize tomato's transcriptional information and to verify the functional mechanisms of DEGs and long noncoding RNAs, it is important to further research.