Transcriptome Characterization of Cymbidium sinense 'Dharma' Using 454 Pyrosequencing and Its Application in the Identification of Genes Associated with Leaf Color Variation

The highly variable leaf color of Cymbidium sinense significantly improves its horticultural and economic value, and makes it highly desirable in the flower markets in China and Southeast Asia. However, little is understood about the molecular mechanism underlying leaf-color variations. In this study, we found the content of photosynthetic pigments, especially chlorophyll degradation metabolite in the leaf-color mutants is distinguished significantly from that in the wild type of Cymbidium sinense 'Dharma'. To further determine the candidate genes controlling leaf-color variations, we first sequenced the global transcriptome using 454 pyrosequencing. More than 0.7 million expressed sequence tags (ESTs) with an average read length of 445.9 bp were generated and assembled into 103,295 isotigs representing 68,460 genes. Of these isotigs, 43,433 were significantly aligned to known proteins in the public database, of which 29,299 could be categorized into 42 functional groups in the gene ontology system, 10,079 classified into 23 functional classifications in the clusters of orthologous groups system, and 23,092 assigned to 139 clusters of specific metabolic pathways in the Kyoto Encyclopedia of Genes and Genomes. Among these annotations, 95 isotigs were designated as involved in chlorophyll metabolism. On this basis, we identified 16 key enzyme-encoding genes in the chlorophyll metabolism pathway, the full length cDNAs and expressions of which were further confirmed. Expression pattern indicated that the key enzyme-encoding genes for chlorophyll degradation were more highly expressed in the leaf color mutants, as was consistent with their lower chlorophyll contents. This study is the first to supply an informative 454 EST dataset for Cymbidium sinense 'Dharma' and to identify original leaf color-associated genes, which provide important resources to facilitate gene discovery for molecular breeding, marketable trait discovery, and investigating various biological process in this species.


Introduction
We further determine the suitability of our de novo assembly and annotation of the expressed genes, and identified 16 key enzyme-encoding genes in the chlorophyll metabolism pathway. Our real-time PCR results validated that the expression levels of two key enzymeencoding genes for chlorophyll degradation were higher in yellow-colored leaves, whereas four of the key enzyme-encoding genes for chlorophyll biosynthesis showed almost the same expression than that of the wild type. This result was consistent with the higher level of chlorophyll degradation metabolite and lower chlorophyll contents measured in the mutant leaves, and further revealed that the leaf color variation is probably owing to over degradation of chlorophyll rather than biosynthesis deficiency.
To our knowledge, this study represents the first characterization of the Cymbidium sinense 'Dharma' transcriptome using massively parallel 454 pyrosequencing. This informative EST dataset provides an important resource for marketable trait discovery and investigating various biological processes in Cymbidium. Additionally, our study on leaf variation will provide valuable data and practical guidance for improving horticultural traits and facilitating molecular breeding in Cymbidium.

Plant materials and growth conditions
Wild-type plants and leaf-color mutants of Cymbidium sinense 'Dharma' used in this study were collected from the cultivation base of floricultural research institute, Guangdong Academy of Agricultural Sciences, China. All of the plants were grown and maintained in pots in a greenhouse at day/night temperatures of 26/23°C under a 16-h light /8-h dark photoperiod.

Biochemical measurements
For Chlorophyll content measurement, six-month old leaves of wild type and the mutant grown under identical conditions were collected and Chl (a,b) were extracted in 90% acetone and determined spectrophotometrically by the method of Lorenzen [26]. For the contents of chlorophyll intermediate products, leaves were homogenized in nine volume of 0.01 M Phosphate Buffered Saline (PBS), and mixed on the ice and centrifuged (30 min at 2,500g). The supernatants was assayed by a ELISA kit (HengYuan Biological Technology Co.,Ltd, Shanghai, China) separately. Data were analyzed by SPSS software.

cDNA preparation and 454 sequencing
The cDNA libraries were prepared from different tissues of Cymbidium sinense 'Dharma' as shown in S1 Fig, including roots, leaves, pseudobulbs and flowers. The leaf samples were collected from six-month old (young leaf, YL) and three-year old (old leaf, OL) leaves of wild-type plants, as well as the six-month old leaves of various leaf-color mutants (Fig 1,Var1 -5). Total RNA was extracted from approximately 0.5 g of each tissue using the Trizol reagent (Life Technologies). Individual mRNAs were purified with an Oligotex mRNA Midi Kit (QIAGEN). First-strand cDNA was synthesized using SuperScript II (Life Technologies) with oligo(dT) 15-VNN as the template-primer, according to the manufacturer's protocol. Double-stranded cDNA was synthesized using a SuperScript double-stranded cDNA synthesis kit (Life Technologies) with the provided primers in a 100 mμl reaction. The double-stranded cDNA was sequenced using a 454 GS-FLX instrument (Roche Applied Science) and the sequencing was performed at Shanghai Majorbio Bio-pharm Biotechnology Co., Ltd. (Shanghai, China).

EST assembly and annotation
Prior to assembly, the resulting 454 raw reads that failed to pass the GS FLX pyrosequencing filter (eg. reads with weak signal or low quality) were discarded to yield a high-quality (HQ) data set (>99.5% accuracy of single-base reads). Primer and adapter sequences were trimmed from the HQ dataset, non-coding RNA (such as rRNA,tRNA, and miRNA) and sequences shorter than 50 bp were also removed. The remaining data were assembled into unique sequences using Trinity (http://trinityrnaseq.sourceforge.net/analysis/extract_proteins_from_ trinity_transcripts.html) with an optimized k-mer length of 25.
Open Reading Frame (ORF) survey were carried out using TransDecoder software to find protein-coding sequences [27], which further were searched against the NCBI non-redundant protein (Nr) database (http://www.ncbi.nlm.nih.gov) the Swiss Protprotein database (http:// www.expasy.ch/sprot), and the Uniprot database (http://www.uniprot.org/) using the BLASTP algorithm with an E-value cut-off of 10 -5 . The remaining sequences were searched against these public databases using the BLASTX algorithm with an E-value cut-off of 10 -5 .
The BLAST2GO V. 2.4.4 was used to obtain the Gene Ontology (GO) classification (http:// wego.genomics.org.cn/cgi-bin/wego/index.pl) of unique transcripts [28]. The assembled sequences were also aligned to the COG database to predict and classify functions [29]. Metabolic pathway assignments was performed based on the pathways of Arabidopsis thaliana and Oryza sativa in the Kyoto Encyclopedia of Genes and Genomes (KEGG) mapping project (http:// www.genome.ad.jp/kegg/kegg2.html) [30]. Enzyme commission (EC) numbers were assigned to unique sequences that had BLASTX scores with an E-value cut-off of 10 -5 as determined by searching the KEGG database.

Sequence Alignment and Phylogenetic Analysis
Full-length amino acid sequences were aligned by the Clustal X 1.83 program. The sequence alignment was further adjusted manually using BioEdit software (http://www.mbio.ncsu.edu/ bioedit/bioedit.html). The amino acid substitution model was calculated by the Model Generator v0.84 and the optimal model of "JTT+G" was selected. Phylogenetic relationships were reconstructed using a maximum-likelihood (ML) method in PHYML software with JTT amino acid substitution model.

Real-time quantitative RT-PCR (qRT-PCR) analysis
For expression levels of the genes related to chlorophyll metabolism pathway, six-month old leaves of wild-type and five leaf color-mutants were harvested for RNA extraction by the use of RNAiso Plus kit (Takara). 1μg of total RNA was quantified using a NanoDrop 2000 spectrophotometer (Thermo Scientific) for reverse-transcription (Vilnius, Lithuania). Relative transcripts levels were determined using the iCycler IQ Real-time PCR Detection System (Bio-Rad, USA) according to the manual QuantiTect SYBR Green PCR kit and analyzed by icycler realtime detection system software (version 5.0). Expression levels were normalized using the expression level of the constitutive actin gene. A relative quantitative method (DDCt) was used to evaluate the quantitative variation. All quantifications were made in duplicate on RNA samples obtained from three independent experiments. The primers used for real-time PCR amplifications are listed in Supplemental S1 Table.

Results and Discussion
The morphology and Phylogenetic analysis of Cymbidium sinense 'Dharma' In the process of cultivation and domestication of orchid species, color mutation in the leaves occurred at very high frequencies,resulting in different-colored mutant. The most representative cultivar, Cymbidium sinense 'Dharma', has generated more than 30 types of leaf variation by far, which makes it an optimal model to study leaf variation in Cymbidium family.
To illustrate the mechanism underlying leaf color variation, we collected ten different-colored mutants of Cymbidium sinense 'Dharma', and examined the associations of the genetic polymorphism with their morphological traits by amplified fragment length polymorphism analysis (AFLP). Consequently, 841 bands from 9 primer combinations were obtained, 644 of which (76.6%) were polymorphic ones. AFLP analysis showed genetic similarity coefficients among them were ranged from 0.6164 to 0.8245, with an average value at 0.7256 (S2 Table).
Using UPGMA cluster analysis, these ten cultivars can be divided into two clusters, which basically agreed with those of morphological taxonomic study and also are related well with the origin of them (Fig 1A and S3 Table). On this basis, five mutants that cluster together were selected for further analysis (Fig 1B).

Decreased content of photosynthetic pigments in the mutant leaves
The mutant showed white-yellow band in the leaf from the seedling stage to the maturity that cannot turn to green with environmental stimulations. Meanwhile, chlorophyll levels in the mutant leaves differed significantly from that in the wild-type leaves, which showed approximately 60% to 17% decrease of the chlorophyll a content, and 73% to 32% decrease of the chlorophyll b content, respectively, compared with the wild-type (Fig 2A). Moreover, detailed analysis showed that the content of chlorophyll synthesis precursors porphobilinogen (PBG), protoporphyrin IX (Proto IX), magnesium protoporphyrin (Mg-Proto), and protochlorophyllide (Pchlide) decreased slightly in most of the mutant leaves. On the contrary, the level of chlorophyll degradation metabolite Pheophorbide acid a (Pheide a) increased apparently, and the changing was consistent with the chlorotic level of the mutant leaves. As shown in Fig 2B, the content of photosynthetic pigments PBG, Proto IX, and Mg-Proto in the mutant leaves decreased from 0-13%, 0-11%, and 0-14%, respectively. Contratrarirly, the level of Pheide a increased from 13%-38% in severe chlorotic leaf of the mutant compared with that of wild type. This result implied that the yellow and green striped leaf spots in Cymbidium 'Dharma' were possibly due to over degradation of Chlorophyll rather than biosynthesis deficiency. Generation and assembly of 454-pyrosequenced expressed sequence tags (ESTs) To illustrate the potential mechanism of leaf color variation and find candidate genes involved in this process, we firstly generated a cDNA library for 454 GS-FLX sequencing from an equal mixture of RNAs isolated from different organs, and obtained an overview of the expressed genes in Cymbidium sinense 'Dharma' (S1 Fig). A total of 705,630 raw reads with an average length of 456.7 bp were generated from one pyrosequencing run. After removing the adaptor reads, repeat sequences, and low-quality reads, 702,015 high-quality reads, totaling 313 Mb with an average length of 445.9 bp, were obtained for further analysis. De novo assembly of these clean reads using the Trinity program generated 103,295 isotigs with 104,336,567 total residues, representing 68,460 genes (Table 1). Isotig length averaged 1,010 bp and ranged from 306 to 16,856 bp. The size distribution of the assembled isotigs is shown in Fig 3. Although most of the isotigs were between 400 and 600 bp in length, a substantial number of large isotigs were obtained. A total of 8,573 isotigs were longer than 2,000 bp, and 29,206 isotigs were longer than 1,000 bp. These data provided comprehensive gene expression information to facilitate the investigation of Cymbidium genetics.

Annotation of the novel transcripts
To validate the effectiveness and suitability of our de novo assembly, all of the assembled sequences were first surveyed by open reading frame (ORF) using TransDecoder software (as described in "Materials and Methods"). This analysis produced 46,210 protein-coding isotigs containing at least an ORF, and the remaining parts were considered as noncoding sequences. Both the putative protein-coding and noncoding transcripts were further searched against the NCBI nonredundant (NR) protein database, as well as the UniProt and SwissProt databases, using the BLASTp and BLASTX algorithms, respectively. A total of 43,433 significant matches were obtained in the NR database (Table 2), and the similarity distribution showed that 31.7% of these alignments had a similarity higher than 80%, 43.7% between 60% and 80%, and 24.6% lower than 60% (Fig 4A). UniProt alignment indicated 43,542 matches and showed a similarity distribution pattern consistent with that observed in the NR database, as shown in Fig 4B. Additionally, nearly 30% (30,629) of the predicted isotigs matched to sequences in the SwissProt database, and the similarity distribution pattern was comparable to that of the other databases, with 26.4% of the sequences having a similarity higher than 80% and over 70% of the hits having a similarity from 32% to 80% ( Fig 4C).
As "non-BLASTable" sequences have been reported in all studied plant transcriptomes, with proportions varying from 13% to 80% depending on species and sequencing depth [31,32] (e.g. primarily 5'and 3'UTR fractions, lineage-specific genes and fast-evolving genes existing in the dataset), it is reasonable that nearly half of the assembled sequences did not receive putative functional identifications in our annotation.

Gene ontology (GO) and clusters of orthologous groups (COG) classification
We used the GO system to classify the functions of the predicted isotigs of Cymbidium 'Dharma'. A total of 29,299 isotigs were categorized into 42 functional groups. "Metabolic process" and "cellular process" were the most highly represented groups in the biological process category, suggesting that our study may allow for the identification of novel genes involved in secondary metabolite synthesis pathways. "Cell " and "cell part" terms were dominant among the cellular component categories. Among the molecular function categories, "catalytic activity" had the most number of isotigs, followed by "binding". However, only a few genes were identified for "nitrogen utilization" (2), "pigmentation" (5), "viral reproduction" (11), "cell killing" (5), "locomotion" (11), and" biological adhesion" (15) in the molecular function category, and a limited number were found for "metallochaperone activity" (4), "translation regulator activity" (5), "translation regulator activity" (5), and "protein tags" (1) in the cellular component category (Fig 5).
To generate a focused view of the GO categories, further functional classification of all isotigs was performed using a set of plant-specific GO slims. "Plastid", "cell", "biosynthetic process ", and "metabolic process" were the most highly represented groups, followed closely by "cellular process", "mitochondrion", "protein modification process", "transport", and "cytoplasmic membrane-bounded". "Stress response", "signal transduction", "cell cycle", and "photosynthesis" were also represented (S4 Table). These trends were consistent with overall biological activity during the plant developmental periods. We further classified the isotigs using the KOG system. Based on sequence homologies, 10,079 isotigs were classified into 25 KOG functional classifications. Among these classifications, the "general function prediction" cluster represented the largest group, followed by "posttranslational modification", "signal transduction" "Transcription", and "Cell motility" was the smallest group, with only eight identified candidates (Fig 6).

Gene pathway assignments by the Kyoto Encyclopedia of Genes and Genomes (KEGG)
We mapped the assembled isotigs to the reference canonical pathways in the Kyoto Encyclopedia of Genes and Genomes (KEGG), including metabolism, genetic information processing, environmental information processing, cellular processes, and organism systems (http://www. genome.jp/kegg/pathway.html). In total, 21,633 sequences were mapped to 293 KEGG pathways, representing 31.6% of the assembled isotigs (S5 Table). Among these isotigs, the most (12,335) were related to metabolism, while 3,687 corresponded to genetic information processing, 1,069 mapped to environmental information processing, 1,885 were associated with cellular processes, and 2,657 belonged to organism systems (Table 3). These pathways provide valuable resources for investigating specific metabolic processes during Cymbidium 'Dharma' research. The presence of genes for these essential cellular processes suggests that these sequences cover most of the Cymbidium 'Dharma' transcriptome, which could further facilitate the identification of gene sets involved in a broad range of biological processes.

Identifying putative genes related to leaf color variations
Our result showed that the chlorophyll contents decreased significantly in severe chlorotic leaf of the mutant compared with that of wild type. We therefor concentrated on the orthologs of genes that are known to be involved in chlorophyll metabolism, and identified 95 isotigs as putative chlorophyll-related genes (S6 Table). According to the previous studies in Arabidopsis, chlorophyll synthesis begins with glutamyl-tRNA (Glu-tRNA), is catalyzed by 16 enzymes, and finally forms chlorophyll b through 16 key regulatory steps [33]. Changes at any step will affect the abundance of chlorophyll pigment, causing leaf color variation [33][34][35]. Homologous sequences for these key enzyme-coding genes were found in our EST dataset (highlighted in red in Fig 7 and listed in Table 4).
Besides, candidate genes associated with chlorophyll degradation were also evaluated in our sequencing data. According to the proposed pathway of chlorophyll catabolism in Arabidopsis [35,36], we identified homologous sequences of genes encoding key enzymes for chlorophyll degradation, including chlorophyll-chlorophyllido hydrolases (CLHs), pheophorbide an oxygenase (PaO), red Chl catabolite reductase (RCCR), and chlorophyll b reductase (NOL) (highlighted in red in Fig 8 and listed in Table 4). To validate the discovery methods, we cloned the full lengths of the ORFs by rapid amplification of the 5 0 cDNA end (5 0 RACE). Full-length cDNAs were obtained for eight identified genes, CHLG, PORA, HEME2, HEMD, HEMC, CAO, CLH2, and RCCR, and all were identical to the RNA-seq data (S7 Table). We randomly chose the homologous sequences of chlorophyll synthase, CHLG (comp4803_c0), and chlorophyllase, CLH2 (comp3012_c0), for further analysis. Our result showed that their ORF were of parallel sizes to their homologs in Arabidopsis (shown in S2 and S3 Figs, respectively). Homologous alignment also indicated comparative high similarity with other species (shown in S4 and S5 Figs, respectively), which confirmed that our 454 EST dataset could effectively facilitate gene identification. This systems bioinformatics survey, combined with molecular biology analysis, is also a feasible approach to elucidate the scent of various biological processes and identify the relevant genes.  Validation and expression analysis of key enzyme genes The expressions of the eight identified homologous genes encoding key enzymes in the chlorophyll biosynthetic and catabolic pathways were further confirmed by RNA blot hybridization.
Our results showed that all six of the candidate genes involved in chlorophyll biosynthesis were highly expressed in the leaves and pseudobulbs but rarely expressed in the roots, as was consistent with the sole deposition of chlorophyll in the leaves (Fig 9). Likewise, the homologs of chlorophyll catabolism genes CHL2 and RCCR were also expressed to a greater extent in the leaves, in agreement with the pattern of chlorophyll distribution. Quantitative PCR analysis revealed that the expression level of RCCR in the leaves of yellow-leaf mutants was one-to nine-fold higher, and those of CHL2 approximately one-to fivefold higher, than in the leaves of wild-type 'Dharma' under normal growth conditions and the changing was consistent with the chlorotic level of the mutant leaves (Fig 10A). By contrarily, the transcript abundances of the key genes for chlorophyll biosynthesis were nearly the same for both yellow-leaf mutants and wild-type plants. As shown in Fig 10B, the differences in expression are no more than 1.5 folds and did not correlate well with their chlorotic levels. This result indicates that the chlorophyll degradation genes were expressed in the yellow-leaf mutants at a higher level, as was consistent with their lower chlorophyll contents and higher chlorophyll degradation.

Conclusion
In this study, we examined the differences between the normal and color-mutant leaves of Cymbidium sinense cultivar 'Dharma', and found a notable decrease of the photosynthetic   pigment (especially chlorophyll b) abundance in the mutant leaves, which is probably resulted from chlorophyll over degradation. To reveal the candidate genes contributing to leaf-color variation, we performed 454 transcriptome sequencing and de novo assembly for Cymbidium 'Dharma'. Roughly 0.7 million ESTs were obtained and assembled into 103,295 isotigs and 68,460 genes, representing orthologs of known plant genes as well as potential new genes. This systems bioinformatics survey, combined with molecular biology analysis, is a feasible approach to elucidate the scent of various biological processes and identify the relevant genes. Regarding the chlorophyll metabolism pathway which directly controls leaf color variation, 95 candidate genes were identified as involved. From which, we cloned 16 key enzyme-encoding genes associated with chlorophyll metabolism. After comparing the expression levels of these candidate genes, we confirmed that the expression levels of two key enzyme-encoding genes for chlorophyll degradation were greater in yellow-color mutant leaves, as was consistent with their lower chlorophyll content. Our result suggested that the mutant leaves arose probably from chlorophyll degradation and not biosynthesis. To the best of our knowledge, this study represents the first transcriptome sequencing this work represents the first study focusing on leaf varation of Cymbidium sinense at both physiological and molecular levels, and it supplies a further molecular basis for leaf color-associated genes. The informative EST dataset we obtained can be used as an important resource to facilitate gene discovery for molecular breeding, the identification of marketable traits, and the investigation of various biological process in Cymbidium.