Transcriptome Sequencing and De Novo Analysis of a Cytoplasmic Male Sterile Line and Its Near-Isogenic Restorer Line in Chili Pepper (Capsicum annuum L.)

Background The use of cytoplasmic male sterility (CMS) in F1 hybrid seed production of chili pepper is increasingly popular. However, the molecular mechanisms of cytoplasmic male sterility and fertility restoration remain poorly understood due to limited transcriptomic and genomic data. Therefore, we analyzed the difference between a CMS line 121A and its near-isogenic restorer line 121C in transcriptome level using next generation sequencing technology (NGS), aiming to find out critical genes and pathways associated with the male sterility. Results We generated approximately 53 million sequencing reads and assembled de novo, yielding 85,144 high quality unigenes with an average length of 643 bp. Among these unigenes, 27,191 were identified as putative homologs of annotated sequences in the public protein databases, 4,326 and 7,061 unigenes were found to be highly abundant in lines 121A and 121C, respectively. Many of the differentially expressed unigenes represent a set of potential candidate genes associated with the formation or abortion of pollen. Conclusions Our study profiled anther transcriptomes of a chili pepper CMS line and its restorer line. The results shed the lights on the occurrence and recovery of the disturbances in nuclear-mitochondrial interaction and provide clues for further investigations.


Introduction
Chili pepper (Capsicum annuum L.) is a member of the Solanaceae family. Originated from South America, it has become an economically significant vegetable and an agriculturally important plant worldwide, particularly in China and Korea [1][2][3][4][5]. The heterosis of pepper is very obvious: the average yield of hybrids is 30% more than that of common cultivars [6][7][8]. At present, hybrid seed production mainly relies on manual pollination, which is not only costly but also difficult to ensure seed purity. Therefore, more and more researchers and breeders tend to the male sterile line and study its application in hybrid seed production.
Cytoplasmic male sterility (CMS), resulted from disturbed mitochondrial-nuclear interaction, was a failure to produce functional pollen that can be suppressed or counteracted by nuclear genes known as restorer-of-fertility (Rf) genes [9][10][11]. It is widely accepted that CMS is closely related to mitochondrial genome rearrangement, and many trait-determining mitochondrial genes could be suppressed or activated by Rf genes [9,12]. In addition to naturally occurring, CMS could be created by either sexual crossing or protoplast fusion [9]. In chili pepper, CMS was first documented in the PI 164835 line from India [13], whose cytoplasm has been used as the only source for CMS. To date, it has been reported two determinants, atp6-2 and orf456 [14,15], and two markers of CMS-specific sequence-characterized amplified region (SCAR) [16]. For Rf genes, previous study has shown that they mainly scatter in chili pepper, but seldom in sweet pepper [17]. One major QTL for fertility restoration was mapped to chromosome P6 [18], and several markers flanking the major restorer gene have been identified [6,[19][20][21][22]. However these markers have limited applications in pepper lines due to low reproducibility and the failure of PCR amplification [23]. Besides, a CAPS marker linked to the partial restoration (pr) locus has been developed [24].
Near-isogenic lines (NILs) are a pair of lines with identical genetic backgrounds, except for a region near the target gene [25,26]. Particularly, NILs are generated by crossing a donor line carried the gene of interest to a recurrent parent and then backcrossing to the recurrent parent for six to eight generations [27]. Pairs of NILs are useful and valuable materials for screening molecular markers linked with target genes and isolating critical genes associated with interested traits. We have used NILs to analyze differently expressed genes between CMS lines and its near-isogenic lines in pepper [28][29][30].
The throughput of sequencing has been improved greatly and cycle time of sequencing has been significantly shortened due to the emergence and development of next generation sequencing (NGS) technology. The NGS technology, namely RNA-Seq, is efficient and inexpensive to analyze transcriptome in a comprehensive and in-depth way [31,32], and has provided new insights into the whole transcriptome. It offers an opportunity to identify critical genes related to a certain character from the numerous molecular markers or the differentially expressed genes discovered. Information of the developmentally and environmentally induced differentially expressed genes can also be used to predict interactions of individual genes, as well as to elucidate more complicated signaling pathways as well as potential cross-talks between these pathways [33].
Chili pepper is a diplont plant (2n = 24) with a reported genome size about 3,753-4,763 million base pairs [34]. Since the chili pepper genome is not currently available, transcriptomic data can help to identify the genes and gene families involved in important biological processes. To date, several studies of RNA sequencing on peppers have been reported [35][36][37][38][39][40]. However, most of these studies were mainly focus on the developing molecular markers.
To characterize the anther transcriptomes of chili pepper and seek genes involved in fertility determination, transcripts from the CMS line 121A and its restorer line 121C were isolated, quantified and sequenced. These transcriptomic sequences were then assembled by Trinity and annotated by BLASTing against public databases. Subsequently, the annotated sequences were clustered into putative functional categories using the Gene Ontology (GO) framework and grouped into pathways using the Kyoto Encyclopedia of Genes and Genomes (KEGG). Finally, differentially abundant unigenes were analyzed, and part of the results was validated by relative RT-PCR and real-time RT-PCR. This study presents the first broadly survey of CMS line and the restorer line in chili pepper with RNA-Seq analysis.

Results and Discussion
Illumina Sequencing and de novo Assembly Though criteria for gametophyte development was available in model plant Arabidopsis [41,42], the pivotal time-point for pepper stamen development is still unclear. To obtain as many of the genes expressed during anther development as possible, RNA was isolated from five different development phases of microspore and mixed equally for the generation of cDNA library. The two cDNA libraries separately constructed from 121A and 121C were sequenced using the Illumina platform. After cleaning and quality check, 53 million 100bp paired-end reads were assembled into 85,144 unigenes with an average length of 643 bp (Table 1).

Gene Prediction
Gene prediction was conducted using the 'GetORF' software for the 85,144 unigenes, 84,793 of which contain protein coding sequences. The average length of the remaining 351 unigenes is 233 bp. This may be due to the fact that they are too short that cover only Untranslated Regions (UTR).

Annotation of Predicted Proteins
In total, 27,191 (32%) unigenes were significantly matched to known genes in the public databases (Table S1). In fact, ''non-BLASTable'' sequences have been reported in all studies regarding plant transcriptomes. However, due to the differences in species, the sequencing depth and the parameters of the BLAST search, the proportion of ''non-BLASTable'' sequences range from 13 to 80% [43][44][45][46].
Consistent with other reports [47,48], assembled sequences in the present study also showed that the longer the sequences had the higher match proportion in database. Match efficiency was 95.10% for sequences longer than 2,000 bp, but was 39.58% and 15.65% for sequences 500-1,000 bp and 100-500 bp in length, respectively ( Figure 1).
In addition, we compared the pepper transcriptome with all the ESTs of Capsicum annuum in NCBI and tomato coding sequences (CDS) from ITAG2.3 annotation release in SGN. 3,385 (3.98%) and 4,993 (5.86%) of the unigenes match with the Capsicum annuum ESTs and tomato coding sequences, respectively, and when do the comparison conversely we got a similar result with 3,383 (3.97%) for the Capsicum annuum ESTs and 5,006 (5.88%) for tomato coding sequences.

Gene Ontology (GO) Annotation
A total of 9,896 unigenes were assigned to 58 functional groups using GO assignment ( Figure 3). In each of the three main categories (cellular component, molecular function and biological progress) of the GO classification, the dominant terms were ''cell'', ''binding'' and ''cellular process'', respectively. ''Intracellular'', ''catalytic activity'' and ''metabolic process'' were also well represented. However, few genes were assigned to the terms ''proteinaceous extracellular matrix & cell surface'', ''translation regulator activity'' and ''extracellular structure organization''.

Gene Expression Analysis
After calculation, expression of each unigene was obtained. Using the restorer line as a reference, 4,326 up-regulated unigenes (with higher expressions in the sterile line) and 7,061 downregulated unigenes (with higher expressions in the restorer line) were identified (Table S3). Results showed that the number of down-regulated unigenes was obviously larger than that of upregulated unigenes ( Figure 4). In addition, 9,224 and 13,568 specific unigenes were found in the sterile line and the restorer line, respectively ( Figure 5).
To evaluate the validity of Illumina analysis and to further assess the patterns of differential gene expression, several unigenes from  our sequencing results were selected and detected by RT-PCR and qRT-PCR (Table 2) with unigene-specific primers (Table S4). For real-time RT-PCR ( Figure 6), this study selected unigenes with known function like ATP binding protein, pentatricopeptide (PPR) repeat-containing protein, MADS-box transcription factor and the others. In contrast to the Illumina data, the highest up-regulation of comp66553_c0_seq1 was observed with almost 263-fold in 121C, while the transcript abundance of comp62432_c0_seq1 was induced by approximately 188-fold, lower than that of comp66553_c0_seq1.
Most selected unigenes (e.g. comp215802_c0_seq1, comp198237_c0_seq1, comp54012_c0_-seq1, comp56601_c0_seq1, comp71609_c0_seq1, comp66553_c0_seq1, comp62048_c0_seq1 and comp62432_c0_-seq1) showed lower changes when compared with Illumina sequencing. Furthermore, 8 ''non-BLASTable'' unigenes were selected randomly for relative RT-PCR (Figure 7). The log2ratios for comp54630_c1_seq1 and comp60513_c2_seq1 were 22.6 and 27.9, respectively. In contrast to the Illumina data, expression difference of comp54630_c1_seq1 between the two materials was more obvious than that of comp60513_c2_seq1 when using relative RT-PCR. However, the trend of higher abundance in the restorer line was consistent between sequencing and relative RT-PCR. Taken together, the expression patterns of these genes in 121A and 121C are consistent with the Illumina data. RT-PCR results basically confirmed the reliability of our transcriptome analysis. However, the two techniques essentially use different algorithms, which may explain the above-mentioned some inconsistent results [49][50][51].
All unigenes showing significant differences in transcript abundance between the two materials were mapped to the GO and KEGG pathway database. Differently expressed unigenes obviously enriched in four GO terms, i.e. ''extracellular'', ''ion transporter activity'', ''lyase activity'' and ''transporter activity'' (Table 3). Twenty KEGG pathways with enrichment of differently expressed unigenes were also found (Table 4), among which the top three pathways that cover the most differentially expressed unigenes were starch and sucrose metabolism (41), oxidative phosphorylation (40) and plant-pathogen interaction (28).
Some genes among the differentially expressed unigenes are associated with molecular pathology, though we collected samples from asymptomatic plants. In addition, it has reported that restorer genotype for male sterile cytoplasm of genetic resources was moderately resistant to phytophthora capsici, which implied some differentially expressed unigenes may be associated with both molecular pathology and restoration of fertility [52].

Analysis of Candidate Male Sterile Genes
Previous studies proved the sharply decreased content of ATP in the male sterile line [53] and its level was much lower than that in the maintainer line [54]. Many mitochondrial genes related to male sterility turned out to be involved in the mutation of ATP synthase subunits [55], such as the rape atp6 gene of Polilma CMS [56], the sunflower atp4 gene [57], the radish atp8 gene of Ogura CMS [58] and the petunia atp9 gene [59], etc. Also a gene named atp6-2 [14] was found to be related to male sterility in pepper. Our study identified 39 unigenes associated with ATP synthase including ATP synthase subunit, among which 4 unigenes showed high abundance in 121A while 6 in 121C.
Cytochrome oxidase is the marker enzyme of mitochondrial inner membrane with strong activity. It plays important role in the mitochondrial respiratory chain electron transfer system and affects plant cell respiration. For the 18 cytochrome oxidaserelated unigenes found in the present study, 2 and 7 unigenes were highly abundant in 121A and 121C, respectively. Many studies have shown that cytochrome oxidase is relevant to CMS in plants [60][61][62], and the open reading frame orf456 found in pepper coxII gene is closely related to CMS [15].

Analysis of Candidate Fertility Restorer Gene
The PPR (pentatricopeptide repeat) gene family is a large gene family characterized by the presence of tandem arrays of a degenerate 35-amino-acid repeat [63]. Due to its essential roles in mitochondria and chloroplasts, the PPR has received the enormous attention. Many restorer genes, including the radish Rfo gene [64], the rice Rf-1 gene [65] and the petunia Rf-PPR592  gene [66], etc., encode proteins of PPRs. Among 463 PPR unigenes found in the present study, 17 and 9 unigenes were highly abundant in 121A and 121C, respectively.

Analysis of Other Candidate Male Fertility-related Genes
Since the abnormal development of anther or pollen is the immediate cause of male sterility, proteins related to anther and pollen may have close relationships with fertility. We found 2 anther specific proteins were more abundant in 121C than in 121A. Proteins related to pollen, including major pollen allergen, pollen coat-like protein and pollen-specific protein, were notably represented among 121C transcripts, with 13 highly abundant unigenes compared with 1 abundant unigene in 121A.
The abnormity of activated oxygen metabolism in the development of anther or young panicle may be associated with male sterility [67][68][69][70][71][72]. The differences of activated oxygen  Table 2. The primers used for each gene are listed in Table S4. doi:10.1371/journal.pone.0065209.g006  Table 2 and Table S4, respectively. doi:10.1371/journal.pone.0065209.g007 metabolism were compared between the three-lines of CMS [73,74], but the influence of oxygen metabolism on fertility remains unknown. As active oxygen scavengers of plant, 10 and 7 peroxidase-associated unigenes were found up-regulated in 121A and 121C, respectively. Among the unigenes annotated as catalase gene, 1 was highly abundant in 121A and 2 in 121C. For superoxide dismutase-related unigenes, only 1 showed high abundance in 121A. All the 3 polyphenol oxidase unigenes in our results were differentially abundant unigenes with 2 highly abundant in121A and 1 in 121C.
It's lack of soluble sugar in the male sterile lines, like pepper [75], cabbage [76] and rape [77]. Gluconeogenesis is an important pathway to generate sugar in plant, in which malate dehydrogenase and aspartate aminotransferase play important roles. Our results showed that 1 malate dehydrogenase-associated unigene was highly abundant in 121A but 4 in 121C, and 1 aspartate aminotransferase unigene was highly abundant in 121C.
Insufficiency or deviation of RNA editing may form improper editing products and therefore hinder the proper functions that generate CMS [78,79]. We found 6 highly abundant splicing factors in 121A and 1 in 121C.
The MADS-box gene family is a regulatory gene family with specific sequence in plants. Proteins coded by this gene family play important roles in regulation of growth and development of plants. It regulates the development of roots, leaves, fruits and flowers [80] with different spatial and temporal expression profiles during development of floral meristems and floral organs [81]. We found 4 and 5 highly abundant unigenes with MADS-box domain in 121A and 121C, respectively.
The F-box proteins with F-box domain can identify substrates in the ubiquitin-mediated proteolysis pathways, and play important roles in cell cycle, signal transduction, transcription, programmed cell death and male sterility [82]. We found 21 highly abundant Fbox unigenes in 121A and 38 in 121C.

Conclusion
In the present study, we not only profiled the transcriptome of pepper anther, but also analyzed differentially abundant unigenes between a pepper CMS line 121A and its near-isogenic restorer line 121C. The total 5.3 Gb data were assembled into 85,144 unigenes. We assembled 71,576 and 75,920 unigenes from 121A and 121C, respectively. 4,326 and 7,061 unigenes were found highly abundant in lines 121A and 121C, respectively. After further enrichment analysis, we identified three enriched pathways that cover the most differentially abundant unigenes. Our results provide a global look at the differences between the pepper CMS line and its near-isogenic restorer line, and laid the foundation for identifying new fertility-associated genes and elucidating the mechanisms of cytoplasmic male sterility and fertility recovery.

Plant Materials and RNA Extraction
The pepper CMS line 121A and its near-isogenic restorer line 121C were selected as materials for RNA sequencing. They were drawn from a backcross program [83]. Line 121A was from a backcross between chili pepper advanced inbred lines 121 and 8907A. In order to get the corresponding restorer line 121C, 8907A was crossed to ''big gold bullion'' with single dominant restorer gene. It was followed by backcrossing repeatedly with advanced inbred line 121. The cytoplasmic source of the CMS line and restorer line was derived from 8907A and a minimum of seven backcrosses was made. In consequence, the CMS line and restorer line had the similar cytoplasmic genetic background. In addition, it's believed that they had the identical nuclear genetic background except for the restorer gene locus. Both lines were grown in the Shangzhuang experimental station of China Agricultural University under standard greenhouse condition during spring in 2011.
Flower buds were collected from each of the 20 individuals of 121A and that of 121C, respectively. And collected flower buds were divided into five phases according to the relevance between development phases of microspore and morphological characteristics of floral organs (see Figure 8 for details) [84]. Anthers of the five phases were taken out and frozen in liquid nitrogen. All samples were stored at 280uC until RNA extraction.
Each frozen sample was grinded in a mortar with liquid nitrogen, and then total RNA was isolated using Trizol reagent (Invitrogen, USA) following the standard protocol. The concentration of total RNA was determined by NanoDrop (Thermo Scientific, USA), and the RNA integrity value (RIN) was checked using RNA 6000 Pico LabChip of Agilent 2100 Bioanalyzer (Agilent, USA). Then RNA of the five phases was equally mixed.
The mixed RNA was incubated with DNase I (Ambion, USA), and messenger RNA was further purified with MicroPoly(A) Purist Kit (Ambion, USA) as per the protocol and the final concentration was determined using NonoDrop.

Library Preparation
RNA was fragmented and annealed with Biotinylated Random Primers which have the Illumina adapter sequence. Then the RNA fragments were captured by Strapavidin through Biotinylated Random Primers. Another Illumina adapter was ligased to 59RNA by RNA ligase. Reverse transcriptase was used for reverse transcription. Finally two double strand Illumina libraries were obtained by PCR amplification. Sequencing and Data Processing The two libraries were sequenced by Illumina paired-end sequencing technology. Raw data was scanned using Casava with default parameters, and reads with more 10% Q,20 bases were removed. All sequences smaller than 70 bases were eliminated based on the assumption that small reads might represent sequencing artifacts [85]. Then the high quality reads were assembled by Trinity with default parameters to construct unique consensus sequences [86].

Bioinformatics Analysis
After de novo assembled with Trinity, open reading frames were identified by using an in-house developed program based on 'GetORF' from EMBOSS [87]. Gene annotation was performed through BLASTp search against Swiss-Prot and GenBank database with E value of 10 25 , and then the best one was chosen as the result of gene annotation. Comparisons with the ESTs of Capsicum annuum in NCBI and tomato CDS in SGN were performed through Perl scripts, with E value less than 10 210 and 10 220 , respectively, and the proportion of the similar part larger than 80%. Gene ontology analysis was performed using GoPipe [88], BLASTP was firstly used to search against Swiss-Prot and TrEMBL database with E value of 10 25 , and then the GO information was obtained according to gene2go. The metabolic pathways were constructed based on KEGG database by BBH (bidirectional best hit) method [89]. KO number of each protein was identified firstly and metabolic pathways were constructed based on the KO number then.

Gene Expression Difference Analysis
Reads number of each unigene was firstly transformed into RPKM (Reads per Kilo bases per Million reads) [90] and then differently expressed unigenes were identified by DEGseq package using the method MARS (MA-plot-based method with Random Sampling model) [91]. ''FDR #0.001 and the absolute value of log 2 Ratio $1'' was used as the threshold to judge the significance of unigene expression difference. The data analyzed have been deposited on the NCBI Gene Expression Omnibus under accession no. GSE45431.
Candidate male fertility-related genes were selected according to their annotation (Table S1), and then their abundance differentiations of the two materials were obtained from the result of gene expression analysis (Table S3).

Enrichment Analysis
All unigenes showing significant transcript abundance differences between the two materials were firstly mapped to the GO and KEGG pathway databases, and then the numbers of unigenes for every GO term and KO term were calculated, respectively. Significantly enriched GO and KO terms from the set of differentially abundant unigenes were found using the hypergeometric test, for the sake of comparing these unigenes to the achieved chili pepper transcriptome background. The formula for the gene enrichment test was  differentially abundant unigenes in N; M represents the number of unigenes that were annotated to certain GO or KO terms; and m represents the number of differentially abundant unigenes in M. The initially obtained p-values were then adjusted using a Bonferroni Correction and a corrected p-value of 0.05 was adopted as a threshold.

Reverse Transcriptase PCR (RT-PCR) Analysis
Total RNA isolated above was treated with DNase I to remove genomic DNA contamination. The first-strand cDNA synthesis and the qRT-PCR were carried out using the PrimeScript 1st Strand cDNA Synthesis Kit (Takara) and SYBR Premix Ex Taq TM (Takara), respectively. The qRT-PCR was performed on an ABI PRISM 7500 Real-Time PCR System(Applied Biosystems, USA) with the following cycling parameters: 95uC for 30 s, followed by 40 cycles of: 95uC for 5 s, 60uC 34 s. The actin (GenBank: GQ339766.1) gene was used as the internal control. Expression levels of the unigenes were calculated from the threshold cycle using the delta-delta Ct method [92]. The cycling parameters of relative RT-PCR were: 94uC for 2 min followed by 30 cycles of 94uC for 30 s, 54uC for 30 s (annealing temperatures were set according to the primers' Tm.), 72uC for 40 s, and final elongation at 72uC for 3 min. All reactions were performed with at least three replicates.