The Transcriptome of the Reference Potato Genome Solanum tuberosum Group Phureja Clone DM1-3 516R44

Advances in molecular breeding in potato have been limited by its complex biological system, which includes vegetative propagation, autotetraploidy, and extreme heterozygosity. The availability of the potato genome and accompanying gene complement with corresponding gene structure, location, and functional annotation are powerful resources for understanding this complex plant and advancing molecular breeding efforts. Here, we report a reference for the potato transcriptome using 32 tissues and growth conditions from the doubled monoploid Solanum tuberosum Group Phureja clone DM1-3 516R44 for which a genome sequence is available. Analysis of greater than 550 million RNA-Seq reads permitted the detection and quantification of expression levels of over 22,000 genes. Hierarchical clustering and principal component analyses captured the biological variability that accounts for gene expression differences among tissues suggesting tissue-specific gene expression, and genes with tissue or condition restricted expression. Using gene co-expression network analysis, we identified 18 gene modules that represent tissue-specific transcriptional networks of major potato organs and developmental stages. This information provides a powerful resource for potato research as well as studies on other members of the Solanaceae family.


Introduction
Although potato is the third most important food crop after rice and wheat (http://faostat.fao.org), the average yield of potatoes around the world is far below its physiological potential of 120 tons/ha [1]. Advances in potato molecular breeding have been constrained by its complex biological system including vegetative propagation, autotetraploidy, and high levels of heterozygosity [2]. The potato genome [3] and accompanying gene complement are powerful resources for understanding this complex system and advancing molecular breeding efforts in this crop.
The potato gene complement, the corresponding gene structure, chromosome location, and biological function are informative to biologists, breeders, and geneticists. One form of gene annotation is expression profiles, which although correlative, can be used to infer function. Traditional gene expression analyses for potato include Expressed Sequence Tags (ESTs) and microarraybased expression profiles. To date, there are 249,457 potato ESTs in the National Center for Biotechnology Information EST database (dbEST, release 080111; http://www.ncbi.nlm.nih. gov/dbEST/dbEST_summary.html), which have been a valuable resource for gene discovery and expression in several potato genotypes, tissues, and environmental stress responses [4,5,6,7,8]. Approaches to quantitative gene expression profiling include the development of cDNA and oligonucleotide-based microarrays, for which 26 experiments and 506 assays exist in the National Center for Biotechnology Information Gene Expression Omnibus and the European Bioinformatics Institute ArrayExpress [9,10]. The Institute for Genomic Research developed potato cDNA microarrays based on ,12,000 potato clones [11], on which more than 50 studies have been completed including potato development and abiotic/biotic stress responses [11,12,13,14,15]. An oligonucleotide microarray based on the Agilent microarray platform was used in a series of studies examining tuber growth and metabolism [16]. Although these studies have generated significant amount of data for gene expression analysis, comprehensive characterization of the potato transcriptome has been constrained by limitations in Sanger-based sequencing and array-based methodologies. While Sanger-based EST sequencing is quantitative, cost limitations prevent deep and exhaustive sampling of the transcriptome. Platforms such as the existing potato cDNA and oligonucleotidebased arrays are limited by lack of the full gene complement being interrogated on the platform. Recent advances in high-throughput sequencing technologies have overcome these limitations and whole transcriptome shotgun sequencing, known as RNA-Seq, enables simultaneous analysis of thousands of transcripts for gene discovery and transcript abundance [17]. Moreover, this method provides a comprehensive view of the transcriptome without prior knowledge [18]. To complement the potato genome sequence for the purposes of improving genome annotation and to generate gene expression profiles, members of the Potato Genome Sequencing Consortium (PGSC) generated a large set of next generation transcript sequencing data. Here, we report a reference for the potato transcriptome using the reference accession, the doubled monoploid Solanum tuberosum Group Phureja clone DM1-3 516R44 (hereafter referred to as DM).

Tissues sampled and sequencing metrics
Here, we analyzed gene expression patterns in a set of 32 tissues from DM plants that represent major organs, developmental stages, and stress-related conditions (Tables 1 and 2). We have grouped these tissues into five major classes: Floral (petals, sepals, carpels, stamens, whole flowers), Fruit (mature, immature, inside fruit), Stolon/Tuber (stolons, tuber1, tuber2), Leaf (leaves, petioles), and Other tissues (shoots, callus, roots). Stress conditions included leaves challenged with Phytophthora infestans, leaves wounded to mimic herbivory, and the elicitors acibenzolar-smethyl (BTH) and DL-ß-amino-n-butyric acid (BABA) for biotic stress. For abiotic stress, plants were exposed to drought, salinity, heat, and a panel of four hormones: abscisic acid (ABA), 6benzylaminopurine (BAP), gibberellic acid (GA 3 ), and indole-3acetic acid (IAA). Overall, this study generated .550 million RNA-Seq reads (35 to 40 base pairs in length). The number of reads per library ranged from 5.4 million in the petal library to 30 million in the mature whole fruit library, while the number of genes that were expressed ranged from 11,394 in tubers to 16,276 in plants treated with NaCl (Tables 1 and 2). We found a weak correlation (20.14) between the 'number of transcripts identified' and the 'number of RNA-Seq reads' per library. The minimum and the maximum number of reads both detected a highly similar number of transcripts (Figure 1), suggesting that there was no bias against transcript detection by the depth of sequence coverage in this data set.

The DM transcriptome
Transcript abundance is expressed in fragments per kilobase of exon model per million mapped reads (FPKM) as implemented in Cufflinks [19]. This normalized unit allows the comparison both within and between samples. We used two other criteria to filter the expression data sets. First, a transcript was considered expressed if the FPKM 95% confidence interval lower boundary was greater than zero, and second, if the FPKM value was $0.001. Based on these criteria, 22,704 high-confidence transcripts were detected in total in these 32 RNA-Seq data sets with 21,630 in the developmental tissue series and 19,704 in the abiotic/biotic stress series (Tables 1 and 2; Table S1). The genome of DM contains 39,031 protein-coding genes [3] and a single transcript was selected to represent each gene model (see Materials and Methods). Thus, the 22,704 transcripts detected here represent nearly 60% of the predicted genes in potato. Eightythree percent of these transcripts encode proteins with known function. Of the remaining 17%, eight percent had either no match in the UniRef database or lack a Pfam domain with a known function, while nine percent align to an unknown or a hypothetical protein from another species (Table S2). These results   indicate that more than half of the transcripts with no known function have sequence homology with other plant proteins, indicating evolutionary conservation and functional significance. The DM transcriptome data provides a valuable reference for gene expression under normal as well as stress conditions. We identified as many as 20,549 genes expressed in normal tissues of major potato organs. Twenty percent of these (4,184 genes) were exclusive either to floral, fruit, leaf, or stolons/tuber tissues. Similarly, an overall number of 20,390 genes were expressed either in tissue culture, abiotic stress, or biotic stress conditions. Of those, eight percent (1,680 genes) were exclusive to abiotic and/or biotic stress treatments relative to their respective controls. While variation in transcriptome responses are to be expected in other potato species and accessions, the DM abiotic and biotic stress transcriptome profiles provide a baseline assessment of the potato transcriptome that can facilitate further studies in the physiological and biochemical mechanisms of stress responses and adaptation.
Of particular interest are two classes of lineage-specific genes. Comparative analysis of the reference potato DM genome with all available plant genome and transcriptome sequence datasets revealed 2,642 high confidence asterid and 3,372 potato lineagespecific genes [3]. The Asterid-specific set of potato genes encode proteins that lack similarity to any other plant genome or transcriptome except that of another Asterid (see Supplementary Figure 5 in ref [3]). The potato-specific set lack sequence similarity to other plant genome or transcriptome sequence including other Asterids (see Supplementary Figure 5 in ref [3]). Table 3 summarizes the expression of these lineage-specific genes. A total of 779 of the 2,642 Asterid-specific genes (29.5%; Table S6) and 820 of the 3,372 potato-specific genes (24.3%, Table S7) are expressed in at least one tissue. However, only 110 Asterid-specific (14.1%) and 15 potato-specific (1.8%) expressed genes have meaningful functional annotation based on alignments to the UniRef100 database and/or the presence of a Pfam domain. Resistance genes (LRR, late blight resistance, tospovirus resistance) were represented in both classes along with genes encoding systemic acquired resistance protein (Asterid-specific only).

Gene Co-expression Pattern Analyses
To examine the variability in expression levels of constitutively expressed genes, i.e. transcribed in all tissues, we calculated the coefficient of variation (CV = standard deviation/mean) of their FPKM normalized expression counts. Genes with small variation across tissues are thought to perform housekeeping functions and consequently used as reference genes to normalize expression values. When calculated across all 32 samples, the CV ranged from 0.14 to 5.6 (Table S3). In addition to common housekeeping genes such as glyceraldehyde-3-phosphate dehydrogenase (PGSC0003DMG400011246, 00015253, 00017433, 00017434), actin (PGSC0003DMG400003985, 00018449, 00020244, 02007428), ubiquitin (PGSC0003DMG400009125, 00021791, 00023184, 00023462), tubulin (PGSC0003DMG400004272, 00014296, 00017954, 00028193, 00029926), and elongation factor 1-a (PGSC0003DMG400005728, 00008117, 00019677, 00020772, 00020775, 00023270, 00023272) that have been reported to be stably expressed during biotic and abiotic stress in potato [20], there was a number of genes with high, stable expression levels that could be potentially useful in cross-tissue expression analyses (Table S3).
To better understand the variation of gene expression across all tissue types and stress-related treatments, we performed hierarchical clustering and principal component analyses. Two different RNA-Seq data sets were analyzed: one included 16 different tissue types with 21,630 transcripts; and the other consisted of 16 stressrelated treatments with 19,704 transcripts (Figures 2 and 3). The resulting cluster heat maps of log2-transformed FPKM values using the Spearman correlation coefficients clearly differentiated major tissue types as well as biotic and abiotic stresses ( Figure 2). Clustering of Floral (sepals, petals, carpels, and stamens, mature whole flowers), Fruit (immature and mature whole fruit), Leaf (leaves, petioles), and Stolon/Tuber tissues (Figure 2A), as well as tissues under abiotic (salt, mannitol, heat, ABA, IAA, GA3) and biotic (late blight, BABA, BAP, BTH, leaf wounding) stresses ( Figure 2B) was supported by high bootstrap scores (.90%, 1000 replicates). Similar gene expression patterns were evident when variation among samples was visualized in a reduced-dimension space via the first two principal components (Figure 3). These two principal components together explained only 38% and 43% of the total variation in tissue types and abiotic/biotic stresses, respectively, which may account for overlap between some tissues/ treatments. Collectively, these analyses captured the biological variability that accounts for gene expression differences among tissues, and suggest tissue-specific expression of differentially expressed genes as well as genes that are expressed only in a specific tissue type or stress response.
A comprehensive identification of highly correlated groups of genes was performed using the Weighted Gene Correlation Network Analysis (WGCNA) [21]. Using 15 tissues from major potato organs and developmental stages, we identified 18 gene coexpression modules containing a total of 5,400 genes (Table S4). Each module represents genes with highly correlated expression profiles, either in a single tissue or in a few developmentally related tissues (Figures 4 and S1). For example, module A1 contains 290 genes that are co-expressed in fruit tissues (''immature fruit'', mesocarp/endocarp'', ''mature whole fruit'') ( Figure 5A). It included genes involved in fruit development and ripening such as pectin esterase, lipoxygenase, and malate synthase (Table S4). Similarly, module A15 contained 90 genes that are co-expressed in tubers (''tuber1'', ''tuber2''), and included starch biosynthesis genes such as glucose 6-phosphate/phosphate translocator and storage proteins such as patatin ( Figure 5B, Table S4).
Our WGCNA analyses identified genes encoding transcription factor-related Pfam domains in all 18 co-expression modules (Table S5). Network modules containing transcription factor genes are of particular importance because these transcription factors may have a role in the regulation of expression of other member genes. Two modules, A2 and A14, were significantly enriched for   Overall, analyses of functional assignments of all of the genes within the modules indicate that 30% of the genes in modules have no known function. Examination of the lineage-specific genes revealed that nearly 12% (632 genes) of our module genes are lineage-specific, 289 asterid-and 343 potato-specific (Tables S8,  S9). Only a few asterid-specific genes were associated with Pfam domains of known function (e.g., PF07333, PF05938, PF05498, PF04043, Table S8) and these were included in floral (''carpels'', ''whole flowers'', ''stamens''), tuber, or stolon related co-expression modules (Table S4, Figure 4, modules A8, A13-15). Based on their interaction with known genes, these genes with no meaningful annotation can be used to place these non-annotated genes in a functional context and infer their role in potato development.
In summary, this large dataset of .550 million RNA-Seq reads permitted detection and quantification of expression levels of more than 22,000 genes in the sequenced accession of potato, and provides an overview of the transcriptome of a diverse collection of tissues and growth conditions. Coupled with identification of coexpression modules, these data provide a basis and a powerful

Transcriptome profiling
Transcriptome analyses were performed using RNA-Seq data generated by the PGSC described previously [3]. In this data set, transcriptome sequences were generated from 32 DM libraries using RNA-Seq with the Illumina Genome Analyzer II platform (Tables 1 and 2). The 32 DM libraries represent a wide range of developmental tissues/organs as well as abiotic and biotic stress treatments and are described in detail in reference [3] (see Supplementary Material and Table S4). The developmental tissues represent vegetative (leaves, petioles, stolons, tubers sampled twice) and reproductive organs (Floral: carpels, petals, sepals, stamens, whole flowers; Fruit: mesocarp/endocarp, whole immature berries, whole mature berries) from greenhouse-grown plants. Shoots and roots from in vitro-grown plants were also included in the developmental series. Callus (10)(11) week old) derived from leaves and stems were used to assess transcription in an undifferentiated tissue. The biotic stress conditions (pooled samples at 24 hr, 36 hr, 72 hr) were induced with Phytophthora infestans inoculum (Pi isolate US8: Pi02-007) and two chemical inducers, acibenzolar-S-methyl (BTH, 100 mg/ml) and DL-bamino-n-butyric acid (BABA, 2 mg/ml) using detached leaves. Wounded leaves, primary and secondary, were included to mimic herbivory. The abiotic stress conditions (24 hr treatment of in vitro grown whole plants) include heat (35uC), salt (150 mM NaCl) and mannitol (260 mM) treatment. Abscisic acid (ABA, 50 mM), indole-3-acetic acid (IAA, 10 mM), giberellic acid (GA3, 50 mM), and 6 benzylaminopurine (BAP, 10 mM) were used to induce hormone stress responses. Expression levels as previously described in [3] were determined by mapping the RNA-Seq reads to the DM potato reference genome using Tophat [23] and expression levels were determined using Cufflinks [19]. Only representative transcripts, which were chosen by selecting the longest Coding Sequence (CDS) from each gene, were used for the analyses [3]. RNA-Seq reads are available in the NCBI Sequence Read Archive under study number SRA029323.

Bioinformatic analyses
Functional annotation was performed using a combination of BLASTX searches [24] against the Uniref100 (E-value cutoff of 1e-5) and identification of Pfam domains using InterProScan searches against InterPro [25]. R-statistics (http://www.r-project. org/) were used for hierarchical cluster analysis, cluster dendrograms, and principal component analysis. Domain-enrichment analyses were performed using Fisher's Exact Test as implemented in R (http://cran.r-project.org). Transcription factor genes were identified based on PFAM domains (Table S5).

Co-expression pattern analyses
Co-expression analysis was performed using WGCNA in order to identify modules of highly correlated genes [21]. CV values were calculated for all genes, and those with a CV less than 0.8 across samples were not included in the WGCNA analyses. Expression values for the remaining genes were then log2 transformed before being processed through the WGCNA Rpackage [26]. Genes with untransformed FPKM values less than 1 were transformed to zero. For module identification, the WGCNA parameters b and treecut were set to 9 and 0.7, respectively. All other parameters were used with the default values. Eigengenes were calculated for each gene co-expression module in order to visualize the gene expression patterns for each module. Eigengenes are the first principal component of principal component analysis of the normalized expression values of all genes in a module, and they represent the average normalized gene expression for a module [27]. Figure S1 Trend plots of the normalized gene expression values for each gene from eighteen identified gene coexpression modules. Modules consisting of genes with specific in various tissues: A1. mesocarp/pericarp tissue and mature fruit, A2. immature fruit and mesocarp/pericarp tissue, A3. immature fruit, A4. mesocarp/pericarp tissue, A5. mature fruit, A6. roots, A7. sepals, A8. carpels, A9. petals, A10. shoots, A11. petioles, A12. leaves, A13. whole flowers and stamens, A14. tubers and stolons, A15. Tubers (sample 1 and 2), A16. tubers (sample 1), A17. tubers (sample 2) and A18. stolons. (PDF)