Association of gene expression with biomass content and composition in sugarcane

About 64% of the total aboveground biomass in sugarcane production is from the culm, of which ~90% is present in fiber and sugars. Understanding the transcriptome in the sugarcane culm, and the transcripts that are associated with the accumulation of the sugar and fiber components would facilitate the modification of biomass composition for enhanced biofuel and biomaterial production. The Sugarcane Iso-Seq Transcriptome (SUGIT) database was used as a reference for RNA-Seq analysis of variation in gene expression between young and mature tissues, and between 10 genotypes with varying fiber content. Global expression analysis suggests that each genotype displayed a unique expression pattern, possibly due to different chromosome combinations and maturation amongst these genotypes. Apart from direct sugar- and fiber-related transcripts, the differentially expressed (DE) transcripts in this study belonged to various supporting pathways that are not obviously involved in the accumulation of these major biomass components. The analysis revealed 1,649 DE transcripts between the young and mature tissues, while 555 DE transcripts were found between the low and high fiber genotypes. Of these, 151 and 23 transcripts respectively, were directly involved in sugar and fiber accumulation. Most of the transcripts identified were up-regulated in the young tissues (2 to 22-fold, FDR adjusted p-value <0.05), which could be explained by the more active metabolism in the young tissues compared to the mature tissues in the sugarcane culm. The results of analysis of the contrasting genotypes suggests that due to the large number of genes contributing to these traits, some of the critical DE transcripts could display less than 2-fold differences in expression and might not be easily identified. However, this transcript profiling analysis identified full-length candidate transcripts and pathways that were likely to determine the differences in sugar and fiber accumulation between tissue types and contrasting genotypes.


Introduction
Sugarcane biomass could play a very important role in supporting second generation biofuel production. On average, about 64% of the total aboveground dry biomass in sugarcane a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 transcripts involved in the accumulation of the major biomass components in corresponding samples. The transcriptome database was sequenced by PacBio isoform sequencing (Iso-Seq), as described in Hoang et al. (2017) [24], can be accessed under the NCBI TSA accession number GFHJ00000000.1; and hereafter is referred to as the Sugarcane Iso-Seq Transcriptome (SUGIT) database. Identifying the differential expressed transcripts between immature and mature internodes could suggest potential transcripts that may be involved in sugar and fiber accumulation along the sugarcane culm [25], while those differing between the low and high fiber (or sugar) content would highlight the differences in expression patterns between these two groups. The study set out to increase understanding of the regulation of carbon flux into the major components in the biomass and the genetic basis of these traits at the transcriptional level and support genetic improvement of sugarcane for fiber and/or sugar production.

Sample selection and preparation
Analysis was performed on 20 internodal samples, belonging to 10 sugarcane genotypes, which were classified into low and high fiber groups (Table 1). These samples were derived from a population previously described in [4]. The RNA extracted from these samples was also used in the construction of the SUGIT database (mentioned earlier), which was employed in the transcript expression analysis in this study. In brief, 5 genotypes were chosen for each of the low and high fiber groups, and for each of the 10 genotypes, 1 top (young) and 1 bottom (mature) internodal tissue samples were collected. The young tissues were defined as the fourth internodes from the top, while the mature tissues as the third internodes from the bottom. For each internodal sample, 4 representative culms from the same genotype were pooled to form 1 biological replicate. Internodes from the pooled culms were harvested, immediately cut into 0.5 cm-thick slices, followed by the removal of the rind and diagonal separation of the remaining pith into small 0.5 cm cubes, using a pair of secateurs. The whole excision process took about 1 min before snap-freezing in liquid nitrogen, and then samples were stored at -80˚C until RNA extraction. To avoid changes in the transcriptome due to the different collection times, the excision was conducted between 10 am to 2 pm on the same day. Prior to RNA extraction, frozen samples were pulverized, while kept frozen, into a fine powder in cryo-jars using a Retsch TissueLyser (Retsch, Haan, Germany), as described in [26]. The frequency of 25/S was used and the time was 1 min 30 s. Samples and the cryo-jars were kept in liquid nitrogen, except for when being ground in the TissueLyser. About 1 g of pulverized sample powder was used for RNA extraction.

RNA extraction
RNA extraction was conducted using a two-step protocol as described in [26] employing a Trizol kit (Invitrogen), followed by a Qiagen RNeasy Plant minikit (#74134, Qiagen, Valencia, CA, United States). The RNA quality, integrity and quantity were determined by a Nano-Drop8000 spectrophotometer (ThermoFisher Scientific, Wilmington, DE, USA), and on a 2100 Agilent Bioanalyser with a plant RNA NanoChip assay (Agilent Technologies, Santa Clara, CA, USA).

RNA-Seq and read data processing
About 3 μg of each of the 20 internodal RNA samples was used for indexed-library preparation (average insert size of 200 bp), employing the TruSeq stranded with Ribo-Zero Plant Library Prep Kit for total RNA library (Illumina Inc.). Each sample was sequenced in two lanes using an Illumina HiSeq4000 instrument to obtain 150 bp paired-end read data, at the Translational Research Institute, The University of Queensland, Australia. This led to more than 80% of the paired-end reads found to be overlapping. The read data was assessed by FastQC [27] for quality and adapter sequences. Read adapters and quality trimming was done in CLC Genomics Workbench v9.0 (CLC-GWB, CLC Bio-Qiagen, Aarhus, Denmark) with a quality score limit <0.01 (Phred Q score !20, equivalent to the accuracy of the base calling of 99%), allowing a maximum two ambiguous nucleotides, and removing reads below 75 bp. Only paired-end reads from each lane were kept for each sample and then concatenated into one data file prior to analysis. Since the sequencing resulted in un-balanced read data between the top and bottom internodal samples of each genotype, to reduce the sample size bias in the downstream analysis, the larger samples were down-sized using the seqtk toolkit [28], to obtain approximately uniform sample sizes across each genotype. Further quality control was conducted on the samples in each group and for all samples in the experiment. This was done by first performing counts-per-million (CPM) and then log2 data transformation of the of raw count matrix for each sample, using the script Perl_to_R from the Trinity package. This was used to assess the samples before analysis, and to remove any outliers or potential confounders within the replicates, which could cause batch effects [29], based on transcript expression level, the transcript pairwise comparisons, Pearson correlations and principal component analysis.

Transcript profiling and differential expression analysis
The pipeline was adapted from the Trinity v2.2.0 package [30,31], designed for transcript profiling and differential expression analysis without reference genome sequences at the transcript isoform level, employing the RNA-Seq by Expectation Maximization (RSEM) software v1.2.31 [32]. To estimate the abundance of each transcript isoform, the RNA-Seq data of each of the samples was aligned against the SUGIT database using the Perl script align_and_estimate_abundance.pl. The Bowtie v2.2.7 program [33] was used with options "-no-mixed-no-discordant-gbar 1000-end-to-end -k 200". The sam alignment output file was converted to a bam file and sorted by samtools-1.3.1 [34]. The sorted bam file was subjected to the program RSEM for quantification of transcript abundance at the isoform level by fractional correcting of read alignment based on the probabilities of the transcript isoforms the reads originally came from, using its iterative process [32]. The transcript abundance estimation result was used to build the matrix of raw read counts and normalized expression values, by the perl script abundan-ce_estimates_to_matrix.pl. The normalized expression measures included fragments per kilobase of feature sequence per million fragments mapped (FPKM) [35] and transcripts per million transcripts (TPM) [36]. The read count matrix for each sample combination was parsed using the script run_DE_analysis.pl for transcript differential expression analysis. To identify the differentially expressed (DE) transcripts, a negative binomial model was used to determine the relationship of the mean and the variance for the dispersion estimation in the DESeq2 package [37]. This was run in a pipeline using script analyze_diff_expr.pl. This DESeq2 package was suitable for general, data driven parameter estimation [38], allowing the selection of differentially expressed transcripts through a dynamic range of data and considering the unbalanced sequencing depth of the different samples. This differential expression analysis pipeline employed R program v3.2.0 [39], with the Bioconductor v3.4 [40] and the following packages: limma [41], ctc [42], and Biobase [43]. Finally, transcripts with a false discovery rate (FDR) adjusted p-value 0.05 and mean fold change !2 were marked as being significantly differentially expressed between the two groups compared. The DE transcripts were clustered by the package cluster 2.0.4 [44] and ape [45], then graphed using the function heatmap.2 in gplots [46]. Lists of up-regulated and down-regulated transcripts were generated for each of comparison.

Functional annotation of identified differentially expressed transcripts
For general function comparison, the DE transcripts were annotated against the Gene Ontology (GO) database, using Blast2GO v4.0.2 [47] with default settings. This used a separate BLASTX homology search result (BLAST+ v2.3.0) with maximum blast hits of 100 against the NCBI non-redundant protein database and an e-value of 1e-10. The GO terms were assigned to each of the DE transcripts, and then the GO terms for each up-regulated and down-regulated transcript sets were extracted, enriched and compared by WEGO [48]. Only enriched GO terms with a p-value cutoff of 0.05 (considered being significant from the Pearson Chisquare test) were used in assessing the over-represented GO terms between the up-and downregulated transcript sets. For further functional analysis, all DE transcripts were subjected to the program Mercator for automated sequence annotation [49,50] and functional classifications (bins) were assigned to the DE transcripts. This functional bin annotations were based on: (1) the BLASTX homology searches with a cutoff of 80% against the Arabidopsis TAIR Release 10, PPAP SwissProt/UniProt Plant Proteins, TIGR5 rice proteins and Uniref90; (2) the reverse PSI-BLAST (RPS-blast) searches against the Clusters of orthologous eukaryotic genes database (KOG), conserved domain database (CDD); and (3) InterProScan search against the protein domain databases. The functional bins were visualized and analysed by the ImageAnnotator module, MapMan v3.5.1R2 [51]. This annotation pipeline using Mercator and Map-Man (termed as MapMan annotation) assigned the transcripts into the most appropriate bins and reduced the multiple times the transcripts were represented in many bins, which differed from the GO term annotations [51]. Arabidopsis and rice were used as the main sources of information, since these two are still amongst the best annotated plant genomes.
Transcripts specifically involved in accumulation of sugar and fiber DE transcripts that were potentially involved in the accumulation of the major sugar and fiber components, including cell-wall metabolism, carbohydrate metabolism, photosynthesis, and phenylpropanoid pathway were investigated by using the MapMan annotation bins.

Validation of DE genes using quantitative real-time PCR (qPCR)
To validate the RNA-Seq differential expression, a reverse transcription followed by qPCR was conducted on 4 μg total RNA from each of 8 selected samples including 4 young and 4 mature tissues from 2 low fiber (QC02-401, Q200) and 2 high fiber (QBYN04-26041, QN05-803) genotypes. Three reference genes including glyceraldehyde-3-phosphate dehydrogenase (GAPDH), ubiquitin (UBQ) and clathrin adaptor complex (CAC) were selected for internal normalization, based on published results in literature [52][53][54][55]. These genes were elucidated, from the RNA-Seq data, by their stable ubiquitous expression across all the samples. Multiple reference gene normalisation method [56,57] was applied to the qPCR quantification cycle (Cq) values, and analysed with qbase+ software (VIB, Flanders, Belgium). A total of 8 putative DE genes were chosen for qPCR experiments. Alignment to the putative gene with least symmetry to the multitude of sequence isoforms was ascertained with Clone Manager 9 (Sci-Ed Software, Denver, US). Primer design was accomplished using Primer3 software in NCBI/ Primer-BLAST [58]. List of genes and primer sequences are disclosed in S1

Data analysis
All Venn diagrams were generated by the online tool [60].

RNA-Seq summary
To investigate the differential expression of transcripts between the young and mature tissues of the sugarcane culm, and between the low and high fiber genotypes, an RNA-Seq experiment was conducted. Two groups of 5 low fiber and 5 high fiber genotypes were used, in which, for each genotype, 1 top internodal tissue sample and 1 bottom internodal tissue sample were collected by pooling respective tissues from 4 different sugarcane culms. The total number of trimmed reads obtained for each sample ranged from 6 to 53 million ( Table 2). The total number of RNA-Seq reads for all the samples from the low fiber group was 293 million, while that of the high fiber group was 148 million. This made the total read data used in this analysis 441 million reads. The percentage of reads mapped to the SUGIT database ranged from 70 to 82%. In most cases (except for genotype 6), the bottom internodal samples had a higher percentage of reads mapped to the references compared to the top internodal samples (mean of 79.5% and 74.2%, respectively). Amongst 107,598 transcriptome reference sequences, the proportion of the transcripts that had reads mapped to (FPKM>0) ranged from 57% to 76%. This result indicates the proportion of total active transcripts originating from the culm, as the SUGIT reference database was derived from leaf, internode and root tissues.

Global expression analysis of sugarcane RNA-Seq data
In this analysis, all transcript isoforms were considered, but only transcripts with FPKM greater than 0.3 were counted as significantly expressed as suggested in [62] for RNA-Seq data.
Overall, amongst the 20 samples studied, the number of transcript isoforms having FPKM were uniquely expressed in the top and bottom tissues, respectively. When the total unique expressed transcripts were separated according to low and high fiber genotypes (Fig 2b), the majority of transcripts (83,421) were found to be common between the two groups, only 4,659 and 5,601 transcripts were unique to low fiber and high fiber groups, respectively. In general, when all the samples were considered, there was a similar number of transcripts expressed in the two groups of low and high fiber genotypes in this study.   that each individual sample had a fraction of unique transcripts that was not detected in the others. Since these samples were derived from the same tissue type, at the global expression level, this could mean that, if one transcript was expressed in sample 1 but not detected in sample 2, it could be that it was not expressed at all in sample 2, or it could mean that it was downregulated in sample 2 to a low expression level that was not detected (this could also be affected by a low sequencing depth).

Transcript differential expression analysis
The differentially expressed (DE) transcripts identified by the package DEseq2 with a fold change !2 and FDR adjusted p-value 0.05 were summarized in Fig 3a. In total, 1,249 DE transcripts were found between all 10 top internodal samples and all 10 bottom internodal samples (referred to as all-genotypes T-B); 572 DE transcripts between the 5 top internodal  between the bottom internodal samples of the low and high fiber genotypes (bottom-tissues L-H), and between the top internodal samples of the low and high fiber genotypes (top-tissues L-H), respectively (Fig 3c). Of all identified DE transcripts among the 3 comparisons of low and high fiber genotypes, 16 were found to be common among the three, 16 were common between only all-tissues L-H and bottom-tissues L-H, while 61 were common between only all-tissues L-H and top-tissues L-H.
Transcript differential expression between the young and mature tissues in the sugarcane culm  (Fig 4). In this analysis, since biomass accumulation is a highly regulated process that involves many quantitative trait loci (QTLs) [7,14,19,22], and some of the transcripts were not well annotated, all GO terms that were associated with the up-regulation and down-regulation in the 3 comparisons were considered, enriched and highlighted the over representative functions in each comparison. This analysis revealed that the GO terms that were involved in the up-/ down-regulation between the top and bottom internodal tissues included those in various cellular components, molecular functions, and biological processes. The most abundant GO terms in the 3 comparisons were the cell and organelle part (in the cellular component category-CC), binding and catalytic (in molecular function category-MF), and cellular process and metabolic process (in biological process category-BP), shown in the right panel in Fig 4a-  There were 151 transcripts (within the total 1,649 unique DE transcripts from 3 comparisons) that were directly involved in the accumulation of sugars and fiber, as summarized in Table 3, details in S3 Table. Amongst the identified DE transcripts, there were transcripts associated with carbohydrate metabolism (9 transcripts), photosynthesis (25), cell-wall proteins (8), cellulose synthesis (29), hemicellulose synthesis (6), cell-wall modification (3), cellwall precursors (6), lignin biosynthesis (58) and dirigent proteins (7). Notably, for celluloserelated transcripts, 24 transcript isoforms were annotated as CesA and CesA-like proteins and 4 as COBRA-like protein precursor. Hemicellulose-related transcripts were those associated with endo-1,4-beta glucanase, plant glycogenin-like starch initiation protein 1 (PGSIP1), GT43 family glycosyltransferases, IRX14 and IRX9 genes. The transcripts for cellwall precursors included those of UDP-glucose 6-dehydrogenase (EC: 1.  Of the 151 identified transcripts related to sugar and fiber accumulation, all except 5 transcripts, were down-regulated in the bottom internodal tissue. There were 9 DE transcripts that were common among the 3 comparisons, including CesA, IRX9 gene, 4CL-1, two CCR isoforms, and four PAL isoforms. There were more DE transcripts that were common between all-genotypes T-B and only the low-fiber T-B comparisons (40 transcripts), than between allgenotypes T-B and only the high-fiber T-B comparisons (6 transcripts). The GO analysis also indicated that identified DE transcripts were involved many cellular components, molecular functions and biological processes. This also suggests that the up/down-regulation also involved the same molecular functions and biological processes (Fig 6a-6c).
The significant GO term that had a p-value <0.05 in all-tissues L-H comparison was catabolic process GO:0009056, which belonged to the metabolic process in the BP category (up in low fiber genotypes). In the bottom-tissues L-H comparison, the significant GO terms were membrane-enclosed lumen GO:0031974 (down in the bottom tissues of the low fiber genotypes), organelle lumen GO:0043233 (down) (in the CC); metabolic process GO:0008152 (down), macromolecule metabolic process GO:0043170 (down), cellular metabolic process GO:0044237 (down), and primary metabolic process GO:0044238 (down) (in the BP). In the top-tissues L-H comparison, none were significant at p-value <0.05, due to a low number of DE transcripts. The Mapman annotation (S2 Table and

Differentially expressed gene validation by qPCR
To confirm the gene expression measured by RNA-Seq analysis, 8 putative genes differentially expressed between the young and mature tissues and between the low and high fiber genotypes were validated by qPCR. These include genes encoding COBRA-like 5 protein precursor, CAD, CCR, CESA, CESA-like C5, dirigent protein, a putative family 43 glycosyl transferase (IRX9 gene) and sucrose synthase 1 (selected from Table 4 and S3 Table). The reliability of the RNA-Seq data was confirmed by qPCR of 8 selected DE genes, in which each selected gene was shown as a single band at their respective size (data not shown). A significant correlation (r = 0.54, p<0.001, n = 64, df = 62) was found between the two data sets (S4 Table).

Discussion
This study represents an effort to identify the transcripts involved in sugar and fiber accumulation by conducting differential expression analysis between the young and mature internodal tissues in the sugarcane plant, and between two groups of low and high fiber genotypes. The comparison between the young and mature internodes could potentially pinpoint the transcripts that are associated with carbon partitioning to the major biomass components in the sugarcane culm over time. The comparison between the low and high fiber genotypes could reveal the transcripts associated with sugar and fiber accumulation between the two groups. It is important to note that the two groups of low and high fiber genotypes used in this study were respectively equivalent to two groups of high and low sugar genotypes, since on a dry biomass basis, these two components were negatively correlated [4]. Overall, 4 out of 5 genotypes in the low fiber group were commercial sugarcane genotypes, while 4 out of 5 genotypes in the high fiber group were high fiber genotypes derived by introgression of genes from the wild Saccharum spontaneum relatives. The fiber content in these genotypes ranged from 31% to adjusted p-value. (1) Between all pooled samples from low fiber genotypes and high fiber genotypes.
(2) Between top tissues of low and high fiber genotypes. 50% of total dry mass (total solids which include sugars, fiber and others in the sugarcane culm biomass).

Transcript differential expression between the young and mature tissues
According to the literature, sugar and fiber (cellulose, hemicellulose and lignin) in sugarcane are largely developmentally regulated, and consequently, it is expected that genes/transcripts involved in their biosynthesis are also developmentally regulated. In the sugarcane culm, the cell elongation and primary cell-wall deposition commences in internode 1, followed by the deposition of secondary cell-wall in internode 2, and then suberisation and sucrose accumulation [20]. Lignification has also been shown to start early in internode 1 [63], and continues to increase until the fifth or sixth internode. After that, the lignin content appears to be similar between tissues [20,22]. At the early developmental stage, the internode acts as strong sink for sucrose, supporting the cell-wall synthesis and cell expansion, without an increase in sucrose concentration [14]. The accumulation of sucrose happens later, once the elongation ceases in maturing internodes and reaches a maximum in mature internodes [63]. After this, a major increase in cell-wall thickening (internode elongation) and lignification in the maturing tissues has been reported [6]. It is, therefore, the transcript expression associated with these changes in the culm development that is expected to be regulated, accordingly. Early studies on Arabidopsis thaliana showed that CesA1, CesA2, CesA3, CesA5, CesA6 and CesA9 function in the biosynthesis of the primary cell-wall [64,65], while CesA4, CesA 7 and CesA8 take part in the biosynthesis of the secondary cell-wall [66,67]. Most CesA transcripts are down-regulated in the mature internodes compared to the young internodes, which reflects their roles in this type of tissues of the plant [68], however, CesA3 and CesA5 were found to be up-regulated, which is thought to be necessary for cell-wall maintenance [14,20] or could be for radial growth in the mature tissue. The pattern of expression of the CesA-like transcripts, on the other hand, being highly abundant in the immature tissues, have been reported to not follow the pattern observed for the CesA in the culm [20]. A recent study [14] suggests that primary cell-wall synthesis in the sugarcane culm could occur throughout the sugarcane culm but is particularly important in the storage parenchyma of the maturing culm, where the internode is fully expanded and actively storing sucrose.
In this study, most of the 151 identified DE transcripts directly involved in the sugar and fiber metabolism, were up-regulated in the top internodal tissues compared to the bottom internodal tissues. These transcripts included those associated with the major carbohydrates, photosynthesis, several CesA and CesA-like transcripts, hemicellulose, cell-wall proteins, cellwall precursors, major enzymes in the lignin pathway and dirigent proteins. The up-regulation in this tissue-based comparison represents the active growth and metabolism in the young internodal tissues where all the transcripts involved were highly expressed, compared to the less active mature internodal tissues of sugarcane. The top internodal samples were derived from the fourth internodes from the top, while the bottom internodal samples were from the third internode from the bottom of 12-month old sugarcane plants. Considering that each sample was pooled from four different plants and that there was plant-to-plant transcript expression variation, this could mean that the top internodal samples represented both the immature and maturing tissues, while the bottom internodal samples could represent mature tissues. Therefore, there would be highly expressed transcripts that regulate the cell-wall synthesis, lignification and sugar accumulation in the top internodal samples, whereas, these would be less active in the bottom internodal samples where most processes would be stopped or slowed down. That would also explain the up-regulation of the dirigent proteins and downregulation of only 5 transcripts in the top internodal tissues compared to the bottom tissues. The dirigent proteins are hypothesized to play roles in scaffolding of lignin and biosynthesis of lignan [19,69,70] and dirigent domain-containing proteins are involved in the patterning of lignin-based Casparian strip in the root of A. thaliana [71]. However, some studies argued that lignin biosynthesis may not be handled by dirigent proteins, since lignin is not optically active and lignin biosynthesis is chemically controlled [72][73][74]. These proteins and a group of ligninrelated enzymes (PAL, COMT and CCoAOMT) were found to be up-regulated in the maturing tissues when compared to the immature [19].

Differential expression between the low and high fiber genotypes
This study identified 23 transcripts (including 6 dirigent proteins), out of 555 unique DE transcripts between low and high fiber genotypes, which were directly involved in sugar-and fiber-related pathways. The fewer DE transcripts in the fiber content-based analysis compared to the tissue-based analysis could reflect that there were many DE transcripts between these 2 groups whose fold change was at a relatively low level and which was not detected as differential expression. The cutoff for differential expression in this study was set at a minimum of two-fold, while the increase/decrease in fiber content between the two groups of low and high fiber genotypes could be from a combination of many up-or down-regulations involving transcript isoforms at a fold change less two-fold cutoff, as in Fig 6. In the previous tissue-based analysis, the difference between the samples was attributed to the developmental stages in which some of the pathways were significantly active/inactive between the stages, while in this fiber content-based comparison, it could be that all pathways were comparably active between the two groups being compared, except those were identified as DE. These identified DE transcripts could reflect the most important enzymes and pathways that played key roles in the processes making the difference in the fiber content. These included a number of transcripts that involved in carbohydrate metabolism, photosynthesis, cell-wall metabolism, monolignol metabolism and dirigent proteins, as presented in Table 4. The other possibility could be that the selected genotypes were not very contrasting in terms of fiber content, resulting in a narrow difference between the expression of the low and high fiber genotypes. These factors, together with the multiple genes/transcripts controlling the accumulation of sugars and fiber as discussed earlier, could explain the low number of DE detected in this analysis.
In relation to carbohydrate metabolism, callose synthase and alpha-amylase were up-regulated in the top tissues of the low fiber compared to that of the high fiber genotypes, while granule-bound starch synthase 1b and sucrose synthase 1 (Susy) were down-regulated in the bottom tissues of low fiber genotypes. Callose synthases are known to regulate the biosynthesis of callose, a cell-wall polysaccharide found in many higher plant species. Callose is a β-1,3-glucan, and has important roles in many developmental processes, including cell division and growth, tissue differentiation, cell plate formation, pollen development, plasmodesmata and response to stress [75][76][77]. Alpha-amylase on the other hand, plays roles in the starch degradation in the plant, breaking down the starch for other enzymes to act [78]. It is active when the stored carbohydrates are diverted back to the metabolism when it is required for plant development. Granule-bound starch synthase 1b is responsible for synthesis of starch (amylose) and the final structure of amylopectin [79], while Susy is the major enzyme of sucrose metabolism in sugarcane. The results indicate that, while all metabolic processes were probably happening in tissues of both groups of low and high fiber genotypes, there could be more active processes related to hemicellulose synthesis and starch degradation in the top tissues of low fiber, and more of starch synthesis and sucrose metabolism-related processes in the bottom tissues of high fiber genotypes.
With respect to photosynthesis, there was one up-regulated transcript identified in the top internode of low fiber which associated with one helix protein homologous to cyanobacterial high-light inducible protein. Between the bottom tissues of low and high fiber genotypes, ALDP fructose-bisphosphate aldolase was up-regulated while phosphoglycerate kinase (PGK) was down-regulated in the bottom tissues of low fiber genotypes. ALDP fructose-bisphosphate aldolase is one of the enzymes of the Calvin cycle and is predicted to have the potential to control photosynthetic carbon flux and biomass yields [80], while PGK catalyzes 1,3-biphosphogrycerate and ADP to form 3-phosphoglycerate and ATP [81]. Taking this together with 25 transcripts that were annotated as being associated with photosynthesis in tissue-based analysis (see S1 Fig), it was surprising to detect transcripts belonging to photosynthesis pathway in these samples, since they were derived from the sugarcane culms, a mostly non-photosynthetic tissue, and the rind of the culms was removed during sample excision. None of the data in the literature implies that significant photosynthesis occurs in this type of tissue. Photosynthesis requires a functional photosystem (PS) II and PSI system, a carbon fixation and the reductive pentose phosphate (RPP) pathway [82,83]. It is noteworthy that the RPP pathway is present in all plastids, which are found in sugarcane culms, as it is crucial for many vital reactions. Some of the enzymes detected are part of the RPP pathway and this could be explained in terms of their roles in converting imported triose-P to starch, lipids and other cellular constituents in the sugarcane culm [9]. However, there were some up-regulated photosynthetic enzymes detected in the DE analysis which belonged to the PSI and PSII, which cannot be explained without further research. It could also be that the photosynthesis was located in the green rind of the sugarcane culms which were not completely removed from the samples.
In the cell-wall metabolism, there were two transcripts (UDP-arabinose 4-epimerase and 2-phosphoglycerate dehydratase) up-regulated in the bottom tissues of low fiber genotypes. UDP-xylose 4-epimerase is the enzyme that catalyzes a reversible reaction between UDP-arabinose and UDP-xylose [84], while 2-phosphoglycerate dehydratase catalyzes the conversion of 2-phosphoglycerate to phosphoenolpyruvate [85]. CesA6 was found down-regulated in the top tissues of low fiber genotypes, and glyceraldehyde-3-phosphate dehydrogenase (GAP-DH) was down-regulated in bottom tissues of low fiber genotypes. CesA6 is critical for cell elongation [86]; and GAP-DH, which can be induced or repressed in sugarcane under abiotic stress, i.e. salt stress tolerance [87], was previously reported to be highly abundant in maturing culms [18]. This result indicates that, apart from the similarly active processes between the two groups, processes related to internode elongation and cellulose synthesis were more active in the top tissues of the high fiber genotypes compared to the top tissues in the low fiber genotypes, while those related to hemicellulose synthesis were more active in the bottom tissues of the low fiber genotypes compared to that of the high fiber genotypes.
The down-regulation of the dirigent proteins and two enzymes in the lignin pathway (CAD and COMT), together with the up-regulation of the CCoAOMT-5 in the low fiber genotypes, reflect their functions in lignin biosynthesis may not the same at a given developmental stage. CAD and COMT are involved in the final steps in the monolignol biosynthesis, while, COMT and CCoAOMT also take part in the production of phenolic compounds [21]. Therefore, while they were present in the lignin pathway, the up-or down-regulation of these enzymes could also be interlinked to other pathways. In comparison of the low and high lignin sugarcane genotypes, PAL-3 and C4H-1 were up-regulated in the low lignin genotype while COMT-1 was up-regulated in high lignin genotype [21]. Only two transcripts related to lignin, C4H and CCoAOMT, were found up-regulated in a study on 11 high-lignin guineagrass genotypes [88]. As discussed in [21], after gathering the results of differential expression of several lignin genes, it was suggested that lignin deposition might be regulated at the transcriptional level, but also the post-transcriptional levels and by enzyme catalytic activities. RNA-Seq data might not reflect all events in the lignin pathway, and therefore the studies at post-transcriptional levels (i.e. proteome and metabolome) would be required to understand further lignin metabolism in the sugarcane plant.

Notes on variation amongst diverse genotype collection and validation of RNA-Seq results
Initially, a diverse collection of 40 samples (from 20 genotypes) were included in the design of this experiment, and it was hypothesized that even though there were dynamic expression patterns and noises within the sample groups of low and high fiber genotypes, the transcripts that play key roles in this difference would always be detected as up-or down-regulated consistently and distinct from the random noises. However, this hypothesis might only work with the traits of high heritability that are controlled by single genes or a few key genes, and with a detectable differential expression level, which can be distinguishable from sample noises. The DE transcript detection soft-wares or models, which are based on the FDR or p-values, normally favor the high fold change due to the un-certainty in reporting the DE transcripts of a low fold change, which could also be sample noises. Biomass traits such as those of sugar and fiber, as mentioned earlier, are controlled by a number of QTLs and the regulation is shown not to be straightforward. This could be reflected in the principal component analysis (PCA) of the transcript expression profiles of the samples. In this PCA, it is suggested that separating samples according to their tissues types (top-bottom tissues) or fiber content (low-high fiber) did not explain the major variation between the respective groups, and there were other factors/dimensions that contribute to the separation of the sample groups used in this study. It was shown in the scatter-plots (Fig 8), that the samples were not tightly clustered according to the tissue types nor fiber groups they belonged to. Overall, the separation according to their tissue types represented more the total variation in the data in comparison to that between the fiber groups, which agreed with the results where more DE transcripts were observed for the tissue-based comparisons than that of the fiber content-based comparisons. The first component (PC1) explained 20-29% of the total variation between the top and bottom internodal tissue groups, while the second component (PC2) accounted for 9-20% of the total variation. Between the low and high fiber groups, the PC1 explained 17-19% of total variation and the PC2 explained 13-15%. Notably, two groups of top and bottom tissue samples were separated into two clusters, except for samples from genotype 6 whose bottom internodal sample was situated in the same cluster as its top internodal sample (top panels Fig 8a and 8b). This was consistent with the global expression analysis which showed these two samples had similar expression patterns, which differed from the other genotypes whose bottom internodal samples had lower expression levels compared to the counterpart top internodal samples. There were less clear clusters between the two groups of low and high fiber genotypes. Overall, the two components explained about 30-48% total variation between the top and bottom internodal tissue samples, and~32% of total variation between the low and high fiber genotypes. This agrees not only with the assumption that there were multiple factors responsible for the variation between the groups studied in this analysis, but also with the patterns in the dendrograms of Pearson's pairwise divergence between the groups. The PCA and Pearson analysis could also suggest that there was variation amongst the samples in each group of tissue types or fiber content (bottom panels Fig 8a-8c and 8d-8f). This could be due to the fact that the genotypes were derived from different parental backgrounds, hence each of the genotypes had a unique expression pattern, possibly reflecting their chromosome number and the allelic contribution. The expression of each allele in the homologous group could vary in different environment, genotype or genotype-specific ripening time, which even when the samples were harvested at the same point of time since planting, might not have corresponded to developmental stage, transcriptome-wise. It would be also that the same fiber content in sugarcane results from the up-or down-regulation of different sets of transcripts, which could account for the same contribution towards the increasing/decreasing in fiber content. Using the correlation analysis, the samples whose expression patterns appeared to be very different from the rest in each group were removed, retaining the final collection of 10 genotypes for which the results in this study was based on.
Differentially expressed genes identified by microarrays or RNA-Seq experiments have often been confirmed by qPCR analysis [89,90]. However, this has been shown to be difficult and un-reliable, especially in polyploid genomes, such as the allohexaploid wheat genome [91]. This is due to the fact that polyploids contain homo(eo)logous and paralogous gene copies, resulting in expression of different but related transcripts (i.e., different isoforms sharing the same exons/introns), which are not discriminated by the technique. It is also likely to be the case for the complex polyploid sugarcane genome, in which multiple homo(eo)logous gene copies (predicted up to 12 [92]), multiple promoter variants and transcript isoforms are likely to exist [14]. Our results showed a significant correlation (r = 0.54, p <0.001) between the RNA-Seq and qPCR results. This is the expected result as a very close correlation would not be expected due to the large number of different closely related genes and transcript isoforms in the highly polyploid sugarcane genome. This could result in amplification of multiple unknown copies of similar genes to the target in the PCR. Poor confirmation rates have been also reported in various studies on polyploid species with inter-platform discrepancies [91]. Designing a homo(eo)logous-and paralogous-specific validation array is complicated [14], RNA-Seq provides a more reliable way to analyse gene expression in a complex polyploid genome like sugarcane, due to the ability to distinguish the many closely related gene transcripts. The RNA-Seq analysis in this study involves a high degree of replication and sample pooling within each treatment providing strong statistical support for the differentially expressed genes identified.

Conclusion
In this study, the newly constructed reference transcriptome, the SUGIT database, which was generated by the PacBio full-length isoform sequencing [24] was used for differential expression analysis. The results showed that 70 to 82% of the total RNA-Seq data to be aligned against this database, and 57-76% of the transcriptome had reads mapped to it, suggesting that the SUGIT database could be used as the reference database for transcript profiling and differential expression analysis. The proportion of RNA-Seq reads mapped is consistent with figures found in the literature, while the percentage of transcripts that had reads mapped indicated the proportion of transcripts that were expressed in the culm tissues out of the total transcriptome database derived from leaf, culm and root tissues.
The accumulation of sugar and fiber in sugarcane is a highly regulated process, and likely involves many genetic factors and QTLs. The differential expression analysis showed that there were many DE transcripts that were directly related to sugar and fiber metabolism, while others were not directly linked but could belong to supporting pathways. In this study, a total of 1,649 differentially expressed transcripts were identified, within which only 151 transcripts were directly related to sugar and fiber metabolism, when comparing between the young and mature internodal samples of the sugarcane culm. Most of these transcripts were up-regulated in the young internodes compared to the mature internodes, highlighting the growth phase of the two tissues being compared. When comparing the two groups of low and high fiber genotypes, there were a total of 555 differentially expressed transcripts identified, of which only 23 transcripts were related to the accumulation of sugars and fiber. This could indicate that the differential expression level between the two groups might be less than two-fold, and due to multi-factors contributing to fiber content and biomass accumulation.  Table. qPCR results and correlation analysis with the RNA-Seq data. (XLSX)