Generation, Annotation and Analysis of First Large-Scale Expressed Sequence Tags from Developing Fiber of Gossypium barbadense L

Background Cotton fiber is the world's leading natural fiber used in the manufacture of textiles. Gossypium is also the model plant in the study of polyploidization, evolution, cell elongation, cell wall development, and cellulose biosynthesis. G. barbadense L. is an ideal candidate for providing new genetic variations useful to improve fiber quality for its superior properties. However, little is known about fiber development mechanisms of G. barbadense and only a few molecular resources are available in GenBank. Methodology and Principal Findings In total, 10,979 high-quality expressed sequence tags (ESTs) were generated from a normalized fiber cDNA library of G. barbadense. The ESTs were clustered and assembled into 5852 unigenes, consisting of 1492 contigs and 4360 singletons. The blastx result showed 2165 unigenes with significant similarity to known genes and 2687 unigenes with significant similarity to genes of predicted proteins. Functional classification revealed that unigenes were abundant in the functions of binding, catalytic activity, and metabolic pathways of carbohydrate, amino acid, energy, and lipids. The function motif/domain-related cytoskeleton and redox homeostasis were enriched. Among the 5852 unigenes, 282 and 736 unigenes were identified as potential cell wall biosynthesis and transcription factors, respectively. Furthermore, the relationships among cotton species or between cotton and other model plant systems were analyzed. Some putative species-specific unigenes of G. barbadense were highlighted. Conclusions/Significance The ESTs generated in this study are from the first large-scale EST project for G. barbadense and significantly enhance the number of G. barbadense ESTs in public databases. This knowledge will contribute to cotton improvements by studying fiber development mechanisms of G. barbadense, establishing a breeding program using marker-assisted selection, and discovering candidate genes related to important agronomic traits of cotton through oligonucleotide array. Our work will also provide important resources for comparative genomics, polyploidization, and genome evolution among Gossypium species.


Introduction
Cotton is one of the most important economic crops that provides prevalent natural fiber for the textile industry. The crop is widely cultivated in more than 80 countries, with China, India, the United States of American, and Pakistan being the top four cotton producers (http://www.cotton.org/econ/cropinfo/cropdata/ rankings.cfm). Cotton fiber is a single-celled seed trichome, developed from epidermal cells of the ovule. Fiber development consists of four distinct but overlapping stages: fiber cell initiation, expansion/primary cell wall (PCW) synthesis, thickening/secondary cell wall (SCW) synthesis, and maturation/dehydration [1,2]. Epidermal cells differentiate into fiber cells at approximately the time of anthesis (from 23 to 5 days post-anthesis [dpa]); as a result, only about 30% of the epidermal cells will successfully differentiate into fibers [1,3]. The initiation stage persists for about 8 days; however, lint fibers usually initiate on the day of anthesis, and fuzz fibers develop at a later stage. During fiber elongation (3-20 dpa), cells elongate rapidly without branching depending on the turgor of the center large vacuole, until the fiber reaches its final length (30-40 mm) [4]. At the same time, PCW begins to synthesize at the surface area of the fiber cell. This is then followed by the synthesis of SCW through the massive deposition of cellulose at 15-40 dpa. The final stage of fiber development, maturation/dehydration (40-60 dpa), is associated with the accumulation of mineral content and decrease in water potential [5]. At maturity, cotton fiber contains about 89% cellulose as is approximately 15 mm in thickness. So, cotton fiber is an excellent model system for studying plant cell elongation, cell wall development, and cellulose biosynthesis [2].
Cotton belongs to the genus Gossypium of the family Malvaceae, which consists of 40-45 diploid (2n = 2x = 26) and 5 allotetraploid species (2n = 4x = 52) [6,7]. The diploid species are grouped into eight genome groups, designed A through G and K. Most modern cotton varieties are allotetraploid species, Gossypium hirsutum L. or upland cotton (Gh, AD 1 genome) and Gossypium barbadense L. (Gb, AD 2 genome). G. hirsutum, especially, produces over 90% of the world's fibers because of its higher yield and wider environmental adaptation. However, G. barbadense contributes 8% of the world's fibers with its superior properties of silkiness, luster, long staples, and high strength. Thus, G. barbadense is a good gene pool for improving the upland cotton fiber quality.
Genomics approaches have been applied to explore the key or predominant expression genes and the mechanism of fiber development in cotton. Several analyses using expressed sequence tags (ESTs) and microarray methods have been performed [8][9][10][11][12][13].
Arpat et al. [9] were the first to take a genomic approach to studying the fiber transcriptome of G. arboreum L., a diploid species, at the elongation stage (7-10 dpa). Through in silico expression analysis of 46,603 ESTs, they found that the rapidly elongating fiber cells exhibited significant metabolic activity, cell wall structure, and biogenesis, with the cytoskeleton and energy/ carbohydrate metabolism the major functional groups. In a microarray study, they also identified 2553 ''expansin-associated'' genes down-regulated and 81 ''cell wall biogenesis and energy/ carbohydrate metabolism-related'' genes up-regulated during the developmental switch from PCW to SCW syntheses.
In another excellent and comprehensive work by Udall et al. [10], approximately 185,000 Gossypium EST sequences were amassed from 30 cDNA libraries. By sequence comparisons, they found that many cotton exemplars appeared to be involved with transcription, including the high-level categories of transcription factor activity, RNA binding, DNA binding, and nucleotide binding. The most abundant types of Pfam transcriptional domains were MYB DNA-binding, APETALA2, auxin-induced, WRKY DNA-binding, and RING zinc finger domains.
A full-length G. hirsutum L. immature ovules (23 dpa to 3 dpa) cDNA library was constructed and 32,789 high-quality ESTs were generated. By comparing with the existing ESTs and expression analysis, the results determined that transcription factors and phytohormonal regulators were accumulated during early stages of fiber cell development in allotetraploid cotton [11].
By mining the data of cDNA library and generation of ESTs, many actin and tubulin interrelated genes were cloned and shown to have critical roles in the process of fiber development, such as GhTUB1, GhACT1, GhPFN1, GhTUA9, and GhADF1 [8,[14][15][16][17][18]. Shi et al. [19] found that 102 metabolic pathways were upregulated during the fast fiber-elongation period, especially ethylene biosynthesis. Ovule culture in vitro indicated that ethylene played a major role in promoting cotton fiber elongation by increasing the expression of sucrose synthase, tubulin, and expansin genes. Recently, many ethylene-responsive species or relevant genes were reported to regulate fiber growth [20][21][22][23].
A global gene expression profiling study at different stages of fiber development was undertaken on two cotton species for fiber, G. hirsutum L. and G. barbadense L.. The result showed that secondary metabolism, pectin synthesis, and pectin modification genes were the most statistically significant and differentially expressed categories between the two species and the final fiber property differences between Pima and Upland cotton may largely be determined during early fiber development [24].
Wendel et al. launched important work to study the evolution of spinnable cotton fiber by comparing the expression profile between the cultivar and wild G. barbadense [13], G. longicalyx and G. herbaceum [25], using the microarray method. The result showed that domestication appeared to enhance modulation of cellular redox levels and avoided or delayed the stress-like processes. The cultivar prolonged fiber growth with up-regulation of signal transduction and hormone-signaling genes and down-regulation of cell wall maturation genes. Recent studies also indicated that hydrogen peroxide was important for fiber initiation and elongation [20,26,27].
EST large-scale sequencing projects for cotton have been done in several laboratories [9][10][11]19] We applied genomics approaches to investigate the transcriptional regulation of G. barbadense fiber development. A normalized fiber cDNA library (from 22 to 25 dpa) of G. barbadense cv. 3-79 (the genetic standard line) was constructed by saturation hybridization with genomic DNA [12]. Random sequencing of clones from the cDNA library generated 10,979 high-quality ESTs, which were assembled into 5852 unique sequences, consisting of 1492 contigs and 4360 singletons. The unique sequences were assigned putative functions based on sequence similarity and Gene Ontology (GO) annotations. The genes governing binding and catalytic activity were more abundant and fiber development was active in the metabolic process, especially in carbohydrate, amino acid, energy, and lipid metabolisms. Moreover, the functions of motif/domain-related cytoskeleton and redox homeostasis were enriched. Furthermore, putative genes involving cell wall biosynthesis and transcription factors were also identified in the fiber development of G. barbadense. Finally, the relationships among cotton species with other model plant systems were analyzed. The datasets will benefit the studying of fiber development mechanisms of G. barbadense and also be important resources for comparative genomic studies among Gossypium species.

ESTs sequencing and assembly
Approximately 12,000 clones were successfully single-pass sequenced from the 59-end. After removal of vector, poly-A, and contaminating microbial sequences as well as those less than 100 bp in length, 10,979 ESTs passed the quality control for high confidence base call (Q20), and were deposited in GenBank under the accession no. GR706801-GR716890, EE592400-EE593286, EH122780, and EH122781. The average length of all 10,979 ESTs was 643 bp; the longest was 1184 bp. The ESTs were clustered and assembled into 5852 unique sequences (putative unigenes), consisting of 1492 (25.5%) contigs and 4360 (74.5%) singletons, and the EST redundancy of this library was 53.3%. The average length was 706 bp for unique sequences, 915 bp for contigs, and 633 bp for singletons ( Table 1). The lengths of 488 (32.7%) contigs were longer than 1000 bp and the largest percentage of unigenes were 800-899 bp (1327, 22.7%). The detailed length distributions of ESTs, unigenes, contigs, and singletons were shown in Table S1 and Figure S1. The mean G/C content of unigenes was 43.1%, which was approximately equivalent to Arabidopsis (43.2%) and much lower than rice (51.4%) [28,29].
To determine if the putatively enriched genes were highly expressed truly, we performed reverse transcription polymerase chain reaction (RT-PCR) (File S1 and Table S3). Results shown in Figure S2 confirmed the high level expression for the genes represented by the ten contigs, which suggested they were all truly abundant and expressed in a tissue or a developmental stage at least in G. barbadense or G. hirsutum.

Functional annotation and classification
The 5852 unique sequences were used in a blastx search against the non-redundant protein sequence (nr) database in GenBank. A total of 4862 (83.1%) unigenes had significant hits (E-value#10 25 ) (Table S4). However, of the 4862 significant hits unique sequences, only 2165 (37.0%) showed similarities to proteins of known function, 2697 (46.1%) showed similarities to predicted proteins of unknown function, and 990 (16.9%) showed no significant similarity to any sequences contained in the nr database (  (Table S6). However, of the 2697 predicted proteins of unknown function (Table S7), 1094 (40.6%) were from V. vinifera, 1014 (37.6%) were from P. trichocarpa, and 336 (12.5%) were from R. communis (Table S8). Then, 990 sequences with no blastx hits were searched for similarities at the nucleotide level (non-redundant nucleotide sequence nt database in GenBank); only 205 sequences shared homology with the genes registered in the NCBI nt database (Table S9), including 106 sequences similar to cotton (Table S10). However, 785 unigenes (13.4%) remained unidentified, which could be considered as novel or specific genes in G. barbadense. This result might have occurred because some sequences were too short (the average length of 785 sequences was only 367 bp and 506 sequences are shorter than 400 bp) (Table S11). In order to get more known information, unique sequences were also used in a blastx search against swissprot database. A total of 67.2% (1813 sequences) of 2697 sequences which showed similarities to predicted proteins of unknown function in nr database, were similar to known protein sequences in swissprot database (Table  S7). However, 990 sequences without similarity to any sequences in the nr database were also not found any similarity sequences in swissprot database.
Gene ontology annotation. Gene Ontology (GO) annotation was performed with BLAST2GO [30,31] based on comprehensive information with sequence similarity against NCBI non-redundant (nr) protein database, InterProScan result and plant-related GO terms slimmed. A total of 4461 (76.2%) unigenes were functionally classified in one or more ontologies, and 3492 (59.7%) of the 5852 unigenes with assigned GO terms had molecular functions, 3137 (53.6%) were involved in a biological process, and 3012 (51.5%) were cellular components; 1830 (31.3%) unique sequences were classified in three ontologies. Among the 1391 unigenes without assigned GO terms, 990 did not have sequences with blastx results, 167 without annotation results, and 234 without GO terms.
Comparisons to the other cotton species and model plant species As of December 2010, 375,745 ESTs from all Gossypium species had been deposited in GenBank. To identify the G. barbadense species-specific sequence relative to other cotton species, the unigenes were used as queries in a blastn search against three databases including 268,797 ESTs of G. hirsutum, 63,577 ESTs of G. raimondii, and 41,768 ESTs of G. arboretum downloaded from GenBank, respectively. The two-dimensional display of relative similarity relationships between G. barbadense with three other cotton species was showed by the program SimiTri [34] (Fig. 3a). A total of 5183 (88.6%) unigenes had similarity with one or more species, with 2991 (51.1%) unigenes shared by four cotton species. However, 570, 41, and 17 unigenes shared only one species by G. hirsutum (AD-genome), G. raimondii (D-genome), G. arboretum (Agenome), respectively. A total of 3723 (63.6%) of the cluster sequences had homologues in G. arboreum, 5116 (87.4%) in G. hirsutum, and 3890 (66.5%) in G. raimondii, and 669 (11.4%) had no significant match to any sequence in the current EST databases of cotton. In order to validate the G. barbadense specific unigenes, five of them were selected from Table S13 to analyze their expression patterns by RT-PCR method. The details and results are shown in File S1, Table S3 and Figure S2. The results showed they were specifically or predominantly expressed in the tissues of G. barbadense, especially the sequences CO001089, 02-D20 and 44-O06, which were very lowly or hardly expressed at the fiber of G. hirsutum ( Figure S2).

Discussion
The enlargement of the G. barbadense EST database is a good supplement for fiber development work in cotton Although there are two cultivated tetraploid species, G. hirsutum and G. barbadense, few researchers investigated G. barbadense because it is less cultivated. EST sequencing is an efficient and relatively low-cost approach for gene discovery and annotation, detection of gene expression, genome and physical mapping, and molecular marker development [41], especially important for the organisms whose whole genome sequencing is currently uncompleted. In Gossypium, the highest number of ESTs in the NCBI GenBank was from G. hirsutum. Although G. barbadense has superior properties, its genomic resources are relatively undiscovered. Only 124 ESTs appeared in the NCBI GenBank (dbESTs), excluding 899 and 333 ESTs that were submitted by our laboratory in December 2006 and September 2009. In this study, we produced more than 10,000 high-quality ESTs from a normalized fiber cDNA library (from 22 to 25 dpa) of G. barbadense cv. 3-79 (the genetic standard line). Most of the ESTs were from fiber initiation and elongation developmental stages, and only a few were involved with the SCW synthesis period (http://www.ncbi.nlm.nih.gov/UniGene/ lbrowse2.cgi?TAXID = 3635&log$ = breadcrumbs). The ESTs were assembled into 5852 unigenes and annotated through a similarity search. Annotation results showed that many previously reported cotton fiber active or key genes were included in these libraries, such as AGPs and FLAs [42][43][44], GhTUB, GhTUA [15,17], actin-related genes (including actin-depolymerizing factor and profilin ) [14,16,18,[45][46][47], and GhEF1As [48]. A normalized cDNA library was an efficient tool for gene identification because it reduced the frequencies of prevalent mRNAs while enriching the rare ones. E6 and aquaporin PIP2-2, the most redundant transcriptions in cotton fiber [9], were only 13 and 19 clones in our ESTs, respectively. So, the first large-scale and publicly available ESTs from G. barbadense will be an important genomic resource to identify novel genes, especially for low-redundant ones through all the fiber development stages.
G. barbadense has a specific expression profile compared with other plant model systems In our work, 5067 (86.6%) unigenes had significant blastx or blastn hits after BLAST annotation although 785 unigenes (13.4%) remained unidentified and could be considered as novel or specific genes in G. barbadense. The most organism distribution of the unigenes best blastx hits is in R. communis (25.5%), followed by V. vinifera (22.7%), P. trichocarpa (22.4%), and Gossypium (7.7%) (Table  S6). A. thaliana and O. sativa were the best model systems for plant biology; however, the species at the top the blastx result were only 2.8% and 0.8% from A. thaliana and O. sativa in these datasets (Table S6). In addition, the unigenes were also compared with the protein sequences of A. thaliana, O. sativa, P. trichocarpa, V. vinifera, and R. communis using blastx. As shown using SimiTri [34] software in Fig. 3 (b, c, d, e), 39.1%, 55.6%, 57.9%, 73.5%, and 76.2% of unigenes have similarity with O. sativa, A. thaliana, P. trichocarpa, V. vinifera and R. communis, respectively. When compared with A. thaliana, O. sativa, and P. trichocarpa (Fig. 3b), the closest was P. trichocarpa and the most distant was O. sativa. Moreover, the mean G/C content of unigenes was approximately equivalent to A. thaliana and much lower than rice. Through sequencing bacterial artificial chromosomes (BACs) and analyzing the phylogenetic tree, Yu [49] also found that cotton was a nearer relative to poplar than the others. O. sativa is a monocot and A. thaliana, P. trichocarpa, and cotton are dicots, which may account for the differences in similarity between them in sequence. Although both G. barbadense and A. thaliana are Eurosid II clade (P. trichocarpa is Eurosid I) (http://www.mobot.org/MOBOT/research/APweb/), some G. barbadense genes appeared to be more similar to P. trichocarpa than A. thaliana as shown in Fig. 3b. P. trichocarpa is a perennial tree and the bulk of the biomass of trees is cellulose [50]. In addition, the ESTs are generated from cotton fiber tissues, which contain about 89% cellulose at maturity. Furthermore, through the analysis of genome sequence of woodland strawberry, Vladimir et al. proposed that poplar has to place into Malvidae clade and not Fabidae [51]. Compared with Ricinus, which is the same order with poplar (Malpighiales), cotton has more similar genes to R. communis (76.2%) than P. trichocarpa (57.9%) (Fig. 3d). The reason may be that fiber is developed from seed trichome and cotton seed also has high-quality oil materials, and R. communis is also an important oil crop. To our surprise, cotton also has high similarity with V. vinifera (73.5%) (Fig. 3c), which is the most distant relative in the plant phylogenetic tree. Although compared with P. trichocarpa, V. vinifera, and R. communis (Fig. 3e), the total number of similar genes between cotton with the three species is different, respectively; however, 3254 of common sequences were the same distance to three organisms. Based on the above results, it is difficult to decide on a best model plant to study cotton or cotton fibers, so cotton genome sequencing is vital and urgent.
Transcription factors (TFs), expansins, cell cytoskeletons, and reactive oxygen species (ROS)-related genes are highly enriched during fiber development Despite the fact that cotton fiber is single-celled trichome, the development of fiber is an exceptionally genetic complex and more dynamic gene network [27,52,53]. The development of fiber was enriched by transcription factors and phytohormonal regulators [11,19,54,55]. Yang et al. found that the frequency of putative TFs was approximately 10% in the G. hirsutum fiber initiation library (GH_TMO), which was significantly higher than that in CGI6 (Cotton Gene Index version 6, approximately 4.7%), CGI7 (approximately 5.0%), and the Arabidopsis proteome (approximately 6.3%) [11]. However, the frequency of putative TFs in this cDNA library is about 12.5%, which is much higher than that in GH_TMO. The result showed that the development through stages in G. barbadense may be enriched in TFs especially. TFs play critical roles in the regulation of cellular pathways in the response to biotic and abiotic stimuli and intrinsic developmental processes. The enrichment of TFs perhaps enhanced modulation of cellular redox levels and the avoidance or delay of stress-like processes, which prolonged the elongation period of G. barbadense characterized as longer fibers [13]. The difference of high-frequency TFs might be caused by using the different databases and organisms. TFs of 49 plant species were used in this study, however, only Arabidopsis TFs were used in the study by Yang et al. [11]. Only 55.6% of the ESTs in G. barbadense were similar to Arabidopsis, which may be not the best model system for cotton.
Rapid elongation of fiber cells is associated with cell turgor pressure [4]. Besides maintaining the high cell turgor, plasmodesmatal regulation and cell wall reassembly are also important for fiber elongation [56,57]. Plant expansins are a group of extracellular proteins that directly modify the mechanical properties of cell walls, enable turgor-driven cell extension, and likely affect length and quality of cotton fibers [58,59]. The expansin-related genes were at a very high level in our library. As shown in Table S2, two contigs including 67 ESTs are expansin genes. In Table S14, of 128 EST-related expansins, 53 ESTs (21 unigenes) are yieldins (glycosyl hydrolase family 18 protein, GH18). Yieldins lower the yield threshold of the minimum of tensile force needed to extend the cell wall [60] and were expressed in hypocotyl tissue prior to elongation activated at low pH [61]. Yieldins were first discovered or highlighted in fiber development and may be a key family protein for fiber elongation. Cell cytoskeletons also play important roles in plant cell expansion and tubulin, actin, or actin-depolymerizing factors (ADF) were all highly abundant in our dataset. The GO analysis of ESTs and unigenes showed that the microtubule was the most abundant in cellular components and microtubule-based movement was the third richest in biological process (Table S17). Study results indicated that the accumulation of various cell cytoskeleton-related transcripts will undoubtedly contribute to the rapid elongation of fiber cells [17,45,46,59].
Through comparative analysis of expression profiling, Chaudhary et al. [13,62] and Hovav et al. [25] found polyploidy, human selection, and domestication-enhanced modulation of cellular redox levels and the avoidance or delay of stress-like processes, which prolonged the elongation period and growth of longer fiber. ROS were continuously produced by oxidases or by electron transport components, such as FeS centers, semiquinones, or ferredoxin. However, large pools of glutathione and ascorbate were maintained in a highly reduced state in nonquiescent cells under optimal conditions. Other key redox signaling components were thioredoxins (TRX) and glutaredoxins (GRX), which were reduced by ferredoxin, NADPH, or glutathione [63]. Given the previous evidence that cotton class III peroxidases (GhPOX1) and ascorbate peroxidase (GhAPX1) play important roles during fiber cell elongation possibly by mediating ROS homeostasis [20,64]. The classification and assignment protein motif/domain for ESTs and unigenes showed that NAD(P) + , thioredoxin, ferredoxin, and glutathione-related domains were the richest protein domain (Table 3 and Table S12). There are about 580 ESTs with the GO annotation of ''biology process:oxidation reduction'' ( Figure S4), which was more abundant than other process. The recent study indicated that ROS also affect fiber initiation [26].
Expression profile analysis might help find speciesspecific unigenes from G. barbadense Through a similar search against other three cotton species genomic resources, 87.4% of unigenes have similarity against G. hirsutum and 2991 (51.1%) were shared by all four species; however, these sequences are more nearly similar with G. hirsutum (Fig. 3a). Finally, 669 (11.4%) had no match to any sequences in current EST databases of cotton. Blastx against protein datasets of Arabidopsis and GO annotation showed that the most enrichment functions were transport and oxidoreductase activity (Table S13  and Table S17). Moreover, several G. barbadense-specific genes deserve to be highlighted. The contig sequence CO000339 (similar with AT3G14130) encoded (S)-2-hydroxy-acid oxidase, peroxisomal, which is mitochondrial type II peroxiredoxin F and essential for redox homeostasis and root growth of Arabidopsis thaliana under stress [65]. Some genes involved in the stress response process, such as contigs CO000432 encoded pyruvate orthophosphate dikinase (PPDK), which is involved in a metabolic response to water deficit and low-oxygen stress in rice, an anoxiatolerant species [66]. Soluble epoxidehydrolase (AtSEH) gene (AT2G26740, similar with contigs CO000715) concerned with epoxide hydrolase lipid metabolism, which showed a relatively high expression in response to malondialdehyde treatment [67]. PDX1 (protein heterodimerization, AT3G16050) was essential for vitamin B 6 biosynthesis, development, and stress tolerance in Arabidopsis [68]. Two other important genes were vital for trichomes, contig CO001421 (similar with AT3G03980) encoded short-chain dehydrogenase/reductase (SDR) family protein, which is up-regulated both in Arabidopsis trichomes and root atrichoblasts [69]. Arabidopsis WAVE complex subunit (AT2G34150.2) activates the Arp2/3 complex and was required for epidermal morphogenesis [70]. In addition, another two genes were related to auxin, IBA-RESPONSE 3 (IBR3, AT3G06810, similar with contig CO001432) and IAA-ALANINE RESISTANT 3(IAR3). IBR3 may act directly in the oxidation of IBA to IAA [71]. The mutant deficiency of IBR3 failed to expand root hairs and exogenous active auxin restored its root hair elongation [72]. IAR3 encoded an auxin conjugate hydrolase [73]. Moreover, contigs CO000682 (similar with AT1G67710) encoded RESPONSE REGULATOR 11 (ARR11) TF, which belonged to the cytokinin-associated type-B ARR subfamily and had an essential role in cytokinin signal transduction [74]. These G. barbadense-specific sequences perhaps contribute to the differences between G. barbadense and other cotton species.

EST generation
The construction of the normalized fiber cDNA library (from 22 to 25 dpa) of G. barbadense cv. 3-79 (the genetic standard line) by saturation hybridization with genomic DNA was been described by Tu et al. [12]. The clones were randomly picked and transferred into 384-well plates. Single-pass sequencing from the 59 end was carried out with ABI 3730 automatic DNA sequencer (Auke Biotech Co., Ltd., Beijing, China) by using T3 universal primer and BigDye Terminator.

EST pre-process and assembly
The trace files were base-called using Phred program [75,76] and all low-quality bases (,Q20, ,99% accuracy) were removed from the sequence ends, followed by SeqClean [77] to shorten Poly-A/T (only hold 5 continual bases A/T). Then the vector and contaminating microbial sequences were eliminated using VecScreen program (http://www.ncbi.nlm.nih.gov/VecScreen/VecScreen.html). EST sequences longer than 100 bp after trimming were deposited into the dbESTs division of GenBank, then clustered and assembled into contigs and singlets (unisequences) using ESTClustering [78,79], which was designed based on MegaBlast [79] and CAP3 [80].

Annotation and functional classification
After clustering and assembly, BLAST search was done to identify similarities between the ESTs and sequences deposited in public databases. All of the unisequences were then compared to SwissProt and GenBank non-redundant protein and nucleotide databases using either blastx (E-value#10 25 ) or blastn (E-value#10 25 ) program [81]. The ESTs of G. hirsutum, G. raimondii, and G. arboretum species from dbEST at NCBI were downloaded and blastn analysis was performed to compare G. barbadense unisequences obtained in this study. In addition, the unigenes were compared with the protein sequences of A. thaliana (http://www. Arabidopsis.org/), O. sativa (http://rice.plantbiology.msu.edu/), P. trichocarpa (http://genome.jgi-psf.org/poplar/), V. vinifera (http:// www.genoscope.cns.fr/externe/), and R. communis (http://castor bean.jcvi.org/) using blastx. The relative similarity relationships between G. barbadense with other species were displayed using the program SimiTri [34]. To identify putative cell wall-related genes and transcription factors, the blastx against Cell Wall Navigator (CWN) database [35] and an comprehensive plant transcription factor database (PlantTFDB) [36,82] were used.
Gene Ontology (GO) [83] annotation was performed with BLAST2GO [30,31] based on sequence similarity. For the annotation, the default configuration settings were used (blastx against NCBI non-redundant (nr) protein database, E-value filter #10 23 , HSP length cutoff of 33, maximum 20 BLAST hits per sequence to sequence description tool and annotation cutoff of 55). Furthermore, to improve annotability, InterProScan was performed and InterProScan results were merged to GO annotation. Then, the GOslim ''goslim_plant.obo'' was used to achieve plantrelated GO terms. Finally, the analysis of biological processes/ pathways was also carried out using the KEGG (Kyoto Encyclopedia of Genes and Genomes) [84] Automatic Annotation Server [33] with the SBH option checked and plant gene datasets selected.
BLAST2GO includes the Gossip package [85] for statistical assessment of annotation differences between two sets of sequences, using Fisher's exact test for each GO term. FDR controlled P values (FDR,0.05) were used for the assessment of differentially significant metabolic pathways.

Identification of EST-SSRs
The EST-SSRs were indentified using Serafer, a visualization powerful pipeline to assemble sequences, detect SSRs, and design primers, developed by Shaoguang Liang in our laboratory (ftp:// ensembl.genomics.org.cn/other/Serafer_1.9.5.zip or http://www. tgir.org/download/software/Serafer_1.9.5.zip) (unpublished). The length criteria for SSR detection were a minimum of seven repeats for dinucleotide motifs, five repeats for trinucleotide motifs, and four repeats for tetra-, penta-, and hexanucleotide motifs. Figure S1 The length distribution of ESTs, contigs, singletons, and unigenes. (PDF) Figure S2 Expression analysis of 15 representative unigenes by RT-PCR method. The first ten are selected from the list of enriched unigenes, the latter five are from the list of G. barbadense putative specific sequences. R, L, S, 0, 5, 10, 15 and 20 represents the tissue of root, leaf, stem and 0, 5, 10, 15, 20-days post anthesis (DPA) fibers. (PDF) Figure S3 Functional classifications for the 5852 unigenes that were assigned with GO terms (third level GO terms). The three GO categories, biological process (a), molecular function (b), and cellular component (c) are presented. (PDF) Figure S4 The GO distribution of 10,979 ESTs. (PDF)        Table S17 669 G. barbadense-specific unigenes statistical distinct GO terms against the other 5183 unigenes. Gene Ontology (GO) annotation was performed with BLAS-T2GO. Then, the Gossip package was used to get statistical assessment of annotation differences between two sets of sequences, using Fisher's exact test for each GO term. FDR controlled P values (FDR,0.05) were used.

(XLS)
File S1 The details of expression analysis by RT-PCR method. Plant materials, RNA extraction and RT-PCR details were described in this file. (PDF) Author Contributions