Characterization of Liaoning Cashmere Goat Transcriptome: Sequencing, De Novo Assembly, Functional Annotation and Comparative Analysis

Background Liaoning cashmere goat is a famous goat breed for cashmere wool. In order to increase the transcriptome data and accelerate genetic improvement for this breed, we performed de novo transcriptome sequencing to generate the first expressed sequence tag dataset for the Liaoning cashmere goat, using next-generation sequencing technology. Results Transcriptome sequencing of Liaoning cashmere goat on a Roche 454 platform yielded 804,601 high-quality reads. Clustering and assembly of these reads produced a non-redundant set of 117,854 unigenes, comprising 13,194 isotigs and 104,660 singletons. Based on similarity searches with known proteins, 17,356 unigenes were assigned to 6,700 GO categories, and the terms were summarized into three main GO categories and 59 sub-categories. 3,548 and 46,778 unigenes had significant similarity to existing sequences in the KEGG and COG databases, respectively. Comparative analysis revealed that 42,254 unigenes were aligned to 17,532 different sequences in NCBI non-redundant nucleotide databases. 97,236 (82.51%) unigenes were mapped to the 30 goat chromosomes. 35,551 (30.17%) unigenes were matched to 11,438 reported goat protein-coding genes. The remaining non-matched unigenes were further compared with cattle and human reference genes, 67 putative new goat genes were discovered. Additionally, 2,781 potential simple sequence repeats were initially identified from all unigenes. Conclusion The transcriptome of Liaoning cashmere goat was deep sequenced, de novo assembled, and annotated, providing abundant data to better understand the Liaoning cashmere goat transcriptome. The potential simple sequence repeats provide a material basis for future genetic linkage and quantitative trait loci analyses.


Introduction
Cashmere is an important commodity in China and Liaoning cashmere goat (Capra hircus) is famous for its fine cashmere quality, high cashmere yield, and stable genetic character. Although this population was bred in the 1980s, a little is known regarding the molecular mechanism and regulation of cashmere growth in this breed. Genetic studies of Liaoning cashmere goat had been performed in the past decade, which mainly focused on characterization and expression of functional genes [1,2] and development of genetic markers [3,4] for breeding and genetic evaluation. This is partly due to unavailable transcriptome information and the sequencing of limited number of randomly selected cDNA clones for Liaoning cashmere goat.
Expressed sequence tags (ESTs) or transcriptome sequencing is an efficient way to generate functional genomiclevel data for an organism, especially non-model organism.
In this article, we performed de novo transcriptome sequencing of the female Liaoning cashmere goat using Roche 454 GS FLX platform. 804,601 high-quality reads were assembled into a non-redundant set of 117,854 unigenes, comprising 13,194 isotigs and 104,660 singletons. Functional annotation and comparative analysis were performed on those 117,854 unigenes, providing an invaluable new resource for functional genomics and biological research in Liaoning cashmere goat. Finally, potential simple sequence repeats we identified could be useful for genetic linkage mapping and other genetic studies.

Transcriptome sequencing and assembly
Although the Roche 454 GS FLX platform incurs a high cost for sequencing, the read length of the output sequences are more than adequate for the de novo assembly of Liaoning cashmere goat genes [28,29]. To achieve a whole-body transcriptome, total RNA was extracted from a variety of adult organs and tissues including heart, liver, spleen, lung, kidney, pancreas, skin, muscle, ovary, stomach, intestines, brain and arteries. Equal quantities of total RNA were mixed together to construct a cDNA library. The library was sequenced by a Roche 454 GS FLX. Table 1 shows the summary of Roche 454 GS FLX assembly and analysis of transcriptomic sequences in Liaoning cashmere goats. A total of 1,044,032 raw reads were obtained with an average length of 416 bps. After processing the raw reads and filtering out short and low quality reads, 804,601 (77.07%) highquality transcriptomic reads with an average size of 403 bp remained ( Figure 1A). De novo assembly using Newbler Software v. 2.5 generated 13,194 isotigs (comprising 654,169 reads (81.30%)). The isotig length ranged from 91 to 7,402 bp ( Figure 1B), with an average length of 1,011 bp. Most isotigs (93.24%) were > 500 bp in length, and 40.64% of them were > 1000 bp. The average length of the Liaoning cashmere goat isotigs (1,011 bp) was longer than most of those assembled in other non-model organisms, such as Melitaea cinxia (197 bp) [30], Acropora millepora (440 bp) [31], P. contorta (500 bp) [32], Pristina leidyi (707 bp) [33], and Megalobrama amblycephala (730 bp) [19], but they were shorter than the average length in Junco hyemalis (1,248 bp) [34]. About 112,148 (18.70%) reads did not assembled into isotigs. After processing these, 104,660 singletons with an average length of 422 bp were obtained. The singletons ranged from 90-1,205 bp, and 98.62% of the singletons were between 101 and 700 bp in length ( Figure 1C). Finally, clustering and assembly of 804,601 high-quality reads produced a non-redundant set of 117,854 unigenes, comprising 13,194 isotigs and 104,660 singletons.

Functional annotation and classification
All unigenes (isotigs and singletons) were analyzed by EMBOSS software [35] for generation of putative protein sequences. Then we searched all the putative protein sequences against several protein databases (NCBI nonredundant (NR), Swiss-Prot and TrEMBL, Clusters of Orthologous Groups (COG), and Kyoto Encyclopedia of Genes and Genomes (KEGG) ), and predicted the protein functions from the annotations of the most similar proteins. Here, 53,515 (45.41%) unigenes were annotated ( Table 2).
Among the annotated 24,614 unigenes, the majority of matched sequences exhibited high similarity to Bos taurus (60.37%) sequences. The top 20 species distribution of BLAST matches is shown in Figure 2. The annotated unigenes of the Liaoning cashmere goat transcriptome have a low percentage of matches to Capra hircus (goat) or Ovis aries (sheep) sequences. In principle, the unigenes of the Liaoning cashmere goat should be more similar to these two species, because they all belong to the Caprinae. We speculated that this might have resulted from an underrepresentation of goat and sheep protein sequences in the NCBI Nr database. To confirm this   description of the gene properties and their products in any organism. All putative protein sequences (obtained from 117,854 unigenes) of Liaoning cashmere goat were assigned for GO terms based on sequence similarity, using BLASTp (with E value<1e -5 ) against the Swiss-Prot and TrEMBL database, and then using the GoPipe software [36] to perform GO functional classifications. In total of 17,356 unigenes were assigned to 6,700 GO categories, and the terms were summarized as three main GO categories and 59 subcategories ( Figure 3A). Of these, 15,288 (88.08%) comprised the largest category, molecular function, followed by cellular component (14,617,84.22%) and biological process (14,119, 81.35%). Among the molecular functions category, the top three were involved in binding (GO: 0005488), protein binding (GO: 0005515) and catalytic activity (GO: 0003824). Regarding cellular components, cell (GO: 0005623) and intracellular (GO: 0005622) were the dominant groups, cytoplasm (GO: 0005737) and membrane (GO: 0016020). Within biological processes category, cellular process (GO: 0009987) was the most dominant group, metabolic process (GO: 0008152), macromolecule metabolic process (GO: 0043170) and regulation of biological process (GO: 0050789) ( Table S1).
To better understand the biological functions and interactions of the putative proteins (obtained from 117,854 unigenes), a bidirectional BLAST search against Kyoto Encyclopedia of Genes and Genomes (KEGG) protein database with a cutoff E value of <1e -5 was performed. 3,548 unigenes, comprising 1,728 isotigs and 1,820 singletons, had KEGG Orthology (KO) numbers (Table S2). According to the KO numbers, we assigned the 3,548 unigenes to 300 KEGG pathways ( Table  S2). The pathways represented by the KEGG annotated unigenes were in cancer (ko05200), HTLV-I infection (ko05166), RNA transport (ko03013), Epstein-Barr virus infection (ko5169), protein processing in endoplasmic reticulum (ko04141) and MAPK signaling pathway (ko04010).
In addition, the 3,548 KEGG annotated unigenes could be categorized into six different functional groups ( Table 3). The largest group is metabolism, which contained 1,117 KEGG annotated unigenes. This KEGG category consisted of 11 major sub-groups, such as carbohydrate metabolism, amino acid metabolism, lipid metabolism. Other KEGG categories were human diseases, genetic information processing, organismal systems, cellular processes and environmental information processing. A well-categorized and annotated transcriptome would be useful for further investigations of gene function in future studies.

Comparative analysis
Gene Coverage Analysis. After clustering and assembling 804,601 high-quality reads, we obtained 117,854 nonredundant unigenes, comprising 13,194 isotigs and 104,660 singletons. The assembled unigenes were compared with the sequences in the NCBI non-redundant nucleotide databases using the BLASTn algorithm with a cutoff E value of <1e -20 . Of the 117,854 assembled unigenes, 66,351 (56.30%) showed one alignment to an existing sequence. After processing and filtering, 42,254 unigenes were aligned to 17,532 different specific genes. The coverage of every gene was also calculated (Table S3). Among the 17,532 specific genes, 31 genes had full-length coverage, and 939 (5.36%) genes had coverage between 90.00% and 99.99%. Moreover, 3,053 (17.41%) genes had a coverage > 70%, and 5,655 (32.26%) genes had a coverage > 50% (Figure 4).
According to the result of the gene coverage analysis, the full-length sequences of some Liaoning cashmere goat genes could be obtained by using PCR or reverse PCR. The 42,254 unigenes aligned to specific genes could serve as a library, facilitating the isolation of the full-length versions of Liaoning cashmere goat genes. Using the genes that participate in hair follicle formation and hair growth regulation as example, we identified FGF7, KAP8.1, Hoxc13, TGFβ, PDGFA, IGFBP5 and BMP4 as having a coverage rate >50% in Table S3. In addition, our data contain a number of other genes related to hair follicle formation and hair growth regulation with a coverage rate <50%, such as TGFβ2, FGF5, Notch1, IGF-1, PDGFRA, BDNF, IL1A/B, NTRK3, and MTNR1A. Without doubt, there were also many genes that were not represented in this RNA-seq dataset, because of the depth of 454 sequencing, the sources of the sequencing samples and low transcript abundance. To date, although 2.66-Gb genome sequences of a female Yunnan black goat have been reported [37], the comprehensive annotation of goat genome remains unavailable. Thus, we believe that our sequence library (comprised 42,254 unigenes) would serve as an important public information platform to accelerate functional gene studies in Liaoning cashmere goat. For example, the sequence library could facilitate studies of genes, gene transcription and transcriptome of other goat species by providing an abundant source of homologous genes.
Comparison with Capra hircus genome sequence and protein-coding genes. Capra hircus genomes information is publicly available through a genome browser interface and database (http://goat.kiz.ac.cn/GGD/). To orient the Liaoning cashmere goat unigenes derived from RNA-seq, we performed BLASTn alignment (E value <1e -20 ) against the sequences of Capra hircus genomes. Among a total 117,854 unigenes, 97,236 (82.51%) were mapped to 30 chromosomes of Capra hircus (Table S4). Furthermore, all matched unigenes (97,236) were compared with the 66,351 annotated unigenes, which showed at least one significant alignment to an existing sequence in the NCBI non-redundant nucleotide databases. 53,140 (54.65%) matched unigenes got annotated. This analysis indicated that our assembled unigenes obtained from 13 Liaoning cashmere goat tissues will be useful for the annotation of Capra hircus chromosomes.
In addition, all Liaoning cashmere goat unigenes were searched against 22,175 Capra hircus protein-coding genes, which have been reported [37] and were publicly available (http://goat.kiz.ac.cn/GGD/). 35  Discovery of putative novel goat genes. The homologybased annotation is a serviceable method for gene prediction and annotation. Thus, we further analyzed the result of above comparative analysis, especially the 18,602 unigenes which were not mapped to the reported Capra hircus protein-coding genes but matched to 8,940 cattle and human reference genes. After statistical investigation, among the 8,940 cattle and human reference genes, 272 genes whose full-length were covered more than 50% by our Liaoning cashmere goat unigenes. Moreover, we collected coding sequences (CDS) of the 272 genes, and through comparative analysis, 67 cattle genes whose coding sequence were completely covered by 67 unigenes (Table S7). We predicted that the 67 Liaoning cashmere goat unigenes were homologous genes of the 67 cattle genes. Thus, we infer that the 67 Liaoning cashmere goat unigenes, which were not aligned to the 22,175 reported goat genes, were putative novel goat genes (Table S8).

Putative SSR markers in unigenes
Simple sequence repeats (SSRs) are polymorphic loci present in genomic DNA, and are distributed in both coding and non-coding regions [38]. They consist of repeated core sequences of 2-6 base pairs in length. We used MIcroSAtellite (MISA, http://pgrc.ipk-gatersleben.de/misa/) to identify potential SSRs in our unigenes. Of the 117,854 unigenes obtained from 454 sequencing, 2,621 unigenes were identified to contain 2,781 SSRs, 122 of the unigenes contained two or more SSRs. The frequency of SSR occurrence in the 117,854 unigenes was 2.36%. Table 4 shows that the dinucleotide repeats were the most common SSRs in our dataset. The second major class was pentanucleotide, and the remaining repeat motifs were mononucleotide, trinucleotide, tetranucleotide and hexanucleotide. This distribution was different from SSRs of Capra hircus skin [39], whose trinucleotide repeats were the most common type (38.6%) and the proportion of pentanucleotide repeats were small. In addition, the frequencies of SSRs with different number of tandem repeats are also shown in Table 4. SSRs with > 10 tandem repeats (49.08%) were the most common, followed by four tandem repeats, five tandem repeats and ten tandem repeats. The dominant repeat motif in all SSRs was AC/GT (908, 37.49%), followed by AACTG/AGTTC (326, 13.46%), AT/AT (211, 8.71%), AGC/CTG (137, 5.66%) and AAAT/ATTT (92, 3.80%) ( Table S9). The AC/GT repeat type, most abundant in Liaoning cashmere goat was also very abundant in other vertebrates [40], but different from plants [41].
SSRs can be used for subsequent marker development, genetic linkage and quantitative trait loci analysis. To make these SSR markers be useful for Liaoning cashmere goat breeding or other studies, we employed primer 3 to design primer pairs for each SSR. After removing the microsatellites without sufficient flanking sequence for primer design, 958 primer pairs from 2,781 (34.45%) SSRs were designed (Table  S10). However, these primers require further experimental validation.

Conclusions
In this study, de novo transcriptome sequencing of the female Liaoning cashmere goat using the 454 GS FLX system was performed for the first time. 1,044,032 raw reads were de novo assembled into 117,854 unigenes. All unigenes were then evaluated and functionally annotated by comparing with the existing protein databases, such as NCBI NR database, Swiss-Prot and TrEMBL database, COG database, and KEGG protein database. Further comparative analysis revealed that 54,153 (45.95%) unigenes got annotated and 67 putative new goat genes were discovered. Moreover, 2,781 SSRs were initially identified among the unigenes, which can be used for subsequent marker development, genetic linkage and quantitative trait loci analysis. Overall, our study on the Liaoning cashmere goat transcriptome provides a valuable resource that will lead to a better understanding of the Liaoning cashmere goat transcriptome.

Ethics statement
All animal procedures and study design were conducted in accordance with the Guide for the Care and Use of Laboratory Animals (Ministry of Science and Technology of China, 2006), and were approved by the animal ethics committee of Northwest A & F University.

Sample preparation, total RNA isolation and quality controls
The most important issue for transcriptome sequencing was the preparation of fresh and healthy tissue samples. Therefore, three adult female Liaoning cashmere goats were selected from Liaoning Province of China in 2012. The 13 tissue samples including heart, liver, spleen, lung, kidney, pancreas, skin, muscle, ovary, stomach, intestines, brain and arteries were freshly collected. All samples were immediately frozen in liquid nitrogen and stored at -80°C until use.
Total RNA was extracted from the 13 tissues using the TRIzol Reagent (Invitrogen, USA) according to the

cDNA synthesis and 454 sequencing
We pooled equivalent amounts of total RNA from each sample (total 0.30 mg total RNA) and delivered it to Shanghai Hanyu Bio-Tech Co., Ltd. (Shanghai, China) for cDNA synthesis and 454 sequencing. Pure mRNA was subtracted using the Micropoly(A) Purist TM mRNA purification kit (Ambion). RNA quality and quantity was determined again at the end of this process. First-strand cDNA was synthesized using GsuIoligo dT as reverse transcription primers, 10 mg of the mRNA as template, and 1000 units of Superscript II reverse transcriptase (Invitrogen), at 42°C for 1 h. The mRNA 5' cap structure was then oxidized using NaIO 4 (Sigma, USA) and connected to biotin. The biotin-coupled mRNA/cDNA was screened by Dynal M280 beads (Invitrogen), and the firststrand cDNA was released by alkaline dissociation. Subsequently, adapters were added to the first-strand cDNA using DNA ligase (TaKaRa, Japan), and the second-strand cDNA was synthesized using Ex Taq polymerase (TaKaRa). Finally, the polyA and 5' end adapters were removed by Gsu I digestion.
Fragmentation of synthesized cDNA was performed by nebulization, which shears double-stranded cDNA into fragments ranging from approximately 300 to 800 base pairs. A single-strand template DNA (sstDNA) library was constructed using a GS DNA Library Preparation kit (Roche Applied Science, USA) following the manufacturer's protocol. The sstDNA was attached to beads using a GS emPCR kit (Roche Applied Science), and an emulsion-based clonal amplification was performed. After amplification, Roche GS-FLX 454 pyrosequencing was conducted, according to the manufacturer's recommendations.

Sequence assembly
1,044,032 raw sequences were obtained from Roche 454 pyrosequencing, with an average length of 416 bp. Newbler Software v. 2.5 (provided with the Roche GS FLX sequencer) was used for sequence assembly, using default parameters. Prior to assembly, we removed the nonsense sequences, including adapters added for reverse transcription and 454 sequencing, primers and low quality sequences, using Newbler. Raw data were deposited in NCBI Sequence Read Archive (SRA) under the accession number SRA064946. All unigene sequences (>200 bp) were deposited to the Transcriptome Shotgun Assembly (TSA) database. This Transcriptome Shotgun Assembly project has been deposited at DDBJ/EMBL/GenBank under the accession GAFC00000000. The version described in this paper is the first version, GAFC01000000.

Functional annotation
All unigenes were analyzed by the EMBOSS software package [35] for generation of putative protein sequences. All putative protein sequences were subjected to a BLASTp similarity search against the NR protein database (NCBI), with a cutoff E value of <1e -5 . Blast hits annotated with uninformative terms such as "unknown", "uncharacterized" and "hypothetical" were omitted from analyses. Ultimately, 24,614 unigenes had significant BLASTp matches.
To annotate unigenes with gene ontology (GO) terms, sequences were compared against the Swiss-Prot and TrEMBL database using the BLASTp algorithm, with a cutoff E value of <1e -5 . The results were then sorted by GO categories using gene2go in the GoPipe software [36]. BLASTp was also used to match unigenes to the Clusters of Orthologous Groups (COG) (http://www.ncbi.nlm.nih.gov/COG/) to predict possible functional classifications. KEGG Orthology (KO) numbers were assigned to unigenes using a bidirectional BLAST search against the KEGG protein database with a cutoff E value of <1e -5 . The sequences were mapped to the KEGG pathways according to the KO numbers in the pathway database.

Calculation of gene coverage
The assembled sequences were compared with the sequences in the NCBI non-redundant nucleotide databases using the BLASTn algorithm with a cutoff E value of <1e -20 . 42,254 unigenes aligned to 17,532 different specific genes, because multiple unigenes sometimes matched to one specific gene. We collected the full-length versions of the 17,532 different specific genes, and then calculated the length of every specific gene covered by the 42,254 unigenes. The results of the coverage analysis are shown in Table S3.

Comparative analysis
To further annotate and identify the assembled sequences, all the sequences (13,194 isotigs and 104,660 singletons) were mapped to Capra hircus genome and compared with the reported 22,175 Capra hircus protein-coding genes, using BLASTn with a cutoff E value of <1e -20 . The remaining nonmatched unigenes were also aligned to Bos taurus (cattle) and Homo sapiens (human) reference genes, using BLASTn with a cutoff E value of <1e -20 . 18,602 unigenes were matched to 8,940 cattle and human reference genes. Among the 8,940 cattle and human reference genes, 272 genes whose fulllength were covered more than 50% by unigenes. To discover the putative new goat genes, the coding sequences (CDS) of 272 cattle and human reference genes collected from NCBI. Through analysis, 67 genes whose coding sequence were completely covered by 67 Liaoning cashmere goat unigenes.

SSR detection and primer designing
SSRs were detected among the 117,854 unigenes using MIcroSAtellite (MISA, http://pgrc.ipk-gatersleben.de/misa/), which accepts FASTA-formatted sequence files. The parameters were adjusted to identify perfect mono-, di-, tri-, tetra-, penta-, and hexanucleotide motifs with a minimum of 20, 10, 7, 5, 4 and 4 repeats, respectively. The maximum number of bases interrupting two SSRs in a compound microsatellite was set at 100. Primer 3 software [42] was used to design primers. The file "FASTAfile.misa", which was created by MISA, contained the information about the type and the localization of each individual microsatellite and was submitted to Primer3 for primer designing. The major parameters for designing the primers were set as follows: primer length ranging from 18 to 22 bases, with 20 as the optimum; PCR product size ranging from 100 to 500 bp; optimum annealing temperature 59°C; and GC content from 40% to 70%, with 50% as the optimum. Table S1.

Supporting Information
Categories of gene ontology of Liaoning cashmere goat unigenes. (XLSX)