Comparative analysis of the transcriptome of the Amazonian fish species Colossoma macropomum (tambaqui) and hybrid tambacu by next generation sequencing

Background The C. macropomum is a characiform fish from the Amazon basin that has been hybridized with other pacu species to produce commercial hybrids, such as the tambacu. However, little is known of the functional genomics of the parental species or these hybrid forms. The transcriptome of C. macropomum and tambacu were sequenced using 454 Roche platform (pyrosequencing) techniques to characterize the domains of Gene Ontology (GO) and to evaluate the levels of gene expression in the two organisms. Results The 8,188,945 reads were assembled into 400,845 contigs. A total of 58,322 contigs were annotated with a predominance of biological processes for both organisms, as determined by Gene Ontology (GO). Similar numbers of metabolic pathways were identified in both the C. macropomum and the tambacu, with the metabolism category presenting the largest number of transcripts. The BUSCO analysis indicated that our assembly was more than 40% complete. We identified 21,986 genes for the two fishes. The P and Log2FC values indicated significant differences in the levels of gene expression, with a total of 600 up-regulated genes. Conclusion In spite of the lack of a reference genome, the functional annotation was successful, and confirmed a considerable difference in the specificity and levels of gene expression between the two organisms. This report provides a comprehensive baseline for the genetic management of these commercially important fishes, in particular for the identification of specific genes that may represent markers involved in the immunity, growth, and fertility of these organisms, with potential practical applications in aquaculture management.


Results
The 8,188,945 reads were assembled into 400,845 contigs. A total of 58,322 contigs were annotated with a predominance of biological processes for both organisms, as determined by Gene Ontology (GO). Similar numbers of metabolic pathways were identified in both the C. macropomum and the tambacu, with the metabolism category presenting the largest number of transcripts. The BUSCO analysis indicated that our assembly was more than 40% complete. We identified 21,986 genes for the two fishes. The P and Log2FC values indicated significant differences in the levels of gene expression, with a total of 600 up-regulated genes.

Conclusion
In spite of the lack of a reference genome, the functional annotation was successful, and confirmed a considerable difference in the specificity and levels of gene expression between the two organisms. This report provides a comprehensive baseline for the genetic management of these commercially important fishes, in particular for the identification of specific PLOS  3500). The sequences were then edited and aligned in BIOEDIT 7.2.5 [25]. The identification of the mitochondrial maternal lineage was based on comparisons with the sequences of the native species provided by Gomes et al. [13] and the sequences available in GenBank (Colossoma macropomum DQ480074 and Piaractus mesopotamicus AF283959).

Extraction of the total RNA
The total RNA was isolated using the PureLink RNA mini kit (Ambion, Life Technologies, USA) according to the manufacturer's instructions. Equal concentrations and volumes of the tissue sampled from different individuals were combined to produce four RNA pools, separated by organism and type of tissue. The samples were treated with RNAse-free DNAse (Invitrogen, CA, USA) to remove any DNA contaminants. The amount of RNA extracted was determined using a PicoDrop spectrophotometer (Picodrop, United Kingdom), and the quality, with a Bioanalyzer 2100 (Agilent Technologies). The integrity of the samples was confirmed by electrophoresis in 1.5% agarose gel. All the RNA samples were treated with RiboMinus (Invitrogen, CA, USA), according to the manufacturer's instructions, for the selective depletion of the ribosomal RNA (rRNA) transcripts from the total RNA. The RNA samples were then stored at -80ºC prior to the pyrosequencing reaction.

Construction of the cDNA library
Four C. macropomum and tambacu libraries were established. Complementary DNA (cDNA) libraries were constricted in two steps, firstly, by the fragmentation of the total RNA and then the synthesis of the double-stranded cDNA. The initial quantity of RNA used in the fragmentation was approximately 600 ng/μ, with zinc chloride (ZnCl2) and Tris-HCL being added to initiate the fragmentation process. The integrity of the fragments was verified in the Bioanalyzer 2100 (Agilent Technologies, Santa Clara, CA, USA), using the RNA 6000 Pico kit. The double strands of the cDNA were then obtained using the cDNA Synthesis System kit (Roche), with the following steps: denaturation of the RNA by the addition of a randomic primer, synthesis of the first cDNA strand, synthesis of the second cDNA strand, and the purification of the double cDNA strands. The irregular ends of the cDNA fragments generated by this process were then repaired. The cDNA fragments were quantified using a TBS 380 QuantiFluor fluorometer (Promega, USA), according to the manufacturer's instructions. The size of the cDNA fragments was verified in a bioanalyzer (Agilent Technologies, Santa Clara, CA, USA) using the High Sensitivity DNA chip.

Amplification of the cDNA library and sequencing
The cDNA fragments of all the samples (C. macropomum and tambacu) were pyrosequenced in the 454 GS FLX Titanium platform (Roche, Branford, CT, USA). The emulsion PCR (emPCR) was based on the enrichment, purification, and preparation of the Pico Titer Plate (PTP), conducted according to the manufacturer's instructions. All the libraries were sequenced five times, with each run using a PTP with two regions. The raw sequence reads have been submitted in the National Center for Biotechnology Information (NCBI) Sequence Therefore, genetic markers are required to ensure the reliable identification of specimens. The markers used in the present study were the mitochondrial D-loop sequences and the multiplex PCR of the nuclear α-Tropomiosin gene. The multiplex PCR is widely used as a highly effective, low-cost tool for the rapid identification of interspecific hybrids and parent species [13,45].
As the mitochondrial DNA is inherited maternally in most eukaryotes [46,47], the sequencing of the D-loop confirmed that the female parents of all the specimens analyzed were C. macropomum (S1 Table). The hybrids were discriminated from the C. macropomum by multiplex PCR, using the parental species (C. macropomum and P. mesopotamicus) as control. The results of the PCR yielded distinct band profiles for P. mesopotamicus, with fragments of 300 base pairs (bps), and C. macropomum, which presented bands of only 200 bps. The multiplex PCR produced a single band of 200 bps in eight of the 16 specimens analyzed, confirming that they belong to C. macropomum (Fig 1A). The other eight specimens had two distinct bands (one of 300 bps and the other of 200 bps), confirming their identification as tambacu hybrids. The 300-bp fragment is inherited from the P. mesopotamicus father and the 200-bp fragment from the C. macropomum mother ( Fig 1B).

Results of the sequencing and assembly
A total of 8,188,945 raw reads were obtained from the four libraries derived from the samples of the skin and muscle tissue of C. macropomum and the tambacu, using the Roche 454   (Table 1). After filtering, the remaining reads obtained from each type of tissue were assembled separately using MIRA and Newbler, providing a total of 572,760 contigs. The dataset was analyzed in CD-Hit-Est to remove redundant contigs and putative assembly errors, resulting in a final number of 400,845 contigs ( Table 1). The use of two assemblers enriched both the number and the size of contigs, and is a successful, widely-used strategy [48,49]. The contigs ranged in size from 200 to 49,961 bps. The largest contig was assembled from the C. macropomum muscle samples (S1 Fig). The N50 values for each type of tissue and other assembly metrics are shown in Table 1.

Assessment of the completeness of the transcriptome
The completeness of the transcriptome assembly of the C. macropomum and tambacu samples was assessed through comparison with the Benchmarking Universal Single-Copy Orthologs (BUSCO) for the metazoan and eukaryote gene sets. The genes sets were classified according to BUSCO parameters ( Table 2) as either Complete (C), Single (S), Duplicated (D), Fragmented (F) or Missing (M). In this analysis, including the total set of transcripts of C. macropomum and the tambacu, 41.1% of the complete genes were observed in comparison with the metazoan gene set, and 42.3% in comparison with the eukaryote set. If the fragmented genes are included here, the percentage of genes recovered increases to 75.5% in comparison with the metazoans and 77.9% for the eukaryotes. On the other hand, 24.55% of the genes were classified as missing in comparison with the metazoan set, and 22.1% in comparison with the eukaryote set, which may reflect biological innovations in the C. macropomum and tambacu [33]. Although this cannot be evaluated due to the lack of data on the complete genome of these organisms, or even this result may be a consequence of problems in the data assembled. Comparative analysis of the transcriptome of the fish Colossoma macropomum and hybrid tambacu.

Functional annotation
When the contigs from the present study were compared to the Swiss-Prot database using BlastX, 100,298 (25%) of the transcripts having significant hits. Of this total, 58,322 contigs were annotated (S2 Fig). Overall, 19.4% of the transcripts annotated in C. macropomum referred to the muscle, and 31.2% to the skin. In the tambacu 35.8% referred to the muscle and 13.8% to the skin. The top hits in BlastX revealed that C. macropomum and its hybrid shared transcripts with a number of other species of fish, amphibian, and mammal (S3 and S4 Figs).

Gene Ontology (GO)
The Gene Oncology (GO) project provides a useful bioinformatics tool for the representation of genes or their products in different species and are especially informative for large datasets [50,51]. In the present study, the GO analysis of the three principal domains, Biological Processes ( Similar results were obtained for the C. macropomum and tambacu in relation to the main functions observed in the BP, MF, and CC domains (Fig 2). The GO term "cellular process" (GO:0009987) was predominant in the BP domain (>12% in the four libraries), whereas in the MF domain, "binding" (GO:0005488) was the most dominant, with nearly half of the transcripts, followed by "catalytic activity" (GO:0003824), responsible for another quarter. In the case of the CC domain, the "cell" (GO:0005623) and "organelle" (GO:0043226) were the most common (45% of the transcripts in the four libraries). The distribution of the most expressed GO terms from each domain were similar to those reported in the transcriptomes of other vertebrate species [53][54][55].

Kyoto Encyclopedia of Genes and Genomes (KEGG)
The KEGG analysis provides plots of the biochemical pathways, and is useful for the systematic analysis of gene function, focusing in particular on the gene and the molecular network [56]. In the present study, we combined the tissues from each organism, resulting in 132 signaling pathways for the C. macropomum and 133 for the tambacu, corresponding to 8276 (1337 enzymes) and 16,366 (1334 enzymes) sequences, respectively (S2 and S3 Tables). The pathways were classified in four categories: Metabolism, Genetic information processing, Environmental information processing, and Organismal system (Fig 3). The largest numbers of transcripts (7,620 in C. macropomum and 15,281 of those in the tambacu) were allocated to the Metabolism category. Within this category, a number of differences were found between the two organisms in the distribution of the subcategories. With 22.7% of the C. macropomum transcripts being involved with carbohydrate metabolism and 12.8% with lipid metabolism, whereas in the tambacu, 19.0% of the transcripts were involved with nucleotide metabolism and 17.7% with the metabolism of cofactors and vitamins.
We identified some differences in the most common pathways in the C. macropomum and tambacu, although in both organisms, the purine metabolism pathways were predominant, with 689 transcripts in the C. macropomum, and 2850 in the tambacu (S6 and S7 Figs). Previous reports on the fish transcriptome produced similar results, such as those recorded for the hybrid pufferfish (Takifugu rubripes x Takifugu flavidus) [57], for example, and the closelyrelated species, Piaractus mesopotamicus [6] and Piaractus brachypomus [58].

Differential expression
The 454 FLX Genome sequencer (GS FLX) is suitable for the de novo sequencing of the transcriptome of non-model organisms [59][60][61], as confirmed in a number of previous studies [5,6,62]. As the organisms analyzed did not have a reference genome, we prepared a reference transcriptome, based on the entire set of assembled contigs. These sequences were mapped in the Star software, and counted in Htseq. Using rigorous parameters, we obtained a mapping index of 64.6% for C. macropomum, and 65.4% for the tambacu.
The present study provides the first comparative analysis of the transcriptomes of C. macropomum and the tambacu. Thus, based on the 400,845 contigs compiled in the four libraries, 21,986 genes were identified, including isophorms (S4 Table), with 16,632 being identified for C. macropomum and 14,835 for tambacu. The dataset from the muscle from both organisms comprised 13,951 genes, with 2,352 expressed only in the C. macropomum and 6,630 only in Comparative analysis of the transcriptome of the fish Colossoma macropomum and hybrid tambacu. the tambacu. The dataset for the skin was composed of 16,857 genes, of which, 9,371 were expressed only in the C. macropomum and 2,772 only in the tambacu. The differences in the number of exclusive and shared genes support the hypothesis that the expressed transcripts depend on the development stage or genetic background of the sample specimens, given that genes may be expressed differentially in the same tissue [63,64].
The results of the heatmap analysis (Fig 4) indicate the clustering of tissues, with one clade formed by the muscle transcriptome of the C. macropomum and tambacu, and the other by the skin samples of the two organisms. The similar size of branches in both clades indicates equivalent patterns of divergence in both tissues in each organism. Each library was also characterized by a number of different groups of distinct and overexpressed genes, considering only the DEGs with RPKM <100 (all the genes included in the analysis are shown in S5 Table). The IDEG6 analysis (p < 0,001) together with the Log2FC values (� 1 and � -1) revealed a number of differentially-expressed genes (see the DEGs in S6 Table). A total of 168 up-regulated genes were found in the library of the C. macropomum muscle, 222 in the tambacu muscle, 95 in the C. macropomum skin and 115 in the tambacu skin (Figs 5 and 6). These results reinforce the differences found in gene expression among organs, tissues or even different types of cell from the same organ [65].

Genes expressed differentially in the muscle of the C. macropomum and tambacu
The largest contig with functional annotation (54,960 bps) was observed in the muscle of the C. macropomum, while the same transcript in the tambacu was slightly shorter, at 54,465 bps, but was nevertheless the most abundant protein (in reads) in this organism. This transcript refers to the gene involved in the synthesis of Titin (ttn), which is the largest and the most abundant protein in the skeletal and cardiac muscle of humans [66,67]. In C. macropomum, Comparative analysis of the transcriptome of the fish Colossoma macropomum and hybrid tambacu. the most expressed gene was the Myosin heavy chain fast skeletal muscle (myss), which is related to muscle contraction, and was also up-regulated in this species (Table 3).
Of the 168 up-regulated genes in the muscle of C. macropomum, those of the myosin gene family are the most prominent, including Myosin-1 (myh1), Myosin-2 (myh2), Myosin-3 (myh3), Myosin-4 (myh4), Myosin-7 (myh7), Myosin-7B (my7b), Myosin-8 (myh8), and Myosin-13, myh13 (Table 3, S6 Table). This gene family encodes motor proteins, which are responsible for the contraction mechanism, as well as being involved in several mechanochemical processes in eukaryotes [68,69]. In addition to this gene family, Actin (acts/acta1) was also expressed prominently in the muscle, and is particularly active in the eukaryote cell, where it has a synergistic effect with myosin in the control of mobility and muscle contraction [70], as well as the genes that encode Myomesin 1 (myom1) and Myomesin 2 (myom2), the main components of M-line [71] (Table 3, S6 Table). It is interesting to note that all these up-regulated genes in C. macropomum are related to muscular contraction. This may be related to the fact that this species is rheophilic, and needs to make an enormous effort to swim upstream, against the current, to mature sexually. The individuals raised in captivity (as those used in the present study) can to mature sexually, although the final stage of the spawning process does not occur, given the absence of the external factors necessary to trigger the reproductive process [7].
The genes responsible for the expression of catalytic proteins were also abundant in the DEGs of the muscle of the C. macropomum, including the enzymes Beta Enolase (enob/ eno3), Triosephosphate Isomerase B (tpisb/tip1b), NAD(P)H dehydrogenase (quinone) 1 (nqo1), Enolase (eno), and Phosphoglucomutase-1 (pgm1), most of which are related to the synthesis or degradation of glucose and the metabolism of carbohydrates [72][73][74] (Table 3, S6 Table). This finding may account for the predominance of the carbohydrate metabolism subcategory identified in the KEGG analysis for C. macropomum.
The results obtained for the transcriptome of the tambacu muscle included several up-regulated genes related to processes such as aminoacid metabolism, stress, and catalysis. In the KEGG analysis of the tambacu muscle, nucleotide metabolism was the most common subcategory (involving 19% of the transcripts) in the "metabolism" category. This finding can be accounted for by the presence of a number of differentially-expressed genes related to the metabolism of aminoacids, such as Elongation factor 1-alpha (eef1a), 4F2 cell-surface antigen heavy chain (4f2/ slc3a2), and Large neutral amino acids transporter small subunit 4 (lat4/ slc43a2) [75][76][77] (Table 3, S6 Table). The DEGs observed in the muscle of the tambacu also included genes of the ubiquitin family and similar genes, such as Ubiquitin-40S ribosomal protein S27a (rps27a), F-box only protein 32 (fbxo32), AN1-type zinc finger protein 5 (zfand5) ( Table 3, S6 Table). These genes are involved primarily in the regulation of proteins [78].
Genes of the Heat shock protein (HSP) family were also prominent in the up-regulated genes of the muscle of the tambacu: Heat shock protein beta-1 (hspb1), Heat shock protein beta-11 (hspb11/hspbb), Heat shock 70 kDa protein (hsp70), Heat shock 70 kDa protein 1 (hsp71), Heat shock protein beta-6 (hspb6), Heat shock protein beta-8 (hspb8) (S6 Table). The expression of this protein family is related mainly to environmental and physiopathological stress [79,80]. The Alpha-crystallin B (cryab) proteins are also prominent in this group (Table 3), being characterized by their presence in a number of different types of tissue, including the skeletal muscle [81,82]. Like other HSPs, these proteins act as molecular chaperones to prevent the incorrect association and aggregation of the polypeptide chain [83,84]. The presence of a number of HSPs with prominent over-expression in the muscle of tambacu is noteworthy, given that the expression of these genes is usually a response to stress factors, including environmental conditions and disease [79,80,85]. The marked expression of the genes related to these proteins potentially indicates environmental effects and adaptive processes in the hybrid tambacu. In fact, the HSPs are helpful for the repair of protein damages, given that over-expression is usually sufficient to protect the cell from exposure to stressful conditions, such as high temperatures [80,86]. Some transcripts related to processes such as catalytic action, defense, and oxidative stress were also relatively common in the DEGs of the tambacu muscle, including AMP desaminase 3 (ampd3), AMP desaminase 1 (ampd1) Phosphoglycerate mutase 2 (pgam2), Sestrin-1 (Sesn-1), and Thioredoxin-interacting protein (txnip) [87][88][89] (S6 Table).

Genes expressed differentially in the skin of the C. macropomum and tambacu
The skin is a complex type of tissue that plays a key role in sensorial activities, thermoregulation, hormone metabolism, and defense [20,90,91]. In addition to these functions, the morphological similarities between the C. macropomum and the tambacu also justifies the analysis of this tissue. Overall, 210 DEGs were identified in the skin of C. macropomum and tambacu.
In the skin of the C. macropomum, the transcripts with the largest number of mapped reads corresponded to the transposable elements: Transposable element Tcb1 transposase (tcb1), Transposable element Tc1 transposase (tc1a), and Transposable element Tcb2 transposase (tcb2). These elements are DNA segments capable of moving through the genome of a cell ("jumping genes"), thus affecting the evolution and genome composition of plants and animals, inducing processes ranging from pathologies to genetic differences in these organisms [92][93][94]. As in the muscle, the genes that encode collagen proteins were also found to be overexpressed in the skin of C. macropomum, including: Collagen alpha-1(I) chain (co1a1), Collagen alpha-3(VI) chain (co6a3) and Collagen alpha-2(I) chain (co1a2). These proteins are abundant in animals and are important for the molecular architecture and the shape and mechanical properties of tissues [95,96].
Another important gene family, that should be mentioned, is the Cation channel spermassociated protein (catsper1/ctsr1 and catsper3/ctsr3). Despite presenting RPKM values lower than 100 (Table 3, S6 Table), this gene group was up-regulated in the skin of C. macropomum. However, the contigs equivalent to these transcripts were not assembled in the tambacu. The proteins encoded by this gene form a complex ion channel that permits the Ca 2+ to enter the flagellum of the spermatozoa, which ensures their hyperactivation and plays a key role in male fertility [97]. The lack of this transcript in the tambacu may be related to the infertility or subfertility of these hybrids. Although some authors have observed that tambacu males produce small amounts of spermatozoa following abdominal pressure, which indicates that these fish can be crossed with their parent species (unpublished data), it remains unclear whether the tambacu is effectively fertile. In addition, the presence of up-regulated transcripts of catsper1 and catsper3 in the C. macropomum skin may represent a potential genetic marker for sex differentiation of the specie, which would be important, given the lack of sexual dimorphism in these fish during non-breeding periods, as occurred in the present study.
With regard to the differential expression of genes in the skin of the tambacu, the transcript with the largest number of mapped reads was related to the keratin, type II cytoskeletal 8 (krt8) protein, which was also present in the DEGs. In fact, other transcripts that encode proteins of this large keratin family (classes I and II) were also found in the DEGs of the tambacu skin (Table 3, S6 Table). The products of this gene family are structural proteins of the epithelial cells, which are expressed differentially according to the cell layer and tissue. The functions of these proteins have not yet been determined fully, but they are known to be involved in a number of processes, including structural support, cytoarchitecture, the response to stress, the regulation of signaling pathways related to apoptosis and protein synthesis, the distribution of organelles, and tissue repair and proliferation [98][99][100].
In contrast with the pattern observed in the C. macropomum, only a single gene of the collagen family-the collagen alpha-1(XVII) chain (Col17a1)-was observed in the DEGs of the tambacu skin. This transmembrane protein is a structural component of the hemidesmosomes in the epithelial cells and is associated with the keratinocytes [101]. One other important gene is the dsp, which encodes desmoplakin (Table 3). This is a key desmosome protein, which plays a major role in the attachment, assembly and stabilization of intermediate filaments, in addition to interacting with keratin proteins [102,103].
Some authors have suggested that hybrids are more resistant and grow faster than their parent species [11,104]. However, few studies have demonstrated any actual advantages in terms of increased productivity, and in fact, some data indicate that hybrids may be more susceptible to parasites [105,106]. In the present study, major differences were found in the specificity and levels of gene expression in the muscle and skin of the C. macropomum and tambacu, which may contribute to the identification of the specific genes involved in the immunity, growth and fertility of these organisms. These findings raise a number of important questions that must be settled in order to confirm whether hybrids have any real advantages over their parent species, and whether they constitute a threat to the stocks of native species. We would further emphasize here the various up-regulated genes linked to stress found in the tambacu, which may be relevant to the farming of this hybrid, given that the stress of the captive environment may have a negative effect on the development of the fish. Given this, it will be very important to identify the factors that provoke this condition, not only for the organism itself, but also fish farming (the region in which this hybrid is farmed should also be evaluated).

Conclusion
This is the first comparative analysis of the C. macropomum and its hybrid form, the tambacu. The results indicated that the NGS platform used in the study is a powerful tool for the identification of genes and molecular markers in non-model species, in addition to confirming that the combination of distinct assembly methods can enrich the dataset considerably. Despite of the lack of a reference genome, we obtained a satisfactory number of transcripts with functional annotation, associated typically with biological processes, with the skin presenting the greatest number of GO terms. The KEGG analysis also identified similarities in the metabolic pathways of the two organisms, with several transcripts in the metabolism category. There was a degree divergence in the number of exclusive genes, and the levels of gene expression and function between the C. macropomum and tambacu tissues. However, a significant number of genes remained uncharacterized, and are also of particular interest for future studies on the physiology, conservation, and genetic improvement of the native species, as well as the management of hybrid stocks. This research is thus an important contribution to the investigation of genes that contribute to potential advantages or disadvantages in the productivity of the hybrids, their implications for the conservation and management of C. macropomum, and provide a valuable tool for future research in functional genomics.