Stallion Sperm Transcriptome Comprises Functionally Coherent Coding and Regulatory RNAs as Revealed by Microarray Analysis and RNA-seq

Mature mammalian sperm contain a complex population of RNAs some of which might regulate spermatogenesis while others probably play a role in fertilization and early development. Due to this limited knowledge, the biological functions of sperm RNAs remain enigmatic. Here we report the first characterization of the global transcriptome of the sperm of fertile stallions. The findings improved understanding of the biological significance of sperm RNAs which in turn will allow the discovery of sperm-based biomarkers for stallion fertility. The stallion sperm transcriptome was interrogated by analyzing sperm and testes RNA on a 21,000-element equine whole-genome oligoarray and by RNA-seq. Microarray analysis revealed 6,761 transcripts in the sperm, of which 165 were sperm-enriched, and 155 were differentially expressed between the sperm and testes. Next, 70 million raw reads were generated by RNA-seq of which 50% could be aligned with the horse reference genome. A total of 19,257 sequence tags were mapped to all horse chromosomes and the mitochondrial genome. The highest density of mapped transcripts was in gene-rich ECA11, 12 and 13, and the lowest in gene-poor ECA9 and X; 7 gene transcripts originated from ECAY. Structural annotation aligned sperm transcripts with 4,504 known horse and/or human genes, rRNAs and 82 miRNAs, whereas 13,354 sequence tags remained anonymous. The data were aligned with selected equine gene models to identify additional exons and splice variants. Gene Ontology annotations showed that sperm transcripts were associated with molecular processes (chemoattractant-activated signal transduction, ion transport) and cellular components (membranes and vesicles) related to known sperm functions at fertilization, while some messenger and micro RNAs might be critical for early development. The findings suggest that the rich repertoire of coding and non-coding RNAs in stallion sperm is not a random remnant from spermatogenesis in testes but a selectively retained and functionally coherent collection of RNAs.


Introduction
Mammalian sperm are considered terminally differentiated and functionally dormant cells with the sole purpose of delivering the paternal genome into the zygote [1,2]. Therefore, early claims about the presence of RNA in mouse [3], bull [4], rat and human sperm [5] were met with skepticism. However, research over the past decade has provided compelling evidence that mature mammalian sperm contain complex populations of RNAs [1,2,6,7,8,9]. These include over 3,000 mRNAs [6,8], and a heterogeneous population of small and long non-coding RNAs [8,10,11,12,13,14], though typically sperm are depleted of intact ribosomal RNAs [6].
The functions of sperm RNAs remain a subject of debate. The initial opinion was that sperm RNAs have no functions of their own and are simply residues of spermatogenesis, reflecting the events that occurred during their formation in the testes [1]. This may be partially valid, although recent discoveries have essentially expanded these views showing that sperm mRNAs constitute a population of stable full-length transcripts, many of which are selectively retained during spermatogenesis [6,11]. Some mRNAs are thought to have a role in sperm chromatin reorganization by setting up boundaries between protamine-and histone-packaged DNA [11,15]. Some mRNAs/cDNAs can be sperm-borne via transcription and reverse-transcription [10,16]. It has been reported that sperm mRNAs can be de novo translated using mitochondrial-type ribosomes during capacitation [17,18,19]. Both sperm mRNAs and micro RNAs (miRNAs) are involved in non-Mendelian inheritance, serving as transgenerational epigenetic signals for zygotic gene regulation [20,21,22]. Furthermore, a few RNAs have been found only in the sperm and the zygote, but not in the oocyte, providing evidence for a unique paternal contribution [23,24].
Even though the functions of the majority of the sperm RNAs remain enigmatic, it has been proposed that sperm transcriptional profiles might provide clinical markers for male fertility [1,11]. Moreover, the non-invasive sample procurement through semen collection makes the approach particularly attractive. Indeed, an increasing number of studies in humans demonstrate that sperm mRNA profile can serve as a molecular diagnostic platform for evaluating male fertility [1,9,25,26]. Consistent and biologically relevant qualitative and quantitative differences are present between the sperm RNAs of fertile men and men with abnormal reproductive phenotypes, such as skewed protamine ratios [27], teratozoospermia [26], cryptorchidism [28], reduced sperm motility [29], and idiopathic infertility [30,31]. Similarly, sperm transcriptome studies have been initiated in bulls [29,32,33,34] and boars [23,35,36] showing differences between the mRNA profiles of high-and low fertility bulls [34]. Analysis of porcine sperm, oocytes and two-cell embryos reveal that mRNAs of some genes, viz., CLU, PRM1 and PRM2 are delivered to the zygote exclusively by the sperm [23].
Despite the promising diagnostic potential of sperm RNAs for male fertility, the approach has found only limited attention in stallions [37,38]. At the same time, poor fertility of breeding stallions is a recognized concern in the equine industry. While foal crop and stud fees form a principal component of the economy of the industry, stallions are typically selected on the basis of their ancestry and performance, and not for their reproductive potential [39]. As a result, about 36-43% of prospective breeding stallions do not pass the breeding soundness tests [40,41].
The goal of this study was to obtain detailed information about the RNAs present in the sperm of normal fertile stallions to improve understanding of the biological significance of sperm RNAs and to establish a foundation for the discovery of spermbased biomarkers for stallion fertility.

Expression microarray analysis
Gene expression microarray analysis revealed 6,761 gene/EST transcripts in stallion sperm and 11,112 in the testes. The majority (97%) of the sperm transcripts were shared with the testes, while surprisingly, 165 transcripts were detected (at signal-to-noise ratio, SNR $2) only in the sperm and not in the testes, and are referred to as sperm-enriched transcripts.
Gene Ontology (GO) annotations were found for 3,319 (49%) sperm transcripts and grouped according to biological process (2,136; 78.9%), molecular function (1,503; 55.5%) and cellular component (2,270; 83.8%) ( Table S1). The sperm transcripts were most significantly (p,0.001) involved in chemoattractant-activated signal transduction pathways, viz., sensory perception and Gprotein coupled signaling, and ion transport related biological processes. The most prevalent molecular functions were related to ion-, nucleotide-, and chromatin binding and the associated cellular components were membranes and vesicles (Table S2). These functional categories were also represented among the 165 sperm-enriched transcripts, though with lower significance values because of fewer genes analyzed (Table S3). In contrast, testes transcripts were localized in all cellular compartments and involved in diverse molecular functions and biological processes (data not shown).
Comparison of the expression of the 6,596 transcripts common for the sperm and the testes (Fig. 1) identified 155 genes/ transcripts that were differentially expressed (DE) between them.
Of these, 60 were up-regulated (fold change; FC.2; p,0.05) and 95 were down-regulated (FC,22; p,0.05) in the sperm (Table  S4). Gene ontology terms could be determined for 37 up-regulated and 47 down-regulated transcripts showing that the former were involved in cell motility and cytoskeleton functions, while the latter were associated with functions in translation and non-membraneenclosed organelles, e.g., ribosomes (Fig. 2). Microarray results for the most significant (p,0.005) DE genes were confirmed by quantitative RT-PCR (qRT-PCR) ( Fig. 3; Fig. S1; Table 1).

RNA sequence analysis
Mapping RNA sequence reads in the equine genome. Next generation sequencing (NGS) of total RNA from the sperm of two reproductively normal stallions generated about 70 million raw reads and more than 3 Gb of sequence per sample; over half of these aligned with the EcuCab2 [42] reference genome ( Table 2). Average coverage (AC; normalized number of transcripts) values could be calculated for over 30 million reads that mapped to all equine chromosomes, including ChrUn and the mitochondrial genome ( Table 2, Table S9), whereas 19,257 sequence tags with AC $1 were uniquely mapped to specific locations in the horse genome (Table 2). Of these, 14,982 map locations were shared between the two samples, while 2,188 and 2,087 were unique to sample 1 and sample 2, respectively (Fig. 4a). These differences could be due to a combination of individual and technical variations, and justified the use of two biological replicates in this study. Genomic locations of all mapped tags together with their absolute and relative AC values are presented in Table S5. The data are deposited in NCBI Gene Expression Omnibus [43,44] and are accessible through GEO series accession number GSE38725 (http://www.ncbi.nlm.nih.gov/geo/query/ acc.cgi?acc = GSE38725).
The 19,257 sequence tags mapped to all horse (Equus caballus, ECA) autosomes, the X chromosome, chromosome Un, and the mitochondrial (Mt) genome (Table 3). Because the horse Mt genome is only 16,660 bp [45], it showed the highest number of mapped tags per megabase (Mb) though only 3 tags mapped to this part of the genome. Among the autosomes, the number of tags in relation to chromosome size correlated well with the known gene densities and was the highest in ECA11, ECA12 and ECA13, and the lowest in ECA9 and ECAX (Table 3).
According to the AC value which was used as the measure of expression level, the 19,257 tags fell into three categories: i) 1,028 (5%) tags with very high expression and AC$100; ii) 8,759 (45%) tags with high expression and AC between 10 and 100, and iii) 9,470 (50%) tags with medium expression and AC between 1 and 10 ( Fig. 4b). The distribution of very highly expressed tags in the horse genome was not uniform and tags with ACmax.100,000 were predominantly found in ECAUn, ECA1, ECA3, and ECAX, possibly indicating the locations of functionally important genes for the sperm. However, accumulation of very highly expressed tags to ECAUn is more likely because it contains multicopy sequences encoding for 18S and 28S ribosomal RNA which form about 80% of raw reads (see below). Compared to this, the ACmax in ECA12, ECA16 and ECA30 was less than 1,000 sequence reads per locus (Table 3).
Overall, there was a good agreement between the two sperm samples regarding the number and AC values of about 80% of mapped tags across the genome, including the tags with very high expression (Table S5). The most pronounced differences were cases where the same tag scored high or very high (AC.10) in one sample and low (AC,1) in another. Some differences in alignment of data in biological replicates were likely due to sequencing errors and chance alignments which is a significant problem for short reads and low alignment scores [46]. Among the 19,257 tags, 22% fell into this category and were uniquely mapped in sperm 1 or sperm 2 ( Fig. 4a; Table S5).
Structural and functional annotation of RNA sequence data. Structural and GO annotations of the 19,257 mapped RNA-seq tags with AC$1 were conducted by alignment to the equine reference sequence (EcuCab2; UCSC Genome Browser; http://genome.ucsc.edu/) using Enhanced Read Analysis of Gene Expression (ERANGE) software packages [47], as well as by homology-based approach against the human genome in GOanna (AgBase; http://www.agbase.msstate. edu/cgi-bin/tools/GOanna.cgi) pipeline. A total of 5,903 (,30%) of all mapped tags, aligned with annotated genes in the horse genome and were classified by ERANGE as expressed sequence tags (5,268), mRNAs (495) and micro RNAs (140) (Fig. 5a). Since the structural annotation of the equine genome is as yet incomplete, we used a permissive 620 kb parameter to identify additional untranscribed regions (UTRs), new external exons, and to discriminate best candidates for novel genes. Among the 5,903 annotated transcripts, ,17% entirely fell within the boundaries of annotated genes, 83% partially aligned with known genes, and 0.03% localized within the extended gene boundaries (see Materials and Methods). Only 1,378 annotations uniquely corresponded to individual equine genes. Similarly, 34% (6,606) of all mapped RNA-seq tags aligned with annotated sequences in the human genome identifying 3,262 unique genes. Because the horse   (ERANGE) and human (GOanna) annotations shared only 136 genes in common (Table S6), stallion sperm transcripts as observed by RNA-seq analysis corresponded to a total of 4,504 annotated genes (Fig. 5b).
The majority of mapped RNA sequence tags (13,354 tags, 70%) had no match in the current horse genome draft assembly by ERANGE. From the tags that could not be annotated, we selected 12 tags with extremely high average coverage values (AC. 50,000) and showed by manual BLAST analysis that 67% of these tags were highly similar to the rRNA in the 60S (5S, 28S) and 40S (18S) subunits of the eukaryotic ribosome (Table 4). High AC values of these tags indicated abundant representation of rRNA in stallion sperm.
Comparison of RNA-seq data with current gene models. Structural annotations of RNA-seq data by ERANGE for pyruvate kinase (PKM2), cysteine-rich secretory protein 3 (CRISP3), protamine 1 (PRM1), and transition protein 2 (TNP2) were compared with the current NCBI equine gene models (UCSC Genome Browser; http://genome.ucsc.edu/). The genes were selected due to their known functions in sperm motility, packaging, structure and fertilization (Table 5), and because all four genes were represented by high number of transcripts (AC.100) in the sperm.
Of the 9 tags that mapped to PKM2, each corresponded to two different NCBI accessions (Table S8) suggesting the presence of two splice variants in stallion sperm. Based on the AC values, the variant comprising of exons 1, 3, 4, 5, and 6 was more abundant than a variant where exons 2 and 9 were included; no tags aligned with exons 7, 8, 10, and 11. However, a relatively abundant (AC = 94.16) sequence tag aligned with a 59 upstream region of the gene indicating likely presence of an additional exon (Fig. 6a).
The three CRISP3 sequence tags mapped 59 upstream from the current gene model and did not align with any of the eight known exons (Fig. 6b). This could indicate inaccurate annotation of the gene in the equine draft assembly (EcuCab2) [42] or the presence of additional 59exons.
All RNA-seq mapped tags that aligned with PRM1, aligned also with TNP2, thus having two distinct accession locations (Table S8, Fig. 6c) in this tightly regulated protamine gene cluster [48,49]. Transcripts with the highest AC values aligned with the two PRM1 exons, while a number of sequence tags with high to medium AC values mapped 59 upstream of PRM1, and between PRM1 and TNP2 (Fig. 6c)-the latter corresponding to parts of the initial joint transcript of the protamine gene cluster [48,49].
Discovery of Y chromosome transcripts in stallion sperm. Horse Y chromosome sequences are not present in the draft assembly (EcuCab2) [42] or in the whole genome (WG) expression oligoarray [50]. Therefore, we used the recently published catalogue for 29 ECAY genes and ESTs that have cDNA evidence [51], and found seven transcripts in the sperm (Fig. 7). These included one X-degenerate gene (DDX3Y), three horse specific novel transcripts (ETSTY4, ETSTY6 and ETY1), and three Y-acquired retrotransposed genes (EIF3CY, MTND1 and RPS3AY).
Comparison of microarray and RNA-seq data. The 21,000-element equine gene expression oligoarray is designed to specifically target genes, so that each probe on the array corresponds to a specific gene or EST [50]. The RNA-seq data, on the other hand, comprises the entire transcriptome and multiple sequence tags can be mapped to a genomic region corresponding to one gene. This explains the substantial difference between the number of sperm RNAs detected by microarray analysis (6,761 transcripts) and by RNA-seq (19,257 mapped tags). Nevertheless, the two datasets were similar regarding the number of annotated genes: 3,319 by microarray and 4,504 by RNA-seq. While 65-70% of the microarray transcripts were present among RNA sequences (data not shown), RNA-seq additionally identified miRNAs and over 13,000 anonymous mapped tags. The latter potentially correspond to splice variants of known genes, to genes yet to be annotated, to various non-coding RNAs (ncRNAs), and maybe even to a few novel genes. Importantly, RNA-seq data allowed refined quantitation of RNA transcripts as expressed by the AC values (Table S5), to determine the level of their expression in stallion sperm.

Discussion
The discovery of haploid transcripts in mammalian sperm dates back to almost two decades when c-MYC mRNA [52] and MHC Class I transcripts [53] were detected by RT-PCR in human sperm. Since then, a number of studies have characterized individual transcripts, as well as the global transcriptome of the sperm in normal and subfertile men [6,8,28,31,54]. In animals, sperm transcripts have been studied in bulls [29,32,33,34], boars [23,35], and recently in the water buffalo [55] using human microarrays, species-specific small custom-made microarrays, or quantitative PCR. To our best knowledge, the present study is the first global sperm RNA analysis in stallions, though massively parallel sequencing has been recently used to study RNAs in the sperm of men [56] and mice [57]. Our findings that thousands of coding and non-coding RNAs are present in mature stallion sperm are in good agreement with previous research in the field [6,34,56,57].

Microarray analysis versus RNA-seq
Analysis of stallion sperm transcriptome by microarray and RNA-seq in the present study, allowed comparison of the two approaches for the efficiency to detect sperm mRNAs. The information obtained by gene expression microarrays is typically influenced by array design and annotation, with a possible advantage that previously known annotations of array probes will reduce the bioinformatics load of analysis. The 21,351-element equine WG oligoarray [50] used in this study contains 14,531 GO annotated gene products (AgBase: http://www.agbase.msstate. edu/) of which 3,319 were identified in the sperm. In contrast, transcriptional profiling by RNA-seq is unbiased, targets all classes of RNAs, and substantially outperforms microarray in the dynamic range of the expression levels [47,58]. Indeed, the heterogeneity and expression range of the 19,257 mapped RNAseq tags in stallion sperm essentially exceeded the microarray data. The downside, however, was limited power of structural annotation of the RNA sequences due to which 70% of mapped tags remained anonymous and will be targets for bioinformatics pipelines in the future. Partial incompatibility between the accession identities of microarray and RNA-seq annotations set additional limitations to efficiently compare the two datasets. We conclude that RNA-seq is certainly the method of choice for global transcriptome analysis and for the discovery of biomarkers for stallion fertility.

Sperm versus testis: selective retention of mRNAs in sperm
The sperm of reproductively normal stallions contained a rich repertoire of about 6,000 mRNAs/ESTs (Figs. 1, 5a) which, according to microarray analysis, represent approximately 50% of the mRNAs found in the testes (Fig. 1), a ratio similar to that reported for men [6]. The ,11,000 testes transcripts, as determined here by microarray, are close in number to the 12,013 expressed genes recently found in stallion testes by RNAseq [59]. The majority of mRNA/EST transcripts in stallion sperm were concordant with those in testes (Fig. 1), supporting the prevailing idea that sperm transcripts are solely historical records of spermatogenesis in testes [1,6,7,8,60,61]. Therefore, the detection of 60 sperm up-regulated and 165 sperm-enriched transcripts by microarray analysis was a surprise. Because GO analysis of these transcripts showed their direct relevance to sperm functions (Figs. 2, 3; Tables S2, S3), it is tempting to speculate that certain transcriptional products of spermatogenesis are selectively retained in the sperm but not in the testes. This was further supported by GO annotations for the sperm RNAs that corresponded to known genes, mRNAs and ESTs (Table S1, Table 5) showing that the majority of sperm transcripts relate to a few defined functional categories. These included cytoskeleton and G-protein coupled receptor activities, transmembrane transport, ion channels, and mitochondrial ribosomal proteins-functions involved in sperm chemotaxis, capacitation, sperm-egg interactions, and the acrosome reaction [62,63,64]. For example, ion channels play an important role in fertilization by facilitating interactions of the sperm with its environment and the egg during capacitation, sperm-egg recognition, and the acrosome reaction [65,66]. Transmembrane transport of glycoproteins on the surface of sperm tails, on the other hand, is required for primary binding of the sperm to the zona pellucida during capacitation and sperm-cumulus interaction [67]. Even though the high abundance of olfactory receptors (OR) and the predominance of sensory perception related biological processes in sperm transcriptome (Table S1) seems at first sight bizarre, ORs too are directly involved in sperm functions. There are between 20 and 66 testicular ORs in mammals which play pivotal roles in progesterone activated signal transduction pathways in guiding sperm chemotaxis, capacitation, Ca 2+ -channels and acrosome reaction [64,68]. The findings also set an important foundation for future research to examine whether the regulation of ORs in individual sperm cells is as tightly controlled as their expression in the central nervous system, where each neuron expresses monoallelically only one particular OR [69]. This might be of value for assisted reproduction and for the improvement of the therapy of subfertility.

Y chromosome transcripts in stallion sperm
One of the original findings was the detection of seven Y chromosome mRNAs in stallion sperm (Fig. 7). Sequences of the Y chromosome are typically missing from EST and cDNA libraries, from genome sequence draft assemblies, and thus, from expression arrays and gene annotation pipelines. Therefore, Y transcripts in the sperm have been identified only in species with advanced Y  chromosome gene catalogues-DBY, SRY, and RPS4Y in men [70] and Dby in mice [24]. Given the known role of Y chromosome genes in spermatogenesis and male fertility and their high expression levels in testes [51,71], the presence of Y transcripts in sperm is not surprising. Although, it is noteworthy that among tens of Y genes expressed in testes, only a few transcripts are retained in the sperm. Among these, DBY (alias DDX3Y) is of particular interest because it is present in the sperm of all three species-humans, mice and horses. In mice, Dby transcripts are retained in the sperm after capacitation and transferred into the oocyte during fertilization. These transcripts are thought to be necessary for the early development because blocking Dby with antisense RNA results in inhibition of zygotic development in mice [24]. Given that Y transcripts are delivered to the zygote exclusively by the sperm, the functions of ECAY transcripts in stallion sperm need further investigation.
In summary, functional coherence of the GO categories of stallion sperm coding RNAs is in agreement with the observations in humans that sperm mRNAs are not random untranslated remnants of spermatogenesis but constitute a population of stable full-length transcripts that are selectively retained for functions in fertilization and early development [6,11].

Non-coding RNAs: rRNAs, miRNAs and lncRNAs
Direct sequencing of stallion sperm total RNA allowed the discovery and identification of RNA species other than mRNAs. Among these, ribosomal RNAs (rRNAs) comprised a substantial portion of mapped tags with very high AC values (Table 4). This was a surprise because it has been a common knowledge that the sperm are depleted of rRNA [6,72]. Absence of intact 18S and 28S rRNA peaks has been shown in most microarray-based sperm transcriptome studies [6,32,35,38] and is an established standard for sperm RNA quality evaluation [73]. Recent RNA-seq of human sperm transcriptome [56] reveals that the truth lies in the middle: 18S and 28S are the most abundant (80%) sperm transcripts but they are not intact. Sperm rRNAs undergo selective cleavage which specifically destroys full-length rRNAs but does not affect mRNAs or small non-coding RNAs. Cleavage of sperm rRNAs is needed to ensure translational cessation and prevent spurious protein synthesis in the sperm. These findings explain the presence of rRNAs in the stallion sperm in this study but also clarify why 18S and 28S peaks are absent from sperm RNA quality control electropherograms [73].
One of the most exciting results was the discovery of 82 sperm miRNAs (Table S7) which comprised 0.73% of all RNA-seq mapped tags (Fig. 4) and were annotated according to the in silico detection of miRNAs in the horse genome [74]. The number of miRNAs in stallion sperm was comparable with the 68 miRNAs found in human spermatozoa [13], and several stallion miRNAs were the same as identified in the sperm of men [75], boars [76] and mice [57,77]. Among the latter, the most noteworthy were the sperm-borne miRNAs which are required for the first cleavage division and are found in mouse sperm and one-cell embryos but not in the oocytes or embryos past the one-cell stage. Three such miRNAs [77], MIR34B, MIR34C and MIR449A, were highly abundant (AC$100) in stallion sperm (Table S7). While the functions of sperm microRNAs in equine biology are yet to be determined, recent discoveries in mouse and humans suggest that sperm miRNAs, as well as novel piRNA-and tRNA-derived small RNAs [57,78], regulate gene expression in the early zygote either by direct interaction with mRNA or via epigenetic mechanisms [13,20,21,77,79]. For example, miR-124, also found in stallions, is critical for the establishment of a distinct, heritable chromatin structure in the promoter region of Sox9 and is responsible for RNA-mediated epigenetic control of embryonic and adult growth in mice [79]. Further, recent comparative study on birth and expression evolution of mammalian miRNA genes [80] indicates the particular importance of X-linked miRNAs in testes where they are potentially involved in diverse functions during spermatogenesis. These X derived miRNAs tend to be duplicated and have higher expression levels than autosomal miRNAs. Though the functions of miRNAs in testes and sperm are likely different, it is worth mentioning that among the 82 sperm miRNAs identified in this study, six were derived from the X chromosome of which MIR223 has high expression level (AC = 295; Table S7). The discovery of over 100 sperm miRNA sequence tags of which 82 could be aligned with unique miRNAs evidenced that to some extent the small RNA fraction can be successfully targeted by global transcriptome sequencing, without special small RNA library construction. However, given that mammalian species on average have about 300 miRNAs [80] and that sperm are enriched with mse-tsRNAs (mature sperm-enriched tRNA-derived small RNAs) [78] and other small non-coding RNAs [57] with likely functions in development, additional studies are needed for in depth analysis of the small non-coding RNA fraction of stallion sperm.Male germ cells also contain transcripts of long non-coding (lnc) regulatory RNAs [21] which are longer than 200 nucleotides, have little or no protein-coding capacity, and regulate gene expression through a diversity of mechanisms [81]. Because only three lncRNA genes are available for the horse in the Long Non-Coding RNA Database (http://lncrnadb.com/) and lncRNA sequences are not conserved across species [81], the RNA-seq annotation pipelines (ERANGE, AgBase) [47] did not identify any lncRNAs. However, we anticipate that among the over 13,000 RNA-seq tags that could not be annotated in this study, many represent small and long regulatory RNA species.

Functions of sperm transcripts
Recent studies have essentially challenged the prevalent concept that the sperm are transcriptionally and translationally dormant cells [82] and that sperm transcripts have no functions of their own [1,6,7,8,11,61,83]. For example, there is evidence that the mature sperm possess an efficient RNA polymerase machinery for transcription, mRNA splicing and for reverse transcription of the primary RNA into stable cDNAs [84], majority of which are delivered during fertilization to the zygote [16]. Sperm mRNAs can be de novo translated using mitochondrial-type ribosomes and at least 26 such sperm-translated proteins are known to be required during capacitation, sperm-egg interactions and fertilization [17,18,19,85]. Also, sperm coding and non-coding RNAs are thought to have a role in stabilizing sperm chromatin and facilitating the selective escape of sequences necessary for early development from repackaging by protamines [15]. This is in line with our findings that stallion sperm mRNAs are not retained randomly but form a distinct population with functions directly relevant to sperm-egg interactions, fertilization and embryonic development. Furthermore, the presence of non-coding regulatory RNAs suggests that like in mice, RNAs can serve as epigenetic modifiers of gene regulation in early equine development [10,20,21,86]. Despite these recent advances, functions of the majority of RNAs found in mammalian sperm remain to be identified [10] and need further investigation.
The primary practical goal of sperm transcriptome analysis in humans and animals is the detection of transcripts that could serve as biomarkers or diagnostic tools for fertility evaluation. For example, elevated protamine mRNA retention in human sperm is an indication of abnormal protamine translation and infertility [27]. Also, consistent and biologically relevant differences in sperm mRNA expression profiles have been found between fertile men and men with teratozoospermia [26], cryptorchidism [28] and idiopathic infertility [30,31]. In bulls, DE sperm transcripts have been associated with high or low sperm motility [29], as well as with overall high-and low-fertility [34]. In boars, statistically significant differences in sperm mRNA profiles have been associated with seasonal changes in the reproductive status [36]. Overall, the current knowledge about sperm transcriptome in men and animals suggests that sperm RNA profiles could be used as a genetic fingerprint of normal fertile males and as a molecular diagnostic platform for male infertility. In this respect, the results of the present study, particularly the expression data for sperm miRNAs and the mRNAs relevant to sperm functions, set a foundation for the development of sperm-based markers for fertility evaluation in stallions in the future.

RNA-seq data and structural annotation of the horse genome
While the primary goal of this study was to characterize in detail the transcriptome of stallion sperm, the generated RNA-seq data is a valuable resource for the improvement of horse genome structural annotation [54]. This was illustrated by suggesting additional exons, splice variants or another genomic location for four important sperm genes-PKM2, CRISP3, TNP2 and PRM1 (Fig. 6). Thus, the RNA-seq data is a valuable resource to improve the structural annotation of the horse genome, and for the discovery of novel genes and regulatory RNAs.

Ethics statement
Procurement of stallion semen and testes was performed according to the United States Government Principles for the Utilization and Care of Vertebrate Animals Used in Testing, Research and Training and were approved by the Clinical Research Review Committee (CRRCs #08-19; #08-33; #09-32; #09-47) and Animal Use Protocol #2009-115 at Texas A&M University, supplemented with Informed Owner Consent From stating that owners of the stallions gave permission for their animals to be used in this study.

Samples
Fresh ejaculates from five reproductively normal stallions were collected using an artificial vagina (Missouri model). The ejaculates were first evaluated for sperm concentration, motility characteristics and morphological features [87,88], followed by purification from somatic cells and immature sperm by EquiPure TM (Nidacon International, Sweden) discontinuous gradient centrifugation [73]. Testes samples were obtained from four normal stallions by castration. Purified sperm and testes were stored in RNAlater (Ambion) at 280uC until use.

RNA isolation and evaluation
Total RNA was isolated from sperm with TRIzol reagent (Invitrogen) as described by Das and colleagues [73], and from testes using RNeasy mini elute kit (Qiagen) and manufacturer's protocol. The RNA samples were cleaned from genomic DNA (gDNA) with Turbo DNase kit (Applied Biosystems/Ambion) and purified with RNeasy MinElute Cleanup kit (Qiagen). The quantity and quality of isolated RNA were evaluated with spectrophotometer (NanoDrop 1000, Thermo Fisher Scientific), Bioanalyzer (Agilent Technologies), and reverse transcriptase PCR (RT-PCR) using primers for sperm-and testes-specific PRM2 (protamine 2), and somatic cell-specific PTPRC (protein tyrosine phosphatase, receptor type, C) (Fig. S2) [73]. The spectrophotometer OD values for all total RNA samples must to be 1.70-1.75 for absorbance ratios A260/A280, indicating that the RNA is free from proteins and organic compounds [89]. The Bioanalyzer profiles distinguish between testes and sperm total RNA: RNA Integrity Number (RIN) above 8 and two peaks corresponding to 18S and 28S rRNAs are indicators for the good quality of testes RNA; in contrast, sperm is depleted of intact rRNA (Fig. S2) [73]. RT-PCR with intron-spanning primers for PRM2 validate that all RNA samples are free of genomic DNA, while no amplification of PTPRC in sperm indicates that the sperm RNA is not contaminated with RNA from somatic cells (Fig. S2) [73].

RNA linear amplification
For microarray hybridizations, sperm and testes total RNA was subjected for two rounds of linear amplification by T3/T7 promoter synthesis with RNA Amplification RampUp kit (Genisphere) following manufacturer's instructions. Starting with 20-30 ng of total RNA, about 20-60 mg of sense-strand mRNA was obtained and stored at 280uC until use.

Expression microarray hybridizations
Four different testes and five sperm RNA samples were used for microarray hybridizations. Testes samples were pooled to generate a reference RNA for normal stallion testes, while sperm samples were used individually. Individual sperm and pooled testes RNA was converted into cDNA and labeled with Cy3 or Cy5 using 3DNA Array 900MPX Detection kit (Genisphere). Transcriptional profiles of stallion sperm and testes were studied by hybridization to the Texas A&M 21,351-element equine WG expression oligoarray [50]. Each hybridization experiment comprised a pair of differently labeled (Cy3 or Cy5) RNAs: the testes reference and one of the five sperm samples. Including a dye swap, a total of ten microarray hybridizations were conducted in a Sure Hyb hybridization chamber (Agilent Technologies) overnight, followed by post-hybridizaton washes in pre-warmed (42uC) 26 SSC with 0.2% SDS and 0.26 SSC at room temperature for 15 min each.

Microarray data analysis
The slides were scanned with a Gene Pix 4100B scanner at 5 micron resolution (Molecular Devices). Spot-finding and quantification of array images was carried out using Gene Pix Pro 6.1 software and the data were stored as GenePix Results (.gpr) files. The raw intensity data were normalized within individual arrays using print-tip LOWESS method [90]. To be considered significant, the signal for a candidate had to be above a threshold value (SNR $2) determined according to the fluorescence output of the negative controls printed on the microarray. Bayesian t-test was performed to consider DE genes between the sperm and testes: signal FC .2 and p value 0.05 were considered significant. The normalized data were analyzed with Bioconductor LIMMA package in the R computing environment, followed by GO analysis using DAVID Bioinformatics Resources (http://david. abcc.ncifcrf.gov/) to describe those molecular functions and biological processes that appeared to be influential.

Validation by quantitative real-time PCR (qRT-PCR)
The cDNA was synthesized from 2 mg of linearly amplified testes and sperm RNA with SuperScript VILO cDNA synthesis kit (Invitrogen), purified with MinElute PCR purification kit (Qiagen), and evaluated for quantity and quality with a NanoDrop spectrophotometer (Thermo Fisher Scientific). Aliquots of cDNA were stored at 220uC until use. Exon spanning primers for qRT-PCR were designed for selected genes (Table 1) using Primer3 ver 0.4.0 software [91], and the efficiency of all primers was evaluated by making a standard curve in the sperm and testes samples. Duplicate qRT-PCR reactions in triplicate experiments were carried out on a Light CyclerH 480 (Roche Diagnostics) along with two housekeeping genes (ACTB, b-actin and PPIA, peptidylprolyl isomerase a) as controls. Each qRT-PCR assay used ,100 ng of mRNA in a 20 mL reaction with 16 Universal SYBRH Green Master Mix (Applied Biosytems, CA) and 300 nM primers. The results were analyzed with LightCycler 480 Software v1.5 by calculating log 2 2DDCt ; the P-value was calculated by performing student's t-test and p,0.05 was considered significant. Scatter plots for qRT-PCR statistics were generated in Microsoft Excel (Fig. S1).

Detection of Y chromosome transcripts in stallion sperm
Reverse transcriptase PCR experiments on stallion sperm and testes were carried out according to standard protocol [73] using primers for 29 known horse Y chromosome genes and transcripts with cDNA evidence [51], along with primers for PRM2 and PTPRC as positive and negative controls, respectively.

RNA-seq library construction and sperm RNA sequencing
Total RNA from the sperm of two reproductively normal stallions was used for next generation sequencing (NGS) on the ABI SOLiD 4 platform at Cofactor Genomics (ST. Louis, MO, USA).-Total RNA (500 ng) was directly used for SOLiD singleend RNA sequencing fragment library construction according to the ABI protocol (http://www.cofactorgenomics.com/faq) [92]. First strand cDNA was directly generated from total RNA using 4 mL of random hexamers (ABI) and SuperScript II Reverse Transcription Kit (Invitrogen) in a 30 mL final volume, following the manufacturer's instructions. The second strand cDNA was generated using 10 mL of 56 second strand buffer (500 mM Tris-HCl, pH 7.8; 50 mM MgCl2; 10 mM DTT), 30 nmol dNTPs; 2 U of RNase H, and 50 U of DNA Pol I (Invitrogen), and incubated at 16 uC for 2.5 h. The double-stranded DNA (dsDNA) was purified with QIAquick PCR purification kit (Qiagen) and the concentration was quantified. From each sample, ,100-200 ng of cDNA was fragmented using Covaris S2 System (Covaris, Inc.). Sequencing libraries were generated with SOLiD Fragment Library Construction Kit (ABI) as described elsewhere [92]. Briefly, fragmented cDNA was end-repaired with Polishing Enzyme 1 and End Polishing Enzyme 2 (ABI); adapter ligated with SOLiD P1 and P2 adaptors, size selected for 200 to 230 bp on a SOLiD Library Size Selection gel, followed by nick translation and PCR amplification using Library PCR Primers 1 and 2 and Platinum PCR Amplification Mix. Amplified libraries were column purified, quantified using the SOLiD Library TaqMan Quantitation Kit, and applied on ABI SOLiD sequencer at a concentration of 10 ng per lane.

RNA sequence analysis and annotation
The 50 bp single-end SOLiD raw reads were directly aligned with the horse reference sequence EcuCab2 [42] using ABI aligner software (NovoalignCS version 1.00.09, http://www.novocraft. com/) which uses multiple indexes in the reference genome, identifies candidate alignment locations for each primary read, and scores alignment locations using the Needleman-Wunsch algorithm [93]. The alignment parameters allowed the minimum number of 30 good quality bases for a read (l = 30); the highest alignment score acceptable for the best alignment was 140 (t = 140), whereas a default threshold was calculated from read length and genome size such that an alignment to a non-repeat should have a quality higher than 30; the number of alignments recorded for a read during the iterative search process, i.e., the number of alignments with score equal to the best alignment was 10 (e = 10). If a read was unaligned, it was shortened by 1 base and tried again. Alignments in repetitive sequences were discarded by removing reads with multiple similarly scoring alignments. The single highest-scoring alignment for each raw read was mapped.
Sequence alignment and alignment clustering to define expressed loci and perform linear normalization across the two sperm RNA samples was carried out with a software package EXpression analysis Pipeline, EXP (Cofactor Genomics).
Gene expression level or average coverage (AC) was calculated by normalizing each sample to the fewest reads and directly comparing different loci. Expression level of a transcript was estimated from the number of reads that mapped to that transcript. The variability present in sequencing depths in different samples was taken care of by the use of two biological replicates. Sequencing depth at each locus and differences in gene expression (AC) between the two sperm samples were calculated using log (base 2 ) ratio, thus showing the association between the two samples. Some differences in alignment of data in biological replicates were likely due to sequencing errors and chance alignments which is a significant problem for short reads and low alignment scores [46]. To combat the high false-positive rate, we focused on a high-quality subset of the data consisting of sequence variants supported by different independent reads. Sequencing reads were computationally categorized according to their AC and chromosomal location. This categorization was conducted comparatively with respect to a present horse genome draft sequence assembly, and normalized count of the number of mapped position was calculated. This count served as a proxy for the transcripts with true abundance in the sample. Expression directories were divided into four categories according to the sum of AC values: very high-AC$100; high 210,AC,100; medium 21#AC#10, and low AC,1. Loci with low expression in both sperm samples were removed from further analysis because they represented the least compelling evidence of expression. Genomic locations of all mapped transcripts were retrieved using Python VS 2.66 script.
Structural annotation of genes for all sequence tags with AC$1 was conducted in two categories: i) a homology-based approach with the human genome (AgBase GOanna; http://agbase.msstate. edu/cgi-bin/tools/GOanna.cgi) and ii) direct annotation with the horse genome using the Enhanced Read Analysis of Gene Expression (ERANGE) with a 620 kb window for recognized chromosomal locations. Matches were categorized as: F-falling within gene boundaries; P-partially falling within gene boundaries, and A-adjacent, falling into extended gene boundaries within the expanded ERANGE window. Annotated genes were functionally analyzed and clustered for GO terms in DAVID Bioinformatics Resources (http://david.abcc.ncifcrf.gov/) with medium classification stringency for all parameters. Table S1 Gene Ontology classifications and terms for 3,319 sperm transcripts by microarray analysis. This table contains GO analysis statistics for all annotated genes that were expressed in sperm by microarray analysis. The GO categories i) Biological process, ii) Molecular function, and iii) Cellular component are shown on separate spreadsheets; Countnumber of genes associated with this gene set; Percentage-genes associated with this gene set/total number of query genes; Pvalue-modified Fisher Exact P-value; Genes-the list of genes from query set that are annotated to this gene set. (XLS)  The GO categories i) Biological process, ii) Molecular function, and iii) Cellular component are shown on separate spreadsheets; Count-number of genes associated with this gene set; Percentage-genes associated with this gene set/total number of query genes; P-value-modified Fisher Exact P-value; Genes-the list of genes from query set that are annotated to this gene set. (XLSX) Table S4 Differentially expressed genes (n = 155) between the sperm and the testes. A list of the 60 sperm upregulated and 95 sperm down-regulated genes, their NCBI and RefSeq accession numbers, logFC-log2 fold change in expression between sperm and testes; AveExpr-average log2-expression level of that gene across red-green channels, and P-value; NULL-no annotation; #NA = unknown. (XLS) Table S5 Mapped RNA sequence tags (n = 19,257) from the sperm of the two stallions. The table presents the following information for each mapped sequence tag: i) genomic location, ii) average coverage in sperm 1 (AC1) and sperm 2 (AC2), and iii) log2 ratios between AC1 and AC2. Columns at the left are sorted by AC1 and columns at the right by AC2. Mapped tags with AC$1 are shaded grey. (XLS)