Investigation of De Novo Unique Differentially Expressed Genes Related to Evolution in Exercise Response during Domestication in Thoroughbred Race Horses

Previous studies of horse RNA-seq were performed by mapping sequence reads to the reference genome during transcriptome analysis. However in this study, we focused on two main ideas. First, differentially expressed genes (DEGs) were identified by de novo–based analysis (DBA) in RNA-seq data from six Thoroughbreds before and after exercise, here-after referred to as “de novo unique differentially expressed genes” (DUDEG). Second, by integrating both conventional DEGs and genes identified as being selected for during domestication of Thoroughbred and Jeju pony from whole genome re-sequencing (WGS) data, we give a new concept to the definition of DEG. We identified 1,034 and 567 DUDEGs in skeletal muscle and blood, respectively. DUDEGs in skeletal muscle were significantly related to exercise-induced stress biological process gene ontology (BP-GO) terms: ‘immune system process’; ‘response to stimulus’; and, ‘death’ and a KEGG pathways: ‘JAK-STAT signaling pathway’; ‘MAPK signaling pathway’; ‘regulation of actin cytoskeleton’; and, ‘p53 signaling pathway’. In addition, we found TIMELESS, EIF4A3 and ZNF592 in blood and CHMP4C and FOXO3 in skeletal muscle, to be in common between DUDEGs and selected genes identified by evolutionary statistics such as FST and Cross Population Extended Haplotype Homozygosity (XP-EHH). Moreover, in Thoroughbreds, three out of five genes (CHMP4C, EIF4A3 and FOXO3) related to exercise response showed relatively low nucleotide diversity compared to the Jeju pony. DUDEGs are not only conceptually new DEGs that cannot be attained from reference-based analysis (RBA) but also supports previous RBA results related to exercise in Thoroughbred. In summary, three exercise related genes which were selected for during domestication in the evolutionary history of Thoroughbred were identified as conceptually new DEGs in this study.


Introduction
Since domestication, at around 3500 B.C.E, horses have mainly been used for riding and racing [1]. One domesticated breed of horses, the Thoroughbred, has been specifically bred for speed, endurance, and strength since the 18 th century. The extreme selection for these traits has resulted in a highly adapted athlete [2] with very high aerobic capacity [3], and high skeletal muscle mass [4], which comprises over 55% of total body mass [5]. The Thoroughbred is an excellent breed for competitive horse racing and by extension a valuable model for studying exercise response. A previous study has shown that exercise training in Thorough-breds resulted in coordinated changes in the expression of genes related to metabolism, oxidative phosphorylation and muscle structure [6].
Domestication leads to gradual changes at the genetic level by a process of selection in a population of animals or plants. Most domestic animals were selectively bred for the goal of benefitting human beings. Due to the combined effect of natural selection and human-controlled selective breeding, phenotypic changes, which are related to genetic mutation, accompany the domestication process. Some genetic mutations with beneficial phenotypic effects have been either highly enriched or vanished by selective sweeps [7]. A selective sweep is the reduction of genetic diversity in the neighboring DNA of a fixed mutation. Selective sweep regions in the genome can potentially be identified by a genome scan, and the low variation interval surrounding the selected gene can be found by fine-scale mapping. Using such genome scans, selective sweeps have been identified in domestic and natural (wild progenitor) populations [8,9].
Previous horse transcriptome studies using RNA-seq were carried out by mapping sequence reads to a reference genome. However, reference genome assembly has been known to have flaws including missing expressed genes [10], hundreds to thousands of miss-assemblies and large genomic deletions [11], and problems in trans-spliced genes [12]. Therefore, the results and success of reference transcriptome assembly depends on both the availability and quality of the reference genome. On the contrary, de novo transcript assembly has several advantages. First, it does not depend on a reference genome [13]. This is a key advantage as many organisms do not have a high-quality finished reference genome. For these organisms, de novo assembly becomes the first analysis step. Also, it does not depend on the correct alignment of reads to known splice sites [14] or the prediction of novel splicing sites, both of which are required by reference-based assemblers. Trans-spliced transcripts and similar transcripts originating from chromosomal rearrangements can be assembled using the de novo approach. In addition, de novo transcriptome assembly can help researchers investigate genes that are absent in the reference genome due to the incompleteness of reference sequences [10]. Lastly, it can identify new transcripts and new transcript structures [15,16]. However, reconstruction of fulllength transcripts from short reads with considerable sequencing error rates poses substantial computational challenges [17]. Still, de novo assembly in RNA-seq is an important approach for carrying out transcriptomic studies.
Recently, many de novo assembly software tools have been developed, most of which take the de Bruijn graph approach. This approach usually has two important parameters: k-mer length and coverage cutoff value [16]. Tools such as Trans-ABySS [15], Trinity [17], ABySS [18], Oases [19], Rnnotator [20], Multiple-k [21], SOAPdenovo [22] and Velvet [23] follow this approach. Considering these de novo assembly software tools, Manfred G Grabherr, et.al concluded that Trinity de novo assembly software tool is superior to others for a number of reasons. Specifically, Trinity fully reconstructs a large fraction of transcripts, including alternatively spliced isoforms and transcripts from recently duplicated genes. In addition, Trinity resolves ,99% of the initial sequencing errors, determines splice isoforms, and distinguishes transcripts from recently duplicated and identified allelic variants [17].
In this study, we used de novo-based analysis (DBA) to identify differentially expressed genes (DEGs) that could not be identified with reference-based analysis (RBA). Furthermore, we integrate DEGs with genes identified as being selected for during domestication to reveal genes that are related to the evolution of exercise response during the domestication process of the Thoroughbred.

Ethics statement
This study was carried out in strict accordance with the recommendations in the Guide for the Care and Use of Laboratory Animals of Pusan National University. All experimental procedures used in this study were approved by the Institutional Animal Care and Use Committee of the Pusan National University (PNU-2013-0417). The owners of the Thoroughbred horses gave permission for their animals to be used in this study.
Analysis of horses RNA-seq data 1. RNA-seq data between before and after exercise. We generated RNA-seq data from six Thoroughbred horses before and after exercise as described in a previous study [24]. Samples of skeletal muscle and blood were taken from six Thoroughbreds before and after exercise. 'Before exercise' samples were collected from the triceps brachii of the right leg and from the jugular vein and carotid artery of each horse. After an adequate resting period of several hours, the horses were subjected to a 30-min trot. Then, immediately after this trot, the 'after exercise' samples were collected from the same tissues of each individual. Thoroughbreds usually canter for 17-18 min per day. For the purposes of this study, a 30-min trot was taken to be the equivalent to 17-18 min of cantering. Total RNA from the skeletal muscle and blood samples were isolated using TRIzol (Invitrogen) and the RNeasy RNA purification kit with DNase treatment (Qiagen). mRNA was isolated from the total RNA using oligo-dT beads, then reverse transcribed into double-stranded cDNA fragments. Construction and sequencing of an RNA sequence library for each sample was carried out based on Illumina HiSeq2000 protocols in order to generate 90 bp paired-end reads. Twenty-four sets of transcriptome data were generated from muscle and blood samples of six horses obtained before and after exercise.
The RNA-sequencing data from this study have been submitted to the NCBI Gene Expression Omnibus (GEO) (http://www.ncbi. nlm.nih.gov/geo/) under the accession number GSE37870.
2. De novo-based analysis (DBA). We used the Trinity de novo assembly software tool [17,25] following the default settings, except for the following options: number of CPU and alignment method: bowtie. First, Trinity tools generated a reference for each obtained sample (a total of 24 samples) to detect DEGs. Second, Trinity tools generated a reference for each individual (a total of 4 samples) to compare with the SNPs from the whole genome sequence, RNA-seq (using reference transcriptome assembly) and RNA-seq (using de novo assembly).
The component ID was converted to known transcript ID (ftp:// ftp.ensembl.org/pub/release-73/fasta/equus_caballus/cdna/) using Blastall [26], a user-friendly, free open source tool, which is suited for short read alignment. After conversion, we filtered the transcript ID by alignment length of higher than 80%, more detail about the de novo assembly method is given in the Supplementary Methods (File S1).
Reference-based analysis (RBA): Most of the RBA used in this study is described in Kim et al [27]. TopHat [28] (ver.1.4.1) was used to map the sequences to a horse reference genome and annotated using the EquCab2 database (http://hgdownload.cse. ucsc.edu/downloads.html#horse).
3. DEG selection (de novo vs reference). We examined the differential expression of replicated count data by applying a method based on negative binomial model as implemented in the R package EdgeR [29]. This package was used because RNA sequence data may exhibit more variability than expected in a Poisson distribution due to wide dispersal in the genome. The method implemented in the EdgeR package automatically takes all known sources of variation into account. Significant DEGs were detected with a cut-off value of FDR,0.01, based on a paired design between'before exercise' and 'after exercise'.
4. Genotype calling and SNP calling (de novo vs reference). Three open source packages were used for downstream processing and variant calling: Picard Tools (http://picard. sourceforge.net), SAMtools [30] and the genome Analysis Toolkit (GATK [31]). Substitution calls were made with the GATK Unified Genotyper [32]. All calls with a Phred-scaled quality of less than 30 were filtered out and VCFtools [33] was used for handling the vcf file format.
5. DAVID analysis (de novo vs reference). One simple but extremely widely used systems biology technique for highlighting biological processes is gene category over-representation analysis. In order to perform this analysis, genes are grouped into categories by a common biological property and then tested to find categories that are over represented among the differentially expressed genes. Gene Ontology (GO) categories are commonly used in this technique and there are many tools available for performing GO and KEGG pathway analysis. We used DAVID [34] web tool to convert the equine Ensembl gene IDs to official gene symbols. This was carried out by cross-matching the equine Ensembl gene IDs to the human Ensembl gene IDs and the official gene symbols. The representation of functional groups in blood and skeletal muscle relative to the whole genome was investigated using the Expression Analysis Systematic Explorer (EASE) tool [35] within DAVID. The EASE tool is a modified Fisher's exact test used to measure enrichment of gene ontology (GO) terms [36]. To identify enriched GO terms, functionally clustered genes were filtered by an EASE value of ,0.01. In addition, A KEGG pathway enrichment test was performed using EASE, with a cut-off value ,0.1 6. Quantitative real-time reverse transcript-PCR (qRT-PCR) validation. A blood sample was obtained from a Thoroughbred horse maintained at Ham-an Racing Horse Resort & Training Center before and after exercise. Exercise was performed as a 30-min trotting on a treadmill. Trizol reagent (Invitrogen) was used to extract total RNA from leukocytes after exercise, according to the Invitrogen manual. In order to prevent contamination of genomic DNA, RNase-free DNase kit (Qiagen) was used according to the manufacturer's operating manual. Total RNA quantification was performed by using NanoDropH ND-1000 Spectrophotometer. cDNAs were synthesized in a reaction with oligo-dT primers, moloney-murine leukemia virus (MMLV) reverse transcriptase (Promega), RNase inhibitor (Promega) and RNase-free ddH 2 O, which was incubated at 37uC for 4 h.
To confirm the de novo unique differentially expressed genes revealed by RNA-Seq, seven DUDEGs were analyzed by RT-PCR amplification. The primers were designed using the PRIMER3 software (http://frodo.wi.mit.edu/primer3/) ( Table  S9 in File S1).
The RT-PCR conditions were as follows: an initial step of 94uC for 10 min, 35 cycles of 94uC for 30 sec, 60uC for 30 sec, 72uC for 30 sec, and final step of 72uC for 10 min. PCR bands were normalized with glyceraldehyde-3-phosphate dehydrogenase (GAPDH) band. RT-PCR products were visualized by gel electrophoresis on a 2.0% SeaKem LE agarose gel.
cDNA was analyzed by BioRad CFX-96. All samples were measured in triplicate to ensure reproducibility, and C t value was calculated using 2 2DDCt method [37].
Analysis of horse whole genome re-sequencing data 1. Whole genome re-sequencing data of Thoroughbred and Jeju domestic ponies. Whole-blood samples were collected from 18 Thoroughbred racing stallions of the Korean Racing Authority, and from four male and two female Jeju domestic ponies (Equus caballus) of the Jeju Provincial Livestock Institute, Korea. A 10 mL sample of blood was drawn from the carotid artery of each horse and was treated with heparin to prevent clotting. Genomic DNA was extracted and a quality check was carried out using fluorescence-based quantification on an agarose gel, a standard electrophoresis on a 0.6% agarose gel and, via a pulsed-field gel, using 200 ng of DNA. Manufacturers' instructions were followed to create a paired library of 500-bp fragments. This consisted of the following: purified genomic DNA fragments of less than 800 bp, fragments with blunt ends, fragments with 59 phosphorylated ends, fragments with a 39-dA overhang, some with adaptor-modified ends, purified ligation product, and a genomic DNA library. Following this, we generated sequence data using HiSeq 2000 (Illumina, Inc).
2. Reference genome assembly. Using the Burrows-Wheeler Aligner [38] with the default setting, pair-end sequence reads were mapped to the reference horse genome (ftp://ftp. ensembl.org/pub/release-73/fasta/equus_caballus/dna/) ( Table  S8 in File S1). The DNA re-sequencing data from this study have been submitted to the NCBI Sequence Read Archive (SRA) database under the accession numbers SRA053569, SRA054885 and SRP017702.
3. Genotype calling and SNP calling. We used the following open-source software packages; Picard Tools, SAMtools, and the Genome analysis toolkit, for downstream processing and variant calling. Substitution calls were made with GATK UnifiedGenotyper20 and all calls with a Phred-scaled quality of less than 30 were filtered out. For each chromosome, we simultaneously inferred the phased haplotype and inputed the missing alleles for the entire set of Thoroughbred populations using BEAGLE [39].

Estimation of Nucleotide diversity, F ST and Cross Population Extended Haplotype Homozygosity (XP-EHH)
value. Nucleotide diversity and long run of homozygosity (LROH) of Thoroughbred and Jeju domestic ponies for each chromosome were calculated by VCFtools. Conventional F ST [40] and Reynolds F ST [41] values were calculated for genes using Arlequin 3.5 [42] based on pairwise differences between the haplotypes of Thoroughbred and Jeju domestic ponies. In order to calculate F ST , we used the horse genome to phase the haplotypes of the two populations. Also, to calculate F ST by each gene region, we used the genomic information (Ensembl Genes71, EquCab2), namely the Ensembl reference annotated gene information. We selected the genes of the top 1% of the empirical distribution (empirical p-value,0.01) [43]. The method Cross Population Extended Haplotype Homozygosity (XP-EHH) was used to detect selective sweeps using the software xpehh [44] (http://hgdp. uchicago.edu/Software/). For XP-EHH analysis, we used haplotype information for all SNPs of the entire autosome, and we calculated Extended Haplotype Homozygosity (EHH) and the logratio integrated EHH (iHH) for the pairwise test of the Thoroughbred and Jeju domestic pony populations. The log ratios were standardized to have a mean of 0 and a variance of 1, and p-values were assigned assuming a normal distribution. We selected SNPs with p-values,0.01, which are considered to have strong selection signals. Then we apply a cutoff value of XP-EHH values,0 for finding adaptation in the Thoroughbred. We chose genes related with these SNPs by identifying genes located within a 10 kb [45] boundary of these SNPs. Since XP-EHH is not sensitive to allele frequencies, there is no need to stratify the data into frequency bins before determining significance. The p-values are empirical p-values; that is, a low p-value indicates that a locus is an outlier with respect to the rest of the genome. However, we note that loci detected as being under selection using this approach may be an under-representative sample of all truly selected loci; in particular, selection on standing variation and recessive loci are likely to be underrepresented [43].

Differences in the results of reference-based and de novobased assembly and analysis
Transcriptome analysis results of reference-based analysis (RBA) and de novo-based analysis (DBA) showed a substantial difference in the number of transcript and differentially expressed genes (DEGs) identified. In RBA, for blood and skeletal muscle, 15,900 and 17,927 transcripts were found, respectively, among which 2,244 and 1,405 were unique transcripts. In DBA, the numbers of transcripts in skeletal muscle and blood were 18,057 and 19,413, respectively with 4,401 and 2,892 unique transcripts. The numbers transcripts in common between RBA and DBA were 13,656 for skeletal muscle tissue and 16,521 for blood ( Figure 1a and Figure S1a in File S1). When the sample variance of RBA and DBA in skeletal muscle and blood were compared using multidimensional scaling (MDS) plot, the results for the two analyses were almost identical. The skeletal muscle samples were clustered into two subgroups: before and after exercise, but the blood samples did not show any clustering (Figure 1b, Figure S1b and Figure S2 in File S1). In RBA, the number of DEG in skeletal muscle and blood were 2,818 and 455, respectively with 2,200 and 427 DEGs being unique to RBA. In DBA, the number of DEG in skeletal muscle and blood were 1,652 and 595, respectively with 1,034 and 567 unique DEGs. The number of DEGs identified by both RBA and DBA were 618 and 28 in skeletal muscle and blood, respectively (Figure 1c and Figure S1c in File S1). These DEGs were compared using Heatmap visualization to examine their expression pattern in each analysis. The expression pattern was similar, however, the intensity of the expression was higher with DBA ( Figure 1d and Figure S1d in File S1). Overall, in comparison to RBA, DBA identified a higher number of transcripts but a lower number of DEGs.

Identification of de novo unique differentially expressed genes (DUDEGs) before and after exercise
We identified DUDEGs from RNA-seq data using the expression profiles of genes in skeletal muscle and blood samples taken from six Thoroughbreds before and after exercise. There were a total of 1,034 significant DUDEGs (519 up-regulated, 515 down-regulated) in skeletal muscle and 567 (314 up-regulated, 253 down-regulated) in blood (FDR,0.01). Among them, 456 (61 upregulated, 395 down-regulated) in skeletal muscle and 205 (93 upregulated, 112 down-regulated) in blood were annotated (Table S1 in File S1).

Validation of DUDEGs in horse RNA-seq data using RT-PCR
We performed RT-PCR to validate the DUDEGs detected in horse blood. The seven genes (TIMELESS, EIF4A3, PGIW, ANK3, MSH3, SYNRG, ASGR2: 2 up-regulated and 5 down-regulated) were randomly selected with conceptually new DEGs and logFC.2 in blood (Table S1, S4 and S5 in File S1). The expression levels of DUDEGs between RNAseq and RT-PCR were highly similar ( Figure S8 in File S1). The results confirmed that DUDEGs identified in this study were reliable.
Enriched KEGG pathways analysis using DUDEGs revealed that up-regulated genes in skeletal muscle and blood are associated with exercise-induced stress. The most significantly enriched terms in skeletal muscle were 'p53 signaling pathway', 'regulation of actin cytoskeleton', 'JAK-STAT signaling pathway', 'MAPK signaling pathway', 'cell adhesion molecules', 'cytokine-cytokine receptor interaction', 'bladder cancer', and 'pathways in cancer'. In addition, the two terms 'colorectal cancer' and 'biosynthesis of unsaturated fatty acids' were significantly enriched in blood ( Table 1).

Integration of DUDEG and selected gene associated with nucleotide diversity, F ST and Cross Population Extended Haplotype Homozygosity (XP-EHH)
The genes under selection were further investigated using nucleotide diversity, F ST and XP-EHH from WGS of Thoroughbred and Jeju pony (skeletal muscle = 1,033 and blood = 567). We found 12 genes (ZNF592, CD58, C1orf162, USP37, FOXM1, TIMELESS, TRMT1, CALR, ASNA1, EIF4A3, SYNRG and FADS1) in blood and 14 genes (HERC2, CHD9, DDX28, CAPZA1, TSEN15, CHMP4C, FOXO3, PLD2, ANKRD13D, UNKL, CBFA2T2, NECAB3, SLC25A29 and FBLN1) in skeletal muscle that were both identified as DUDEGs and implicated in F ST analysis as a selected gene (Table S4 in File S1). The F ST distribution histogram for the pair of horse breeds is shown in Figure S7 in File S1.
Between DUDEGs and XP-EHH (selected genes in the Thoroughbred), we found 11 (ANK3, ZNF592, TOR1AIP1,  TIMELESS, INSR (Table S5 in File S1). Among them, three genes, TIMELESS, EIF4A3 and ZNF592, in blood, and two genes, CHMP4C and FOXO3, in skeletal muscle were shown to be significant in all three analyses (DUDEG, F ST , and XP-EHH). In comparison to Jeju pony, the Thoroughbred showed relatively low levels of nucleotide diversity at three out of the five identified genes (Table 2 and Figure 3).

Differences in result between reference-based and de novo-based assembly and analysis
Up to now, many studies of RNA-seq analyses have used reference-based analysis (RBA) when a reference genome for the species is available [27,46,47]. When a species does not have a reference genome, RBA using the reference genome of a closely related species or de novo-based analysis (DBA) is used. Several studies of RNA-seq analyses have used the align-then-assemble combined method (align-then-assemble and assemble-then-align) [24]. DBA used in align-then-assemble method assembles the unmapped sequence reads after RBA, which supplements the inherent weakness of RBA. However, we suggest that even if a Figure 1. Summary of comparative analysis between de novo assembly and reference genome assembly from RNA-seq data of skeletal muscle from six Thoroughbreds before and after exercise (Total 12 samples). a) The number of transcripts in common between de novo assembly and reference genome assembly b) MDS plot of six Thoroughbreds before and after exercise using de novo assembly. c) The number of DEGs between de novo assembly and reference genome assembly. d) Heat-map visualization of common DEGs between de novo assembly and reference genome assembly: rows represent DEGs from skeletal muscle and columns represent assemble method from 6 horse samples (*First 'B' is for blood and 'M' is for muscle. 'F1', 'F2', 'F3' and 'S3' are horse samples. Last 'B' is for 'before exercise' and 'P' is for 'after exercise'). doi:10.1371/journal.pone.0091418.g001 species has a reference genome, DBA used in assemble-then-align is necessary for assembly of the total sequence reads including the unmapped sequence reads (Table S6 and S7 in File S1). In this study, we found a significant difference in the number of transcript and differentially expressed genes (DEGs) identified by RBA and DBA. Also, a greater number of unique transcripts were identified by DBA than in RBA (Figure 1a and Figure S1a in File S1). This implies that the horse RNA-seq data in this study includes new transcripts and new transcript structures that are not included in the horse reference genome. Additionally, we found de novo unique differentially expressed genes (DUDEGs) which cannot be attained from RBA (Figure 1c and Figure S1c in File S1). Credibility of DBA in RNA-seq has been proven in numerous methods and protocol papers [17,25,48] and in this study. Both multidimensional scaling (MDS) plot (Figure 1b, Figure S1b and Figure S2 in File S1) and expression patterns of common DEGs did not show differences between the results of RBA and DBA. However, the intensity of the expression was different because the newly assembled transcriptome reference represents the individual transcriptome made using de Bruijn graph assemblers during DBA [48] (Figure 1d and Figure S1d in File S1).
RNA-Seq can reveal sequence variations such as SNP in genes [49] as is possible with whole genome sequencing (WGS).
Transcribed SNPs in RNA-seq are needed for accurate measurement of allele-specific expression [50,51] and detection of novel SNPs. [52]. Hence, we compared the number and rate of SNPs identified from the two NGS methods and found differences between the type of references and NGS methods. In summary, we detected more SNPs in de novo assembly of RNA-seq than in the reference genome assembly of cDNA (Table S2 in File S1).

Identification and Functional annotation of unique DEGs identified by de novo base assembly
We identified DUDEGs to ascertain the important function of DEGs, which cannot be attained from RBA. In the highest biological process gene ontology (BP-GO) of DUDEG result, immune system process had the most significant P-value (Pvalue = 2.23E-13) in up-regulation of skeletal muscle (Figure 2). Response to stimulus had the second most significant P-value (Pvalue = 5.16E-08), which is related with immune response caused by exercise-induced stress [53]. Exercise-induced stress is closely related with the regulation of immune response [6,27]. Overexercise in horses has shown an increase in the expression of alpha-1-antitrypsin protein, which plays an important role in protection of cells from inflammatory enzymes released from neutrophils [54]. Exercise-induced reactive oxygen species was Figure 2. Biological process GO terms of tissues specific DEGs between before and after exercise in Thoroughbred. Up-regulated genes indicate higher activation after exercise than before and down-regulation genes indicate lower activation after exercise than before exercise. doi:10.1371/journal.pone.0091418.g002 Table 1. Enriched KEGG pathways associated with DEGs in two tissue such as skeletal muscle and blood.  also related with the regulation of immune responses, and caused the inflammatory responses from muscle damage [55,56]. In the KEGG pathways result (Table 1), the JAK-STAT signaling pathway and MAPK signaling pathway were also up-regulated in the skeletal muscle. The JAK-STAT signaling is a key pathway in myoblast proliferation [57] and plays a major role in inflammatory and immune responses [58]. The MAPK signaling pathway is implicated in inflammation and carbohydrate metabolism [59]. Death, Related with apoptosis of skeletal muscle caused by overexercise also had a significant P-value (P-value = 1.75E-06). This was supported by the KEGG pathways results, which showed that the p53 signaling pathway and regulation of actin cytoskeleton were up-regulated. P53 protein has an important role in apoptosis of skeletal muscle and actin cytoskeleton, and is also a key pathway in regulation of apoptosis pathways [60,61]. However, we did not find immune responses and apoptosis related with BP-GO and KEGG pathway in blood.

Integration of conceptually new DEGs: DUDEGs and selected gene associated with nucleotide diversity, F ST and Cross Population Extended Haplotype Homozygosity (XP-EHH)
In order to investigate the evolutionary history of domestication in relation to different experimental conditions, we approached the identification of DEGs with a new concept. The conceptually new DEGs were attained by screening for genes in common between DUDEGs from DBA in RNA-seq and selected genes identified by evolutionary statistics, such as nucleotide diversity, F ST and XP-EHH from RBA in WGS.
This comparison highlighted three genes (EIF4A3, ZNF592 and TIMELESS) in blood and two genes (CHMP4C, and FOXO3) in skeletal muscle as being in common between DUDEGs, F ST and XP-EHH. These five genes are not only DUDEGs in six Thoroughbreds, before and after exercise, but also selected genes (F ST (empirical p-value,0.01) and XP-EHH (value,0 and pvalue,0.01)). A pairwise test, XP-EHH, of the Thoroughbred and Jeju domestic pony populations was used to identify selective sweep regions between the two populations. As we are interested in locating selective sweep regions representing adaptation in Thoroughbreds, a cutoff of XP-EHH value,0 was used. If the XP-EHH value.0 is used, then the identified selective sweep region would correspond to the adaptations in Jeju domestic pony.
The five genes were conceptually new DEGs, and were related to the evolution of exercise response during the domestication process of Thoroughbred. Among them, three genes, CHMP4C, EIF4A3 and FOXO3, showed relatively low levels of nucleotide diversity compared to that of the Jeju pony (Table 2 and Figure 3). This suggests that these three genes have been more strongly selected for in Thoroughbred than in Jeju pony. EIF4A3 was mostly expressed in megakaryocytes, platelets and red blood cell. EIF4A3, an mRNA-localization protein in mammals, controls the synaptic strength, neuronal protein expression, and in megakaryocytes and platelets act as mRNA sorting machinery [62][63][64]. In a previous study, it was shown that over-exercise activates and  increases platelets [65]. Although, it has an important role in blood post-exercise in Thoroughbreds, EIF4A3 expression was upregulated in our results. CHMP4C is a p53-regulated gene and plays an important role in exosome production [66]. The importance of p53 in apoptosis of skeletal muscle was implicated in a previous study, in which p53-null animals showed greater fatigability and less locomotory endurance than wild-type animals [60]. This suggests that p53 is closely related with exercise-induced stress in skeletal muscle. CHMP4C expression was down-regulated in our results implying the activation of p53 regulation in Thoroughbred skeletal muscle. FOXO3, also known as Forkhead box O3, has a role in triggering apoptosis by down-regulating the FOXO3 gene. In addition, FOXO3 causes a loss of muscle mass, and is closely related to PGC1a, ATG4b, ATG12, Beclin1, Gabarapl1, and LC3b. PGC1a, the transcription of atrophyspecific genes, inhibits the activity of the transcription factor FOXO3, with protects skeletal muscle from atrophy [67]. In human muscle after ultra-endurance exercise, the expression of several autophagy genes, ATG4b, ATG12, Beclin1, Gabarapl1 and LC3b, were increased [68,69]. For this reason, FOXO3 was also closely related to exercise in Thoroughbred skeletal muscle. Based on these results, EIF4A3, CHMP4C and FOXO3 are conceptually new DEGs involved in exercise response that have been selected for during the domestication history of the Thoroughbred that cannot be acquired by RBA.

Supporting Information
File S1 Contains the following. Table S1, List of unique DEGs in skeletal muscle and blood in six Thoroughbred horses before and after exercise RNA-seq data by de novo assembly (FDR,0.01). Table S2, The number and rate of SNPs from different next-generation sequencing method (DNA and RNA sequencing) and different reference genome assembly in each Thoroughbred horse sample (F1, F2 and F3 = male, S3 = female). Table S3, GO terms of cellular components and molecular function of two tissues specific DEGs between before the exercise and after exercise in horses. Table S4, Common genes between DEGs and selected genes associated with F ST (F ST cut-off value top 5% with empirical p-value,0.05). Table S5, Common genes between DEGs and selected genes associated with XP-EHH: XP-EHH cut-off value empirical p-value,0.01 and XP-EHH value ,23.51551 significant SNPs in Thoroughbred were selected and .1.73481 significant SNPs in Jeju domestic pony were selected. Table S6, List of basic stats such as the number of transcripts, components, and contig N50 value in RNA-seq whole reads and unmapped reads by trinity de novo assembly. Table S7, Number of annotated transcripts from RNA-seq unmapped reads by trinity de novo assembly. The number in the parentheses is the number of transcripts that were not included in the results of the referencebased analysis. Table S8, Basic information of 4 horses resequencing data. Table S9, RT-PCR primer information such as the gene symbol, direction and sequence. Figure S1, Summary of comparative analysis between de novo assemble and reference genome assemble from blood in six Thoroughbred horses before and after exercise RNA-seq data (Total 12 samples). a) The number of common transcripts of 12 samples between de novo assemble and reference genome assemble b) MDS plot of six Thoroughbred horses before and after exercise using de novo assemble. c) The number of DEGs between de novo assemble and reference genome assemble. d) Heat-map visualization of common DEGs between de novo assemble and reference genome assemble: rows represent DEGs from blood and columns represent assemble method from 6 horse samples (*First 'B' is for Blood and 'M' is for muscle. 'F1', 'F2', 'F3' and 'S3' are horse samples. Last 'B' is for 'before exercise' and 'P' is for 'after exercise'). Figure S2, MDS plot of six Thoroughbred horses before and after exercise using reference genome assemble in RNA-seq. a) MDS plot of blood tissue in six Thoroughbred horse before and after exercise. b) MDS plot of skeletal muscle tissue in six Thoroughbred horse before and after exercise. (*First 'B' is for Blood and 'M' is for muscle. 'F1', 'F2', 'F3' and 'S3' are horse samples. Last 'B' is for 'before exercise' and 'P' is for 'after exercise'). Figure S3, Hierarchical clustering of biological process GO terms associated with up-regulated DEGs in blood. The gene list of each GO term clustered using DAVID was compared to calculate the distance between the GO terms. For a distance value .0.5, GO terms were re-clustered, and GO term groups are shown as light-blue graduated blocks. The number of genes associated with the reclustered GO term group is shown on the left side of the block. Figure S4, Hierarchical clustering of biological process GO terms associated with down-regulated DEGs in blood. The gene list of each GO term clustered using DAVID was compared to calculate the distance between the GO terms. For a distance value .0.5, GO terms were re-clustered, and GO term groups are shown as light-blue graduated blocks. The number of genes associated with the re-clustered GO term group is shown on the left side of the block. Figure S5, Hierarchical clustering of biological process GO terms associated with up-regulated DEGs in muscle. The gene list of each GO term clustered using DAVID was compared to calculate the distance between the GO terms. For a distance value .0.5, GO terms were re-clustered, and GO term groups are shown as light-blue graduated blocks. The number of genes associated with the re-clustered GO term group is shown on the left side of the block. Figure S6, Hierarchical clustering of biological process GO terms associated with down-regulated DEGs in muscle. The gene list of each GO term clustered using DAVID was compared to calculate the distance between the GO terms. For a distance value .0.5, GO terms were re-clustered, and GO term groups are shown as light-blue graduated blocks. The number of genes associated with the re-clustered GO term group is shown on the left side of the block. Figure S7, Histogram of conventional F ST frequency between Thoroughbred and jeju pony. x-axis is conventional F ST value, y-axis is gene frequency. Figure S8, qRT-PCR validation of de novo unique differentially expressed genes (DUDEGs) identified from the RNA-seq data set of Thoroughbred horses before and after exercise: a) RT-PCR of six DUDEG in horses before exercise and after exercise. b) qRT-PCR results depicted as Ct value was calculated using 2-DDCt method. *: p-value,0.05. **: p-value,0.01. #: The expression patters of genes supported the result of our analysis. (ZIP)