Genome-Wide Analysis of Positively Selected Genes in Seasonal and Non-Seasonal Breeding Species

Some mammals breed throughout the year, while others breed only at certain times of year. These differences in reproductive behavior can be explained by evolution. We identified positively-selected genes in two sets of species with different degrees of relatedness including seasonal and non-seasonal breeding species, using branch-site models. After stringent filtering by sum of pairs scoring, we revealed that more genes underwent positive selection in seasonal compared with non-seasonal breeding species. Positively-selected genes were verified by cDNA mapping of the positive sites with the corresponding cDNA sequences. The design of the evolutionary analysis can effectively lower the false-positive rate and thus identify valid positive genes. Validated, positively-selected genes, including CGA, DNAH1, INVS, and CD151, were related to reproductive behaviors such as spermatogenesis and cell proliferation in non-seasonal breeding species. Genes in seasonal breeding species, including THRAP3, TH1L, and CMTM6, may be related to the evolution of sperm and the circadian rhythm system. Identification of these positively-selected genes might help to identify the molecular mechanisms underlying seasonal and non-seasonal reproductive behaviors.


Introduction
The environment can influence gene evolution and thus animal behaviors, including reproduction-related behaviors. Some mammals can breed throughout the year, while others only breed successfully at certain times of year. Such animals are defined as non-seasonal and seasonal breeding species, respectively. Day length, temperature, and food supply can all influence the reproductive behavior of seasonal breeding species and subsequent survival of offspring [1]; if they breed too early, the growing offspring may be exposed to low temperatures and scarce resources, whereas late breeding limits the time available for reproductive behaviors and preparation for the following winter. Accurate timing is therefore an essential component of lifehistory strategies for organisms living in seasonal environments [2]. The different reproductive behaviors of seasonal and non-seasonal breeding species may result from natural selection pressures [3]. Both strategies benefit the respective species to survive by adaption of their breeding behaviors to the environment through their long evolutionary histories. Whole genome-wide analysis of genes that are positively selected in mammal lineages using the respective breeding strategies may help us to understand the mechanisms responsible for the divergent reproductive behaviors as a result of adaptive evolution.
Positive Darwinian selection of protein-coding genes is a major driving force for detecting adaptive evolution and species diversification. The modified version of the branch-site test (Model A) [4,5] was designed to detect localized episodic bouts of positive selection that affect only a few amino acid residues in particular lineages. This test has been shown to be a reasonably powerful tool, and has been widely used to investigate the adaptive evolution of genes in many species [6][7][8].
However, alignment errors may influence the results of branch-site gene analysis in mammalian and vertebrate species. It is therefore necessary to use reliable alignment methods to reduce the incidence of false-positive results [9]. Although the aligner software PRANK [10,11] cannot eliminate false-positive results, it is nonetheless more powerful than other aligners [9,12] such as MUSCLE [13] and ClustalW [14]. In addition to misalignments in multiple sequences, other factors such as sequence errors, misassembly, and annotation mistakes also increase the incidence of falsely-identified positive selection [15,16]. More stringent filters are needed to ensure that branch-site analysis has a low and acceptable false positive rate.
In this genome-wide study, we investigated the evolution of seasonal breeding strategies by identifying positively-selected genes in non-seasonal and seasonal breeding species using modified branch-site models. We established Distant-Species and Close-Species sets, each of which included seasonal and non-seasonal groups. We then identified positively-selected genes in these groups. PRANK (codon) software was used to align all the gene orthologs in the two gene sets. However, because PRANK generates a relatively high false-positive rate with the branchsite model, stringent filtering using sum of pairs (SP) [17,18] scoring was used to remove potentially unreliable alignments generated by multiple sequence alignments. Sequence errors, misassembly, and annotation mistakes were also detected by cDNA mapping. Functional analysis of genes identified as positively-selected after this stringent filtering process might help us to understand the molecular mechanisms that determine non-seasonal and seasonal breeding.
The protein-coding sequences for human, gorilla, chimpanzee, orangutan, Indian rhesus monkey, marmoset, mouse, rat, dog, horse, and rabbit were downloaded from the Ensembl database (version 64, Sep. 2011; http://www.ensembl.org/info/data/ftp/index.html) [32]. The sequences for cynomolgus monkey (http://climb.genomics.cn/10.5524/100003) and Chinese rhesus macaque (http://climb.genomics.cn/10.5524/100002) were provided by BGI [33]. The corresponding cDNA sequences used in the accuracy assessment were downloaded from NCBI. Detailed information on the cDNA sequences used in this study are listed in S1 Table. Calculating positively-selected sites To identify 1:1 gene orthologs, human protein sequences were used to conduct BLAST [34] searches against other species sequences (blastp-F T-e 1e-5-m 8). It is difficult to select a set of transcripts to minimize alignment gaps and potential errors and thus false-positive branch-site test results [35]. In simple analyses in previous studies [6-8, 33, 36-41], the longest transcript for a given gene was chosen. Reciprocal searches were then performed for each species protein sequences relative to human protein sequences. In each search, pairwise sequences with identities <60% were excluded, and the highest hit for each query was retained to determine the pairwise orthologs between humans and other species.
Modified branch-site models [5] for adaptive evolution analysis used each species in one breeding series as the foreground species, and all the other species in that breeding series as background species. For example, to test for positive selection in humans in the Distant-Species set, the human branch was designated as the foreground branch, and the other five species in the seasonal breeding group were designated as background branches. Positive selection signals for all species were tested similarly.
Protein-coding sequences associated with the corresponding 1:1 gene orthologs were aligned using PRANK (codon). The corresponding gene-based phylogenetic trees were constructed using the maximum likelihood method in the PHYLIP 3.69 [42] software package, according to the tested aligned protein-coding sequences. The aligned protein-coding sequences and the corresponding phylogenetic trees were then used to analyze the adaptive evolution using the branch-site model in PAML's codeML program [4]. Branch-site modified model A (model = 2, NSsits = 2) and the corresponding null model (model = 2, NSsits = 2, fix_omega = 1 and omega = 1) [5] were used to identify sequences under positive selection in both test sets of animals. Significance was calculated using the χ 2 statistic, with one degree of freedom. Genes with p 0.01 were considered to be positively selected [5]. The p values were adjusted according to the FDR method (multiple testing correction with the method of Benjamini and Hochberg) [43] to allow for multiple testing, with a strict criterion of FDR <0.05. Positively-selected sites were obtained based on the Bayes Empirical Bayes (BEB) analysis [5], with a posterior probability >95%.

Screening for valid positive sites by SP penalty scoring
To ensure the accuracy of the positive sites, extended sequences were extracted including 15 amino acids (45 base pairs) upstream and downstream from the positive sites. SP [17,18] measurements were then performed for penalty scoring of the sequences in both streams. (1) Some of the positive sites were at the edge of the beginning or end of the gene and were not reached by the upstream or downstream sequences, and the penalty base score was set separately for both streams (regarded as S, S = 15/n, where n is equal to the number of amino acids in the upstream or downstream sequence). (2) Penalty scores added 0 point for each position in perfect alignment, while mismatched sites or gaps in the alignment were awarded penalty scores of minus S or 2S, respectively. (3) Penalty scores for the upstream and downstream sequences were calculated separately, and the total penalty scores were the sum of the upstream and downstream scores. (4) Average penalty scores were calculated as the final scores (average penalty score = total penalty score/N, where N is the number of sequences used in each alignment).
General and individual penalty scores were used. General penalty scores were equal to the sum of the penalty scores from each of the two compared species. For individual penalty scores, sequences with positive sites were compared with each of the other sequences used in the alignment in turn, and the total penalty scores were regarded as the individual penalty score.
Threshold values were set for general and individual penalty scores to filter sequences with valid positive sites. In this study, the threshold values for the general and individual penalty scores were −50 and −15, respectively. If both the general and individual penalty scores were greater than the threshold value, the sequences were filtered and the sites regarded as positive.

Accuracy of positive sites according to cDNA sequences
Mistakes can occur during genome sequencing, sequence assembly, or gene annotation, and cDNA sequences can be used as references to assess the accuracy of the positive sites.
Corresponding cDNA sequences were first matched to the gene sequences using the function BLAST [34] (blastn-e 1e-10-a 4-m 8). cDNA sequences that included the positions corresponding to the positive sites were then filtered. Further analysis was conducted using MEGA5 [44]. The gene sequences and their corresponding cDNA sequences were then subjected to alignment analysis using the MUSCLE [13] function. If the nucleotide sequences of the positive sites were identical to those of the corresponding positions in the cDNA sequences, the positive sites were regarded as valid.

Results
Preliminary filtering of positively-selected genes using PRANK and branch-site model Totals of 11,031 and 13,171 1:1 gene orthologs with >60% identities were filtered from the Distant-and Close-Species sets, respectively, by BLAST [34]. The corresponding protein sequences were used for subsequent alignments. The numbers of pairwise gene orthologs between humans and other species are listed in S2 Table. After alignment using PRANK (codon), 10,918 gene orthologs in the Distant-Species set and 12,485 in the Close-Species set were tested for positive selection signals using the codeML program in the PAML package [4], with the modified branch-site model [5]. Positively-selected genes in each species with a p value <0.01 (comparing LRT, the likelihood ratio test, with the χ2 distribution) and with a false-discovery rate (FDR) <5% are shown in Table 1.
In the Distant-Species set, the mean number of positively-selected genes in the seasonal species was four fold greater than in the non-seasonal species (fdr <0.05) (Fig 1A, Table 1). The equivalent increase in the Close-Species set was about 2.63-fold ( Fig 1B, Table 1). These results demonstrate that there were more positively-selected genes in seasonal compared with nonseasonal breeders in both species sets. However, there were more positively-selected genes in the Close-Species than in the Distant-Species set (mean numbers with FDR <0.05 208.75 and 111.9, respectively). In addition to the different numbers of orthologs (12,485 vs. 10,918), it is also possible that more gaps were generated by alignment in the Distant-Species gene ortholog set compared with in the Close-Species set (mean gap length 244 in the Close-Species set and 322 in the Distant-Species set) (S3 Table), because the sequence divergence was smaller in the Close-Species set. The number of gaps may influence the results of branch-site analysis, because the branch-site would remove columns with gaps in the alignment sequences and would thus exclude more potential positive sites in the Distant-Species set compared with the Close-Species set.

Identification of false-positive sites through sequence misalignment
Putative positively-selected sites in the genome (FDR<0.05) were obtained by Bayes Empirical Bayes (BEB) analysis (posterior probability >95%) [5]. The numbers of putative positivelyselected sites in each species are listed in Table 2. The details of all the positive sites with BEB >0.95 are listed in S4 Table. Alignment problems may influence the performance of the branch-site test, with poor alignment increasing the incidence of false-positive sites. We therefore filtered out sites with obvious signs of unreliable alignment. We also calculated the SP [17,18] score for each of the positive sites' extended sequences (± 15 amino acids/45 base pairs). Most unreliable alignments are represented by numerous gaps and sequence divergences (S1 Fig and S5 Table). After filtering, a total of 2009/3810 (52.73%) positive sites remained. Sites with extended alignments with low divergence are listed in S6 Table. The results after filtering revealed more sites with positive selection in the seasonal compared with the non-seasonal breeding species ( Table 2). The falsepositive rate due to misalignment was 33.33%-61.28% (Table 2), which was similar to that of 50%-55% in a previous report [12]. After alignment filtering, differences in gene numbers between species in the Distant-and Close-Species sets were consistent with those after FDRadjusted filtering. However, the false positive rate(FPR) statistics only considered misalignment and did not take account of other factors such as sequence errors, misassembly, or annotation problems.
According to extended-sequence alignments of the positive sites, SP scores <-50 were generally caused by excessive gaps or deficient matches, of which gaps contributed more to the low SP penalty scores (S1 Fig and S6 Table). Gaps and deficient matches may arise as a result of diversity between species or different transcript lengths, because we used the longest human transcripts to BLAST other species' protein-coding sequences [35]. Columns with gaps in the alignments would be deleted in branch-site models, even though positive sites may be located within deficient sequence alignments surrounded by gaps or mismatched sequences. A threshold SP score of −50 can filter out most false-positive sites caused by divergent sequence alignments. SP scoring thus improves the reliability of the results by reducing the false-positive rate caused by unreliable alignments. Details of the positive genes filtered by SP scoring are shown in S7 Table. cDNA mapping as a novel method of filtering positive sites The quality of the genome may limit the accuracy of evolutionary analysis. It can result in false-positive results associated with sequencing errors, alternative splicing, amino acid repeats, and frameshift mutations, causing mistakes in gene annotation [8,15]. However, cDNA sequences are much shorter than genome sequences and are thus more reliable. The reliability of positive sites will therefore be increased if sequences with positive sites are mapped to the corresponding cDNA sequences and aligned with most of the bases. We therefore used cDNA mapping as a novel means of testing sequence errors.
cDNA sequences corresponding to the positive sites were analyzed. In this study, we aligned a total of 193 positive sites in perfect alignment with at least one cDNA sequence of the corresponding species using the MUSCLE function [13] in MEGA5 [44]. The coverage between positive sites and corresponding cDNA sequences was low (<10%, 193/2009), and the false positive rate was 61.66% (120/193). Most inconsistent sites were in cynomolgus monkey, horse, and orangutan, which had genome sequences of low quality or with annotation mistakes. In contrast, the human, mouse and rat genome sequences showed high accuracy. The details of the positive sites mapped with the corresponding cDNA sequences are shown in S1 Table. A total of 74 corresponding cDNA sites were finally identified that were consistent with the positive sites (S1 Table). No corresponding cDNA sequences mapped to the positive sites in gorillas, Chinese rhesus monkeys, and marmosets. After verification by cDNA filtering, 39 genes remained, including 15 genes that were positively-selected in non-seasonal species (Table 3), and 24 in seasonal species (Table 4). Although the limited availability of cDNA sequences meant that only a few positive sites remained after mapping, these sites were likely to be more accurate.

Influence of alignment and annotation
The results of evolutionary analysis are influenced the quality of the genome sequence; falsepositive sites may be detected and important information may be missed as a result of lowquality sequences [15,16]. Unfortunately, recent genome-sequencing techniques are still unable to provide sequences reliable enough for evolutionary analysis. Stringent filtering functions and parameters are therefore needed to obtain reliable positive sites, and careful analytical design can achieve reliable results, even from low-quality genome sequences.
Evolutionary analysis usually starts with sequence alignment using software such as Clus-talW, MUSCLE or PRANK. In this study, we used PRANK (codon), because this software takes evolutionary information into consideration before placing the gaps [11], resulting in fewer mismatches but larger gaps compared with the other programs (S3 Table). Valid positive sites are likely to be located in alignments with low divergence and few gaps or mismatches, and sequence misalignments can thus generate false-positive sites in branch-site models. The branch-site model usually deletes columns with gaps in the alignments when calculating positive sites, so some sites located in deficient alignments may be regarded as positive, whereas some true-positive sites may be missed. SP-score filtering, which focuses on filtering out such false-positive sites, can be used to reduce the false-positive rate and ensure the quality of the filtered positive sites. On the other hand, cDNA mapping can exclude false-positive sites that originate from mistakes in genome sequence assembly and gene annotation. The combination of these processes can thus filter out many false-positive sites and identify low-quality genome sequences, such as those for cynomolgus monkey, horse, and orangutan in this study. cDNA sequences in previous genome-wide studies have generally been used as references for gene annotation [45][46][47]. In contrast, we used cDNA mapping as a novel method to identify positive sites with high quality. Because cDNA sequences are usually relatively short, current sequencing techniques can provide reliable sequences. Moreover, some sites can be mapped to more than one corresponding cDNA sequence. cDNA mapping can thus ensure the quality of the remaining positive sites. However, there are some limitations. More than 90% of sites cannot be matched with corresponding cDNA sequences, and the validity of these sites therefore cannot be checked using this method. Because cDNA sequences are usually sequenced for a specific purpose, corresponding cDNA sequences may not be available for some putative positive sites, and genes with important evolutionary implications may be missed.

Positively-selected genes in seasonal and non-seasonal breeders
Evolutionary analysis of genome sequences can be used to identify specific, positively-selected genes in various species. The genetic mechanisms and potential environmental adaptations associated with seasonal and non-seasonal breeding can then be inferred by functional analysis of positively-selected genes in the respective species.
The functions of positively-selected genes in non-seasonal breeding species reflect reproductive tendencies such as sperm generation and cell proliferation. Two key genes perform these functions in humans: CGA (glycoprotein hormones, alpha polypeptide) is a gonadotropin subunit [48,49], while CD151 functions in promoting metastasis, and increases the expression of phospho-extracellular signal-regulated kinase (ERK) [50,51]. Given that ERK is a component of the mitogen-activated protein kinase pathway, positive selection pressure on this gene may influence cell proliferation and differentiation [52,53]. Mutation of Dnah1 in mice has been reported to cause male infertility [54,55], suggesting that it may play an important role in influencing mating behavior. Another crucial gene in rats, Invs, is involved in controlling cytoskeletal organization and cell division, which are essential for reproduction [56,57]. Moreover, this gene can interact with NPHP1 and NPHP3 that influence the Wnt signaling pathway, which may in turn influence kidney function and renal cell formation linked to spermatocyte and spermatid generation in the testis [58][59][60]. These positively-selected genes may reflect modulation of the reproductive system under environmental pressure in non-seasonal breeding species, enabling them to breed throughout the year. The identification of positive sites focused on sperm generation and cell proliferation suggests that mutations in these genes may influence sperm quantity or reproductive capacity. Genes that were positively selected in seasonal breeding species differed from those in nonseasonal species in having less focused functions. However, the orangutan provided the most valid positive genes among these species, and their functional analysis may help to explain some predominant characteristics of seasonal breeding species. The key gene, THRAP3 (thyroid hormone receptor associated protein 3, also known as Thrap150), is a selective coactivator for CLOCK-BMAL1 and promotes CLOCK-BMAL1 binding to target genes [61]. Moreover, THRAP3 can also interact with HELZ2, which regulates adipocyte differentiation [62]. Clock and Bmal1 have previously been reported to be closely related to seasonal breeding behaviors [63], the THRAP3 mutation may thus influence the circadian rhythm of the reproductive system. This is supported by a previous study showing that thyroid hormone catabolism within the mediobasal hypothalamus regulated seasonal gonadotropin-releasing secretion [64]. However, because orangutans live in Indonesia, which has high temperature throughout the year [30,65], they may not need to adjust their physical condition, such as lipid storage, to cope with cold weather. THRAP3 may thus influence adipocyte differentiation, while other functionally-related genes such as MTMR12 [66] and ZFR [67] would be positively selected because of such environmental conditions. In addition to THRAP3, the positively-selected genes TH1L and CMTM6 may also help to explain the seasonal breeding behavior. As TH1L may have a similar function to TH1, which attenuates androgen signaling [68], while CMTM6 functions in spermatogenesis [69][70][71]. Evidence from previous studies suggests that orangutans produce 14 times less sperm than chimpanzees, which is a closely-related, but non-seasonal breeder [72]. Seasonal breeding in orangutans may thus be a consequence of circadian rhythm and limited sperm production, which restrict their breeding to the period from December to May, the most productive months in terms of food (fruit) supply, to ensure adequate food and energy for effective reproduction [73]. Diversity in breeding behaviors can generally be attributed to mutations affecting endocrine mechanisms. Such mutations may be related to specific environmental conditions, such as temperature and food supply. In this study, positively-selected genes related to sperm generation were identified in both types of breeding species. Indeed, previous reports have indicated rapid evolution of sperm proteins in mammals [74,75]. Evolutionary mutations in these genes may not lead to the unique consequences associated with different breeding strategies. However, previous studies have indicated that the reproduction behavior in seasonal breeding species is largely under the regulation of the circadian rhythm system [64]. This is consistent with our results, which showed that THRAP3, which is functionally-related to the CLOCK-BMAL1 system, was under positive selection pressure. The mechanisms determining breeding behaviors can be complicated, but evolution leads to adaptation to the environment, enabling welladapted lineages to persist for many generations.

Conclusions
In this study, we conducted a precise, genome-wide scan to detect genes that were positively selected between seasonal and non-seasonal breeding species. The evolutionary analysis was designed to reduce the incidence of false-positive sites by SP filtering and cDNA mapping. Although the lack of cDNA sequences means that some positive genes may have been missed, the identification of valid, positively-selected genes with functions relating to spermatogenesis, cell proliferation, and circadian rhythm might indicate possible molecular mechanisms underlying the seasonal and non-seasonal reproductive behaviors. Further developments in genomesequencing technologies will allow the sequencing and assembly of higher-quality genomes, and more accurate gene annotation, while the availability of more cDNA sequences will increase the value of cDNA mapping for improving the accuracy of evolutionary analysis.