Control of human testis-specific gene expression

Background As a result of decades of effort by many investigators we now have an advanced level of understanding about several molecular systems involved in the control of gene expression. Examples include CpG islands, promoters, mRNA splicing and epigenetic signals. It is less clear, however, how such systems work together to integrate the functions of a living organism. Here I describe the results of a study to test the idea that a contribution might be made by focusing on genes specifically expressed in a particular tissue, the human testis. Experimental design A database of 239 testis-specific genes was accumulated and each was examined for the presence of features relevant to control of gene expression. These include: (1) the presence of a promoter, (2) the presence of a CpG island (CGI) within the promoter, (3) the presence in the promoter of a transcription factor binding site near the transcription start site, (4) the level of gene expression, and (5) the above features in genes of testis-specific cell types such as spermatocyte and spermatid that differ in their extent of differentiation. Results Of the 107 database genes with an annotated promoter, 56 were found to have one or more transcription factor binding sites near the transcription start site. Three of the binding sites observed, Pax-5, AP-2αA and GRα, stand out in abundance suggesting they may be involved in testis-specific gene expression. Compared to less differentiated testis-specific cells, genes of more differentiated cells were found to be (1) more likely to lack a CGI, (2) more likely to lack introns and (3) higher in expression level. The results suggest genes of more differentiated cells have a reduced need for CGI-based regulatory repression, reduced usage of gene splicing and a smaller set of expressed proteins.


Introduction
The regulatory control of gene expression is a central feature of all living organisms. Beginning with the same genome sequence, features of differential gene expression collaborate to create the entire landscape of tissue and cell function including a life-long developmental program, pathways to maintain homeostasis and functions able to respond to environmental change. The crucial importance of gene regulatory control has made it a thoroughly-studied and familiar area of investigation. As a result we now know about central features of regulation including the role of promoters, CpG islands, epigenetic signaling, transcription factors, enhancers, structured chromosome domains, mRNA splicing and many others [1][2][3][4][5][6][7]. Lacking, however, is an appreciation of how the individual systems work together to produce smoothly functioning developmental and other programs. Are there features that are more fundamental in that they are expressed earlier in development or affect a greater number of tissues and cells? To what extent is the pathway of gene regulatory systems the same in different tissues? Are there pathways of gene expression that use some but not all of the gene regulatory features used in others? Are regulatory features deployed differently in developmental pathways compared to those involved in response to environmental change? The above questions and many related ones currently occupy investigators studying gene regulatory control.
I have adopted the view that progress might be made by focusing on the genes specifically expressed in a single tissue. Limiting the analysis in this way significantly reduces the number of genes to be examined and also may reduce the number of regulatory systems that need to be considered. It is anticipated that information generated about regulation of genes expressed specifically in a single tissue may be able to be generalized to a larger and more diverse gene population.
Here I describe the results of studies carried out to examine genes expressed specifically in the human testis [8]. Testis is attractive for study because it consists predominantly of a highly restricted number (four) of distinct cell types that are all on the same pathway leading to production of a single cellular product, sperm [9]. Also, the testis stands out, compared to other tissues, for the high number of tissue-specific genes [10], a property that offers a similarly high number of regulatory features that might be relevant. Together the two features of testis, a small number of cells and a large number of specific genes, offer the possibility of relating control of specific gene expression to defined cellular developmental events.
The study began with creation of a database containing 239 genes expressed specifically in human testis. Database genes were chosen to be representative of the larger population of all testis-specific genes. The database includes genes encoded on all but one of the 24 human chromosomes; both protein-coding genes and genes that specify non-coding RNAs are represented. Database genes were examined for the presence and functioning of properties relevant to control of gene expression including the presence of a CpG island, the presence of a promoter, transcription factor binding sites within the promoter and the level of gene expression. The results are interpreted to clarify the role of the above features in control of testis-specific gene usage and their significance for sperm development.

Database of human testis-specific genes
The database of human testis-specific genes employed here (S1 Table) contains 239 genes each annotated to be highly specific for testis in both the UCSC Genome Browser (version hg38, 2013 [https://genome.ucsc.edu/]) and the NCBI gene reference [https://www.ncbi.nlm.nih. gov/]. The database was curated from among genes contained in slightly larger databases of testis-specific genes [8,11]. The studies of Djureinovic et al. [8] and Liu et al. [11] include 364 and 317 testis-specific human genes, respectively. In creating the current database, the goal was to include genes that are highly specific for expression in the human testis. For instance, both strongly-and weakly-expressed genes were included as were both LINC and intronless genes. Excluded were miRNA genes and pseudogenes. The goal was to create a gene set representative of all testis-specific genes while keeping the overall number to a manageable level. All database information can be downloaded from https://doi.org/10.6084/m9.figshare.7952123

Gene properties examined
Genes with a CpG island (CGI) were identified from the UCSC Genome Browser (version hg38, 2013). All database testis-specific genes with an annotated CGI near the transcription start site (TSS) were included without regard to the length of the CGI or its percent GC content. Genes containing a promoter were identified by the FirstEF algorithm [12] as found in the 2003 (hg36) version of the UCSC Genome Browser. For all genes examined, the level of testis-specific expression was retrieved from the UCSC Genome Browser (version hg38, 2013). A gene was considered to be broadly expressed if it was annotated to have a comparable level of expression in half or more of the tissues reported in the UCSC or NCBI databases. The list of Djureinovic et al. [8] was used to identify gene-encoded proteins highly enriched in spermatogonia, spermatocyte, spermatid or sperm. Genes lacking introns were identified using the Intronless Gene Database (http://www.bioinfo-cbs.org/igd/).

Transcription factor binding sites
Transcription factor binding sites (TFBS) near transcription start sites were identified beginning with promoters downloaded from the UCSC Genome Browser [13]. Promoters were identified by the FirstEF algorithm as described above. Each was 1000bp in length beginning 570bp upstream from the TSS and ending 430bp downstream. The entire 1000bp promoter sequence was scanned for the presence of TFBS with the ALGGEN-PROMO website running TRANSFAC version 8.3 (maximum matrix dissimilarity rate = 2; http://alggen.lsi.upc.es/cgibin/promo_v3/promo/promoinit.cgi?dirDB=TF_8.3). TFBS or combinations of contiguous TFBS were included in S2 Table if they were found to begin between -10bp and +10bp of the annotated TSS and were 6bp or more in length. The small interval (i.e. -10bp-+10bp) was chosen to emphasize sequences right at the TSS where transcription factor binding is expected to have a strong effect on transcription.

Testis-specific gene database
Database testis-specific genes were found to be widely distributed among the 24 human chromosomes. All but the Y chromosome encode at least one testis-specific database gene. Chromosome 1 has the most (28 of 239 database genes) and chromosome 21 the least (1 gene; Fig  1A). When expressed as the number of database genes per 100Mb of chromosome sequence, the highest number was found in chromosome 19 (19.0) and the lowest in chromosome 21 (2.2; Fig 1B).
The expression level of database genes was found to favor those with low expression. For instance, 194 of the 239 genes (81%) have expression levels in the lowest 1/3 of the distribution (Fig 2). Among the highly expressed genes, the distribution shows preferred values of~60, 170 and 215 RPKM suggesting there may be a mechanism to favor particular expression levels ( Fig  2).

CpG islands in testis-specific human genes
As tissue specific genes have been reported to be depleted in CpG islands compared to broadly expressed genes [2,14,15], it was expected that database testis-specific genes would be depleted in CGI, and this was found to be the case (Table 1). Of the 239 database genes, 127 (53.1%) were found to lack a CGI. In contrast, absence of a CGI was observed in only 8.0% of  Table 1. Testis-specific genes lacking a CpG island.
b Sequential genes on chromosome 9 beginning with ACO1. c Sequential broadly-expressed genes on human chromosome 12 beginning with ZNF641.
an unselected human gene population and 9.4% of a population of broadly expressed genes (Table 1). Testis-specific LINC genes were almost all lacking a CGI (14 of 15 LINC genes) while among testis-specific intronless genes the proportion was about the same as the testisspecific population as a whole (50.0% for intronless genes compared to 53.1% for all database genes; Table 1). Testis-specific database genes lacking a CGI did not differ greatly in expression level from CGI-containing testis-specific genes or from all testis-specific genes; mean expression levels were 48.4, 50.7 and 49.5 RPKM, respectively ( Table 2). This result suggests CGI are not directly involved in determining gene expression level. The observation is compatible with the accepted view that CGI function in large-scale gene repression by way of methylation, a modification that suppresses expression of affected genes [2,16].

Expression levels of testis-specific LINC and intronless genes
Testis-specific long, intergenic non-coding (LINC) genes were found to have a lower mean expression level compared to all testis-specific database genes. The difference was~2.2 fold (49.5 RPKM compared to 21.8; Table 2). This observation is in qualitative agreement with results showing decreased expression of LINC genes in databases of all human LINC genes [17,18]. In contrast, database testis-specific intronless genes were found to have a mean expression level higher than that of all testis-specific genes (89.4RPKM compared to 49.5; Table 2). This observation indicates that testis-specific intronless genes must possess strong nuclear export and other translation-enabling features that do not depend on the presence of introns and mRNA splicing pathways [19,20].

Testis-specific genes with a promoter
As promoters can play an important role in control of gene expression, they were examined carefully in the testis-specific population considered here. Special attention was devoted to transcription factor binding sites (TFBS) near the annotated transcription start site because such TFBS can have a direct effect on initiation of new gene transcription [21,22]. Less than half of the database testis-specific genes were found to have an annotated promoter (107/239 genes; 44.8%; see Table 3). This compares to greater than 90% in a population of unselected human genes. Both LINC and intronless testis-specific gene populations were also found to be depleted in promoter-containing genes. Percentages were 6.7% (1/15) of LINC genes with a promoter and 29.1% (7/24) for intronless genes ( Table 3). The lower number of promotercontaining genes in the testis-specific population suggests that in many testis-specific genes Table 2. Expression level of testis-specific gene populations.

Gene Population Mean expression (RPKM) a Range
Testis-specific (all database) b 49.5 n = 237 0. Human testis-specific gene expression the functions of the promoter must be accomplished by unannotated promoters or by other gene features.
The ALGGEN-PROMO web site was used to retrieve transcription factor binding sites near the TSS in database gene promoters as described in Materials and Methods. A total of 25 different transcription factor binding sites were observed among the 56 genes with a promoter (see Table 3 and S2 Table). Highest in abundance were Pax-5, AP-2αA and GR-α which were present in 12, 10 and 8 gene promoters, respectively (Table 4). Together the three account for 30 of the 56 transcription factor binding sites (53.5%) present in relevant database genes suggesting they may have a role in regulation of testis-specific gene expression. Eleven of the 25 different transcription factor binding sites were each present near the TSS in only one database gene promoter (S2 Table).
The nucleotide sequences of transcription factor binding sites were also retrieved in case they might suggest the identity of other elements that recognize the same DNA sites (S2 Table; https://doi.org/10.6084/m9.figshare.7952123). The sequences were scanned visually to identify similarities, and the results are summarized in Table 4. One recurring site was found to correspond to the Pax-5 binding site, one for AP-2αA and two for GR-α ( Table 4). The four sequences suggest themselves as candidates for a role in control of testis-specific gene expression. In each sequence the relevant genes were found to vary significantly in level of expression indicating that the sequences and the transcription factors that bind them may act to activate or repress gene expression depending on the context of other regulatory features present (Table 4 and S2 Table).

Sperm progenitor cells in the testis
Seminiferous tubules are the major structural feature of the testis accounting for more than 80% of the testis mass. They consist of six distinct cell types. Four are direct precursors of sperm (the spermatogonia, spermatocytes, spermatids and sperm themselves), while two others support spermatogenesis but do not themselves develop into sperm (Leydig and Sertoli cells). Sperm progenitor cells are arranged radially in the seminiferous tubule with the spermatogonia located furthest from the tubule lumen and spermatocytes, spermatids and sperm progressively nearer [23,24].
Spermatogonial cells divide to produce: (1) primary spermatocytes capable of further differentiation to create sperm; and (2) cells capable of replenishing the spermatogonial population. Both spermatogonium progeny cell types are diploid. Primary spermatocytes undergo a meiotic division to produce secondary spermatocytes. These are haploid cells that divide to produce spermatids, cells that further differentiate to become sperm. The well-characterized pathway leading to sperm production described above creates an opportunity to ask how Table 3. Testis-specific genes with a promoter and transcription factor binding site near the transcription start site. features controlling gene expression may correlate with and underlie the molecular events involved. Below I describe studies designed to clarify how aspects of gene regulatory control may be involved. The studies were enabled by the existence of a database of 122 testis-specific genes whose expression has been defined in individual sperm pathway cell types ( [8]; see Supplementary Tables V and VI). Assignments were made by noting the binding of protein-specific antibodies to sections of seminiferous tubule tissue. If the cell type(s) was defined for a testis-specific gene examined here, it is noted in S1 Table. The results showed that the cell type(s) was defined for 40 of the 239 database genes. As shown in Table 5, four database genes were found in spermatogonium, 7 in spermatocytes, 17 in spermatid and 17 in sperm. Features of gene regulatory control noted were: (1) the presence of a CGI in the promoter, (2) the presence of introns in the gene and (3) the gene expression level ( Table 5).

Gene Population Genes with Promoter a Promoters with TFBS at TSS b
The results in the case of CG islands show that the population of genes expressed at early stages of sperm formation (spermatogonia and spermatocytes) has a lower proportion of CGInegative genes compared to the more differentiated cells (i.e. spermatid and sperm). The proportion in less differentiated cells more closely resembles that seen in unselected human gene populations (8.0%; see Table 1) than in all testis-specific genes (53.1%). In the more differentiated cells, however, the proportion is more similar to the population of all testis-specific genes (i.e. 53% and 65% compared to 53.1%). The result suggests that more differentiated cells are better able to function without a CGI or do not have a need for a CGI.
The proportion of intronless genes in sperm precursor cell types was found to be lower than in the proportion in all testis-specific genes (i.e. 12%-33% compared to 50%; see Tables 5  and 1). If this result is not affected by the small number of pathway-specific genes available for analysis, then it indicates that compared to all testis-specific genes, sperm precursor cell genes may be more dependent on gene splicing and nuclear export events found in splicing pathways. Finally, the expression level of genes in less differentiated cell types was found to be lower than those in more highly differentiated cells (Table 5). Levels in less differentiated cells were lower than the average for all testis-specific genes (i.e. 24.1 and 42.3 compared to 49.4 RPKM) while higher levels were observed in the more differentiated populations. This observation is consistent with the idea that overall as cells differentiate they express a smaller number of distinct genes, but genes in the group are expressed at a higher level.

Control of gene expression
Current ideas about vertebrate gene regulation emphasize the involvement of structured chromosomal domains [25][26][27]. Actively expressed genes are thought to be contained on regions of chromatin that project outward from a core region of heterochromatin, an area where gene expression is repressed. Projecting or looped chromatin regions contain a small number of active genes located between insulator regions composed of CTCF/ cohesion or YY1 binding sites [26,28]. Active genes present in loops contain RNA polymerase II (RNAPII), promoters and transcription factors involved in gene regulatory control. Also present may be enhancer/ promoter regions of DNA located remotely on the chromosome, but containing bound transcription factors able to affect gene expression.

CG islands
CGI-containing genes suggest themselves as components of the heterochromatic region where gene expression is suppressed. Methylation of CpG sequences is known to repress gene expression or to make temporary repression more permanent [2]. The absence of CGI from a substantial portion (~50%; Table 1) of the testis-specific genes examined here suggests CGI may be a threat to testis-specific gene expression if genes were able to be suppressed by CpG methylation. In contrast to the testis-specific genes that lack a CGI, the results here show that a significant proportion has a CGI (also~50%; Table 1). This would be the case with genes whose expression needs to be suppressed in non-testis tissues.
LINC genes constitute a second population where many genes lack CGIs (Table 1). LINC are weakly expressed genes that specify non-coding RNA molecules thought to function as sponges for unneeded proteins or perhaps as components of protein-RNA complexes [17,29]. The lack of CGIs in most LINC gene promoters suggests it is rarely necessary for their expression to be repressed permanently.

Level of gene expression
Current ideas about the role of structured chromosome domains provide few clues regarding factors that affect the level of gene expression. Proximity of a gene to a CTCF/cohesion insulator may potentiate expression, but otherwise little guidance is provided [26]. The results reported here indicate that a gene's expression level is not strongly affected by a CGI in the promoter region. The mean expression level of genes with a CGI in the promoter is about the same as that of genes lacking a CGI (Table 2). Also, LINC genes were found to be more weakly expressed compared to the average of testis-specific genes, and intronless genes are more strongly expressed. The latter observation is in conflict with results indicating that the level of gene expression is potentiated by the presence of introns and mRNA splicing pathways [30,31].

Transcription factor binding
As it is well established that transcription factors bound to the promoter can have important effects on gene expression, transcription factor binding sites were examined thoroughly in the testis-specific gene population considered here. To simplify the analysis somewhat, I focused only on TFBS near the transcription start site. This simplification can be justified by the fact that the TSS is the site where transcription by RNAPII is initiated and where binding of a transcription factor might have its maximum effect.
The results led to the identification of three transcription factors (Pax-5, AP-2αA and GRα) whose abundance make them candidates for a role in testis-specific gene expression ( Table 4). Although Pax-5 is best known for its effects on B cell development, it has been noted to be prominently expressed in testis [32,33]. A similar situation applies in the case of AP-2αA. While AP-2α is best known for effects in the nervous system [34,35], a related transcription factor, AP-2γ, recognizes a DNA sequence similar to that of AP-2αA and has effects on testis development [36,37]. I suggest that AP-2γ could be the factor that recognizes AP-2 sites in the testis-specific genes identified here. GRα, a member of the glucocorticoid receptor family, is widely expressed in human tissues where it is known to have multiple effects on gene expression [38]. It would have specific effects in the testis only if another feature such as a specific isoform or association with another protein were involved [39].
As shown in Table 4, a wide range of expression level was observed among the genes having a TSS-proximal TF. For instance in the case of genes having a Pax-5 TF site, the range was 12.5-174.2 RPKM. This observation suggests the effect of individual TFs can be either activating or suppressive.

Differentiation of testis-specific cells
The present study benefitted from the results of immuno-histochemical analyses in which testis-specific genes could be associated with cells at progressively more mature states of differentiation [8]. All four recognized pathway-specific cell types were found to be populated by at least a few database testis-specific genes ( Table 5). This permitted features of gene regulatory control to be compared among genes of the four cell types (i.e. spermatogonium, spermatocyte, spermatid and sperm). The results showed that an increase in differentiated state correlated with an increase in the proportion of genes: (1) lacking a CGI, (2) lacking introns, and (3) with an increased level of gene transcription.
The observed increase in the proportion of genes lacking a CGI may be interpreted in the same way as the similar increase observed in the case of broadly expressed compared to tissue specific genes [2,15]. Genes of more highly specialized cells (i.e. tissue specific and more differentiated cells) may have a reduced need for permanent repression by the CpG methylation pathway. A similar interpretation is suggested to apply to the observed increase in intronless genes among more highly differentiated cells. Such genes may be reduced in their need for mRNA splicing and splicing-related pathways of mRNA transport out of the nucleus. The observed increase in tissue-specific gene expression level with cell differentiation state (Table 5) may be simply a consequence of the overall differentiation process. As a more highly specialized cell is created, the need for more abundant, highly specialized gene products is increased while products of less specialized cells is decreased. The observed increase in expression level in more differentiated cells could have a useful consequence for investigators studying gene regulation. The correlation of expression level with increased differentiation state could be used to identify the extent of differentiation in an unknown cell type.
Finally, focus on a population of tissue specific genes as described here is interpreted to support the view that this is an attractive way to further our understanding of development and cell differentiation processes. For instance, testis-specific genes have been identified in both mouse and Drosophila with most mouse genes similar to those found in humans [40,41]. It would be of interest to know whether the regulatory control elements are also similar in mouse. It might also be of interest to know whether the observed increase in CGI-less genes and intronless genes observed here with more differentiated testis-specific genes is also found in specific genes of other tissues. Additional features of gene regulatory control such as the role of insulators and structural domains might also be productively evaluated with tissue-specific genes.
Supporting information S1