Large Genomic Region Free of GWAS-Based Common Variants Contains Fertility-Related Genes

DNA variants, such as single nucleotide polymorphisms (SNPs) and copy number variants (CNVs), are unevenly distributed across the human genome. Currently, dbSNP contains more than 6 million human SNPs, and whole-genome genotyping arrays can assay more than 4 million of them simultaneously. In our study, we first questioned whether published genome-wide association studies (GWASs) assays cover all regions well in the genome. Using dbSNP build 135 data, we identified 50 genomic regions longer than 100 Kb that do not contain any common SNPs, i.e., those with minor allele frequency (MAF)≥1%. Secondly, because conserved regions are generally of functional importance, we tested genes in those large genomic regions without common SNPs. We found 97 genes and were enriched for reproduction function. In addition, we further filtered out regions with CNVs listed in the Database of Genomic Variants (DGV), segmental duplications from Human Genome Project and common variants identified by personal genome sequencing (UCSC). No region survived after those filtering. Our analysis suggests that, while there may not be many large genomic regions free of common variants, there are still some “holes” in the current human genomic map for common SNPs. Because GWAS only focused on common SNPs, interpretation of GWAS results should take this limitation into account. Particularly, two recent GWAS of fertility may be incomplete due to the map deficit. Additional SNP discovery efforts should pay close attention to these regions.


Introduction
The human genome contains millions of common SNPs, which are being deposited into public databases. These data have been used to design genome-wide association studies (GWASs) [1,2,3]. Common SNPs are better powered in association tests [4]. However, genomic regions not covered by common variants are neglected. Those neglected regions may contain variants with low frequencies, and should be paid more attention to because rare variants are even more likely to be functional than common ones [5].
In our study, we were interested in two questions: 1) whether the human genome is sufficiently covered by common SNPs and is sufficiently captured by common SNPs of standard GWAS platforms, and 2) whether any genes were included in those regions and their enriched biological functions.
To answer these two questions, we started with searching regions without common SNPs, called common SNP-free regions (CSFRs), regions free of both common SNPs and CNVs, called common variant-free regions (CVFRs). Next, we explored the functional enrichment of genes identified in CSFRs and CVFRs. With available personal genome sequencing data, whether these CSFRs and CVFRs contain common and rare variants were also examined.

Identification of CSFRs and CVFRs
Common SNPs (MAF$1%) in dbSNP build 135, Genome Assembly Gaps and Genome Database refGene data were downloaded from the UCSC Genome Browser (http://genome. ucsc.edu/) ( Table 1). The CNV data were downloaded from the DGV (Table 1). Using the common SNP table, we calculated distances between adjacent common SNPs and subtracted regions containing the genome assembly gaps. If the remaining SNP intervals were longer than 100 kb, those intervals were defined as CSFRs. The CSFRs were further searched for CNVs. If after subtracting regions containing CNVs, the intervals were still longer than 100 kb, those intervals were defined as CVFRs. The reason we used 100 kb as bin to detect SNP free region is the SNP Linkage disequilibrium distance: several groups reported blocks of up to 100 kb in length exhibiting very strong linkage disequilibrium [6,7].
To verify our result for its impacts on GWAS, we first determined whether the CSFRs are truly missed by Affymetrix Genome-Wide Human SNP Arrays. Next, we asked whether these regions included rare variations or were devoid of genetic variation. We analyzed common SNP data obtained from Genomes Unzipped (genomesunzipped.org) and Personal Genome Variation tracks from the UCSC Genome Browser. These two datasets are collections of variants that have been identified in the sequencing of personal genomes ( Table 1).

Identification of genes in CSFRs and CVFRs
Gene annotation data from the Human Genome assembly hg19 UCSC refGene was used to map coding genes in the CSFRs and CVFRs (Table 2). Genes were included if their transcription regions overlapped with the CSFRs/CVFRs by at least one base pair. When a gene had multiple splicing forms, we chose the longest splicing form to define the gene region.

Pathway and functional analyses
The genes identified in the CSFRs/CVFRs were used to analyze their enrichment of biological functions through the Database for Annotation, Visualization and Integrated Discovery (DAVID, http://david.abcc.ncifcrf.gov/tools.jsp).

Isochore characterization
Isochore is a large region of DNA sequence which has a relatively uniform degree in its GC content [8]. We use 100 kb as the length of flank region and 2% GC difference as indicator to identify isochore, isochore border and unknown region among SNP free regions. All SNP free regions in this study are longer than 100 kb. CSFRs are identified as isochore if its GC content is 2% greater or lower than both right and left regions. CSFRs are identified as isochore border if the difference of GC content between two flank regions is greater than 2%, and GC-content difference between left flank and right flank region is greater than GC-content difference between CSFR and its flank regions. Unknown region means CSFR is neither isochore nor isochore border.
We checked our results in the Affymetrix SNP Array 6.0 by its annotation data. Among the CSFRs, we found 25 SNPs' information in the annotation file, and only four of them had non-zero minor allele frequency: rs11681529, rs2571764, rs2874557, and rs35516764. The other 20 are monomorphic for HapMap four populations (Caucasian, African, Chinese and Japanese). Therefore, we concluded that most of these 50 large genomic regions has not been covered properly by the Affymetrix 6.0 Array at least in those major populations investigated.

Genes in CSFRs and CVFRs and their functional enrichment
Ninety-seven genes overlapped with 28 of the 50 CSFRs (56%) ( Table 2). DAVID was used to test whether the annotations of this set of genes were over presented with particular GO terms [9]. They were highly enriched with biological pathways involved with sexual reproduction, spermatogenesis, male gamete generation, gamete generation, multicellular organism reproduction, and reproductive processes in a multicellular organism (p,0.05 and FDR q,0.05, Table 4). The gene set included a number of gene previously reported to be related to reproduction, including DAZ1 [10,11], BPY2 [12], TSPY2 [11], CDY1 [13], CDY2A [13] and RBMY1 [11]. A gr/gr deletion polymorphism on Y chromosome of those CSFRs has also been suggested to be a risk factor of spermatogenic impairment in some populations [14,15].

SNP-free regions from personal genome sequencing and segmental duplications
We further explored those SNP-free regions in personal genome variant data. Rare variants were detected in most of the CSFRs or CVFRs. Only one region on X chromosome (chrX: 52,267,361-52,395,914) left. We also examined this region in updated dbSNP database (dbSNP137, http://www.ncbi.nlm.nih.gov/). Two more common SNPs were deteceted (rs201652812 and rs199865557). After subtract them, the left region was 105 kb (chrX: 52,290,698-52,395,914), which was the finally region not containing any  chrY  9524503  9640365  115862  TTTY8, TTTY8B, TTTY7B, TTTY7, TTTY21,   TTTY21B, TTTY2B, TTTY2, TTTY1, TTTY1B   TTTY22   chrY  20228333  20599266   known variant in all of the genome-wide sequencing data that we were able to collect. XAGE2 and its splicing isoforms were harbored in this region. We next tested this final region in segmental duplication database from Eichler's lab (http://eichlerlab.gs.washington.edu/ database.html) [7], and found it was overlapped with one of the segmental duplication regions.
We found that 49 CSFRs did carry SNPs in the Genomes Unzipped and Personal Genome Variation tracks. And the left X chromosome region did not contain any SNPs but overlapped with segmental duplication region.

Twenty-four CSFRs are isochore borders
To dig out the sequence properties of 50 CSFRs, we characterized those regions by GC content. Different GC contents can separated DNA sequences into compositionally fairly homogeneous regions [8]. By comparing GC contents between CSFRs and their flanking regions, we found that twenty-four CSFRs belong to isochore border regions, seven belong to isochore regions, and eighteen are unknown regions ( Table 2, Table S1).

Discussion
We performed a thorough search for large genomic regions that are free of common variants in dbSNP and we found 50 CSFRs and 20 CVFRs. Most of these variations free regions located on Y chromosome. Genes in the CSFRs were highly enriched for activities related to reproduction. Further investigation in the sequencing of personal genomes found most of the CSFRs (49 out of 50) did contain rare SNPs, suggesting those regions have not been covered well in the existing common variants sequencing projects, like the 1000 Genomes Project.
GWAS is one the most infusive common variants sequencing projects, but important finding might be missed because of its poor coverage of rare variants. Recently, two fertility GWAS studies were conducted but failed to find SNPs on sex chromosomes [16,17]. Both studies used Affymetrix GWAS platforms that we evaluated in this study. However, both sex chromosomes have long been implicated in infertility, specifically in spermatogenic damage in mouse models and in human candidate gene/region studies [18]. Our study found that those genomic regions free of common variants regions carrying many genes important to reproduction. With those important candidate genes missing, we must be cautious of analyzing fertility-related GWASs, which may produce false negatives.
The most reliable CVFR call contains the XAGE2 and its isoforms, which belong to XAGE subfamily. XAGE2 is strongly expressed in normal testes, and in some tumor [19]. Because genotyping platforms cannot fully cover structural variations such as segmental duplication, we further applied structural variations filtering analysis, and observed XAGE region was overlapped with segmental duplication. Based on these observations, we concluded that the observation of variant free regions is more a coverage problem with the current versions of dbSNP and existing GWAS assay platforms than a lack of assayable variation. When more genomes are sequenced, we may end up with proper coverage of complete human genome by common SNPs.
We mapped our SNPs on dbSNP build 135 and regions on GRCh37.p10 (hg19) assembly reference, which is the most accurate alignment version and with all current genome knowledge available. Comparing to old versions, hg19 changed many genomic coordinates and included alternate haplotype assemblies for chr6 (7 haplotypes), chr4 (1 haplotype), and chr17 (1 haplotype). Different versions can be converted by liftOver software (http://genome.ucsc.edu/cgi-bin/hgLiftOver). More details of differences in each version are provided in NCBI (http:// www.ncbi.nlm.nih.gov/genome/guide/human/release_notes. html).
Further study can focus on the sequence properties of those regions, and their conservative across species. Isochores are spatially heterogeneous in mammalian genome and varies in replication timing, gene richness, recombination rate, etc [20,21,22]. Natural selection is the most plausible explanation for formation and maintenance of isochores [20]. We observed nearly half of CSFRs are isochores and isochore border regions, which is a hint that these CSFRs may be under different selection pressure from its neighboring regions. To further test selection pressure, we mapped those regions to chimpanzee and mouse by Synteny analysis from Ensembl (http://useast.ensembl.org/ Homo_sapiens/Location/Synteny?r = 6:133017695-133161157), and found only 6 genes (RGPD5, RGPD6, GATSL2, FAM25G, HSFY1, HSFY2) can map to unique regions in the other two species. Next we applied dN/dS ratio test, the ratio of substitution rates at non-synonymous and synonymous sites, and found that human genes under more purify selection than chimpanzee genes (paired T test, p = 0.01, Table S2). Those results suggest that natural selection seems to be the major evolutionary force behind these variant-free regions.
In summary, by searching large genomic regions free of common variants for the first time, we identified tens of common variations free regions, and most of them were located on the X and Y chromosomes. The genes located in CSFRs are enriched for fertility. Incorporating personal genome data, only one region was still free of variants and harbored gene XAGE2, indicating most of the detections due to low coverage of rare variations. Future deep sequencing from more individuals and redesigning GWAS arrays should improve our understanding of the variability of these regions and their functional importance.