SBM and SJMJ conceived and designed the experiments, performed the experiments, and wrote the paper. SBM, JMS, ABW, and SJMJ analyzed the data. SBM, OLG, ABW, and SJMJ contributed reagents/materials/analysis tools.
¤ Current address: Wellcome Trust Sanger Institute, Hinxton, Cambridge, United Kingdom
The authors have declared that no competing interests exist.
Advances in the computational identification of functional noncoding polymorphisms will aid in cataloging novel determinants of health and identifying genetic variants that explain human evolution. To date, however, the development and evaluation of such techniques has been limited by the availability of known regulatory polymorphisms. We have attempted to address this by assembling, from the literature, a computationally tractable set of regulatory polymorphisms within the ORegAnno database (
Computational techniques are used in biology to prioritize DNA sequence variants (or polymorphisms) that may be responsible for population diversity and the manifestation of species-specific traits. Predominantly, they have been used to predict the class of polymorphisms that alter protein function through allele-specific changes to amino acid composition. However, polymorphisms that alter gene expression have been increasingly implicated in manifestation of similar traits. Prioritization of these polymorphisms is challenged, though, by the lack of knowledge regarding the mechanisms of gene regulation and the paucity of characterized regulatory polymorphisms. Our work attempts to address this issue by assembling a collection of regulatory polymorphisms from the existing literature. Furthermore, we use this collection to investigate and prioritize various properties that may be important for identifying novel regulatory polymorphisms.
Our ability to identify the molecular mechanisms responsible for specific genetic traits within our population will be enhanced by our imminent ability to decipher each individual's genome. This is evident from recent advances in sequencing and genotyping technologies, which allow an increasing number of variants to be sampled for association and linkage (reviewed in [
Conventional computational approaches to rSNP classification have predominantly relied on allele-specific differences in the scoring of transcription factor weight matrices as supplied from databases such as TRANSFAC and Jaspar [
A substantial challenge with developing strategies for identifying functional noncoding variants has been the shortage of characterized regulatory variants. Few studies have successfully identified the causative variant(s) after a susceptibility haplotype is identified. To address this problem, we have assembled the largest openly available collection of functional regulatory polymorphisms within the ORegAnno database (
We have used this dataset of rSNPs and their properties to train a support vector machine (SVM) classifier. Two approaches were used to train the classifier: one in which the properties of all rSNPs were compared with that of all the ufSNPs, and one in which each property value of the positive SNPs and ufSNPs within an associated gene were compared with the average values for each property within that gene (referred to here as the “All” and “Group” approaches, respectively). The All approach is designed to determine if there are any properties that are important across the test set, while the Group approach is designed to determine if there are important directional shifts in values within a promoter that may discriminate functional SNPs from ufSNPs. In a 10-fold cross-validated test, the SVM achieves a receiver operating characteristic (ROC) value of 0.83 ± 0.05 for the All analysis (sensitivity, 0.82 ± 0.08; specificity, 0.71 ± 0.13) and 0.78 ± 0.04 for the Group analysis (sensitivity, 0.72 ± 0.19; specificity, 0.68 ± 0.07).
Literature describing noncoding polymorphisms responsible for allele-specific differences in gene expression was surveyed from PubMed [
Investigated Properties
Using each of the 78 transcripts, SNPs within 2 kb of the TSS were extracted from version 37 of EnsEMBL (dbSNP version 125), producing exactly 951 ufSNPs. The ufSNP and rSNP genomic locations have been mapped (see
A total of 23 different properties of relevance to assessing regulatory function were calculated for each SNP in both the 104-rSNP and ufSNP sets (
Two types of analyses were conducted using the investigated properties. One was an all-versus-all approach, where the 104-rSNP and ufSNP sets were compared en masse. The other was a group analysis, where the average value of each property within each upstream noncoding region was first calculated, and then the individual SNP properties within that region were recalculated as the difference from this average. The All test data were designed to identify global characteristics of rSNPs, while the Group test data were designed to look for directional trends within the sampled region that might be indicative of SNP importance. For example, the All test is able to ask whether rSNPs have generic features that would distinguish them from any other promoter SNP; the Group test is designed to identify whether there are any features that distinguish rSNPs from other SNPs within the same upstream noncoding region.
The All and Group test data were input to the Gist SVM implementation [
The individual importance of each property in discriminating regulatory polymorphisms was assessed in the All and Group test sets using a Wilcoxon rank sum test. Each value was corrected for multiple testing using the BioConductor MTP package (
The performance of the Gist SVM classifier was measured using a ROC curve. ROC scores of 1 indicate perfect discrimination, while those at 0.5 indicate random classification of the input SNPs. ROC performance measurements have been previously described in detail elsewhere [
A 10-fold cross-validation was performed to assess the overall performance of the SVM. The input data was randomly partitioned by transcript into ten sets. Data from one set were excluded, and the remaining nine sets were trained on for each fold validation. This analysis was performed for each set to cover the entire training site and to calculate an average ROC value for the SVM.
We were concerned that several properties may be indirect measurements of distance from the TSS, and that any discrimination strategy would be limited to characterizing this property alone. This concern is a particular challenge since distance ascertainment bias exists; most SNPs surveyed were within a few hundred base pairs of the TSS, which is much smaller when compared with our sampling distance of 2 kb. Furthermore, it has been well established in a previous study that distance to the TSS is correlated to detection of rSNPs (it is unknown if this is because they are more likely to affect essential transcription factor–binding sites, or because there is a higher density of transcription factor–binding sites in these regions) [
CpG island expectancy is plotted for each chromosome as a function of the distance from the TSS. This type of data was used to normalize many of the features in this study for distance from the TSS. In this figure, the expectancy of being in a CpG island at position −1 for any promoter region is ~0.5.
A total of 104 rSNPs and 951 ufSNPs in the upstream noncoding regions of 78 genes were compiled to test properties that discriminate polymorphisms with effects on gene expression. A multiple testing–corrected Wilcoxon rank sum test was used to analyze the All test data (
Analysis of rSNP and ufSNP Properties in the 2-kb and 152-bp Upstream Noncoding Regions
However, a concern with the All analysis was that calculated property values for SNPs in individual upstream noncoding regions would not be comparable with those in other upstream noncoding regions due to differences in background property values. To address this, a multiple testing–corrected Wilcoxon rank sum test was also used to analyze the Group test data (
Both lists are highly concordant and demonstrate several properties that may be of utility when prioritizing SNPs for functional analysis either across the genome or within an individual upstream noncoding region. In both tests, distance to the TSS was found to be the most significant discriminant. While it is possible that ascertainment bias in the 104-rSNP set contributes to the strength of this discriminant in our study, this property has also been independently identified as an important discriminant in a previous study where, in 500-bp assayed regions, 50% of rSNPs identified through transfection experiments were within 100 bp of the TSS (
Furthermore, several other properties are consistently identified as being significant after normalization against distance to TSS. One property, ClustalW alignment distance, was identified in both the All and Group tests as being significant. The mean value of ClustalW alignment distance was slightly higher for the tested rSNPs compared with the ufSNPs, indicating that 1-kb multiple alignments centered on the tested rSNPs were more divergent than those centered on ufSNPs. This result is concordant with previous analyses of conservation around rSNPs (
Another property of significance was repetitive element content. Our results indicate that the tested rSNPs were less likely to be in or around repetitive elements. This suggests that regions that are likely to contain a transcription factor–binding site are less likely to accrue repetitive elements and be subject to dysregulation. We note, however, that ascertainment bias by which the 104-rSNPs set was surveyed in terms of repetitive elements is not known, and future collections of discovered rSNPs should address this issue.
Both MAF and derived allele frequency are also identified as significant discriminants. Unexpectedly, for genotyped SNPs, the MAF was higher in the 104-rSNP set than in the ufSNP set. Previous analyses of MAF have suggested that most functional SNPs are positioned around 6% [
Another interesting result was that SNPs in the 104-rSNP set were less likely to be in CpG islands than were ufSNPs. Since CpG expectancy was normalized from average values at specific distances from the TSS of associated genes across individual chromosomes, an admixture of CpG and CpG-less promoters would drive the 104-rSNP set values lower than the ufSNP set values (
Many tested properties fell below our significance threshold in these tests. Of interest, both weight matrix–based approaches did not discriminate well. In addition, our definition of coexpression was significantly broad as to allow multiple coexpressed partners for any given gene; this may have reduced the overall effectiveness of reducing transcription factor–binding profiles using this information. However, the performance of the coexpression-filtered approach was moderately better than the TRANSFAC approach alone. This suggests that targeted analysis of specific, biologically relevant transcription factors may further increase the discriminating ability of this approach. This should also act as a warning to those who have in the past applied the TRANSFAC approach to this problem indiscriminately. Furthermore, none of the DNA structural or stability analyses used were successfully discriminatory. This analysis could indicate that not only do these features have nongeneralizable effects using the data in this study, but since these analyses also measure local sequence composition, no particularly important effect is caused by specific base changes.
To evaluate whether the combination of the tested properties would enhance discrimination of rSNPs from ufSNPs, we trained a SVM for the ALL and GROUP test data. We tested the classification performance of SVMs by 10-fold cross-validation. For each SVM, the mean area under the ROC curve was 0.83 ± 0.05 and 0.78 ± 0.04, respectively. Both suggest good performance. It is significant, however, that when removing distance from the classification, the performance of each test drops to 0.52 ± 0.09 and 0.48 ± 0.07, respectively (
Representative ROC curves were calculated by training an SVM on a 90% subset of the 104-rSNP and ufSNP datasets. Here, 93 rSNPs and 882 ufSNPs were used for training, followed by testing on the held-out 10%. The ALL SVM approach was used for training. Furthermore, each curve had one tested property held out to demonstrate the impact of various properties on training. Notably, many curves are the same except for a marked reduction in performance when the “Distance to TSS” property is held out. The area under the “all” curve is 0.830. The dot on the “all” curve marks the location of the decision boundary selected by the SVM. At this boundary, the SVM identifies nine of 11 true positives and 56 of 69 true negatives. (Plots for each tested partition are available at
To address the issue of distance bias further, we fortuitously identified that, across our dataset, in the 152 bp immediately upstream of the TSS, the average distance to the TSS for the ufSNPs was identical to that of the rSNPs. This 152-bp window therefore represented a region with no observable distance biases, albeit a greatly reduced subset in size; at this window size, only 16 rSNPs and 21 ufSNPs were available for analysis. When analyzed using a multiple testing–corrected Wilcoxon rank sum test for both All and Group test sets, only two properties were significant (
We also examined the position of identified rSNPs to characterize possible bias. Our expectation was that well-established transcription factor–binding sites such as the TATA and CCAAT boxes may be overrepresented and contribute to lower distance values. A histogram of rSNPs for the first 300 bp of sequence from the TSS shows an expected increase around the 21–31 position where seven rSNPs are located, twice as many as average. However, it is apparent that these types of binding sites are only overrepresented slightly when compared with the distribution of rSNPs at other positions (
The positions of rSNPs are plotted in a histogram for bin sizes of 10 bp for the first 300 bp of sequence from the TSS. A blip is seen at position 21–31, where it is likely that TATA and CCAAT box–binding sites are located. These types of rSNPs, however, are only slightly overrepresented in this study and from this graph are not expected to significantly bias the outcome.
All pipeline software has been programmed in Perl and is available under the Lesser GNU Public Licence at
This study introduces the largest publicly available collection of rSNPs—160 known rSNPs from literature. Using this collection, we investigate 104 rSNPs and 951 ufSNPs in human 2-kb upstream regions to identify properties that may discriminate functional from nonfunctional polymorphisms. We identify several properties that may be useful to researchers attempting to determine the functional status of upstream noncoding SNPs. The most important properties detected suggest that rSNPs are close to the TSS, are not within CpG islands, are isolated from repetitive elements, possess higher MAF and higher derived allele frequency, and are within comparatively more divergent regions. However, within a 152-bp window, where an equal distribution of rSNPs and ufSNPs from the TSS is obtained, the significant results suggest that only repetitive element content and local divergence remain important (we have included in
Through this work, several challenges are apparent with current predictive approaches to prioritize candidate rSNPs. Necessary to future analyses is a dataset of core promoter polymorphisms that are nonfunctional across a broad range of cell types; since our negative control set was a neutral set, it is assured that more accurate performance metrics can come from addition of a reliable negative control set. Furthermore, recent analysis of allelic expression difference has demonstrated that the effects of rSNPs may be highly context-specific such that function in one cell line may not imply function in others; to address this complication, future analysis may require expanded collections of cell line–specific positive and negative rSNPs [
In summary, this study introduces a new dataset for the investigation of rSNPs. We have also introduced one of the first gene regulation and population genetics–based approaches to classifying rSNPs in the core promoter regions of human genes. We identify the utility of different gene regulation and population genetics properties in discriminating literature-curated rSNPs. Such results are increasingly essential to researchers seeking criteria for prioritizing SNPs to test in association, binding, or expression assays. Furthermore, we provided evidence that popular methodological practices based on identification of allele-specific differences in position weight matrices through unrestricted application of the TRANSFAC database are poor criteria for SNP selection. However, we highlight the fact that because of the lack of extensive unbiased collections of rSNPs, it still remains challenging to dissect the existing effects of investigator or methodological biases in evaluating the importance of these properties. We hope that this work will stimulate active discussion and both the development of expanded collections of rSNPs and an improved class of bioinformatics tools for rSNP analysis that address these challenges.
The locations of the tested rSNPs and ufSNPs are plotted upstream of their respective genes.
(6.4 MB PNG)
The tested rSNP data is listed with information describing experimental evidence, associated gene, and dbSNP number, if available.
(59 KB PDF)
Different upstream window sizes were selected for All and Group analyses. The results of the Wilcoxon rank sum test for these windows are summarized and displayed as figures.
(44 KB XLS)
Promoters annotated in ORegAnno were assembled into this background file for MotifSampler analysis.
(5 KB RTF)
We would like to acknowledge Manolis Dermitzakis and Wyeth W. Wasserman for support and feedback during the development of this work.
minor allele frequency
receiver operating characteristic
regulatory single-nucleotide polymorphism
support vector machine
transcription start site
SNP of unknown function