^{1}

^{¤}

^{1}

^{2}

^{3}

^{1}

^{4}

^{*}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: MG DL MAB. Performed the experiments: MG DL. Analyzed the data: MG DL. Contributed reagents/ materials/analysis tools: MG DL MMN. Wrote the paper: MG DL MAB.

Current address: Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America

Oligomers of length

Genomic regulatory elements (enhancers, promoters, and insulators) control the expression of their target genes and are widely believed to play a key role in human development and disease by altering protein concentrations. A fundamental step in understanding enhancers is the development of DNA sequence-based models to predict the tissue specific activity of regulatory elements. Such models facilitate both the identification of the molecular pathways which impinge on enhancer activity through direct transcription factor binding, and the direct evaluation of the impact of specific common or rare genetic variants on enhancer function. We have previously developed a successful sequence-based model for enhancer prediction using a

Predicting the function of regulatory elements from primary DNA sequence still remains a major problem in computational biology. These elements typically contain combinations of several binding sites for regulatory factors whose activity together specifies the developmental times, cell-types, or environmental signals in which the element will be active. Genetic variation in regulatory elements is increasingly thought to play a significant role in the etiology and heritability of common diseases, and surveys of Genome Wide Association Studies have highlighted the preponderance of significant variants in regulatory DNA

We have recently introduced a successful method for regulatory DNA sequence prediction, kmer-SVM, which uses combinations of short (6–8 bp)

Naively one could address this issue by using longer

We recently introduced gapped

To overcome the limitations associated with using

We first define a feature vector for a given sequence _{1} and _{2}, as the normalized inner product of the corresponding feature vectors as follows:_{1}, _{2}), is always between 0 and 1, and

Since the number of all possible gapped _{1}, _{2}) that does not involve the computation of all possible gapped _{1}, _{2}, and _{1} and _{2} are the numbers of full _{1} and _{2} respectively, i.e. _{1} = length(_{1})−_{2} = length(_{2})−_{lk}_{1}, _{2}) only depends on the number of mismatches, _{1} and _{2}, i.e. _{lk}_{1}, _{2}) = _{lk}_{m}_{1}, _{2}) is the number of pairs of _{lk}_{m}_{1}, _{2}) as the _{1} _{2}. Since each _{lk}_{m}

Determining a mismatch profile in Equation (3) is still computationally challenging since the numbers of mismatches between all possible

The direct and sequential evaluation of the kernel function between all training sequences becomes less practical as the number of training sequences gets larger, since it requires O(^{2}^{2}) operations of mismatch counting between _{max}

Because of the difficulty of reliably estimating long

We compared the performance of gkm-SVM and kmer-SVM on the CTCF data set for a range of oligomer lengths by varying either

Both gkm-SVM and kmer-SVM were trained on (A) CTCF bound and (B) EP300 bound genomic regions using different word lengths (_{max} = 3 are shown as dashed lines and AUCs of these faster approximations are comparable when the difference between m_{max} and

Interestingly, gkm-SVM shows consistently better performance than kmer-SVM even if

One further modification can substantially reduce the computational cost of using gapped _{max}_{m}_{max}_{max}

Encouraged by the analyses of CTCF and EP300 data sets above, we systematically compared gkm-SVM to kmer-SVM using a very broad range of human data sets generated by the ENCODE project

(A) We trained gkm-SVM and kmer-SVM on the complete set of 467 ENCODE ChIP-seq data sets (with

The predictive sequence features that allow gkm-SVM to outperform the single best PWM imply that cooperative binding is the underlying molecular mechanisms that targets TFs to these regulatory regions. Previously we have typically focused on a handful of the highest SVM weight

To directly address this issue, we developed a new method to combine multiple similar

Since the early development of

To further evaluate our proposed method, we directly compared the gkm-kernel with the aforementioned three alternative methods, Mismatch kernel

To further evaluate our proposed method, we directly compared the gkm-kernel with the aforementioned three alternative methods, Mismatch kernel

(A) For each method, averages of 5-CV AUCs are shown as a function of the word length with the optimal number of mismatches, _{max}. (B) Running time for each of the kernel computations shown in (A). Gkm-kernels show better classification performance and significantly more efficient computation at peak AUC.

More significantly, when we compare running times at parameters which maximize AUC for each method, our gkm-SVM implementation (_{max}

Interestingly, we also note that both Mismatch kernel and Wildcard kernel are special cases of the more general class of kernels, defined by Equation (3). This unification allows direct application of the methods we developed for mismatch profile computation and therefore gives more efficient methods for computation of these existing methods (see

As an alternative to the gapped

The weight

Direct calculation of Equation (5), however, requires actual counting of all of the _{lk}_{lk}

To systematically compare the classification performance of these new methods with the original gapped

So far, we have focused on using gapped

Here, we used the log-likelihood ratio of the estimated _{0}_{1}…_{n}_{–1}, is then given by:_{P}_{N}_{i}s_{i}_{+1}…_{i+l}_{−1}, in the positive and negative training set, and are given by Equation (11) below. We used the _{P}_{N}

Naïve-Bayes classifiers were trained on (A) CTCF bound and (B) EP300 bound genomic regions using different word lengths,

In this paper, we presented a significantly improved method for sequence prediction using gapped

The main biological relevance of the computational method we present in this paper is that gkm-SVM is capable of accurately predicting a wide range of specific classes of functional regulatory elements based on DNA sequence features in those elements alone. This in itself is interesting and implies that the epigenomic state of a DNA regulatory element primarily is specified by its sequence. In addition, our predictions facilitate direct investigation of how these elements function, either by targeted mutation of the predictive elements within the larger regulatory region, or by modulating the activity of the TFs which bind the predictive sequence elements. We are currently using changes in the gkm-SVM score to systematically evaluate the predicted impact of human regulatory variation (single nucleotide polymorphisms (SNPs) or indels) to interpret significant SNPs identified in genome wide association studies. We demonstrated that gkm-SVM is better at predicting all ENCODE ChIP-seq data than the best single PWM found from the ChIP-seq regions, or previously known PWMs. The gkm-SVM is able to do so by integrating cofactor sequences which may not be directly bound by the ChIP-ed TF but facilitate its occupancy. To predict this ChIP-seq set accurately required the improved accuracy of the gkm-SVM and its ability to describe longer binding sites such as CTCF, which were very difficult for our earlier kmer-SVM approach. We recovered most of the cofactors found by traditional PWM discovery methods, but we further show that these combinations of cofactors are predictive in the sense that they are sufficient to define the experimentally bound regions.

There are some further issues that need to be considered in the application of these methods. First, one will typically be interested in finding an optimal set of the parameters (

Our approach also avoids an issue that would arise if one chose instead to directly use Equation (5) for computing count estimates. This would involve a large number of floating point operations, and accumulated round-off error could become significant in the large summations. There are some algorithms, such as Kahan compensated summation

Two issues which are left for future investigation are different treatment of end vs. internal gaps, and allowing imperfect mismatches. We currently do not make special consideration for gaps which occur at the end of a

Throughout this paper, we have focused on using DNA sequences as features for classifying the molecular or biological function of a genomic region. However, in principle, our method can be applied to any classification or prediction problem involving a large feature set. In general, when the number of features used by a classifier increases, the number of samples in the training set for each point in the feature space becomes smaller, and small sample count issues occur (which we have resolved using gapped

The Support Vector Machine (SVM)

For direct computation of the gkm-SVM kernel matrix, we represent each training sequence with a list of _{m}^{2}L^{2}) comparisons. In addition, a naive algorithm for counting the number of mismatches between two ^{2t}. The optimal value of

As depicted in _{i}_{i}_{i}_{1}_{2}_{3}_{i}_{j}_{i}_{j}_{i}_{j}_{m}_{i}_{j}_{i}_{j}

As an example, we use _{1}_{2}_{3}_{i}_{i}_{i}_{2}_{4}_{0}_{i}_{j}_{i}_{j}_{max}_{i}_{j}_{m}_{i}_{j}_{i}_{j}

To increase the speed further, we also introduce an optional parameter _{max}_{max}_{max}_{i}_{j}_{i}_{j}_{i}_{j}_{i}_{j}

We developed a new method for building _{i}^{st} PWM were then removed from the list, and the process was repeated on the remaining predictive

In the gkm-kernel, we define the feature vector to consist of the frequency of all the ^{l - k}_{m}

Using the above form allows us to directly use the fast algorithms we have developed for calculation of the mismatch profiles to calculate the wildcard kernels. Although there are similarities between our tree algorithm and the tree algorithm described in Ref.

In the mismatch string kernel described in Ref. _{lk}_{1}+_{2}−_{1} and _{2} where _{1} and _{2} differ in exactly _{1} in _{1} places and _{2} in _{2} places _{1} and _{2}. (See _{1} and _{2} in at most

To compute the _{i}_{i}_{ij}_{i}_{j}_{ij}_{ij}_{ij}_{ij}_{i}_{j}_{ij}_{ij}_{i}_{j}_{lk}

To obtain the element _{lk}_{i}_{j}_{i}_{τ}_{i}_{ij}_{i}_{j}_{i}_{j}_{j}_{lk}

In other words, there are _{i}_{j}_{lk}_{tr}(

In summary, we defined a generalized _{lk}_{lk}_{lk}_{lk}_{lk}_{lk}_{0}, where _{0} is the smallest number of mismatches for which _{lk}_{0})<0. This will give an approximation to the value of

Given a sequence ^{l}^{th} _{1} and _{2} are the number of _{1} and _{2}, and _{1} and _{2}. If _{1} and _{2} have exactly _{1}, _{2}) = _{m}_{m}_{1}, _{2}) is the mismatch profile of _{1} and _{2} as previously defined in Equation (3). We show that the weight _{lk}_{m}_{1}+_{2}−2_{m}_{1} and _{2}, with _{1} mismatches with _{1} and _{2} mismatches with _{2}. For this, we assume that _{1} mismatches are among the _{1} – _{1} positions and (^{t}_{1}−_{2}. Then, for the remaining _{2}−(_{1}−_{2} there are ^{r}_{1} mismatches with _{1} and _{2} mismatches with _{2}, where _{1} and _{1} and _{2} is given by

Using matrix notation, we can further show that _{m}_{m}_{lk}_{1} and _{2}. Given _{m}_{m}

To compare the performance of different classification methods, we calculated the area under the receiver operating characteristic (ROC) curve for each classifier. To plot the ROC curves and calculate area under the curves (AUCs) we used the ROCR package

Following standard five-fold cross validation procedures, we divided the positive and negative sets into five segments, left one segment out as the test set and used the other four segments for training. We repeated for all of the five segments and calculated the mean and standard error of the prediction accuracy on the test set elements.

The ENCODE ChIP-seq datasets were downloaded from

We have implemented these algorithms in C++, and the source code and executable files are available on our website at

(PDF)

(PDF)

_{max}

(PDF)

(PDF)

(PDF)

(PDF)

(PDF)

(PDF)

_{1} and x_{2} differ in m places. u differs x_{1} in m_{1} places and x_{2} in m_{2} places. t of the u mismatch places are among the l – m ,x_{1}, x_{2} common places. There are _{1}, m_{2}, t≤M.

(PNG)

_{ij}_{ij}_{i}_{j}_{ij}_{ij}

(PNG)

_{i}_{j}_{i}_{j}_{i}_{j}

(PNG)

_{lk}_{lk}

(PNG)

_{1}_{2}_{1} and _{2}, with _{1} mismatches with _{1} and _{2} mismatches with _{2}. For this, we assume that _{1} mismatches are among the _{1} – _{1} positions and _{1}−_{2}. For the remaining _{2}−(_{1}−_{2} there are _{1} mismatches with _{1} and _{2} mismatches with _{2}, where _{1} and _{1} and _{2} is given by

(PNG)

(PDF)

We gratefully acknowledge contribution of source code for the wildcard and di-mismatch approaches from Rui Kuang and Christina Leslie for the direct comparisons of these methods.