SVM2Motif—Reconstructing Overlapping DNA Sequence Motifs by Mimicking an SVM Predictor

Identifying discriminative motifs underlying the functionality and evolution of organisms is a major challenge in computational biology. Machine learning approaches such as support vector machines (SVMs) achieve state-of-the-art performances in genomic discrimination tasks, but—due to its black-box character—motifs underlying its decision function are largely unknown. As a remedy, positional oligomer importance matrices (POIMs) allow us to visualize the significance of position-specific subsequences. Although being a major step towards the explanation of trained SVM models, they suffer from the fact that their size grows exponentially in the length of the motif, which renders their manual inspection feasible only for comparably small motif sizes, typically k ≤ 5. In this work, we extend the work on positional oligomer importance matrices, by presenting a new machine-learning methodology, entitled motifPOIM, to extract the truly relevant motifs—regardless of their length and complexity—underlying the predictions of a trained SVM model. Our framework thereby considers the motifs as free parameters in a probabilistic model, a task which can be phrased as a non-convex optimization problem. The exponential dependence of the POIM size on the oligomer length poses a major numerical challenge, which we address by an efficient optimization framework that allows us to find possibly overlapping motifs consisting of up to hundreds of nucleotides. We demonstrate the efficacy of our approach on a synthetic data set as well as a real-world human splice site data set.


Introduction
Major technological advances in sequencing techniques within the past decade have facilitated a deeper understanding of the mechanisms underlying the functionality and evolution of organisms. Considering the pure size of a genome, it comes, however, at the expense of an 2. hampers manual inspection (in order to determine candidate motifs) already for rather small motif sizes such as k % 5 and is prohibitive for k ! 10. For example, a POIM of order k = 5 contains, at each position, already 4 5 % 1,000 oligomers that a domain expert has to manually inspect. Slightly increasing the motif length to k = 10 leads to an unfeasible amount of 4 10 % 1,000,000 subsequences per position in the POIM.
In this paper, we tackle the problem of obtaining motifs from the output of an SVM via the use of POIMs from a different perspective. In a nutshell, our approach is the other way round: we propose a probabilistic framework to reconstruct, from a given motif, the POIM that is the most likely to be generated by the motif. By subsequently minimizing the reconstruction error with respect to the truly given POIM, we can in fact optimize over the motif in order to find the one that is the most likely to have generated the POIM at hand. The latter poses a substantial numerical challenge due to the extremely high dimensionality of the feature space. The main contributions of this work can be summarized as follows: 1. Advancing the work of [15] on positional oligomer importance matrices (POIMs), we propose a novel probabilistic framework to finally go the full way from the output of a state-of-the-art WD-kernel SVM via POIMs to the relevant motifs truly underlying the SVM predictions.
2. To deal with the sheer exponentially large size of the feature space associated with the WD kernel, we propose a very efficient numerical framework based on numerous speed-ups such as bit-shift operations, highly efficient scalar multiplications as well as advanced sequence decomposition techniques, and provide a free open-source implementation thereof, which is available at https://github.com/mcvidomi/poim2motif.git.
3. Our approach is able to even find overlapping motifs consisting of up to hundreds of nucleotides, while previous approaches are limited to either comparably short or contiguous motifs. 4. We demonstrate the efficiency and efficacy of our approach on both synthetic data sets as well as a human splice data set, evaluated by means of the JASPAR database [21].
The paper is structured as follows: after reviewing the traditional approach of obtaining a POIM from a trained SVM model, we introduce the proposed probabilistic methodology-motifPOIM-for approximately determining the motif underlying the observed POIM at hand. Following this, we propose a numerical framework for solving the corresponding nonconvex optimization problem by the use of efficient sequence computation techniques such as bit shifts. We evaluate the proposed methodology empirically both on controlled synthetic data as well as real-world human splice data. Finally we conclude the paper and discusses starting points for future work.

Methods
After, firstly defining the weighted degree kernel, we briefly review the positional oligomer importance matrices (POIMs) and then describe our novel approach for extracting motifs from POIMs. Memory footprint for POIMs of oligomer length k. Note that the plot is in semi-logarithmic scale and thus showing an exponential growth for increasing oligomer length rendering a direct approach incomputable for even small k ! 12.
breaks two DNA sequences x and x 0 of length L into all possible subsequences of length l L starting at position j, denoted by x[j] l and x 0 [j] l , respectively. The kernel value κ(x, x 0 ) is then obtained by counting the number of matching subsequences, the so-called positional oligomers (POs), when traversing the positions j = 1, . . ., L − l + 1. Equivalently, we may represent the WD kernel by the corresponding binary feature embedding F, with κ(x, where each entry of F(x) represents a valid positional oligomer y of length l 2 {1, . . ., k} starting at position j 2 {1, . . ., L−l + 1}. A WD-kernel SVM then simply fits the parameter w of the linear model s(x): = hw, F(x)i, which can, more concisely, be expressed as since F(x) is inherently sparse (only the entries in F(x) corresponding to the oligomers y = x[i] l with l 2 {1, . . ., k} and i 2 {1, . . ., L−l + 1} are non-zero). Let S = {A, C, G, T} be the DNA alphabet and X $ UðS L Þ be a uniformly distributed random variable with values in S L and let x 2 S L be a realization thereof. For any positional k-mer The POIM of order k is then defined as the tuple Q Q k : = (Q k, y, j ) (y, j)2S k × {1, . . ., L−k + 1} . See Fig 2 for an illustration of a POIM of degree k = 2. We may interpret Eq (3) as a measure for the contribution of the positional oligomer (y, j) to the SVM prediction function s because a high value of w (y, j) , by Eq (2), implies a strong contribution to the prediction score s(x) if and only if y = x[j] k . We can very well visualize POIMs in terms of heatmaps as illustrated in Fig 2, from which we may obtain the most discriminative features by manual inspection.
As a first step towards a more automatic analysis of POIMs, [22] propose an extension of the POIM method, the so-called differential POIM, which aims to identify the most relevant motif lengths as well as the according starting positions. Formally, the differential POIM O is defined as a k × L matrix O: = (O l, j ) with entries O l;j :¼ where q l;j max :¼ max y2S l jQ l;y;j j: We can interpret O l, j as an overall score for the general importance of the oligomers of length l starting at position j.

Extracting Motifs by Mimicking POIMs
In this section, we introduce the proposed motifPOIM methodology for extraction of motifs from POIMs. In a nutshell, it is based on associating each candidate motif by a probability of occurrence at a certain location-which we call probabilistic positional motif (PPM)-and then (re-)construct from each PPM the POIM that is the most likely to be generated from the candidate PPM, which we call motifPOIM. The final motif is obtained by optimizing over the candidate motifs such that the reconstruction error of the motifPOIM with respect to the truly given POIM is minimized. See Fig 4 for an illustration.
To this end, let us formally define the PPM as a tuple m k : = (r, μ, σ), where r 2 R 4Âk and m; s 2 R. We think of m k as a candidate motif with PWM r and estimated starting position μ. The variable σ encodes the uncertainty in the location of the motif and can be thought of a standard deviation of the location of the motif. Under this probabilistic model, we define, in analogy to the SVM weight vector w occurring in Eq (2), a motif weight vector v v(m k ) with for any positional k-mer (z, i) 2 S k × {1, . . ., L−k + 1}. Consequently, we define in analogy to Eq (2) a function By means of the above function, we can construct, from a PPM as defined above, a POIM R R(m k ) with entries R y;j ðm k Þ :¼ E½ sðX jm k ÞjX ½j k ¼ y À E½ sðX jm k Þ: ð6Þ Our overall aim is, by optimizing over the motifPOIM R, to approximate the original POIM (cf. also the illustration in the introduction, given by Fig 4). An interesting fact here is that, since computing motifPOIMs for longer PPMs (m k , k > 5) is computationally expensive, we may use motifPOIMs of small ordersk 2 f2; 3g, although, this is no restriction of the motif length, as we model a PPM of length k !k as a number of D overlapping SubPPMs, D :¼ k Àk þ 1 with lengthk k. We define the SubPPMs analogous to PPMs as tuples m d ðm k ;kÞ :¼ ðr;m; sÞ; 8 d ¼ 0; . . . ; D À 1 withm :¼ m þ d and the sub-matrixr 2 R 4Âk of r starting with column d.
The basic idea is illustrated in Fig 5, where we divide a PPM into a set of SubPPM. Instead of computing an motifPOIM for the PPM, we now compute a set of D motifPOIMs for the smaller overlapping SubPPMs.

Numerical Methods
In this section, we introduce an efficient numerical framework for the extraction of motifs from POIMs by mathematical optimization. The core idea is to determine a motif m k with an according motifPOIM R(m k ) that approximates the original POIM Q k . To this end, let us introduce some notation. Let K & N be the set of all motif lengths to be considered and k max ¼ max k2K k the maximum length. The vector T 2 N k max where T k ; k 2 K is the given number of PPMs of length k. For example, when K ¼ f2; 4; 10g and T = (0, 6, 0, 2, 0, 0, 0, 0, 0, 2), then the goal is to find 6 PPMs of length 2, 4 PPMs of length 4, and 2 PPMs of length 10. Our optimization method is as follows: given the set K and the vector T, we randomly initialize the PPMs m k;t t ¼ 1; . . . ; T k ; k 2 K and generate a set of motifPOIMs for the SubPPMsm d ðm k ;kÞ; d ¼ 0; . . . ; D À 1. The optimization variables are all T k ; k 2 K PPMs. For obtaining the priorities of the PPMs we weight the PPMs by l k;t ; t ¼ 1; . . . ; T k ; k 2 K and additionally optimize over the weights. Hence, the optimization variables are: • PPM m k;t ¼ ðr k;t ; m k;t ; s k;t Þ; t ¼ 1; . . . ; T k ; k 2 K, where m k;t 2 f1; . . . ; L À k þ 1g; A PPM generates a motifPOIM, which is given by the sum of D motifPOIMs generated by its SubPPMs. The sum of the weighted motifPOIMs, λ k, t R(m k, t ), t = 1, . . ., T k , should estimate the POIM Qk for each k 2 K. The optimization problem is now that of minimizing the distance between the sum of the motifPOIMs and the original POIM, which leads to a non-convex optimization problem with the following objective function: where Z ¼ ðm k;t ; l k;t ;kÞ t¼1;...;T k ;k2K : The associated constrained non-linear optimization problem is thus as follows: min ðm k;t ;l k;t Þ t¼1;...;T k ;k2K f ðZÞ subject to where W 2 R þ . Note that for the sake of optimization efficiency we relax the integer constraint on motifs start positions in the sense that we optimize over positive real numbers. The objective function f(η) is defined on the compact set U, since all parameters are defined in a closed and bounded, convex space. Consequently, if U is not empty, f(η) is a continuously differentiable function, since its conforming parts, that is, the Gaussian function and the product of the PWM entries, all are continuously differentiable. Thus the global minimum of the optimization problem Eq (8) is guaranteed to exist. Due to the non-convex nature of Eq (8), however, there may exist multiple local minima.

Efficient Computation
To allow an efficient numerical optimization of Eq (8), we first translate the motifPOIM formula Eq (6) in another, equivalent form, similar as in [11]. To this end, note that the expected value of sðX jm k Þ for the given weight vector v(m k ) and a random variable X 2 S L is given by: Hence the conditioned expectation is almost equivalent to Eq (9), except the probability term that is given by the conditioned probability conditioned that y is at position j: We now consider this probability term and its affect on the summation in Eq (6)). To this end, we introduce the following notation as in [11]. Definition 1 Two positional oligomers (z, i) and (y, j) of length l and k are independent if and only if they do not share any position; in this case we write ðy; jÞ⊀ðz; iÞ and (y, j)0(z, i) otherwise (i.e., when they are dependent). If they are dependent and also agree on all shared positions we say they are compatible and we write ðy; jÞ≾ðz; iÞ (and ðy; jÞ≴ðz; iÞ if they are not compatible).
According to the cases discussed in the above definition, the conditioned probability term can take the following values: where c is the number of shared and compatible positions of two positional oligomers: Taken the case ðy; jÞ⋨ðz; iÞ, the probability terms in the motifPOIM formula (6) subtract to zero, so that the positional oligomer (z, i) is not considered in the sum R y, j (m k ). Hence, in order to compute R y, j (m k ), it is sufficient to sum over two positional oligomer sets, where one contains all (z, i) with ðy; jÞ≾ðz; iÞ, I ≾ ðy;jÞ , and the others contains all (z, i) with ðy; jÞ⋨ðz; iÞ, I ⋨ ðy;jÞ : where I ðy;jÞ :¼ fðz; iÞ 2 S jyj Â f1; . . . ; L À jyj þ 1gjðy; jÞ ðz; iÞg and 2 f≾; ⋨g: Numerical Speed-ups. In addition to the speed-up achieved by the above re-formulation of the problem, we can additionally save time in the motifPOIM computation by exploiting bit shift operations as follows. With the help of the dependence sets I ≾ ðy;jÞ and I ⋨ ðy;jÞ we know all the dependent and compatible positional oligomers of a single positional oligomer (y, j). The core idea leading to the numerical speed-up is as follows: In each (y, j) we consider the two dependence sets. However, the fact is that an oligomer y has completely the same dependent and compatible oligomers z at each position in the sequence. Thus, a dependent set containing all dependent and compatible z of y is the same for all positions i = 1, . . ., L. The trick is to generate a dependency matrix A (see Eq 13) for a single y once, which can then be use at every sequence position without the need of recalculation. This matrix contains the probability terms of the motifPOIM formula since they do not change for y over the positions, saving at least |S k |(2(k−1) + 1) complex computations per position. For each position j we now create a weight matrix C j of same size, which contains all the weights v (z, i) (m k ) for the entries in A for a specific position j. Finally, the dot product of A and C j replaces the long motifPOIM formula (12) and we achieve a faster computation speed.
Due to the fact that dependent positional oligomers overlap each other, a dependent k-mer z of the k-mer y could have a maximal distance of k−1 from y. Hence, we have to consider the oligomers z with a maximum distance of k−1 position next to both sites of y and the position of y itself. That yields to the dependence set: The dependent matrix AðyÞ is defined on I y ðkÞ as a matrix of size 4 k × (2(k−1) + 1) and contains the positional oligomer probability terms of the motifPOIM formula as entries: Furthermore, we create a weight matrix C j of same size as A, which contains all weights v (z, i) (m k ) of the entries in A for a specific position j, so that the dot product of C j and A replaces the sums of the motifPOIM formula (12), which speeds up computations considerably. This fact is stated in the following theorem. Theorem 2 Let y be a k-mer, m k the PPM, v(m k ) the motif weight vector, and A the dependent matrix of y. Introducing a matrix C j ðyjm k Þ, which is defined on I y ðkÞ as a matrix of same size S k × (2(k−1) + 1) as AðyÞ and contains all weights of the positional oligomers in AðyÞ for the motifPOIM position j as ( then R y;j ðm k;t Þ ¼ hAðyÞ; C j ðyjm k;t Þi: ð15Þ Proof Substituting the last equation into Eq (15) gives us Eq (12). The case distinction in Theorem 2 is made since some dependent positional oligomers are placed outside the possible sequence positions. Suppose we compute the weight matrix C j z;i ðyjm k Þ for y = ACT at the sequence position j = 1. Then there are overlapping 3-mers such as, for example, (AAA, −1) and (TAC, 0), that not exist in the sequence at all. Thus, they are weighted by zero.
Together with the fact that we implement the algorithm in the Python programming language and use the numpy library for computations, calculations are very fast by using the algorithm shown in Table 1.
Another step towards an efficient computation is as follows: The probability distribution over the PPM with starting position μ in the sequence is a Gaussian function. One characteristic of this function is that 99, 7% of the starting positions are within the confidence interval [μ −3σ, μ + 3σ]. Hence, it suffices to compute the motifPOIM entries for the integer values in the confidence interval and set the other motifPOIM entries to zero. Let I CO be the set containing all positional oligomers of the confidence interval. A summary is given in Table 1. For each k 2 K a motifPOIM R is constructed (see Theorem 2) and the residual between the aforementioned motifPOIM and the SVM POIM Q k of matching degree k is added to the variable iteratively computing the function value.

Empirical Analysis
In this section, we analyze our proposed mathematical model Eq (8) empirically. After introducing the experimental setup, we evaluate our approach on a synthetic data set where we fully control the underlying ground truth. Finally, we investigate our model on a real human splice data set and compare our results to motifs contained in the JASPAR database [21].

Overall Experimental Setup
For SVM training, we use the shogun machine-learning toolbox [12] (available from http:// www.shogun-toolbox.org/), which contains a C++ implementation of a WD-kernel SVM that is specially designed for large-scale sequence learning problems and provides interfaces to ⪼matlab, Python, R, and java. The regularization constant C of the SVM and the degree d of the weighted-degree kernel are set to C = 1 and d = 20, which are proven default values. After SVM training, the POIM Q is generated through the Python script compute_poims.py included in the shogun toolbox. The Python framework obtains the trained SVM and a (maximal) POIM degree k poim = 12 as parameters and returns all POIMs, i.e., the differential POIM, the maximum POIM, and the regular POIMs Q l , l = 1, . . ., k poim . We set k poim = 7 in synthetic experiments and k poim = 6 in real experiments because of memory requirements (storing all POIMs up to a degree of 10 requires about 4 gigabytes of space). Note that this is no restriction as our modified optimization problem Eq (8) requires POIMs of degree two or three only. Nevertheless, POIMS of higher degree than three can be provide additional useful information since they contain prior information about the optimization variables, which we use for a proper initialization: For efficient optimization of our highly non-convex optimization problem Eq (8), an appropriate initialization of the optimization variables is mandatory. Thus, we use the differential POIM (defined in Eq (4)) as indicator for extracting the area of interest: we search for points of accumulation of high scoring entries, from which we manually estimate the number of motifs as well as their length and starting position. Thereby we take the whole interval of all highly scoring positions as motif length, where the start position is the first position where all k-mers show a substantial increase in their scores. Once the motif interval is estimated, we select the leading nucleotide from the highest scoring column entry within the interval from the corresponding POIM and initialize the respective PWM entry with a value of 0.7 and 0.1 for non-matches. Indeed, we found this approach to be more stable and reliable than using random initialization. These parameters serve as initialization for our non-convex optimization problem Eq (8). To compute a motif from the computed POIMs, we employ the L-BFGS-B Algorithm [23], where the parameters λ and σ both are initialized as 1. An illustration of the so-obtained experimental pipeline is shown in Fig 7. As a measure of the motif reconstruction quality (MRQ), we employ the same score as in JASPAR SPLICE [24]. When comparing equally sized sequences, this scoring reduces to the simple formula

Synthetic Data Experiments
We first evaluate the proposed methodology on synthetically generated data, where we have full access to the underlying ground truth. This experiment aims successive at demonstrating the ability of our method in finding

Results
Results on the unmutated data set S 1 . The results of the realization of this synthetic experiment using training subsets of size n from the base sample S 1 are shown in Fig 8, for various values of n. We can observe from the figure that the reconstruction error decreases as a function of the sample size n already for n = 100. The corresponding motif/PWM computed by our approach correctly identifies the true underlying motif sequence as the most likely path in the PWM. Results on the mutated data set S 2 . Furthermore, we realize the very same experiment using the sample S 2 S p 2 for various levels of mutations. The results are shown in Fig 9. We can observe that, up to a mutation level of 60%, we correctly identify the true underlying motif as being the sequence with the highest probability in the PWM. For more than 70% of mutations in the training data, the performance drops severely. This effect however, is due to a drop of classification performance of the corresponding SVM as can be seen in Table 2. Table 2 highlights results for an exemplary sample for each level of mutation, to relate SVM classification error to mutation level, and also random PWM initialization strategy (30 runs) to greedy initialization.
Results with overlapping motifs, i.e., data set S 3 . To validate our method for overlapping motifs, we also experiment on the sample S 3 . The differential POIM and the POIM of degree two resulting from our experimental pipeline are shown in Fig 10(a) and 10(b). Interestingly, the two accumulations of entries with high scores indicate that the POIM includes two overlapping motifs. The investigation of these accumulations is slightly more involved than in the The results of the synthetic experiment for varying SVM training sample size n using non-mutated sequences of length 30. As expected, the motif is better reconstructed the more training sequences are used for SVM training. However, as can be seen in the figure, the true motif is picked up early, a tendency that we claim to the robustness of our approach. doi:10.1371/journal.pone.0144782.g008 SVM2Motif experiment above: we observe, for each motif length l > 1, 11−l + 1 subsequent cell entries having an extraordinary high score as indicated by light blue, green, orange, or red colors (e.g., length l = 7, we observe a block of 5 subsequent entries). Thus, the first discriminative motif starts at position 5 and consists of 11 nucleotides. We can observe a drop at position 10 (notice the dark blue color) indicating the starting position of the second motif. Altogether, the figure indicates that the optimal model parameters are: K ¼ f11; 15g, T 11 = 1, T 15 = 1, where μ 11, 1 = 5 and μ 15, 1 = 10. Furthermore, Fig 10 (c) and 10(d) show the PWMs resulting from our optimization approach. We can observe that, although the two motifs are overlapping, both motifs are identified correctly. As for the previous experiment, we also report on the optimal parameters and execution time, shown in Table 3, from which we observe an increase in computation time by a factor of about 5, when contrasted to the runtimes measured on the samples S 1 and S 2 . This can be attributed to the presence of multiple motifs in S 3 , each having an increased length of 11 and 16 nucleotides, respectively, instead of just 6 nucleotides as in the sample S 1 , leading to an increase in computational complexity. We illustrate the robustness of our approach by plotting the reconstruction errors vs. the mutation level for a fixed amount of training samples. We observe that even for high mutation levels (e.g. 50%) the motif reconstruction quality (MRQ) is sufficiently good to reconstruct the true underlying motif correctly. Results for a very long motif, i.e., data set S 4 . At last, we investigate whether our approach is able to find a very long motif, as contained in the sample S 4 . Due to the huge number of variables and the immense size of the POIM, we divide the POIM into 10 smaller conforming parts, in each searching for a motif of length 20. Fig 11 shows the results. We can observe that combination of the 10 computed PWMs reconstructed the real motif adequately.
We can summarize that the experiments on synthetic data demonstrate the ability of our approach to robustly extract the true underlying-possibly overlapping-motifs from noisy data sets even for large motif sizes.

Application to Human Splice Data
In this section, we evaluate our methodology on a human splice data set, which we downloaded from http://www.fml.tuebingen.mpg.de/raetsch/projects/lsmkl. The available human splice Table 2. Experimental results for a fixed sample S 1 with no mutation (p = 0) and S 2 with various levels of mutation (p = 0.1, . . ., 1). The proposed greedy initialization of the PWMs is more reliable and stable than randomly initialized PWMs For verifying our results we use the JASPAR database [21] (available at http://jaspar.genereg. net), which provides us with a collection of important DNA motifs and also contains a splice site database. As a measure of the motif reconstruction quality (MRQ), we use the JASPAR SPLICE score [24]. Note that real DNA sequences may contain non-polymorphic loci, which is why such a motif is not discriminative and we may thus not expect the SVM to identify this locus. We thus catch this special case and place this positional oligomer in the solution sequence. We apply the full experimental pipeline described in the previous section to the splice data, i.e., we first train an SVM, then generate the POIM and the differential POIM, from which we reconstruct a motif by our motifPOIM optimization approach.
We compare our approach against the publicly available motif finder MEME (Multiple EM for Motif Elicitation, [25]), a well known motif discovering tool for DNA sequences, included in the MEME suite, which is a collection of tools for motif discovering and sequence analyzing. The user can specify the number of motifs as well as the length by either the exact length or a range specification. MEME expects the input sequences in FASTA file format. For comparison, we conducted three experiments with varying numbers of positive samples. For support vector machine training, we double the number of samples by filling in negative ones. We chose 400 positive samples (computation time *1min), which is the maximum amount of sequences when using the MEME online tool, 700 positive samples (*10min), which is the maximum recommended amount when using the MEME locally, and 2000 positive samples (*12h). We compare the found motifs against the true splice site motif, taken from the JASPAR database with the JASPAR consensus score. Fig 12 shows the preliminary results for 400 samples in terms of the differential POIM and corresponding POIM of degree 2, shown for the entire sequence (see Fig 12 (a) and 12(c), respectively) as well as zoomed in for the "interesting" positions 36-76 of the sequence (see Fig  12 (b) and 12(d)). According to Fig 12 (b), the largest entries correspond to a 3-and 2-mer that Table 3. Execution times and optimal parameters for the synthetic data set S 3 with overlapping motifs. can be found at position 56 and 57, respectively. A significant increase of the score is recognizable for all k-mers at position 45, which is enhanced at position 46. The last largest entry for a 6-mer is found at position 58, which corresponds to the last largest entries of 4-mers at position 60 and 2-mers at position 62, from which we conclude that the discriminative motif starts at position 45 and ends at position 63. Thus, the motif we are searching is expected to have a length of 19 nucleotides, which we use as an initialization for our motifPOIM approach. We also account for non-polymorphic loci and find that the nucleotides A and G appear in all DNA sequences of the data set, always at the positions 60 and 61, respectively. We thus place them in the final PWM with a probability of 100 percent. The final results for 400 positive samples, are shown in Fig 13, where the true underlying motif taken from the JASPAR splice database is shown in Fig 13 (a), while the motif computed by our approach is shown in Fig 13 (b) and the motif found by MEME is shown in Fig 13 (c). The execution times and the optimal parameters found by the L-BFGS-B solver are shown in Table 4. For all experiments, the start position is around the initialization value of 45, with a small variance of up to σ = 0.44. The great difference in the optimal function value is caused by the experiment dependent POIM scorings, for example in the POIM of degree 2 of the first experiment we observe a maximal score value of 4 (see Fig 12), where the maximal value in the third experiment was 5. Furthermore, from Table 4, we observe moderate execution times of up to 22.8 seconds. From the resulting motif, shown in Fig 13 (b), we observe a striking accordance with the true motif as evidenced by a high consensus score of 98.6. However, the motif found by MEME, shown in Fig 13 (c), which has a length of 21 nucleotides, has a lower consensus score of 94, 5 although there exists a high similarity to the true motif. The reason is that the motif found by MEME starts 2 positions and ends 1 position before the true motif. The results for 700 and 2000 positive training samples, are shown in Figs 14 and 15, respectively. Here, the results for our approach show similar high consensus scores. MEME, found in both experiments a 21 nucleotides long motif starting 4 positions before the true motif. To get more insights, we fixed the motif length for both methods to 20 nucleotides, which corresponds to the underlying ground truth taken from the JASPAR database. The results are shown in Table 5. Again we observe high consensus scores for the motif computed with our method. Interestingly, the MEME motif finder suffers a severe loss of performance for the first two experiments, achieving consensus scores of around 90 for the last experiment, while the performance of our approach remains comparable. The results show, that our approach is in principle able to infer motifs of high quality and more robust than MEME. Moreover, our approach easily handles sample-sizes beyond MEME.

Conclusion and Discussion
We have developed a new methodology to extract long, overlapping and mutated motifs from trained support vector machines. Putting forward the work of [15] on positional oligomer importance matrices (POIMs), the proposed novel probabilistic framework extracts from the output of a WD-kernel SVM the relevant motifs. To deal with the exponentially large size of the feature space associated with the SVM weight vector and the corresponding POIM (". . . we realize that the list of POs can be prohibitively large for manual inspection." [15], page 8), we proposed a very efficient numerical framework. The results clearly illustrate the power of our approach in discovering discriminative motifs. In all synthetic data tasks, the hidden motifs could be found and almost perfectly reconstructed. For the human splice site experiments, we recovered known motifs up to a very high precision of 98.39% as compared to the JASPAR Splice data base. A thorough investigation of the association between the found motif and its biological function can be subject to further research.
For practical purposes, a Python framework is available at https://github.com/mcvidomi/ poim2motif.git. We have implemented the core algorithms as an add-on to the Python interface of the Shogun Machine Learning Toolbox. It is not only an established machine-learning framework within the bioinformatics community, moreover, it already incorporates the possibility to extract positional-oligomer importance matrices of trained support vector machines with a WD-kernel. Future work will extend our approach to an automatic extraction of the initialization variables, that is, the number of motifs, their length and starting positions. Ultimately, the usage by experimentalists will determine the utility of this approach and govern the direction of further extensions. A core issue might be the extension to other interesting kernels,  such as, e.g., spectrum kernels [26], multiple kernels [27][28][29][30][31][32][33], other learning methods [34,35], or learning settings [36][37][38].