^{1}

^{2}

^{3}

^{¤}

^{4}

^{3}

^{2}

^{*}

Current address: Department of Computer Science, ETH, Zürich, Switzerland.

The authors have declared that no competing interests exist.

The increasing wealth of biological data coming from a large variety of platforms and the continued development of new high-throughput methods for probing biological systems require increasingly more sophisticated computational approaches. Putting all these data in simple-to-use databases is a first step; but realizing the full potential of the data requires algorithms that automatically extract regularities from the data, which can then lead to biological insight.

Many of the problems in computational biology are in the form of prediction: starting from prediction of a gene's structure, prediction of its function, interactions, and role in disease. Support vector machines (SVMs) and related kernel methods are extremely good at solving such problems

The simplest form of a prediction problem is binary classification: trying to discriminate between objects that belong to one of two categories—positive (+1) or negative (−1). SVMs use two key concepts to solve this problem: large margin separation and kernel functions. The idea of large margin separation can be motivated by classification of points in two dimensions (see

The decision boundary divides the space into two sets depending on the sign of

The region between the two thin lines defines the

We modified the toy dataset by moving the point shaded in gray to a new position indicated by an arrow, which significantly reduces the margin with which a hard-margin SVM can separate the data. (A) We show the margin and decision boundary for an SVM with a very high value of _{i}

Instead of the abstract idea of points in space, one can think of our data points as representing objects using a set of

In the post-processing step, the pre-mRNA is transformed into mRNA. One necessary step in the process of obtaining mature mRNA is called

The sequence logo

The polynomial kernel of degree 1 leads to a linear separation (A). Higher-degree polynomial kernels allow a more flexible decision boundary (B,C). The style follows that of

For large values of

The domain knowledge inherent in any classification task is captured by defining a suitable kernel (i.e., similarity) between objects. As we shall see later, this has two advantages: the ability to generate nonlinear decision boundaries using methods designed for linear classifiers; and the possibility of applying a classifier to data that have no obvious vector space representation; for example, DNA/RNA, or protein sequences, or protein structures.

Throughout this tutorial we are going to use an example problem for illustration. It is a problem arising in computational gene finding and concerns the recognition of splice sites that mark the boundaries between exons and introns in eukaryotes. Introns are excised from premature mRNAs in a processing step after transcription (see

The vast majority of all splice sites are characterized by the presence of specific dimers on the intronic side of the splice site: GT for donor and AG for acceptor sites (see

In the first part of the tutorial we are going to use real-valued features describing the sequence surrounding the splice site. For illustration purposes, we use only two features: the GC content in the exon and intron flanking potential

To evaluate the classifier performance, we will use so-called

All computational results in this tutorial were generated using the

In this section, we introduce the idea of linear classifiers. Support vector machines are an example of a linear two-class classifier. The data for a two-class learning problem consists of objects labeled with one of two labels; for convenience we assume the labels are +1 (positive examples) and −1 (negative examples). Let _{j}_{i}^{th} vector in a dataset _{i}_{i}_{i}

A key concept required for defining a linear classifier is the

The discriminant function

The hyperplane divides the space into two half spaces according to the sign of

Whenever a dataset such as is shown in

The so-called ^{2} is equivalent to maximizing the margin. (The set of formulas above describes a _{i}_{i}

In practice, data are often not linearly separable; and even if they are, a greater margin can be achieved by allowing the classifier to misclassify some points—see _{i}_{i}_{i}

The constant

The effect of the choice of

Using the method of Lagrange multipliers (see, e.g., _{i}

One can prove that the weight vector _{i}_{i}

The _{i}_{i}

Note that the dual formulation of the SVM optimization problem depends on the inputs _{i}

In many applications, a nonlinear classifier provides better accuracy. And yet linear classifiers have advantages, one of them being that they often have simple training algorithms that scale well with the number of examples

There is a straightforward way of turning a linear classifier nonlinear, or making it applicable to nonvectorial data. It consists of mapping our data to some vector space, which we will refer to as the feature space, using a function

Note that

We have seen above (Equation 5) that the weight vector of a large margin separating hyperplane can be expressed as a linear combination of the training points, i.e.,

The representation in terms of the variables _{i}_{i}_{i}_{j}

If the

Real-valued data, i.e., data where the examples are vectors of a given dimensionality, are common in bioinformatics and other areas. A few examples of applying SVM to real-valued data include prediction of disease state from microarray data (see, e.g.,

The two most commonly used kernel functions for real-valued data are the polynomial and the Gaussian kernel. The ^{linear}, is the

The degree of the polynomial kernel controls the flexibility of the resulting classifier (

The second very widely used kernel is the ^{2} is much larger than

As seen from the examples in

Results on a much larger sample of the two dimensional splice site recognition dataset are shown in

Kernel | auROC |

Linear | 88.2% |

Polynomial |
91.4% |

Polynomial |
90.4% |

Gaussian |
87.9% |

Gaussian |
88.6% |

Gaussian |
77.3% |

Accuracy is measured using the area under the ROC curve (auROC) and is computed using 5-fold cross-validation (cf. the section Running Example: Splice Site Recognition for details).

So far we have shown how SVMs perform on our splice site example if we use kernels based only on the two GC content features derived from the exonic and intronic parts of the sequence. The small subset of the dataset shown in ^{ℓ} features).

The above idea is realized in the so-called ^{ℓ} dimensional feature space. Each dimension corresponds to one of the |Σ|^{ℓ} possible strings ^{ℓ} entries of the mapping Φ, which would be unfeasible for nucleotide sequences with ℓ≥10 or protein sequences with ℓ≥5. Faster computation is possible by exploiting the fact that the only ℓ-mers that contribute to the dot product (in Equation 11) are those that actually appear in the sequences. This leads to algorithms that are linear in the length of the sequences instead of the exponential |Σ|^{ℓ} computation time (see, e.g.,

If we use the spectrum kernel for the splice site recognition task, we obtain considerable improvement over the simple GC content features (see _{d}

Kernel | auROC |

Spectrum ℓ = 1 | 94.0% |

Spectrum ℓ = 3 | 96.4% |

Spectrum ℓ = 5 | 94.5% |

Mixed spectrum ℓ = 1 | 94.0% |

Mixed spectrum ℓ = 3 | 96.9% |

Mixed spectrum ℓ = 5 | 97.2% |

WD ℓ = 1 | 98.2% |

WD ℓ = 3 | 98.7% |

WD ℓ = 5 | 98.9% |

The kernels mentioned above ignore the position of substrings within the input sequence. However, in our example of splice site prediction, it is known that there exist sequence motifs near the splice site that allow the spliceosome to accurately recognize the splice sites. While the spectrum kernel is in principle able to recognize such motifs, it cannot distinguish where exactly the motif appears in the sequence. However, this is crucial in deciding where exactly the splice site is located. And indeed, Position Weight Matrices (PWMs) are able to predict splice sites with high accuracy. The kernel introduced next is analogous to PWMs in the way it uses positional information, and its use in conjunction with a large margin classifier leads to improved performance _{[l:l+d]} is the substring of length _{d}

The

Note that since the polynomial and Gaussian kernels are functions of the linear kernel, the above-described sequence kernels can be used in conjunction with the polynomial or Gaussian kernel to model more complex decision boundaries. For instance, the polynomial kernel of degree

Because of the importance of sequence data and the many ways of modeling it, there are many alternatives to the spectrum and weighted degree kernels. Most closely related to the spectrum kernel are extensions allowing for gaps or mismatches

Sequence similarity has been studied extensively in the bioinformatics community, and local alignment algorithms like BLAST and Smith-Waterman are good at revealing regions of similarity between proteins and DNA sequences. The statistics produced by these algorithms do not satisfy the mathematical condition required of a kernel function. But they can still be used as a basis for highly effective kernels. The simplest way is to represent a sequence in terms of its BLAST/Smith-Waterman scores against a database of sequences

Probabilistic models, and Hidden Markov Models in particular, are in wide use for sequence analysis. The dependence of the log-likelihood of a sequence on the parameters of the model can be used to represent a variable-length sequence in a fixed dimensional vector space. The so-called Fisher-kernel uses the sensitivity of the log-likelihood of a sequence with respect to the model parameters as the feature space

This tutorial introduced the concepts of large margin classification as implemented by SVMs, an idea that is both intuitive and also supported by theoretical results in statistical learning theory. The SVM algorithm allows the use of kernels, which are efficient ways of computing scalar products in nonlinear feature spaces. The “kernel trick” is also applicable to other types of data, e.g., sequence data, which we illustrated in the problem of predicting splice sites in

In the rest of this section, we outline issues that we have not covered in this tutorial and provide pointers for further reading. For a comprehensive discussion of SVMs and kernel methods, we refer the reader to recent books on the subject

Large margin classifiers are known to be sensitive to the way features are scaled (see, for example

Many datasets encountered in bioinformatics and other areas of application are unbalanced, i.e., one class contains a lot more examples than the other. For instance, in the case of splice site detection, there are 100 times fewer positive examples than negative ones. Unbalanced datasets can present a challenge when training a classifier, and SVMs are no exception. The standard approach to addressing this issue is to assign a different misclassification cost to each class. For SVMs, this is achieved by associating a different soft-margin constant to each class according to the number of examples in the class (see, e.g.,

A question frequently posed by practitioners is “which kernel with which parameters should I use for my data?” There are several answers to this question. The first is that it is, like most practical questions in machine learning, data-dependent, so several kernels should be tried. That being said, one typically follows the following procedure: try a linear kernel first, and then see if we can improve on its performance using a nonlinear kernel. The linear kernel provides a useful baseline, and in many bioinformatics applications it is hard to beat, in particular if the dimensionality of the inputs is large and the number of examples small. The flexibility of the Gaussian and polynomial kernels can lead to overfitting in high-dimensional datasets with a small number of examples, such as in micro-array datasets. If the examples are (biological) sequences, then the spectrum or the WD kernel of relatively low order (say ℓ = 3) are good starting points if the sequences have varying or fixed length. Depending on the problem, one may then try the spectrum kernel with mismatches, the oligo kernel, the WD kernel with shifts, or the local alignment kernel.

In problems such as prediction of protein function or protein interactions, there are several sources of genomic data that are relevant, each of which may require a different kernel to model. Rather than choosing a single kernel, several papers have established that using a combination of multiple kernels can significantly boost classifier performance

When selecting the kernel, its parameters, and the soft-margin parameter

We have focused on kernels for real-valued and sequence data; and while this covers many bioinformatics applications, often data is better modeled by more complex data types. Many types of bioinformatics data can be modeled as graphs, and the inputs can be either nodes in the graph, e.g., proteins in an interaction network, or the inputs can be represented by graphs, e.g., proteins modeled by phylogenetic trees. Kernels have been developed for both scenarios. Researchers have developed kernels to compare phylogenetic profiles modeled as trees

The popularity of SVMs has led to the development of a large number of special-purpose solvers for the SVM optimization problem ^{light}

Another class of software includes machine learning libraries that provide a variety of classification methods and other facilities such as methods for feature selection, preprocessing, etc. The user has a large number of choices, and the following is an incomplete list of environments that provide an SVM classifier: ^{light}

A repository of machine learning open source software is available at

We would like to thank Alexander Zien for discussions, and Nora Toussaint, Sebastian Henschel, and Petra Philips for comments on the manuscript.