^{1}

^{2}

^{3}

^{4}

^{2}

^{5}

^{*}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: AYK AMD. Performed the experiments: AYK. Analyzed the data: AYK AMD. Wrote the paper: AYK AMD. Designed and implemented the PairHMM model and related algorithms: VO. Designed and implemented ProfileHMM and related algorithms: FA RM. Implemented the specificaion parser with error reporting and some algorithms of the GHMM probabilistic model: IB.

Discrete Markovian models can be used to characterize patterns in sequences of values and have many applications in biological sequence analysis, including gene prediction, CpG island detection, alignment, and protein profiling. We present ToPS, a computational framework that can be used to implement different applications in bioinformatics analysis by combining eight kinds of models: (i) independent and identically distributed process; (ii) variable-length Markov chain; (iii) inhomogeneous Markov chain; (iv) hidden Markov model; (v) profile hidden Markov model; (vi) pair hidden Markov model; (vii) generalized hidden Markov model; and (viii) similarity based sequence weighting. The framework includes functionality for training, simulation and decoding of the models. Additionally, it provides two methods to help parameter setting: Akaike and Bayesian information criteria (AIC and BIC). The models can be used stand-alone, combined in Bayesian classifiers, or included in more complex, multi-model, probabilistic architectures using GHMMs. In particular the framework provides a novel, flexible, implementation of decoding in GHMMs that detects when the architecture can be traversed efficiently.

Markov models of nucleic acids and proteins are widely used in bioinformatics. Examples of applications include

One approach to avoid rewriting code is to use a general-purpose system such as R

HTK and gHMM have the distinctive capability of working with continuous emission distributions or, in other words, can accept sequences of arbitrary floating point numbers. HTK was designed to treat the speech recognition problem, but it can also be used to model biological sequences. However it implements only HMMs and does not provide simulations of the models. The gHMM package is a C library providing implementations for HMMs, pair-HMMs, inhomogeneous Markov chains and a mixture of PDFs. The system includes a graphical user interface and provides Python wrappers for each probabilistic model, but it does not implement GHMMs.

HMMConverter and HMMoC are systems that contain skeleton implementations of HMMs, pair-HMMs, and a generalization of the HMM where states may emit more than one symbol at a time. As a distinctive characteristic, both implement memory-efficient versions of the forward, backward, and Viterbi algorithms. However they do not implement the general GHMMs traditionally applied in gene-finding systems

Finally, N-SCAN and Tigrscan are examples of systems which implement general, configurable GHMMs that can combine different probabilistic sub-models in states with a given duration probability distribution. However, they are targeted specifically for gene prediction, offering only a restricted set of probabilistic models in a fixed architecture designed for the gene-finding problem.

In this paper we present ToPS (Toolkit for Probabilistic models of Sequences), a framework for the implementation of discrete probabilistic models for sequence data. ToPS currently implements eight kinds of models: (i) independent and identically distributed process (i.i.d); (ii) variable-length Markov chain

Program | Input Format | Probabilistic Models | Simulation | Distinguishing Characteristics |

HMMConverter | XML | HMM | NO | memory efficient Viterbi, forward, backward |

pair-HMM | ||||

generalized HMM |
||||

HMMoC | XML | HMM | YES | memory efficient Viterbi, forward, backward |

C language | pair-HMM, triple-HMM, quad-HMM | |||

generalized HMM |
||||

gHMM | XML | HMM | YES | continuous emission |

inhomogeneous Markov chain | graphical user interface | |||

pair-HMM | ||||

mixture of probability density functions | ||||

HTK | XML | HMM | NO | continuous emission |

Tigrscan | own language | GHMM |
NO | Does not provide Baum-Welsh training |

N-SCAN | XML | GHMM |
NO | Does not provide Baum-Welsh training |

own language | HMM | YES | model selection criteria (AIC and BIC) | |

pair-HMM | build profile-HMM from alignment | |||

GHMM | efficient and general GHMMs | |||

variable-length Markov chain | ||||

inhomogeneous Markov chains | ||||

discrete i.i.d models | ||||

SBSW |

The generalized version of HMMs in HMMoC and HMMConverter is different from the GHMMs as defined by Kulp

Tigrscan and N-SCAN implement GHMMs containing as sub-models weight arrays, maximum dependence decomposition, smoothed histograms, three-periodic Markov chains, and interpolated Markov models.

In this paper we describe the basic characteristics of ToPS and two examples of how to use it in practical problems: (i) a CpG island detector; (ii) a simple eukaryotic gene predictor.

The ToPS framework has been in intensive use by our research group in a wide variety of problems, including experimentation with null models

ToPS was developed with an object-oriented architecture, which is important for the integration of the models in a single framework. The ToPS architecture includes three main class hierarchies:

Many training algorithms contain parameters that can control the dimensionality of the trained model. A typical example is a Markov chain model in which the user has to choose the value of the order parameter. Another example is the Variable Length Markov Chain in which the user has to set a parameter that controls the pruning of the probabilistic suffix tree. Finding the best parameters can be a long and tedious task if it is performed by manually testing possible parameters. To aid the user with finding a good set of parameters, ToPS contains two model selection criteria that the user can specify with the training procedure:

Bayesian Information Criteria (BIC)

Akaike Information Criteria (AIC)

GHMMs are very flexible probabilistic models that can be integrated with other models to describe a complex architecture. A wide majority of successful gene predictors use GHMMs as a base to recognize particular gene structures

To implement an efficient decoding algorithm, many gene-finding systems use fixed GHMM architectures hard-coded in the program and embed restrictions of the model in order to allow efficient processing. This enables efficient decoding, but limits the architectures that can be described using GHMMs and, therefore, potentially limits their applicability.

ToPS was designed for general applicability, accepting any arbitrary GHMM configurations. To do so, we introduced a methodology to automatically use efficient decoding when the architecture allows it. This is achieved by the use of an adjacency graph to represent the transitions with probability greater than zero, and by taking advantage of the object-oriented architecture of the system:

ToPS uses a sparse graph implementation to benefit from the limited connectivity.

The automatic detection of the constant

The constant time lookup of the emission probabilities is achieved by the use of the object-oriented architecture: any probabilistic model implemented as a subclass of

In addition, we have developed another optimization technique for the case when some observation sub-model has probability zero to emit specific words, a situation that is very common in gene-finding systems. In this case ToPS maintains an auxiliary linked list for each line of the Viterbi matrix (corresponding to the values of a given state for each position of the sequence), indicating the positions that have non-zero probability. When we have factorable models, the entries of the Viterbi matrix that generate a path with probability zero do not need to be examined. Typically, most positions have zero probability, therefore using the lists substantially reduce the running time.

These techniques achieve similar performance to the

ToPS is a framework that helps describing and using discrete probabilistic models.

Square boxes represent data files, rounded boxes represent programs or manual processes. Each model may be described manually by editing a text file (1), or the train program can be used to estimate the parameters and automatically generate such file from a training set (2). The files that contain the model parameters (in our example model1.txt, model2.txt and model3.txt) are used by the programs evaluate (3), simulate (4), bayes_classifier(5) and viterbi_decoding (6). The evaluate program calculates the likelihood of a set of input sequences given a model, the simulate program samples new sequences, the viterbi_decoding program decodes input sequences using the Viterbi algorithm, and the bayes_classifier classifies input sequences given a set of probabilistic models.

All models, scripts, configuration files and sequence data to reproduce the experiments are available through the ToPS homepage.

CpG islands (CGI) are genomic regions of great interest due to their relation with gene regulation. These regions are commonly present in the promoter region of genes. The CGI sequences typically have high G+C content with a significant high frequency of Cs followed by Gs. CGIs are also related to the DNA methylation that occurs typically at the C nucleotides. The presence of methylated DNA regions can inhibit the binding of transcription factors and therefore inhibit gene expression. Large scale experiments to detect differentially methylated regions use a CGI list as a reference, stating the importance of producing high quality CGI lists

The use of Hidden Markov Model to define CGIs was described in

Our GHMM has only two states, shown in

In this GHMM we used IMMs as emission sub-models and we tested different values for the exit probability of the NONCPG state,

To implement this system in ToPS we initially trained the two IMMs that constitute the states of the GHMM. We stored the description of these two models in the files

Once we had all the trained models, we specified a GHMM with the configuration file described in the

In this experiment the points in the curve correspond to different values for the exit probability of the NONCPG state of the GHMM. For comparison, the results with the CGI list from UCSC Genome Browser and with the CGI list obtained using HMM

In a different experiment we evaluated another GHMM with a non-geometric duration for the CPG state (data not shown). Because the Viterbi decoding must verify the best length for the CPG state, the decoding was significantly slower than with the geometric duration GHMM (12 hours vs. 1 minute). Furthermore, we did not observe an improvement in the quality of the prediction when modeling the duration of the CPG state explicitly.

To characterize CpG islands and their lengths we used

We compared our results with two independent CGI lists: (i) the CGI list computed by an HMM developed by Wu and colleagues

As can be seen from

CGI List | Total number of CGI regions | Percentage of confirmed TSSs contained in the CGI predictions (“sensitivity”) | Total of nucleotides in CGI list (“specificity”) |

UCSC Genome Browser | |||

HMM |
|||

The results obtained with different values of

Predicting the location and the structure of protein-coding genes in eukaryotic genomes is a difficult but very important task

Next we illustrate the implementation in TopS of a gene-finding system using a GHMM with 56 states.

The GHMM we built is shown in

To define a GHMM, we have to specify an emission sub-model for each state. Below is a list of the forward strand models we used:

A summarized description of each state can be found in

State Name | Description | Emission Model | Duration Model |

start codon | start codon initial motif (20 nt) | fixed-length (27 nt) | |

start codon model (3 nt) | |||

initial pattern model (4 nt) | |||

stop codon | stop codon model (3 nt) | fixed-length (3 nt) | |

single exon | protein-coding model | Smoothed Histogram | |

initial exons | protein-coding model | Smoothed Histogram | |

terminal exons | protein-coding model | Smoothed Histogram | |

internal exon | protein-coding model | Smoothed Histogram | |

intron | non-coding model | geometric distributed | |

donor splice site | donor initial pattern (4 nt) | fixed-length (13 nt) | |

donor splice site model (9 nt) | |||

acceptor splice site | branch point model (32 nt) | fixed-length (42 nt) | |

acceptor splice site model (6 nt) | |||

acceptor initial pattern model (4 nt) | |||

intergenic state | non-coding model | geometric distributed | |

final state | non-coding model | self-transition probability is one |

The run-length distribution of the states representing exons was trained using the same methodology described in

To compare our results with a well established program, we applied GENSCAN

Gene | Exon | Nucleotide | |||||||

Predictor | PPV | S_{n} |
F-score | PPV | S_{n} |
F-score | PPV | S_{n} |
F-score |

GENSCAN | 9.7±1.1 | 19.6±0.7 | 12.9±1.1 | 54.3±2.2 | 55.0±4.7 | 69.9±3.7 | |||

ToPS | 55.9±1.7 | 57.4±1.6 | 87.1±2.4 |

We presented ToPS, an open-source object-oriented framework for analyzing probabilistic models of sequence data. It implements seven well-established probabilistic models that have applications in many distinct disciplines. ToPS includes programs for simulating, decoding, classifying and evaluating discrete sequences. The implemented models can be used individually, combined in heterogeneous models using GHMMs, or integrated in Bayesian classifiers. In contrast to systems with similar goals, end users do not need any previous knowledge of programming languages, since the probabilistic models are specified using a notation close to the mathematical one. There are specific auxiliary programs for training, simulating and decoding. In addition, ToPS includes two algorithms for model selection, BIC and AIC, that can be used to find the best classification parameters for given training and validation sets. Also, in contrast to other systems, ToPS includes a GHMM implementation that is at the same time general enough to describe any GHMM architecture and efficient when the model characteristics allow for a faster version of the Viterbi algorithm. This is important to enable the use of ToPS in gene finding.

The two examples presented above, a CpG island classifier and a gene predictor, illustrate that ToPS can be used to build complex model architectures to be applied to real-world problems. In both cases we achieve competitive performance against well established results with minimal implementation work. Both results could even be improved further through experimentation with the model.

ToPS was tested under GNU/Linux, and MacOSX and can be obtained from

We are currently using ToPS to develop different probabilistic models for biological sequence analysis. In particular ToPS was useful to produce results described in

Source code for ToPS. A compressed file containing the source code for ToPS.

(GZ)

The authors are indebted to Elias de Moraes Fernandes for the design of the ToPS logo.