Evaluating High-Throughput Ab Initio Gene Finders to Discover Proteins Encoded in Eukaryotic Pathogen Genomes Missed by Laboratory Techniques

Next generation sequencing technology is advancing genome sequencing at an unprecedented level. By unravelling the code within a pathogen’s genome, every possible protein (prior to post-translational modifications) can theoretically be discovered, irrespective of life cycle stages and environmental stimuli. Now more than ever there is a great need for high-throughput ab initio gene finding. Ab initio gene finders use statistical models to predict genes and their exon-intron structures from the genome sequence alone. This paper evaluates whether existing ab initio gene finders can effectively predict genes to deduce proteins that have presently missed capture by laboratory techniques. An aim here is to identify possible patterns of prediction inaccuracies for gene finders as a whole irrespective of the target pathogen. All currently available ab initio gene finders are considered in the evaluation but only four fulfil high-throughput capability: AUGUSTUS, GeneMark_hmm, GlimmerHMM, and SNAP. These gene finders require training data specific to a target pathogen and consequently the evaluation results are inextricably linked to the availability and quality of the data. The pathogen, Toxoplasma gondii, is used to illustrate the evaluation methods. The results support current opinion that predicted exons by ab initio gene finders are inaccurate in the absence of experimental evidence. However, the results reveal some patterns of inaccuracy that are common to all gene finders and these inaccuracies may provide a focus area for future gene finder developers.

Input: a DNA sequence in FASTA format or multiple sequences in multiple FASTA format and model parameters for target species Output: A text output in the 'General Feature Format' (GFF) (See http://www.sanger.ac.uk/resources/software/gff/) To extract the protein sequences from the gff file a Perl script called getAnnoFasta.pl is provided.
Advantages: AUGUSTUS can be run on the German MediGRID. This enables you to submit larger sequence files and allows you to use protein homology information in the prediction.

Disadvantages:
Training: AUGUSTUS has currently been trained on species specific training sets to predict genes in organisms (see http://augustus.gobics.de/) for the list of organisms. A training program called etraining is available and requires training genes and their exon locations to be in a single file in a genBank format. The main keys required are LOCUS heading (LOCUS entries are optional), FEATURES, source, gene, mRNA, CDS, and ORIGIN (followed by the DNA sequence containing the training genes). How to train AUGUSTUS can be found at: http://molecularevolution.org/molevolfiles/exercises/augustus/training.html

Comments:
AUGUSTUS is based on a Hidden Markov Model and integrates a number of known methods and sub models e.g. Markov chain, a higher order windowed weight array model (WWAM), interpolated Markov Models (IMM) and a novel method for similarity-based weighting of sequence patterns. The default version of the model consists of 47 states (23 states for forward strand and 23 symmetric states for reverse strand). Each state emits a random DNA string of possibly random length. The distribution of these and the transition probabilities between them are determined using established models and a training set of annotated sequences for the target species. The following are probabilistically modelled separately: 1) the sequence around the splice sites, 2) the sequence of the branch point region, 3) the bases before the translation start, 4) the coding regions, 5) the non-coding regions, 6) the first coding bases of a gene, the length distribution of single exons, initial exons, 7) internal exons, 8) terminal exons, 9) intergenic regions,10) the distribution of the number of exons per gene, and 11) the length of distribution of introns. AUGUSTUS employs a new way of modelling intron lengths by combining explicit length modelling (estimated from observed frequencies) for short introns, with a geometric distribution for long introns. Short introns typically have a length distribution clustering around a certain length.
AUGUSTUS predicts the gene structure with the largest a-posteriori probability using the Viterbialgorithm.

GeneMark.hmm eukaryotic version
Program Name: gmhmme3 Usage: A eukaryotic gene finding algorithm using hidden Markov models (HMM) Description: GeneMark-hmm employ inhomogeneous (three-periodic) Markov chain models describing protein-coding DNA and homogeneous Markov chain models describing non-coding DNA. It utilises an extended hidden Markov model (HMM) architecture and the generalized Viterbi algorithm to determine the most likely sequence of hidden states (labels designating the coding or non-coding function) based on the whole observed DNA sequence. The hidden states are: initial, internal and terminal exons, introns, intergenic regions and single exon genes located on both DNA strands; and initiation site, termination site, donor and acceptor splice sites.
Platforms: Web server, Linux and Sun Solaris Output: an option to output in the 'General Feature Format' (GFF) and an option to translate predicted DNA sequences to protein sequences.

Advantages:
Disadvantages: There is no training program. However, there is a self-training version of GeneMark (see below) Trained model files: Model files for the following organisms are provided and have the extension *.mod: Homo sapiens, Arabidopsis thaliana, Caenorhabditis elegans, Toxoplasma gondii, Chlamydomonas reinhardtii, Drosophila melanogaster, Gallus gallus, Hordeum vulgare, Mus musculus, Oryza sativa, Triticum aestivum and Zea mays Comments: There are 2 versions of GeneMark.hmm -Supervised and un-supervised training versions. The version that uses supervised training has a web interface: http://exon.biology.gatech.edu/eukhmm.cgi

GeneMark.hmm ES
Program Name: gm_es.pl Usage: A eukaryotic gene finding algorithm using hidden Markov models (HMM) and employing the Viterbi unsupervised training procedure. ("E" stands for "Eukaryotic" and "S" stands for "Selftraining") Description: GeneMark.hmm-ES program predicts genes and intergenic regions in a sequence as a whole. They use the Hidden Markov models reflecting the "grammar" of gene organization. The selftraining procedure determines parameters for the gene models Reference: Lomsadze A., Ter-Hovhannisyan V., Chernoff Y. and Borodovsky M., "Gene identification in novel eukaryotic genomes by self-training algorithm", Nucleic Acids Research, 2005, Vol. 33, No. 20, 6494-6506 Input: a DNA sequence in FASTA format Output: a file in a Gene Transfer Format (gtf).
Advantages: Can be used on novel genomes where there are an inadequate number of experimentally validated genes. The training set used is classified as "unsupervised training". GeneMark.hmm ES has been tested on Toxoplasma gondii.
Disadvantages: Contains no option to convert predicted genes to protein sequences Comments: GeneMark-ES tested on genomes of Arabidopsis thaliana, Caenorhabditis elegans and Drosophila melanogaster.
There is an optional parameter (--BP OFF) that switches off the branch point sub model and runs original ES algorithm (GeneMark.hmm ES version 1.0). This option is recommended for genomes with weak branch points and was used with Toxoplasma gondii Generation of gene predictions for a novel eukaryotic genome occurs in parallel with the unsupervised (automatic) iterative estimation of gene model parameters by the Viterbi training. At each iteration, the algorithm takes genomic sequence labelled by the Viterbi algorithm at the previous iteration into coding and non-coding regions, re-estimates model parameters, and computes a new sequence parse and labelling. This general path of the iterative Viterbi training process is modified by addition of restrictions on possible changes of parameters to ensure convergence of the iteration process to the biologically relevant point. At the point of convergence the set of sequence labels is transformed into the list of gene predictions, the program output.

GlimmerHMM
Program Name: glimmerhmm_linux Although the gene finder conforms to the overall mathematical framework of a GHMM, additionally it incorporates splice site models adapted from the GeneSplicer program and a decision tree adapted from GlimmerM. It also utilizes Interpolated Markov Models for the coding and non-coding models.
Platform: Linux RedHat 6.x+, Sun Solaris, and Alpha OSF1 Input: Two inputs -a DNA sequence file in FASTA format and a directory containing the training parameters for the program.
Output: A text output in the 'General Feature Format' (GFF) (See http://www.sanger.ac.uk/resources/software/gff/) Advantages: Code reusable due to their modular and extensible architectures. The programs are retrainable by the end user. They are also re-configurable and include several types of probabilistic sub models which can be independently combined, such as Maximal Dependence Decomposition trees and interpolated Markov models.
Disadvantages: No option to convert the predicted DNA sequences to protein sequences.
Training: Training program is available

Comments:
GlimmerHMM was used for the annotation of the Aspergillus fumigatus and Toxoplasma gondii genomes.
Currently, the GHMM structure includes introns of each phase, intergenic regions, and four types of exons (initial, internal, final, and single). Input: a DNA sequence in FASTA format and model parameters for target species Output: an option to output in the 'General Feature Format' (GFF) and an option to translate predicted DNA sequences to protein sequences.
Advantage: Attempts to be more adaptable to different organisms, addressing problems related to using a gene finder on a genome sequence that it was not trained against.

Comments:
The default HMMs employed by SNAP have a minimal genome model. There are no models for promoter, poly-A site, UTRs or trans-splicing. These features are often not annotated and the author states that only the features which can be unambiguously defined in the training data should be used.
Some key SNAP features: 1) Uses six intron states to prevent stop codons at splice junctions, 2) models each strand independently. Decoding the strands independently allows genes on opposite strands to overlap. The advantage is that it allows genes within introns of other genes. The disadvantage is that it also allows overlapping exons. 3) The state diagram is read from a parameter file. This allows one to change the HMM to describe a variety of genomic features. 4) The sequence feature model architecture allows one to employ any length weight matrix and any order Markov model and to embed these models within an array, decision tree, or 3-periodic (coding sequence) framework. 6) Introns may have explicit length distributions over a fixed distance followed by a geometric tail (Similar to AUGUSTUS).
It may be possible to increase the accuracy of SNAP by including more states in the HMM to model additional genomic features or by using more sophisticated statistical techniques such as interpolated Markov models, maximum dependence decomposition trees or isochore segmentation. SNAP employs weight matrices (WM) to model compositional features such as the translation start site, splice sites and codon bias. The "default" models are as follows: the acceptor WM is30 bp long with 3 exonic nucleotides, the donor WM is 9 bp with 3 exonic nucleotides, the start WM is 12 bp with 6 coding nucleotides, and the stop WM is 9 bp with 3 bp on either side of the stop codons. Markov models are used for exons, introns, and intergenic sequence.

Background Information
The following is a brief introduction to in silico gene finding. In particular, the focus is on ab initio gene finding that predicts genes within a DNA sequence with no additional evidence. Figure 1 shows the molecular steps from a DNA sequence containing a single gene to a polypeptide sequence: Step 1) DNA sequence is transcribed to produce a messenger RNA (mRNA). In effect the introns (the non-coding sequences) are removed due to splicing and the exons (the coding sequences) are concatenated; and step 2) mRNA is translated into a chain of amino acids (a polypeptide) that folds to form a protein. These steps are commonly referred to as the central dogma of molecular biology.

Figure 1. Central dogma of molecular biology
The general aim in gene finding is to identify the coding segment (CDS) within a gene region; but more specifically, identify exons within the coding segment. Once the internal exon-intron structure of a gene is known, the encoded protein(s) can be deduced. At the most basic understanding there are three types of exons: an initial exon that is defined by a start codon (typically ATG), an internal exon, and a terminal exon that is defined by a stop codon (typically TAG, TGA, or TAA). That is, some codons in protein coding regions are significantly more common than other codons in non-coding regions. Ab initio gene prediction searches for signals and content. Figure  shown and GT is the most probable sequence emitted from this state. An input sequence can be scored by assessing the emission probabilities for a particular valid parse (or path through the GHMM model).
The emission probabilities collectively comprise signal scores (produced from the circle states) and Each state has fixed transition probabilities from one state to another state (represented in diagram by arrows). The transition probabilities describe the linear order in which the states are expected to occur. In this example, the red arrows are the low transition probabilities and the black arrows the high transition probabilities. In effect the transitions define the syntax rules for parsing the DNA sequence.A general rule is that the states representing the content regions (the oval states) cannot transition directly to each other. Each state has its own separate sub-model or sensor (see Figure 3).
The signal sensor uses the emission distribution to detect the appropriate motif signal and the content sensor assigns a probability score between two signals to determine the subsequence to be emitted by the state. Figure 5 shows how a GHMM model is fitted to an input DNA sequence to predict exons. For example, let us assume we are at the intergenic state. Each state emits a subsequence based on its emission probability distribution. The probabilities are derived from the training sequences for which the correct gene structure is known. We choose which state to visit next based on the transition probability (in the figure the low transition probabilities have been removed for clarity). The path through the states is referred to as a Markov chain -meaning that the state we go to next depends on the state that we are in. Since only the observed input sequence is given, the actual state path is hidden and needs to be inferred. Each nucleotide in the input sequence can only be emitted from only one state. The model determines the probability that a nucleotide was emitted by a particular state.
Only three putative paths (represented by a series of coloured rectangles) are shown below the state-transition diagram. The colours of the rectangles correspond to the colours of the states e.g. exons are green.
There are potentially many state paths (putative gene parses) that could emit the same sequence. To evaluate and score all possible valid parses is too time consuming. In practice, a shortest path algorithm (incorporating dynamic-programming) is used to find the highest-scoring parse for the input sequence (i.e. the path with the highest probability). The term decoding is used to describe the problem of identifying the highest-scoring parse. Decoding is typically achieved by an algorithm called Viterbi. Figure 6 shows a summary of the steps involved in predicting protein sequences.
Prediction accuracy depends on both the number of genes and the variety of genes in the training set. A training set in effect represents an "average gene" so it can be expected that some genes in the genome will not be predicted or will be incorrectly predicted because they greatly differ in structure or compositional biases to the so-called average gene. In mammalian genomes, there can be regions in the genome that are rich or poor in GC content in comparison to regions in the rest of the genome. These regions are called isochores and tend to have more functional genes. It is proposed that gene prediction accuracy in mammalian genomes will increase if the compositional biases are modelled for each isochore.
Most gene finders for efficiency make the following assumptions, which may be invalid at times: 1) The motif signals (start and stop codons, and splice sites) are the same throughout the genome; 2) The lengths for the introns and intergenic regions are geometrically distributed (AUGUSTUS is an exception); This schematic shows a summary of the steps involved in predicting protein sequences when given a DNA sequence as input. The first major step is the creation of model parameters (the determination of the probability distributions) for the respective gene finder. This is achieved by using validated genes from the target pathogen as training genes. The more genes used the greater the prediction accuracy but also the greater the potential for overtraining. The training procedure requires two input files: one containing the validated gene sequences and the other the exon locations of these gene sequences.
The model parameters along with the DNA sequence, in which we want to predict genes, are the only input requirements to the gene finder. One of the primary outputs is a file containing all the predicted exon locations per predicted gene. Using the exon start and end coordinates (e.g. start = 6 and end = 16) the exon nucleotide sequences (shown in red) are extracted from the input DNA sequence. These exon sequences are concatenated to form one sequence per gene prediction and in effect are equivalent to an mRNA sequence without the 5´cap or polyA tail. Commencing at the start of the mRNA sequence three letter codons (e.g. ATG,TCG, ATC) are translated to amino acid letters (e.g. M,S, I) to construct the protein sequence

Gene Finder Commands
For the gene prediction programs we need to know the exon start and end locations for the training data genes. The main place to find this information is in GenBank under the CDS join section as shown below: CDS join (22502..26539,27124..27259,28110..30498,31038..31104, 31285..31575,31925..32903,33391..35489,36324..36458, 36903..37039,37723..37822,38124..38225,38690..38821, 39431..39811,40299..41876) Each exon is represented by two numbers separated by "..". The first number e.g. 25502 is the start coordinates of the exon and the second number e.g. 26539 is the end coordinates of the exon. The exons are list in the order that they occur in the gene.