Developing and Applying Heterogeneous Phylogenetic Models with XRate

Modeling sequence evolution on phylogenetic trees is a useful technique in computational biology. Especially powerful are models which take account of the heterogeneous nature of sequence evolution according to the “grammar” of the encoded gene features. However, beyond a modest level of model complexity, manual coding of models becomes prohibitively labor-intensive. We demonstrate, via a set of case studies, the new built-in model-prototyping capabilities of XRate (macros and Scheme extensions). These features allow rapid implementation of phylogenetic models which would have previously been far more labor-intensive. XRate 's new capabilities for lineage-specific models, ancestral sequence reconstruction, and improved annotation output are also discussed. XRate 's flexible model-specification capabilities and computational efficiency make it well-suited to developing and prototyping phylogenetic grammar models. XRate is available as part of the DART software package: http://biowiki.org/DART.


Introduction
Phylogenetics, the modeling of evolution on trees, is an extremely powerful tool in computational biology.
The better we can model a system, the more can learn from it, and vice-versa. Especially attractive, given the plethora of available sequence data, is modeling sequence evolution at the molecular level. Models describing the evolution of a single nucleotide began simply (e.g. JC69 [1]), later evolving to capture such biological features as transition/transversion bias (e.g. K80 [2]) and unequal base frequencies (e.g. HKY85 [3]). Felsenstein's "pruning" algorithm allows combining these models with phylogenetic trees to compute the likelihood of multiple sequences [4]. describe the parametric structure of substitution rate matrices, along with grammatical rules governing which rate matrices can account for which alignment columns. This essentially amounts to partitioning the alignment (e.g. marking up exon boundaries and reading frames) and factoring in the transitions between the different types of region.
Parameter estimation and decoding (alignment annotation) algorithms are built in, allowing fast model prototyping and fitting. Model training (estimating the rate and probability parameters of the grammar) is done via a form of the Expectation Maximization (EM) algorithm, described in more detail in the original XRate paper [9]. Most recently, XRate allows programmatic model construction via its macros and Scheme extensions. XRate's built-in macro language allows large, repetitive grammars to be compactly represented, and also enables the model structure to depend on aspects of the data, such as the tree or alignment. Scheme extensions take this even further, interfacing XRate to a full-featured functional scripting language, allowing complex XRate-oriented workflows to be written as Scheme programs.
In this paper we demonstrate XRate's new model-specification tools via a set of progressively more complex examples, concluding with XDecoder, a phylo-grammar modeling RNA secondary structure overlapping protein-coding regions. We also describe additional improvements to XRate since its initial publication, namely ancestral sequence reconstruction, GFF/WIG output, and hybrid substitution models. Finally, we show how XRate's features are exposed as function extensions in a dialect of the Scheme programming language, typifying a Functional Programming (FP) style of model development and inference for phylogenetic sequence analysis. Terminology relevant to modeling with XRate are defined in detail in Appendix Section A. We also provide an online tutorial for making nontrivial modifications to existing grammars, going step-by-step from a Jukes-Cantor model to an autocorrelated Gamma-distributed rates phylo-HMM: http://biowiki.org/XrateTutorial.

Results and Discussion
The XRate generative model A phylo-grammar generates an alignment in two steps: nonterminal transformations and token evolution.
The sequence of nonterminal transformations comprises the "grammar" portion of a phylo-grammar, and the "phylo" portion refers to the evolution of tokens along a phylogeny. First, transformation rules are repeatedly applied, beginning with the START nonterminal, until only a series of pseudoterminals remains.
From each group of pseudoterminals (a group may be a single column, two "paired" columns in an RNA structure, or a codon triplet of columns), a tuple of tokens is sampled from the initial distribution of the chain corresponding to the pseudoterminal. These tokens then evolve down the phylogenetic tree according to the mutation rules of the chain, resulting in the observed alignment columns.
If the nonterminal transformations contain no bifurcations and all emissions occur on the same side of the nonterminal, the grammar is a phylogenetic hidden Markov model (phylo-HMM), a special subclass of phylo-grammars. Otherwise, it is a phylogenetic stochastic context-free grammar (phylo-SCFG), the most general class of models implemented by XRate. This distinction, along with other related technical terms, are described in greater detail in Appendix Section A, the Glossary of XRate model terminology.
The generality of XRate requires a slight tradeoff against speed. Since the low-level code implementing core operations is shared among the set of possible models, XRate will generally be slower than programs with source code optimized for a narrower range of models. Computing the Felsenstein likelihood under the HKY85 [3] model of a 5-taxon, 1Mb alignment, XRate required 1.25 minutes of CPU time and 116MB RAM, while PAML required 9 seconds of CPU time and 19MB RAM for the same operation. Running PFOLD [15] on a 5-taxon, 1KB alignment required 11 seconds and 164MB RAM, and running XRate on the same alignment with a comparable grammar required 25 seconds and 62MB RAM. All programs were run with default settings on a 3.4 GHz Intel i7 processor. Model-fitting also takes longer with XRate: a previous work found that XRate's parameter estimation routines were approximately 130 times slower than those in PAML [16].
In an attempt to improve XRate's performance, we tried using Beagle, a library that provides CPU and accelerated parallel GPU implementations of Felsenstein's algorithm along with related matrix operations [17]. We have, however, been so far unable to generate significant performance gains by this method.
Despite these caveats, XRate has proved to be fast enough for genome-scale applications, such as a screen of Drosophila whole-genome alignments [18]. Furthermore, it implements a significantly broader range of models than the above-cited tools.

XRate inputs, outputs and operations
The formulation of the XRate model presented in the previous section is generative: that is, it describes the generation of data on a tree. In practice, the main reason for doing this is to generate simulation data for benchmarking purposes. This is possible using the tool simgram [19], which is provided with XRate as part of the DART package.
Most common use cases for generative models involve not simulation, but inference: that is, reconstructing aspects of the generative process (sequence of nonterminal transformations, token mutations, or grammar parameters) given observed sequence data (in the form of a multiple sequence alignment).
Using a phylo-grammar, a set of aligned sequences, and a phylogeny relating these sequences (optionally inferred by XRate), XRate implements the relevant parameterization and inference algorithms, allowing researchers to analyze sequence data without having to implement their own models.
Sequences are read and written in Stockholm format [20] (converters to and from common formats are included with DART). This format allows for the option of embedding a tree in Newick format [21] (via the #=GF NH tag) and annotations in GFF format [22]. By construction, Newick format necessarily specifies a rooted tree, rather than an unrooted one. However, the root placement is only relevant for time-irreversible models; when using time-reversible models, the placement of the root is arbitrary and can safely be ignored. Given these input ingredients, a call to XRate proceeds in the following order (more detail is provided at http://biowiki.org/XRATE and http://biowiki.org/XrateFormat ): 1. The Stockholm file and grammar alphabet are parsed (as macros may depend on these).
2. Any grammar macros are expanded, followed by Scheme functions.
3. If requested, or a tree was not provided in the input data, one is estimated using neighbor-joining [23]. As noted above, this is a rooted tree, but the root placement is arbitrary if a time-reversible model is used.

Grammar parameters are estimated (if requested).
5. Alignment is annotated (if requested).

Ancestral sequences are reconstructed (if requested).
After the analysis is complete, the alignment (along with an embedded tree) is printed to the output stream along with ancestral sequences (if requested) as well as any #=GC and #=GR column annotations.
GFF and WIG annotations are sent to standard output by default, but these can be directed to separate files by way of the -gff and -wig options, respectively.
The XRate format macro language for phylo-grammar specification: case studies The following sections describe case studies of repetitively-structured models which motivate the need for grammar-generating code. Historically, we have attempted several solutions to the case studies described.
We first briefly review the factors that influenced our eventual choice of Scheme as a macro language.
XRate was preceded by Searls' Prolog-based automata [24] and Birney's Dynamite parser-generator [11], and roughly contemporaneous with Slater's Exonerate [10] and Lunter's HMMoC [12]. In early versions of XRate (circa 2004), and in Exonerate, the only way for the user to specify their own phylogrammar models was to write C/C++ code that would compile directly against the program's internal libraries. This kind of compilation step significantly slows model prototyping, and impedes re-use of model parameters.
Current versions of XRate, along with Dynamite and HMMoC, understand a machine-readable grammar format. In the case of XRate, this format is based on Lisp S-expressions. In such formats (as the case studies illustrate) the need arises for code that generates repetitively-structured grammar files. It is often convenient, and sometimes sufficient, to write such grammar-generating code in an external language: for example, we have written Perl, Python and C++ libraries to generate XRate grammar files [9,16].
However, this approach still has the disadvantage (from a programmer's or model developer's perspective) that (a) code to generate real grammars tends to require an ungainly mix of grammar-related S-expression constants embedded in Perl/Python/C++ code, and (b) the requirement for an explicit model-generation step can delay prototyping and evaluation of new phylo-grammar models.
XRate's macro language provides an alternate way to generate repetitive models within XRate, without having to resort to external code-generating scripts. This allows the model-specifying code to remain compact, readable, and easy to edit. As we report in this manuscript, the XRate grammar format now also natively includes a Scheme-based scripting language that can be embedded directly within grammar files, whose syntax blends seamlessly with the S-expression format used by XRate and whose functional nature fits XRate's problem domain. We provide here examples of common phylogenetic models which make use of various macro features, and refer the reader to the online documentation for a complete introduction to XRate's macro features: http://biowiki.org/XrateMacros. All of the code snippets presented here are available as minimal complete grammars in Text S1. The full, trained grammars corresponding to those presented here are available as part of DART. This correspondence is described here: http://biowiki.org/XratePaper2011 A repetitively-structured HMM specified using simple macros Probabilistic models for the evolution of biological sequences tend to contain repetitive structure. Sometimes, this structure arises as a reflection of symmetries in the phylo-grammar; other times, it arises due to structure in the data, such as the tree or the alignment. While small repetitive models can be written manually, developing richer evolutionary models and grammars often demands writing code to model the underlying structure.
Markov chain symmetry The most familiar source of repetition derives from the substitution model's structure: different substitutions share parameters based on prior knowledge or biological intuition.
Perhaps most repetitive is the Jukes-Cantor model for DNA. The matrix entries Q ij denote the rate of substitution from i to j: Here u is an arbitrary positive rate parameter. The * character denotes the negative sum of the remaining row entries (here equal to −3u in every case). The parameter u is typically set to 1/3 in order that the stochastic process performs, on average, one substitution event per unit of time.
This matrix can be specified in XRate with two nested loops over alphabet tokens. Each loop over alphabet tokens has the form (&foreach-token X expression...) where expression... is a construct to be expanded for each alphabet token X. Here, expression sets the substitution rate between each pair of source and destination tokens (except for the case when the source and destination tokens are identical, for which case we simply generate an empty list, (), which will be ignored by the XRate grammar parser).
We do not explicitly need to write the negative values of the on-diagonal matrix elements (labeled * in the above description of the matrix); XRate will figure these out for itself. To check whether source and destination tokens are equal in the loop, we use a conditional &if statement, which has the form (&if (condition) (expansion-if-true) (expansion-if-false)). The condition is implemented using the &eq macro, which tests if its two arguments are equal. Putting all these together, the nested loops look like this: (&foreach-token tok1 (&foreach-token tok2 (&if (&eq tok1 tok2) () ;; If tok1==tok2, expand to an empty list (ignored by parser) (mutate (from (tok1)) (to (tok2)) (rate u))))) While this illustrates XRate's looping and conditional capabilities, such a simple model would almost be easier to code by hand. For a slightly more complex application, we turn to the model of Pupko et al in their 2008 work. In their RASER program the authors used a chain augmented with a latent variable indicating "slow" or "fast" substitution. Reconstructing ancestral sequences on an HIV phylogeny allowed them to infer locations of transitions between slow and fast modes -indicating a possible gain or loss of selective pressure [25]. The chain shown below, Q RASER , shows a simplified version of their model: substitutions within rate classes occur according to a JC69 model scaled by rate parameters s and f (slow and fast, respectively), and transitions between rate classes occur with rates r sf and r f s (slow → fast and fast → slow, respectively).
While this chain contains four times as many rates as the basic JC69 model, there are only five param-eters: u, s, f, r sf , r f s since the model contains repetition via its symmetry. While manual implementation is possible, the model can be expressed in just a few lines of XRate macro code. Further, additional "modes" of substitution (corresponding to additional quadrants in the matrix above) can be added by editing the first two lines of the following code.

Phylo-HMM-induced repetition
The previous examples both involved specifying the Markov chain component of a phylo-grammar. Coupled with a trivial top-level grammar (a START state and an EMIT state which emits the chain via the EMIT* pseudoterminal), these models describe an alignment where each column's characters evolve according to the same substitution model. A common extension to this is using sequences of hidden states which generate alignment columns according to different substitution models. These "phylo-grammars" (which can include phylo-SCFGs and the more restricted phylo-HMMs) allow modelers to describe and/or detect alignment regions exhibiting different evolutionary patterns.
Phylo-HMMs model left-to-right correlations between alignment columns, and phylo-SCFGs are capable of modeling nested correlations (such as "paired" columns in an RNA secondary structure). Readers unfamiliar with phylo-grammars may benefit from relevant descriptions and links available here: http: //biowiki.org/PhyloGrammars, animations available here: http://biowiki.org/PhyloFilm, and the original paper describing XRate [9].
We outline here a phylo-HMM that is simple to describe, but would take a substantial amount of code to implement without XRate's macro language. The model is based on PhastCons, a program by Siepel et al which uses an HMM whose three states (or, in XRate terminology, nonterminals) use substitution models differing only by rate multipliers [26]. This model, depicted schematically in Figure 1, can be used to detect alignment regions evolving at different rates. If the rates of each hidden state correspond to quantiles of the Gamma distribution, then summing over hidden states of this model is equivalent to the commonly-used Gamma model of rate heterogeneity. We provide this grammar in Text S1, which is essentially identical to the PhastCons grammar with n states except for its invocation of a Scheme function returning the n Gamma-derived rates for a given shape parameter.
We can define such a model in XRate easily due to the symmetric structure: all three nonterminals have similar underlying substitution models (varying only by a multiplier) and also similar probabilities of making transitions to other nonterminals via grammar transformation rules.
The grammar will have nonterminals named "1", "2"...up to numNonTerms, each one associated with a rate parameter (r 1, r 2...) and substitution chain (chain 1, chain 2...). To express this grammar in XRate macro code, we'll need to declare each of these nonterminals, the production rules which govern transitions between them, rate parameters, and the nonterminal-associated substitution chains. (For a fully-functional grammar, an alphabet is also needed; these are omitted in code snippets included in the main text, but the corresponding grammars in Text S1 contain alphabets.) First, define how many nonterminals the model will have: adding more nonterminals to the model later on can be done simply by adjusting this variable. We define a SEED value to initialize the rate parameters (this is not a random number seed, but rather an initial guess at the parameter value necessary for the EM algorithm to begin), which is done inside a foreach-integer loop using the numNonterms variable.
The (foreach-integer X (1 K) expression) expands expression for all values of X from 1 to K. In this case, we define a rate parameter for each of our nonterminals 1..K.
(&define numNonterms 3) Next, define a Markov chain for each nonterminal: all make use of the same underlying substitution model (e.g. JC69 [1], HKY85 [3]) whose entries are stored as Q_a_b for the transition rate between characters a and b. This "underlying" chain must be defined elsewhere -either in an included file (using the (&include) directive), or directly in the grammar file. For instance, we could re-use the JC69 chain, declaring rate parameters for later use: (&foreach-token tok1 (&foreach-token tok2 (&if (&eq tok1 tok2) () ;; If tok1==tok2, expand to an empty list (ignored by parser) (rate (&cat Q_ tok1 _ tok2) u )))) Each nonterminal has an associated substitution model which is Q_a_b scaled by a different rate multiplier r_nonterminal. Using an integer loop, we create a chain for each nonterminal using the rate parameters we defined in the two previous code snippets:  (tok1)) (to (tok2)) (rate (&cat Q_ tok1 _ tok2) (&cat r_ nonterminal)))))))) Next, define the production rules which govern the nonterminal transitions. For simplicity of presentation (but not required), we assume here that transitions between nonterminals all occur with probability proportional to leaveProb, and all self-transitions have probability stayProb.
The pgroup declaration defines a probability distribution over a finite outcome space, with the parameters declared therein normalized to unity during parameter estimation. In this grammar we declare stayProb and leaveProb within a pgroup since they describe the two outcomes at each step of creating the alignment: staying at the current nonterminal or moving to a different one.
This is accomplished by making use of the tree iterators (e.g. &BRANCHES, &NODES, and &LEAVES) and alignment data (e.g. &COLUMNS) to create nonterminals and/or terminal chains associated with these parts of the input data.
In their program DLESS, Haussler and colleagues used such an approach in a tree-dependent model to detect lineage-specific selection. Their model used a phylo-HMM with different nonterminals for each tree node, with the substitution rate below this node scaled to reflect gain or loss of functional elements [26]. We show a simplified form of their model as a schematic in Figure 2, with blue colored branches representing a slowed evolutionary rate.
Using XRate's macros we can express this model in a compact way just as was done with the PhastCons model. Since both models use a set of nonterminals with their own scaled substitution models, we need simply to replace the integer-based loop (&foreach-integer nonterminal (1 numNonterms) expression) with the tree-based loop (&foreach-node state expression) to create a nonterminal for each node in the tree. Then, define each node-specific chain as a hybrid chain, such that the chain associated with tree node n has all the branches below node n scaled to reflect heightened selective pressure. Hybrid chains, substitution processes which vary across the tree, are discussed briefly in the section on "Recent enhancements to XRate", and the details of their specification is thoroughly covered in the XRate format documentation, available here: http://biowiki.org/XrateFormat . A minimal working form of the DLESS-style grammar included in Text S1.
A repetitively-structured codon model specified using Scheme functions While XRate's macro language is very flexible, there are some relatively common models that are difficult to express within the language's constraints. For example, a Nielsen-Yang codon matrix incorporating transition bias and selection has nearly 4,000 entries whose rates are determined by the following criteria: can be generated fairly easily by way of a Perl or Python script tailored to generate XRate grammar code.
While this is a convenient scripting mechanism for many users (and is perfectly possible with XRate), it tends to lead to an awkward mix of code and embedded data (i.e. snippets of grammar-formatting text). This obscures both the generating script and the final generated grammar file (the former due to the code/data mix, and the latter due to sheer size).
Another choice of programming language for implementing XRate extensions, which suffers slightly less from these limitations, is Scheme. As XRate's macro language is based on Lisp (the parent language to Scheme), the syntaxes are very similar, so the "extension" blends naturally with the surrounding XRate grammar file. Scheme is inherently functional and is also "safe" (in that it has garbage collection). Lastly, data and code have equivalent formats in Scheme, enabling the sort of code/data mingling outlined above. Note that xrate-dna-alphabet is a simple variable, but xrate-NY-grammar is a function and is therefore wrapped in parentheses (as per the syntax of calling a function in Scheme). The reason that xrate-NY-grammar is a function is so that the user can optionally redefine the genetic code, which (as noted above) is stored as a Scheme association list, in the variable codon-translation-table (the standard library code can be examined for details).
A macro-heavy grammar for RNA structures in protein-coding exons As a final example of the possibilities that XRate's new model-specification features enable, we present a new grammar for predicting RNA structures which overlap protein-coding regions. XDecoder is based closely on the RNADecoder grammar first developed by Pederson and colleagues [27]. This grammar is designed to detect phylogenetic evidence of conserved RNA structures, while also incorporating the evolutionary signals brought on by selection at the amino-acid level. In eukaryotes, RNA structure overlapping protein coding sequence is not yet well-known, but in viral genomes this is a common phenomenon due to constraints on genome size acting on many virus families. XDecoder is available as an XRate grammar, linked here: http://biowiki.org/XratePaper2011 Motivation for implementation Our endeavor to re-implement the RNADecoder grammar was based both on practical and methodological reasons. The original RNADecoder code is no longer maintained, but performs well on published viral datasets [28]. Running RNADecoder on an alignment of full viral genomes is quite involved: the alignment must first be split up into appropriately-sized chunks (˜300 columns), converted to COL format [29], and linked to a tree in a special XML file which directs the analysis. The grammar and its parameters, also stored in an XML format, are difficult to read and interpret. RNADecoder attains remarkably higher specificity in genome-wide scans as compared to protein-naive prediction programs like PFOLD [15] or MFOLD [30]. Each analysis with RNADecoder requires an XML file to coordinate the alignment and tree as well as direct parts of the analysis (training and annotation). XRate reads Stockholm format alignments which natively allows for alignment-tree association, enabling simple batch processing of many alignments. The grammar can be run on arbitrarily long alignments, provided a suitable maximum pair length is specified via the -l N argument. This prevents XRate from considering any pairing whose columns are more than N positions apart, effectively limiting both the memory usage and runtime.

Using XDecoder
Training the grammar's parameters, which may be necessary for running the grammar on significantly different datasets, is also accomplished with a single command: xrate -g XDecoder.eg -l 300 -t XDecoder.trained.eg polio.stk The results of an analysis using XDecoder are shown in Figure 3, together with gene and RNA structure annotations. Also shown are three related analyses (all done using XRate grammars): PhastCons conservation, coding potential, and pairing probabilities computed using PFOLD. These three separate analyses reflect the signals that XDecoder must tease apart in order to reliably predict RNA structures.
DNA-level conservation could be due to protein-coding constraints, regional rate variation, pressure to maintain a particular RNA structure, or a combination of all three. Using codon-position rate multipliers, multiple rate classes, and a secondary structure model, XDecoder unifies all of these signals in a single phylogenetic model, resulting in the highly-specific predictions shown at the top of Figure 3.

Recent enhancements to XRate
Lineage-specific models This allows a direct link between XRate and visualization tools such as JBrowse [32], GBrowse [33], the UCSC Genome Browser [34],and Galaxy [35], allowing the results of different analyses to be displayed next to one another and/or processed in a unified framework.
GFF: Discrete genomic features GFF is a format oriented towards storing genomic features using 9 tab-delimited fields: each line represents a separate feature, with each field storing a particular aspect of the feature (e.g. identifier, start, end, etc). With XRate, a common application is using GFF to annotate an alignment with features corresponding to grammar nonterminals. For instance, using a gene-prediction grammar one could store the predicted start and end points of genes together with a confidence measure.
Similarly, predicted RNA base pairs could be represented in GFF as one feature per pair, with start and end positions indicating the paired positions.

The Dart Scheme (Darts) interpreter
Another way to use XRate, instead of running it from the command line, is to call it from the Scheme interpreter (included in DART). The compiled interpreter executable is named "darts" (for "DART Scheme"). This offers a simple yet powerful way to create parameter-fitting and genome annotation workflows. For example, a user could train a grammar on a set of alignments, then use the resulting grammar to annotate a set of test alignments.
Darts, in common with the Scheme interpreter used in XRate grammars, is implemented using Guile (GNU's Ubiquitous Intelligent Language for Extension: http://www.gnu.org/software/guile/guile.html).
Certain commonly-encountered bioinformatics objects, serializable via standard file formats and implemented as C++ classes within XRate, are exposed using Guile's "small object" (smob) mechanism.
Currently, these types include Newick-format trees and Stockholm-format alignments. API calls are provided to construct these "smobs" by parsing strings (or files) in the appropriate format. The smobs may then be passed directly as parameters to XRate API calls, or may be "unpacked" into Scheme data structures for individual element access. Guile encourages sparing use of smobs; consequently, smobs are used within Darts exclusively to implement bioinformatic objects that already have a broadly-used file format (Stockholm alignments and Newick trees). In contrast, formats that are newly-introduced by XRate (grammars, alphabets and so forth) are all based on S-expressions, and so may be represented directly as native Scheme data structures.
The functions listed in Section B provide an interface between Scheme and XRate. Together with the functions in the XRate-scheme standard library and Scheme's native functional scripting abilities, a broad array of models and/or workflows are possible. For instance, one could estimate several sets of parameters for Nielsen-Yang models using groups of alignments, and then embed each one in a PhastCons-style phylo-HMM, finally using this model to annotate a set of alignments. While this and other workflows could be accomplished in an external framework (e.g. Make, Galaxy [35]), Darts provides an alternate way to script XRate tasks using the same language that is used to construct the grammars.

Materials and Methods
Text S1 contains example grammars referred to in the text, as well as small and large test Stockholm alignments. The alignment of poliovirus genomes along with the grammars used to produce Figure 3 are also included along with a Makefile indicating how the data was analyzed. Typing make help in the directory containing the Makefile will display the demonstrations available to users.   Substitution Rate Normal Slow Figure 2. A schematic of a DLESS-style phylo-HMM: each node of the tree has its own nonterminal, such that the node-rooted subtree evolves at a slower rate than the rest of the tree. Inferring the pattern of hidden nonterminals generating an alignment allows for detecting regions of lineage-specific selection. Expressing this model compactly in XRate's macro language allows it to be used with any input tree without having to write data-specific code or use external model-generating scripts. Ancestral reconstruction: The use of XRate to reconstruct the sequences at ancestral nodes of a phylogenetic tree, given a grammar, a multiple sequence alignment and a parse tree. This occurs after tree estimation, training and annotation.

Figure Legends
Annotation: The use of XRate to apply a grammar to a multiple sequence alignment and phylogenetic tree, so as to impute the optimal parse tree and mark up the alignment with the co-ordinates of selected features (associated with particular nonterminals in the parse tree), or generate other annotation including GFF and WIGgle files. This occurs after tree estimation and training, but prior to ancestral reconstruction. Can also refer to a specific part of a transformation rule that generates annotations.
(The alphabet is specified in a separate part of the file from the rest of the grammar, and so is sometimes omitted from this definition.) Grammar symbol: A symbol that is either a nonterminal or a pseudoterminal.

HMM: Hidden Markov
Model. An SCFG that is also a regular grammar. See also phylo-HMM. Hybrid chain: A mapping from tree branches to substitution rate matrices (chains) where the instantaneous rate matrix may vary from one branch to another. This may be used to implement lineage-dependent selection, or other models which are heterogeneous with respect to the tree.
Initial distribution: The initial probability distribution over states in a substitution chain.

Left-emission: See emission.
Left-regular: A grammar is left-regular if it contains no bifurcations and its emissions are all leftemissions.
Macro: A construct that is expanded by the XRate grammar preprocessor and may be used to implement redundant or repetitive grammar models; e.g. grammars with a large number of similar transformation rules sharing the same probability parameter, or substitution chains whose mutation rules all share the same rate parameter.
Multiple sequence alignment: The raw data on which XRate operates, and which constitutes its input and output. XRate cannot align sequences, but assumes that they have been pre-aligned using an external alignment program. Alignments must be converted to Stockholm format [36] before supplying them to XRate. The alignment may include a phylogenetic tree (using the Stockholm syntax for specifying this); if no tree is provided, XRate's tree estimation routines can be used to find one.
Mutation rule: A single element in the rate matrix of a substitution chain.
Nonterminal: A grammar symbol that may be transformed, by application of transformation rules, into other nonterminals or pseudoterminals. In XRate, a nonterminal must be exclusively associated with (that is, appear on the left-hand side of) either emission rules, transition rules or bifurcation rules.
Parameter: A named parameter in a grammar. May be a probability parameter or a rate parameter.
Parametric model: A grammar whose transformation rules or mutation rules (or both) are specified as functions of the grammar's parameters, rather than as direct numerical values.
Parse tree: A tree structure corresponding to the derivation of a multiple sequence alignment from a grammar. Each tree node is labeled with a grammar symbol: the root node is labeled with the start nonterminal, internal nodes are labeled with nonterminals, and the leaves are labeled with pseudoterminals. Not to be confused with a phylogenetic tree.
PGroup: A set of probability parameters collectively representing a probability distribution over a finite set of events. Following training, probability parameters constituting a PGroup will be normalized to sum to 1.
Phylogenetic tree: The evolutionary tree describing the relationship between sequences in a multiple alignment. XRate uses the Stockholm format for alignments, which allows the tree to be included as an annotation of the alignment. If no tree is provided, XRate's tree estimation routines can be used to find one.

Phylo-HMM:
A phylo-SCFG that uses a regular grammar. A phylo-HMM is an HMM whose emissions generate alignment columns by evolving substitution chains on a phylogenetic tree.
Phylo-SCFG: A phylogenetic SCFG: a member of the general class of grammars implemented by XRate.
A phylo-SCFG is an SCFG whose emissions generate alignment columns by evolving substitution chains on a phylogenetic tree.
Production rule: See transformation rule.
Probability parameter: A dimensionless parameter that generally takes a value between 0 and 1, and so can occur in the probability part of a transformation rule (or as a multiplying factor in the rate part of a mutation rule). Probability parameters are declared in PGroups.
Pseudocounts: A set of nonnegative counts that specifies a Dirichlet prior distribution over a PGroup.
Pseudoterminal: A grammar symbol that is generated via an emission and cannot be further modified by subsequent transformation rules. In a parse tree, a pseudoterminal serves as a placeholder for an alignment column. Pseudoterminals occur in groups associated with a particular substitution chain. In the generative interpretation of the model, alignment columns are generated using the initial distribution and mutation rules of the chain, applied on the phylogenetic tree associated with the alignment.
Rate parameter: A nonnegative parameter that has units of "inverse time" (i.e. rate), and so can occur in the rate part of a mutation rule. Rate parameters can be declared individually.

Regular grammar:
A grammar is regular if it is either left-regular or right-regular; that is, it contains no bifurcations and its emissions are all either left-emissions or right-emissions. A regular grammar is equivalent to an HMM.

Right-emission: See emission.
Right-regular: A grammar is right-regular if it contains no bifurcations and its emissions are all rightemissions.
Start nonterminal: The first nonterminal declared or used in a grammar. In the generative interpretation of the model, this is the initial grammar symbol to which transformation rules are applied.
It is also the label of the root node in the parse tree.
State: In the context of a phylo-grammar, this term is ambiguous: it can refer either to a state-tuple in a chain, or (for phylo-HMMs) a nonterminal in a grammar. For the most part in this paper, and exclusively in this glossary, we use it in the former sense. Terminal: See token.
Token: An alphabet symbol. (Also called a terminal.) Training: The use of XRate to estimate a grammar's parameters, mutation rule rates and transformation rule probabilities, given a (set of) multiple alignments. This occurs after tree estimation and prior to annotation or ancestral reconstruction.
Transformation rule: A probabilistic rule that describes the transformation of a nonterminal symbol into a sequence of zero or more grammar symbols. (Also called a production rule.) A transformation rule may be an emission, a transition or a bifurcation.
Transition: A transformation rule that generates exactly one nonterminal (and no pseudoterminals).
Tree: In the context of a phylo-grammar, this term is ambiguous: it can mean a parse tree (which explains the "horizontal", i.e. spatial, structure of an alignment) or a phylogenetic tree (which explains the "vertical", i.e. temporal, structure).
Tree estimation: The use of XRate to estimate a phylogenetic tree for a multiple sequence alignment, given a grammar. This occurs prior to training, annotation or ancestral reconstruction.

B Tables of Scheme functions in Darts
The following list of Scheme functions, natively implemented within selected DART programs (including (ln-gamma k) Calculates the gamma function, Calculates the gamma probability density, β α 1 Γ(α) x α−1 e −βx (incomplete-gamma x alpha beta) Calculates the incomplete gamma function, i.e. the integral of the gamma density up to x (incomplete-gamma-inverse p alpha beta) Calculates the inverse of the incomplete gamma function