MixtureTree Annotator: A Program for Automatic Colorization and Visual Annotation of MixtureTree

The MixtureTree Annotator, written in JAVA, allows the user to automatically color any phylogenetic tree in Newick format generated from any phylogeny reconstruction program and output the Nexus file. By providing the ability to automatically color the tree by sequence name, the MixtureTree Annotator provides a unique advantage over any other programs which perform a similar function. In addition, the MixtureTree Annotator is the only package that can efficiently annotate the output produced by MixtureTree with mutation information and coalescent time information. In order to visualize the resulting output file, a modified version of FigTree is used. Certain popular methods, which lack good built-in visualization tools, for example, MEGA, Mesquite, PHY-FI, TreeView, treeGraph and Geneious, may give results with human errors due to either manually adding colors to each node or with other limitations, for example only using color based on a number, such as branch length, or by taxonomy. In addition to allowing the user to automatically color any given Newick tree by sequence name, the MixtureTree Annotator is the only method that allows the user to automatically annotate the resulting tree created by the MixtureTree program. The MixtureTree Annotator is fast and easy-to-use, while still allowing the user full control over the coloring and annotating process.


Introduction
The Newick tree format [1] is used in many scientific disciplines, with a major role in reconstructive phylogeny. The format is relatively simple and provides the ability to show the relative distance and relationship between leaves (i.e., operational taxonomic units, OTUs); however, it lacks the ability for the user to add color and annotations to each branch. In reconstructive phylogeny, it is important to be able to show clusters of leaves and to provide annotations such as mutation information, especially when the sample size is large. The MixtureTree Annotator allows the user to automatically color any given Newick tree generated by many popular software packages, including but not limited to MixtureTree [2], MEGA [3], MrBayes [4], and SeaView [5]. By providing the ability to automatically color the tree by sequence name, the MixtureTree Annotator provides an advantage over other current programs. For example, MEGA [3], Mesquite [6], PHY-FI [7], TreeView [8], and Geneious [9], the most popular programs that allow the user to add color to a Newick tree, require the user to manually add color to each node; this can easily result in repetitive clicking with a high potential for human error. TreeGraph [10] allows the user to automatically color a tree to some extent, but it can only color based on a number, such as branch length. PhyloView [11] also allows the user to color the tree automatically by taxonomy, but requires the dataset to be named in a more specific manner. In contrast to the above programs, the MixtureTree Annotator allows the user to easily assign user-defined colors to different groups of sequences that have commonalities such as source population or phenotypic character state. In addition, MixtureTree Annotator is the only program available that can properly annotate the output produced by the MixtureTree [2,14,15,16] with mutation information and coalescent time information. However, trees that are not generated by the MixtureTree package cannot be annotated at this time. A program that provides similar coloring abilities to Mix-tureTree Annotator is ColorTree [12], but it does not provide annotation abilities.

Material and Methods
The MixtureTree Annotator can accept either a single Newick Tree File as input, or the entire output dataset of MixtureTree [2]. As illustrated in the main screen, Fig 1, there are five different types of files which may be entered. All features of this program require a Newick tree file as input, and for the user to specify an output file. In order to provide annotation ability, this program requires two input files: the sequence file and the log file. Colorization ability can be enhanced by providing a file for group definitions. The sequence file contains a list of sequence names, nucleotide data and frequencies. The log file, generated by the MixtureTree algorithm, contains the debugging output. The group file contains a list of group names and their sequence members. These files are described in further details in the User Guide. The user may specify if he or she would like to add color, annotation information, or both to the resulting tree. The generated output file from the MixtureTree Annotator is a modified form of the Nexus format. In order to visualize the output file from this program, a modified version of the program FigTree [17] must be used.

Results
In order to demonstrate the utility of this program in a real-life application, a sample dataset from the International HapMap Project is used. It includes data from two human population groups: the Yoruba people from Nigeria, Africa (YRI) and the U.S. European American group (Eu_Am) [13]. There are 52 sequences used in this dataset, with 34 from Eu_Am and 18 from YRI. One hundred sites were taken from Chromosome 1 between 3290990 and 3498109 of HapMap Phase 3 Release 2, NCBI Build 36. The resulting tree was generated by MixtureTree v3.0 [2,15] using the modalEM algorithm with a sliding scale (p) value of 0.001 [2,14,15,16]. The discussion is divided into two sections: Newick Tree Colorization and Newick Tree Annotation and entering external files.

Newick Tree Colorization
Colorization in Newick trees is important because it enables the user to quickly and accurately visualize clusters of DNA sequences, especially when the sample size is large. A typical coloring screen is shown in Fig 2, in which the different lineages are listed on the left and the color picker is on the right. Based on user preference, the YRI population group could be colored red and the Eu_Am population group colored in blue. A Newick tree normally consists of a single, solid black color. However, as shown in Fig 3, the benefits of tree colorization are clear. One  immediately notices two large distinct clusters, namely the distinct Eu_Am and YRI groups. The advantage of this package is that by clicking on the color picker, the user can easily assign a color to each group or sequence.

Newick Tree Annotation
Annotation of phylogenetic trees is important for estimating the actual ancestral sequences and the evolution time. The Log file generated by MixtureTree records every change of nucleotide over time. The MixtureTree Annotator could help to display when mutations happen in xy### format where x is the ancestral nucleotide at time t+ε, y is the mutated ancestor type at time t, and ### is the site at which the mutation occurred. Fig 4 shows an example where the currently observed nucleotide at site 54 of Eu_Am5 is a G, and it is an A in the ancestral sequence at site 54 when time t = 2.009. The MixtureTree algorithm [2,14,15,16] constructs the tree in a reverse time manner. The currently observed sequences are given by time t = 0. The most recent common ancestor is located at far left.  The case at a given time point t is shown in Table 1. The mutation information for this case is AT1. That is at time t+ ε, where ε>0, the nucleotide of site 1 mutates from A to T. The merge time information is in a self-explanatory format. The time scale used is described in Chen and Lindsay, 2006 [16]. For further clarification, AG54 in Fig 4 shows that at time t = 2.0099, the common ancestral sequence of Eu_Am5, Eu_Am6, and Eu_Am22 at site 54 has nucleotide A. At time t = 0, the sequence of Eu_Am5 at site 54 has nucleotide G. This is because the mutation (at time = 2.0099, site 54, from A to G) makes Eu_Am5 become a distinct lineage. In this specific example, Eu_Am6 and Eu_Am22 actually contain the same genetic sequences.
Entering external files: An example using Newick tree files generated from Mixture Tree In Fig 5, the sequence file (filename.y) and group file (filename.g) can be generated from the table converter. The table converter, one supplementary package included in MixtureTree, converts the sequence format into MixtureTree input format. The MixtureTree generates tree file in Newick format (filename.tre) and Log file (filename. log) with mutation information at the given time back to the ancient state. The tree file and log file can be placed directly into 'Newick File' selection bar and 'Log File', respectively.
The output of MixtureTree Annotator is Nexus format with all the color and annotating information. The name and directory of output file can be assigned by users.
Entering external files: An example using simple Newick tree files generated from other packages Users also can input the Newick file from any phylogeny reconstruction program and output the Ã .nxs file. We use MEGA 6 [27] to reconstruct the phylogeny of one example data set. Next, we save and export the phylogenetic tree into Newick format and upload the Newick file that is generated by MEGA into 'Newick File' file selection bar. In Fig 6, even though there is no sequence file, group file and log file, MixtureTree Annotator can still generate the colorized tree when users input the Newick file generated from other programs. The resulting tree is shown in Fig 7.

Distances Supports
The branch lengths (either internal or external) represent the distance information for the time that merged the two sequences. The distance information is written in the Newick file

Discussion and Conclusions
The unique advantage of MixtureTree Annotator is that it is the only package that can easily and efficiently annotate the output produced by MixtureTree with mutation information and coalescent time information. Table 2 compares MixtureTree Annotator with other active tree visualization tools, including Dendroscope [23], HyperTree [19], NJPlot [26], HyperGeny [20], CTree [18], and BAOBAB [21]. These useful tree drawing, editing and manipulation tools can generate the topology with subtree collapse, re-rooting, rotating, adding/removing the taxa and colorizing them individually in different layouts (rectangular, slanted, circular views). Some packages also annotate information such as branch lengths, confidence values and summary of subtrees on the internal nodes. Phylowidget [24] and Archaeopteryx [22] include the annotating and tree handling functionalities, which are relatively comprehensive tools. The tree viewer iTOL [25], TreeIllustrator [28], and Archaeopteryx [22] also integrate the taxonomy of organisms into the package, which allows users to compare and identify new data with the known classification of organisms. Although MixtureTree Annotator does not incorporate taxonomic information, it does integrate the functions of FigTree (http://tree.bio.ed.ac.uk/software/figtree/) that has both annotation and tree manipulation as other packages have. Its colorization can easily define the color of a group by information given in the sequence name in the graphic interface. Its functionality is more intuitive than other published packages. Another distinctive and essential feature in MixtureTree Annotator is the mutation events annotation. The mutation estimation comes from MixtureTree's phylogeny reconstruction package, and it could tell users the specific time and site of mutation occurrences in a sequence. MixtureTree Annotator is currently the only package available that illustrates this information on the topology tree. These features help researchers better interpret phylogeny and make hypotheses from the relationships and clusters of taxa, and convey their ideas to readers more efficiently.
From the sample dataset above, it is clear how the MixtureTree Annotator is useful to both users of MixtureTree, as well as to users who want an enhanced visualization of any general phylogenetic trees. The MixtureTree Annotator is a colorization and annotation program that is designed to assist the user when visualizing phylogenetic trees. It gives the user fine-grained control over the different settings while remaining easy-to-use. By using this program, a much clearer picture can be formed of the ancestral lines represented by different trees.
One of the more useful features of this package is easy colorization of groups based on names of sequences, and the presentation of the ancient states of nucleotides. MixtureTree

Availability and Requirements
The MixtureTree Annotator binary, source code, and S1 User Guide are available at the link http://www.mixturetree.net. It is a platform independent, Java-based program that requires Java 1.6 or higher. It implements a method of Newick tree colorization and provides visual annotation for MixtureTree. Anyone who uses this program is requested to cite the MixtureTree website and this paper.