PHYRN: A Robust Method for Phylogenetic Analysis of Highly Divergent Sequences

doi:10.1371/journal.pone.0034261

Table 1.

Performance Comparison of Phylogenetic Inference Methods.

More »

Expand

Figure 1.

PHYRN concept and work flow.

PHYRN begins by (i–ii) defining and extracting the domain specific region among the query sequences. (iii) Domain specific regions are then used to create PSSM library using PSI-BLAST. (iv–v) Positive alignments are then calculated between queries and PSSM library using rpsBLAST, and encoded as a PHYRN product score (percentage identity X percentage coverage) matrix. (vi) The product score matrix is converted to a Euclidean distance matrix by calculating Euclidean distance between each query pair. (vii) Phylogenetic trees are then inferred using Neighbor Joining, WEIGHBOR, Minimum Evolution, NINJA, or FastME.

More »

Expand

Figure 2.

Characteristics of ROSE Data Sets.

Multiple data sets (n = 25) were generated using ROSE at each divergence range (PAM distance = 100–700). The true alignment provided by the ROSE simulation was used to calculate the percentage identity, and gap statistics. A) Average percent identity calculated from each data set, decreases on increasing PAM distance (n = 25, Error Bars: +/– S.E.M.). B) Distribution of Average INDEL Events per position at different divergence ranges (PAM100–700). Average Indel events are calculated by dividing total number of gaps by total number of amino acid positions in all sequences represented in 25 replicates. C−F) Distribution of gap lengths in all replicates generated at PAM 100-PAM700. (Number of replicates = 25, number of sequences in each replicate = 100. Average length of each sequence = 450 aa). AGL: Average Gap Length as calculated from the mean of all gap lengths in all 25 replicates. ISR: Indel event Rate/Substitution Rate.

More »

Expand

Figure 3.

Accuracy Comparison of Different MSA methods.

Graphical representation of average Robinson-Foulds Distance from true ROSE trees (n = 21, PAM700) generated using different MSA methods. All trees were inferred using Neighbor-Joining. (n = 21, Error Bars: +/− S. E. M.). The number of sequences in each data set = 100. Maximum possible RF distance = 194.

More »

Expand

Figure 4.

Performance Comparison of PHYRN with other Phylogenetic Inference methods.

A–C) Graphical representation of average symmetric distance (Robinson-Foulds Distance) between the true ROSE tree and trees estimated using PHYRN, ACS-NJ (Average Common Substring, Alignment-free method), MUSCLE-FastME (corrected distance method), MSA-PAUP (Maximum Parsimony), MSA-RAxML (Maximum Likelihood), and MUSCLE-GARLI (Maximum Likelihood) based methods. Number of replicates tested at each divergence range = 25, Error bars = +/− S.E.M. The number of sequences in each data set = 100, Avg. Length of sequences = 450. Maximum possible RF Distance = 194.

More »

Expand

Figure 5.

PHYRN outperforms MrBayes in ‘midnight-zone’ synthetic data sets.

A) Graphical representation of symmetric distance (from true ROSE trees) for trees inferred using PHYRN and MrBayes. Data sets used were generated using ROSE at PAM 700 distance. Number of data sets tested (n = 5). The number of sequences in each data set = 100. Maximum possible RF distance = 194. Error Bars: +/−S.E.M. B) Consensus tree between true ROSE tree and PHYRN tree (PAM 700 data set 1). Red circles mark nodes that are incorrectly inferred by PHYRN. C) Consensus tree between true ROSE tree and MrBayes tree (PAM 700 data set 1). Red circles mark nodes that are incorrectly inferred by MrBayes.

More »

Expand

Figure 6.

Effect of ‘True Alignment’ on Phylogenetic Inference.

Graphical representation of average symmetric distance (Robinson-Foulds Distance) between the true ROSE tree and trees estimated using PHYRN, corrected distance (FastME) and ML methods (GARLI). Corrected Distance and ML trees were generated with both MUSCLE alignment, and True Alignment (TA) provided by ROSE. (Number of replicates tested at each divergence range = 25, Error bars = +/− S.E.M. Number of sequences in each data set = 100, Avg. Length of sequences = 450). Maximum possible RF distance = 194.

More »

Expand

Figure 7.

Comparison of Topology and Resampling Statistics for Various Tree Construction Methods.

Collapsed unrooted phylogenetic trees for DANGER superfamily generated using (A) PHYRN-NJ, (B) MUSCLE-MrBayes, (C) MUSCLE-PhyML, (D) MUSCLE-NJ, (E) CLUSTAL-NJ and (F) TCOFFEE-NJ. For PHYRN trees the statistics are represented by two numbers with Bootstrap listed first followed by Jacknife statistics. Statistics for panel A were calculated from resampling results from 3000 replicates. Bootstrap statistics for panels B-F were calculated from resampling results from 1000 replicates.

More »

Expand

Figure 8.

Model for the Evolution of the DANGER Superfamily.

Graphical representation of the Neighbor-Joining (NJ) tree for 108 DANGER sequences generated PHYRN. In this model, DANGER appeared first in cnidarian organisms (Nematostella) and then evolved into 6 different clades. The chordate specific group, D1 attains the furthest position from the putative root (D6).

More »

Expand