PACo: A Novel Procrustes Application to Cophylogenetic Analysis

We present Procrustean Approach to Cophylogeny (PACo), a novel statistical tool to test for congruence between phylogenetic trees, or between phylogenetic distance matrices of associated taxa. Unlike previous tests, PACo evaluates the dependence of one phylogeny upon the other. This makes it especially appropriate to test the classical coevolutionary model that assumes that parasites that spend part of their life in or on their hosts track the phylogeny of their hosts. The new method does not require fully resolved phylogenies and allows for multiple host-parasite associations. PACo produces a Procrustes superimposition plot enabling a graphical assessment of the fit of the parasite phylogeny onto the host phylogeny and a goodness-of-fit statistic, whose significance is established by randomization of the host-parasite association data. The contribution of each individual host-parasite association to the global fit is measured by means of jackknife estimation of their respective squared residuals and confidence intervals associated to each host-parasite link. We carried out different simulations to evaluate the performance of PACo in terms of Type I and Type II errors with respect to two similar published tests. In most instances, PACo performed at least as well as the other tests and showed higher overall statistical power. In addition, the jackknife estimation of squared residuals enabled more elaborate validations about the nature of individual links than the ParaFitLink1 test of the program ParaFit. In order to demonstrate how it can be used in real biological situations, we applied PACo to two published studies using a script written in the public-domain statistical software R.


Introduction
We present an R [1] script to carry out PACo (Procrustes Approach to Cophylogeny), an application of Procrustes analysis for comparison of phylogenetic trees of associated organisms, such as hosts and parasites.
PACo provides a residual sum of squares of the Procrustean fit that measures the congruence between two given phylogenies and uses a permutation approach to test its significance. The analysis allows for multiple host-parasite associations and different number of hosts and parasites. Because in the Procrustean superimposition the host matrix is kept fixed, whereas the parasite matrix is rotated and scaled to fit the former, PACo tests the classical view of whether the parasite phylogeny is constrained by the host phylogeny [2]. This implies that the null hypothesis tested is slightly different from that of previous tests of phylogenetic congruence [3][4][5]. In addition to hypothesis testing, PACo provides a superimposition plot enabling a graphical comparison of the fit of the host-parasite associations, and a residual bar chart for evaluation of the contribution of the individual host-parasite associations to the global fit.

The R script
All the computations described in the accompanying paper can be carried out with the R script below. R runs on a wide variety of Linux/Unix platforms, Windows and MacOS and can be downloaded at http://www.rproject.org/. In addition to the basic R install, two dedicated packages need to be installed to implement PACo: ape [6], required for handling of phylogenetic data and Principal Coordinates Ordination and vegan [7], required for Procrustes fitting. (See http://cran.r-project.org/doc/manuals/R-admin.html#Installingpackages for details on how to install R packages). In order to assist users with little or no experience with R, we provide annotations to the script. The analyses can be implemented by cutting and pasting the code below in an opened R console. The text in red identifies parameters that can be customized to adapt the analysis to the user's needs.
The script is demonstrated with the phylogenies of pocket gophers and their chewing lice based on the mitochondrial cytochrome oxidase I sequences of Hafner et al. [8], which represents a classical example of host-parasite cospeciation [9]. The input files required are shown in the Appendix below and can be downloaded, together with a fully annotated R script, at http://www.uv.es/cophylpaco/index.html. File S1 2 First, load the two packages required. At the R prompt, write library(ape) library(vegan)

Data input
Input files should be in plain text format either space-or tab-delimited. (See Appendix for examples). Three files are required. Two of them encapsulate the host and parasite phylogenies, respectively, and will eventually be transformed into distance matrices between host or parasite taxa. The third one consists of a binary matrix coding the host-parasite associations, where host species are arranged in rows and parasites in columns; 1's indicate occurrence of a given parasite in a given host and 0's denote parasite absences in the host. The input files should include taxa labels that have to match exactly in the three files. Any mismatch will cause execution errors and/or incorrect results. (For clarity of the graphical output, short labels are recommended.) The following syntax reads the matrix of host-parasite associations and computes the number of associations (NLinks) required for further computations: HP <-as.matrix(read.table("PACo/example/gophers/g-l_links.txt", header=TRUE)) NLinks = sum(HP) As illustrated by the foregoing code, the path pointing to the file system location should be indicated The syntax to load the phylogenetic input depends on the type of data used. There are three possibilities: (a) phylogenetic trees, (b) aligned sequences or (c) distance matrices.

a) Phylogenetic trees
Use the read.tree or read.nexus functions to open tree files in Newick or Nexus formats, respectively.

Procustean Superimposition
The host and parasite distance matrices are first sorted according, respectively, to the order of rows (hosts) and columns (parasites) of the host-parasite association matrix: The program produces the following output: Warning message: In procrustes(PACo.fit$H.PCo, PACo.fit$P.PCo) : X has fewer axes than Y: X adjusted to comform Y.
This indicates that the host input matrix has fewer columns than the parasite counterpart. No action by the user is required since the narrower matrix is completed with zero columns [13].
To visualize the host-parasite superimposition plot, as show in Figure S1

Goodness-of-fit test
The following code computes the residual sum of squares and performs a randomization of the hostparasite association matrix to establish the probability P under H 0 : where N.perm sets the number of random permutations of the host-parasite association matrix. For high precision of the P estimate, 100,000 permutations were used in the accompanying paper. Although computing time was not prohibitive (some 24 min. on a PC equipped with an Intel Core 2 CPU 6600 @ 2.40GHz processor), in most situations ≤ 10,000 permutations would be sufficient for hypothesis testing.  } Although the write function above is not essential for the analysis (and can be omitted), it is useful if one wishes to save the set of residual sum of squares generated at each permutation for further reference.
Note the path pointing to the location where the file will be saved. Given that append=TRUE, the file created (m2_perm.txt) should be deleted or renamed prior to a new analysis. Otherwise the values generated in the new run will be appended to those produced in the previous one.
(The warnings are originated by each of the Procrustes analyses with the permuted host-parasite association matrix and result from the different number of columns in the host and parasite matrices. As noted above this has no effect on the analysis). So = 0.1159. In only one of the 100,000 random permutations the residual sum of squares was smaller than this value (i.e., P = 10 -5 ) and congruence between the host and parasite phylogenies is statistically significant at the conventional significance level of 0.05.

Evaluation of host-parasite links
As justified in the accompanying article, the contribution of each host-parasite to the global fit can be assessed with a jackknife procedure that estimates the squared residual and its 95% confidence interval of each The foregoing code produces a bar chart of squared residuals (Fig. S1.2). Most links related to gopher species of Orthogeomys, Geomys and Pappogeomys contributed relatively little to and thus likely represent coevolutionary links. The links related to Thomomys spp. showed the highest residuals but their confidence intervals were quite broad (Fig. S1.2). Thus, it is difficult to evaluate their contribution to the cophylogenetic pattern observed.

A1. Host-parasite association matrix (g-l_links.txt)
Binary matrix with host and parasite species in rows and columns, respectively; 1's represent presence of a given parasite in a given host in nature, whereas 0's denote otherwise.