ClonalFrameML: Efficient Inference of Recombination in Whole Bacterial Genomes

Recombination is an important evolutionary force in bacteria, but it remains challenging to reconstruct the imports that occurred in the ancestry of a genomic sample. Here we present ClonalFrameML, which uses maximum likelihood inference to simultaneously detect recombination in bacterial genomes and account for it in phylogenetic reconstruction. ClonalFrameML can analyse hundreds of genomes in a matter of hours, and we demonstrate its usefulness on simulated and real datasets. We find evidence for recombination hotspots associated with mobile elements in Clostridium difficile ST6 and a previously undescribed 310kb chromosomal replacement in Staphylococcus aureus ST582. ClonalFrameML is freely available at http://clonalframeml.googlecode.com/.


Input
There are two input files needed to run ClonalFrameML. The first one is an alignment of the sequences which can be either in fasta format or extended multi fasta (XMFA) format. The second one is a starting tree, which must be in Newick format. There must be as many leaves in this tree as there are sequences in the alignment file, and the names of the leaves must match the names in the headers of the alignment file.
This starting tree can be generated for example using RAxML or PhyML.

Running
The basic command for running ClonalFrameML is as follows: ClonalFrameML newick_file seq_file kappa output_prefix [OPTIONS] The newick_file and seq_file are the two input files described above. The parameter kappa specifies the transition/transversion bias, and is an output of the phylogenetic reconstruction software. The output_prefix is the prefix used for all output files generated by ClonalFrameML.

Output
Running ClonalFrameML produces several output files, each of which starts with the output_prefix specified in the command line and ending with the following extensions:

ML_sequence.fasta
This file contains the sequence reconstructed by maximum likelihood for all internal nodes of the phylogeny, as well as for all missing data in the input sequences.

position_cross_reference.txt
A vector of comma-separated values indicating the location in the input sequence file of the sites reconstructed in the output ML_sequence.fasta file.

em.txt
This file contains the point estimates for R/theta, nu, delta and the branch lengths.

emsim.txt
This file contains the bootstrapped values for the three parameters R/theta, nu and delta.

importation_status.txt
This file contains the list of reconstructed recombination events. There is one line for each event, the first column indicates the branch on which the event was found, and the second and third columns indicate the first and last genomic positions affected by the recombination event.

labelled_tree.newick
This file contains the starting tree with all nodes labelled so that they can be referred to in other files.

Graphical output
To produce a graphical representation of the ClonalFrameML output, use the following command:

Rscript cfml_results.R output_prefix
The result is a PDF file named output_prefix.pdf Example: standard model Start by downloading the example files from here. Unzip on Mac/Linux as follows: tar -xzvf cfml.tgz The file contains the S. aureus FASTA file (Saureus.fasta), maximum likelihood tree (Saureus.phyML.newick), lists of core and non-core sites (Saureus.core-sites.txt and Saureus.non-core-sites.txt) and some R code (cfml_results_2.R).
Run the standard ClonalFrameML analysis as follows. Note! This is a real-world example, and requires around 12 hours of run time.
./ClonalFrameML Saureus.phyML.newick Saureus.fasta 4.967695 example.output -em true -emsim 100 -guess_initial_m true -ignore_user_sites Saureus.non-core-sites.txt -use_incompatible_sites true -driving_prior_mean "0.3793988249 0.0037939882 0.3793988249 0.0001199764" -driving_prior_precision "3.33101e+00 3.33101e+04 3.33101e+00 3.33101e+07" > example.log.txt Note that the transition:transversion ratio of 4.967695 was estimated by PhyML. The option -em true specifies the standard analysis in which recombination parameters are shared by all branches, and the -emsim 100 option requests 100 pseudobootstrap replicates. The -guess_initial_m true aims to start the estimation of branch lengths at a reasonable set of values. The positions listed in the file specified by the -ignore_user_sites option lists any site that was not callable or present in all the genomes. These sites are treated as missing data, which is important since uncalled sites can cause artefactual clustering of substitutions. The -use_incompatible_sites option specifies the analysis of homoplasious as well as non-homoplasious sites.
The pseudocounts prior is specified by -driving_prior_mean and -driving_prior_precision corresponding to the parameters of a gamma distribution. They specify the parameters in the following order: R/theta (relative rate of recombination to mutation), 1/delta (inverse mean DNA import length), nu (mean divergence of imported DNA) and mean branch length.
The suggested values specify distributions that, on the log scale, are approximately centred on R/theta = 0.1, delta = 1000, nu = 0.1 and the mean branch length in the PhyML tree, with standard deviations of one order of magnitude.
Once the analysis has finished, a figure can be produced with an R script. The R script takes a list of the core sites (the complement of the non-core sites fed into ClonalFrameML): Rscript cfml_results_2.R example.output core-sites.txt This will generate a pdf similar to the figures in the paper.

Example: per-branch model
Run the ClonalFrameML analysis in which recombination parameters are estimated per branch as follows . Note! This is a real-world example, and requires around 12 hours of run time.