Pyvolve: A Flexible Python Module for Simulating Sequences along Phylogenies

Stephanie J. Spielman; Claus O. Wilke

doi:10.1371/journal.pone.0139047

Abstract

We introduce Pyvolve, a flexible Python module for simulating genetic data along a phylogeny using continuous-time Markov models of sequence evolution. Easily incorporated into Python bioinformatics pipelines, Pyvolve can simulate sequences according to most standard models of nucleotide, amino-acid, and codon sequence evolution. All model parameters are fully customizable. Users can additionally specify custom evolutionary models, with custom rate matrices and/or states to evolve. This flexibility makes Pyvolve a convenient framework not only for simulating sequences under a wide variety of conditions, but also for developing and testing new evolutionary models. Pyvolve is an open-source project under a FreeBSD license, and it is available for download, along with a detailed user-manual and example scripts, from http://github.com/sjspielman/pyvolve.

Citation: Spielman SJ, Wilke CO (2015) Pyvolve: A Flexible Python Module for Simulating Sequences along Phylogenies. PLoS ONE 10(9): e0139047. https://doi.org/10.1371/journal.pone.0139047

Editor: Marc Robinson-Rechavi, University of Lausanne, SWITZERLAND

Received: June 15, 2015; Accepted: September 7, 2015; Published: September 23, 2015

Copyright: © 2015 Spielman, Wilke. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

Data Availability: All relevant data are within the paper and its Supporting Information files. Source code for the software described in the paper is available from http://github.com/sjspielman/pyvolve.

Funding: This work was supported by National Institutes of Health (www.nih.gov/), Grant No. F31GM113622 to SJS; Army Research Office (www.aro.army.mil/), Grant No. W911NF-12-1-0390 to COW; Defense Threat Reduction Agency (www.dtra.mil/), Grant No. HDTRA1-12-C-0007 to COW; and National Science Foundation (www.nsf.gov/), Cooperative Agreement No. DBI-0939454, BEACON Center, to COW. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

The Python programming language has become a staple in biological computing. In particular, the molecular evolution community has widely embraced Python as standard tool, in part due to the development of powerful bioinformatics modules such as Biopython [1] and DendroPy [2]. However, Python lacks a robust platform for evolutionary sequence simulation.

In computational molecular evolution and phylogenetics, sequence simulation represents a fundamental aspect of model development and testing. Through simulating genetic data according to a particular evolutionary model, one can rigorously test hypotheses about the model, examine the utility of analytical methods or tools in a controlled setting, and assess the interactions of different biological processes [3].

To this end, we introduce Pyvolve, a sequence simulation Python module (with dependencies Biopython [1], SciPy, and NumPy [4]). Pyvolve simulates sequences along a phylogeny using continuous-time Markov models of sequence evolution for nucleotides, amino acids, and codons, according to standard approaches [5]. The primary purpose of Pyvolve is to provide a user-friendly and flexible sequence simulation platform that can easily be integrated into Python bioinformatics pipelines without necessitating the use of third-party software. Furthermore, Pyvolve allows users to specify and evolve custom evolutionary models and/or states, making Pyvolve an ideal engine for novel model development and testing.

Substitution models and frameworks in Pyvolve

Pyvolve is specifically intended to simulate gene sequences along phylogenies according to Markov models of sequence evolution. Therefore, Pyvolve requires users to provide a fixed phylogeny along which sequences will evolve. Modeling frameworks which are included in Pyvolve are detailed in Table 1.

Download:

Table 1. Substitution models included in Pyvolve.

https://doi.org/10.1371/journal.pone.0139047.t001

Pyvolve supports both site-wise and branch (temporal) heterogeneity. Site-wise heterogeneity can be modeled with Γ or Γ+I rates, or users can specify a custom rate-distribution. Further, users can specify a custom rate matrix for a given simulation, and thus they can evolve sequences according to substitution models other than those shown in Table 1. Similarly, users have the option to specify a custom set of states to evolve, rather than being limited to nucleotide, amino-acid, or codon data. Therefore, it is possible to specify arbitrary models with corresponding custom states, e.g. states 0, 1, and 2. This general framework will enable users to evolve, for instance, states according to models of character evolution, such as the Mk model [6].

Similar to other simulation platforms (e.g. Seq-Gen [7], indel-Seq-Gen [8], and INDELible [9]), Pyvolve simulates sequences in groups of partitions, such that different partitions can have unique evolutionary models and/or parameters. Although Pyvolve enforces that all partitions within a single simulation evolve according to the same model family (e.g. nucleotide, amino-acid, or codon), Python’s flexible scripting environment allows for straight-forward alignment concatenation. Therefore, it is readily possible to embed a series of Pyvolve simulations within a Python script to produce highly-heterogeneous alignments, for instance where coding sequences are interspersed with non-coding DNA sequences. Moreover, Pyvolve allows users to specify, for a given partition, the ancestral sequence (MRCA) to evolve.

In addition, we highlight that Pyvolve is among the first open-source simulation tools to include the mutation-selection modeling framework introduced by Halpern and Bruno in ref. [10] (we note that the simulation software SGWE [11] also includes this model). Importantly, although these models were originally developed for codon evolution [10, 12], Pyvolve implements mutation-selection models for both codons and nucleotides. We expect that this implementation will foster the continued development and use of this modeling framework, which has seen a surge of popularity in recent years [13–19].

Basic Usage of Pyvolve

The basic framework for a simple simulation with the Pyvolve module is shown in Fig 1. To simulate sequences, users should input the phylogeny along which sequences will evolve, define evolutionary model(s), and assign model(s) to partition(s). Pyvolve implements all evolutionary models in their most general forms, such that any parameter in the model may be customized. This behavior stands in contrast to several other simulation platforms of comparable scope to Pyvolve. For example, some of the most commonly used simulation tools that implement codon models, including INDELible [9], EVOLVER [34], and PhyloSim [35], do not allow users to specify dS rate variation in codon models. Pyvolve provides this option, among many others.

Download:

Fig 1. Example code for a simple codon simulation in Pyvolve.

This example will simulate an alignment of 100 codons with a dN/dS of 0.75 and a κ (transition-tranversion mutational bias) of 3.25. By default, sequences will be output to a file called “simulated_alignment.fasta”, although this file name can be changed, as described in Pyvolve’s user manual.

https://doi.org/10.1371/journal.pone.0139047.g001

In the example shown in Fig 1, stationary frequencies are not explicitly specified. Under this circumstance, Pyvolve will assume equal frequencies, although they would normally be provided using the key “state_freqs” in the dictionary of parameters. Furthermore, Pyvolve contains a convenient module to help specify state frequencies. This module can read in frequencies from an existing sequence and/or alignment file (either globally or from specified alignment columns), generate random frequencies, or constrain frequencies to a given list of allowed states. In addition, this module will convert frequencies between alphabets, which is useful, for example, if one wishes to simulate amino-acid data using the state frequencies as read from a file of codon sequence data.

Validating Pyvolve

We carefully assessed that Pyvolve accurately simulates sequences. To this end, we simulated several data sets under a variety of evolutionary models and conditions and compared the observed substitution rates with the simulated parameters.

To evaluate Pyvolve under the most basic of conditions, site-homogeneity, we simulated both nucleotide and codon data sets. We evolved nucleotide sequences under the JC69 model [36] across several phylogenies with varying branch lengths (representing the substitution rate), and we evolved codon sequences under a MG94-style model [32] with varying values of dN/dS. All alignments were simulated along a two-taxon tree and contained 100,000 positions. We simulated 50 replicates for each branch length and/or dN/dS parameterization. As shown in Fig 2A and 2B, the observed number of changes agreed precisely with the specified parameters.

Download:

Fig 2. Pyvolve accurately evolves sequences under homogenous, site-wise rate heterogeneity, and branch-specific rate heterogeneity.

A) Nucleotide alignments simulated under the JC69 [36] model along two-taxon trees with varying branch lengths, which represent the substitution rate. Points represent the mean observed substitution rate for the 50 alignment replicates simulated under the given value, and error bars represent standard deviations. The red line indicates the x = y line. B) Codon alignments simulated under an MG94-style [32] model with varying values for the dN/dS parameter. Points represent the mean dN/dS inferred from the 50 alignment replicates simulated under the given dN/dS value, and error bars represent standard deviations. The red line indicates the x = y line. C) Site-wise heterogeneity simulated with an MG94-style [32] model with varying dN/dS values across sites. Horizontal lines indicate the simulated dN/dS value for each dN/dS category. D) Branch-wise heterogeneity simulated with an MG94-style [32] model with each branch evolving according to a distinct dN/dS value. Horizontal lines indicate the simulated dN/dS value for each branch, as shown in the inset phylogeny. The lowest dN/dS category (dN/dS = 0.1) was applied to the internal branch (shown in gray). All code and data used to validate Pyvolve’s performance and generate this figure are available in S1 File.

https://doi.org/10.1371/journal.pone.0139047.g002

We additionally validated Pyvolve’s implementation of site-wise rate heterogeneity. We simulated an alignment of 400 codon positions, again under an MG94-style model [32], along a balanced tree of 2¹⁴ taxa with all branch lengths set to 0.01. This large number of taxa was necessary to achieve accurate estimates for site-specific measurements. To incorporate site-specific rate heterogeneity, we specified four dN/dS values of 0.2, 0.4, 0.6, and 0.8, to be assigned in equal proportions to sites across this alignment. We counted the observed dN/dS values for each resulting alignment column using a version of the Suzuki-Gojobori algorithm [37] adapted for phylogenetic data [38]. Fig 2C demonstrates that Pyvolve accurately implements site-specific rate heterogeneity. The high variance seen in Fig 2C is an expected result of enumerating substitutions on a site-specific basis, which, as a relatively small data set, produces substantial noise.

Finally, we confirmed that Pyvolve accurately simulates branch heterogeneity. Using a four-taxon tree, we evolved codon sequences under an MG94-style model [32] and specified a distinct dN/dS ratio for each branch. We simulated 50 replicate alignments of 100,000 positions, and we computed the observed dN/dS value along each branch. Fig 2D shows that observed branch dN/dS values agreed with the simulated values.

Conclusions

Because Pyvolve focuses on simulating the substitution processes using continuous-time Markov models along a fixed phylogeny, it is most suitable for simulating gene sequences, benchmarking inference frameworks, and for developing and testing novel Markov models of sequence evolution. For example, we see a primary application of Pyvolve as a convenient simulation platform to benchmark dN/dS and mutation-selection model inference frameworks such as the ones provided by PAML [34], HyPhy [39], Phylobayes [17], or swMutSel [16]. Indeed, the Pyvolve engine has already successfully been applied to investigate the relationship between mutation-selection and dN/dS modeling frameworks and to identify estimation biases in certain dN/dS models [18]. Moreover, we believe that Pyvolve provides a convenient tool for easy incorporation of complex simulations, for instance those used in approximate Bayesian computation (ABC) or MCMC methods [40], into Python pipelines.

Importantly, Pyvolve is meant primarily as a convenient Python library for simulating simple Markov models of sequence evolution. For more complex evolutionary scenarios, including the simulation of entire genomes, population processes, or protein folding and energetics, we refer the reader to several more suitable platforms. For example, genomic process such as recombination, coalescent-based models, gene duplication, and migration, may be best simulated with softwares such as ALF [41], CoalEvol and SGWE [11], or EvolSimulator [42]. Simulators which consider the influence of structural and/or biophysical constraints in protein sequence evolution include CASS [43] or ProteinEvolver [44]. Similarly, the software REvolver [45] simulates protein sequences with structural domain constraints by recruiting profile hidden Markov models (pHMMs) to model site-specific substitution processes.

We additionally note that Pyvolve does not currently include the simulation of insertions and deletions (indels), although this functionality is planned for a future release. We refer readers to the softwares indel-Seq-Gen [8] and Dawg [46] for simulating nucleotide sequences, and we suggest platforms such as INDELible [9], phyloSim [35], or πBuss [47] for simulating coding sequences with indels.

In sum, we believe that Pyvolve’s flexible platform and user-friendly interface will provide a helpful and convenient tool for the biocomputing Python community. Pyvolve is freely available from http://github.com/sjspielman/pyvolve, conveniently packaged with a comprehensive user manual and several example scripts demonstrating various simulation conditions. In addition, Pyvolve is distributed with two helpful Python scripts that complement Pyvolve simulations: one which implements the Suzuki-Gojobori [37] dN/dS counting algorithm adapted for phylogenetic data [38], and one which calculates dN/dS from a given set of mutation-selection model parameters as described in ref. [18]. Pyvolve is additionally available for download from the Python Package Index (e.g. via pip).

Supporting Information

S1 File. This file contains all scripts and data used to validate Pyvolve.

https://doi.org/10.1371/journal.pone.0139047.s001

(ZIP)

Acknowledgments

We thank Suyang Wan and Dariya Sydykova for helpful feedback during Pyvolve development and Rebecca Tarvin for her help designing the Pyvolve logo.

Author Contributions

Conceived and designed the experiments: SJS COW. Performed the experiments: SJS. Analyzed the data: SJS. Contributed reagents/materials/analysis tools: SJS. Wrote the paper: SJS COW.

References

1. Cock PJ, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009;25:1422–1423. pmid:19304878
- View Article
- PubMed/NCBI
- Google Scholar
2. Sukumaran J, Holder MT. DendroPy: A Python library for phylogenetic computing. Bioinformatics. 2010;26:1569–1571. pmid:20421198
- View Article
- PubMed/NCBI
- Google Scholar
3. Arenas M. Simulation of Molecular Data under Diverse Evolutionary Scenarios. PLoS Comp Biol. 2012;8:e1002495.
- View Article
- Google Scholar
4. Oliphant T. Python for Scientific Computing. IEEE Comput Sci Eng. 2007;9:10–20.
- View Article
- Google Scholar
5. Yang Z. Computational Molecular Evolution. Oxford University Press; 2006.
6. Lewis PO. A Likelihood Approach to Estimating Phylogeny from Discrete Morphological Character Data. Syst Biol. 2001;50:913–925. pmid:12116640
- View Article
- PubMed/NCBI
- Google Scholar
7. Rambaut A, Grassly N. Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput Appl Biosci. 1997;13:235–238. pmid:9183526
- View Article
- PubMed/NCBI
- Google Scholar
8. Strope C, Scott S, Moriyama E. indel-Seq-Gen: a new protein family simulator incorporating domains, motifs, and indels. Mol Biol Evol. 2007;24(3):640–649. pmid:17158778
- View Article
- PubMed/NCBI
- Google Scholar
9. Fletcher W, Yang Z. INDELible: A Flexible Simulator of Biological Sequence Evolution. Mol Biol Evol. 2009;26(8):1879–1888. pmid:19423664
- View Article
- PubMed/NCBI
- Google Scholar
10. Halpern A, Bruno W. Evolutionary distances for protein-coding sequences: modeling site-specific residue frequencies. Mol Biol Evol. 1998;15:910–917. pmid:9656490
- View Article
- PubMed/NCBI
- Google Scholar
11. Arenas M, Posada D. Simulation of Genome-Wide Evolution under Heterogeneous Substitution Models and Complex Multispecies Coalescent Histories. Mol Biol Evol. 2014;31:1295–1301. pmid:24557445
- View Article
- PubMed/NCBI
- Google Scholar
12. Yang Z, Nielsen R. Mutation-Selection Models of Codon Substitution and Their Use to Estimate Selective Strengths on Codon Usage. Mol Biol Evol. 2008 Jan;25(3):568–579. pmid:18178545
- View Article
- PubMed/NCBI
- Google Scholar
13. Holder M, Zwickl D, Dessimoz C. Evaluating the robustness of phylogenetic methods to among-site variability in substitution processes. Phil Trans R Soc B. 2008;363:4013–4021. pmid:18852108
- View Article
- PubMed/NCBI
- Google Scholar
14. Rodrigue N, Philippe H, Lartillot N. Mutation-selection models of coding sequence evolution with site-heterogeneous amino acid fitness profiles. Proc Natl Acad Sci USA. 2010;107(10):4629–4634. pmid:20176949
- View Article
- PubMed/NCBI
- Google Scholar
15. Tamuri AU, dos Reis M, Goldstein RA. Estimating the distribution of selection coefficients from phylogenetic data using sitewise mutation-selection models. Genetics. 2012;190:1101–1115. pmid:22209901
- View Article
- PubMed/NCBI
- Google Scholar
16. Tamuri AU, Goldman N, dos Reis M. A penalized-likelihood method to estimate the distribution of selection coefficients from phylogenetic data. Genetics. 2014;197:257–271. pmid:24532780
- View Article
- PubMed/NCBI
- Google Scholar
17. Rodrigue N, Lartillot N. Site-heterogeneous mutation-selection models within the PhyloBayes-MPI Package. Bioinformatics. 2014;30:1020–1021. pmid:24351710
- View Article
- PubMed/NCBI
- Google Scholar
18. Spielman S, Wilke C. The relationship between dN/dS and scaled selection coefficients. Mol Biol Evol. 2015;32:1097–1108. pmid:25576365
- View Article
- PubMed/NCBI
- Google Scholar
19. dos Reis M. How to calculate the non-synonymous to synonymous rate ratio of protein-coding genes under the Fisher–Wright mutation–selection framework. Biol Lett. 2015;11:20141031. pmid:25854546
- View Article
- PubMed/NCBI
- Google Scholar
20. Tavare S. Lines of descent and genealogical processes, and their applications in population genetics models. Theor Popul Biol. 1984;26:119–164. pmid:6505980
- View Article
- PubMed/NCBI
- Google Scholar
21. Hasegawa M, Kishino H, Yano T. Dating of human-ape splitting by a molecular clock of mitochondrial DNA. J Mol Evol. 1985;22(2):160–174. pmid:3934395
- View Article
- PubMed/NCBI
- Google Scholar
22. Tamura K, Nei M. Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol Biol Evol. 1993;10:512–526. pmid:8336541
- View Article
- PubMed/NCBI
- Google Scholar
23. Jones D, Taylor W, Thornton J. The rapid generation of mutation data matrices from protein sequences. CABIOS. 1992;8:275–282. pmid:1633570
- View Article
- PubMed/NCBI
- Google Scholar
24. Whelan S, Goldman N. A general empirical model of protein evolution derived from multiple protein families using a maximum likelihood approach. Mol Biol Evol. 2001;18:691–699. pmid:11319253
- View Article
- PubMed/NCBI
- Google Scholar
25. Le S, Gascuel O. An improved general amino acid replacement matrix. Mol Biol Evol. 2008;25:1307–1320. pmid:18367465
- View Article
- PubMed/NCBI
- Google Scholar
26. Yang N, Nielsen R, Hasegawa M. Models of Amino Acid Substitution and Applications to Mitochondrial Protein Evolution. Mol Biol Evol. 1998;15:1600–1611. pmid:9866196
- View Article
- PubMed/NCBI
- Google Scholar
27. Adachi J, Hasegawa M. MOLPHY version 2.3: programs for molecular phylogenetics based on maximum likelihood. Comput Sci Monogr. 19896;28:1–150.
- View Article
- Google Scholar
28. Dayhoff M, Schwartz R, Orcutt B. A model of evolutionary change in proteins. Atlas of Protein Sequence and Structure. 1978;5(3):345–352.
- View Article
- Google Scholar
29. Mirsky A, Kazandjian L, Anisimova M. Antibody-Specific Model of Amino Acid Substitution for Immunological Inferences from Alignments of Antibody Sequences. Mol Biol Evol. 2015;32:806–819. pmid:25534034
- View Article
- PubMed/NCBI
- Google Scholar
30. Goldman N, Yang Z. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol Biol Evol. 1994;11:725–736. pmid:7968486
- View Article
- PubMed/NCBI
- Google Scholar
31. Nielsen R, Yang Z. Likelihood models for detecting positive selected amino acid sites and applications to the HIV-1 envelope gene. Genetics. 1998;148:929–936. pmid:9539414
- View Article
- PubMed/NCBI
- Google Scholar
32. Muse S, Gaut B. A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. Mol Biol Evol. 1994;11:715–724. pmid:7968485
- View Article
- PubMed/NCBI
- Google Scholar
33. Kosiol C, Holmes I, Goldman N. An empirical codon model for protein sequence evolution. Mol Biol Evol. 2007;24:1464–1479. pmid:17400572
- View Article
- PubMed/NCBI
- Google Scholar
34. Yang Z. PAML 4: Phylogenetic Analysis by Maximum Likelihood. Molecular Biology and Evolution. 2007;24:1586–1591. pmid:17483113
- View Article
- PubMed/NCBI
- Google Scholar
35. Sipos B, Massingham T, Jordan GE, Goldman N. PhyloSim—Monte Carlo simulation of sequence evolution in the R statistical computing environment. BMC Bioinform. 2011;12(104).
- View Article
- Google Scholar
36. Jukes T, Cantor C. Evolution of protein molecules. In: Munro H, editor. Mammalian protein metabolism. New York: Academic Press; 1969.
37. Suzuki Y, Gojobori T. A method for detecting positive selection at single amino acid sites. Mol Biol Evol. 1999;16:1315–1328. pmid:10563013
- View Article
- PubMed/NCBI
- Google Scholar
38. Kosakovsky Pond S, Frost S. Not So Different After All: A Comparison of Methods for Detecting Amino Acid Sites Under Selection. Mol Biol Evol. 2005;22:1208–1222. pmid:15703242
- View Article
- PubMed/NCBI
- Google Scholar
39. Kosakovsky Pond S, Frost S, Muse S. HyPhy: hypothesis testing using phylogenies. Bioinformatics. 2005;12:676–679.
- View Article
- Google Scholar
40. Arenas M. Advances in Computer Simulation of Genome Evolution: Toward More Realistic Evolutionary Genomics Analysis by Approximate Bayesian Computation. J Mol Evol. 2015;8:189–192.
- View Article
- Google Scholar
41. Dalquen D, Anisimova M, Gonnet G, Dessimoz C. ALF–a simulation framework for genome evolution. Mol Biol Evol. 2012;29:1115–1123. pmid:22160766
- View Article
- PubMed/NCBI
- Google Scholar
42. Beiko R, Charlebois R. A simulation test bed for hypotheses of genome evolution. Bioinformatics. 2007;23:825–831. pmid:17267425
- View Article
- PubMed/NCBI
- Google Scholar
43. Grahnen J, Liberles D. CASS: Protein sequence simulation with explicit genotype-phenotype mapping. Trends in Evolutionary Biology;4:e9.
- View Article
- Google Scholar
44. Arenas M, Dos Santos H, Posada D, Bastolla U. Protein evolution along phylogenetic histories under structurally constrained substitution models. Bioinformatics. 2013;29:3020–3028. pmid:24037213
- View Article
- PubMed/NCBI
- Google Scholar
45. Koestler T, von Haeseler A, Ebersberger I. REvolver: modeling sequence evolution under domain constraints. Mol Biol Evol. 2012;29:2133–2145. pmid:22383532
- View Article
- PubMed/NCBI
- Google Scholar
46. Cartwright R. DNA assembly with gaps (Dawg): simulating sequence evolution. Bioinformatics. 2005;21:iii31–iii38. pmid:16306390
- View Article
- PubMed/NCBI
- Google Scholar
47. Bielejec F, Lemey P, Carvalho L, Baele G, Rambaut A, Suchard M. piBUSS: a parallel BEAST/BEAGLE utility for sequence simulation under complex evolutionary scenarios. BMC Bioinformatics. 2014;15:133. pmid:24885610
- View Article
- PubMed/NCBI
- Google Scholar

[ref1] 1. Cock PJ, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009;25:1422–1423. pmid:19304878
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Sukumaran J, Holder MT. DendroPy: A Python library for phylogenetic computing. Bioinformatics. 2010;26:1569–1571. pmid:20421198
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. Arenas M. Simulation of Molecular Data under Diverse Evolutionary Scenarios. PLoS Comp Biol. 2012;8:e1002495.
View Article
Google Scholar

[10] View Article

[11] Google Scholar

[ref4] 4. Oliphant T. Python for Scientific Computing. IEEE Comput Sci Eng. 2007;9:10–20.
View Article
Google Scholar

[13] View Article

[14] Google Scholar

[ref5] 5. Yang Z. Computational Molecular Evolution. Oxford University Press; 2006.

[ref6] 6. Lewis PO. A Likelihood Approach to Estimating Phylogeny from Discrete Morphological Character Data. Syst Biol. 2001;50:913–925. pmid:12116640
View Article
PubMed/NCBI
Google Scholar

[17] View Article

[18] PubMed/NCBI

[19] Google Scholar

[ref7] 7. Rambaut A, Grassly N. Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput Appl Biosci. 1997;13:235–238. pmid:9183526
View Article
PubMed/NCBI
Google Scholar

[21] View Article

[22] PubMed/NCBI

[23] Google Scholar

[ref8] 8. Strope C, Scott S, Moriyama E. indel-Seq-Gen: a new protein family simulator incorporating domains, motifs, and indels. Mol Biol Evol. 2007;24(3):640–649. pmid:17158778
View Article
PubMed/NCBI
Google Scholar

[25] View Article

[26] PubMed/NCBI

[27] Google Scholar

[ref9] 9. Fletcher W, Yang Z. INDELible: A Flexible Simulator of Biological Sequence Evolution. Mol Biol Evol. 2009;26(8):1879–1888. pmid:19423664
View Article
PubMed/NCBI
Google Scholar

[29] View Article

[30] PubMed/NCBI

[31] Google Scholar

[ref10] 10. Halpern A, Bruno W. Evolutionary distances for protein-coding sequences: modeling site-specific residue frequencies. Mol Biol Evol. 1998;15:910–917. pmid:9656490
View Article
PubMed/NCBI
Google Scholar

[33] View Article

[34] PubMed/NCBI

[35] Google Scholar

[ref11] 11. Arenas M, Posada D. Simulation of Genome-Wide Evolution under Heterogeneous Substitution Models and Complex Multispecies Coalescent Histories. Mol Biol Evol. 2014;31:1295–1301. pmid:24557445
View Article
PubMed/NCBI
Google Scholar

[37] View Article

[38] PubMed/NCBI

[39] Google Scholar

[ref12] 12. Yang Z, Nielsen R. Mutation-Selection Models of Codon Substitution and Their Use to Estimate Selective Strengths on Codon Usage. Mol Biol Evol. 2008 Jan;25(3):568–579. pmid:18178545
View Article
PubMed/NCBI
Google Scholar

[41] View Article

[42] PubMed/NCBI

[43] Google Scholar

[ref13] 13. Holder M, Zwickl D, Dessimoz C. Evaluating the robustness of phylogenetic methods to among-site variability in substitution processes. Phil Trans R Soc B. 2008;363:4013–4021. pmid:18852108
View Article
PubMed/NCBI
Google Scholar

[45] View Article

[46] PubMed/NCBI

[47] Google Scholar

[ref14] 14. Rodrigue N, Philippe H, Lartillot N. Mutation-selection models of coding sequence evolution with site-heterogeneous amino acid fitness profiles. Proc Natl Acad Sci USA. 2010;107(10):4629–4634. pmid:20176949
View Article
PubMed/NCBI
Google Scholar

[49] View Article

[50] PubMed/NCBI

[51] Google Scholar

[ref15] 15. Tamuri AU, dos Reis M, Goldstein RA. Estimating the distribution of selection coefficients from phylogenetic data using sitewise mutation-selection models. Genetics. 2012;190:1101–1115. pmid:22209901
View Article
PubMed/NCBI
Google Scholar

[53] View Article

[54] PubMed/NCBI

[55] Google Scholar

[ref16] 16. Tamuri AU, Goldman N, dos Reis M. A penalized-likelihood method to estimate the distribution of selection coefficients from phylogenetic data. Genetics. 2014;197:257–271. pmid:24532780
View Article
PubMed/NCBI
Google Scholar

[57] View Article

[58] PubMed/NCBI

[59] Google Scholar

[ref17] 17. Rodrigue N, Lartillot N. Site-heterogeneous mutation-selection models within the PhyloBayes-MPI Package. Bioinformatics. 2014;30:1020–1021. pmid:24351710
View Article
PubMed/NCBI
Google Scholar

[61] View Article

[62] PubMed/NCBI

[63] Google Scholar

[ref18] 18. Spielman S, Wilke C. The relationship between dN/dS and scaled selection coefficients. Mol Biol Evol. 2015;32:1097–1108. pmid:25576365
View Article
PubMed/NCBI
Google Scholar

[65] View Article

[66] PubMed/NCBI

[67] Google Scholar

[ref19] 19. dos Reis M. How to calculate the non-synonymous to synonymous rate ratio of protein-coding genes under the Fisher–Wright mutation–selection framework. Biol Lett. 2015;11:20141031. pmid:25854546
View Article
PubMed/NCBI
Google Scholar

[69] View Article

[70] PubMed/NCBI

[71] Google Scholar

[ref20] 20. Tavare S. Lines of descent and genealogical processes, and their applications in population genetics models. Theor Popul Biol. 1984;26:119–164. pmid:6505980
View Article
PubMed/NCBI
Google Scholar

[73] View Article

[74] PubMed/NCBI

[75] Google Scholar

[ref21] 21. Hasegawa M, Kishino H, Yano T. Dating of human-ape splitting by a molecular clock of mitochondrial DNA. J Mol Evol. 1985;22(2):160–174. pmid:3934395
View Article
PubMed/NCBI
Google Scholar

[77] View Article

[78] PubMed/NCBI

[79] Google Scholar

[ref22] 22. Tamura K, Nei M. Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol Biol Evol. 1993;10:512–526. pmid:8336541
View Article
PubMed/NCBI
Google Scholar

[81] View Article

[82] PubMed/NCBI

[83] Google Scholar

[ref23] 23. Jones D, Taylor W, Thornton J. The rapid generation of mutation data matrices from protein sequences. CABIOS. 1992;8:275–282. pmid:1633570
View Article
PubMed/NCBI
Google Scholar

[85] View Article

[86] PubMed/NCBI

[87] Google Scholar

[ref24] 24. Whelan S, Goldman N. A general empirical model of protein evolution derived from multiple protein families using a maximum likelihood approach. Mol Biol Evol. 2001;18:691–699. pmid:11319253
View Article
PubMed/NCBI
Google Scholar

[89] View Article

[90] PubMed/NCBI

[91] Google Scholar

[ref25] 25. Le S, Gascuel O. An improved general amino acid replacement matrix. Mol Biol Evol. 2008;25:1307–1320. pmid:18367465
View Article
PubMed/NCBI
Google Scholar

[93] View Article

[94] PubMed/NCBI

[95] Google Scholar

[ref26] 26. Yang N, Nielsen R, Hasegawa M. Models of Amino Acid Substitution and Applications to Mitochondrial Protein Evolution. Mol Biol Evol. 1998;15:1600–1611. pmid:9866196
View Article
PubMed/NCBI
Google Scholar

[97] View Article

[98] PubMed/NCBI

[99] Google Scholar

[ref27] 27. Adachi J, Hasegawa M. MOLPHY version 2.3: programs for molecular phylogenetics based on maximum likelihood. Comput Sci Monogr. 19896;28:1–150.
View Article
Google Scholar

[101] View Article

[102] Google Scholar

[ref28] 28. Dayhoff M, Schwartz R, Orcutt B. A model of evolutionary change in proteins. Atlas of Protein Sequence and Structure. 1978;5(3):345–352.
View Article
Google Scholar

[104] View Article

[105] Google Scholar

[ref29] 29. Mirsky A, Kazandjian L, Anisimova M. Antibody-Specific Model of Amino Acid Substitution for Immunological Inferences from Alignments of Antibody Sequences. Mol Biol Evol. 2015;32:806–819. pmid:25534034
View Article
PubMed/NCBI
Google Scholar

[107] View Article

[108] PubMed/NCBI

[109] Google Scholar

[ref30] 30. Goldman N, Yang Z. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol Biol Evol. 1994;11:725–736. pmid:7968486
View Article
PubMed/NCBI
Google Scholar

[111] View Article

[112] PubMed/NCBI

[113] Google Scholar

[ref31] 31. Nielsen R, Yang Z. Likelihood models for detecting positive selected amino acid sites and applications to the HIV-1 envelope gene. Genetics. 1998;148:929–936. pmid:9539414
View Article
PubMed/NCBI
Google Scholar

[115] View Article

[116] PubMed/NCBI

[117] Google Scholar

[ref32] 32. Muse S, Gaut B. A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. Mol Biol Evol. 1994;11:715–724. pmid:7968485
View Article
PubMed/NCBI
Google Scholar

[119] View Article

[120] PubMed/NCBI

[121] Google Scholar

[ref33] 33. Kosiol C, Holmes I, Goldman N. An empirical codon model for protein sequence evolution. Mol Biol Evol. 2007;24:1464–1479. pmid:17400572
View Article
PubMed/NCBI
Google Scholar

[123] View Article

[124] PubMed/NCBI

[125] Google Scholar

[ref34] 34. Yang Z. PAML 4: Phylogenetic Analysis by Maximum Likelihood. Molecular Biology and Evolution. 2007;24:1586–1591. pmid:17483113
View Article
PubMed/NCBI
Google Scholar

[127] View Article

[128] PubMed/NCBI

[129] Google Scholar

[ref35] 35. Sipos B, Massingham T, Jordan GE, Goldman N. PhyloSim—Monte Carlo simulation of sequence evolution in the R statistical computing environment. BMC Bioinform. 2011;12(104).
View Article
Google Scholar

[131] View Article

[132] Google Scholar

[ref36] 36. Jukes T, Cantor C. Evolution of protein molecules. In: Munro H, editor. Mammalian protein metabolism. New York: Academic Press; 1969.

[ref37] 37. Suzuki Y, Gojobori T. A method for detecting positive selection at single amino acid sites. Mol Biol Evol. 1999;16:1315–1328. pmid:10563013
View Article
PubMed/NCBI
Google Scholar

[135] View Article

[136] PubMed/NCBI

[137] Google Scholar

[ref38] 38. Kosakovsky Pond S, Frost S. Not So Different After All: A Comparison of Methods for Detecting Amino Acid Sites Under Selection. Mol Biol Evol. 2005;22:1208–1222. pmid:15703242
View Article
PubMed/NCBI
Google Scholar

[139] View Article

[140] PubMed/NCBI

[141] Google Scholar

[ref39] 39. Kosakovsky Pond S, Frost S, Muse S. HyPhy: hypothesis testing using phylogenies. Bioinformatics. 2005;12:676–679.
View Article
Google Scholar

[143] View Article

[144] Google Scholar

[ref40] 40. Arenas M. Advances in Computer Simulation of Genome Evolution: Toward More Realistic Evolutionary Genomics Analysis by Approximate Bayesian Computation. J Mol Evol. 2015;8:189–192.
View Article
Google Scholar

[146] View Article

[147] Google Scholar

[ref41] 41. Dalquen D, Anisimova M, Gonnet G, Dessimoz C. ALF–a simulation framework for genome evolution. Mol Biol Evol. 2012;29:1115–1123. pmid:22160766
View Article
PubMed/NCBI
Google Scholar

[149] View Article

[150] PubMed/NCBI

[151] Google Scholar

[ref42] 42. Beiko R, Charlebois R. A simulation test bed for hypotheses of genome evolution. Bioinformatics. 2007;23:825–831. pmid:17267425
View Article
PubMed/NCBI
Google Scholar

[153] View Article

[154] PubMed/NCBI

[155] Google Scholar

[ref43] 43. Grahnen J, Liberles D. CASS: Protein sequence simulation with explicit genotype-phenotype mapping. Trends in Evolutionary Biology;4:e9.
View Article
Google Scholar

[157] View Article

[158] Google Scholar

[ref44] 44. Arenas M, Dos Santos H, Posada D, Bastolla U. Protein evolution along phylogenetic histories under structurally constrained substitution models. Bioinformatics. 2013;29:3020–3028. pmid:24037213
View Article
PubMed/NCBI
Google Scholar

[160] View Article

[161] PubMed/NCBI

[162] Google Scholar

[ref45] 45. Koestler T, von Haeseler A, Ebersberger I. REvolver: modeling sequence evolution under domain constraints. Mol Biol Evol. 2012;29:2133–2145. pmid:22383532
View Article
PubMed/NCBI
Google Scholar

[164] View Article

[165] PubMed/NCBI

[166] Google Scholar

[ref46] 46. Cartwright R. DNA assembly with gaps (Dawg): simulating sequence evolution. Bioinformatics. 2005;21:iii31–iii38. pmid:16306390
View Article
PubMed/NCBI
Google Scholar

[168] View Article

[169] PubMed/NCBI

[170] Google Scholar

[ref47] 47. Bielejec F, Lemey P, Carvalho L, Baele G, Rambaut A, Suchard M. piBUSS: a parallel BEAST/BEAGLE utility for sequence simulation under complex evolutionary scenarios. BMC Bioinformatics. 2014;15:133. pmid:24885610
View Article
PubMed/NCBI
Google Scholar

[172] View Article

[173] PubMed/NCBI

[174] Google Scholar

Figures

Abstract

Introduction

Substitution models and frameworks in Pyvolve

Basic Usage of Pyvolve

Validating Pyvolve

Conclusions

Supporting Information

S1 File. This file contains all scripts and data used to validate Pyvolve.

Acknowledgments

Author Contributions

References