Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Roadmap to the study of gene and protein phylogeny and evolution—A practical guide

  • Florian Jacques,

    Roles Conceptualization, Data curation, Investigation, Methodology, Software, Writing – original draft, Writing – review & editing

    Affiliations Lund University Cancer Centre, Department of Laboratory Medicine, Lund University, Lund, Sweden, Lund Stem Cell Center, Department of Laboratory Medicine, Lund University, Lund, Sweden

  • Paulina Bolivar,

    Roles Investigation, Methodology, Software, Writing – review & editing

    Affiliation Lund University Cancer Centre, Department of Laboratory Medicine, Lund University, Lund, Sweden

  • Kristian Pietras,

    Roles Project administration, Supervision, Writing – review & editing

    Affiliation Lund University Cancer Centre, Department of Laboratory Medicine, Lund University, Lund, Sweden

  • Emma U. Hammarlund

    Roles Conceptualization, Funding acquisition, Project administration, Supervision, Writing – original draft, Writing – review & editing

    emma.hammarlund@med.lu.se

    Affiliations Lund University Cancer Centre, Department of Laboratory Medicine, Lund University, Lund, Sweden, Lund Stem Cell Center, Department of Laboratory Medicine, Lund University, Lund, Sweden

Abstract

Developments in sequencing technologies and the sequencing of an ever-increasing number of genomes have revolutionised studies of biodiversity and organismal evolution. This accumulation of data has been paralleled by the creation of numerous public biological databases through which the scientific community can mine the sequences and annotations of genomes, transcriptomes, and proteomes of multiple species. However, to find the appropriate databases and bioinformatic tools for respective inquiries and aims can be challenging. Here, we present a compilation of DNA and protein databases, as well as bioinformatic tools for phylogenetic reconstruction and a wide range of studies on molecular evolution. We provide a protocol for information extraction from biological databases and simple phylogenetic reconstruction using probabilistic and distance methods, facilitating the study of biodiversity and evolution at the molecular level for the broad scientific community.

Introduction

Living organisms are characterized by an astonishing array of phenotypes. This diversity is the result of billions of years of evolution, from the first primitive cells to modern cells and multicellular organisms, including bacteria, archaea, protists, plants, and animals. How organismal functions, adaptations, and diversifications are related can be studied through molecular evolution, a field that studies variation in the information content in the genetic material through time, namely the evolution of genes, proteins, or other markers such as ribosomal RNA, transposable elements, or other parts of the genome, that have common ancestry and are, therefore, homologous. Homology can result from speciation events, that create homologous genes in different species, or gene or genome duplication, that generate homologous genes in the same genome. Homologous genes in different species (e.g., human and murine hemoglobin) are called orthologues, and homologous genes present in the same genome (e.g., human hemoglobin and human myoglobin) are called paralogues. Gene homology resulting from horizontal transfer is called xenology. Phylogenetics and evolutionary biology study how homologues are related to each other to retrace their evolutionary history.

Evolutionary studies can be carried out in the context of comparative genomics between different species or, alternatively, in the context of population genetics, and compare molecular variation between the populations or individuals of a single species. In the first case, one considers long-term evolution and compares mutations (including substitutions) that have been fixed between different species. The second case considers short-term evolution and concerns the study of variation that is still segregating in a population. In both cases, the evolution of species is usually presented as a phylogenetic tree, a diagram displaying the evolutionary relationships between the sequences or taxa. The tools and methods for phylogenetic inference have become more complex over past four decades and their use can be challenging.

Molecular evolutionary studies aim at reconstructing the evolutionary histories and relationships of different taxa, genes or genomic components (e.g., transposable elements), as well as understanding the diverse mechanisms and factors underlying evolutionary change, such as mutation, selection, recombination, genetic drift, demographic processes, or biased gene conversion. For these purposes, the integration of novel genomic technologies with evolutionary studies are invaluable. For example, in systematics, the description of new species necessitates knowing how they are related to other species. In epidemiology, the emergence of new infectious diseases and antibiotic resistance requires studying genetic variation of infectious agents and identifying adaptive mutations leading to pathological conditions. A focus on the process of adaptation is also valuable in biological, agricultural and environmental sciences to, for example, protect endangered species or limit the spread of invasive species.

In recent years, technological advances in molecular biology, in particular the sequencing of DNA and RNA, has allowed for an exponential increase of available sequences of nucleotides and amino acids. In addition, these data are coupled with annotations regarding biological functions. The genomic, transcriptomic, and proteomic data is curated in specialized public databases, assets that are paralleled by development of new statistical methods and computational technology to study gene and protein functions and evolution. While generally accessible to biologists for studying molecular diversity and evolution, to sort and navigate through these resources can be challenging. Here, we outline a selection of molecular databases as well as bioinformatic tools and methods for retrieving sequences and reconstructing evolutionary history and processes. In doing so, we follow a recently published phylogenetic protocol [1]. We focus on databases that are maintained and popular. The aim is to provide a practical guide for beginners and more advanced explorers into protein and gene evolution. This is followed by a tutorial to the reconstruction of the evolution of two families of cell cycle-related proteins: P53 and cyclins/cyclin-dependent kinases (CDKs), over organismal history.

Materials and methods

Data included in this study (sequences and accession numbers) are available in S1S4 Files. There are no ethical or legal restrictions on sharing these data sets. The protocol described in this peer-reviewed article is published on protocols.io, https://protocols.io/view/road map-to-the-study-of-gene-and-protein-phylogeny-cknkuvcw and included as S5 File.

Collecting genomic and proteomic information

Dozens of databases store sequences and other biological information about genes and proteins; for a complete list, see [2, 3]. These databases offer query tools to retrieve DNA or amino-acid sequences and other information such as gene architecture or protein structure. They also provide annotations with information about gene or protein properties such as function, polymorphism, activity and pathways, subcellular localization, and tissue expression (Fig 1).

thumbnail
Fig 1. Protocol for reconstructing the phylogeny and evolutionary history of genes and proteins using molecular databases and bioinformatic tools.

Solid arrows indicate the order of actions for the phylogenetic analysis and evolutionary studies. Dashed arrows indicate feedback loops that are needed during the process. A subset of available databases and bioinformatic programs are depicted in the figure. This roadmap is mostly based on a recently published phylogenetic protocol [1].

https://doi.org/10.1371/journal.pone.0279597.g001

DNA databases

GenBank [4] and Entrez [5], both maintained at the National Centre for Biotechnology Information (NCBI) [6], store nucleotide sequences of all living organisms and, when applicable, their translation into protein, with biological annotation and supporting bibliography (Table 1). They include integrated search tools to retrieve sequences, structures, genetic cartography and bibliography about genes [6]. Ensembl [7] is a genome browser that focuses on chordates and contains information about gene sequence and structure, expression, location on the chromosome, transcript variants, homologues, and gene ontologies. The browser is further expanded into specific databases for invertebrates, plants, fungi, protists, and bacteria in EnsemblMetazoa, EnsemblPlants, EnsembleFungi, EnsemblProtists and EnsemblBacteria. Ensembl is relevant for evolutionary analyses, comparative genomics, and population genetics studies. Data on gene expression patterns across animal species, including anatomical and embryonic information, is stored in the database Bgee [8]. GeneCards [9] stores information on human genes, including biological function, genomics, transcription factor binding sites and protein products, as well as assay products (e.g., siRNA, inhibitors or CRISPR products) and crosslinks to many other databases.

Particular model organisms that have provided extensive biological data are stored in organism-specific databases such as XenBase [10], FlyBase [11], and WormBase [12]. They include data concerning genomics, development, gene expression and variants of the amphibian Xenopus laevis, the fruit fly Drosophila melanogaster, and the nematode Caenorhabditis elegans, respectively. Data from the fission yeast Schizosaccharomyces pombe is stored in PomBase [13], which includes the complete genome, gene and protein sequences, and annotations. The Arabidopsis Information Resource (TAIR) [14] provides the complete genome sequence of the model plant Arabidopsis thaliana and information on gene sequence and structure, gene expression, protein sequence and literature. The Bio-Analytic Resource (BAR) [15] for plant biology also provides access to several plant-specific databases, including gene expression and protein tools such as the eFP Browser (https://bar.utoronto.ca/efp/cgi-bin/efpWeb.cgi) [16], that displays gene expression patterns in Arabidopsis, molecular markers and mapping and genomic tools.

Nucleotide sequences can be downloaded from these databases in the FASTA format (a text-based format for representing either nucleotide or amino-acid sequences, where nucleotides or amino acids are represented by single-letter codes). It is also possible to batch-download a large number of sequences from NCBI, by entering their identifiers (accession numbers, GI numbers or GeneIDs) in Batch Entrez (https://www.ncbi.nlm.nih.gov/sites/batchentrez).

Protein databases

General information on proteins.

The Universal protein resource (UniProt) [17] is currently the main source of information for proteins (Table 2). UniProt contains published amino-acid sequences and open-reading-frame translations, with various annotations including structure, classification, biological function, and subcellular localization of the protein. Protein sequences can be downloaded from UniProt in the FASTA format. The resource also provides links to many other public databases. The “retrieve/ID mapping” tool of UniProt (https://www.uniprot.org/id-mapping) facilitates batch downloads of information on a set of proteins using UniProt identifiers. This tool can also be used to convert UniProt identifiers to the identifiers of external databases such as NCBI, GenBank, Ensembl or the Protein Data Bank (PDB). Gene Ontology [18] provides a unified annotation system of the molecular function, biological processes, and cellular components of proteins across all species. Information about genes, proteins, and genomes that is acquired from several ‘omics technologies is further gathered in the Kyoto Encyclopedia of Genes and Genomes (KEGG) [19]. The KEGG database focuses on metabolism, biological pathways, and human diseases. Summaries of the entire human proteome using antibody-based proteomics, transcriptomics, and integration of other omics technologies is gathered in the Human Protein Atlas [20]. This atlas displays expression profiles, subcellular localization, tissue and organ distribution, and protein function in human metabolism, as well as information about diseases such as cancer. The PHAROS database [20, 21] provides an overview of the literature on human proteins, including their classification, pathways, expression data, and related diseases.

Protein structure databases.

Protein structures are described by databases such as the PDB [26]. The PDB provides three-dimensional (3D) structures of proteins and their interacting ligands established by X-ray crystallography, electron microscopy, or NMR spectroscopy, which can be retrieved as pdb files. The PDB also displays a 3D visualization tool, programs for 3D analyses such as pairwise structure alignment and pairwise symmetry, and cross links to other protein databases. Annotation for protein families based on fingerprints (i.e., conserved 3D motifs specific for a protein family), are gathered in the database PRINTS [28]. PRINTS includes a 3D visualization software and search tools for protein sequence homology and pairwise or multiple sequence alignments.

Protein classification

Proteins are classified into different categories based on structural similarity, functionality and evolutionary relationship (Table 2). The 3D structural classification of proteins (SCOP) [30] classifies protein domains according to their class, fold, superfamily, and family. Large proteins can have several domains belonging to different categories. The class and fold levels are based on protein structures. Most proteins belong to one of the five structural classes (α, β, α/β, α+β, multi-domain), defined respectively by the presence of α-helices, β-strands, both α-helices and β-strands, segregated α-helices and β-strands, or none of these characteristics. Below this primary level, a protein’s secondary structure is reflected in folds. The other levels of protein classification (superfamily and family) are based on evolutionary relationships. Proteins with shared ancestry are classified in the same superfamily, and proteins sharing 30% or more sequence identity are classified in the same family [32]. Similarly, the Class, Architecture, Topology, Homology (CATH) database proposes a five-level classification of protein domains. The first three levels: class, architecture and topology, are based on structural homology. The last two, homologous superfamily and family, are based on sequence, structure and functional specificities, and sequence identity, respectively [23]. The Families of Structurally Similar Proteins (FSSP) provides a classification of proteins in the PDB based on a structure comparison algorithm, that calculates a structural similarity score between protein chains. These similarity scores are used to create a classification of protein structures [24, 25].

SUPERFAMILY [31] is a database of structurally and functionally annotated proteins. The database Protein families (Pfam) [27] classifies protein domains based on multiple sequence alignment. Pfam contains information on protein domain structures and their occurrence in living organisms. The homologues of a protein are listed in Pfam and their sequences can be downloaded in the FASTA format. For any family, Pfam displays a graphical view of all species possessing the protein domain. The domain families and functional sites of proteins is the focus of the InterPro database [17]. It combines structure-based and phylogenetic classifications. InterPro uses predictive models called signatures, that are used to infer functions for a sequence in association with the database Gene Ontology [18]. Biological information about the protein family, domains, and functional sites is gathered in the PROSITE database [29]. PROSITE also provides tools for identifying distant homology between sequences.

Homologue research

Studying the evolution of a family of genes or proteins requires the identification of homologues (i.e., genes or protein with shared ancestry). Homologues of a gene or protein can be identified using appropriate tools (Table 3). The Basic Local Alignment Search Tool (BLAST) [33], maintained at the NCBI, is the most widely used heuristic algorithm for researching sequence homology. Homologues of a gene or protein can be retrieved from the genomes or proteomes of specified taxa with a significance score called E-value (Expect value), which is defined as the number of expected hits of similar quality that can be obtained by chance [34]. A lower E-value corresponds to a higher statistical significance of the match. A BLAST search also provides the identity and similarity of the hits. BLAST can be used for nucleotide (BLASTn) or amino-acid sequences (BLASTp), translated nucleotide to protein (BLASTx) and protein to translated nucleotide (tBLASTn). UniProt [17] also provides a tool to identify proteins in the database sharing 50% or 90% identity with any protein, including paralogues and orthologues. Pfam [27] and Ensembl [7] also include tools to identify homologues of any given gene and retrieve their sequences. Other search tools include HMMER [35] of the HH-suite software, FASTA [36], SSAHA [37] and BLAT [38]. Homologues of a given sequence should be compiled into a single FASTA file. FASTA files can be made by retrieving sequences from the databases and manually adding the sequences one by one. For large datasets, for example the results of an exhaustive search for homologues, it is possible to directly export a large number of sequences in the FASTA format from NCBI or UniProt [17].

thumbnail
Table 3. List of bioinformatic tools for identification of gene and protein homologues.

https://doi.org/10.1371/journal.pone.0279597.t003

Phylogenetic analysis

Exploring molecular evolution necessitates studying how species, genes, and proteins are related to each other in an evolutionary sense. Phylogenetic relationships can apply to species, genes, or proteins even within the same genome. Reconstructing evolution from molecular data (amino-acid or nucleotide sequences) includes the steps of sequence alignment and trimming, phylogenetic analysis, and study of molecular evolution using a phylogenetic tree. Below, we describe the tools used in these different steps.

Multiple sequence alignment

Aligning gene or protein sequences consists in inferring homology between bases or amino acids. The sequences are put in every row of a matrix, one after the other, to arrange every homologous base or amino acid. Alignment of the homologous residues necessitates adding gaps, indicated by the symbol “-” and corresponding to insertions or deletions (indels), into the sequences. Sequence alignment methods include the progressive approach and the consistency-based method. The progressive approach aligns progressively from the two closest to the most distant sequences. It is used by CLUSTALW [39], CLUSTAL Omega [40], MUSCLE [41], PRANK [37, 55], KAlign [44] and MAFFT [45]. Consistency-based methods calculate the best multiple sequence alignment (MSA) after different pairwise alignments using information from a third sequence as intermediate [4648]. They are used by T-COFFEE [49], PROBCONS [50] and its successor CONTRAlign [51], the latter for amino acid sequences only (Table 4). Other approaches include the iterative refinement method, which is also included in MAFFT and Muscle [52], the genetic algorithms [53], and methods that use hidden Markov models [54].

thumbnail
Table 4. List of programs for multiple sequence alignment.

https://doi.org/10.1371/journal.pone.0279597.t004

Several MSA tools, including CLUSTALW [39], MUSCLE [41], MAFFT [45], Kalign [44] and PRANKS [37, 55] display inferred MSAs using user interfaces. ClustalW and Muscle are also included in MEGA [55]. PROBCONS [50], T-COFFEE [49] and MAFFT [45] are described to have particularly high accuracy but also high execution times [56]. Their use should be restricted to small and intermediate datasets. CLUSTAL Omega [40] and Kalign [44] are particularly fast, but less accurate [57]. They can be used to analyse datasets of up to 4,000 and 2,000 sequences, respectively [56, 57]. The performances of MUSCLE are intermediate [57]. PRANK is meant for closely-related sequences [58]. Bali-Phy [59] performs a Bayesian co-estimation of alignment, phylogeny, and other parameters and is also argued to be very reliable. PASTA [60] and UPP [61], which uses a machine-learning technique, are designed for very large datasets. MAFFT offers a wide range of methods, which can be accuracy-oriented, such as L-INS-i, G-INS-I and E-INS-i; or speed-oriented, such as FFT-NS-2. The latter can be used for up to 30,000 sequences. Simultaneous Alignment and Tree Estimation (SATé) [62] is a software package providing several tools for sequence alignment and phylogenetic analysis.

In practice, finding the accurate MSA can be challenging for several reasons. First, one should keep in mind that the alignment with the best score is not necessarily biologically correct. Computer programs are not based explicitly on the hypothesis of homology between aligned residues [63]. Furthermore, it is difficult to get a good alignment for sequences that have diverged significantly and share low identity. In this case, for protein-coding sequences, amino-acid data should be preferred over nucleotide data, since it is possible to consider the biochemical similarity of amino acids [64]. Alignment programs require defining a gap-opening penalty and a gap-extension penalty, but these values are arbitrary. It is common that different sequences in the alignment do not have the same length, for biological or experimental reasons. It is recommended to keep end-gaps unpenalized [64]. Furthermore, indels are reported to affect the accuracy of MSA programs. It is recommended to use several MSA programs for sequences that contain indels [65]. MAFFT is reported to be the most accurate program in the cases of sequences with non-overlapping deletions and alternatively spliced gene products [65]. Furthermore, single nucleotides, small sequences (e.g., microsatellites) or entire protein domains, can be repeated in a gene or protein sequence. If the number of repeats differs between sequences, one domain of a sequence can be homologous to several domains of another sequence. It is recommended to excise the repeated domains [64].

Alignment trimming

Once the alignment is completed, it is necessary to select the positions and regions that will be used for the phylogenetic inference. Poorly aligned positions and highly variable regions are not phylogenetically informative, because these positions might not be homologous or subject to saturation. They should be excluded prior to the phylogenetic analysis to maximize the phylogenetic signal of the alignment [66]. A minimum reporting standard has been developed to quantify the alignment completeness, and implemented in AliStat [67]. Phylogenetically informative regions of the alignment can be selected using appropriate tools, such as Guidance 2 [68], GBlocks [69], trimAl [70], BMGE [71] and Noisy [72] (Table 5).

thumbnail
Table 5. List of programs for sequence alignment trimming.

https://doi.org/10.1371/journal.pone.0279597.t005

Assessing phylogenetic assumptions

Phylogenetic methods rely on simple assumptions about the evolutionary processes, stating, for example, that all sites in the alignment evolved under the same tree (treelikeness), that mutation rates have remained constant over time (time-homogeneity), and that substitutions are reversible and, therefore, also stationary (for details on these assumptions, see [73]). If the phylogenetic data violate these assumptions, the phylogeny and evolutionary analyses may become biased [7476]. Once the alignment is performed and the sites have been selected for phylogenetic inference, it is recommended to assess those phylogenetic assumptions when possible [1]. Statistical methods allowing users to test stationarity and homogeneity of the evolutionary processes (along diverging lineages), and treelikeness, have been developed and included in IQ-TREE and IQ-TREE2 [77, 78]. Homo2.1 (https://github.com/lsjermiin/Homo.v2.1) [79] is designed for the analysis of compositional heterogeneity in sequence alignments. It is also possible to use the R package MOTMOT [80].

Phylogenetic protocol selection

The choice of a phylogenetic method can be challenging. The appropriate phylogenetic method depends on the phylogenetic assumptions of each method. Inferring the correct phylogenetic tree requires that the data do not violate the assumptions of the method. Most phylogenetic methods assume that the sequences have evolved under time-reversible Markovian conditions (i.e., the nucleotides or amino-acids have evolved independently of time and their past history). Most model selectors consider only time-reversible Markovian models. However, if the data have evolved under more complex, non-time reversible Markovian conditions, identifying the sequence evolution model that fits the data and reconstructing the phylogenetic tree may be complex, since phylogenetic methods for such data are lacking [1]. A model of sequence evolution selected as the best fit does not necessarily imply that it adequately describes the data. Poorly fitting models are inadequate approximations of the evolutionary processes and can lead to errors. In this case, it is recommended to test the goodness of fit between the phylogenetic tree, the substitution model, and the data (see paragraph Test of goodness of Fit).

Selection of the optimal model of sequence evolution

Probabilistic and distance methods require selection of the model of molecular evolution that best describes the data. Several models of nucleotide or amino-acid substitution exist [81]. The nucleotide substitution models differ in the number of parameters considered, like mutation rates and base frequencies [82] (for a review see [83]). The main nucleotide substitution models are, from the simplest to the most complex: JC69 [84], K80 [85], F81 [86], HKY85 [87], TN93 [88] and GTR [89]. The main amino acid substitution models include JTT [90], WAG [91], LG [92] and Dayhoff [93]. Models of codon evolution also exist [94].

These models can be associated with models of substitution rate heterogeneity across sites. Mutation rates and selective pressure may vary among sites, due to different roles in the structure and function of the gene or protein. The most common rate heterogeneity across sites models are the Gamma distribution (G) and the proportion of invariant nucleotide or amino acid sites (I). Every substitution model can be associated with G, I or both. The FreeRate model (R), a more complex model of rate heterogeneity [95], is included in ModelFinder, PhyML and IQ-TREE. More recently, the GHOST model for alignments with variation in mutation rate was introduced and implemented in IQ-TREE [96].

The likelihood of the different models should be computed by appropriate software (Table 6). For every model of sequence evolution (i.e., a combination of a substitution model and a rate-heterogeneity across sites model), these tools calculate the Bayesian information criterion (BIC) [97] and the Akaike information criterion (AIC) [98] from the log-likelihood scores. A model with lower BIC or AIC is considered more accurate. The model minimizing BIC or AIC (i.e., with the lowest score) should be selected. ModelTest and jModelTest [99, 100] estimate the likelihood for phylogenetic trees based on nucleotide sequences and ProtTest [100] for amino acid sequences. ModelFinder [101] is a model selection method, for alignments of nucleotides, codons or amino acids, implemented in IQ-TREE [77, 78]. PartitionFinder 2 [102] can be used with nucleotide and amino acid data. Model selectors are also included in programs such as MEGA [55] and PhyML (SMS) [103].

thumbnail
Table 6. List of programs for molecular evolution model selection.

https://doi.org/10.1371/journal.pone.0279597.t006

Phylogenetic analysis

A phylogenetic tree is a graphical illustration of the evolutionary relationships between taxa, genes, or proteins. For comprehensive reviews, see [104, 105]. Phylogenetic trees may consider the topology and the branch lengths (phylograms) or just the topology (cladograms). Several tree-building methods exist. Distance methods create a matrix of molecular distances, defined by the numbers and types of differences between the sequences, and they use this matrix to reconstruct the phylogenetic tree. Character-based methods compare all sequences at the same time, site by site. They include Maximum Parsimony (MP) and the probabilistic methods: Maximum Likelihood (ML) [106] and Bayesian Inference (BI) [107109]. Maximum parsimony is a classical and simple method, now rarely used with molecular data, that calculates the minimum number of evolutionary steps, including nucleotide insertions, deletions or substitutions, between species. The main weakness of this method is that it ignores hidden mutations and does not consider branch lengths. This can lead to incorrect clustering of unrelated taxa, a phenomenon known as long branch attraction (for a review, see [110]). Probabilistic methods are the most recent and today the most widely used methods for phylogenetic inference. They are more relevant for molecular phylogenetics because they use specified models of molecular evolution and rely on likelihood calculations, but their execution time is longer.

Phylogenetic analysis using distance methods.

Distance methods include the Unweighted Pair Group Method with Arithmetic Mean (UPGMA) [111], Neighbor Joining (NJ) [112], and Minimum Evolution (ME) [113] tree-inference methods. FastME [114] is designed for phylogenetic inference using diverse distance methods with nucleotide or amino-acid data or distance matrices (Table 7). PyCogent [115] is a software library for genomic biology, allowing for phylogenetic analysis and a large number of evolutionary, statistical and genomic analyses, including partition models, as well as graphical display and annotation of phylogenetic trees. SplitsTree [116, 117] is used for inference of unrooted phylogenetic trees and phylogenetic networks from sequence alignments or distance matrices. APE (Analysis of Phylogeny and Evolution) [118] is a package written in the R language that provides a wide range of evolutionary analyses, including calculating genetic distances and computing phylogenetic trees using distance-based methods. PAUP [119], MEGA [55], FastTree [120] or PHYLIP [121], can also be used for distance-based methods, as well as ML, and/or MP [122, 123].

thumbnail
Table 7. List of programs for phylogenetic analysis using distance methods, maximum parsimony, maximum likelihood and Bayesian inference.

https://doi.org/10.1371/journal.pone.0279597.t007

Phylogenetic analysis using maximum likelihood.

Maximum-likelihood methods calculate the likelihood of observing the data under different explicit models of molecular evolution. Maximum likelihood aims to identify the best fit model by exploring multiple combinations of trees and model parameters. Programs for ML phylogenetic analysis include MEGA [55], SeaView [124], PhyML [125], RAxML [126], FastTree [120], PAML [127], PAUP [119], IQ-TREE [77, 78], HYPHY [128], PHYLIP [121] and GARLI [129] (Table 7). All of them can be used with nucleotide or amino-acid data. MEGA [55] and SeaView [124] are known to be very user-friendly. They include sequence alignment tools and tree manipulators. PhyML [125] is reported as being accurate, easy to use and, like PAUP and MEGA [55], includes many common models of substitution. RAxML [126] and particularly FastTree [120] are fast and well suited for large datasets (up to 1 million sequences with FastTree). In addition to assuming Gamma-distributed rate-heterogeneity across sites and the proportion of invariant sites, they include CAT, a specific model of rate heterogeneity [130]. IQ-TREE [77, 78], which includes ModelFinder [101] and the very fast bootstrapping method UFBoot2 [131], is reported to be both fast and accurate [132].

Phylogenetic analysis using Bayesian Inference

The most recently-developed method for phylogenetic reconstruction uses Bayesian Inference (BI). This method calculates the posterior probability of the tree and model of sequence evolution, given the data. The main software used for BI-based phylogenetics is MrBayes [133]. It uses the Markov Chain Monte Carlo (MCMC) algorithm (Table 7). PhyloBayes [134, 135] is a Bayesian MCMC sampler for phylogenetic reconstruction with protein data using a specific probabilistic model. It is well adapted for large datasets and phylogenomics. Bali-Phy [59] can also be used for phylogenetic analysis using BI.

Test of reliability of the inferred tree

It is recommended to estimate the reliability of the clades of the inferred phylogenetic tree. Most programs of phylogenetic analysis use the non-parametric bootstrapping method [138]. Bootstrapping is a resampling technique used to assess the repeatability of the clade, and estimate how consistently it is supported by the data [139]. The sites in the alignment (nucleotides or amino acids) are randomly resampled with replacement and a new phylogeny is calculated for each replicate [138]. A bootstrap value, corresponding to the proportion of replicate phylogenies that recovered the clade, is calculated for every internal branch. A bootstrap value of 100% means that the branch is supported by all resampled datasets, while low values mean that only few of these datasets support the branch. Bootstrap values depend on both data and the method used. Users should keep in mind that bootstrapping gives a measure of the consistency of the estimate, but it is not a measure of the accuracy of the tree [140]. The number of replicates that are necessary to obtain a good accuracy of the bootstrap depends on the bootstrap value. For example, for a 1% confidence interval on a bootstrap value of 95, 2,000 replicates are necessary [139].

Since bootstrapping can be time consuming, fast approximation methods for phylogenetic bootstrap, UFBoot and UFBoot2, have been developed and implemented in IQ-TREE [131, 141, 142]. They are also less biased than other non-parametric bootstrapping methods and robust against moderate model violations. While other methods tend to underestimate the probabilities of the clade of being correct, the values from UFBoot and UFBoot2 truly reflect this probability, simplifying the interpretation of bootstrap values [141].

The approximate likelihood ratio test (aLRT), implemented in PhyML [143], is an alternative to the non-parametric bootstrap. Bayesian inference methods use posterior probabilities (PP) to measure branch support. It is also possible to compare the topology of different trees. In ML-based phylogenetic analysis, the Shimodaira-Hasegawa (SH) test and its improved version, Approximately unbiased (AU) [144], have been designed to evaluate alternative phylogenetic hypotheses, and test if a tree is better supported than another one. This test can be used with PAUP, PhyML, FastTree and IQ-TREE.

Tree rooting

The root of a phylogenetic tree is the hypothetical last common ancestor of all the sequences present in the tree. Depending on the question asked, phylogenetic trees can be unrooted or rooted. The latter corresponds to the identification of ancestral and derived states, aiming at studying the direction of the evolution of the sequences [69]. Diverse methods have been developed to root phylogenetic trees. The most common consists in including outgroups (i.e., sequences that are closely related to the ingroup of interest) in the analysis. Typically, two outgroups are selected, one being more closely related to the ingroup than the other, allowing for a proper identification of the states of characters. Correctly rooting a phylogeny can be challenging, for example in the case of rapid evolutionary radiations. The outgroups can be subject to long-branch artifacts and tend to cluster with the longest branches of the tree [145]. A study suggests reconstructing the trees with and without outgroups. When the outgroup affects the topology, the tree with no outgroups should be preferred [146]. When outgroups are not included, alternative methods can be used. For example, midpoint rooting places the root at the mid-point between the most dissimilar sequences in the tree, and molecular clock rooting assumes that evolution speed is constant between the sequences [69].

Test of goodness-of-fit

The inferred optimal model of sequence evolution used for the phylogenetic analysis can be inadequate. Once a phylogenetic tree has been inferred, it is recommended to test the goodness-of-fit (i.e., the adequacy) between the tree, the model, and the data [1]. A good fit means that the tree and the model of sequence evolution provide a good explanation of the data but does not indicate if the tree is correct or not. The goodness of fit can be tested using a parametric bootstrap [147], a method that consists in simulating sequence evolution to generate pseudo-data, using the optimal tree and the optimal model as an input. Sequence generating programs, such as SeqGen (https://github.com/rambaut/Seq-Gen/releases/tag/1.3.4, [148]) can be used. The goodness-of-fit is calculated from the difference between the unconstrained and constrained (i.e., assuming the optimal tree and model) log-likelihoods of the real data and the pseudo-data. If the fit is poor, it is recommended to check the alignment and the selected set of sites, and the sequence evolution model (feedback loops on Fig 1). The adequacy of the data can be tested using the frequentist Goldman-Cox (GC) test, which can be performed with PAUP [149]. Most Bayesian phylogenetic programs employ the posterior predictive (PP) test [150].

Visualization tools

Once the phylogenetic tree has been computed, it can be visualized using graphical software such as FigTree [151], ETE Toolkit [152] or ITOL [153] (Table 8). MEGA [55] and SeaView [124] also include visualization tools. Using different sets of options, several types of phylogenetic trees can be drawn (rooted or not, cladogram or phylogram), and branch support values (bootstrap values or posterior probabilities) can be displayed.

thumbnail
Table 8. List of tools for graphical visualization and annotation of phylogenetic trees.

https://doi.org/10.1371/journal.pone.0279597.t008

Integrative services for phylogenetic workflows

Packages for phylogenetic analysis can facilitate phylogenetic inference, analysis and other evolutionary studies (Table 9). A complete series of libraries for bioinformatics including tools for sequence alignment, phylogenetic analysis, study of molecular evolution and population genetics is available through the Bio++ suite [154] and HyPhy [128]. The Cyberinfrastructure for phylogenetic research (CIPRES science gateway) [155] is a public resource for phylogenetic analysis that includes many tools and software for sequence alignment, model selection, and phylogenetic inference. Other packages include NGPhylogeny [156] and Phylemon [157]. Geneious is a platform for DNA, RNA and protein studies that provides tools for NGS assembly, sequence alignment, phylogenetic analysis using NJ, UPGMA, ML and BI, 3-D structures study, and SNP analysis [158].

thumbnail
Table 9. List of packages for phylogenetic analysis and evolutionary studies.

https://doi.org/10.1371/journal.pone.0279597.t009

Study of molecular evolution

Inferring phylogenetic relationships allows users to study many aspects of molecular evolution. Here, we propose a non-exhaustive list of studies that can be carried out using phylogenetic trees and the above mentioned bioinformatic tools.

Reconstitution of ancestral states

Retracing the functional evolution of genes, proteins, or biological traits often requires the reconstitution of ancestral states. Ancestral states can be inferred from a phylogenetic tree using MP, ML, or BI. To infer ancestral states also requires the aligned sequences and, when using probabilistic and distance methods, the model of sequence evolution that has been used for the phylogenetic analysis. For ML-based reconstructions, MEGA [55], PAML [127], IQ-TREE and IQ-TREE 2 [77, 78], HyPhy [128], Bio++ [154], and Mesquite [159] can be used (Table 10). BEAST [160], MrBayes [133], and BayesTraits [136] use BI. RASP (Reconstruct Ancestral state in Phylogenies) [161] can be used for both ML- and BI-based ancestral states reconstruction. PyCogent [115] provides a large number of evolutionary analyses, including ancestral states reconstruction.

thumbnail
Table 10. List of programs and databases for diverse evolutionary analyses in complement to phylogenetic analysis.

https://doi.org/10.1371/journal.pone.0279597.t010

Measure of selection strength

The type and strength of selection on protein coding genes may be of interest. It is calculated by evaluating the ratio of the number of non-synonymous substitutions (substitutions changing the protein sequence) per non-synonymous site (dN), and the number of synonymous substitutions (substitutions with no effect on the protein sequence due to the redundancy of the genetic code) per synonymous site (dS). If dN/dS > 1, then the non-synonymous substitutions are higher than expected and the gene is under positive selection. If dN/dS<1, the gene is under purifying selection and if dN/dS = 1, the selection is neutral. It is recommended not to use the dN/dS ratio for closely related species [175]. The ratio can be calculated using PAML [127], MEGA [55], Bio++ [154] and HyPhy [128] (Table 10).

Time-calibration of phylogenetic trees

Time calibration of phylogenetic trees consists in estimating divergence times, using events with a known age, such as fossil and other geological data (that can only give minimal ages) as calibration points. Alternatively, mutation rates can be used to calculate the divergence time between two sequences. However, since the mutation rate can vary a lot during long-term evolution and differ between taxonomic groups, using mutation rates should be avoided for distantly related species [176]. The estimated divergence times between species is summarized in TimeTree [173] (Table 10). It is noteworthy that divergence times estimates from the literature, based on calibration points from fossil data and molecular clocks, are prone to error and illusory precision [177]. TimeTree can be used with MEGA [55] to calibrate a phylogeny. The BEAST package [160] uses BI to estimate mutation rates and calibrate phylogenies. LSD [178], recent versions of PhyloBayes [134] and APE [118] can also be used for molecular dating of evolutionary events.

Study of host/parasite co-evolution

Co-evolution refers to the genetic or morphological changes (or both) between different species in interaction. It is widely used in evolutionary ecology and parasitology to study the evolution of hosts and parasites. Co-evolutionary events include co-speciation, host change, duplication, and loss of interaction. The evolution of the parasite is partly driven by the evolution of the host, which is considered independent from the evolution of the parasite [179]. The co-evolutionary history can be presented as a co-phylogeny with the two entities. Some programs for studying co-evolution, including Jane [170], CoRe-PA [166] and TreeMap [174] (Table 10), are based on the hypothesis that the evolution of the parasite is driven by the evolution of the host. Others, such as Copycat [165], reconciliate the two phylogenies under the hypothesis that the situation is symmetric and evaluate the significance of co-evolution under a statistical framework. Co-evolution of genes or proteins can also be studied using these tools.

Phylogenetic comparative analysis

Evolutionary biology often employs the so-called phylogenetic comparative methods to study the adaptive significance of biological traits. These methods aim at identifying biological characters, in terms of morphology, physiology or ecology, that result from a shared ancestry. Comparative analysis uses a correlative approach between traits, taking into account the phylogenetic constraints [180]. Comparative analyses can be performed for quantitative or qualitative variables. Suitable programs include Mesquite [159] and BayesTraits [136] (Table 10).

Genome evolution

Phylogenetic trees, in complement with genomics tools and databases, can be used to study genome evolution, and identify evolutionary events such as mutations, insertions, deletions, gene or genome duplications, genome re-organization, chromosomal rearrangements, polyploidization events or genetic exchanges. Molecular databases, such as Ensembl [7] and GenBank [4] (Fig 1, Table 1), can be used to study genome evolution. Ohnologs [171] summarizes the whole genome duplication events during the evolution of vertebrates. This database can be used to interpret the duplication events and identify paralogues resulting from a whole genome duplication. Horizontal gene transfer, i.e., the gene exchanges between different organisms, can be estimated using HGT-Finder [169] (Table 10). CoGe [164] provides many tools for comparative genomic research, including BLAST [33] and tools for studying synteny, genomic inversions or horizontal gene transfers. Computational analysis of gene family evolution (CAFE) [163] is a program for studying the evolution of gene family sizes. It can be used to calculate the birth and death rates of gene families over phylogenies.

Population genetics

Genetic diversity can also be explored at the population level by analyzing polymorphism between members of the same species. Population geneticists often study allele diversity within a population, including single nucleotide polymorphisms (SNP), indels, microsatellites or transposable elements. Mathematical models have been developed to describe polymorphism. For instance, nucleotide diversity (π) measures the degree of polymorphism in a population, based on the average number of SNPs per site [181]. The fixation index (FST) is a statistic of genetic distance between populations based on their allelic composition using multiple alleles [182]. Linkage disequilibrium measures the association between alleles at different loci in a population. Several programs are suitable for population genetics studies; for a full review, see [183]. Arlequin [162], SNiPlay [172], DNAsp [184] and GENEPOP [168] can be used to compute statistics describing genetic diversity in populations, as well as the R-written package APE [118] (Table 10). Arlequin and GENEPOP are also relevant for inferring the strength of genetic drift and selection. The Bio++ suite [154] and HyPhy [128] also include tools for population genetics analyses.

Protein structure study

The study of protein functional evolution can require bioinformatic tools for protein structural analyses (Table 11). 3D structure comparisons can be performed using PyMOL [185]. Structure alignments can be realized and the mean distance in ångström between homologous residues can be calculated with this program. I-TASSER [186], HHPred of the HH suite [187] and Alpha fold [188] can be used to predict the 3D structure of proteins from their amino-acid sequences. FoRSA [189] is able to identify a protein fold from its amino-acid sequence or a protein sequence in the proteome of a species from a crystal structure.

thumbnail
Table 11. List of programs for protein structure analyses.

https://doi.org/10.1371/journal.pone.0279597.t011

Two test-cases of evolutionary analyses of proteins

Following the roadmap for evolutionary analyses of proteins that is presented above (Fig 1), we now demonstrate how to track the evolution of the p53 family and human cyclins and CDKs.

Reconstructing the evolutionary history of the p53 family

TP53 is a transcription factor regulating genes involved in DNA repair and cell cycle control, inducing growth arrest or apoptosis depending on the physiological conditions and cell type. TP53 has been extensively studied for its role in development and cancer. Two paralogues of p53 are identified in vertebrate genomes: p63 and p73. Here, we propose a simple bioinformatic study to reconstruct the evolutionary history of the proteins p53, p63 and p73, to illustrate our roadmap. We investigate how the paralogues of different animal species are related to each other, when they appeared and diverged, and when they evolved new protein domains. We describe step-by-step the methods and tools used, from the selection of sequences to the phylogenetic inference and reconstruction of the evolutionary history of the TP53 family, using data from reference [190].

1. State of the art and protein classification of p53.

According to UniProt [17], the human p53 (cellular tumor antigen p53, hereafter named HsTP53) is located in the nucleus, the endoplasmic reticulum, the cytoskeleton and the mitochondrion. HsTP53 is labelled as P04637 in UniProt [17] (https://www.uniprot.org/uniprot/P04637, accessed September 06, 2021) where the full sequence can be downloaded in the FASTA format. HsTP53 contains 393 amino acids. According to Pfam 35.0 [27], HsTP53 contains four main protein domains: P53 TAD (transactivating domain), TAD2, P53 DNA binding domain, and P53 tetramer. P63 and p73 also contain the P53 DNA binding domain and the P53 tetramer domain. The P53 TAD and TAD2 domains are absent in P63 and P73, but instead both include a single SAM_2 domain.

P53 (PF00870 in Pfam) is the main domain of the p53 protein, covering the amino acids 99 to 289. Pfam contains 1765 P53-domain-containing sequences from 382 species, all in choano-organisms (metazoans and choanoflagellates), including 5 sequences in choanoflagellates and 13 sequences in the genome of Homo sapiens (Fig 2A, the figure can also be accessed here: https://pfam.xfam.org/family/PF00870#tabview=tab7). P53 TAD and TAD2 are two transcription scaffold domains. The Pfam database [27] includes 253 sequences containing the P53 TAD domain, in bilaterians only. The domain TAD2 is present in 81 sequences, from primates only. P53 tetramer serves for the oligomerization of the protein. The database includes 1,392 sequences, in animals only, containing the domain p53 tetramer. The SAM 2 (sterile alpha motif) domain is a putative protein interaction domain. More than 20,000 sequences containing this domain are present in Pfam [27], in more than 1,400 species.

thumbnail
Fig 2. Sunburst plot of the distribution of the p53 protein domain (PF00870) in living organisms according to Pfam (accessed September 06, 2021).

The plot shows the distribution of the 1,765 sequences containing the P53 binding domain across 382 species. Every bar on the periphery represents one single species, containing one or several p53 paralogues in their genome (A), and percent identity matrix created by CLUSTALW 2.1 of the 7 human p53 domain-containing proteins and two of their most distant homologues from the choanoflagellate Monosiga brevicollis (B).

https://doi.org/10.1371/journal.pone.0279597.g002

The SCOP classification of p53 is as follows (accessed September 06, 2021):

2. Identification of homologues.

To reconstruct the evolutionary history of the p53 domain in animals, the selection of TP53 homologues covering the diversity of the family is necessary.

Using a protein BLAST (Blastp) (https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE=Proteins), paste the sequence of HsTP53 in the FASTA format, and select the genomes of the species of interest. Launch a BLAST search and download the amino-acid sequences in the FASTA format. Then, paste all the sequences in a single file using ’.fasta’ as filename extension.

In this example, the p53 homologues of diverse animals (the cnidarian Hydra vulgaris, four insect species: Drosophila melanogaster, Apis mellifera, Bombus terrestris and Aedes aegyptus, and the tunicate Ciona intestinalis) and the p53, p63, and p73 of diverse vertebrates (the teleost fish Danio rerio, the coelacanth Latimeria chalumnae, the amphibian Xenopus tropicalis, the lizard Anolis carolinensis, the bird Gallus gallus, and the mammals Bos taurus and H. sapiens) were chosen. The p53 of the choanoflagellate Monosiga brevicollis (a protist related to animals), also retrieved from a BLAST search was chosen as outgroup (S1 File).

3. Multiple sequence alignment and alignment trimming.

Use an alignment tool (e.g. MAFFT, https://www.ebi.ac.uk/Tools/msa/mafft/ [45]) to align the sequences. Paste the alignment in the FASTA format and submit. Save the alignment in a new FASTA file. You can also directly download the sequences into the Guidance 2 server (http://guidance.tau.ac.il/) [68] and proceed to the alignment using MAFFT. Open the color-coded MSA to identify poorly aligned and highly variable regions. You can delete them manually from the alignment or remove unreliable columns below a certain cutoff. The new MSA, hereafter renamed sub-MSA, will be used for the phylogenetic analysis.

Optional: calculate the identity matrix of the sequences using alignment tools (e.g., CLUSTALW 2.1).

The 13 human p53 paralogues share 36% to 100% identity, and the two paralogues of Monosiga brevicollis share 21.4% identity (Fig 2B). Human and Monosiga orthologues share 17% to 25% identity. Hence, all human paralogues are more similar to each other than to any of the Monosiga orthologues.

4. Sequence evolution model selection.

We propose to perform a phylogenetic analysis of the p53 family using a distance-based method (NJ) and a probabilistic method (ML) and compare the results. First, it is necessary to identify the optimal model of sequence evolution.

Here, we are using protein sequences. Use ProtTest 3.4.2 [100] to calculate the log-likelihoods of a panel of 56 amino-acid substitution models, and select the most relevant one based on the BIC or AIC score. Select the model with the lowest score.

Alternatively, use the substitution model selectors included in IQ-TREE or MEGA. For example, with the IQ-TREE web server (http://iqtree.cibiv.univie.ac.at/), open the Model Selection panel, download the sub-MSA, select “protein sequences”, choose a selection criterion (AIC or BIC) and proceed to the analysis. With MEGA 11, download the sub-MSA, and select “Find best DNA/protein models” in the Model panel.

The model JTT+G [90] (JTT with Gamma-distributed rate-heterogeneity across sites), that minimizes the BIC score, was selected.

5. Phylogenetic inference.

For beginners, we recommend using programs that include a user interface or an online version, such as MEGA, SeaView, or the IQ-TREE server. The phylogenetic trees were inferred using MEGA 11 [55] for the NJ-based analysis, and IQ-TREE 2 [77, 78] for the ML-based analysis, using the sub-MSA and the appropriate model (JTT+G).

With MEGA 11, in the Phylogeny panel, perform a phylogenetic analysis using NJ with the sub-MSA. Select the appropriate substitution model (e.g., JTT+G) and the bootstrap method with e.g., 1000 replicates.

With IQ-TREE 2, download the alignment file, select the appropriate sequence type (DNA or protein) and the appropriate substitution model (e.g., JTT+G). In the panel “branch support analysis”, select the Ultrafast Bootstrap analysis with e.g., 1000 replicates. For single branch tests, you can also select the SH-aLRT test.

Save the phylogenetic tree including the Bootstrap/SH-aLRT values and branch lengths in the Newick format and open it with FigTree or ITOL for a graphical display of the tree. You can also paste the tree in the Newick format directly into the graphical program.

Both methods reveal four major clades containing respectively the p53 of insects and the p53, p63, and p73 of all vertebrates (Fig 3). The p53, p63, and p73 of vertebrates are more closely related to each other than to any other p53. Furthermore, the p63 and p73 of vertebrates are more closely related to each other than to vertebrate p53. This indicates that two duplication events in the p53 family preceded the origin of vertebrates. First, the p53 family and the p63/p73 cluster diverged. The second one caused the p63 and p73 families to diverge (Fig 3). The p53 of insects are clustered together. This indicates that insects diverged from the other bilaterians before these two duplications. These results are in accordance with the existing literature on the evolutionary history of the p53 family [190].

thumbnail
Fig 3.

Phylogenetic trees of the P53 domain-containing proteins of metazoans using Neighbor Joining (A) and Maximum Likelihood (B). The trees were realized according to the model JTT+G [90], as calculated by ModelFinder [101], using the AIC [98]. The numbers on the internal edges/branches indicate the bootstrap values as calculated by the standard bootstrapping method [138] and UFBoot2 [131], respectively. The phylogenetic trees were inferred using MEGA 11 [55] and IQ-TREE 2 [77, 78], respectively, and the figures were generated using ITOL [153]. Green branches represent the p63 family, orange branches represent the p73 family and blue branches represent the p53 family.

https://doi.org/10.1371/journal.pone.0279597.g003

6. Molecular dating of speciation events.

We propose to estimate the age of speciation and duplication events of our phylogenetic tree. TimeTree has been used to retrieve the estimates of age of speciation events, and these events were used as calibration points. Molecular clocks can also be used to calibrate the phylogeny in MEGA.

Download the alignment file in the FASTA format and the phylogenetic tree in the Newick format in MEGA 11. In the Compute panel, select “Compute TimeTree” and “internal nodes constraints”. In TimeTree (http://www.timetree.org/), enter the names of two species of interest. For example, Homo and Drosophila diverged between 630 and 830 million years ago, with 694 million years as median time. In MEGA, click “add new calibration point” and select the node in the phylogenetic tree, or enter the names of the two taxa, and define the speciation age with a minimum, maximum or fixed time (for example, 694 million years between Homo and Drosophila). Use TimeTree to define several calibration points, before and after the duplication events, and save the calibrated tree.

The time-calibrated phylogeny of the TP53 family suggests that the duplication event between p53 and the p67/p73 cluster occurred around 502 million years ago, and that p63 and p73 diverged 452 million years ago (Fig 4). One should keep in mind that these evolutionary ages are only estimates based on a few calibration points. According to the database Ohnologs [171], p53, p63 and p73 result from the two-round whole genome duplication event that preceded the origin of chordates.

thumbnail
Fig 4. Time-calibrated phylogenetic tree of the P53 domain-containing proteins of metazoans.

The trees were realized according to the model JTT+G [90], as calculated by ModelFinder [101] using the AIC [98]. The phylogenetic tree and the figure were realized using MEGA 11 [55]. Time calibration was performed using TimeTree [173]. The values at the nodes and the scale indicate the divergence time in million years. Green branches represent the p63 family, orange branches represent the p73 family and blue branches represent the p53 family. Gray spots on the branches indicate the appearance of the different protein domains during the evolution of the TP53 family.

https://doi.org/10.1371/journal.pone.0279597.g004

7. Reconstruction of the evolutionary history of the P53 family.

By combining this phylogenetic tree and the database Pfam [27], the evolutionary history of the protein family can be traced. The p53 DNA binding domain, shared by all proteins in this analysis, appeared before the divergence between choanoflagellates and metazoans (Fig 4). The SAM2 domain, present in p63 and p73 sequences of vertebrates only, appeared after the p53-p63/p73 duplication and before the p63-p73 duplication. The P53 TAD domain is restricted to vertebrate p53. It appeared after the first whole genome duplication and before the speciation of vertebrates. Finally, the TAD2 domain evolved recently and is restricted to primate p53.

Evolutionary history of human cyclins and CDKs

Cyclin-dependent kinases (CDKs) are protein kinases involved in the control of cell cycle. They are responsible for the activation of specific target proteins. CDKs are activated by regulatory proteins called cyclins, that are characterized by a cyclic variation of the concentration along the cell cycle. Cyclin binding to the CDK activates specific kinases and phosphatases that in turn activate the CDK. Subsequent ubiquitination and proteolysis of cyclins by the anaphase promoting complex then inactivate the CDK. All steps of the cell cycle (mitosis, G1, G2, S) depend on the activation of specific CDKs by specific cyclins. Cyclins and CDKs represent large families of proteins. Twenty-one CDKs and twenty-one cyclins are present in the human genome. Here, we present a protocol for studying the evolutionary history of human cyclin and CDK paralogues and their coevolution, following the roadmap presented above.

1. Phylogenetic analyses.

To study the coevolution between human cyclins and CDKs, we need first to reconstruct their phylogenies separately. In this example, two human proteins related to CDKs, GSK3 and MAK, were chosen as outgroups for CDKs [191] (S2 File). Cables1 and Cables2, related to cyclins, were used as outgroups for the phylogenetic analysis of cyclins [191] (S3 File). Sequences were aligned using CLUSTALW 2.1 [39] and ModelFinder [101] has been used to determine the most relevant evolutionary model based on the AIC. The amino acid substitution model LG [192] has been selected for both families. Then, phylogenetic analysis was performed using maximum likelihood with IQ-TREE 2 [77, 78] and consistency of the phylogenetic estimate was assessed using the bootstrapping method UFBoot2 [131]. The figures were generated using ITOL [153].

2. Identification of homologues resulting from whole genome duplications.

Using the phylogeny (Fig 5) and the database Ohnologs 2.0 [171], ohnologues, i.e., paralogues resulting from a whole genome duplication, can be identified (Fig 5). For example, in the cyclin family, Cyclins T1 and T2, Cyclins B1 and B2, Cyclins A1 and A2, and Cyclins E1 and E2 are ohnologues. In the CDK family, CDK12 and CDK13, CDK4 and CDK6, CDK14 and CDK15, and CDK19 and CDK8 are ohnologues. These ohnologues likely resulted from two-round whole genome duplication that occurred before the origin of chordates [193, 194].

thumbnail
Fig 5.

Phylogenetic trees of human CDKs (A) and Cyclins (B) using Maximum likelihood. The trees were realized according to the substitution model LG [192] as calculated by ModelFinder [101] using the AIC [98]. The phylogenetic analyses were performed using IQ-TREE2 [77, 78] and the figure was realized using ITOL [153]. The numbers indicate the bootstrap values as calculated by UFBoot2 [130]. Red squares indicate whole genome duplications according to the database Ohnologs 2.0 [171].

https://doi.org/10.1371/journal.pone.0279597.g005

3. Study of the coevolution between cyclins and CDKs.

To study the coevolution between the two gene families, the phylogenetic trees of cyclins and CDKs and their associations are needed. Jane 4 [170] and Treemap 3 [174], two programs designed for studying coevolution between hosts and parasites, were used to reconstruct the co-phylogeny of the two gene families. This example uses the cyclin/CDK associations from a publication on the evolution of the Cyclin and CDK families [195].

With Jane and TreeMap, a single nexus file containing the phylogenies of cyclins and CDKs, and their associations is needed. Create a nexus file (starting with #NEXUS). This file should contain the two trees in the Newick format, in the sections BEGIN HOST and BEGIN PARASITE, and the associations in the section BEGIN DISTRIBUTION. This section should mention every association between Cyclins and CDK following the pattern “Host: Parasite,. All three sections should end with “ENDBLOCK;”. The names of the taxa in the three files should be identical. Cyclins interacting with several CDKs and vice-versa should be repeated (S4 File).

Import this file to Jane and launch the analysis in the Solve Mode. The costs of coevolutionary events can be set. The stats mode can be used to compute the cost range of the solutions. With TreeMap, import the nexus file and launch the analysis in “Solve the tanglegram”. We obtain a coevolutionary scenario that represents the best way to associate the two trees. You can test the significance of the reconstruction in “estimate significance” or perform a heuristic test.

The co-phylogenies of cyclins and CDKs are presented in Fig 6. These figures retrace the evolutionary history of cyclins and CDKs. Several coevolution events were identified, including co-speciation, duplication, duplication with interaction switch, loss of interaction, and numerous failures to diverge, for example the duplication of Cyclins L1 and L2 without a duplication in CDK9 (Fig 6A). Significant co-evolution events are identified by both programs, such as between the 3 Cyclin D paralogues and the CDK4 and CDK6 cluster, and between Cyclins A, B, D and E and CDK 1, 2, 3, 4, 6, 14 and 16 (Fig 6A and 6B).

thumbnail
Fig 6. Two co-evolutionary scenarios between human cyclins and human CDKs.

The co-phylogenies were realized using Jane [170] (A) and TreeMap [174] (B). (A) Cyclins (black lines) and CDKs (blue lines) that cluster together depict an interaction (the cyclin can bind the CDK to activate it). Hollow red circles indicate co-speciation events, solid red circles indicate a duplication, and yellow circles indicate a duplication with host switch. Dashed lines illustrate a loss of interaction, and jagged lines indicate a failure to diverge. (B) Significant co-speciation events between cyclins and CDKs are indicated by red filled circles (graded). The more intense red color indicates a more significant congruence.

https://doi.org/10.1371/journal.pone.0279597.g006

Supporting information

S1 File. Sequences and accession numbers of p53 proteins.

These sequences and accession numbers were used for phylogenetic analysis.

https://doi.org/10.1371/journal.pone.0279597.s001

(PDF)

S2 File. Sequences and accession numbers of human CDKs.

These sequences and accession numbers were used for phylogenetic analysis.

https://doi.org/10.1371/journal.pone.0279597.s002

(PDF)

S3 File. Sequences and accession numbers of human cyclins.

These sequences and accession numbers were used for phylogenetic analysis.

https://doi.org/10.1371/journal.pone.0279597.s003

(PDF)

S4 File. Nexus code.

The coding was used to reconstruct the co-phylogeny of human CDKs and Cyclins.

https://doi.org/10.1371/journal.pone.0279597.s004

(PDF)

S5 File. Protocol.

The protocol as also available on protocols.io.

https://doi.org/10.1371/journal.pone.0279597.s005

(PDF)

Acknowledgments

We are grateful to Sarah Amend, Kenneth Pienta, and Laurie Kostecka at the Brady Urological Institute, Johns Hopkins School of Medicine, and to Stina Andersson, Chris Carroll, and Sinan Karakaya at the Tissue Development and Evolution (TiDE) group, Lund University, for carefully reading the manuscript and providing useful comments that improved the paper.

References

  1. 1. Jermiin LS, Catullo RA, Holland BR. A new phylogenetic protocol: dealing with model misspecification and confirmation bias in molecular phylogenetics. NAR Genomics Bioinforma. 2020 Jun 1;2(2):lqaa041. pmid:33575594
  2. 2. Chen C, Huang H, Wu CH. Protein Bioinformatics Databases and Resources. In: Wu CH, Arighi CN, Ross KE, editors. Protein Bioinformatics [Internet]. New York, NY: Springer New York; 2017 [cited 2021 Aug 5]. p. 3–39. (Methods in Molecular Biology; vol. 1558). Available from: http://link.springer.com/10.1007/978-1-4939-6783-4_1
  3. 3. Rigden DJ, Fernández XM. The 2021 Nucleic Acids Research database issue and the online molecular biology database collection. Nucleic Acids Res. 2021 Jan 8;49(D1):D1–9.
  4. 4. Benson DA. GenBank. Nucleic Acids Res. 2002 Jan 1;30(1):17–20. pmid:11752243
  5. 5. Schuler GD, Epstein JA, Ohkawa H, Kans JA. [10] Entrez: Molecular biology database and retrieval system. In: Methods in Enzymology [Internet]. Elsevier; 1996 [cited 2021 Dec 20]. p. 141–62. Available from: https://linkinghub.elsevier.com/retrieve/pii/S0076687996660121
  6. 6. NCBI Resource Coordinators. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2012 Nov 26;41(D1):D8–20.
  7. 7. Howe KL, Achuthan P, Allen J, Allen J, Alvarez-Jarreta J, Amode MR, et al. Ensembl 2021. Nucleic Acids Res. 2021 Jan 8;49(D1):D884–91. pmid:33137190
  8. 8. Bastian F, Parmentier G, Roux J, Moretti S, Laudet V, Robinson-Rechavi M. Bgee: Integrating and Comparing Heterogeneous Transcriptome Data Among Species. In: Bairoch A, Cohen-Boulakia S, Froidevaux C, editors. Data Integration in the Life Sciences [Internet]. Berlin, Heidelberg: Springer Berlin Heidelberg; 2008 [cited 2021 Aug 5]. p. 124–31. (Lecture Notes in Computer Science; vol. 5109). Available from: http://link.springer.com/10.1007/978-3-540-69828-9_12
  9. 9. Safran M, Dalah I, Alexander J, Rosen N, Iny Stein T, Shmoish M, et al. GeneCards Version 3: the human gene integrator. Database. 2010 Aug 5;2010(0):baq020–baq020. pmid:20689021
  10. 10. Bowes JB, Snyder KA, Segerdell E, Gibb R, Jarabek C, Noumen E, et al. Xenbase: a Xenopus biology and genomics resource. Nucleic Acids Res. 2007 Dec 23;36(Database):D761–7. pmid:17984085
  11. 11. Drysdale RA. FlyBase: genes and gene models. Nucleic Acids Res. 2004 Dec 17;33(Database issue):D390–5.
  12. 12. Stein L. WormBase: network access to the genome and biology of Caenorhabditis elegans. Nucleic Acids Res. 2001 Jan 1;29(1):82–6. pmid:11125056
  13. 13. Wood V, Harris MA, McDowall MD, Rutherford K, Vaughan BW, Staines DM, et al. PomBase: a comprehensive online resource for fission yeast. Nucleic Acids Res. 2012 Jan 1;40(D1):D695–9. pmid:22039153
  14. 14. Rhee SY. The Arabidopsis Information Resource (TAIR): a model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and community. Nucleic Acids Res. 2003 Jan 1;31(1):224–8. pmid:12519987
  15. 15. Waese J, Provart NJ. The Bio-Analytic Resource: Data visualization and analytic tools for multiple levels of plant biology. Curr Plant Biol. 2016 Nov;7–8:2–5.
  16. 16. Winter D, Vinegar B, Nahal H, Ammar R, Wilson GV, Provart NJ. An “Electronic Fluorescent Pictograph” Browser for Exploring and Analyzing Large-Scale Biological Data Sets. Baxter I, editor. PLoS ONE. 2007 Aug 8;2(8):e718. pmid:17684564
  17. 17. Bairoch A. The Universal Protein Resource (UniProt). Nucleic Acids Res. 2004 Dec 17;33(Database issue):D154–9.
  18. 18. Consortium TGO. Creating the Gene Ontology Resource: Design and Implementation. Genome Res. 2001 Aug 1;11(8):1425–33. pmid:11483584
  19. 19. Aoki KF, Kanehisa M. Using the KEGG Database Resource. Curr Protoc Bioinforma [Internet]. 2005 Sep [cited 2021 Aug 5];11(1). Available from: https://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0112s11 pmid:18428742
  20. 20. Digre A, Lindskog C. The Human Protein Atlas—Spatial localization of the human proteome in health and disease. Protein Sci. 2021 Jan;30(1):218–33. pmid:33146890
  21. 21. Bouthors V, Dedieu O. Pharos, a Collaborative Infrastructure for Web Knowledge Sharing. In: Abiteboul S, Vercoustre AM, editors. Research and Advanced Technology for Digital Libraries [Internet]. Berlin, Heidelberg: Springer Berlin Heidelberg; 1999 [cited 2021 Aug 5]. p. 215–33. (Goos G, Hartmanis J, van Leeuwen J, editors. Lecture Notes in Computer Science; vol. 1696). Available from: http://link.springer.com/10.1007/3-540-48155-9_15
  22. 22. Sheils TK, Mathias SL, Kelleher KJ, Siramshetty VB, Nguyen DT, Bologa CG, et al. TCRD and Pharos 2021: mining the human proteome for disease biology. Nucleic Acids Res. 2021 Jan 8;49(D1):D1334–46. pmid:33156327
  23. 23. Cuff AL, Sillitoe I, Lewis T, Redfern OC, Garratt R, Thornton J, et al. The CATH classification revisited—architectures reviewed and new ways to characterize structural divergence in superfamilies. Nucleic Acids Res. 2009 Jan 1;37(Database):D310–4. pmid:18996897
  24. 24. Holm L, Sander C. The FSSP database: fold classification based on structure-structure alignment of proteins. Nucleic Acids Res. 1996 Jan 1;24(1):206–9. pmid:8594580
  25. 25. Holm L, Sander C. Dali/FSSP classification of three-dimensional protein folds. Nucleic Acids Res. 1997 Jan 1;25(1):231–4. pmid:9016542
  26. 26. Berman H, Henrick K, Nakamura H, Markley JL. The worldwide Protein Data Bank (wwPDB): ensuring a single, uniform archive of PDB data. Nucleic Acids Res. 2007 Jan 3;35(Database):D301–3. pmid:17142228
  27. 27. El-Gebali S, Mistry J, Bateman A, Eddy SR, Luciani A, Potter SC, et al. The Pfam protein families database in 2019. Nucleic Acids Res. 2019 Jan 8;47(D1):D427–32. pmid:30357350
  28. 28. Attwood TK, Beck ME, Bleasby AJ, Parry-Smith DJ. PRINTS a database of protein motif fingerprints.: 7.
  29. 29. Hulo N. The PROSITE database. Nucleic Acids Res. 2006 Jan 1;34(90001):D227–30. pmid:16381852
  30. 30. Conte LL, Ailey B, Hubbard TJP, Brenner SE, Murzin AG, Chothia C. SCOP: a Structural Classification of Proteins database.: 3.
  31. 31. Madera M. The SUPERFAMILY database in 2004: additions and improvements. Nucleic Acids Res. 2004 Jan 1;32(90001):235D – 239. pmid:14681402
  32. 32. Hubbard TJP, Murzin AG, Brenner SE, Chothia C. SCOP: a Structural Classification of Proteins database. Nucleic Acids Research. 1996;25(1):4.
  33. 33. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic Local Alignment Search Tool. 1990;8.
  34. 34. Kerfeld CA, Scott KM. Using BLAST to Teach “E-value-tionary” Concepts. Kerfeld CA, editor. PLoS Biol. 2011 Feb 1;9(2):e1001014. pmid:21304918
  35. 35. Finn RD, Clements J, Eddy SR. HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 2011 Jul 1;39(suppl):W29–37.
  36. 36. Pearson WR, Lipman DJ. Improved tools for biological sequence comparison. Proc Natl Acad Sci. 1988 Apr 1;85(8):2444–8. pmid:3162770
  37. 37. Ning Z. SSAHA: A Fast Search Method for Large DNA Databases. Genome Res. 2001 Oct 1;11(10):1725–9. pmid:11591649
  38. 38. Kent WJ. BLAT—The BLAST-Like Alignment Tool. Genome Res. 2002;(12):656–64. pmid:11932250
  39. 39. Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, McWilliam H, et al. Clustal W and Clustal X version 2.0. Bioinformatics. 2007 Nov 1;23(21):2947–8. pmid:17846036
  40. 40. Sievers F, Higgins DG. Clustal Omega. Curr Protoc Bioinforma [Internet]. 2014 Dec [cited 2022 Jun 4];48(1). Available from: https://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0313s48
  41. 41. Edgar RC. MUSCLE: multiple sequence alignment with improved accuracy and speed. In: Proceedings 2004 IEEE Computational Systems Bioinformatics Conference, 2004 CSB 2004 [Internet]. Stanford, CA, USA: IEEE; 2004 [cited 2021 Aug 10]. p. 689–90. Available from: http://ieeexplore.ieee.org/document/1332560/
  42. 42. Löytynoja A. Phylogeny-aware alignment with PRANK. In: Russell DJ, editor. Multiple Sequence Alignment Methods [Internet]. Totowa, NJ: Humana Press; 2014 [cited 2021 Sep 9]. p. 155–70. (Methods in Molecular Biology; vol. 1079). Available from: http://link.springer.com/10.1007/978-1-62703-646-7_10
  43. 43. Löytynoja A, Goldman N. webPRANK: a phylogeny-aware multiple sequence aligner with interactive alignment browser. BMC Bioinformatics. 2010 Dec;11(1):579. pmid:21110866
  44. 44. Lassmann T. Kalign 3: multiple sequence alignment of large datasets. Mathelier A, editor. Bioinformatics. 2019 Oct 26;btz795. pmid:31665271
  45. 45. Katoh K. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 2002 Jul 15;30(14):3059–66. pmid:12136088
  46. 46. Kemena C, Notredame C. Upcoming challenges for multiple sequence alignment methods in the high-throughput era. Bioinformatics. 2009 Oct 1;25(19):2455–65. pmid:19648142
  47. 47. Do CB, Katoh K. Protein Multiple Sequence Alignment. In: Thompson JD, Ueffing M, Schaeffer-Reiss C, editors. Functional Proteomics [Internet]. Totowa, NJ: Humana Press; 2008 [cited 2022 Oct 22]. p. 379–413. (Walker JM, editor. Methods in Molecular Biology; vol. 484). Available from: http://link.springer.com/10.1007/978-1-59745-398-1_25
  48. 48. Pei J. Multiple protein sequence alignment. Curr Opin Struct Biol. 2008 Jun;18(3):382–6. pmid:18485694
  49. 49. Notredame C, Higgins DG, Heringa J. T-coffee: a novel method for fast and accurate multiple sequence alignment 1 1Edited by J. Thornton. J Mol Biol. 2000 Sep;302(1):205–17.
  50. 50. Do CB, Mahabhashyam MSP, Brudno M, Batzoglou S. ProbCons: Probabilistic consistency-based multiple sequence alignment. Genome Res. 2005 Feb;15(2):330–40. pmid:15687296
  51. 51. Do CB, Gross SS, Batzoglou S. CONTRAlign: Discriminative Training for Protein Sequence Alignment. In: Apostolico A, Guerra C, Istrail S, Pevzner PA, Waterman M, editors. Research in Computational Molecular Biology [Internet]. Berlin, Heidelberg: Springer Berlin Heidelberg; 2006 [cited 2022 Jun 16]. p. 160–74. (Hutchison D, Kanade T, Kittler J, Kleinberg JM, Mattern F, Mitchell JC, et al., editors. Lecture Notes in Computer Science; vol. 3909). Available from: http://link.springer.com/10.1007/11732990_15
  52. 52. Gotoh O. Significant Improvement in Accuracy of Multiple Protein Sequence Alignments by Iterative Refinement as Assessed by Reference to Structural Alignments. J Mol Biol. 1996 Dec;264(4):823–38. pmid:8980688
  53. 53. Notredame C. SAGA: sequence alignment by genetic algorithm. Nucleic Acids Res. 1996 Apr 15;24(8):1515–24. pmid:8628686
  54. 54. Hughey R, Krogh A. Hidden Markov models for sequence analysis: extension and analysis of the basic method. Bioinformatics. 1996;12(2):95–107. pmid:8744772
  55. 55. Kumar S, Stecher G, Li M, Knyaz C, Tamura K. MEGA X: Molecular Evolutionary Genetics Analysis across Computing Platforms. Battistuzzi FU, editor. Mol Biol Evol. 2018 Jun 1;35(6):1547–9. pmid:29722887
  56. 56. Pais FSM, Ruy P de C, Oliveira G, Coimbra RS. Assessing the efficiency of multiple sequence alignment programs. Algorithms Mol Biol. 2014 Dec;9(1):4. pmid:24602402
  57. 57. Mohamed EM, Mousa HM, keshk AE. Comparative Analysis of Multiple Sequence Alignment Tools. Int J Inf Technol Comput Sci. 2018 Aug 8;10(8):24–30.
  58. 58. Anderson C, Strope C, Moriyama E. Assessing multiple sequence alignments using visual tools. In: Bioinformatic—trends and methodologies. InTech Publications. 2011.
  59. 59. Redelings BD. BAli-Phy version 3: model-based co-estimation of alignment and phylogeny. Ponty Y, editor. Bioinformatics. 2021 Sep 29;37(18):3032–4.
  60. 60. Mirarab S, Nguyen N, Guo S, Wang LS, Kim J, Warnow T. PASTA: Ultra-Large Multiple Sequence Alignment for Nucleotide and Amino-Acid Sequences. J Comput Biol. 2015 May;22(5):377–86. pmid:25549288
  61. 61. Nguyen N phuong D, Mirarab S, Kumar K, Warnow T. Ultra-large alignments using phylogeny-aware profiles. Genome Biol. 2015 Dec;16(1):124. pmid:26076734
  62. 62. Liu K, Warnow TJ, Holder MT, Nelesen SM, Yu J, Stamatakis AP, et al. SATé-II: Very Fast and Accurate Simultaneous Estimation of Multiple Sequence Alignments and Phylogenetic Trees. Syst Biol. 2012 Jan 1;61(1):90.
  63. 63. Morrison DA. Is Sequence Alignment an Art or a Science? Syst Bot. 2015 Feb 1;40(1):14–26.
  64. 64. Lemey P, Salemi M, Van Damme AM. The Phylogenetic Handbook, A Practical Approach to Phylogenetic Analysis and Hypothesis Testing. Cambridge University Press; 2009.
  65. 65. Golubchik T, Wise MJ, Easteal S, Jermiin LS. Mind the Gaps: Evidence of Bias in Estimates of Multiple Sequence Alignments. Mol Biol Evol. 2007 Aug 16;24(11):2433–42. pmid:17709332
  66. 66. Talavera G, Castresana J. Improvement of Phylogenies after Removing Divergent and Ambiguously Aligned Blocks from Protein Sequence Alignments. Kjer K, Page R, Sullivan J, editors. Syst Biol. 2007 Aug 1;56(4):564–77.
  67. 67. Wong TKF, Kalyaanamoorthy S, Meusemann K, Yeates DK, Misof B, Jermiin LS. A minimum reporting standard for multiple sequence alignments. NAR Genomics Bioinforma. 2020 Jun 1;2(2):lqaa024. pmid:33575581
  68. 68. Sela I, Ashkenazy H, Katoh K, Pupko T. GUIDANCE2: accurate detection of unreliable alignment regions accounting for the uncertainty of multiple parameters. Nucleic Acids Res. 2015 Jul 1;43(W1):W7–14. pmid:25883146
  69. 69. Kinene T, Wainaina J, Maina S, Boykin LM. Rooting Trees, Methods for. In: Encyclopedia of Evolutionary Biology [Internet]. Elsevier; 2016 [cited 2021 Dec 17]. p. 489–93. Available from: https://linkinghub.elsevier.com/retrieve/pii/B9780128000496002158
  70. 70. Capella-Gutierrez S, Silla-Martinez JM, Gabaldon T. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics. 2009 Aug 1;25(15):1972–3. pmid:19505945
  71. 71. Criscuolo A, Gribaldo S. BMGE (Block Mapping and Gathering with Entropy): a new software for selection of phylogenetic informative regions from multiple sequence alignments. BMC Evol Biol. 2010;10(1):210. pmid:20626897
  72. 72. Dress AW, Flamm C, Fritzsch G, Grünewald S, Kruspe M, Prohaska SJ, et al. Noisy: Identification of problematic columns in multiple sequence alignments. Algorithms Mol Biol. 2008 Dec;3(1):7. pmid:18577231
  73. 73. Jayaswal V, Wong TKF, Robinson J, Poladian L, Jermiin LS. Mixture Models of Nucleotide Sequence Evolution that Account for Heterogeneity in the Substitution Process Across Sites and Across Lineages. Syst Biol. 2014 Sep 1;63(5):726–42. pmid:24927722
  74. 74. Naser-Khdour S, Minh BQ, Zhang W, Stone EA, Lanfear R. The Prevalence and Impact of Model Violations in Phylogenetic Analysis. Bryant D, editor. Genome Biol Evol. 2019 Dec 1;11(12):3341–52. pmid:31536115
  75. 75. Ho SYW, Jermiin LS. Tracing the Decay of the Historical Signal in Biological Sequence Data. Lockhart P, editor. Syst Biol. 2004 Aug 1;53(4):623–37. pmid:15371250
  76. 76. Jermiin LS, Ho SYW, Ababneh F, Robinson J, Larkum AWD. The Biasing Effect of Compositional Heterogeneity on Phylogenetic Estimates May be Underestimated. Lockhart P, editor. Syst Biol. 2004 Aug 1;53(4):638–43. pmid:15371251
  77. 77. Nguyen LT, Schmidt HA, von Haeseler A, Minh BQ. IQ-TREE: A Fast and Effective Stochastic Algorithm for Estimating Maximum-Likelihood Phylogenies. Mol Biol Evol. 2015 Jan;32(1):268–74. pmid:25371430
  78. 78. Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD, von Haeseler A, et al. IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era. Teeling E, editor. Mol Biol Evol. 2020 May 1;37(5):1530–4. pmid:32011700
  79. 79. Jermiin LS, Lovell DR, Misof B, Foster PG, Robinson J. Detecting and visualising the impact of heterogeneous evolutionary processes on phylogenetic estimates [Internet]. Evolutionary Biology; 2019 Nov [cited 2022 Oct 22]. Available from: http://biorxiv.org/lookup/doi/10.1101/828996
  80. 80. Thomas GH, Freckleton RP. MOTMOT: models of trait macroevolution on trees: MOTMOT. Methods Ecol Evol. 2012 Feb;3(1):145–51.
  81. 81. Arenas M. Trends in substitution models of molecular evolution. Front Genet [Internet]. 2015 Oct 26 [cited 2022 Oct 22];6. Available from: http://journal.frontiersin.org/Article/10.3389/fgene.2015.00319/abstract pmid:26579193
  82. 82. Posada D, Crandall KA. Selecting the Best-Fit Model of Nucleotide Substitution. YSTEMATIC Biol. 2001;50:22.
  83. 83. Yang Z. Molecular Evolution: A Statistical Approach. Oxford University Press; 2014. 512 p.
  84. 84. Jukes TH, Cantor CR. Evolution of Protein Molecules. In: Mammalian Protein Metabolism [Internet]. Elsevier; 1969 [cited 2021 Dec 20]. p. 21–132. Available from: https://linkinghub.elsevier.com/retrieve/pii/B9781483232119500097
  85. 85. Kimura M. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J Mol Evol. 1980 Jun;16(2):111–20. pmid:7463489
  86. 86. Felsenstein J. Evolutionary trees from DNA sequences: A maximum likelihood approach. J Mol Evol. 1981 Nov;17(6):368–76. pmid:7288891
  87. 87. Hasegawa M, Kishino H, aki Yano T. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J Mol Evol. 1985 Oct;22(2):160–74. pmid:3934395
  88. 88. Tamura Koichiro, Nei Masatoshi. Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol Biol Evol. 1993 May;10(3):512–26. pmid:8336541
  89. 89. Miura RM, American Association for the Advancement of Science, editors. Some mathematical questions in biology: DNA sequence analysis. Providence, R.I: American Mathematical Society; 1986. 124 p. (Lectures on mathematics in the life sciences).
  90. 90. Jones DT, Taylor WR, Thornton JM. The rapid generation of mutation data matrices from protein sequences. Bioinformatics. 1992;8(3):275–82. pmid:1633570
  91. 91. Whelan S, Goldman N. A General Empirical Model of Protein Evolution Derived from Multiple Protein Families Using a Maximum-Likelihood Approach. Mol Biol Evol. 2001 May;18(5):691–9. pmid:11319253
  92. 92. Le SQ, Gascuel O. An Improved General Amino Acid Replacement Matrix. Mol Biol Evol. 2008 Apr 3;25(7):1307–20. pmid:18367465
  93. 93. Dayhoff MO, Schwartz RM, Orcutt BC. 22 A model of evolutionary change in proteins. In: Atlas of protein sequence and structure. 1978. p. 345–52.
  94. 94. Muse SV, Gaut BS. A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. Mol Biol Evol. 1994 Sep;11(5):715–24. pmid:7968485
  95. 95. Yang Z. A space-time process model for the evolution of DNA sequences. Genetics. 1995 Feb 1;139(2):993–1005. pmid:7713447
  96. 96. Crotty SM, Minh BQ, Bean NG, Holland BR, Tuke J, Jermiin LS, et al. GHOST: Recovering Historical Signal from Heterotachously Evolved Sequence Alignments. Smith S, editor. Syst Biol. 2019 Jul 31;syz051.
  97. 97. Neath AA, Cavanaugh JE. The Bayesian information criterion: background, derivation, and applications. WIREs Comput Stat. 2012 Mar;4(2):199–203.
  98. 98. Bozdogan H. Model selection and Akaike’s Information Criterion (AIC): The general theory and its analytical extensions. Psychometrika. 1987 Sep;52(3):345–70.
  99. 99. Posada D, Crandall KA. MODELTEST: testing the model of DNA substitution. Bioinformatics. 1998 Oct 1;14(9):817–8. pmid:9918953
  100. 100. Abascal F, Zardoya R, Posada D. ProtTest: selection of best-fit models of protein evolution. Bioinformatics. 2005 May 1;21(9):2104–5. pmid:15647292
  101. 101. Kalyaanamoorthy S, Minh BQ, Wong TKF, von Haeseler A, Jermiin LS. ModelFinder: fast model selection for accurate phylogenetic estimates. Nat Methods. 2017 Jun;14(6):587–9. pmid:28481363
  102. 102. Lanfear R, Frandsen PB, Wright AM, Senfeld T, Calcott B. PartitionFinder 2: New Methods for Selecting Partitioned Models of Evolution for Molecular and Morphological Phylogenetic Analyses. Mol Biol Evol. 2016 Dec 23;msw260.
  103. 103. Lefort V, Longueville JE, Gascuel O. SMS: Smart Model Selection in PhyML. Mol Biol Evol. 2017 Sep 1;34(9):2422–4. pmid:28472384
  104. 104. Roy SS, Dasgupta R, Bagchi A. A Review on Phylogenetic Analysis: A Journey through Modern Era. Comput Mol Biosci. 2014;04(03):39–45.
  105. 105. Kapli P, Yang Z, Telford MJ. Phylogenetic tree building in the genomic age. Nat Rev Genet. 2020 Jul;21(7):428–44. pmid:32424311
  106. 106. Goldman N. Maximum Likelihood Inference of Phylogenetic Trees, with Special Reference to a Poisson Process Model of DNA Substitution and to Parsimony Analyses. Syst Zool. 1990 Dec;39(4):345.
  107. 107. Rannala B, Yang Z. Probability distribution of molecular evolutionary trees: A new method of phylogenetic inference. J Mol Evol. (43):304–11.
  108. 108. Yang Z, Rannala B. Bayesian phylogenetic inference using DNA sequences: a Markov Chain Monte Carlo Method. Mol Biol Evol. 1997 Jul 1;14(7):717–24. pmid:9214744
  109. 109. Mau B, Newton MA, Larget B. Bayesian Phylogenetic Inference via Markov Chain Monte Carlo Methods. Biometrics. 1999 Mar;55(1):1–12. pmid:11318142
  110. 110. Bergsten J. A review of long-branch attraction. Cladistics. 2005 Apr;21(2):163–93. pmid:34892859
  111. 111. Sokal RR, Michener CD. A statistical method for evaluating systematic relationships. University of Kansas Science Bulletin. 1958;(38):1409–38.
  112. 112. Saitou N, Nasatoshi N. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol. 1987;4(4):406–25. pmid:3447015
  113. 113. Rzhetsky A, Nasatoshi N. A Simple Method for Estimating and Testing Minimum-Evolution Trees. Mol Biol Evol. 1992 Sep 1;9(5):945–67.
  114. 114. Lefort V, Desper R, Gascuel O. FastME 2.0: A Comprehensive, Accurate, and Fast Distance-Based Phylogeny Inference Program: Table 1. Mol Biol Evol. 2015 Oct;32(10):2798–800.
  115. 115. Knight R, Maxwell P, Birmingham A, Carnes J, Caporaso JG, Easton BC, et al. PyCogent: a toolkit for making sense from sequence. Genome Biol. 2007;8(8):R171. pmid:17708774
  116. 116. Huson DH, Bryant D. Estimating phylogenetic trees and networks using SplitsTree 4. Center for Bioinformatics: Tübingen University; 2004.
  117. 117. Huson DH, Kloepper T, Bryant D. SplitsTree 4.0—Computation of phylogenetic trees and networks. 2004.
  118. 118. Paradis E, Claude J, Strimmer K. APE: Analyses of Phylogenetics and Evolution in R language. Bioinformatics. 2004 Jan 22;20(2):289–90. pmid:14734327
  119. 119. Swofford DL. Phylogenetic analysis using parsimony. Laboratory of Molecular Systematics Smithsonian Institution; 1998.
  120. 120. Price MN, Dehal PS, Arkin AP. FastTree: Computing Large Minimum Evolution Trees with Profiles instead of a Distance Matrix. Mol Biol Evol. 2009 Jul 1;26(7):1641–50. pmid:19377059
  121. 121. Felsenstein J. phylogenetic inference program Version 3.6. Seattle: University of Washington; 2005.
  122. 122. Misener S, Krawetz SA, editors. Bioinformatics methods and protocols. Totowa, N.J: Humana Press; 2000. 500 p. (Methods in molecular biology).
  123. 123. Swofford DL, Sullivan J. Phylogeny inference based on parsimony and other methods using PAUP*. 2003;160.
  124. 124. Gouy M, Guindon S, Gascuel O. SeaView Version 4: A Multiplatform Graphical User Interface for Sequence Alignment and Phylogenetic Tree Building. Mol Biol Evol. 2010 Feb 1;27(2):221–4. pmid:19854763
  125. 125. Guindon S, Delsuc F, Dufayard JF, Gascuel O. Estimating Maximum Likelihood Phylogenies with PhyML. In: Posada D, editor. Bioinformatics for DNA Sequence Analysis [Internet]. Totowa, NJ: Humana Press; 2009 [cited 2021 Sep 9]. p. 113–37. (Methods in Molecular Biology; vol. 537). Available from: http://link.springer.com/10.1007/978-1-59745-251-9_6
  126. 126. Stamatakis A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics. 2014 May 1;30(9):1312–3. pmid:24451623
  127. 127. Yang Z. PAML 4: Phylogenetic Analysis by Maximum Likelihood. Mol Biol Evol. 2007 Apr 18;24(8):1586–91. pmid:17483113
  128. 128. Kosakovsky Pond SL, Poon AFY, Velazquez R, Weaver S, Hepler NL, Murrell B, et al. HyPhy 2.5—A Customizable Platform for Evolutionary Hypothesis Testing Using Phylogenies. Crandall K, editor. Mol Biol Evol. 2020 Jan 1;37(1):295–9. pmid:31504749
  129. 129. Lewis PO. A genetic algorithm for maximum-likelihood phylogeny inference using nucleotide sequence data. Mol Biol Evol. 1998 Mar 1;15(3):277–83. pmid:9501494
  130. 130. Stamatakis A. Phylogenetic models of rate heterogeneity: a high performance computing perspective. In: Proceedings 20th IEEE International Parallel & Distributed Processing Symposium [Internet]. Rhodes Island, Greece: IEEE; 2006 [cited 2022 Oct 22]. p. 8 pp. Available from: http://ieeexplore.ieee.org/document/1639535/
  131. 131. Hoang DT, Chernomor O, von Haeseler A, Minh BQ, Vinh LS. UFBoot2: Improving the Ultrafast Bootstrap Approximation. Mol Biol Evol. 2018 Feb 1;35(2):518–22. pmid:29077904
  132. 132. Zhou X, Shen XX, Hittinger CT, Rokas A. Evaluating Fast Maximum Likelihood-Based Phylogenetic Programs Using Empirical Phylogenomic Data Sets. Mol Biol Evol. 2018 Feb 1;35(2):486–503. pmid:29177474
  133. 133. Huelsenbeck JP, Ronquist F. MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics. 2001 Aug 1;17(8):754–5. pmid:11524383
  134. 134. Lartillot N, Lepage T, Blanquart S. PhyloBayes 3: a Bayesian software package for phylogenetic reconstruction and molecular dating. Bioinformatics. 2009 Sep 1;25(17):2286–8. pmid:19535536
  135. 135. Lartillot N, Philippe H. A Bayesian Mixture Model for Across-Site Heterogeneities in the Amino-Acid Replacement Process. Mol Biol Evol. 2004 Jun;21(6):1095–109. pmid:15014145
  136. 136. BayesTraits Pagel M. Computer program and documentation. Meade A, editor. PLoS Comput Biol [Internet]. 2007 [cited 2022 Jun 4]; Available from: https://dx.plos.org/10.1371/journal.pcbi.0010003
  137. 137. Bazinet AL, Zwickl DJ, Cummings MP. A Gateway for Phylogenetic Analysis Powered by Grid Computing Featuring GARLI 2.0. Syst Biol. 2014 Sep 1;63(5):812–8. pmid:24789072
  138. 138. Felsenstein J. CONFIDENCE LIMITS ON PHYLOGENIES: AN APPROACH USING THE BOOTSTRAP. Evolution. 1985 Jul;39(4):783–91. pmid:28561359
  139. 139. Hedges BS. The number of replications needed for accurate estimation of the bootstrap P value in phylogenetic studies. Mol Biol Evol. 1992;9(2):366–9. pmid:1560769
  140. 140. Jermiin LS, Poladian L, Charleston MA. Is the ‘Big Bang’ in Animal Evolution Real? Science. 2005 Dec 23;310(5756):1910–1.
  141. 141. Minh BQ, Nguyen MAT, von Haeseler A. Ultrafast Approximation for Phylogenetic Bootstrap. Mol Biol Evol. 2013 May 1;30(5):1188–95. pmid:23418397
  142. 142. Stamatakis A, Hoover P, Rougemont J. A Rapid Bootstrap Algorithm for the RAxML Web Servers. Renner S, editor. Syst Biol. 2008 Oct 1;57(5):758–71. pmid:18853362
  143. 143. Anisimova M, Gascuel O. Approximate Likelihood-Ratio Test for Branches: A Fast, Accurate, and Powerful Alternative. Sullivan J, editor. Syst Biol. 2006 Aug 1;55(4):539–52. pmid:16785212
  144. 144. Shimodaira H, Hasegawa M. Multiple Comparisons of Log-Likelihoods with Applications to Phylogenetic Inference. Mol Biol Evol. 1999 Aug 1;16(8):1114–6.
  145. 145. Shavit L, Penny D, Hendy MD, Holland BR. The Problem of Rooting Rapid Radiations. Mol Biol Evol. 2007 Nov;24(11):2400–11. pmid:17720690
  146. 146. Holland BR, Penny D, Hendy MD. Outgroup Misplacement and Phylogenetic Inaccuracy Under a Molecular Clock—A Simulation Study. Sullivan J, editor. Syst Biol. 2003 Apr 1;52(2):229–38. pmid:12746148
  147. 147. Goldman N. Statistical tests of models of DNA substitution. J Mol Evol. 1993 Feb;36(2):182–98. pmid:7679448
  148. 148. Rambaut A, Grass NC. Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Bioinformatics. 1997;13(3):235–8. pmid:9183526
  149. 149. A Shepherd D, Klaere S. How Well Does Your Phylogenetic Model Fit Your Data? Foster P, editor. Syst Biol. 2019 Jan 1;68(1):157–67. pmid:30329125
  150. 150. Lewis PO, Xie W, Chen MH, Fan Y, Kuo L. Posterior Predictive Bayesian Phylogenetic Model Selection. Syst Biol. 2014 May;63(3):309–21. pmid:24193892
  151. 151. Rambaut A. FigTree v1.3.1. Edinburgh: Institute of Evolutionary Biology, University of Edinburgh (2010).
  152. 152. Huerta-Cepas J, Serra F, Bork P. ETE 3: Reconstruction, Analysis, and Visualization of Phylogenomic Data. Mol Biol Evol. 2016 Jun;33(6):1635–8. pmid:26921390
  153. 153. Letunic I, Bork P. Interactive Tree Of Life (iTOL) v5: an online tool for phylogenetic tree display and annotation. Nucleic Acids Res. 2021 Jul 2;49(W1):W293–6. pmid:33885785
  154. 154. Dutheil J, Gaillard S, Bazin E, Glémin S, Ranwez V, Galtier N, et al. Bio++: a set of C++ libraries for sequence analysis, phylogenetics, molecular evolution and population genetics. BMC Bioinformatics. 2006 Dec;7(1):188.
  155. 155. Miller MA, Pfeiffer W, Schwartz T. Creating the CIPRES Science Gateway for inference of large phylogenetic trees. In: 2010 Gateway Computing Environments Workshop (GCE) [Internet]. New Orleans, LA, USA: IEEE; 2010 [cited 2021 Sep 9]. p. 1–8. Available from: http://ieeexplore.ieee.org/document/5676129/
  156. 156. Lemoine F, Correia D, Lefort V, Doppelt-Azeroual O, Mareuil F, Cohen-Boulakia S, et al. NGPhylogeny.fr: new generation phylogenetic services for non-specialists. Nucleic Acids Res. 2019 Jul 2;47(W1):W260–5. pmid:31028399
  157. 157. Sanchez R, Serra F, Tarraga J, Medina I, Carbonell J, Pulido L, et al. Phylemon 2.0: a suite of web-tools for molecular evolution, phylogenetics, phylogenomics and hypotheses testing. Nucleic Acids Res. 2011 Jul 1;39(suppl):W470–4. pmid:21646336
  158. 158. Kearse M, Moir R, Wilson A, Stones-Havas S, Cheung M, Sturrock S, et al. Geneious Basic: An integrated and extendable desktop software platform for the organization and analysis of sequence data. Bioinformatics. 2012 Jun 15;28(12):1647–9. pmid:22543367
  159. 159. Maddison WP, Maddison DR. Mesquite: a modular system for evolutionary analysis. [Internet]. 2021. Available from: http://www.mesquiteproject.org
  160. 160. Bouckaert R, Vaughan TG, Barido-Sottani J, Duchêne S, Fourment M, Gavryushkina A, et al. BEAST 2.5: An advanced software platform for Bayesian evolutionary analysis. Pertea M, editor. PLOS Comput Biol. 2019 Apr 8;15(4):e1006650. pmid:30958812
  161. 161. Yu Y, Harris AJ, Blair C, He X. RASP (Reconstruct Ancestral State in Phylogenies): A tool for historical biogeography. Mol Phylogenet Evol. 2015 Jun;87:46–9. pmid:25819445
  162. 162. Excoffier L, Laval G, Schneider S. Arlequin (version 3.0): An integrated software package for population genetics data analysis. Evol Bioinforma. 2005 Jan;1:117693430500100.
  163. 163. De Bie T, Cristianini N, Demuth JP, Hahn MW. CAFE: a computational tool for the study of gene family evolution. Bioinformatics. 2006 May 15;22(10):1269–71. pmid:16543274
  164. 164. Lyons EH. CoGe, a new kind of comparative genomics platform. University of California, Berkeley; 2008.
  165. 165. Meier-Kolthoff JP, Auch AF, Huson DH, Goker M. COPYCAT: cophylogenetic analysis tool. Bioinformatics. 2007 Apr 1;23(7):898–900. pmid:17267434
  166. 166. Merkle D, Middendorf M, Wieseke N. A parameter-adaptive dynamic programming approach for inferring cophylogenies. BMC Bioinformatics. 2010 Jan;11(S1):S60. pmid:20122236
  167. 167. Rozas J, Rozas R. DnaSP version 3: an integrated program for molecular population genetics and molecular evolution analysis. Bioinformatics. 1999 Feb 1;15(2):174–5. pmid:10089204
  168. 168. Rousset F. genepop’007: a complete re-implementation of the genepop software for Windows and Linux. Mol Ecol Resour. 2008 Jan;8(1):103–6. pmid:21585727
  169. 169. Nguyen M, Ekstrom A, Li X, Yin Y. HGT-Finder: A New Tool for Horizontal Gene Transfer Finding and Application to Aspergillus genomes. Toxins. 2015 Oct 9;7(10):4035–53. pmid:26473921
  170. 170. Conow C, Fielder D, Ovadia Y, Libeskind-Hadas R. Jane: a new tool for the cophylogeny reconstruction problem. Algorithms Mol Biol. 2010 Dec;5(1):16.
  171. 171. Singh PP, Isambert H. OHNOLOGS v2: a comprehensive resource for the genes retained from whole genome duplication in vertebrates. Nucleic Acids Res. 2019 Oct 15;gkz909.
  172. 172. Dereeper A, Nicolas S, Le Cunff L, Bacilieri R, Doligez A, Peros JP, et al. SNiPlay: a web-based tool for detection, management and analysis of SNPs. Application to grapevine diversity projects. BMC Bioinformatics. 2011 Dec;12(1):134. pmid:21545712
  173. 173. Hedges SB, Dudley J, Kumar S. TimeTree: a public knowledge-base of divergence times among organisms. Bioinformatics. 2006 Dec 1;22(23):2971–2. pmid:17021158
  174. 174. Charleston MA, Robertson DL. Preferential Host Switching by Primate Lentiviruses Can Account for Phylogenetic Similarity with the Primate Phylogeny. Sanderson M, editor. Syst Biol. 2002 May 1;51(3):528–35. pmid:12079649
  175. 175. Kryazhimskiy S, Plotkin JB. The Population Genetics of dN/dS. Gojobori T, editor. PLoS Genet. 2008 Dec 12;4(12):e1000304. pmid:19081788
  176. 176. Britten RJ. Rates of DNA Sequence Evolution Differ Between Taxonomic Groups. Science. 1986 Mar 21;231(4744):1393–8. pmid:3082006
  177. 177. Graur D, Martin W. Reading the entrails of chickens: molecular timescales of evolution and the illusion of precision. Trends Genet. 2004 Feb;20(2):80–6. pmid:14746989
  178. 178. To TH, Jung M, Lycett S, Gascuel O. Fast Dating Using Least-Squares Criteria and Algorithms. Syst Biol. 2016 Jan;65(1):82–97. pmid:26424727
  179. 179. Stevens J. Computational aspects of host-parasite phylogenies. Brief Bioinform. 2004 Jan 1;5(4):339–49. pmid:15606970
  180. 180. Felsenstein J. Phylogenies and the Comparative Method. Am Nat. 1985 Jan;125(1):1–15.
  181. 181. Nei M, Li WH. Mathematical model for studying genetic variation in terms of restriction endonucleases. Proc Natl Acad Sci. 1979 Oct 1;76(10):5269–73. pmid:291943
  182. 182. Nei M. Analysis of Gene Diversity in Subdivided Populations. Proc Nat Acad Sci USA. 1973;70(12):3. pmid:4519626
  183. 183. Excoffier L, Heckel G. Computer programs for population genetics data analysis: a survival guide. Nat Rev Genet. 2006 Oct;7(10):745–58. pmid:16924258
  184. 184. Rozas J, Ferrer-Mata A, Sánchez-DelBarrio JC, Guirao-Rico S, Librado P, Ramos-Onsins SE, et al. DnaSP 6: DNA Sequence Polymorphism Analysis of Large Data Sets. Mol Biol Evol. 2017 Dec 1;34(12):3299–302. pmid:29029172
  185. 185. DeLano WL. Pymol: An open-source molecular graphics tool. CCP4 Newsl Protein Crystallogr. 2002;40(1):82–92.
  186. 186. Zhang Y. I-TASSER server for protein 3D structure prediction. BMC Bioinformatics. 2008 Dec;9(1):40. pmid:18215316
  187. 187. Hildebrand A, Remmert M, Biegert A, Söding J. Fast and accurate automatic structure prediction with HHpred: Structure Prediction with HHpred. Proteins Struct Funct Bioinforma. 2009;77(S9):128–32.
  188. 188. Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021 Aug 26;596(7873):583–9. pmid:34265844
  189. 189. Mahajan S, de Brevern AG, Sanejouand YH, Srinivasan N, Offmann B. Use of a structural alphabet to find compatible folds for amino acid sequences: Fold Recognition Using a Structural Alphabet. Protein Sci. 2015 Jan;24(1):145–53.
  190. 190. dos Santos HG, Nunez-Castilla J, Siltberg-Liberles J. Functional Diversification after Gene Duplication: Paralog Specific Regions of Structural Disorder and Phosphorylation in p53, p63, and p73. Roemer K, editor. PLOS ONE. 2016 Mar 22;11(3):e0151961. pmid:27003913
  191. 191. Cao L, Chen F, Yang X, Xu W, Xie J, Yu L. Phylogenetic analysis of CDK and cyclin proteins in premetazoan lineages. BMC Evol Biol. 2014;14(1):10. pmid:24433236
  192. 192. Le SQ, Gascuel O. An Improved General Amino Acid Replacement Matrix. Mol Biol Evol. 2008 Apr 3;25(7):1307–20. pmid:18367465
  193. 193. Dehal P, Boore JL. Two Rounds of Whole Genome Duplication in the Ancestral Vertebrate. Holland P, editor. PLoS Biol. 2005 Sep 6;3(10):e314. pmid:16128622
  194. 194. Holland LZ, Ocampo Daza D. A new look at an old question: when did the second whole genome duplication occur in vertebrate evolution? Genome Biol. 2018 Dec;19(1):209. pmid:30486862
  195. 195. Peyressatre M, Prével C, Pellerano M, Morris M. Targeting Cyclin-Dependent Kinases in Human Cancers: From Small Molecules to Peptide Inhibitors. Cancers. 2015 Jan 23;7(1):179–237. pmid:25625291