Browse Subject Areas

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Mean Protein Evolutionary Distance: A Method for Comparative Protein Evolution and Its Application

Mean Protein Evolutionary Distance: A Method for Comparative Protein Evolution and Its Application

  • Michael J. Wise


Proteins are under tight evolutionary constraints, so if a protein changes it can only do so in ways that do not compromise its function. In addition, the proteins in an organism evolve at different rates. Leveraging the history of patristic distance methods, a new method for analysing comparative protein evolution, called Mean Protein Evolutionary Distance (MeaPED), measures differential resistance to evolutionary pressure across viral proteomes and is thereby able to point to the proteins’ roles. Different species’ proteomes can also be compared because the results, consistent across virus subtypes, concisely reflect the very different lifestyles of the viruses. The MeaPED method is here applied to influenza A virus, hepatitis C virus, human immunodeficiency virus (HIV), dengue virus, rotavirus A, polyomavirus BK and measles, which span the positive and negative single-stranded, doubled-stranded and reverse transcribing RNA viruses, and double-stranded DNA viruses. From this analysis, host interaction proteins including hemagglutinin (influenza), and viroporins agnoprotein (polyomavirus), p7 (hepatitis C) and VPU (HIV) emerge as evolutionary hot-spots. By contrast, RNA-directed RNA polymerase proteins including L (measles), PB1/PB2 (influenza) and VP1 (rotavirus), and internal serine proteases such as NS3 (dengue and hepatitis C virus) emerge as evolutionary cold-spots. The hot spot influenza hemagglutinin protein is contrasted with the related cold spot H protein from measles. It is proposed that evolutionary cold-spot proteins can become significant targets for second-line anti-viral therapeutics, in cases where front-line vaccines are not available or have become ineffective due to mutations in the hot-spot, generally more antigenically exposed proteins. The MeaPED package is available from


Proteins are under tight evolutionary constraints because they are the main agents of the processes required by all organisms. Therefore, if a protein changes it can only do so in ways that do not compromise its function. This is particularly true for the small viruses that lack the levels of redundancy available to their eukaryote (or even bacterial) hosts. Viruses make up for this by having high levels of adaptability, driven by high mutation rates. For example, while E. coli has as mutation rate of muations/bp/replication [1], the dsDNA virus Bacteriophage T2’s rate is [1] and RNA viruses have mutation rates in the range to [2]. However, not all the genes in a virus – or any other organism – vary at the same rate. In phylogenetic analyses one, or perhaps several, known genes or proteins from a range of species are typically used to make inferences about the evolutionary histories of the source species, or to calculate the genetic distances between species. Indeed, these efforts predate the use of DNA or protein sequence data, e.g. Faith (1992) [3]. Differential rates of evolution bedevils the choice of which genes to use for these analyses; see, for example, the discussion in D’Erchia et al. (1996) [4]. However, I propose to turn this sort of analysis on its head and instead use comparisons of the rates of evolution evident across the proteins from a single species to yield significant insights about those proteins. To motivate the description of the method, Figure 1 shows phylogenetic trees that have been created based on an initial set of 43 avian influenza M1 matrix proteins and a set of 43 neuraminidase proteins from the same isolates. From the initial sets, 100% identical proteins were deleted and multiple sequence alignments were created from the remaining 17 (M1) and 43 (neuraminidase) sequences using Muscle [5]. Phyml version 3.0 [6], bootstrapped 100 times, was then used to create phylogenetic trees. The first thing to note from Figure 1b (neuraminidase) is that the clades formed during tree construction correspond to the designated neuraminidase types. The more important thing to note in each of the graphs is the scale bar; For the M1 set the scale bar is 0.007, while for the neuraminidase set the scale bar is 0.2. In other words, if the neuraminidase tree were drawn to the same scale as the M1 tree it would be 29 times larger. The inference is that despite strong purifying selection operating on both proteins, neuraminidase functional requirements necessitate a much higher evolutionary rate and hence greater protein diversity (43 unique neuraminidase proteins versus 17 for M1 protein, in this small example). In other words, we see in neuraminidase a greater resistance to evolutionary pressure. If evolutionary pressure is seen as the centripetal force that restricts genomic change through purifying selection – here measured at the protein level – then resistance to evolutionary change is the countervailing (centrifugal) force for change.

Figure 1. Phylogenetic trees based on avian influenza M1 matrix and neuraminidase proteins.

Phylogenetic trees have been created based on small sets of M1 matrix proteins (a) and corresponding neuraminidase proteins (b) taken from complete influenza proteomes. The trees were created using Phyml and the figures drawn using FigTree. Notice that the neuraminidase tree forms clades largely corresponding to influenza type. Notice also that the scale bar is much larger for the neuraminidase tree.

Leveraging the large numbers of isolates from single viral species that are increasingly becoming available, using a phylogenetic-tree based method described below we can now examine the differential evolutionary rates of proteins within a single species. The underlying idea is not new; minimising patristic distance – the distance between taxa in phylogenetic trees – underlies the Neighbour-Joining algorithm [7], and Farris (1972) [8] proposes the use of patristic distances (which he calls patristic differences) from taxa to a putative root taxon as a measure of evolution in a single protein. Here the idea will be extended to facilitate the comparison of proteins from isolates of a single species, and ultimately comparisons between species. The new method, called Mean Protein Evolutionary Distance (MeaPED), fits into the considerable literature on protein evolution; see, for example the review Pàl et al. (2006) [9]. However, the preeminent current method for measuring evolutionary pressure is the ratio of nonsynonymous to synonymous substitutions ()– also known as Ka/Ks – which is discussed in Yang (2006) [10]. The MeaPED method will be compared with . Despite misgivings expressed in Kryazhimskiy and Plotkin (2008) [11] about the applicability of to single populations, at least for the viral species discussed below the results suggest that comparisons with this method are reasonable.

The New Approach

Mean Protein Evolutionary Distance (MeaPED) is a way of measuring differential resistance to evolutionary pressure across sets of proteins from a single viral species. The MeaPED process begins with collecting complete proteomes from isolates of the species of interest, such as influenza A virus, where the aim is to sample as much of the sequence diversity within the species as possible. The protein sequences are grouped, e.g. haemagglutinin (HA), neuraminidase (NA), etc., An important technical point is that care must be taken to ensure that each set contains the same gene from the different isolates, i.e. true orthologues, and that paralogues are only found in their own sets, e.g. only E1 in the E1 set, only E2 in the E2 set (from hepatitis C virus). (Paralogues arise due to gene duplication; alpha and beta haemoglobin are a well known example of paralogous genes in vertebrates. Gene duplications are common in the large dsDNA viruses, e.g. mimivirus [12].) In practice, this means the one-to-one mapping of positional (i.e. syntenic) orthologues [13].

For each protein, an unrooted phylogenetic tree is constructed from the set of unique sequences. The choice of phylogenetic tree building engine and amino acid substitution matrix will be discussed below. For each leaf node (i.e. taxon) in the resulting phylogenetic tree the mean patristic distance is computed between it and every other leaf node. Then the mean of these means is calculated – a single value representing the mean evolutionary distance for that protein computed across the set of unique sequences. An adjusted mean-of-means (AMM) is also computed, where the denominator is the original count of sequences rather than the final count of sequences once duplicates have been deleted. This reflects the fact that duplicate sequences add no new information, i.e. no additional variability. While a reduced number of sequences due to deletion of duplicates can imply over-sampling of that strain and gene during sample collection, it can also imply that there is a limited number of amino acid encodings of the corresponding protein which are viable, i.e. still are able to perform that protein’s function. As a final step, the adjusted mean of means for a set of sequences is divided by the median input sequence length, times 100, giving the adjusted mean distance per 100 aa (AMM100), to facilitate comparisons between proteins with different average sequence lengths. Sorting the list of proteins by decreasing AMM100 value reveals both evolutionary hot-spot proteins (higher average evolutionary rates) and evolutionary cold-spot proteins.

One issue with the use of patristic distances to measure protein evolution, e.g. as proposed in Farris (1972) [8], is that branches close to the designated root are counted multiple times. By averaging patristic distances using each taxon in turn as the root, there is no difference in the number of times leaf branches are counted. However, internal branches will be counted more times than leaf branches, particularly branches linking “clades” (i.e. subtrees of highly similar proteins that are substantially different to proteins in other subtrees). The effect – a weighted mean, with the impact blunted by averaging – is desirable because it is the internal branch lengths, particularly those between “clades”, that reflect a requirement in certain proteins for higher evolutionary rates despite strong purifying selection. By contrast, using a simple mean of patristic distances, for example, would result in evidence from the small number of significant internal branches (inter-clade) being lost among the much larger number of intra-clade distances.


Table 1 below shows the results of a MeaPED analysis of human influenza A virus, hepatitis C virus (type 1), human immunodeficiency virus type 1 (subtype b), dengue virus (type 1), measles, polyomavirus BK and rotavirus A. (The analyses of these plus swine and avian influenza A virus, HIV1 subtypes c and d, dengue virus types 2,3,4 and hepatitis C virus types 2,3,4,6 comprise Table S1.) The viral species, spanning the positive and negative single-stranded, doubled-stranded and reverse transcribing RNA viruses, and double-stranded DNA viruses, are summarised in Table 2.

Table 1. Protein Evolutionary Distances - Initial and Final counts of sequences following deletion of duplicates, Mean PED, Adjusted Mean PED and Adjusted Mean PED per 100 aa, and mean .


In Table 1, the proteins for the different viruses are sorted by decreasing AMM100 value (column 7), with values for the corresponding genes being found in column 8. Looking in Table 1 specifically at influenza virus, the fact that the antigenically exposed haemagglutinin (HA) and neuraminidase (NA) show the most variation comes as no surprise. The appearance of PB1-F2 as an evolutionary hot-spot in human (and also avian and swine) influenza is more controversial. First described in Chen et al. (2001) [14], PB1-F2 is understood to induce apoptosis through interactions with mitochondrial membrane adenine nucleotide translocator 3 (ANT3) and outer mitochondrial membrane voltage-dependent anion channel 1 (VDAC1) [15]. However, there are no single nucleotide polymorphisms (SNPs) in the exons of VDAC1 and at most one in the exons of ANT3 (see Methods), so variability in these interacting proteins cannot be driving the evolution of PB1-F2. PB1-F2 has also been found to interact with PB1, a subunit of the RNA polymerase complex [16]. Here again, Table 1 shows that there is two orders of magnitude less variability in PB1 than in PB1-F2. PB1-F2 has recently been shown to be natively unfolded, form amyloids and be able to perforate cell membranes [17]. On the other hand, it is argued in Trifonov et al. (2009) [18] that because PB1-F2 accumulates stop-codons and is therefore often truncated – and generally shows little evidence of selection – it does not play a significant evolutionary role. Clearly further research is needed to determine the role of PB1-F2 in vivo. Looking further down the AMM100 list for influenza virus, the NS1 protein AMM100 score is an order of magnitude less than haemagglutinin, neuraminidase and PB1-F2, and is known to play a role in immune evasion [19]. At the bottom of the list are polymerase proteins PB1 and PB2, whose AMM100 scores are an order of magnitude less again. It is also of interest to note that AMM100 values are higher in avian and swine influenza than in human influenza, indicating greater resistance to evolutionary pressure.

Hepatitis C Virus

Scanning the AMM100 values for hepatitis C virus it is clear that the resistance to evolutionary pressure on its genes is generally higher than for influenza virus, with the top three values being for p7, NS2 and envelope protein E1. The p7 is a short (63 aa) non-structural protein that oligomerises and becomes resident in the endoplasmic reticulum. It has been found to have ion channel activity and to be involved in the release of infectious hepatitis C virus [20]. (Indeed, VPU – another viroporin found in HIV – has the highest AM100 score for all three of the HIV1 subtypes examined in this study.) Envelope glycoproteins E1 and E2 dimerise to gain entry into host cells [21]. The E1, E2 and p7 proteins have also been shown to be immunogenic [22], [21]. In view of that it is interesting to note the lower AM100 value for the longer E2 protein (363 aa) versus the shorter E1 protein (192 aa), though their adjusted mean scores are similar, suggesting that while some portions of the E2 protein provide evidence of higher evolutionary rate, other parts are relatively conserved. The evidence for NS2 as an evolutionary hot-spot for hepatitis C virus is mixed, for although it has a high AMM100 score for hepatitis C virus type 1 and type 4, it has a more middling score for the other hepatitis C virus types. While it has been known for some time that NS2 works cooperatively with NS3 to cleave the peptide bond linking NS2 and NS3, the function of the mature NS2 protein is still uncertain. However, NS2 has been shown in vitro to inhibit host cell gene expression [23] and to interfere with CIDE-B induced, caspase mediated apoptosis [24]. At the other end of the scale, NS5b (RNA-dependent RNA polymerase), NS3 (serine protease/helicase) and the C (core) protein – involved in capsid formation – [25] are cold spot proteins across the hepatitis C virus types.

Fast and Slow: HIV versus Dengue Virus and Polyomavirus

Analysis of HIV1 reveals a very similar picture to hepatitis C virus – even the most slowly varying protein, the POL reverse transcriptase/integrase polyprotein, is evolving at a rate similar to the mid-range NS1 of influenza, while most of the remaining genes are evolving at a rate similar to, or faster than, influenza’s haemagglutinin or neuraminidase. By contrast, dengue virus and polyomavirus present the opposite picture. Being a mosquito-borne pathogen, dengue virus has to survive in both the mosquito and human hosts, implying a double constraint on the evolution of its proteins, which is reflected in AMM100 values that are comparable with M1 at the highest and then drop an order of magnitude. The polyomavirus proteins have a similar AMM100 score profile to dengue virus, though for a somewhat different reason. After primary acute infection, the dsDNA polyomavirus can persist for a along period in its host, only reappearing in immunocompromised hosts [26]. Indeed, long term persistence of dsDNA viruses in their hosts, to which they are generally closely adapted, is a hallmark of these viruses and it has been argued that this persistence has contributed to virus-host coevolution [27].

Analysis of the MeaPED Method

An obvious question is whether MeaPED analyses are robust or dependent on the choice of phylogenetic tree building application, and then the choice of amino acid substitution matrix. One practical constraint is that whichever phylogenetic-tree building method is chosen, it must be able to deal with large data-sets; the largest used here (human influenza) has more than 3,300 sequences for each protein. One such application is the Maximum Likelihood method Phyml version 3.0 [6], which was used for the analyses described above, in combination with the default LG amino acid substitution matrix [28]. To test the robustness of MeaPED, the Neighbour-Joining applications Protdist (default JTT amino acid substitution matrix) and Neighbor (from the Phylip suite [29]) were substituted for Phyml and the experiments were repeated. Spearman Rank Correlations where computed comparing the Phyml-based and Neighbor-based calculations for each species/subtype across the sets of proteins. The coefficients of determination () and the associated p-values are shown in Table 3. The results indicate that the MeaPED method is largely independent of the choice of phylogenetic-tree building method (and amino acid substitution matrix).

Table 3. Spearman Rank Correlation of MeaPED Analyses Undertaken via Phyml and Neighbor.

A second question is how Mean Protein Evolutionary Distance compares with the standard approach: the ratio of nonsynonymous to synonymous substitutions, . To test this, the rankings of proteins based on AMM100 values were compared using Spearman Rank Correlations across the different virus subtypes. A similar analysis was undertaken with the orderings based on the mean values computed across the corresponding genes. Such comparisons are valid, for although the evolutionary rates of orthologous genes can in general vary between species, in this case the analysis is based on the identical genes coming from different isolates of the same species, so therefore subject to the same functional constraints. This effect is strengthened because positional orthologues are subject to tighter evolutionary constraints [13]. The approach appears to be borne out by the results, shown in Table 4, where the ranking of proteins by AMM100 score across, for example, HIV subtypes has of 0.83 while the ranking of the corresponding genes by value has of 0.86. On the other hand, Penn et al (2008) [30] would seem to question this assumption, with the paper providing evidence that different HIV1 subclades can have different evolutionary rates – what they call “rate shifts”. However, looking at the extended results in Table S1 you can see that protein-for-protein, HIV1 subtype c, for example, has a higher AMM100 value than the corresponding rank in HIV1 subtype d. In other words, for this hyper-mutator virus, the evolutionary rate for a protein does change between subtypes, but so too do the other proteins in the corresponding subtypes, preserving the ordering. This is implicit in Table 1 from [30], where the ranking of Proportion of Shifting Sites approximately follows that seen due to AMM100 values, with Pol having the lowest proportion (i.e. being the most constrained) and Vpu having the highest proportion. In this light, a statistical test based on ranking, rather than absolute values, is appropriate. Turning to the AMM100 versus dN/dS based rankings, shown in Table 4, for the dengue virus subtypes, the MeaPED approach was somewhat better, while for the different influenza host species subtypes, the approach was somewhat better. For the HIV subtypes the orderings of the genes due to MeaPED and were equally consistent. However, for the Hepatitis C subtypes the MeaPED approach produced significantly more consistent results.

Table 4. Comparing the consistency of AMM100 and values across virus subtypes.

A third question is whether there is a correlation between AMM100 (or ) values and the number of sequences in the input set. This can be investigated by observing that in this study data is provided for the subtypes of certain viruses: influenza, hepatitis C virus, dengue virus and HIV, and the different subtypes are represented by different numbers of sequences. For each gene in a given virus, the AMM100 (or ) values can be correlated across virus subtypes with the final counts of sequences after the deletion of duplicate sequences, . On that basis, AMM100 scores for 8 out of 11 dengue virus genes were positively correlated with across 4 virus subtypes, while scores for 8 out of 9 HIV genes were positively correlated across 3 virus subtypes and scores for 7 out of 11 hepatitis C virus genes were positively correlated across 5 virus subtypes. On the other hand, only 3 out of 11 influenza AMM100 scores were positively correlated with across 3 virus subtypes. Given that the number of virus subtypes is small – N = 3 (influenza and HIV), N = 4 (dengue virus) and N = 5 (hepatitis C virus) – and therefore the possibility that the lists of values can be correlated by chance, together with the fact that each species had some genes yielding opposite correlations, it can be assumed that there is no systematic correlation between AMM100 values and the counts of sequences being examined. (Analysis based instead on yields a similar conclusion.)

A final question is whether there is a relationship between MeaPED scores and the lengths of the input sequences, represented by the median input sequence length for each species subtype. It is plausible that an inverse relationship may exist – longer sequences attracting lower MeaPED scores – because, assuming globular structures, larger proteins will have more of their residues buried and buried residues are known to mutate more slowly than surface residues, particularly residues occurring in loops. To investigate this, for each virus subtype a linear regression was done between the set of Mean or AMM100 scores and the corresponding median input sequence lengths. Of the 18 virus subtypes, 12 returned a negative correlation between median input sequence length and mean MeaPED score. However, none of these was significant, the most significant being measles () and polyomavirus (). Linear regression models involving AMM100 produced more significant results – dengue virus types 1 and 2, HIV2 subtypes b and c and hepatitis C virus (all types) were significant at the level, but that is to be expected because computation of AMM100 involves the median input sequence length. (Comparisons involving yielded similar results to those involving the mean MeaPED score.) In summary, there is a small affect due to input sequence length, but it is not significant.


Likely Roles of Hot Spot and Cold Spot Proteins

One observation evident from the Results is that relative hot spot proteins are likely to interact with the host. Examples include: hemagglutinin (influenza) and viroporins agnoprotein (polyomavirus), p7 (hepatitis C) and VPU (HIV). For measles, the protein with the highest AMM100 score is the V protein, which is known to inhibit alpha interferon signalling through a number of interactions, including acting as a decoy substrate for IB kinase , preventing phosphorylation of IFN regulatory factor 7 [31]. In this context it is also interesting to contrast human influenza hemagglutinin – AMM100 value of 0.1701, and at the top of the list of influenza virus AMM100 values – with measles virus hemagglutinin, which has an AMM100 value of 0.0027 and near the bottom of the measles virus AMM100 values. Unlike influenza virus, which has separate neuraminidase (NA) and hemagglutinin (HA) proteins, paramyxoviruses, including measles, have the two functions performed by the same, HN protein. However, in the measles virus, the protein lacks neuraminidase activity (and is therefore designated “H”). It also does not bind sialic acid and the binding pocket appears enlarged [32]. Instead, the H protein binds signalling lymphocyte activation molecule (SLAM, also called CD150) and CD46 [33]; CD46 regulates complement activation while SLAM triggers a cascade that results in cytokine release, including IL4 and IL13. In other words, rather binding sialic acid as does influenza hemagglutinin, and inducing a strong immune response [34], measles virus H protein binds to SLAM or CD46, thereby gaining entry to cells and also performing immunosuppression by acting as a inhibitor of those key proteins [33].

MeaPED analysis can also highlight cold spot proteins. These include the RNA directed RNA polymerase proteins PB1 and PB2 (influenza), NS5 (dengue virus), L (measles virus), NS5b (hepatitis C virus), and VP1 (rotavirus), and internal serine protease NS3 (dengue and hepatitis C). The POL protein from HIV combines an internal protease with a reverse transcriptase/integrase. The theme that emerges is that, in contrast to hot spot proteins, cold spot proteins are internal to the working of the virus. While the identification of hot-spot proteins can yield useful biological insights, the identification of cold-spot proteins, on the other hand, can tell us which proteins are less likely to evolve should they be targeted by anti-viral therapeutics. For this analysis, the adjusted mean of means (AMM) score is sometimes more useful than the AMM100 score. For example, near the bottom of the Table 1 list for influenza A virus is the matrix protein M1, whose AMM score is two orders of magnitude less than the top three in the list. Multiple sequence alignment of the M1 protein reveals quite long peptides (20–40 aa) which are identical (or very nearly so) across the large human, swine and avian data-sets, but which may not be “drugable” in the conventional sense. While any anti-viral therapy will act as a selective agent and induce increased viral evolutionary rates, if M1 can be targeted, e.g. through peptide ligands such as Phylomers [35], the implication is that M1 will evolve far less rapidly than haemagglutinin or neuraminidase, which are the targets of vaccines and drugs such as Relenaza (Zanamivir) or Tamiflu (Oseltamivir).

Comparison with

While the primary intention of using has been to rank evolutionary rates, a general observation from the data presented here is that, with the exception of HIV, the vast majority of the genes have average values of 0.1 or less, placing them in the range – according to Kryazhimskiy and Plotkin (2008) [11] – where strong purifying selection can be assumed, justifying the assumption that the evolution of proteins, at least for these viruses, is tightly constrained. The hint for why this might be the case comes from two concepts developed in that paper: evolution over long versus short time-scales and silent (or non-silent) mutations versus fixations. Given short viral doubling times and high mutation rates, one can assume that a wide range of possible encodings for a protein will have been tried over what is a relatively short chronological time, but representing many generations. However, functional constraints on the proteins will mean that only the small number of encodings leading to viable proteins will be fixed in a population. This is very evident, for example, in neuraminidase from influenza, where the different clades are very distinct. It is therefore plausible that the different clades represent quasispecies [2]. Both the average computations and the MeaPED analyses involve all-against-all comparisons, which emphasise the distances between the quasispecies, as most of the comparisons will be between sequences from different clades/quasispecies rather than sequences from the same clade/quasispecies.

In Summary

These examples illustrate that the Mean Protein Evolutionary Distance approach is a robust method that concisely and consistently captures an important facet of viral protein function through their differing responses to evolutionary pressure. In doing so it is at least as effective, if not better than, the equivalent computation using average values, does not suffer from issues of interpretation highlighted by Kryazhimskiy and Plotkin (2008) [11] and, unlike the method, the MeaPED method is able to make use of data from gaps in the underlying multiple-sequence alignments, neglect of which can produce less accurate trees [36]. The MeaPED method is also considerably quicker, particularly for large data-sets. Finally, although MeaPED analysis has to date only been used on viral data-sets, with the increasing number of isolates from different microbial genomes being sequenced the data is becoming available for the method to also be applied to proteomes of species from other kingdoms.


Sources of Sequences Used

Seven species were examined for this study: human, swine and avian influenza A virus, hepatitis C virus (types 1,2,3, 4 and 6), human immunodeficiency virus type 1 (subtypes b, c and d) and dengue virus (types 1,2,3 and 4), measles, polyomavirus BK and rotavirus A. There were too few hepatitis C virus type 5 sequences for analysis to be carried out. In each case, complete coding-sequence (CDS) sets were obtained for each isolate of the given species and type.

The influenza data-sets came from the Influenza Virus Resource at NCBI [37] Coding sequences for different influenza types, different host species and different geographic locations can be obtained via the web site. Because a very large number of sequences have been recorded in the database, the human data-set was restricted to isolates from China and USA, while the avian set was restricted to isolates from duck species (wild and domesticated) from China and USA. For the sake of consistency, the swine influenza data-sets also used isolates from China and USA. The hepatitis C virus, measles and dengue viruses data-sets were obtained from [38] As with the influenza data-set, the sets of coding sequences can be obtained from the web site. The HIV data-sets were obtained from the Los Alamos HIV databases In this case the sets of genomes for each HIV1 subtype (b, c and d) had to first be downloaded. Then the Gene Cutter tool on the LANL site was used to produce a set of aligned sequences which then had to be post-processed to remove the gap character ‘−’. The rotavirus A data-set was obtained from NCBI Viral Genomes as sets of segments which where first collected as complete genomes. Complete polyomavirus genomes were also obtained from NCBI Viral Genomes. The data-sets were first split into subtypes (if appropriate) and then into the different genes.

Sequence Processing

For each of the gene data-sets, after automatically removing a small number of faulty sequences (either clearly too long or too short, or with predicted coding sequences whose lengths are not a multiple of 3) the number of sequences in each set was recorded (shown in Table 1 as Init N). Because duplicate sequences lengthen processing times but add no additional information and can produce artefacts in phylogenetic trees, duplicate sequences were deleted to produce the final set of sequences for each gene. (The final counts are listed in Table 1 under Final N). The existence of paralogues is not a problem for the small viruses analysed here. For each gene across all the species and subtypes, e.g. matrix M1 protein sequences from avian influenza A virus, a codon-based multiple sequence alignment was created across using a Python application which calls Muscle [5]. The codon-based multiple sequence alignment then became input for both the pairwise calculations and, in protein form, the MeaPED analyses.

Computing MeaPED Scores and

MeaPED analyses were undertaken by a computer program, written in the programming language Python MeaPED first calls Muscle to create a multiple sequence alignment, if one is not already provided, and then calls a phylogenetic tree building application to create a phylogenetic tree based on the multiple sequence alignment. The phylogenetic tree building application Phyml version 3.0 [6] was used as it is a Maximum Likelihood method but still able to process the large numbers of sequences found in some of the data-sets. Branch-length optimisation was specified. For comparison, phylogenetic-tree computations were also carried out using the Neighbour-Joining application Neighbor (from the Phylip suite [29]). Once the phylogenetic tree has been created, then traverses the tree to create a matrix which records the evolutionary distance between each leaf node in the tree (i.e. input sequence) and every other leaf node. Using this information, mean distances between each node/sequence and all the others can be computed, and then the mean of these means across all the (unique) sequences in the data-set for that protein. Finally, the adjusted mean of means (AMM) and adjusted mean of means per 100 aa (AMM100) were computed, as described above.

The computations were undertaking using the codeml application from the PAML suite [10], with input from the codon-based multiple sequence alignments and the corresponding Phyml trees. A single value was returned for each pairwise computation spanning both input sequences. The mean pairwise value was then computed across all the pairwise comparisons.

To estimate the evolutionary pressure on the human proteins ANT3 (gene name SLC25A6) and VDAC1, records of single nucleotide polymorphisms (SNPs) were obtained from the International HapMap Project [39]. The HapMap project has undertaken a comprehensive SNP survey across a limited number of individuals (270 in Phase II) from diverse geographic locations. HapMap’s BioMart tool returned no SNPs in the the exons of the genes encoding these two proteins. Because the HapMap methodology has involved a small number of genomes, an alternative approach was to examine the Ensembl records for the two genes [40]. In this case, use of Ensembl’s Population Comparison tool across the two protein coding transcripts for the gene SLC25A6 revealed a single non-synonymous mutation. However, use of the Population Use of the Comparison tool across the five protein coding transcripts for the VDAC1 did not yield a single non synonymous mutation.

Statistical Methods

The Python scipy mathematical/statistical functions suite was used for the statistical computations. The linear regression function linregress was used for the omparisons of MeaPED scores (and ) with median input sequence. Statistical comparison of the two MeaPED analyses – one using Phyml as the phylogenetic tree building application and the other using Neighbor – was carried out using the Spearman Rank Correlation function function (spearmanr). In the comparison of MeaPED versus consistency (Table 4), an all against all set of comparisons of gene rankings was done for all subtypes of dengue virus, HIV, hepatitis C virus and the avian, human and swine host influenza virus. To avoid double counting, a maximum spanning tree was computed from the pairwise comparisons, such that each virus subtype appears once and there are no cycles. From the reduced set of pairwise values taken from the maximum spanning tree mean correlations of determination were computed, together with combined p-values based on both Stouffer’s and Fisher’s methods (see discussion in Mosteller and Bush (1954) [41]). A final note on estimating p-values from Spearman Rank Correlations. For , the estimated p-value returned by the spearmanr scipy function was used. However, when – a perfect match – the p-value is 0, even when a small number of items are being compared. This clearly overstates the significance of the match, and prevents both Stouffer and Fisher combined p-values being computed. Instead, the p-value for a perfect match involving lists of length was estimated by computing the Spearman Rank Correlation of two sorted lists of unique integers of length , where the lists were identical except that in one list two adjacent integers had the same value (in which case the rank difference is averaged).

Supporting Information

Table S1.

Protein Evolutionary Distances Based on Phyml - Initial and Final counts of sequences, Mean PED, Adjusted Mean PED and Adjusted Mean PED per 100 aa, and mean .


Author Contributions

Conceived and designed the experiments: MJW. Performed the experiments: MJW. Analyzed the data: MJW. Contributed reagents/materials/analysis tools: MJW. Wrote the paper: MJW.


  1. 1. Drake JW (1991) A constant rate of spontaneous mutation in dna-based microbes. Proc Natl Acad Sci USA 88: 7160–7164.
  2. 2. Domingo E, Holland JJ (1997) RNA virus mutations and fitness for survival. Annu Rev Microbiol 51: 151–178.
  3. 3. Faith DP (1992) Conservation evaluation and phylogenetic diversity. Biol Conserv 61: 1–10.
  4. 4. D’Erchia AM, Gissi C, Pesole G, Saccone C, Arnason U (1996) The guinea-pig is not a rodent. Nature 381: 597–600.
  5. 5. Edgar RC (2004) Muscle: A multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 5.
  6. 6. Guindon S, Gascuel O (2003) A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol 52: 696–704.
  7. 7. Saitou N, Nei M (1987) The neighbor-joining method: A new method for reconstructing phylogene- tic trees. Mol Biol Evol 4: 406–425.
  8. 8. Farris JS (1972) Estimating phylogenetic trees from distance matrices. Am Nat 106: 645–668.
  9. 9. Pàl C, Papp B, Lercher MJ (2006) An integrated view of protein evolution. Nat Rev Genet 7: 337–348.
  10. 10. Yang Z (2006) Computational Molecular Evolution. Oxford University Press.
  11. 11. Kryazhimskiy S, Plotkin JB (2008) The population genetics of dn/ds. PLoS Genet 4: e1000304.
  12. 12. Suhre K (2005) Gene and genome duplication in Acanthamoeba polyphaga Mimivirus. J Virol 79: 14095–14101.
  13. 13. Lemoine F, Lespinet O, Labedan B (2007) Assessing the evolutionary rate of positional orthologous genes in prokaryotes using synteny data. BMC Evol Biol 7: 237.
  14. 14. Chen W, Calvo PA, Malide D, Gibbs J, Schubert U, et al. (2001) A novel influenza A virus mitochondrial protein that induces cell death. Nature Med 7: 1306–1312.
  15. 15. Zamarin D, Garcia-Sastre A, Xiao X, Wang R, Palese P (2005) Influenza virus PB1-F2 protein induces cell death through mitochondrial ANT3 and VDAC1. PLoS Pathog 1: e4.
  16. 16. Mazur I, Anhlan D, Mitzner D, Wixler L, Schubert U, et al. (2008) The proapoptotic influenza A virus protein PB1-F2 regulates viral polymerase activity by interaction with the PB1 protein. Cell Microbiol 10: 1140–1152.
  17. 17. Chevalier C, Bazzal AA, Vidic J, Fevrier V, Bourdieu C, et al. (2010) PB1-F2 influenza A virus protein adopts a beta-sheet conformation and forms amyloid fibers in membrane environments. J Biol Chem 285: 13233–13243.
  18. 18. Trifonov V, Racaniello V, Rabadan R (August 28 2009) The contribution of the PB1-F2 protein to the fitness of influenza A viruses and its recent evolution in the 2009 influenza A (H1N1) pandemic virus. PLoS Curr 1.
  19. 19. Donelan NR, Basler CF, Garca-Sastre A (2003) A recombinant influenza A virus expressing an RNA-binding-defective NS1 protein induces high levels of beta interferon and is attenuated in mice. J Virol 77: 13257–13266.
  20. 20. Montserret R, Saint N, Vanbelle C, Salvay AG, Simorre JP, et al. (2010) NMR structure and ion channel activity of the p7 protein from hepatitis C virus. J Biol Chem 41: 31446–31461.
  21. 21. Vieyres G, Thomas X, Descamps V, Duverlie G, Patel AH, et al. (2010) Characterization of the envelope glycoproteins associated with infectious hepatitis C virus. J Virol 84: 10159–10168.
  22. 22. Sominskaya I, Alekseeva E, Skrastina D, Mokhonov V, Starodubova E, et al. (2006) Signal sequences modulate the immunogenic performance of human hepatitis C virus E2 gene. Mol Immunol 43: 1941–1952.
  23. 23. Dumoulin FL, von dem Bussche A, Li J, Khamzina L, Wands JR, et al. (2003) Hepatitis C virus NS2 protein inhibits gene expression from different cellular and viral promoters in hepatic and nonhepatic cell lines. Virology 305: 260–266.
  24. 24. Erdtmann L, Franck N, Lerat H, Seyec JL, Gilot D, et al. (2003) The hepatitis C virus NS2 protein is an inhibitor of CIDE-B-induced apoptosis. J Biol Chem 278: 18256–18264.
  25. 25. Penin F (2003) Structural biology of hepatitis C virus. Clin Liver Dis 7: 1–21.
  26. 26. Chesters PM, Heritage J, McCance DJ (1983) Persistence of DNA sequences of BK virus and JC virus in normal human tissues and in diseased tissues. J Infect Dis 147: 676–684.
  27. 27. Villareal LP (1999) DNA virus contribution to host evolution. In: Esteban Domingo RGW, Holland JF, editors, Origin and Evolution of Viruses, Academic Press. pp.391–420.
  28. 28. Le SQ, Gascuel O (2008) An improved general amino acid replacement matrix. Mol Biol Evol 25: 1307–1320.
  29. 29. Felsenstein J (2005) PHYLIP (Phylogeny Inference Package) version 3.6. Distributed by the au- thor, Department of Genome Sciences, University of Washington, Seattle. http://evolution.
  30. 30. Penn O, Stern A, Rubinstein ND, Dutheil J, Bacharach E, et al. (2008) Evolutionary modeling of rate shifts reveals specificity determinants in hiv-1 subtypes. PLoS Comput Biol 4: e1000214.
  31. 31. Pfaller CK, Conzelmann KK (2008) Measles virus V protein is a decoy substrate for iκb kinase α and prevents toll-like receptor 7/9-mediated interferon induction. J Virol 82: 12365–12373.
  32. 32. Hashiguchi T, Kajikawa M, Maita N, Takeda M, Kuroki K, et al. (2007) Crystal structure of measles virus hemagglutinin provides insight into effective vaccines. Proc Natl Acad Sci USA 104: 19535–19540.
  33. 33. Yanagi Y, Takeda M, Ohno S (2006) Measles virus: Cellular receptors, tropism and pathogenesis. J Gen Virol 87: 2767–2779.
  34. 34. Webster RG, Bean WJ, Gorman OT, Chambers TM, Kawaoka Y (1992) Evolution and ecology of influenza A viruses. Microbiol Rev 56: 152–179.
  35. 35. Watt P (2006) Screening for peptide drugs from the natural repertoire of biodiverse protein folds. Nature Biotechnol 24: 177–183.
  36. 36. Dessimoz C, Gil M (2010) Phylogenetic assessment of alignments reveals neglected tree signal in gaps. Genome Biol 11: R37.
  37. 37. Bao Y, Bolotov P, Dernovoy D, Kiryutin B, Zaslavsky L, et al. (2008) The influenza virus resource at the national center for biotechnology information. J Virol 82: 596–601.
  38. 38. Greene JM, Collins F, Lefkowitz EJ, Roos D, Scheuermann RH, et al. (2007) National institute of allergy and infectious diseases bioinformatics resource centers: New assets for pathogen informatics. Infect Immun 75: 3212–3219.
  39. 39. The-International-HapMap-Consortium (2007) A second generation human haplotype map of over 3.1 million snps. Nature 449: 851–861.
  40. 40. Hubbard TJP, Aken BL, Beal K, Ballester B, Caccamo M, et al. (2007) Ensembl 2007. Nucleic Acids Res 35: D610–D617.
  41. 41. Mosteller F, Bush RR (1954) Selected quantitative techniques. In: Lindzey G, editor, Handbook of Social Psychology, Addison-Wesley. pp.289–334.