Reader Comments

Post a new comment on this article

A note on evolutionary rate estimation in Bayesian evolutionary analysis

Posted by sana_eybpoosh on 06 Nov 2014 at 08:38 GMT

Authors:
Kayhan Azadmanesh1; Sana Eybpoosh2*
1 Department of Virology, Pasteur Institute of Iran, Tehran, Iran
2 Regional Knowledge Hub and WHO Collaborating Centre for HIV Surveillance, Institute for Futures Studies in Health, Kerman University of Medical Sciences, Kerman, Iran

*Corresponding author: Sana Eybpoosh
Kerman University of Medical Sciences, Institute for Futures Studies in Health, Regional Knowledge Hub and WHO Collaborating Centre for HIV Surveillance, Haft-Baagh Avenue, Kerman, Iran
Tel: +98 (0) 3431325092
E-mail address:
seybpoosh@hivhub.ir
sana.eybpoosh@gmail.com


We enjoyed reading the article by Tien Ng et al., published on December 2, 2013 in PLOS ONE. The article provides information about epidemic history of three HIV-1 subtypes circulating in the population of Men Who Have Sex with Men (MSM) in Singapore (i.e., subtype B, CRF01_AE and CRF51_01B). The study relied upon 105 HIV-1 samples recruited between February 2008 and August 2009. Using this data, the authors estimated the time to the Most Recent Common Ancestor (tMRCA) based on evolutionary rates (μ) calculated for three genetic regions sequenced in this time interval (Pol and gp120 & gp41 of env). The estimated tMRCA of subtype B and CRF01_AE was concordant between three genetic regions (late 1990s-early 2000s and early 2000s, respectively) but it could not be estimated precisely for CRF51_01B, and ranged from 1996 (1992-2001) for gp41 of env gene to 2004 (2002-2006) for gp120 of this gene [1].
We would like to note that the short time-span of sequence sampling in this study (i.e. about 18 months) might be the reason for observing such inconsistent results for tMRCA of CRF51_01B. It might also have hampered the accuracy of tMRCA estimates for all three subtypes. Serially-sampled sequences of fast-evolving organisms (such as RNA viruses) can be used to estimate the rate of evolution (μ) directly from study samples. However, the time-span of sampling should be wide enough to ensure accurate estimation of μ. The length of time interval required in this regard, varies in different populations and species, and depends on a combination of different factors such as underlying mutation rate, tree topology, and sequence length [2]. Given the time-interval fixed, samples with higher mutation rates, lower heterogeneity, and longer lengths are more suitable for evolutionary rate estimates than those with lower mutation rate, higher heterogeneity, and shorter sequence length.
One way to assess the adequacy of time-span of sampling is to make an initial estimate of μ and its confidence interval with methods such as root-to-tip linear regression, maximum likelihood estimation, etc. (see Drummond, et al., (2003) [2] for more details). The resulting estimates can be used to assess if the sequences show a statistically significant amount of genetic differences or not (i.e., if the confidence interval of μ includes ‘zero’ or not). While it is realized that the time-span of sampling is not enough, tMRCA is better to be estimated using μ obtained from external sources (e.g., datasets or previous research). In Bayesian evolutionary analysis, this external information is encoded as a prior and used for estimation of the coalescent time [3, 4].
There are numerous datasets and estimates for μ in the literature. Therefore, the question would be, ‘which of the many datasets available should one choose for estimation of μ?’ Below, we provide guidance on how to make this decision using the HIV virus as an example.
Generally, it is recommended that the study sample and the data used for estimation of μ of a species -referred here as ‘external dataset’- resemble each other closely with respect to the factors with significant effects on the evolutionary rate.
Specifically, we suggest choosing sequences that have been sampled from the same species and strain (e.g., HIV subtype). In addition, we suggest selecting sequences that are presumably under similar selective pressures as the sequences at hand.
To minimize the difference in selective pressures of two datasets, our recommendation is to choose sequences from similar host populations, genes, and genomic regions. Usually, the individuals within a population are under similar selective pressures, while different populations can experience different evolutionary forces. For example, individuals within a country or specific geographic region have relatively similar culture and ethnicity, which cause similar selective pressures on human and infecting pathogen [5]. The epidemic growth rate is also uniform within a population which itself has direct effect on evolutionary rate of the pathogen [6]. However, it is notable that sub-populations with different epidemic growth rates may also exist within a population. For example, individuals within HIV high risk groups (e.g., MSMs) usually have similar rate of epidemic growth but might show different growth rates than other risk groups (e.g., injecting drug users). If this is the case, then, the study sample and the external dataset need to be selected from similar risk groups in order to maximize the similarity between the two datasets. Moreover, factors such as care and treatment interventions and stage of the disease progression impose special selective pressures on the pathogen. In the case of HIV infection; for example, Anti Retroviral Treatment (ART) can result in special selective pressures on virus genome [7]. Therefore, if patients in the study sample are ART-free, or if the genomic region(s) under study are free of drug-resistance mutations, the μ estimated from an external dataset with similar characteristics would better represent the μ of the study sample. In addition, it is more likely for a more rigorous host immune system (usually seen in early stages of HIV infection) to recognize and target specific viral antigens, and impose pressure on the virus to evolve immune escape mutations [8, 9]. Therefore, selecting an external dataset comprising patients with relatively similar stage of disease progression would also be beneficial. Finally, different genes and genomic regions are believed to be subjected to different selective pressures and evolutionary constraints (e.g., higher pressure on coding vs. non-coding regions, and on 1st and 2nd codons vs. 3rd codon) [6]. Therefore, it is strongly recommended to pick similar genes and genomic regions (i.e., similar start and stop codons) for estimation of μ from external datasets.
This note provides a brief discussion on factors affecting the evolutionary rate of living species which is not discussed elsewhere in an inclusive manner. It is noteworthy that these factors may vary depending on the context of the research study, and the species considered. Therefore, selection of external datasets should be founded on a thorough understanding of the effective factors on evolutionary rate of the species at biological and social levels.

Acknowledgement
The authors wish to thank Professor Oliver G Pybus for critical reading of this note.

No competing interests declared.