Assessing the role of transmission chains in the spread of HIV-1 among men who have sex with men in Quebec, Canada

Background Phylogenetics has been used to investigate HIV transmission among men who have sex with men. This study compares several methodologies to elucidate the role of transmission chains in the dynamics of HIV spread in Quebec, Canada. Methods The Quebec Human Immunodeficiency Virus (HIV) genotyping program database now includes viral sequences from close to 4,000 HIV-positive individuals classified as Men who have Sex with Men (MSMs), collected between 1996 and early 2016. Assessment of chain expansion may depend on the partitioning scheme used, and so, we produce estimates from several methods: the conventional Bayesian and maximum likelihood-bootstrap methods, in combination with a variety of schemes for applying a maximum distance criterion, and two other algorithms, DM-PhyClus, a Bayesian algorithm that produces a measure of uncertainty for proposed partitions, and the Gap Procedure, a fast non-phylogenetic approach. Sequences obtained from individuals in the Primary HIV Infection (PHI) stage serve to identify incident cases. We focus on the period ranging from January 1st 2012 to February 1st 2016. Results and conclusion The analyses reveal considerable overlap between chain estimates obtained from conventional methods, thus leading to similar estimates of recent temporal expansion. The Gap Procedure and DM-PhyClus suggest however moderately different chains. Nevertheless, all estimates stress that longer older chains are responsible for a sizeable proportion of the sampled incident cases among MSMs. Curbing the HIV epidemic will require strategies aimed specifically at preventing such growth.


Introduction
In Canada, Men who have Sex with Men (MSMs) are especially at risk of contracting HIV [1].In Montreal, Que ´bec, for instance, HIV prevalence in that risk group could be as high as 13% [2].Analyses of HIV genetic sequences provided by MSMs who participated in the Quebec HIV genotyping program have revealed the existence of many separate transmission chains, and suggested an association between chain length and incidence: 42% of MSMs infected between 2012 and 2015 belonged to a transmission chain comprising 20 or more known cases, compared to 13% [3] between 2004 and 2007.
Highly Active Antiretroviral Therapy (HAART) has been successful in substantially suppressing viremia within the diagnosed population, making late transmission of the virus a lot less common, and consequently, early transmission has been increasingly driving the epidemic [4].Recently-infected individuals are much more likely to transmit because of high viral load, viral homogeneity, and immature immune response, potentially leading to transmission cascades, consecutive transmission events happening in a short time span [5].Those cascades result in the expansion of existing chains and may point to a higher proportion of early transmissions.However, their contribution to incidence is hard to measure accurately due to their estimation being very sensitive to the sampling time of infected individuals [6].
Montreal has become a UNAIDS Fast Track City in May 2017, and the only way to reach the 90-90-90 targets [7] is to drastically reduce incidence produced by long transmission chains.In other words, the aim is for 90% of all people living with HIV to be diagnosed, 90% of diagnosed individuals to receive antiretroviral therapy, and 90% of treated individuals to have viral suppression.Also, quantifying the role of early transmission in the epidemic is important from a public health standpoint, as it can help assess the extent to which programs are able to reach infected individuals early enough.This is the motivation behind the current study, in which we analyse a large sample of HIV sequences collected via the Quebec HIV genotyping program with a variety of methods, comparing estimates of growth in transmission chains to shed light on their sensitivity to modelling assumptions, thus better framing the recent transmission dynamics of the epidemic.
There is a lack of consensus as to how to best partition HIV-1 sequence data into distinct transmission chains, and different methods may produce equally valid, but conflicting results [8].This paper therefore compares several up-to-date estimates of transmission chains in the HIV epidemic among MSMs in Quebec and looks into their temporal expansion, improving as a result previous assessments of the contribution of early transmission to the epidemic.More specifically, we consider the following methods: 1. Maximum likelihood phylogenetic reconstruction coupled with several combinations of bootstrap support and genetic distance requirements for clades, termed the ML-bootstrap approach, 2. Bayesian phylogenetic inference, coupled with several combinations of posterior probability support and genetic distance requirement for clades,

Sequence database
The Quebec HIV genotyping program database, operational since 2002, includes HIV pol sequences from all genotyped persons at the Jewish General Hospital (site supervisor: Bluma Brenner) and Ho ˆpital Notre-Dame (site supervisor: Michel Roger) genotyping sites.All blood samples, sequences, and medical records at the two testing sites were fully anonymized and encrypted before analysis.A unique phylogenetic identifier number was assigned for each person.It was created based on putative cluster group association, using a birthdate cross-linker to rule out repeat patient sampling.First genotypes were used in those analyses.The database can still include multiple sequences for certain patients, in which case letters are appended to the identifier to indicate collection order, e.g.C001-001a, C001-001b.The putative clusters were subjected to BLAST, to ensure that transmissions were local.

Provincial cohort of newly-infected MSMs (2002-2016)
Among the entries in the Quebec HIV genotyping program database [11], we first retain 3,916 subtype B sequences, each obtained from a different individual classified as MSM.Sequences include 918 sites located in the pol gene: positions 10-297 of the protease region (PR), and 112-741 of the reverse transcriptase (RT) region.All sequences were collected between May 3rd, 1996 and February 1st, 2016.When multiple sequences were available for a single individual, we selected the one with the earliest sampling date.Sequences come with a time stamp, indicating when the blood sample was collected, and an indicator of infection status, either chronic treated, chronic untreated, or Primary HIV Infection (PHI).A case is considered a PHI if the sequence was obtained less than six months after seroconversion [11].A physician was responsible for the assessment of PHI status, based on serologic testing algorithms for recent HIV infection [12].Genotypic drug resistance testing is standard of care for all persons diagnosed with HIV, and test requisition forms identify all newly-diagnosed or treatment-naïve persons.All newlydiagnosed persons are genotyped prior to treatment onset and so, do not have acquired resistance.The database included 212 sequences from chronic treated persons or persons with missing infection status, which we exclude to prevent artefacts in clustering due to selective pressure induced by treatment.Transmitted resistance on the other hand is limited to only a few single point mutations, K103N and G190A for example.Past analyses of a previous version of the database [11,13] have revealed that chain membership estimation was not affected by their inclusion.We are left with 3,704 sequences, of which 1,401 were sampled from individuals in the PHI stage.Finally, for rooting purposes, we add to those 3,704 sequences three subtype A outgroup sequences from Zambia [14] (NCBI accession numbers AB254141, AB254142, AB254143), and it follows that the dataset we consider includes 3,707 sequences in total.Subtype A sequences are sufficiently different from subtype B sequences to make appropriate outgroups.As well as allowing root placement, they serve as a crude check of inference quality.Indeed, any phylogeny in which subtype A sequences are placed inside subtype B transmission chains would be discarded outright.

Chain estimation
We present a brief review of phylogenetic concepts in Supplementary Material S1.We obtain the maximum likelihood (ML) phylogenetic estimate [15] with RAxML 8.2.10 [16], under the assumption that nucleotide evolution follows the GTR + I + Γ(5) model.We produce 1,000 bootstrap trees, and use them to evaluate confidence in the clades found in the estimated ML phylogeny.To obtain estimates of transmission chains, we separately consider 15 combinations of bootstrap support and genetic distance requirements.More specifically, we first apply a bootstrap support requirement of either 70%, 90%, or 95%.We then consider a distance requirement of either 1.5%, 3.0%, 4.5%, or one corresponding to the 15th or 30th percentile of the maximum likelihood tree's patristic distance distribution [17].We apply the genetic distance criterion in one of three ways: we use, in turns, the maximum within-chain p-distance requirement of ClusterPicker [18], the maximum median within-chain patristic distance requirement of PhyloPart [17], and the maximum within-chain patristic distance requirement found in [11].For the PhyloPart analysis, we consider a combination of the bootstrap requirements listed before, but, per guidelines in [17], only the two distance cutpoints based on percentiles of patristic distances.
We also perform phylogenetic inference with MrBayes 3.2.6 [19] using default parameters, under the assumption that nucleotide evolution follows the GTR + I + Γ(4) model.MrBayes uses the Markov Chain Monte Carlo (MCMC) algorithm [20], more specifically the so-called Metropolis-coupled MCMC, or (MC) 3 [21], algorithm, to generate estimates for the posterior distribution of phylogenetic parameters.The MCMC algorithm lets us obtain samples from the posterior distributions of interest.It starts off by setting all parameters at an arbitrary value.Then, in each iteration, updates to parameter values are proposed, conditional on their current values.Each proposal is randomly accepted with probability equal to the Metropolis-Hastings (MH) ratio, producing a move in the parameter space; else, no move is recorded.After a large number of iterations, parameter values generated throughout the chain are used to empirically estimate the posteriors.We run three million iterations, burning in the first 50% and sampling one iteration out of 500.We derive the majority rule consensus tree from the remaining 3,000 trees, and produce chain estimates by identifying clades with posterior probability support of 1.0.We picked this requirement because it is, by far, the most commonly encountered in the HIV clustering literature, e.g.[22][23][24][25].Studies that use posterior probability requirements under 1.0 are fairly rare [26,27].Since the majority rule consensus tree does not have branch lengths, we use the ClusterPicker algorithm to obtain chain estimates, under all distance requirements stated earlier.We therefore produce five additional partitions.
DM-PhyClus is a Bayesian phylogenetic algorithm that estimates transmission chains by identifying disjoint sets of sequences supported by subtrees with distinctive branch length distributions.It therefore avoids patristic distance and clade support requirements [9].Conditional on an input phylogeny-the maximum likelihood estimate in this study-it uses MCMC to produce an estimate of the posterior distribution of chain membership indices.As a result, it also provides a straightforward measure of uncertainty for chain membership indices, in the form of co-clustering frequencies across the MCMC run.It requires specification of a number of other priors and evolutionary parameters, which we list in Supplementary Material S4.We perform 220,000 iterations, discarding the first 20,000 as a burn-in and applying a thinning ratio of 1 over 200, leaving us with a sample of size 1,000.We identify the partition that maximises the joint posterior probability score, which we refer to as the Maximum Posterior probability (MAP) estimate.We also derived the so-called linkage estimate from the MCMC run results [9].To obtain the linkage estimate, we first derive an adjacency matrix from each partition visited in the mcmc run.The adjacency matrix is a square matrix with entry (i, j) taking value 1 if sequences i and j are in the same transmission chain, and 0 otherwise.We then compute a mean of all the obtained matrices, and round up to 1 all elements above a prespecified threshold, e.g.0.7, and down to 0 all the others.We consider the resulting matrix a representation of an undirected unweighted network.The distinct components, called modules, it contains form the linkage estimate.We present a more detailed description in Supplementary Material S5.
The Gap Procedure is a pure distance-based partitioning algorithm that avoids reliance on ad hoc cutpoints by separating sets of sequences into distinctive components without requiring phylogenetic estimation [10].When the true chains are compact and separable enough, the Gap Procedure can propose partitions that largely agree with conventional phylogenetic estimates, but in a fraction of the time normally required for such analyses, thus making the method ideal for handling large datasets.For example, in an analysis presented in [10], partitioning a dataset comprising 627 sequences of length 810 took 126 hours with MrBayes and less than a second with the Gap Procedure.The method takes as sole input a matrix of p-distances, which we obtain under the Kimura 1980 (K80) model.We leave tuning parameters at their default values.

Cutpoint selection for the conventional maximum likelihood and Bayesian methods
We wish to retain only one chain estimate per method used, and must therefore select a "best" partition among those proposed.Prior to the analyses, clinicians involved in the Quebec HIV genotyping program had performed a preliminary partitioning of the dataset.The partition they proposed results from successive updates, performed at different points in time, of an initial set of transmission chains.They inferred the initial set by imposing very stringent bootstrap support (� 98%) and maximum within-chain patristic distance requirements (< 0.015 nt/bp) [11].Such conservative thresholds are meant to ensure that any two sequences found to co-cluster indeed belong to the same transmission chain.In the following updates, once a sequence is attributed to a chain, it cannot be re-attributed to another chain.They identified from the results seven noteworthy chains comprising 372 sequences in total, which we use as a reference [17].One of those chains, for instance, has 68 sequences, among which more than half harbour the Non-Nucleoside Reverse Transcriptase Inhibitors (NNRTI) mutation K103N.We compare all partitions we obtain with that reference using the Adjusted Rand Index (ARI), a measure of similarity between two partitions, with the aim of maximising overlap.Greater ARI values are better, and the measure is bounded above by 1, indicating perfect correspondence.We describe the comparison scheme in more details in Supplementary Material S2.

Growth of transmission chains
To evaluate the growth of transmission chains, ideally, we would need to know seroconversion dates for all cases whose sequences were sampled.Indeed, it is not enough to infer that a sequence belongs to a known chain: it might be that the infection event happened a long time ago, but was only diagnosed recently.In that situation, the chain did not actually expand: we are merely observing the effects of an increase in the sampling rate.This is a common problem in transmission chain inference [6].The dataset contains an infection stage indicator, equivalent to a censored estimate of infection time, i.e. smaller (greater or equal) than six months prior to the sampling date for PHIs (chronic cases).Since HIV genetic diversity is low in newly-infected individuals [28], we can more accurately associate PHIs to the different chains.Most HIV-positive individuals are diagnosed while already in the chronic stage, at which point seroconversion date estimates are very imprecise [28].As a result, we decide to use PHIs only to obtain a lower bound estimate for the growth of inferred chains, since infection dates for PHIs can be placed reliably within a short time window prior to sampling.We focus on a period ranging from January 1, 2012 to February 1, 2016.For example, one of the methods may propose a chain of size 20, with eight of its sequences having been obtained from cases diagnosed while in the PHI stage at some point in 2014.We can therefore be certain that those cases were infected after January 1, 2012 and so, we conclude that the chain has accrued at least eight new cases in the selected period.

Software
We import phylogenetic estimates from RAxML or MrBayes into R v3.2.3 and produce partitions of the data with custom functions built with the help of the phangorn and ape libraries [29].We use functions in the GapProcedure and DMphyClus R libraries to obtain the other estimates.

Cutpoint selection
The 15th and 30th percentiles of the maximum likelihood tree's patristic distance distribution are equal to 6.8% and 7.7%, respectively.In all maximum likelihood analyses, the bootstrap support requirement of 70% resulted in greater overlap with the reference.Under the maximum patristic distance scheme of [11], we found that a distance requirement of 7.7% maximised the correspondence (ARI = 0.91).With ClusterPicker, requirements of either 6.8% or 7.7% were preferable (ARI = 0.91).In PhyloPart, a median within-chain patristic distance requirement of 0.03 resulted in the largest overlap with the reference (ARI = 0.98).Finally, in the Bayesian analysis, in addition to a posterior probability requirement of 1, we determined that a 6.8% or 7.7% requirement for maximum within-chain p-distances were equivalent (ARI = 0.91).In cases where several distance requirements were equivalent, we picked the smallest one.

Estimates comparison
We first compare optimal partitions from all methods with the ARI, cf.Table 1.As expected, the DM-PhyClus linkage and MAP estimates are very similar (ARI = 0.99).Besides those, we observed the largest overlap between the partitions resulting from the ML + ClusterPicker and MrBayes + ClusterPicker methods (ARI = 0.94).DM-PhyClus produced the most distinctive partition, producing overlap with other methods ranging from 0.64 and 0.72.The larger correspondence with the Gap Procedure estimate is not surprising, since both methods look for chains that are well separated.We depict in Fig 1 the correspondence between the seven partitions we computed.Each square in the heat map is made up of a large number of pixels, each pixel indicating the number of methods that produced partitions attributing to the same chain the sequences labelled by the x and y axes.In other words, for any two distinct sequences, we counted how many of the seven estimates had them in the same chain: this is what the graph represents.We only show the 2,938 sequences found to co-cluster with at least one other sequence by at least one of the methods.The graph reveals several moderately-sized chains.The largest rectangle, marked We excluded all sequences that were found to be singletons by the seven methods we used, leaving us with 2,938 sequences.We gave each remaining sequence a numerical label between 1 and 2,938.The colour of each pixel indicates the number of methods for which the two sequences identified by the x and y axes co-clustered.https://doi.org/10.1371/journal.pone.0213366.g001"1" in the figure, includes 126 sequences, 124 of which belonged to the same reference chain.The earliest sequence in the block, a PHI, was collected on August 13th, 2002, and the latest, also a PHI, on December 23rd, 2015, which suggests the chain might still be expanding.The MAP and linkage estimates of DM-PhyClus, on the other hand, split this block into 14 components, including three chains of size 37, 36, and 14, respectively, and 7 singletons.
The second largest block, marked "2" in the figure, comprises 88 sequences, 87 of which coclustered in the reference set.Its sequences were sampled between May 11st, 2004 (chronic untreated) and December 14, 2015 (PHI), which highlights its durability.Of its 3,828 pairs of sequences, 2,932 were found to be in the same chain by all methods, and 395, by five of the seven methods.The block marked "3" also stands out.Among its 71 sequences, 68 were found in the same reference chain.Its first sequence (PHI) was collected on August 13th, 2002 and its last (chronic untreated), on March 28th, 2015.Of its 2,415 unique pairs of sequences, 1,066 were found to co-cluster by all methods, and 1,280, by five of the seven methods.
Table 2 provides summary statistics for the inferred chains.The four conventional methods produced chain length distributions with similar characteristics.We found that ML + Phylo-Part and DM-PhyClus + Linkage had the highest proportions of singletons, with approximately 36% of sequences not co-clustering with any other sequence.Further, DM-PhyClus did not identify the size 126 transmission chain found by the other methods, as we had noted when we commented on block 1 in Fig 1 .On the other hand, the Gap Procedure and the conventional Bayesian estimate had the fewest singletons.The Gap Procedure estimate, however, contained many more transmission pairs.

Transmission chain growth assessment
Among the 957 cases in the MSM risk group added to the database in the period ranging from January 1st, 2012 to February 1st, 2016, 304 were PHIs.Of those PHIs, 254 were sampled after June 30th 2012, guaranteeing that the corresponding transmission events took place in 2012.According to the ML + PhyloPart estimate, 50 (20%) of those 254 PHIs are singletons, 23 (9%) are found in transmission pairs, and 153 (60%) belong to chains of length five or more.In comparison, in the period ranging from July 1st 2008 to January 1st 2012, 319 MSM cases diagnosed in the PHI stage were added to the database.Out of those, 83 (26%) are singletons, 34 (11%) belong to transmission pairs, and 159 (50%) are part of chains of length five or more.
We represent the 30 longest chains, according to the ML + PhyloPart estimate, in Fig 2 .The labels on the y axis are based on transmission chains described in [3,30].When we found a chain with members from more than one such chain, we created a new label by concatenating the numbers of the two largest chains, under the condition that the second largest was not a singleton.Those 30 chains include 126 recent PHIs, split between 22 chains.Among the ten  When there is only one such PHI, we display the corresponding collection date instead.We assume that all chronic cases recorded before July 1st, 2012 were infected prior to 2012.The dark red bar represents the "minimum chain size before 2012" because several chronic cases diagnosed after July 1st 2012 were probably infected prior to 2012.Also, it is likely that several PHIs sampled in the first half of 2012 match with transmission events that occurred late in 2011.
https://doi.org/10.1371/journal.pone.0213366.g002longest chains, nine include at least one recent PHI.The longest chain includes 12 recent PHIs, while the second and third include 26 and two, respectively.Chain C185 is noteworthy: despite its smaller length prior to 2012, it has grown quickly, with the addition of 22 recent PHI.Chain C067, on the other hand, is still short, but had not been recorded before 2012.Each of those two chains has a PHI recorded as late as the second half of 2015, indicating that they may still be expanding.The partition produced by DM-PhyClus + MAP is different, but leads to similar conclusions, as shown in Fig 3 .Of the 30 longest chains, 20 include at least one recent PHI.We note that the longest chain includes sequences from transmission chain C163; other methods found instead that the longest chain comprised sequences from chain C050, which DM-PhyClus split up.Other conventional estimates, the Gap Procedure, and DM-Phy-Clus + Linkage lead to similar conclusions, as can be seen in Supplementary Material S6.The split of C099 in the DM-PhyClus estimate is also meaningful, as it reflects successive waves of infections.All sequences in C099(01) are from a wild-type virus.In C099(2), 36 out of 43 sequences harbour transmitted drug-resistance mutation K103N, four out of 43 harbour mutation V106I, and finally, only three out of 43 are from a wild-type virus.

Discussion
Phylogenetic analyses can help in the prioritization of public health interventions by highlighting high-risk transmission groups.Previous findings had suggested that long chains play a substantial role in sustaining HIV epidemics in the era of Treatment-as-Prevention [3].In Quebec, people belonging to long chains are mostly MSMs, younger, and concentrated in the greater Montreal area [3].Viral species found in long chains may be characterised by improved replicative fitness and the ability to overcome transmission barriers [31].Moreover, poor testing habits may exacerbate the persistence of long chains [3].It remains important to identify and target growing transmission chains in near real-time [32,33].This study compared several phylogenetic approaches applied to a subset of a large HIV sequence database, representative of the molecular diversity in the HIV epidemic among MSMs in Quebec.More specifically, we looked at HIV-1 transmission chains among 3,704 HIV-1 cases belonging to the MSM risk category.We compared estimates from seven methods, four conventional approaches relying on a variety of cutpoints applied to phylogenetic estimates, and three additional recent approaches seeking to reduce reliance on cutpoints, the Gap Procedure and two variations on DM-PhyClus.We found that estimates obtained from conventional methods were overall fairly similar.The estimates from DM-PhyClus involved a noticeably different chain length distribution.All estimates however produced a similar assessment of chain growth in the period ranging from January 1st, 2012 to February 1st, 2016: nine of the ten longest chains had grown in the selected period, three of those having accrued at least ten new cases.Further, we observed emerging chains, as well as an increase in the proportion of recent PHIs belonging to chains of size 5 or more.
The study has several limitations.Cutpoint selection remains inherently subjective.In the current work, the observed similarities in the proposed choices of cutpoints were not surprising.Chain estimation based on the consensus tree obtained after running the MCMC algorithm relied on ClusterPicker, just like one of the ML-bootstrap approaches.Often, chains found through a Bayesian analysis agree substantially with those obtained through a sensibly-tuned ML-bootstrap approach [22,27].Also, ClusterPicker uses maximum withinchain p-distances, which provide a rough approximation of patristic distances.It follows that tuning ClusterPicker to produce estimates in line with those from the method of [11] is possible.The labels at the end of each bar indicate the sequence collection dates for the first and last "recent" PHIs in the chain, that is, recorded on or after July 1st, 2012.When there is only one such PHI, we display the corresponding collection date instead.We assume that all chronic cases recorded before July 1st, 2012 were infected prior to 2012.The dark red bar represents the "minimum chain size before 2012" because several chronic cases diagnosed after July 1st 2012 were probably infected prior to 2012.Also, it is likely that several PHIs sampled in the first half of 2012 match with transmission events that occurred late in 2011.https://doi.org/10.1371/journal.pone.0213366.g003 Choosing cutpoints as to maximise overlap with a reference set does not guarantee that other chains will be estimated well.Moreover, identifying a suitable reference set can be difficult.In our study, researchers involved in the Quebec HIV genotyping program proposed the set.A different reference set might have led to different cutpoints.Several of the approaches we used did manage to recover the reference set very closely though, which suggests that it is not unrealistic.
DM-PhyClus, being a Bayesian method, rests on a number of prior assumptions, which are all more or less informative, and it follows that prior calibration is key.Although estimates are reasonably robust to some prior assumptions [9], it remains possible that a combination of very poorly chosen priors may result in misleading chain estimates.Further, DM-PhyClus looks for branch length patterns hinting at the existence of transmission cascades.This study focused instead on transmission chains, which do not necessarily involve such patterns.Therefore, the assumptions underlying DM-PhyClus might not be entirely appropriate.
Reliable infection date estimates for cases diagnosed while in the chronic stage are unavailable and so, we could only obtain a lower bound for chain growth between January 1st, 2012 and February 1st, 2016.Time between seroconversion and diagnosis can go from months to years, depending on an individual's testing habits, with a mean that could be above two years [34][35][36], and it follows that several chronic cases diagnosed after June 30th, 2012 might have been infected during the selected period.Estimating infection time from the fraction of ambiguous nucleotides in each sequence would have been possible [28], but the high standard deviation for such predictions would have limited their usefulness.
Because of the non-random sampling of cases, we cannot readily deduce from our estimates the population-level chain length distribution.In the absence of covariate information, we cannot model the sampling process.If, for example, the probability for a case to be sampled correlates positively with chain length, we might end up underestimating the length of shorter chains and the number of singletons, and consequently, overestimating the contribution of the chains we found to the epidemic.Nevertheless, the results we presented provide good evidence of chain growth, and that phenomenon alone warrants attention.
Determining which partition among the six proposed provides the most accurate representation of transmission chains in the sample is difficult.The choice depends ultimately on our confidence in the assumptions of each approach, and on substantive knowledge.The agreement between estimates from the conventional approaches, although explained in great part by shared assumptions, is still a good sign.The moderately different partitions proposed by the Gap Procedure and DM-PhyClus are not erroneous: they result from the way the two methods define a partition of interest.The two approaches also have additional aims and benefits.[10] designed the Gap Procedure with scalability in mind, and [9] formulated DM-PhyClus in such a way that it could offer a straightforward measure of uncertainty for its estimates.
The existence of long transmission chains is not only a feature of transmission of HIV-1 among MSMs in Montreal: it has been observed across Europe and other regions of North America as well [22,[37][38][39].The increasing size of sequence databases represents a considerable computational challenge, especially in the Bayesian framework, and so, scalability should be an essential feature of future partitioning algorithms.We contend that methods that avoid cutpoint selection altogether are promising, and would benefit from further improvements.In addition to lightening their computational burden, adapting them to use time-stamp and covariate data, for example, would be a welcome extension.Further, methods designed to provide a clear measure of uncertainty for estimated partitions, like DM-PhyClus, would warrant more attention.Indeed, the strength of co-clustering between sequences within an inferred chain may vary sizeably, and the separation between some chains may not be very clear-cut.

Fig 1 .
Fig 1.Heat map indicating overlap between estimates produced by the seven methods considered.We excluded all sequences that were found to be singletons by the seven methods we used, leaving us with 2,938 sequences.We gave each remaining sequence a numerical label between 1 and 2,938.The colour of each pixel indicates the number of methods for which the two sequences identified by the x and y axes co-clustered.

Fig 2 .
Fig 2. Bar plot showing the breakdown in membership for the 30 longest chains in the ML + PhyloPart estimate.The labels at the end of each bar indicate the sequence collection dates for the first and last "recent" PHIs in the chain, that is, recorded on or after July 1st, 2012.When there is only one such PHI, we display the corresponding collection date instead.We assume that all chronic cases recorded before July 1st, 2012 were infected prior to 2012.The dark red bar represents the "minimum chain size before 2012" because several chronic cases diagnosed after July 1st 2012 were probably infected prior to 2012.Also, it is likely that several PHIs sampled in the first half of 2012 match with transmission events that occurred late in 2011.

Fig 3 .
Fig 3. Bar plot showing the breakdown in membership for the 30 largest chains in the DM-PhyClus + MAP estimate.The labels at the end of each bar indicate the sequence collection dates for the first and last "recent" PHIs in the chain, that is, recorded on or after July 1st, 2012.When there is only one such PHI, we display the corresponding collection date instead.We assume that all chronic cases recorded before July 1st, 2012 were infected prior to 2012.The dark red bar represents the "minimum chain size before 2012" because several chronic cases diagnosed after July 1st 2012 were probably infected prior to 2012.Also, it is likely that several PHIs sampled in the first half of 2012 match with transmission events that occurred late in 2011.

Table 1 . Adjusted Rand index for the overlap between the chain estimates obtained from the different methods
. CP stands for ClusterPicker. https://doi.org/10.1371/journal.pone.0213366.t001