Skip to main content
Advertisement
  • Loading metrics

Quantifying prevalence and risk factors of HIV multiple infection in Uganda from population-based deep-sequence data

  • Michael A. Martin,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliation Department of Pathology, Johns Hopkins School of Medicine, Baltimore, Maryland, United States of America

  • Andrea Brizzi,

    Roles Data curation, Formal analysis, Methodology, Writing – review & editing

    Affiliation Department of Mathematics, Imperial College London, London, United Kingdom

  • Xiaoyue Xi,

    Roles Data curation, Formal analysis, Methodology, Writing – review & editing

    Affiliations Department of Mathematics, Imperial College London, London, United Kingdom, Medical Research Council Biostatistics Unit, University of Cambridge, Cambridge, United Kingdom

  • Ronald Moses Galiwango,

    Roles Data curation, Investigation, Writing – review & editing

    Affiliation Rakai Health Sciences Program, Kalisizo, Uganda

  • Sikhulile Moyo,

    Roles Data curation, Funding acquisition, Project administration, Writing – review & editing

    Affiliations Botswana Harvard AIDS Institute Partnership, Botswana Harvard HIV Reference Laboratory, Gaborone, Botswana, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, United States of America

  • Deogratius Ssemwanga,

    Roles Investigation, Writing – review & editing

    Affiliations Medical Research Council/Uganda Virus Research Institute and London School of Hygiene and Tropical Medicine Uganda Research Unit, Entebbe, Uganda, Uganda Virus Research Institute, Entebbe, Uganda

  • Alexandra Blenkinsop,

    Roles Conceptualization, Formal analysis, Methodology, Writing – review & editing

    Affiliation Department of Mathematics, Imperial College London, London, United Kingdom

  • Andrew D. Redd,

    Roles Funding acquisition, Investigation, Writing – review & editing

    Affiliations Department of Medicine, Johns Hopkins School of Medicine, Baltimore, Maryland, United States of America, Division of Intramural Research, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, Maryland, United States of America, Institute of Infectious Disease and Molecular Medicine, University of Cape Town, Cape Town, South Africa

  • Lucie Abeler-Dörner,

    Roles Funding acquisition, Project administration, Writing – review & editing

    Affiliation Pandemic Sciences Institute, Nuffield Department of Medicine, University of Oxford, Oxford, United Kingdom

  • Christophe Fraser,

    Roles Conceptualization, Funding acquisition, Project administration, Supervision, Writing – review & editing

    Affiliation Pandemic Sciences Institute, Nuffield Department of Medicine, University of Oxford, Oxford, United Kingdom

  • Steven J. Reynolds,

    Roles Funding acquisition, Investigation, Project administration, Supervision, Writing – review & editing

    Affiliations Rakai Health Sciences Program, Kalisizo, Uganda, Department of Medicine, Johns Hopkins School of Medicine, Baltimore, Maryland, United States of America, Division of Intramural Research, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, Maryland, United States of America

  • Thomas C. Quinn,

    Roles Funding acquisition, Investigation, Project administration, Resources, Supervision, Writing – review & editing

    Affiliations Rakai Health Sciences Program, Kalisizo, Uganda, Department of Medicine, Johns Hopkins School of Medicine, Baltimore, Maryland, United States of America, Division of Intramural Research, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, Maryland, United States of America

  • Joseph Kagaayi,

    Roles Funding acquisition, Investigation, Project administration, Resources, Supervision, Writing – review & editing

    Affiliations Rakai Health Sciences Program, Kalisizo, Uganda, Makerere University School of Public Health, Kampala, Uganda

  • David Bonsall,

    Roles Data curation, Funding acquisition, Investigation, Methodology, Project administration, Resources, Writing – review & editing

    Affiliation Wellcome Centre for Human Genetics, Nuffield Department of Medicine, University of Oxford, Oxford, United Kingdom

  • David Serwadda,

    Roles Funding acquisition, Project administration, Writing – review & editing

    Affiliation Rakai Health Sciences Program, Kalisizo, Uganda

  • Gertrude Nakigozi,

    Roles Data curation, Funding acquisition, Investigation, Project administration, Supervision, Writing – review & editing

    Affiliation Rakai Health Sciences Program, Kalisizo, Uganda

  • Godfrey Kigozi,

    Roles Data curation, Funding acquisition, Investigation, Project administration, Supervision, Writing – review & editing

    Affiliation Rakai Health Sciences Program, Kalisizo, Uganda

  • M. Kate Grabowski ,

    Roles Conceptualization, Data curation, Formal analysis, Funding acquisition, Project administration, Resources, Supervision, Validation, Writing – original draft, Writing – review & editing

    mmart108@jhmi.edu (MAM); mgrabow2@jhu.edu (MKG); oliver.ratmann@imperial.ac.uk (OR)

    Affiliations Department of Pathology, Johns Hopkins School of Medicine, Baltimore, Maryland, United States of America, Rakai Health Sciences Program, Kalisizo, Uganda, Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland, United States of America

  • Oliver Ratmann ,

    Roles Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Writing – original draft, Writing – review & editing

    mmart108@jhmi.edu (MAM); mgrabow2@jhu.edu (MKG); oliver.ratmann@imperial.ac.uk (OR)

    Affiliation Department of Mathematics, Imperial College London, London, United Kingdom

  •  [ ... ],
  • with the PANGEA-HIV Consortium and the Rakai Health Sciences Program
  • [ view all ]
  • [ view less ]

Abstract

People living with HIV can acquire secondary infections through a process called superinfection, giving rise to simultaneous infection with genetically distinct variants (multiple infection). Multiple infection provides the necessary conditions for the generation of novel recombinant forms of HIV and may worsen clinical outcomes and increase the rate of transmission to HIV seronegative sexual partners. To date, studies of HIV multiple infection have relied on insensitive bulk-sequencing, labor intensive single genome amplification protocols, or deep-sequencing of short genome regions. Here, we identified multiple infections in whole-genome or near whole-genome HIV RNA deep-sequence data generated from plasma samples of 2,029 people living with viremic HIV who participated in the population-based Rakai Community Cohort Study (RCCS). We estimated individual- and population-level probabilities of being multiply infected and assessed epidemiological risk factors using the novel Bayesian deep-phylogenetic multiple infection model (deep − phyloMI) which accounts for bias due to partial sequencing success and false-negative and false-positive detection rates. We estimated that between 2010 and 2020, 4.09% (95% highest posterior density interval (HPD) 2.95%–5.45%) of RCCS participants with viremic HIV multiple infection at time of sampling. Participants living in high-HIV prevalence communities along Lake Victoria were 2.33-fold (95% HPD 1.3–3.7) more likely to harbor a multiple infection compared to individuals in lower prevalence neighboring communities. This work introduces a high-throughput surveillance framework for identifying people with multiple HIV infections and quantifying population-level prevalence and risk factors of multiple infection for clinical and epidemiological investigations.

Author summary

HIV exists as a population of genetically distinct viral variants among people living with HIV. People living with HIV can be infected with genetically distinct variants. Identification of these mixed infections requires generating viral genomic data from people living with HIV. In the past, the approaches used to identify multiple infections from viral genomic data have had poor sensitivity or required labor intensive protocols that are prohibitive in application to large data sets. Prior work has also only utilized data generated from small portions of the viral genome and the statistical procedures used to generate population-level estimates from sequencing data generated from individual infections has not accounted for incomplete sampling of the within-host viral population or sources of sequencing error, which may confound multiple infection estimates. Here, we develop a statistical model that addresses these limitations and allows for the identification of multiple infections and the estimation of population-level risk of multiple infection from deep-sequence data. We fit this model to population-based HIV genomic data from people living with HIV in southern Uganda and estimate that approximately 4% of viremic participants harbor a multiple infection at a given point in time. We show that the prevalence of multiple infections is higher in key populations with high HIV prevalence. These findings inform our understanding of the sexual risk networks that give rise to multiple infections and aid in efforts to model HIV epidemiological dynamics and evolution during a period of incidence declines and shifting transmission dynamics across Eastern and Southern Africa.

1. Introduction

Simultaneous infection with multiple distinct variants of human immunodeficiency virus (HIV) can occur through a process called superinfection following secondary exposure to infected bodily fluids [1]. Following acquisition, infecting variants are shaped by within-individual evolutionary processes and can either stably coexist or undergo competitive exclusion [2,3]. Superinfection of PLHIV has important implications for the evolution, pathogenesis, and spread of HIV. Specifically, it provides the necessary conditions for the generation of novel recombinant viruses [4,5], which fuels diversification of the circulating viral population [6,7], complicating vaccine development efforts through the generation of novel epitopes [8,9] and potentially leads to the evolution of more transmissible viral genotypes [10]. Acquisition of superinfections may also increase the breadth and strength of the antibody response to HIV infection [1113], potentially aiding in the identification of broadly neutralizing antibodies [14]. Finally, multiple infections may themselves lead to faster disease progression [1517] and higher viral load [16,17], thereby potentially also increasing the risk of onward transmission [18,19]. While the availability of viral genome sequence data has allowed for the identification of HIV multiple infections across a range of epidemiological contexts [20], prevalence estimates have generally been based on relatively small samples sizes with only partial genome data. Here, we identify HIV multiple infections using within-host deep-sequence phylogenetic trees inferred across the genome from a population-based surveillance cohort.

To date, viral sequence-based methods to identify HIV multiple infections have generally relied on one of three approaches. First, bulk sequencing (e.g. Sanger sequencing or consensus sequence estimation from deep-sequence data) can reveal instances where the majority viral variant changes between baseline and follow-up visits under longitudinal sampling or cases where the within-person viral population at a specific visit harbors abnormal levels of diversity [16,2123]. While this approach proved useful prior to the availability of deep-sequencing technologies, it has a sensitivity of only  ∼ 5% to detect variants present in  ≤ 20% of the viral population within a sample [24]. Alternatively, single genome amplification (SGA) relies on serial dilutions to isolate a single molecule of transcribed viral cDNA prior to amplification and sequencing [2527]. This approach is more sensitive in detecting minor variants than bulk sequencing and was considered the “gold standard” [28], but is labor intensive and difficult to apply at scale. Amplicon deep-sequencing of discrete regions of the HIV genome is able to achieve high sensitivity while being highly scalable to large sets of samples and has therefore been broadly applied to study multiple infections in larger studies [2,2831].

Despite advancements in viral sequence-based identification of HIV multiple infections, existing approaches share shortcomings that hinder the interpretation of the results they generate. Critically, all of these methods rely on sequence data generated from only a subset of the genome, due in part to historical challenges in generating whole-genome HIV sequence data. For example, general population-based studies in Rakai, Uganda have previously utilized sequence data from 390 base pairs (bp) and 324 bp of the p24 (gag) and gp41 (env) regions, representing only 7.3% of the HIV genome. This inherently limits sensitivity to identify multiple infection with viral variants that are highly related within these short regions. Analysis of gag sequence data sampled from high-risk Kenyan women revealed cases of superinfection that were unidentified when querying only the env region [32]. Further, limited consideration has been given to the fact that factors that affect sequencing success of biological samples [33] may also affect the detection probability of multiple viral variants and may therefore confound prevalence estimates and assessment of multiple infection risk factors. Finally, existing methods generally use binary categorization of samples as either multiply or singly infected. They do not quantify uncertainty in individual-level assignments and do not account for this uncertainty when estimating population-level prevalence. With the advent of approaches that can generate near whole-genome HIV deep-sequence data [33,34], there is a need for statistical approaches that can integrate data from across the genome to robustly identify multiple infections while accounting for the various sources of bias that can obscure the underlying biological signal.

Here, we identify individuals that are likely to have multiple HIV multiple infection at the time of sampling, provide minimum estimates of the prevalence of HIV multiple infections in Rakai, Uganda between January 2010 and November 2020, and characterize risk factors for harboring a multiple infection based on HIV RNA deep-sequence data obtained from plasma samples of 2,029 people living with viremic HIV aged 15-49 who participated in the longitudinal, population-based Rakai Community Cohort Study (RCCS) [35,36]. These estimates reflect multiple infections present at time of sampling in plasma and, because infecting variants may be lost over time due to within-host evolutionary processes [2,3], should be interpreted as the minimum prevalence of people who have ever been multiply infected. Rakai District is located in south-central Uganda, East Africa, bordering Lake Victoria, and is one of the areas with highest HIV-prevalence globally [37]. To support these inferences, we developed a novel Bayesian statistical model to identify multiple infections using within-host phylogenetic trees inferred from deep-sequence data generated from across the HIV genome, which we call the deep-phylo multiple infection model (deep-phyloMI). Phyloscanner [38], which analyzes within-host pathogen diversity from deep-sequencing reads, was used to infer within-host phylogenetic trees across the HIV genome, remove contaminant sequences, and identify regions of the genome with evidence of multiple infecting variants. Our model simultaneously estimates individual- and population-level risks of harboring a multiple infection from processed phyloscanner output after accounting for incomplete sequencing of the viral population within a sample and false-negative and false-positive rates of multiple variant identification. We validated model performance on simulated data and used it to identify multiple infections in RCCS participants over a period of declining incidence and rapidly shifting transmission dynamics [35,39].

2. Materials and methods

2.1. Ethics statement

All participants provided written informed consent for the study. Written assent and written parental consent were obtained for participants less than 18 years of age. The RCCS is administered by the Rakai Health Sciences Program (RHSP) and has received ethical approval from the Uganda Virus Research Institute’s Research and Ethics Committee (GC/127/08/12/137), the Uganda National Council for Science and Technology (HS450), and the Johns Hopkins School of Medicine (IRB00217467).

2.2. Study design and participants

The RCCS conducts population-based surveys every 18–24 months in agrarian, semi-urban trading, and Lake Victoria fishing communities in southern Uganda. Data in this study were collected over six RCCS survey rounds conducted between January 2010 and November 2020. As survey rounds occurred over more than a year, we herein refer to them by the median interview date. Communities that participated in the RCCS were categorized based on their geographic setting and primary economic activity (inland communities: agrarian/trading, Lake Victoria communities: fishing). These communities differ considerably in their HIV burden (HIV prevalence of  ∼ 14% [agrarian],  ∼ 17% [trading], and  ∼ 42% [fishing]) [36]. At each survey round, households were censused and all residents aged 15–49 who were able to provide consent (assent for those under 18) were invited to participate in a survey. Survey participants were eligible to participate exactly once in each survey round (“participant-visits”). As part of the survey, participants completed a detailed structured sociodemographic, behavioral, and health questionnaire. Specifically, participants were asked to self-report their sex, age, residency status (e.g. recent migration into a community), circumcision status (among males), occupation, occupation of sex partners in the year prior to the survey, and number of lifetime sex partners. As HIV is more prevalent among female sex and bar/restaurant workers [40,41], we generated a composite variable indicating reported sex or bar/restaurant work among women and sex with a sex or bar/restaurant worker among men to determine if these individuals were at higher risk of being multiply infected.

To account for the fact that the number of lifetime sex partners increases over the lifespan, we calculated the mean number of lifetime sex partners within population strata (s) defined by HIV serostatus, sex, age category in five year bins, and community type (inland/fishing) () to allow for standardization of the observed responses. Responses of no lifetime sex partners were treated as missing data as HIV transmission in this setting is predominantly heterosexual [42] and we therefore expected these individuals to have had at least one sexual encounter in order to acquire HIV, although we cannot rule-out perinatal transmission with available data. When calculating missing data was imputed to the mean value of a lognormal distribution fit to all numeric responses of  ≥ 1 lifetime sex partner within strata defined by HIV serostatus, sex, age category, and community type. Additionally, some RCCS participants provided categorical responses (“1–2” or “3+” lifetime sex partners). To calculate , we first imputed these values to a numeric response. Responses of “1–2” were imputed to the mean response among PLHIV reporting either one or two lifetime sex partners within strata. Similarly, responses of “3+” were imputed to the mean value of a lognormal distribution fit to all numeric responses of  ≥ 3 lifetime partners within strata as above.

In addition to completing the survey questionnaire, participants provided venous blood samples for HIV testing, viral load quantification, and viral deep sequencing. HIV serostatus was evaluated using a validated rapid test algorithm [43]. HIV viral load quantification was conducted using the Abbott real-time m2000 assay (Abbott Laboratories).

2.3. HIV deep sequencing and bioinformatic processing

HIV RNA deep-sequence data from plasma samples contributed by RCCS participants was generated through the Phylogenetics and Networks for Generalized HIV Epidemics in Africa consortium (PANGEA-HIV) [4446]. The study sample included RCCS participants with HIV who were viremic ( ≥ 1 , 000 copies/mL) at one of their study visits between January 2010 and November 2020. To avoid biasing our inferences, for individuals that participated in multiple survey rounds we used only the data from the sample with the highest genome coverage or the highest viral load in the case of ties in our analyses of multiple infections. The study sample was further restricted to individuals in putative transmission networks and excluded individuals for who another phylogenetically close individual could not be identified over the entire study period [39]. All available sequence data for individuals in putative transmission networks was included in phylogenetic analyses.

Deep-sequencing was performed with two protocols (S1 Table), as previously described [39]. Briefly, for sequence data generated through the amplicon protocol, viral RNA was extracted from plasma samples on the QIAsymphony SP workstation with the QIAsymphony DSP Virus/Pathogen Kit. cDNA was generated through one-step reverse transcription PCR protocol using universal HIV-1 primers designed to generate four overlapping amplicons across the HIV-1 genome [34]. Deep-sequencing was conducted at the Wellcome Trust Sanger Institute core facility using the Illumina MiSeq and HiSeq platforms. To generate sequence data using the bait-capture protocol viral RNA was similarly extracted using the QIAsymphony DSP Virus/Pathogen Kit followed by library preparation according to the veSEQ-HIV protocol [33]. Library preparation was performed using the SMARTer Stranded Total RNA-Seq v2-PicoInputMammalian (Clontech, TakaRaBio) kit and double-stranded dual-indexed cDNA generated using in-house indexed primers. Libraries were pooled and cleaned with Agencourt AMPure XMP. Pooled libraries were hybridized to HIV-specific biotinylated 120-mer oligonucleotides (xGen Lockdown Probes, Integrated DNA Technologies) and isolated with streptavidin-conjugated beads. Captured libraries were PCR amplified prior to generation of 350-600 base pair (bp) paired-ends reads with the Illumina NovaSeq 6000 at the Oxford Genomic Centre.

Kraken v.0.10.5-beta [47] with a custom database of human, bacterial, archael, viral, and fungal genomes was used to isolate reads of viral and unknown origin which were trimmed of adaptors and low-quality bases using trimmomatic [48] v.0.36/0.39. Trimmed reads were de novo assembled into contigs using SPAdes [49] and metaSPAdes [50] v.3.10. Shiver v.1.5.7 [51] was used to align reads to a reference sequence constructed for each sample using these contigs.

2.4. Inference of within-host deep-sequence phylogenetic trees

To improve the computational efficiency of our within-host deep-sequence phylogenetic analyses we first clustered participants with HIV into putative transmission networks as previously described (S1 File) [39,52], and then grouped putative networks into batches for deep-sequence phylogenetic analyses.

Deep-sequence data belonging to participants in each batch were further processed with phyloscanner [38] v.1.8.1 to infer within-host phylogenetic trees in 287 sliding windows of length 250 bp with a step size of 25 across the HIV genome as in [39]. As suggested in [38], this window-size was chosen to be long enough to capture sufficient within-host diversity to provide phylogenetic signal but no longer than the target read length and short enough to minimize within-window recombination. Windows spanning env gp120 were excluded as genetic diversity in the variable loop regions [53] led to poor sequence alignment and unreliable within-host phylogenetic trees. In addition to deep-sequence data from RCCS participants, we included as phylogenetic background 113 consensus sequences from representative subtypes and circulating forms and 200 near full-length consensus sequences from Kenya, Uganda, and Tanzania (Los Alamos National Laboratory HIV Sequence Database, http://www.hiv.lanl.gov, S2 File). Within phyloscanner, MAFFT v.7.475 [54] with iterative refinement and iterative re-alignment using consistency scores was used to align sequencing reads and IQ-TREE v.2.0.3 with the GTR+F+R(Free-Rate)6 substitution model was used for phylogenetic inference [55,56]. Phylogenetic branch lengths within phyloscanner were adjusted to account for varying substitution rates across the HIV genome as described in [57] (S3 File). Adjusted distances can be interpreted as average distances expected in the pol gene. The genomic coordinates of input sequence data were standardized to the coordinates of the HIV-1 HXB2 reference genome (GenBank: K03455.1).

For each participant, phyloscanner was used to estimate the number of genetically distinct phylogenetic lineages (subgraphs) in each genome window using a modified parsimony algorithm. In each window, for each participant, the given phylogenetic tree was pruned to include only tips from the given participant and the specified outgroup (here, the subtype H consensus sequence). Ancestral nodes in the pruned tree were assigned to one of two states: either that of the participant or an unsampled “unassigned” state (to which the outgroup and root of the phylogeny was assigned), representing the lineages that are evolutionarily ancestral to the lineages that initiated a given host’s infection. To accurately assign nodes without relying on patterns of phylogenetic clustering with reference sequences, we employed a modified Sankoff minimum parsimony algorithm for ancestral state reconstruction as described in [38,58] (in particular, see Supplementary Information 1.2 and Supplementary Fig 1 in [38]). This algorithm assigns a cost (c(n, h)) to a state change along a lineage ending at ancestral node n that is proportional to the sum of the branch lengths descendant from that node that give rise to tips form host h (l(n, h)). As tips from all other subjects with the exception of the outgroup were pruned from the tree prior to this procedure (“single-host tree”), this is equivalent to the sum of the total branch length of the subtree with node n as its root. Specifically, this cost was calculated as:

(1)

where k is a tuneable constant that controls the penalty associated with fewer host h subgraphs. Traditional parsimony is recovered when k = 0 which will always assign all tips in a single-host tree to a single subgraph, regardless of the phylogenetic branch length captured within that subgraph. As k → , each tip belonging to host h will be assigned to a unique subgraph. Here, we parameterized k with the goal of distinguishing evolution that occurred within a given host from evolution that occurred prior to HIV acquisition, in the case of multiple infection. In the case of single infection, all tips in a single-host tree will be closely related (e.g. Fig 1A) and therefore we want the ancestral reconstruction that minimizes c(n, h) to assign all tips to a single subgraph. In the case of a multiple infection the tips will be expected to fall into ( ≥ )2 clades with relatively small within-subgraph distances but large between-subgraph distances and we seek to parameterize k such that the ancestral reconstruction minimizing c(n, h) differentiates these clades into distinct subgraphs. We conservatively used a k value of 15 such that , which is greater than the 99th percentile of the pairwise genetic distances between epidemiologically confirmed HIV transmission pairs [57] and comparable to within-subtype HIV genetic diversity within Rakai [7].

Quality filtering of inferred within-participant phylogenetic trees was performed with phyloscanner. Specifically, within each window, subgraphs with less than three reads or less than 1% of reads from a particular participant were marked as putative contaminants and removed from the analysis. To mask regions with insufficient data for reliable phylogenetic inference any window with less than 30 reads from a given participant after aforementioned filter was also removed from the analysis. After filtering we identified the subgraphs with data from the deep-sequenced reads from each sequenced sample for a given participant.

2.5. Bayesian model to identify multiple infections

We developed a Bayesian statistical model to identify samples harboring multiple infections and estimate the prevalence of multiple infections in a set of deep-sequencing reads that were processed with phyloscanner. We refer to this model as the the deep-phylo multiple infection model (deep-phyloMI). We first summarized the phyloscanner output for each sample and each genomic window in terms of two binary variables, (presence/absence of sequencing reads from sample i in window w following phyloscanner contamination filtering) and (presence/absence of multiple subgraphs for sample i in window w) where n is the number of sequenced samples. To simplify notation below, when we set . We further summarized the data for sample i into two quantities, and where is the number of genome windows.

2.5.1. Base model accounting for partial sequencing success of infecting variants.

We first developed a base model that accounts for partial sequencing success across the HIV genome in giving rise to the observed and . Working from first principles, we first derived a likelihood model for observing the pair of counts (, ) for the unobserved groups of samples with true multiple infection () and single infection (), and subsequently marginalise out the unknown true multiple infection status (either or ). Among samples from multiply infected individuals (), we assumed that the probability of sequencing each of the infecting variants in window w was given by for each sample i. The probability of sequencing at least one variant in each window is therefore and the probability of sequencing both variants given at least one was sequenced is therefore . Assuming sequencing success was independently and identically distributed for each sample, we obtained(2a)(2b)where represents the 0-truncated Binomial distribution as we only consider data from individuals with phyloscanner output in at least one genomic window, and is the total number of genomic windows. This model implicitly accounts for the presence of windows in which only a single variant was present in the phyloscanner output due to incomplete sequencing success. For samples from individuals infected with only a single variant (), we obtained analogously

(3a)(3b)

Taken together, the joint likelihood of observing the count pair (, ) conditional on latent multiple infection status is given by

(4)

Thus, aggregating over the two unknown possible multiple infection states for each sample in a finite mixture model framework, we have

(5)

One of our primary inferential targets was the individual-level probability of harboring multiple infection not conditional on observed and , which we denoted with . Making this target explicit in the joint likelihood, we have

(6a)(6b)(6c)

and so the log posterior distribution of the parameters , for all the data under our model is

(7)

where we use f to denote posterior and prior densities.

2.5.2. Base model prior densities.

In the base model, prior to observing data, we modelled the individual-level probability of multiple infection as identical for all i with the prior density,

(8)

with diffuse variance [59]. Given the known log-linear dependency of sequencing success on log viral load [33], known differences in sequencing success rates by sampling protocol [39], and other factors, we specified the prior on the individual-level sequencing probability through a logistic mixed effects model. Specifically, we modeled with

(9a)(9b)(9c)(9d)(9e)(9f)(9g)(9h)

where and are indicator variables for whether sample i was sequenced using the amplicon or bait capture approach respectively and , , and are the sample log10 copies/mL values standardized to have mean zero and standard deviation 1 among all samples and among only the amplicon () and bait capture () samples, respectively. To maintain identifiability we constrain  +   +  by specifying their joint prior distributions with a zero-mean multivariate normal with a particular variance-covariance matrix described in [59], such that all marginal distributions are standard normal, e.g. and , which we represent with the notation stz-MVN. To maintain marginal priors with standard deviation , we adopt a non-centered parameterisation and post-multiply the sum-to-zero random variables with . Finally, denotes an individual-level random effect.

2.5.3. Modelling false-negative and false-positive phylogenetic observations.

We extended the base model to account for possible false-negative and false-positive phylogenetic observations, accounting for incomplete removal of false-positive observations through phyloscanner, and/or incomplete phylogenetic identification of multiple infections due to insufficient phylogenetic background. First, among samples from individuals in which we accounted for the scenario in which both variants are successfully sequenced in a given window but were identified as a single phylogenetic clade by phyloscanner, i.e. false-negative observations, by modifying our data-generating model to

(10)

where λ represents the false-negative rate. We analogously accounted for the scenario in which only a single variant was sequenced but phyloscanner spuriously assigned multiple subgraphs in a given window, i.e. false-positive observations, through a false-positive rate ε in the model. We modeled false-positives among samples lacking multiple infection and among windows in multiply infected samples in which only a single variant was sequenced, which occurs with probability when , but was spuriously assigned to two subgraphs. Note that because we did not differentiate between windows with exactly 2 and >2 subgraphs, we do not consider the scenario where both variants are sequenced in a true multiple infection and the two sequenced variants are spuriously assigned to 3 or 4 subgraphs. Our data generating model was updated to account for false-positives and false-negatives as:

(11a)(11b)

with additional prior densities

(12a)(12b)

where [,2.2] represents that logit ( λ )  was constrained to be <2.2 and all other components of the model remaining as above.

2.5.4. Estimating risk factors of multiple infection.

We further extended the model described above to model the probability of multiple infection as dependent on potential clinical, behavioral, and/or epidemiological risk factors through a logistic regression approach. Specifically, we modeled the logit of the individual-level multiple infection prior probabilities as a linear predictor of fixed effects,

(13a)(13b)(13c)

where are dimensional row vectors for each of putative multiple infection predictive covariates and are dimensional column vectors of fixed effect coefficients. For all categorical j in with levels, we model the corresponding fixed effects with the sum-to-zero joint multivariate normal prior defined above to maintain identifiability.

We also considered a fixed effects model with Horseshoe-type shrinkage priors [60,61] on the effect sizes to handle correlated individual-level covariates. To maintain desirable sum-to-zero properties, we define a global non-negative shrinkage parameter τ ∈ [ 0 ,  ) , and for each categorical j with levels non-negative local shrinkage parameters , and the diagonal matrix . We then specify sum-to-zero shrinkage effects through a joint zero-mean multivariate normal distribution with variance covariance matrix  −  , such that and the induced marginal distributions of each are , which we refer to . We incorporated the global shrinkage parameter in non-centered parameterisation through post-multiplication as in Eq 9. Therefore, we have:

(14a)(14b)(14c)(14d)(14e)

where we modelled the with t-distributions with 2 degrees of freedom instead of Cauchy distributions to ease numerical sampling.

As above, the number of lifetime sex partners included missing and ambiguous responses (e.g. “3+”), and these values were estimated as additional random variables in the Bayesian inference, assuming they were missing at random within sex, age, and community type, using lognormal prior distributions specific to these strata defined by the non-missing responses as above. Imputed values for missing responses were limited to the range [1,60] and responses of “3+” were limited to the range [3,60].

2.5.5. Parameter estimation.

We estimated joint posterior distributions numerically using Hamiltonian Monte Carlo [62] with the No-U-Turn Sampler [63] implemented in Stan [59] and accessed through cmdStanR v.2.36.0 [64] in R. For all analyses, four independent chains with 2,000 iterations of warm up and 2,000 iterations of sampling were run. A target acceptance rate of 0.8 was used for all analyses with the exception of those that employed shrinkage priors where a target acceptance rate of 0.95 was used to avoid divergent transitions. Convergence was assessed using the statistic, bulk and tail effective sample sizes (ESS) for each parameter [65], and visual inspection of trace and pairs plots.

2.5.6. Generated quantities.

Based on the estimated parameter distributions of the models described above, we generated a number of quantities to aid in interpretation of our results.

2.5.6.1. Posterior probabilities of individual-level multiple infection.

We computed the posterior probabilities of individual-level multiple infection directly from Monte Carlo samples of the joint posterior density via

(15)

by taking for each individual i all Monte Carlo samples of the posterior density of , evaluating according to:

(16)

and calculating the expectation across these.

2.5.6.2. Prevalence of multiple infection in the study sample.

Following from prior work on Bayesian latent class models with covariates [6671], under the base model the posterior estimate of the prevalence of multiple infection in the study sample is given by:

(17)

where is from the joint posterior density of the model defined by Eqs 8, 9, 11, and 12. In the presence of modeled risk factors, the prevalence of multiple infections in the study sample will vary based on sub-groups s defined by Xrisk. In the case where contains only the covariates used to define s:

(18)

Finally, we estimated the prevalence in a target population (e.g. the entire sample of sequenced viremic RCCS participants) through post-stratification:

(19)

where are the number of sampled individuals in each of the S sub-populations s and are the sub-group specific prevalence estimates from Eq 18.

2.5.6.3. Prevalence and risk ratios of harboring multiple infection associated withepidemiological covariates.

We calculated a posterior estimate for the prevalence risk ratio (PRR) of multiple infections in epidemiological strata s* as compared to strata s as

(20)

In the case where contained additional covariates beyond those used to define s* from s we estimated a multivariate risk ratio (RR) associate with the covariate(s) that distinguish s* from s by calculating the ratio of the estimated risk of multiple infection for person i as if they belonged to strata s* divided by the risk of multiple infection of the same person i as if they belonged to strata s, while holding all other covariates at their observed values (based on the design matrices and , respectively):

(21)

2.5.6.4. Post-stratification adjustments.

Finally, because sequence data was not available for all viremic participants with HIV in our study population, we employed post-stratification based on prevalence estimates in epidemiological sub-groups s to estimate the prevalence of multiple infections in the population under study (viremic study participants) [72]. Specifically, we calculated

(22)

where is the estimated population size or estimated relative population size of sub-group s. The population prevalence ratio between two non-overlapping composite sub-groups can therefore be calculated as in Eq 20. We performed post-stratification based on the total number of participant-visits from viremic PLHIV stratified by age ((14, 24], (24, 34], and (34, 49] years), sex, and community type. Because viral load measurements were not routinely conducted for all PLHIV in the 2010 and 2012 survey rounds we calculated population-sizes using only participant-visits in the 2014-2019 survey rounds.

2.6. Simulation study

We used simulations to validate our inference model. For all simulations, we simulated data for genome windows in n = 2 , 000 samples which were assigned a normalized log10 viral load () with random draws from a N(0,1) distribution. For all samples, was drawn from a N(0,1) distribution and calculated as  +   +  with and . Under these parameters, we generated three simulated data sets as described below.

2.6.1. Base simulation.

(23a)(23b)(23c)(23d)(23e)

where represents a vector of x repeated n times and  ⊕  represents concatenation of two vectors.

2.6.2. Full simulation.

(24a)(24b)(24c)(24d)(24e)(24f)(24g)

Additional simulations from this simulation model were generated with all other parameters held constant except (A): , (B): λ = 0 . 10 , 0 . 20 , 0 . 40, and (C): ε = 0 , 0 . 005 , 0 . 05.

2.6.3. Extended simulation.

(25a)(25b)(25c)(25d)(25e)(25f)(25g)

where represents the entry in the ith row and jth column of the design matrix and shuffle(v) denotes shuffling the elements of v.

2.7. Data analysis and visualization

All data analysis was conducted in R v.4.4.1 [73] using the tidyverse [74] with dplyr v.1.1.4 [75], tibble v.3.2.1 [76], and tidyr v.1.3.1 [77]. Haven v.2.5.4 [78] was used to parse a subset of input data files. Visualization of data and results was done using ggplot2 v.3.5.1 [79] with bayesplot v.1.11.1 [80,81], cowplot v.1.1.3 [82], and patchwork v.1.2.0. [83]. Phylogenetic trees were manipulated and visualized using ape v.5.8 [84], ggtree v.3.12.0 [8589], phytools v.2.1.-1 [90], and tidytree v.0.4.6 [85]. Highest posterior density intervals were calculated with HDInterval v.0.2.4 [91] and convergence statistics were assessed with posterior v.1.6.0 [92]. Preliminary analyses and model fitting was performed using fitdistrplus v.1.1-11 [93].

3. Results

3.1. Phylogenetic signatures of multiple infection in population-based pathogen surveillance

Between 2010 and 2020, 50,967 participants contributed to the RCCS in 109,608 visits over six survey rounds. Overall, 8,841 participants were HIV seropositive and 3,586 were viremic (plasma viral load  ≥  1,000 copies/mL) at one of their visits (S2 and S3 Tables). Of these, 2 ,029 individuals were sampled between January 2010 and November 2020, had HIV RNA deep-sequence data available at minimum quality criteria for deep-sequence phylogenetic analysis, and were identified as a member of a putative transmission network (Tables 1 and S4 and S1 File.). Availability of sequence data among viremic participants was generally higher among men, from residents of fishing communities, and from participants aged 25-34 years.

We next inferred within-host phylogenies from deep-sequencing reads in twenty-nine 250 bp non-overlapping genomic windows using phyloscanner (S4 File.), which captured evolutionary relationships of HIV variants within individual participants. Sequencing coverage varied significantly between samples (median [interquartile range (IQR)]: 5000 [4250] bp, S1A Fig) but was generally higher among bait capture sequenced samples and samples with higher viral load. Across the genome, sequencing success was highest in gag (Figs 1F and S1B), likely due to differential amplification efficiency of the primers used in the amplicon sequencing approach [94].

thumbnail
Table 1. Characteristics of the study sample obtained from population-level HIV deep-sequence surveillance in the Rakai Community Cohort Study, 2010-2020, stratified by availability of deep-sequence data.

https://doi.org/10.1371/journal.ppat.1013065.t001

thumbnail
Fig 1. Empiric phylogenetic multiple infection signatures from 2,029 samples from people with viremic HIV in the Rakai Community Cohort Study, 2010-2020.

(A) Representative within-host phylogenetic tree lacking evidence of multiple phylogenetic subgraphs. (B) Representative within-host phylogenetic tree with two subgraphs as indicated by the green and blue shading of the tips. (C) Distribution of branch length distance between the MRCAs of the two subgraphs with the most sequencing reads in all genome windows windows with  ≥ 2 subgraphs from all samples. Bins are shaded according to the 95th and 50th percentile. Vertical dotted line indicates median value. Binwidth is calculated such that there are approximately 50 bins across the range of observed values. (D) Per-sample number of non-overlapping genome windows with sequence data versus the number of non-overlapping genome windows with multiple subgraphs. Samples with at least one window with multiple subgraphs are shown in purple. Points have been jittered along both the X and Y axes for visual clarity. Dotted line shows modeled prediction in the absence of false-positive or false-negative multiple subgraph windows. Marginal densities are shown at right and above the scatter-plot. (E) Schematic of the HIV genome based on the coordinates from HXB2 (Genbank: K03455.1). (F) Number of samples with sequence data in each of the 29 non-overlapping genome windows. (G) Number of samples with evidence of multiple subgraphs in each of the 29 non-overlapping genome-windows.

https://doi.org/10.1371/journal.ppat.1013065.g001

To characterise phylogenetic signatures of multiple infection, we used phyloscanner to identify distinct co-circulating variants among participants with viremic HIV (Materials and methods and Fig 1A and 1B). We tabulated the number () of genome windows in which distinct phylogenetic lineages (phylogenetic subgraphs) were observed. The median genetic distance between the most recent common ancestors of subgraphs in genome windows with multiple subgraphs was 0.19 [IQR: 0.17] substitutions/site (Figs 1C and S2), which is consistent with contemporary circulating genetic diversity within Rakai [7,57]. Empirically, 181 (8.92%) samples had multiple subgraphs in at least one of the 29 non-overlapping windows (Fig 1D). Among these, the proportion of sequenced windows in which multiple subgraphs were observed varied considerably, but was generally relatively rare (median [IQR] 11.11 [19.14]% of sequenced windows for each sample, 2 [3] windows total). We observed a clear dependence of the ability to identify multiple subgraphs on sequencing success as quantified by genome coverage in the phyloscanner output. Of those samples with sequence data in all genome windows, 12.26% (52/424) had at least one window with multiple subgraphs compared to 8.04% (129 / 1605) among the remaining samples. Multiple subgraph windows were more common in the genome windows corresponding to gag, env, and nef, likely reflecting circulating genetic diversity in these regions with higher substitution rates [95]. Previous studies of HIV multiple infection in this setting have used amplicon-based deep-sequencing of two regions in p24 (1427–1816) and gp41 (7941–8264) regions [2,29,30]. Of 1,742 sequenced participants with data in windows spanning these regions (S4 File.), 75 (4.31%) had multiple subgraphs in one of the regions.

3.2. Bayesian model to identify multiple infections from pathogen deep-sequencedata

The observed dependence between phylogenetically identified samples with multiple infection and successful genome sequencing implies it is difficult to deduce the underlying prevalence of multiple infections from the empirical data without a statistical model that accounts for partial sequencing success, false-positive multiple subgraphs, and false-negative unique subgraphs within hosts. Specifically, because identification of multiple infection requires successful sequencing of both variants and genetic divergence between those variants, there is inherently more uncertainty in multiple infection status when sequencing success is poor or when infecting variants are genetically related in the sequenced region of the genome. Further, contamination or sequencing errors may give rise to spurious within-host genetic diversity and thereby inflate the estimated prevalence of multiple infection.

Therefore, we constructed a Bayesian model accounting for partial sequencing success to estimate the probabilities that each individual harbors a multiple infection, prevalence of multiple infection among deep-sequenced viremic participants, and risk factors for multiple infection (Materials and methods). We first verified that we were able to accurately estimate model parameters on simulated test data in the presence of incomplete sequencing success (S5 Table.). Next, we investigated the impact of false-positive and false-negative observations, as empiric analyses of RCCS deep-sequence data indicated that false-negative rates were likely substantial in that among samples with , the observed values for a given number of sequenced windows () was less than expected based on our model (S1 File. and Fig 1D). We found that failing to account for these errors led to an overestimation of the prevalence of multiple infections on simulated data (Fig 2A and S6 Table.). This prompted us to explicitly include false-positive and false-negative detection rates in our model as free parameters. With this, we found that model parameters could be accurately estimated on simulated data (Fig 2B2H and S7 Table.). Model performance was robust across simulations covering a range of reasonable values of the prevalence of multiple infections as well as false-positive and false-negative rates of multiple subgraph observation(S3, S4, and S5 Figs).

thumbnail
Fig 2. Verification of model accuracy for estimating multiple infection prevalence on simulated data with incomplete sequencing success and false-negative and false-positive observations.

(A) Number of windows with sequence data (x-axis) v. number of windows with multiple subgraphs (y-axis) for each simulated sample. Data from multiply infected samples is highlighted in red. Marginal distributions are shown at right and above. (B) Estimated posterior probability of multiple infection for each sample. Confidence bounds represent the 95% highest posterior density. Data for each sample is shaded as in (A). (C-H) Posterior distributions of the baseline sequencing success (, C), dependence of sequencing success on viral load (log copies/mL) standardized to mean = 0 and standard deviation = 1. (, D), standard deviation of per-individual sequencing success random effect (, E), the multiple subgraph false-negative rate (λ, F), the multiple subgraph false-positive rate (ε, G), and the population prevalence of multiple infections (, H). Posterior distributions in (C-H) bins are shaded according to the 95% and 50% HPD. Histogram bin width is calculated such that there are approximately 50 bins over the range of the plotted values. True values are shown as vertical dotted lines.

https://doi.org/10.1371/journal.ppat.1013065.g002

To identify risk factors for multiple infection among people living with HIV, we formulated an extended model in which individual-level prior multiple infection probabilities are described with a logit linear predictor of putative risk factors. On simulated data, this model accurately estimated the true risk ratio associated with a covariate leading to a two-fold higher probability of harboring a multiple infection (risk ratio (RR) median [95% HPD] 1.74 [1.08–2.48]) in the context of four additional background null covariates (S8 Table.).

3.3. Prevalence of HIV multiple infections among sequenced participants

We next considered estimating the prevalence of multiple infection in the sequenced sample of 2,029 participants living with viremic HIV. In a model accounting for partial sequencing success and false-positive and false-negative observations of multiple subgraphs we estimate that 92 (4.53%) of the sequenced viremic PLHIV had a median posterior probability of multiple infection greater than 50% when allowing the probability of multiple infection () to vary by age, sex, and community type (Figs 3A, 3B, and S6). Our empirical analyses above demonstrated that the number of genome windows with multiple subgraphs is less than would be expected in the absence of false-negatives (Fig 1D). In line with this observation, the model estimated a high false-negative rate (median [95% HPD] 57.63% [53.27%–61.99%], S9 Table.), implying that empirical phylogenetic signatures of multiple infection under-estimate the true infection status of individuals in any single HIV genomic window. It was therefore essential to have whole-genome data from a subset of participants (Fig 1) to estimate false-negative detection rates. Further, informed by the 91.08% of samples with no multiple subgraph windows, we estimated the false-positive rate to be low (0.32% [0.26%–0.4%]). However, we note that even a low absolute rate will likely give rise to spurious multiple subgraph observations in a large sample size, which warrants consideration in our statistical framework.

In this model, the estimated prevalence of multiple infections in the study sample was 5.86% [4.65%–7.21%] (S9 Table.). Relaxing our minor subgraph frequency-based filtering step resulted in only a slightly higher prevalence of multiple infections in the study sample (6.1% [4.86%–7.39%], S10 Table.). When considering only genome windows spanning the p24 and gp41 regions as in previous studies (e.g. [2,29,30]), we were unable to estimate with suitably high effective sample size (ESS) values as there were at most two regions of data for each sample. We therefore fixed based on the whole-genome analysis (S9 Table.) and found that that the sample prevalence of multiple infections based on p24 and gp41 was considerably lower as compared to the whole-genome analysis (2.31% [0.71%–4.94%], S11 Table.), highlighting the utility of incorporating whole-genome data into our inference. Finally, after adjusting for slight biases in the availability of sequence data among viremic participants (Table 1) using post-stratification based on age, sex, and community type (S4 Table), the prevalence of multiple infections among viremic PLHIV in the RCCS was estimated to be slightly lower than the prevalence in the sequenced sample (4.09% [2.95%–5.45%], Fig 3C).

We next used our model to identify individuals with likely multiple infection based on their within-host phylogenetic trees and our modeling framework. Classification was based on the inferred, posterior multiple infection probabilities, and therefore our model-based approach accounted for individual-level factors associated with sequencing success and population-level false-postive and false-negative rates. We determined a binary classification cut-off above which individuals were classified as having a likely multiple infection such that the total number of identified individuals was consistent with the estimated prevalence in the sample, which resulted in a cut-off of 3.5%. Using this threshold, we estimated there were 118 individuals with a likely multiple infection (Fig 3B).

thumbnail
Fig 3. Individual-level estimates and population-level characteristics of HIV multiple infection in people with viremic HIV in the Rakai Community Cohort Study, 2010-2020.

(A) Estimated posterior probability of multiple infection for each participant. Confidence bounds represent the 95% highest posterior density. Participants with at least one multiple subgraph window are shown in purple. (B) Number of participants with multiple infection as a function of the threshold used to dichotomize the probability of multiple infection. Central estimate uses the median estimated prevalence of multiple infections and shading uses 95% and 50% HPD. Horizontal dotted line plotted at the number of participants needed to match the estimated population prevalence of multiple infection. (C) Posterior distribution of the prevalence of multiple infections among viremic participants in the RCCS after accounting for sampling biases. Bins are shaded according to the 95% and 50% HPD. Histogram width is calculated such that there are approximately 50 bins over the range of the plotted values.

https://doi.org/10.1371/journal.ppat.1013065.g003

3.4. Risk factors of HIV multiple infection

In African contexts, HIV infection risk varies at the individual-level, such as by age, gender, sexual behaviour and circumcision status, and at the community-level [35,36,41]. We therefore next aimed to characterize individual and population-level risk factors for multiple infection with HIV. First, given the significantly higher prevalence of HIV and viremic HIV in Lake Victoria fishing communities [36,96], we investigated whether participants with viremic HIV in these communities had increased risk of multiple infection as compared to participants with viremic HIV in inland communities. Using the model described above with age, sex, and community type as predictors of the probability of multiple infection and accounting for sequencing biases through poststratificaiton we calculated the prevalence of multiple infections among viremic PLHIV in fishing and inland communities and found that multiple infections in fishing communities were 2.33 times (95% HPD 1.3–3.7)-times more frequent than in inland communities (with posterior median [95% HPD] prevalence of multiple infection of 7.42% [5.62%–9.31%]) and 3.14% [1.8%–4.74%] respectively, Fig 4A and S9 Table.). The estimated prevalence ratio for HIV multiple infection was therefore broadly comparable to the risk ratio of HIV prevalence and viremia in fishing as compared to inland communities (2.5-3)[36,96], consistent with the expectation that the risk of superinfection acquisition scales with the population prevalence of viremic HIV. Because participants from fishing communities are oversampled in our sequence data (Tables 1 and S4), this also explains the lower estimated prevalence of multiple infections in the population as compared to the sample.

thumbnail
Fig 4. Risk factors of HIV multiple infection among people with viremic HIV in the Rakai Community cohort Study, 2010-2020.

(A) Posterior distribution of the prevalence of multiple infections stratified by community type, accounting for sampling biases, estimated in a multivariate model (age, sex, and community type) with diffuse priors (n =  2,029). Bins are shaded according to the 95% and 50% highest posterior density (HPD). Histogram width is calculated such that there are approximately 50 bins over the range of plotted values. (B) Predicted risk of multiple infection among men aged 25 to 29 years old as a function of lifetime sex partners and community type estimated in a bivariate model with diffuse priors (n =  997). Median of the posterior distribution is plotted as the central estimate and shading represents the 95% and 50% HPD. Colors are as in (A). (C) Logistic coefficients for the association between putative risk factors and the probability of harboring a multiple infection estimated with Bayesian shrinkage priors (n =  1,970). Sex and bar/rest. work variable includes female sex and bar/restaurant worker and men who report having sex with female sex and bar/restaurant workers. Median of the posterior distribution is plotted as the central estimate, horizontal bars extend to the 95% and 50% HPD. Colors are as in (A).

https://doi.org/10.1371/journal.ppat.1013065.g004

We additionally incorporated a binary feature describing the sequencing technology used to generate the deep-sequence data from each participant to assess the extent of technical bias in our inferences. In a univariate analysis, we estimated that multiple infections were less common among participants sequenced using the bait-capture protocol (RR median [95% HPD]: 0.64 [0.4–0.94], S12 Table.). However, 50.45% of bait-capture sequenced participants were residents of fishing communities compared to 76.02% of amplicon sequenced participants. Consequently, in a bivariate model with community type, the estimated magnitude of the dependence of multiple infection status on sequencing technology was considerably reduced and no longer considered to be significant at the 95% level. (multivariate RR median [95% HPD] 0.77 [0.48–1.12], S13 Table.).

Participants with HIV in fishing communities also reported having more lifetime sex partners (S7 Fig), so we next assessed whether the risk of harboring a multiple infection differed by the number of self-reported lifetime sex partners within each of the two community locations. As women tend to under-report their number of sex partners relative to men [97], we restricted this analysis to male participants. The number of lifetime sexual partners generally increases with age, and so we standardized responses relative to the age-specific mean number of lifetime sexual partners among participants separately for the inland and fishing communities (S8 Fig). Among 997 male participants included in this analysis, 516 reported an exact number of lifetime sex partners, 477 responded they had three or more lifetime partners, and 4 did not provide a response. We imputed ambiguous responses and missing data within our inference framework by assuming responses were missing at random between people with and without multiple infection (Materials and methods).

In a bivariate model with community type and number of lifetime sexual partners we did not find a statistically significantly higher risk of multiple infection in male participants with more lifetime sexual partners in the context of substantial missing data and sampling over potential missing values using age-specific prior distributions. However, we note that the posterior effect size translated into an estimated more than two-fold higher risk of multiple infection between men living with viremic HIV in fishing communities associated with having 30 lifetime sexual partners compared to one lifetime sexual partner (e.g. RR median [95% HPD] among 25-29 year olds 2.47 [0.7–5.61], Figs 4B and S9 for all age groups and S14 Table.). Very similar results were observed using a complete case analysis of the 516 men who provided an exact number of lifetime sex partners (S15 Table.).

We also performed a comprehensive discovery-based risk factor variable selection analysis over eight additive biological, behavioral and epidemic features, stratifying epidemiological and behavioral variables by community type to account demographic differences between the populations and excluding additional variable interactions. This analysis confirmed residency in fishing communities as a risk factor of multiple infection among sequenced participants, albeit with a wide credible interval, (multivariate RR median [95% HPD] 1.59 [0.92–2.85]), but did not identify any other variables that were associated with significantly higher or lower risk of multiple infection in our sample (Fig 4C and S16 Table.). Specifically, despite the fact that female bar/restaurant workers face a three-fold higher risk of incident HIV [41] we did not identify an increased risk of multiple infection among female bar/restaurant workers or men who have sex with bar/restaurant workers in either inland or fishing communities.

4. Discussion

In this large-scale study, we assessed the prevalence and risk factors of HIV multiple infection in an East African setting with high HIV burden using population-based pathogen deep-sequence surveillance data. To do this, we developed a Bayesian statistical model to identify multiple infections in deep-sequence phylogenies such as those generated by phyloscanner [38]. Our model incorporates false-negative and false-positive rates for the presence of genetically distinct viral variants and simultaneously estimates individual and population-level probabilities of harboring multiple infection. This framework also allows for the identification of biological and epidemiological risk factors for harboring a multiple infection. In simulation analyses, we demonstrated the ability of the model to generate accurate inferences across a range of parameter values, and fitted the model to phyloscanner within-host phylogenies inferred from HIV whole-genome RNA deep-sequence data collected between January 2010 and November 2020 from 2,029 viremic participants in the Rakai Community Cohort Study, a population-based open-cohort in southern Uganda. Among viremic participants in this study over the study period, the estimated prevalence of multiple infections was approximately 4%, reflecting the prevalence of co-circulating multiple infections present at time of sampling. Further, we showed that viremic participants with HIV living in high HIV prevalence fishing communities along Lake Victoria were more than twice as likely to harbor a multiple infection as compared to those living in inland agrarian or trading communities. Among male residents in fishing communities, we estimated that those with more lifetime sex partners can be expected to be more likely to have a multiple infection, although this finding did not reach statistical significance at the 95% level.

This study represents the largest analysis of HIV multiple infections by more than an order of magnitude [20] and rigorously accounts for partial sequencing success and uncertainty in individual-level estimates when estimating population-level risk of multiple infection. Our model indicated that in the context of incomplete genome coverage, as is common in HIV whole-genome sequencing [33], evidence for multiple infections is expected to be observed in only a subset of genome windows. However, we observed a high rate of false-negatives beyond what is expected due to incomplete sequencing, which may be due to insufficient diversity of infecting variants in some regions of the genome [95] to phylogentically distinguish them. This could potentially be due to recombination between infecting variants prior to sampling [4,5] such that infecting variants are only genetically distinct in some portions of genome when sampled. The population-based multiple infection prevalence estimates from the data reported here are substantially more precise than previous estimates from this setting as expected given the larger sample size and slightly higher than previous estimates (n = 7 ∕ 149 [2]), likely primarily reflecting greater sensitivity of whole-genome sequencing data. Multiple infection among inland community study participants in this study (3.14%) was slightly less prevalent than in this earlier work (pre-2009, 4.7% [2]), consistent with reductions in HIV incidence over the same time frame [35]. Previous studies of female sex workers in urban Uganda and Kenya have estimated the prevalence of multiple infections to be as high as 14–16% in this high-risk demographic based on amplicon deep-sequencing [30,31]. Here, we do not replicate this finding using self-reported data on sex work or bar/restaurant work in our population-based sampling framework. We expect this is likely due to hesitation to self-report sex work among study participants and study participation bias among sex workers. However, our results are generally consistent with previous findings suggesting multiple infections are less common in African populations as compared to the United States (10–15% in studies conducted between 1996 and 2010 [98102]), which may reflect the fact that the HIV epidemic in the United States is concentrated among men who have sex with men (MSM) and people who inject drugs (PWID) as opposed to the generalized nature of the epidemic in Africa. Further, as the risk of HIV transmission given exposure is 8–16 ×  and 3–17 ×  greater for needle-sharing and anal intercourse, respectively, as compared to vaginal intercourse [103], the risk of multiple infection acquisition given exposure may also be significantly greater in concentrated epidemics. To date, however, we note that the sample size of HIV multiple infection studies in the United States are relatively small (<150 individuals) and there is therefore significant uncertainty in the true underlying prevalence in these settings.

Our results add to considerable previous research on increased risk of HIV infection among Lake Victoria fishing communities. Previous studies have shown that overall HIV prevalence and prevalence of viremic HIV in these communities is 2.5–3 ×  higher than in inland communities [36,96], in part due to migration of PLHIV to these communities [104,105]. Further, despite a rapid increase in antiretroviral therapy (ART) uptake among residents of fishing communities over the study period [106], there remains a higher prevalence of people living with viremic, ART-resistant HIV as compared to inland communities [107]. We here show that viremic PLHIV in fishing communities also face a significantly higher burden of HIV multiple infections. We also show that among men in fishing communities, multiple infection risk increases with the number of lifetime sex partners. The precision of this estimate is hindered by a large proportion of qualitative responses to this component of the RCCS survey. These results imply that PLHIV in fishing communities continue to be exposed to viremic partners following initial infection. Public health interventions directed at viremic PLHIV in these communities may therefore not only provide life-saving treatment to these individuals but also reduce opportunities for the generation of novel recombinant forms of HIV which could pose challenges to control efforts through potential generation of more transmissible variants and broadening the antigneic space that potential vaccines need to cover [810,108110].

We expect that our inferential framework may be adaptable to whole-genome deep-sequence phylogenies from other pathogens in which infection is chronic (thereby allowing sufficient time for superinfection to occur). Hepatitis C virus (HCV), which is a chronic viral infection transmitted either sexually or by injection drug use, is a natural extension [111]. Among people who inject drugs, the prevalence of HCV mixed infections is estimated to be as high as 39% [112]. Our framework has the advantage that it uses data from across the genome and does not require haplotyping of sequencing reads, which has proven to be exceedingly difficult with short-read sequence data [113]. Recent work has also attempted to identify multiple infections of Mycobacterium tuberculosis (MTB), a chronic bacterial infection canonically of the lungs [114]. These methods work by either clustering allele frequencies to distinguish within- and between-variant differences [115117] or by comparing sampled sequence data to a database of reference strains [118]. They therefore require defining circulating genetic diversity a priori (which may be challenging in a poorly sampled epidemic) or assume independence between alleles, failing to account for linkage between adjacent genome positions and the evolutionary history giving rise to the observed genetic variation. Multiple infections may also be of interest in acute, high-prevalence infectious diseases. For example, severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) superinfections have been observed by identifying mixed alleles as known lineage-defining sites [119,120].

While deep-phyloMI builds upon previous investigations into HIV multiple infections to provide more rigorous estimates of individual and population level parameters, we do rely on some simplifying assumptions in our framework. First, we only identify multiple infections among viremic participants with available deep-sequence data who were identified as part of a putative transmission network. While we adjust for known sampling biases based on demographic characteristics, there may be residual bias such that our sample is non-representative of the underlying population of viremic PLHIV. Further, because the RCCS did not perform viral load testing on all participants prior to the 2014 survey round we adjust only to the demographic characteristics of viremic PLHIV in the four most recent surveys. Further, we focus on identifying multiple infections only in cross-sectional sequence data. As multiple infections can be transient [31], we are unable to identify participants who have been but are not currently multiply infected. It is likely that longitudinal sampling or sequencing of the viral reservoir would identify additional individuals who have been multiply infected. Further, with only a single sample per-individual we were unable to reliably identify factors causally associated with incident multiple infection [121] and therefore report factors that are associated with prevalent multiple infections. Similarly, in the absence of longitudinal data or data sampled soon after initial infection we are unable to reliably distinguish multiple infections acquired through coinfection and superinfection. However, based on our parametrization of the k parameter within phyloscanner and the genetic distance between observed multiple subgraphs, we suspect that the vast majority of identified multiple infections are due to superinfection with a genetically distinct viral genotype. More liberal values of k would increase the sensitivity of our approach to identify closely related viral genotypes (such as those acquired during co-infection) at the expense of an increased rate of false-positives. Further, more liberal values of k would be appropriate in settings with less circulating HIV genetic diversity as compared to our study site [7].

HIV multiple infections complicate global control efforts by fueling the generation of genetic diversity [6], worsening clinical outcomes [15,16], and increasing viral load [16,31,122]. Here we developed a robust inference framework to identify multiple infections in deep-sequence data and assess the role of epidemiological risk factors, such as living in high burden communities, in harboring multiple infections. This work will inform interventions aimed at preventing the acquisition of HIV superinfections and efforts to model the role iof multiple infections in the dynamics and evolution of HIV.

Supporting information

S1 Fig. Sequencing coverage among samples from 2,029 Rakai Community Cohort Study participant-visits contributed by viremic people living with HIV with , stratified by viral load category and sequencing technology.

(A) Distribution of values for all samples. (B) Number of samples with coverage in each of the 29 genome window.

https://doi.org/10.1371/journal.ppat.1013065.s001

(TIF)

S2 Fig. Pairwise genetic between unique tips in within-host phylogenetic trees among people with viremic HIV in the Rakai Community Cohort Study, 2010-2020.

Bins are shaded based on whether tips were assigned to the same subgraph (grey) or different subgraphs (purple), in the case where multiple subgraphs were observed.

https://doi.org/10.1371/journal.ppat.1013065.s002

(TIF)

S3 Fig. Posterior distribution of parameters in full model fit to simulated data across a range of δ values.

Rows represent model to fit to simulated data with δ = 0 (top row), δ = 5% (second row), δ = 10% (third row), and δ = 20% (bottom row). Posterior distributions bins are shaded according to the 95% and 50% highest posterior density. Histogram width is calculated such that there are approximately 50 bins over the range of plotted values. True values are shown as vertical dotted lines. VL = viral load (log10 copies/mL) normalized to mean = 0 and std. dev = 1. Std. dev. = standard deviation.

https://doi.org/10.1371/journal.ppat.1013065.s003

(TIF)

S4 Fig. Posterior distribution of parameters in full model fit to simulated data across a range of λ values.

Rows represent model to fit to simulated data with λ = 0 . 1 (top row), λ = 0 . 2% (second row), λ = 0 . 3% (third row), and λ = 0 . 4% (bottom row). Posterior distributions bins are shaded according to the 95% and 50% higheset posterior density. Histogram bin width is calculated such that there are approximately 50 bins over the range of the plotted values. True values are shown as vertical dotted lines. VL = viral load (log10 copies/mL) normalized to mean = 0 and std. dev = 1. Std. dev. = standard deviation.

https://doi.org/10.1371/journal.ppat.1013065.s004

(TIF)

S5 Fig. Posterior distribution of parameters in full model fit to simulated data across a range of ε values.

Rows represent model to fit to simulated data with ε = 0 (top row), ε = 0 . 5% (second row), ε = 1% (third row), and ε = 5% (bottom row). Posterior distributions bins are shaded according to the 95% and 50% higheset posterior density. Histogram bin width is calculated such that there are approximately 50 bins over the range of the plotted values. True values are shown as vertical dotted lines. VL = viral load (log10 copies/mL) normalized to mean = 0 and std. dev = 1. Std. dev. = standard deviation.

https://doi.org/10.1371/journal.ppat.1013065.s005

(TIF)

S6 Fig. Individual-level estimate of HIV multiple infection in people living with viremic HIV in the Rakai Community Cohort Study, 2010-2020.

Estimated posterior log probability of multiple infection for each participant. Confidence bounds represent the 95% highest posterior density. Participants with at least one multiple subgraph window are shown in purple.

https://doi.org/10.1371/journal.ppat.1013065.s006

(TIF)

S7 Fig. Mean number of lifetime sex partners stratified by HIV serostatus, sex, community type, and age among 109,608 RCCS participant-visits.

Excludes participant visits in which respondents provided a categorical response (N = 5,436 (10.67%)).

https://doi.org/10.1371/journal.ppat.1013065.s007

(TIF)

S8 Fig. Standardization curve used to adjust observed number of lifetime sex partners among men for age-cohort effects.

Includes simple imputation of categorical responses (e.g. “1-2” and “3+”) to 1) the mean value of observed responses of 1 or 2 (“1-2”) within age category and community type and 2) the mean of a lognormal distribution fit to observed responses of  ≥ 3 lifetime sex partners within age category and community type.

https://doi.org/10.1371/journal.ppat.1013065.s008

(TIF)

S9 Fig. Posterior estimates of the prevalence of multiple infections, stratified by age category and community type.

Median estimate is plotted as a line and shading represents the 50% and 95% highest posterior densities. All age categories share the same coefficient estimates but differ because lifetime sex partner values are standardized to the mean of the observed values within groups defined by sex, age category, and community type.

https://doi.org/10.1371/journal.ppat.1013065.s009

(TIF)

S2 File. Reference genomes included in the phyloscanner analysis.

https://doi.org/10.1371/journal.ppat.1013065.s011

(TXT)

S3 File. Normalization constants used to adjust branch lengths in within-host phylogenetic trees.

https://doi.org/10.1371/journal.ppat.1013065.s012

(CSV)

S4 File. Sensitivity of results to choice of genome windows.

https://doi.org/10.1371/journal.ppat.1013065.s013

(PDF)

S1 Table. Count of participants sequenced using each sequencing protocol.

https://doi.org/10.1371/journal.ppat.1013065.s014

(PDF)

S2 Table. Characteristics of Rakai Community Cohort Study participant, 2010–2020.

For each participant, includes data from the participant-visit processed with PHSC if applicable or the participant-visit with the highest viral load, using the first visit in the case of ties or for people not living with HIV. Percentages represent the row percentages within each category. Binomial confidence intervals were calculated using the Agresti–Coull method. PHSC = phyloscanner.

https://doi.org/10.1371/journal.ppat.1013065.s015

(PDF)

S3 Table. Count of missing values among 50,967 RCCS participants.

For each participant, includes data from the participant-visit processed with PHSC if applicable or the participant-visit with the highest viral load, using the first visit in the case of ties or for people not living with HIV. In each category the percentage represents the percentage of all participants or all participants that were viremic and processed with PHSC.

https://doi.org/10.1371/journal.ppat.1013065.s016

(PDF)

S4 Table. Viremic participant-visits (2014–2019) and participants with available phyloscanner output belonging to epidemiological strata in the Rakai Community Cohort Study.

Epidemiological strata are defined by community type, age category, and sex. As viral load testing was not routinely conducted in earlier study rounds, the viremic participants belonging to each strata were tabulated using only data from the 2014 through 2019 surveys.

https://doi.org/10.1371/journal.ppat.1013065.s017

(PDF)

S5 Table. Parameter estimates for base model fit to base simulated data.

ESS = effective sample size. HPD = highest posterior density.

https://doi.org/10.1371/journal.ppat.1013065.s018

(PDF)

S6 Table. Parameter estimates for base model fit to full simulated data.

ESS = effective sample size. HPD = highest posterior density.

https://doi.org/10.1371/journal.ppat.1013065.s019

(PDF)

S7 Table. Parameter estimates for full model fit to full simulated data.

ESS = effective sample size. HPD = highest posterior density.

https://doi.org/10.1371/journal.ppat.1013065.s020

(PDF)

S8 Table. Parameter estimates for extended model fit to extended simulated data with epidemiological risk factor of multiple infection.

ESS = effective sample size. HPD = highest posterior density. stz-MVN = sum-to-zero multivariate Normal distribution.

https://doi.org/10.1371/journal.ppat.1013065.s021

(PDF)

S9 Table. Parameter estimates for full model fit to deep-sequence data from 2,029 RCCS participants living with viremic HIV with age, sex, and community type as putative risk factors for harboring multiple infections

ESS = effective sample size. HPD = highest posterior density. stz-MVN = sum-to-zero multivariate Normal distribution.

https://doi.org/10.1371/journal.ppat.1013065.s022

(PDF)

S10 Table. Parameter estimates for full model fit to deep-sequence data from 2,029 RCCS participants living with viremic HIV with age, sex, and community type as putative risk factors for harboring multiple infections.

Includes minor subgraphs supported in < 1% of reads in a given window so long as they are supported by at least three reads. ESS = effective sample size. HPD = highest posterior density. stz-MVN = sum-to-zero multivariate Normal distribution.

https://doi.org/10.1371/journal.ppat.1013065.s023

(PDF)

S11 Table. Parameter estimates for full model fit to deep-sequence data from 1,742 RCCS participants living with viremic HIV with age, sex, and community type as putative risk factors for harboring multiple infections.

Includes data from genome windows spanning the p24 (1427–1816) and gp41 (7941–8264) regions. ESS = effective sample size. HPD = highest posterior density. stz-MVN = sum-to-zero multivariate Normal distribution.

https://doi.org/10.1371/journal.ppat.1013065.s024

(PDF)

S12 Table. Parameter estimates for full model fit to deep-sequence data from 2,029 RCCS participants living with viremic HIV with deep-sequencing protocol as a putative risk factor for harboring multiple infections.

ESS = effective sample size. HPD = highest posterior density. stz-MVN = sum-to-zero multivariate Normal distribution.

https://doi.org/10.1371/journal.ppat.1013065.s025

(PDF)

S13 Table. Parameter estimates for full model fit to deep-sequence data from 2,029 RCCS participants living with viremic HIV with community type and deep-sequencing protocol as putative risk factors for harboring multiple infections

ESS = effective sample size. HPD = highest posterior density. stz-MVN = sum-to-zero multivariate Normal distribution.

https://doi.org/10.1371/journal.ppat.1013065.s026

(PDF)

S14 Table. Parameter estimates for full model fit to deep-sequence data from 997 men who participated in the RCCS living with viremic HIV with community type and number of lifetime sex partners as putative risk factors for harboring multiple infections adjusted for deep-sequencing protocol.

ESS = effective sample size. HPD = highest posterior density. stz-MVN = sum-to-zero multivariate Normal distribution.

https://doi.org/10.1371/journal.ppat.1013065.s027

(PDF)

S15 Table. Parameter estimates for full model fit to deep-sequence data from 516 men who participated in the RCCS living with viremic HIV with community type and number of lifetime sex partners as putative risk factors for harboring multiple infections.

Excludes participants with ambiguous or missing data on the number of lifetime sex partners. ESS = effective sample size. HPD = highest posterior density. stz-MVN = sum-to-zero multivariate Normal distribution.

https://doi.org/10.1371/journal.ppat.1013065.s028

(PDF)

S16 Table. Parameter estimates for full model fit to deep-sequence data from 1,970 RCCS participants living with viremic HIV with putative risk factors for harboring multiple infection and Bayesian shrinkage priors.

ESS = effective sample size. HPD = highest posterior density. stz-MVN = sum-to-zero multivariate Normal distribution.

https://doi.org/10.1371/journal.ppat.1013065.s029

(PDF)

Acknowledgments

We thank the participants of the Rakai Community Cohort Study for making this research possible. Further, we thank all Rakai Health Sciences Program staff and all members of the PANGEA-HIV consortium. We thank Dr. Chris Wymant, PhD (Pandemic Sciences Institute, University of Oxford) for insightful discussions about Bayesian modeling of HIV multiple infections and helpful comments on this manuscript. We thank Zhi Ling (Saw Swee Hock School of Public Health, National University of Singapore) for advice on enforcing sum-to-zero constraints in the context or horseshoe-type shrinkage priors. Computational resources were provided through the Imperial College Research Computing Service and the Biomedical Research Computing Cluster at the University of Oxford.

References

  1. 1. Redd AD, Quinn TC, Tobian AAR. Frequency and implications of HIV superinfection. Lancet Infect Dis 2013;13(7):622–8. pmid:23726798
  2. 2. Redd AD, Mullis CE, Serwadda D, Kong X, Martens C, Ricklefs SM, et al. The rates of HIV superinfection and primary HIV incidence in a general population in Rakai, Uganda. J Infect Dis 2012;206(2):267–74. pmid:22675216
  3. 3. Wertheim JO, Oster AM, Murrell B, Saduvala N, Heneine W, Switzer WM, et al. Maintenance and reappearance of extremely divergent intra-host HIV-1 variants. Virus Evol. 2018;4(2):vey030. pmid:30538823
  4. 4. Fang G, Weiser B, Kuiken C, Philpott SM, Rowland-Jones S, Plummer F, et al. Recombination following superinfection by HIV-1. AIDS 2004;18(2):153–9. pmid:15075531
  5. 5. Streeck H, Li B, Poon AFY, Schneidewind A, Gladden AD, Power KA, et al. Immune-driven recombination and loss of control after HIV superinfection. J Exp Med 2008;205(8):1789–96. pmid:18625749
  6. 6. Ramirez BC, Simon-Loriere E, Galetto R, Negroni M. Implications of recombination for HIV diversity. Virus Res. 2008;134(1–2):64–73. pmid:18308413
  7. 7. Kim S, Kigozi G, Martin MA, Galiwango RM, Quinn TC, Redd AD, et al. Intra- and inter-subtype HIV diversity between 1994 and 2018 in southern Uganda: a longitudinal population-based study. Virus Evolution. 2024. https://doi.org/10.1093/ve/veae065 pmid:39399152
  8. 8. Ritchie AJ, Cai F, Smith NMG, Chen S, Song H, Brackenridge S, et al. Recombination-mediated escape from primary CD8+ T cells in acute HIV-1 infection. Retrovirology. 2014;11:69. pmid:25212771
  9. 9. Corey L, McElrath MJ. HIV vaccines: mosaic approach to virus diversity. Nat Med 2010;16(3):268–70. pmid:20208511
  10. 10. Kiwanuka N, Laeyendecker O, Quinn TC, Wawer MJ, Shepherd J, Robb M, et al. HIV-1 subtypes and differences in heterosexual HIV transmission among HIV-discordant couples in Rakai, Uganda. AIDS 2009;23(18):2479–84. pmid:19841572
  11. 11. Powell RLR, Kinge T, Nyambi PN. Infection by discordant strains of HIV-1 markedly enhances the neutralizing antibody response against heterologous virus. J Virol 2010;84(18):9415–26. pmid:20631143
  12. 12. Cortez V, Odem-Davis K, McClelland RS, Jaoko W, Overbaugh J. HIV-1 superinfection in women broadens and strengthens the neutralizing antibody response. PLoS Pathog 2012;8(3):e1002611. pmid:22479183
  13. 13. Krebs SJ, Kwon YD, Schramm CA, Law WH, Donofrio G, Zhou KH, et al. Longitudinal analysis reveals early development of three MPER-directed neutralizing antibody lineages from an HIV-1-infected individual. Immunity. 2019;50(3):677-691.e13. pmid:30876875
  14. 14. Sok D, Burton DR. Recent progress in broadly neutralizing antibodies to HIV. Nat Immunol 2018;19(11):1179–88. pmid:30333615
  15. 15. Gottlieb GS, Nickle DC, Jensen MA, Wong KG, Grobler J, Li F, et al. Dual HIV-1 infection associated with rapid disease progression. Lancet 2004;363(9409):619–22. pmid:14987889
  16. 16. Smith DM, Wong JK, Hightower GK, Ignacio CC, Koelsch KK, Daar ES, et al. Incidence of HIV superinfection following primary infection. JAMA 2004;292(10):1177–8. pmid:15353529
  17. 17. Ronen K, Richardson BA, Graham SM, Jaoko W, Mandaliya K, McClelland RS, et al. HIV-1 superinfection is associated with an accelerated viral load increase but has a limited impact on disease progression. AIDS 2014;28(15):2281–6. pmid:25102090
  18. 18. Quinn TC, Wawer MJ, Sewankambo N, Serwadda D, Li C, Wabwire-Mangen F, et al. Viral load and heterosexual transmission of human immunodeficiency virus type 1. Rakai Project Study Group. N Engl J Med 2000;342(13):921–9. pmid:10738050
  19. 19. Fraser C, Hollingsworth TD, Chapman R, de Wolf F, Hanage WP. Variation in HIV-1 set-point viral load: epidemiological analysis and an evolutionary hypothesis. Proc Natl Acad Sci U S A 2007;104(44):17441–6. pmid:17954909
  20. 20. Yuan D, Zhao F, Liu S, Liu Y, Yan H, Liu L, et al. Dual infection of different clusters of HIV in people living with HIV worldwide: a meta-analysis based on next-generation sequencing studies. AIDS Patient Care STDS 2024;38(8):348–57. pmid:38957963
  21. 21. Cornelissen M, Jurriaans S, Kozaczynska K, Prins JM, Hamidjaja RA, Zorgdrager F, et al. Routine HIV-1 genotyping as a tool to identify dual infections. AIDS 2007;21(7):807–11. pmid:17415035
  22. 22. van der Kuyl AC, Zorgdrager F, Jurriaans S, Back NKT, Prins JM, Brinkman K, et al. Incidence of human immunodeficiency virus type 1 dual infections in Amsterdam, The Netherlands, during 2003-2007. Clin Infect Dis 2009;48(7):973–8. pmid:19231977
  23. 23. Chaudron SE, Leemann C, Kusejko K, Nguyen H, Tschumi N, Marzel A, et al. A systematic molecular epidemiology screen reveals numerous human immunodeficiency virus (HIV) type 1 superinfections in the Swiss HIV cohort study. J Infect Dis 2022;226(7):1256–66. pmid:35485458
  24. 24. Rachinger A, van de Ven TD, Burger JA, Schuitemaker H, van ’t Wout AB. Evaluation of pre-screening methods for the identification of HIV-1 superinfection. J Virol Methods 2010;165(2):311–7. pmid:20178816
  25. 25. Sheward DJ, Ntale R, Garrett NJ, Woodman ZL, Abdool Karim SS, Williamson C. HIV-1 superinfection resembles primary infection. J Infect Dis 2015;212(6):904–8. pmid:25754982
  26. 26. Ssemwanga D, Doria-Rose NA, Redd AD, Shiakolas AR, Longosz AF, Nsubuga RN, et al. Characterization of the neutralizing antibody response in a case of genetically linked HIV superinfection. J Infect Dis 2018;217(10):1530–4. pmid:29579256
  27. 27. Woodson E, Basu D, Olszewski H, Gilmour J, Brill I, Kilembe W, et al. Reduced frequency of HIV superinfection in a high-risk cohort in Zambia. Virology. 2019;535:11–9. pmid:31254743
  28. 28. Pacold M, Smith D, Little S, Cheng PM, Jordan P, Ignacio C, et al. Comparison of methods to detect HIV dual infection. AIDS Res Hum Retroviruses 2010;26(12):1291–8. pmid:20954840
  29. 29. Redd AD, Collinson-Streng A, Martens C, Ricklefs S, Mullis CE, Manucci J, et al. Identification of HIV superinfection in seroconcordant couples in Rakai, Uganda, by use of next-generation deep sequencing. J Clin Microbiol 2011;49(8):2859–67. pmid:21697329
  30. 30. Redd AD, Ssemwanga D, Vandepitte J, Wendel SK, Ndembi N, Bukenya J, et al. Rates of HIV-1 superinfection and primary HIV-1 infection are similar in female sex workers in Uganda. AIDS 2014;28(14):2147–52. pmid:25265078
  31. 31. Ronen K, McCoy CO, Matsen FA, Boyd DF, Emery S, Odem-Davis K, et al. HIV-1 superinfection occurs less frequently than initial infection in a cohort of high-risk Kenyan women. PLoS Pathog 2013;9(8):e1003593. pmid:24009513
  32. 32. Piantadosi A, Ngayo MO, Chohan B, Overbaugh J. Examination of a second region of the HIV type 1 genome reveals additional cases of superinfection. AIDS Res Hum Retroviruses 2008;24(9):1221. pmid:18729772
  33. 33. Bonsall D, Golubchik T, de Cesare M, Limbada M, Kosloff B, MacIntyre-Cockett G, et al. A comprehensive genomics solution for HIV surveillance and clinical monitoring in low-income settings. J Clin Microbiol 2020;58(10):e00382–20. pmid:32669382
  34. 34. Gall A, Ferns B, Morris C, Watson S, Cotten M, Robinson M, et al. Universal amplification, next-generation sequencing, and assembly of HIV-1 genomes. J Clin Microbiol 2012;50(12):3838–44. pmid:22993180
  35. 35. Grabowski MK, Serwadda DM, Gray RH, Nakigozi G, Kigozi G, Kagaayi J, et al. HIV prevention efforts and incidence of HIV in Uganda. N Engl J Med 2017;377(22):2154–66. pmid:29171817
  36. 36. Chang LW, Grabowski MK, Ssekubugu R, Nalugoda F, Kigozi G, Nantume B, et al. Heterogeneity of the HIV epidemic in agrarian, trading, and fishing communities in Rakai, Uganda: an observational epidemiological study. Lancet HIV 2016;3(8):e388–96. pmid:27470029
  37. 37. Dwyer-Lindgren L, Cork MA, Sligar A, Steuben KM, Wilson KF, Provost NR, et al. Mapping HIV prevalence in sub-Saharan Africa between 2000 and 2017. Nature 2019;570(7760):189–93. pmid:31092927
  38. 38. Wymant C, Hall M, Ratmann O, Bonsall D, Golubchik T, de Cesare M, et al. PHYLOSCANNER: inferring transmission from within- and between-host pathogen genetic diversity. Mol Biol Evol 2018;35(3):719–33. pmid:29186559
  39. 39. Monod M, Brizzi A, Galiwango RM, Ssekubugu R, Chen Y, Xi X, et al. Longitudinal population-level HIV epidemiologic and genomic surveillance highlights growing gender disparity of HIV transmission in Uganda. Nat Microbiol 2024;9(1):35–54. pmid:38052974
  40. 40. Dambach P, Mahenge B, Mashasi I, Muya A, Barnhart DA, Bärnighausen TW, et al. Socio-demographic characteristics and risk factors for HIV transmission in female bar workers in sub-Saharan Africa: a systematic literature review. BMC Public Health 2020;20(1):697. pmid:32414352
  41. 41. Popoola VO, Kagaayi J, Ssekasanvu J, Ssekubugu R, Kigozi G, Ndyanabo A, et al. HIV epidemiologic trends among occupational groups in Rakai, Uganda: a population-based longitudinal study, 1999–2016. PLOS Glob Public Health 2024;4(2):e0002891. pmid:38377078
  42. 42. Global Aids response progress report: Uganda January 2010–December 2012. 2013.
  43. 43. Kagulire SC, Opendi P, Stamper PD, Nakavuma JL, Mills LA, Makumbi F, et al. Field evaluation of five rapid diagnostic tests for screening of HIV-1 infections in rural Rakai, Uganda. Int J STD AIDS 2011;22(6):308–9. pmid:21680664
  44. 44. Pillay D, Herbeck J, Cohen MS, de Oliveira T, Fraser C, Ratmann O, et al. PANGEA-HIV: phylogenetics for generalised epidemics in Africa. Lancet Infect Dis 2015;15(3):259–61. pmid:25749217
  45. 45. Abeler-Dörner L, Grabowski MK, Rambaut A, Pillay D, Fraser C, PANGEA consortium. PANGEA-HIV 2: phylogenetics and networks for generalised epidemics in Africa. Curr Opin HIV AIDS 2019;14(3):173–80. pmid:30946141
  46. 46. PANGEA-HIV/PANGEA-Sequences: Latest version release. 2024. Available from: https://doi.org/10.5281/zenodo.10793873
  47. 47. Wood DE, Salzberg SL. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 2014;15(3):R46. pmid:24580807
  48. 48. Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 2014;30(15):2114–20. pmid:24695404
  49. 49. Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol 2012;19(5):455–77. pmid:22506599
  50. 50. Nurk S, Meleshko D, Korobeynikov A, Pevzner PA. metaSPAdes: a new versatile metagenomic assembler. Genome Res 2017;27(5):824–34. pmid:28298430
  51. 51. Wymant C, Blanquart F, Golubchik T, Gall A, Bakker M, Bezemer D, et al. Easy and accurate reconstruction of whole HIV genomes from short-read sequence data with shiver. Virus Evol. 2018;4(1):vey007. https://doi.org/10.1093/ve/vey007 pmid:29876136
  52. 52. Xi X. Bayesian methods for source attribution using HIV deep sequence data. 2021.
  53. 53. Lynch RM, Shen T, Gnanakaran S, Derdeyn CA. Appreciating HIV type 1 diversity: subtype differences in Env. AIDS Res Hum Retroviruses 2009;25(3):237–48. pmid:19327047
  54. 54. Katoh K, Misawa K, Kuma K, Miyata T. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res 2002;30(14):3059–66. pmid:12136088
  55. 55. Nguyen L-T, Schmidt HA, von Haeseler A, Minh BQ. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol Biol Evol 2015;32(1):268–74. pmid:25371430
  56. 56. Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD, von Haeseler A, et al. IQ-TREE 2: new models and efficient methods for phylogenetic inference in the genomic era. Mol Biol Evol 2020;37(5):1530–4. pmid:32011700
  57. 57. Ratmann O, Grabowski MK, Hall M, Golubchik T, Wymant C, Abeler-Dörner L, et al. Inferring HIV-1 transmission networks and sources of epidemic spread in Africa with deep-sequence phylogenetic analysis. Nat Commun 2019;10(1):1411. pmid:30926780
  58. 58. Sankoff D. Minimal mutation trees of sequences. SIAM J Appl Math 1975;28(1):35–42.
  59. 59. Stan Modeling Language Users Guide and Reference Manual, Version 2.36. Available from: https://mc-stan.org
  60. 60. Carvalho CM, Polson NG, Scott JG. Handling sparsity via the horseshoe. In: Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics. vol. 5 of Proceedings of Machine Learning Research. Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA: PMLR. 2009. p 73–80. Available from: https://proceedings.mlr.press/v5/carvalho09a.html
  61. 61. Piironen J, Vehtari A. Sparsity information and regularization in the horseshoe and other shrinkage priors. Electron J Statist. 2017;11(2).
  62. 62. Betancourt MJ, Girolami M. Hamiltonian Monte Carlo for hierarchical models. arXiv. 2013.
  63. 63. Hoffman M, Gelman A. The No-U-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. J Mach Learn Res. 2014;15(47):1593–623.
  64. 64. Gabry J, Cesnovar R, Johnson A. cmdstanr: R interface to “CmdStan”. 2023.
  65. 65. Vehtari A, Gelman A, Simpson D, Carpenter B, Bürkner P-C. Rank-normalization, folding, and localization: an improved r ∘  for assessing convergence of MCMC (with discussion). Bayesian Anal. 2021;16(2).
  66. 66. Thompson TJ, Smith PJ, Boyle JP. Finite mixture models with concomitant information: assessing diagnostic criteria for diabetes. J Roy Statist Soc Ser C: Appl Statist 1998;47(3):393–404.
  67. 67. Shi JQ, Murray-Smith R, Titterington DM. Hierarchical Gaussian process mixtures for regression. Stat Comput 2005;15(1):31–41.
  68. 68. Proust-Lima C, Letenneur L, Jacqmin-Gadda H. A nonlinear latent class model for joint analysis of multivariate longitudinal data and a binary outcome. Stat Med 2007;26(10):2229–45. pmid:16900568
  69. 69. Williford E, Haley V, McNutt L-A, Lazariu V. Dealing with highly skewed hospital length of stay distributions: the use of Gamma mixture models to study delivery hospitalizations. PLoS One 2020;15(4):e0231825. pmid:32310963
  70. 70. Samerei SA, Aghabayk K, Shiwakoti N, Mohammadi A. Using latent class clustering and binary logistic regression to model Australian cyclist injury severity in motor vehicle-bicycle crashes. J Safety Res. 2021;79:246–56. pmid:34848005
  71. 71. Sotres-Alvarez D, Herring AH, Siega-Riz AM. Latent class analysis is useful to classify pregnant women into dietary patterns. J Nutr 2010;140(12):2253–9. pmid:20962151
  72. 72. Little RJA. Post-stratification: a modeler’s perspective. J Am Statist Assoc 1993;88(423):1001–12.
  73. 73. R Core Team. R: A language and environment for statistical computing. 2023. Available from: https://www.R-project.org/
  74. 74. Wickham H, Averick M, Bryan J, Chang W, McGowan L, François R, et al. Welcome to the tidyverse. JOSS 2019;4(43):1686.
  75. 75. Wickham H, François R, Henry L, Müller K, Vaughan D. dplyr: a grammar of data manipulation. 2023. Available from: https://CRAN.R-project.org/package=dplyr
  76. 76. Müller K, Wickham H. tibble: Simple data frames. 2023. Available from: https://CRAN.R-project.org/package=tibble
  77. 77. Wickham H, Vaughan D, Girlich M. tidyr: Tidy messy data. 2024. Available from: https://CRAN.R-project.org/package=tidyr
  78. 78. Wickham H, Miller E, Smith D. haven: Import and export ‘SPSS’, ‘Stata’ and ‘SAS’ files. 2023. Available from: https://haven.tidyverse.org
  79. 79. Wickham H. ggplot2: Elegant graphics for data analysis. New York: Springer. 2016. Available from: https://ggplot2.tidyverse.org
  80. 80. Gabry J, Mahr T. bayesplot: Plotting for Bayesian models. 2024. Available from: https://mc-stan.org/bayesplot/
  81. 81. Gabry J, Simpson D, Vehtari A, Betancourt M, Gelman A. Visualization in Bayesian workflow. J Roy Statist Soc Ser A: Statist Soc 2019;182(2):389–402.
  82. 82. Wilke CO. cowplot: streamlined plot theme and plot annotations for ‘ggplot2’. 2024. Available from: https://wilkelab.org/cowplot/
  83. 83. Pedersen TL. patchwork: The composer of plots. 2024. Available from: https://patchwork.data-imaginist.com
  84. 84. Paradis E, Schliep K. ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R. Bioinformatics. 2019;35(3):526–8. pmid:30016406
  85. 85. Yu G. Data integration, manipulation and visualization of phylogenetic treess. 1st edn. Chapman and Hall/CRC. 2022. Available from: https://www.amazon.com/Integration-Manipulation-Visualization-Phylogenetic-Computational-ebook/dp/B0B5NLZR1Z/
  86. 86. Xu S, Li L, Luo X, Chen M, Tang W, Zhan L, et al. Ggtree: a serialized data object for visualization of a phylogenetic tree and annotation data. Imeta 2022;1(4):e56. pmid:38867905
  87. 87. Yu G. Using ggtree to visualize data on tree-like structures. Curr Protoc Bioinformatics 2020;69(1):e96. pmid:32162851
  88. 88. Yu G, Lam TT-Y, Zhu H, Guan Y. Two methods for mapping and visualizing associated data on phylogeny using Ggtree. Mol Biol Evol 2018;35(12):3041–3. pmid:30351396
  89. 89. Yu G, Smith DK, Zhu H, Guan Y, Lam TT. ggtree: an r package for visualization and annotation of phylogenetic trees with their covariates and other associated data. Methods Ecol Evol 2016;8(1):28–36.
  90. 90. Revell LJ. phytools 2.0: an updated R ecosystem for phylogenetic comparative methods (and other things). PeerJ. 2024;12:e16505. pmid:38192598
  91. 91. Meredith M, Kruschke J. HDInterval: highest (posterior) density intervals; 2022. Available from: https://CRAN.R-project.org/package=HDInterval
  92. 92. Bürkner P, Gabry J, Kay M, Vehtari A. posterior: tools for working with posterior distributions. 2023. Available from: https://mc-stan.org/posterior/
  93. 93. Delignette-Muller ML, Dutang C. fitdistrplus: AnRPackage for fitting distributions. J Stat Soft. 2015;64(4):1–34.
  94. 94. Ratmann O, Wymant C, Colijn C, Danaviah S, Essex M, Frost SDW, et al. HIV-1 full-genome phylogenetics of generalized epidemics in sub-Saharan Africa: impact of missing nucleotide characters in next-generation sequences. AIDS Res Hum Retroviruses 2017;33(11):1083–98. pmid:28540766
  95. 95. Patiño-Galindo JÁ, González-Candelas F. The substitution rate of HIV-1 subtypes: a genomic approach. Virus Evol. 2017;3(2):vex029. https://doi.org/10.1093/ve/vex029 pmid:29942652
  96. 96. Brizzi A, Kagaayi J, Ssekubugu R, Abeler-Dörner L, Blenkinsop A, Bonsall D, et al. Age and gender profiles of HIV infection burden and viraemia: novel metrics for HIV epidemic control in African populations with high antiretroviral therapy coverage. medRxiv. 2024. pmid:38712115
  97. 97. Todd J, Cremin I, McGrath N, Bwanika J-B, Wringe A, Marston M, et al. Reported number of sexual partners: comparison of data from four African longitudinal studies. Sex Transm Infect. 2009;85(Suppl 1):i72-80. pmid:19307344
  98. 98. Pacold ME, Pond SLK, Wagner GA, Delport W, Bourque DL, Richman DD, et al. Clinical, virologic, and immunologic correlates of HIV-1 intraclade B dual infection among men who have sex with men. AIDS 2012;26(2):157–65. pmid:22045341
  99. 99. Wagner GA, Pacold ME, Vigil E, Caballero G, Morris SR, Kosakovsky Pond SL, et al. Using ultradeep pyrosequencing to study HIV-1 coreceptor usage in primary and dual infection. J Infect Dis 2013;208(2):271–4. pmid:23599311
  100. 100. Wagner GA, Pacold ME, Kosakovsky Pond SL, Caballero G, Chaillon A, Rudolph AE, et al. Incidence and prevalence of intrasubtype HIV-1 dual infection in at-risk men in the United States. J Infect Dis 2014;209(7):1032–8. pmid:24273040
  101. 101. Wagner GA, Chaillon A, Liu S, Franklin DR Jr, Caballero G, Kosakovsky Pond SL, et al. HIV-associated neurocognitive disorder is associated with HIV-1 dual infection. AIDS 2016;30(17):2591–7. pmid:27536983
  102. 102. Vesa J, Chaillon A, Wagner GA, Anderson CM, Richman DD, Smith DM, et al. Increased HIV-1 superinfection risk in carriers of specific human leukocyte antigen alleles. AIDS 2017;31(8):1149–58. pmid:28244954
  103. 103. Patel P, Borkowf CB, Brooks JT, Lasry A, Lansky A, Mermin J. Estimating per-act HIV transmission risk: a systematic review. AIDS 2014;28(10):1509–19. pmid:24809629
  104. 104. Kate Grabowski M, Lessler J, Bazaale J, Nabukalu D, Nankinga J, Nantume B, et al. Migration, hotspots, and dispersal of HIV infection in Rakai, Uganda. Nat Commun 2020;11(1):976. pmid:32080169
  105. 105. Ratmann O, Kagaayi J, Hall M, Golubchick T, Kigozi G, Xi X, et al. Quantifying HIV transmission flow between high-prevalence hotspots and surrounding communities: a population-based study in Rakai, Uganda. Lancet HIV 2020;7(3):e173–83. pmid:31953184
  106. 106. Kagaayi J, Chang LW, Ssempijja V, Grabowski MK, Ssekubugu R, Nakigozi G, et al. Impact of combination HIV interventions on HIV incidence in hyperendemic fishing communities in Uganda: a prospective cohort study. Lancet HIV 2019;6(10):e680–7. pmid:31533894
  107. 107. Martin MA, Reynolds SJ, Foley BT, Nalugoda F, Quinn TC, Kemp SA, et al. Population dynamics of HIV drug resistance during treatment scale-up in Uganda: a population-based longitudinal study. medRxiv. 2024. https://doi.org/10.1101/2023.10.14.23297021 pmid:39417110
  108. 108. Rambaut A, Posada D, Crandall KA, Holmes EC. The causes and consequences of HIV evolution. Nat Rev Genet 2004;5(1):52–61. pmid:14708016
  109. 109. Shriner D, Rodrigo AG, Nickle DC, Mullins JI. Pervasive genomic recombination of HIV-1 in vivo. Genetics 2004;167(4):1573–83. pmid:15342499
  110. 110. Song H, Giorgi EE, Ganusov VV, Cai F, Athreya G, Yoon H, et al. Tracking HIV-1 recombination to resolve its contribution to HIV-1 evolution in natural infection. Nat Commun 2018;9(1):1928. pmid:29765018
  111. 111. Cunningham EB, Applegate TL, Lloyd AR, Dore GJ, Grebely J. Mixed HCV infection and reinfection in people who inject drugs–impact on therapy. Nat Rev Gastroenterol Hepatol 2015;12(4):218–30. pmid:25782091
  112. 112. van de Laar TJW, Molenkamp R, van den Berg C, Schinkel J, Beld MGHM, Prins M, et al. Frequent HCV reinfection and superinfection in a cohort of injecting drug users in Amsterdam. J Hepatol 2009;51(4):667–74. pmid:19646773
  113. 113. Eliseev A, Gibson KM, Avdeyev P, Novik D, Bendall ML, Pérez-Losada M, et al. Evaluation of haplotype callers for next-generation sequencing of viruses. Infect Genet Evol. 2020;82:104277. pmid:32151775
  114. 114. Cohen T, van Helden PD, Wilson D, Colijn C, McLaughlin MM, Abubakar I, et al. Mixed-strain mycobacterium tuberculosis infections and the implications for tuberculosis treatment and control. Clin Microbiol Rev 2012;25(4):708–19. pmid:23034327
  115. 115. Sobkowiak B, Glynn JR, Houben RMGJ, Mallard K, Phelan JE, Guerra-Assunção JA, et al. Identifying mixed Mycobacterium tuberculosis infections from whole genome sequence data. BMC Genomics 2018;19(1):613. pmid:30107785
  116. 116. Gabbassov E, Moreno-Molina M, Comas I, Libbrecht M, Chindelevitch L. SplitStrains, a tool to identify and separate mixed Mycobacterium tuberculosis infections from WGS data. Microb Genom 2021;7(6):000607. pmid:34165419
  117. 117. Sobkowiak B, Cudahy P, Chitwood MH, Clark TG, Colijn C, Grandjean L, et al. A new method for detecting mixedMycobacterium tuberculosisinfection and reconstructing constituent strains provides insights into transmission. 2024.
  118. 118. Anyansi C, Keo A, Walker BJ, Straub TJ, Manson AL, Earl AM, et al. QuantTB–a method to classify mixed Mycobacterium tuberculosis infections within whole genome sequencing data. BMC Genomics 2020;21(1):80. pmid:31992201
  119. 119. Dezordi FZ, Resende PC, Naveca FG, do Nascimento VA, de Souza VC, Dias Paixão AC, et al. Unusual SARS-CoV-2 intrahost diversity reveals lineage superinfection. Microb Genom 2022;8(3):000751. pmid:35297757
  120. 120. Wertheim JO, Wang JC, Leelawong M, Martin DP, Havens JL, Chowdhury MA, et al. Detection of SARS-CoV-2 intra-host recombination during superinfection with Alpha and Epsilon variants in New York City. Nat Commun 2022;13(1):3645. pmid:35752633
  121. 121. Savitz DA, Wellenius GA. Can cross-sectional studies contribute to causal inference? It depends. Am J Epidemiol 2023;192(4):514–6. pmid:35231933
  122. 122. Janes H, Herbeck JT, Tovanabutra S, Thomas R, Frahm N, Duerr A, et al. HIV-1 infections with multiple founders are associated with higher viral loads than infections with single founders. Nat Med 2015;21(10):1139–41. pmid:26322580