Population-level HIV incidence estimates using a combination of synthetic cohort and recency biomarker approaches in KwaZulu-Natal, South Africa

Introduction There is a notable absence of consensus on how to generate estimates of population-level incidence. Incidence is a considerably more sensitive indicator of epidemiological trends than prevalence, but is harder to estimate. We used a novel hybrid method to estimate HIV incidence by age and sex in a rural district of KwaZulu-Natal, South Africa. Methods Our novel method uses an ‘optimal weighting’ of estimates based on an implementation of a particular ‘synthetic cohort’ approach (interpreting the age/time structure of prevalence, in conjunction with estimates of excess mortality) and biomarkers of ‘recent infection’ (combining Lag-Avidity, Bio-Rad Avidity and viral load results to define recent infection, and adapting the method for age-specific incidence estimation). Data were obtained from a population-based cross-sectional HIV survey conducted in Mbongolwane and Eshowe health service areas in 2013. Results Using the combined method, we find that age-specific HIV incidence in females rose rapidly during adolescence, from 1.33 cases/100 person-years (95% CI: 0.98,1.67) at age 15 to a peak of 5.01/100PY (4.14,5.87) at age 23. In males, incidence was lower, 0.34/100PY (0.00-0.74) at age 15, and rose later, peaking at 3.86/100PY (2.52-5.20) at age 30. Susceptible population-weighted average incidence in females aged 15-29 was estimated at 3.84/100PY (3.36-4.40), in males aged 15-29 at 1.28/100PY (0.68-1.50) and in all individuals aged 15-29 at 2.55/100PY (2.09-2.76). Using the conventional recency biomarker approach, we estimated HIV incidence among females aged 15-29 at 2.99/100PY (1.79-4.36), among males aged 15-29 at 0.87/100PY (0.22-1.60) and among all individuals aged 15-59 at 1.66/100PY (1.13-2.27). Discussion HIV incidence was very high in women aged 15-30, peaking in the early 20s. Men had lower incidence, which peaked at age 30. The estimates obtained from the hybrid method are more informative than those produced by conventional analysis of biomarker data, and represents a more optimal use of available data than either the age-continuous biomarker or synthetic cohort methods alone. The method is mainly useful at younger ages, where excess mortality is low and uncertainty in the synthetic cohort estimates is reasonably small. Conclusion Application of this method to large-scale population-based HIV prevalence surveys is likely to result in improved incidence surveillance over methods currently in wide use. Reasonably accurate and precise age-specific estimates of incidence are important to target better prevention, diagnosis and care strategies.


Introduction
There is a notable absence of consensus on how to generate estimates of population-level incidence. Incidence is a considerably more sensitive indicator of epidemiological trends than prevalence, but is harder to estimate. We used a novel hybrid method to estimate HIV incidence by age and sex in a rural district of KwaZulu-Natal, South Africa.

Methods
Our novel method uses an 'optimal weighting' of estimates based on an implementation of a particular 'synthetic cohort' approach (interpreting the age/time structure of prevalence, in conjunction with estimates of excess mortality) and biomarkers of 'recent infection' (combining Lag-Avidity, Bio-Rad Avidity and viral load results to define recent infection, and adapting the method for age-specific incidence estimation). Data were obtained from a populationbased cross-sectional HIV survey conducted in Mbongolwane and Eshowe health service areas in 2013.

Introduction
HIV epidemic surveillance largely relies on cross-sectional measurements of prevalence, often by means of representative household surveys. However, for a non-remissible condition with extended survival time like HIV, instantaneous prevalence reflects the epidemic trajectory (incidence, mortality and migration) over a significant period prior to the survey. Estimating HIV incidence-the most sensitive and informative indicator of current epidemiological trends-therefore poses significant methodological challenges. The 'gold standard' method of directly observing new infections in cohorts of HIV-negative individuals followed up over time are costly and logistically challenging, and it is difficult to ensure sufficient population representivity to ensure results can be generalised. Several alternative approaches have been proposed for estimating HIV incidence, including a 'synthetic cohort' approach-i.e. inferring incidence from the age and/or time structure of prevalence [1][2][3][4][5][6], from biomarkers for 'recent infection' measured in cross-sectional surveys [7][8][9][10][11], or using dynamical population models that have been calibrated to survey data [12][13][14][15]. No single method by itself achieves the desired levels of accuracy and precision [16].
In this work we develop a novel hybrid method which uses an 'optimal' weighting of, (a) an implementation of the 'synthetic cohort' approach of Mahiane et al. [6]-i.e. interpreting the age and time structure of prevalence, in conjunction with excess mortality-and (b) an adaptation of the Kassanjee et al. estimator for incidence from biomarkers of recent infection [8] that takes account of the age structure of recent infection (amongst HIV-positive individuals). The method of Mahiane et al. relies on the instantaneously exact, fully age-and time-structured, representation of the dynamical relation of prevalence, excess mortality and incidence. In the case of a relatively stable epidemic (i.e. relatively slow change in age-specific prevalence over time), the age structure of prevalence provides fairly precise age-specific incidence estimates.

Estimating incidence using biomarkers for 'recent' infection
We used calibration data from the Consortium for the Evaluation and Performance of HIV Incidence Assays (CEPHIA) to explore a range of recent infection case definitions based on combinations of LAg normalised optical density (ODn), Bio-Rad Avidity index (AI) and viral load thresholds, and selected an 'optimal' recent infection testing algorithm (RITA) based on the variance of the incidence estimates produced. The procedures and results are detailed in S1 Appendix. These show that a RITA that defines recent infection as NAT+/Ab− OR Ab+/ODn < 2.5/AI < 30/VL > 75 achieves a mean duration of recent infection (MDRI, adjusted for the sensitivity of the screening algorithm) of 217 days (95% CI: 192,244) and a context-specific false-recent rate (FRR) of 0.17% (95% CI: 0.05%,0.35%). We analyse sensitivity to imperfectlyestimated FRR in S2 Appendix.
This definition of recency was then employed to estimate incidence in the study population, by age group and sex, using the method of Kassanjee et al. [8]. The well-known Kassanjee et al. estimator, adapted for use in complex surveys by allowing the use of proportions and their standard errors, rather than survey counts, is given in Eq 1.
where P H is the prevalence of HIV, P R|+ is the proportion of recency tests performed on HIVpositive participants that produced a 'recent' result, O T is the MDRI and T is the FRR, and T is the chosen time cutoff beyond which a 'recent' result is considered 'falsely recent' by definition. Note that the product of P H and P R|+ is the overall prevalence of recency in the sample. This estimator is implemented in the inctools R package [23]. The documentation of inctools provides details on estimating the variance of incidence estimates using both delta method and bootstrapping approaches.
Owing to the small 'recent infection' case counts, statistical uncertainty reaches unacceptable levels when age groups are small, and we therefore estimated incidence using the conventional approach in 15 to 29 year-olds and 30 to 59 year-olds.
For the purpose of the combined method described below, we further adapted the estimator for age-dependent prevalence of recent infection, allowing us to estimate highly granular agespecific incidence using the recent infection biomarker data in the survey. Details are provided in the section on the combined method.

Estimating age-specific incidence using the Mahiane et al. 'synthetic cohort' method
We employ the incidence estimator of Mahiane et al. [6] to estimate incidence from the age structure of prevalence. The estimator was derived from the fundamental relationship between incidence, prevalence and mortality in a non-transient condition-with prevalence viewed as the accumulated incidence over time, accounting for the removal of prevalent cases from the population through condition-induced 'excess' mortality. This is shown using a simple dynamical SI-type model, where it is demonstrated that simply rearranging the differential equations describing change in the state variables for the susceptible and infected groups yields an estimator for incidence that relies only on prevalence and excess mortality (but critically, not total mortality). In an age-structured population, this approach yields the incidence estimator in Eq 2.
where p(a,t) is age and time-specific prevalence and δ(a,t) is age and time-specific excess mortality.
In a stable epidemic, where the age structure of prevalence is not changing at a significant rate in secular time (see Discussion section), the age-structure of prevalence from a single cross-sectional prevalence survey is informative, and the estimator can be simplified to: We obtained age-specific incidence estimates by fitting a regression model for prevalence as a function of age to finely-grained data (i.e., not using integer ages, but the difference in days between the birth date and interview date of each participant), using a generalised linear model with a cubic polynomial in age as predictors and a logit link: with g() the logit link function, so that and We fit the model, separately for males and females, to data from participants aged 15 to 34 years. This provided us with a continuous function, p(a), for 15 a < 35. We derived, for each sex, the function for excess mortality, δ(a), from age-specific AIDS mortality estimates for KwaZulu-Natal province produced by the Thembisa demographic model [24], allowing us to estimate age-specific incidence, λ(a).
Reproducibility of the incidence estimate at any given age was investigated by bootstrapping the dataset (reproducing the complex sampling frame employed in the survey), refitting the models and obtaining an incidence estimate for each of the 10,000 resampled datasets. The standard deviation of the obtained estimates was computed to approximate the standard error, and the 2.5th and 97.5th percentiles to approximate the 95% confidence interval.

Estimating age-specific incidence using the combined method
In order to estimate age-specific incidence by combining HIV prevalence data and biomarkers for recent infection, available in the same dataset, we estimated age-specific incidence (and its variance) using (1) the synthetic cohort method described above, and (2) an adaptation of the Kassanjee et al. estimator to age-structured recency biomarker data. The adapted estimator is shown in Eq 7.
where P H (a) is the age-specific prevalence of HIV (estimated as in the previous section), and P R|+ (a) is the age-specific prevalence of recency amongst HIV-positives (described below). In order to obtain the prevalence of recency as a function of age we fit a generalised linear regression model with log of age as linear predictor and a complementary log-log link. This functional form implies an exponential decline in the prevalence of recency with age, which captures the epidemiologically sensible assumption that at young ages larger proportions of infections were acquired in the recent past. The model has the functional form: with g() the link function, so that Owing to the use of prevalence in both estimates, the incidence estimates are necessarily correlated. We therefore resample the data (replicating the complex sampling frame), fit the models of P H (a) and P R (a), and at each age of interest, obtain the two incidence estimates, λ P (incidence from age-structured prevalence) and λ R (incidence from age-structured recency amongst positives). We then evaluate, at each age of interest, from 10,000 bootstrap iterations, the variances and covariance of the two incidence estimates, in the case of λ R further incorporating uncertainty in MDRI and FRR. We then compute a combined incidence estimate using a weighted average of the two estimates. The implied weighting function, W(a), derived from the 'optimal' weighting factors, W a (i.e. the weighting factor that minimises variance of the combined estimate) obtained at each evaluated age, is then convolved with the combined incidence function. At a particular age a, incidence from the combined method is given by Eq 10.
with 0 W a 1, and consequently no normalisation to total weight is required. The variance of the incidence estimate at a given age is obtained from Eq 11.
with ρ the Pearson's correlation coefficient between λ P and λ R at that age. The value of W that minimises total variance at the age of interest is obtained from the following formula, derived by setting Eq 11 to 0 and solving for W: The continuous incidence function λ(a) is then obtained by fitting a cubic interpolating spline (using the method of Forsythe, Malcolm and Moler) to estimated incidence, λ a , for all ages in the range 15 to 35 years, evaluated at steps of 0.25 years.
For comparability with conventional age-group estimates, 'average incidence' was estimated in age bins. The integral of the λ(a) function was evaluated over the age range for which average incidence was sought, and weighted using a weighting function reflecting (a) the sampling density, or (b) the susceptible population density, to obtain average incidence. For population weighting, the population by age and sex was obtained from the 2011 Census for Eshowe and Mbongolwane, and the susceptible population size estimated using prevalence estimates from the survey data. Susceptible population-weigthted estimates are presented as primary results.
The unweighted incidence spline function, λ(a) was weighted by a weighting function f(a), derived from either the sampling density or the susceptible population density, and the integral evaluated over the age range of interest (a 0 to a 1 ) in order to obtain weighted average incidence over that range, as shown in Eq 13.
The procedure was performed separately for males and females, and in order to obtain overall average incidence, these estimates were then further weighted using the weighting functions for males and females.
Defining total weights for the two sexes as W M ¼ R a 1 a 0 f M ðaÞ da and W F ¼ R a 1 a 0 f F ðaÞ da, for any age interval and weighting function, the total incidence is then given by Eq 14.
Confidence intervals were obtained by bootstrapping the data (10,000 iterations), and in each iteration estimating average incidence.
The source code utilised in this estimation procedure is made available in S1 Code.

Sensitivity analyses
In order to investigate the sensitivity of our analyses to uncertainty in the False-Recent Rate, we repeated the incidence estimation procedure using a range of FRRs between 0% and 1%. We further investigated sensitivity of average incidence to the weighting scheme. The implementation of the method developed in this paper does not take into account change in prevalence (and incidence) in the time dimension. This is valid when the epidemic is relatively stable and most of the information is captured in the age structure of prevalence. In order to investigate sensitivity to possible change over time in age-specific prevalence, we investigated a number of hypothetical scenarios in which age-specific prevalence is increasing or decreasing exponentially.
Sensitivity analyses are reported in S2 Appendix.  Tables 2  and 3. Age-specific estimates using the combined method are shown in the figures. Incidence estimates are presented as continuous functions of age for individuals aged 15-29, with the contributions of the age-continuous biomarker and synthetic cohort methods. Incidence in females rose steeply during the teenage years, from 1.31 cases/100PY (0.97,1.66) at age 15 to a peak of 4.95/100PY (4.09,5.81) at age 23. Incidence was lower-but still very high-in women in their late twenties and early 30s, with estimated incidence of 4.50/ 100PY (3.07,5.92) at age 30. Uncertainty in the estimates increased with age (with a standard error of approximately 0.7 at age 30, compared to 0.4 at age 23). Estimates were very imprecise for ages over 30: at age 35, incidence was estimated at 2.78/100PY, with a standard error of 1.67, resulting in a 95% CI of 0.00,6.06. Age-specific incidence in teenaged males was substantially lower than in females, estimated at 0.32 cases/100PY (0.00,0.65) at age 15, and rising sharply from the early twenties, peaking at 4.10/100PY (2.75,5.46) at age 30. Incidence in males aged 23 was estimated at 1.39/100PY (0.95,1.82). Overall incidence estimates reflect the estimates for males and females so that estimated incidence at age 15 was 0.82 cases/100PY (0.64,1.00), peaked at 4.47/100PY (3.52,5.41) at age 29, and was 4.42/100PY (3.37,5.47) at age 30.

Discussion
This study describes a novel hybrid method that allows for reasonably precise estimation of age-specific incidence up to about age 30 years. It constitutes a significant improvement over conventional cross-sectional incidence estimation using biomarkers of recent infection, where small case counts limit informative estimates to large age bins. We confirm previously-described very high incidence among young women and also among slightly older young men. A compartmental mathematical model developed by Blaizot et al. [25] produced similar incidence estimates by sex and age group when calibrated to the same data [26]. In females, incidence peaked at age 23, and in males at age 30. We have previously described that young people were more likely to transmit HIV. In the same survey, among individuals aged 15-19 years and 20-34 years 34% and 35% respectively were unaware of their HIV status and 66% and 53% were virally unsuppressed; both factors were associated with higher-risk sexual behaviour [20]. Precise age-specific incidence estimates are important to identify the age and gender groups most at risk. These findings highlight the need for targeted prevention and HIV testing strategies for girls and young women, as well as men aged 20 to 40 years.
The conventional biomarker-based approach does not allow finely-grained age-specific incidence estimation, since small case counts (or sample proportions) result in very wide confidence intervals. Even analysis of the data in five-year age bins produce estimates that cannot be clearly distinguished from zero. Our adapted age-continuous biomarker estimates provide reasonably reproducible estimates in younger individuals, where the parameterisation of the prevalence of recent infection (amongst HIV-positive individuals) is likely to be sound. However, this method would be more challenging to implement in older individuals, where the distribution of recent infections is more complicated. The synthetic cohort method provides additional information on incidence, and in certain age ranges is in fact more informative than the biomarker method. As can be seen in the figures, at younger ages the two estimates are very similar, but diverge at older ages. At younger ages the synthetic cohort method has greater precision (estimates have lower variance). In females, the combined estimate is weighted in favour of the synthetic cohort method throughout the age range 15 to 29, but with more heavily skewed at younger ages (weighting factor of 0.84 at age 15 and 0.54 at age 29), whereas in males the weighting tips towards the biomarker method at age 25.
The idea of using demographic structure of prevalence data to infer incidence is certainly not new. Williams et al. [1] developed something very close to the approach we are takingthe main difference being their proposal (in light of data available at that time) to use ageaveraged rather than age-specific estimates of time dependence of prevalence. We follow the instantaneously exact, fully age and time-structured, representation of the relation of prevalence, mortality and incidence that was introduced in Mahiane et al. [6]. That paper also considered the previously-published methods of Brunet and Struchiner [2,3], Hallett et al. [4], and Brookmeyer and Konikoff [5], all of which were found to have substantial biases, noted to be the result of their various particular forms of dynamical approximation-essentially using assumptions of constant prevalence in age and time ranges.
An advantage of the hybrid approach is that it combines both age-specific HIV prevalence data and biomarker data, thus reducing the risk of bias in HIV incidence estimation. By combining the estimates from the two methods, age-specific incidence can be estimated with significantly greater precision than with the biomarker method alone. For example, at age 15 in females, the standard error on the age-continuous biomarker estimate of 0.91 cases/100PY is 0.42 (i.e. a coefficient of variation of 46%) and the standard error on the weighted average of 1.31/100PY is 0.18 (CoV = 13%). The narrower confidence bounds around the combined method estimates can be clearly seen in the figures. At certain ages, there is a very substantial improvement, for example in males aged 22, the CoV on the biomarker estimate is 47%, while on the combined estimate it is 18%. The precision of the combined method is not greatly enhanced over that of the synthetic cohort method, but estimates are likely to be more accurate, especially where information on change in the age structure of prevalence over time is not available, which may bias estimates.
For the recent infection case definition adopted for this analysis we estimated, using CEPHIA calibration data, a very small context-specific false-recent rate. Unfortunately, the FRR is also the test property that is hardest to estimate, and where the transferability from calibration data to the surveyed population is most problematic. While we adopted a sophisticated approach to context adaptation of the test property estimates, these challenges remain. For present purposes we assumed that the test properties (MDRI and FRR) do not vary with age, although it is likely that the longer (on average) time-since-infection in older individuals would impact the FRR and that biological changes in the immune system would impact the MDRI. In estimating context-specific FRR we make assumptions about the population-level distribution of times-since-infection, but a lack of data on past incidence precludes a more nuanced age-specific FRR estimate. This limitation is addressed by means of a sensitivity analysis with respect to FRR, as reported in S2 Appendix. Given the very low population-level FRR estimate, it is unlikely that this assumption introduces substantial bias. The sensitivity analyses indicate that our results are not highly sensitive to the false-recent rate, although it becomes more so at older ages, where the combined method relies more on the biomarker-based estimate.
While the use of pooled nucleic acid amplification testing increases the sensitivity of the screening algorithm, this strategy adds considerably to the cost. Defining NAAT yield (acutely infected) cases as recently infected also added approximately 15 days to the MDRI of the RITA [27]. However, we identified only two acute infections, and it is not clear that this strategy is feasible in most large population-based surveys.
The impact of ART on recency biomarkers is well-established-treated individuals tend to undergo partial seroreversion resulting in "false" recent classifications. The inclusion of a viral load threshold in the RITA ameliorates this problem, resulting in a very low FRR, although calibration data are lacking on treated but virally unsuppressed individuals (see S1 Appendix). The increasing adoption of early treatment ("Universal Test-and-Treat") has the potential to impact the MDRI of RITAs that classify treated (or virally suppressed) individuals as nonrecent, although at the time of this survey, very few (if any) individuals in the study population would have received ART within two years of infection. The impact of ART on the synthetic cohort method is likely to be largely innocuous, as long as accurate age-specific excess mortality estimates are employed. Increasing uptake of early ART would further reduce the already low excess mortality in young HIV-infected individuals.
A major limitation of this study is that we are analysing data from a single cross-section survey, providing no information on change in the (age-structured) prevalence over time. A second survey is planned in the study population, which would allow future analyses to be conducted that explicitly incorporate change over time. We investigated the sensitivity of our estimates to prevalence changes in time, and found that estimates from the combined method are not very sensitive to plausible rates of change in prevalence at the time of the survey. However, if the assumption of a stable epidemic were violated and rapid increases or declines in prevalence were taking place at the time of the survey, our method would exhibit significant bias. It would therefore be preferable to explicitly incorporate the time dimension in the analysis (by using data from serial prevalence surveys) and it is essential that the version of the method that ignores time is only applied in settings where the assumption of stability is sound.
Further, estimates become very uncertain at ages over 30 (see Fig 4), resulting in synthetic cohort, biomarker and combined method estimates with confidence intervals that stretch from close to zero to very large values. The failure of the method to provide informative estimates at higher ages requires further investigation. This limitation may, in part, reflect the particular parameterisation of regression models for HIV prevalence and for the prevalence of 'recent infection' used in the present analysis.

Conclusion
This analysis demonstrates the value of age-structured prevalence data, when reasonable estimates of excess mortality are available, and that when additional biomarkers of recent infection are available these can be sensibly incorporated into age-specific incidence estimates. The novel hybrid method used in this analysis can be extended to allow the analysis of serial prevalence (and, when available, recent infection) data, without significant further conceptual development, for maximally informative incidence estimation. Application of this method to large-scale population-based HIV prevalence surveys is likely to result in improved incidence surveillance over methods currently in wide use. Reasonably accurate and precise age-specific estimates of incidence are important to target better prevention, diagnosis and care strategies.
Supporting information S1 Appendix. Optimal RITA identification and calibration. (PDF) S2 Appendix. Sensitivity analyses. (PDF) S1 Dataset. Anonymised dataset. The variables id, cluster and ward are randomised participant, cluster (primary sampling unit) and electoral ward (stratum) identifiers, respectively. To replicate the multistage sampling frame during boostrapping, wards were resampled with replacement, and within each sampled ward clusters were resampled with replacemnent. The age_years variable captures age in years, rounded to whole numbers as a safeguard against deanonymisation. Final HIV status is captured in hiv_status and participants identified as HIVinfected using nucleic acid amplification testing have "True" in naat_yield. Final LAg normalised optical density and Bio-Rad Avidity avidity index are captured in LAgODn and BioRadAI, respectively, and viral load in viral_load. For convenience, recency status according to the RITA utilised in this analysis is captured in recent. (CSV) S1 Code. Source code used in this analysis. https://github.com/SACEMA/incidencecombined-method. (ZIP) Investigation: Gilles van Cutsem, Adrian Puren, Tom Ellman, Jean-François Etard, Helena Huerga.