CD4 Enumeration Technologies: A Systematic Review of Test Performance for Determining Eligibility for Antiretroviral Therapy

Background Measurement of CD4+ T-lymphocytes (CD4) is a crucial parameter in the management of HIV patients, particularly in determining eligibility to initiate antiretroviral treatment (ART). A number of technologies exist for CD4 enumeration, with considerable variation in cost, complexity, and operational requirements. We conducted a systematic review of the performance of technologies for CD4 enumeration. Methods and Findings Studies were identified by searching electronic databases MEDLINE and EMBASE using a pre-defined search strategy. Data on test accuracy and precision included bias and limits of agreement with a reference standard, and misclassification probabilities around CD4 thresholds of 200 and 350 cells/μl over a clinically relevant range. The secondary outcome measure was test imprecision, expressed as % coefficient of variation. Thirty-two studies evaluating 15 CD4 technologies were included, of which less than half presented data on bias and misclassification compared to the same reference technology. At CD4 counts <350 cells/μl, bias ranged from -35.2 to +13.1 cells/μl while at counts >350 cells/μl, bias ranged from -70.7 to +47 cells/μl, compared to the BD FACSCount as a reference technology. Misclassification around the threshold of 350 cells/μl ranged from 1-29% for upward classification, resulting in under-treatment, and 7-68% for downward classification resulting in overtreatment. Less than half of these studies reported within laboratory precision or reproducibility of the CD4 values obtained. Conclusions A wide range of bias and percent misclassification around treatment thresholds were reported on the CD4 enumeration technologies included in this review, with few studies reporting assay precision. The lack of standardised methodology on test evaluation, including the use of different reference standards, is a barrier to assessing relative assay performance and could hinder the introduction of new point-of-care assays in countries where they are most needed.


Introduction
The increased availability of antiretroviral therapy (ART) has resulted in major reductions in morbidity and mortality in high HIV burden settings. Through significant global scale-up, access to ART is increasing, with around 10 million people in low-and middle-income settings receiving treatment as of the end of 2013, an estimated 65% of the of the global target of 15 million people set for 2015 [1].
CD4 + T-lymphocytes, also known as the helper T-cells, are the coordinators of the immune response that protects the body against microbial disease, a variety of autoimmune diseases and some forms of cancer. The destruction of CD4 + T-lymphocytes by HIV is the main cause of the progressive weakening of the immune system in HIV infection, and leads ultimately to acquired immune deficiency syndrome (AIDS). The CD4 count is a strong predictor of HIV progression to AIDS and death, and is considered the best laboratory marker for deciding when to initiate ART [2,3]. The use of clinical staging alone to determine the timing of ART initiation is limited by the unreliable correlation between asymptomatic or mild disease and short-term prognosis and may result in dangerous delays in treatment initiation in those without symptoms but with severe immune suppression [4].
Prior to 2013, the World Health Organization (WHO) recommended ART initiation in all HIV-infected individuals whose CD4 count has dropped to 350 cells/μl, irrespective of clinical symptoms. The WHO 2013 guidelines have raised the threshold for ART initiation to 500 cells/μl, with priority given to those with a CD4 count to 350 cells/μl, consistent with emerging data, indicating a clinical and public health benefit of earlier treatment and as part of a global effort to get 15 million HIV patients on ART by end of 2015 [5].
A number of technologies are available for CD4 enumeration, with considerable variation in cost, complexity, and operating requirements (Tables 1-5). The traditional approach to calculating absolute CD4+ T lymphocyte counts is to use the total leukocyte count (or lymphocyte count) obtained from the hematology analyzer and then use the percentage CD4+ T lymphocytes from the flow cytometric analysis to calculate the absolute values-the so-called "dual platform" (DP) approach. Quite often, however, two separate samples are used in the procedure, one to obtain the total leukocyte count using a hematology analyzer and one to undertake the flow cytometry, each having its own in-built variation. Thus, when the results from each are combined to determine the absolute CD4+ T lymphocyte count, the variation is compounded such that inter-laboratory variation between centers can be as high as 40%. Thus, the need to derive accurate and precise absolute CD4+ T lymphocyte counts has led to the development of instruments that can produce both percentage and absolute values, termed the"single platform" approach (SP).
Two SP approaches are in widespread use today: volumetric and bead based. The principle of the volumetric approach is that a known volume of sample is passed through the flow cell (and interrogated by the laser beam) in a known amount of time. The alternative approach is   to use bead-based technologies where a known number of beads is added to a known volume of sample thus allowing calculation of the bead to cell ratio and the subsequent calculation of the absolute cell count, in this instance, CD4+ T lymphocytes. An important feature of any absolute counting system is pipetting accuracy and minimum sample manipulation. The introduction of SP technologies has had a beneficial effect and lowered inter-laboratory variation in CD4 enumeration. It is critical that country programmes consider whether these tests can give accurate and reproducible results as well as being appropriate for the setting [6]. In particular, we focussed on how bias and misclassification probabilities of different CD4 assays may affect eligibility for ART initiation. Misclassification probabilities give clinically useful measures of test performance and should be reported in evaluations of CD4 technologies. An upward misclassification around a treatment threshold means that a patient who should be eligible for treatment would be denied treatment, while a downward misclassification would not lead to ineligibility for treatment. To date, there have been no systematic reviews of the performance of CD4 technologies. Here we provide an evaluation of the performance characteristics of CD4 technologies through a systematic review of published literature.

Methods
We performed a systematic review of studies evaluating the performance of CD4 enumeration technologies. A search of the Cochrane Library and the Centre for Reviews and Dissemination databases, including the Database of Abstracts of Reviews of Effects (DARE), the National Health Service Economic Evaluation Database (NHS EED) and the National Institute for Health Research Health Technology Assessment (NIHR HTA) database found no existing reviews addressing the review objective. We followed standard guidance in performing the review [7]. Objectives and methods of the review were documented in a review protocol, which is included as S1 Methods.

Eligibility criteria
Eligibility criteria were defined using the PICOS (Population, Interventions, Comparisons, Outcomes, Study Design) format. Studies evaluating the accuracy and/or precision of any CD4 technology commercially available at the time of the review were considered eligible for inclusion. Currently, no "gold" standard technology or internationally recognised reference preparation exists for CD4 enumeration, and a wide range of flow cytometric technologies have been used as comparators [6].
For the purposes of this review, we included studies that used as reference technologies any flow cytometric method considered to be acceptable by the WHO HIV diagnostics working group named in the review protocol (S1 Methods).

Information Sources
Studies were identified by searching two electronic databases-MEDLINE and EMBASE, scanning the reference lists of the Nature supplement Evaluating diagnostics: the CD4 guide, and by inviting the WHO working group, whose members are authors of the Nature supplement to identify relevant studies for the review [6,8].

Study selection
Articles were exported from the search database to EndNote and screened for relevance ( Fig. 1). Data were extracted by two independent reviewers (SG and KS) and disagreements resolved through consensus.

Data extraction
The following data were extracted: study location, index test, reference test, and population (HIV positive or HIV positive and negative). Data on accuracy and precision included bias or mean difference and limits of agreement, misclassification probabilities (when sensitivity or specificity values were given, misclassification probabilities were calculated), and coefficient of variation. Where possible, HIV positive data alone were extracted. Where this was not possible, Studies should report not only percent misclassification around clinically important CD4 cell thresholds (e.g., 200, 350 or 500 cells/μl), but should also report the magnitude of these misclassifications.
The secondary outcome measure addressed precision or reproducibility. Precision is particularly important when following a patient's serial measurements using the same technology. Precision can be measured within-laboratory or between-laboratories and is expressed as percent coefficient of variation (%CV).
Studies meeting inclusion criteria were also assessed for bias and quality on ten points drawn from the STARD guidelines (Fig. 2) [9]. This review has been reported following the PRISMA statement guidance for reporting of systematic reviews [10,11].

Results
A summary of different commercially available CD4 technologies, including their assay principles, operational characteristics and compatibility with international external quality assurance programme reagents, is shown in Tables

Study selection
This systematic review was first performed in July 2009. Of the 433 studies in the search, 345 were excluded as they were not performance evaluations. After further triage, 20 studies that measured bias, misclassification and/or %CV were accepted for inclusion in this review (PRISMA flow diagram, Fig. 1). A second search was conducted in April 2013 using the same search strategy and review protocol with the goal of capturing more POC CD4 enumeration technologies in the review. An additional 12 studies were included. A summary of data extracted from all eligible studies with data on bias, misclassifications and/or %CV is shown in S1 Dataset.

Study characteristics
A summary of study characteristics is shown in Table 6.

Methodological quality of included studies
The findings of the quality assessment of included studies are summarised in Fig. 2. Most studies reported the index test (test under evaluation) and the reference standard in sufficient detail to be reproduced, but few studies reported whether staff at the evaluation sites were proficient at performing the reference standard and/or sufficiently trained on performing the index test. Few studies reported internal quality controls being performed during the evaluation period. Without these quality measures, it would be difficult to differentiate whether the bias or misclassification between the index and reference tests was due to differences in inherent test characteristics or to operator error.
Manufacturer involvement was evident in a number of studies. Seven studies declared one or more authors to be affiliated with the manufacturer of the index test [12,17,39,42,46,54]. Four studies were partially sponsored by the manufacturer [25,26,45,53]. One study stated that the manufacturer's site was one of the study sites [42], and four studies declared donation of reagents or equipment by the manufacturer [15,29,38,40]. A further four studies could be considered to be calibration or test developers' papers [30,49,51,55]. In the absence of definitions for a sponsored study versus an independent evaluation, it is not clear to what extent the inclusion of manufacturers as co-authors of papers influenced the study results.

Accuracy
As there is no international standard for CD4 enumeration, a variety of reference standard technologies were used for evaluating the performance of new CD4 technologies, making it difficult to pool data on bias and misclassification across all studies.

Bias
Bias (mean difference) data were collated and represented graphically but only from studies that compared the index tests mean difference to the same reference technology (Table 6 and Fig. 3a, b). FACSCount and FACSCalibur were the most common reference technologies.
When the overall bias for all CD4 technologies was calculated, this ranged from-70.7 to +47 cells/μl for CD4 counts >350 cells/ μl and -35.2 to +13.1 cells/μl for CD4 counts <350 cells/μl when compared to the FASCount as a reference method.
The four studies that had reported data at <200 cells/μl and compared the Guava Easy CD4 to the FACSCount, all had a positive bias that ranged from +10 to +45.5 cells/μl [23,[25][26][27]. One study had data for CD4 counts >200 cells/μl and had a bias of +44.9 cells/μl (limits of agreement-112.6 to + 212.3) [25]. It is interesting to note that all studies reporting the performance of the Guava Easy CD4 showed that this assay overestimated CD4 counts compared to the FACSCount [23,[25][26][27].

Misclassification
Two studies [63,64] evaluated the Auto40 Flow Cytometer compared to the FACSCalibur in Cameroon. One study [63] reported upward and downward misclassification probabilities of  CD4 Assay Performance: A Systematic Review 8% and 2%, respectively, while another [64] found the likelihood of under-treatment (upward misclassification) to be 3% and the probability of over-treatment (downward misclassification) to be 2%.
Karcher et al. conducted a large trial in field conditions in Uganda comparing BC Cytospheres and the Partec CyFlow Counter to DP flow cytometry [16]. HIV positive patients were recruited, the majority of whom had CD4 counts within a range from 0 to1200 (median 332 cells/μl). Of samples with CD4 counts <350 cells/μl measured by DP flow cytometry, BC Cytospheres misclassified 16% of the patients as having CD4 counts of >350 cells/μl, thereby denying them of eligibility for treatment. Similarly for those with CD4 counts > 350 cells/μl measured by DP flow cytometry, BC Cytospheres misclassified 20%, indicating that 20% of patients not qualifying for treatment using the reference test would have done so if BC Cytospheres were used. Karcher et al. also compared the Partec CyFlow to the DP flow cytometer and found that of samples with counts <350 cells/μl, 29% were misclassified as having >350 cells/μl by the Partec CyFlow Counter, and 7% of samples with CD4 counts >350 cells/μl were misclassified downward as being <350 cells/μl [16].
Lutwama et al. conducted a large study of manual technologies in Uganda; they recruited only HIV positive patients, the majority of whom had CD4 counts within the clinically important range (range in study: 0-900 cells/μl) using the reference standard technology [18]. Of samples with counts of <350 cells/μl using the reference standard technology, BC Cytospheres misclassified only 1% as >350 cells/μl. However, of those with counts >350 cells/μl, 68% were misclassified as <350 cells/μl (indicating that 68% of patients not qualifying for treatment using the reference test would have done so if BC Cytospheres were used). Dynal Dynabeads had upward and downward misclassification probabilities of 6% and 30% respectively. Renault et al. (2010) conducted a comparison study between the Guava Easy CD4 and FACSCount. Across a range of CD4 (0-1100 cells/μl), the upward and downward misclassification was calculated and found to be 6.1% and 9.4%, respectively [27].
One study reported on evaluations of the PointCare NOW assay (this instrument has since been re-marketed/rebranded as HumaCount CD4now (Human Diagnsotics Worldwide mbH, Weisbaden, Germany) in five countries (Mozambique, Belgium, Canada, USA and South Africa) [60]. Mozambique, Belgium, Canada and USA used the FACSCalibur as a reference standard while South Africa compared the PointCare NOW to the Epics XL. Upward and downward misclassification were reported by country: Mozambique, +51%, -20%; Belgium, +62%, -4%; Canada, +50%, -0%; USA, +0%, -3%; and for South Africa, +64%, -6%. Overall misclassification was also calculated, and it was found that testing with PointCare NOW would have led to 47% of patients with CD4 counts less than 350 cells/μl not eligible to receive treatment (upward misclassification) and 6% of patients with CD4 counts greater than 350 cells/μl eligible to receive treatment (downward misclassification).
Of the two studies evaluating the Pima Analyzer, Herbert et al. (2013) used the BC Cytomics FC 500 as a reference standard while Jani et al. compared the Pima Analyzer to the BD FACS-Calibur [56,61]. Across the clinically relevant range (60-1200 cells/μl), Herbert et al. found the upward and downward misclassification to be 6.1% and 9.4% respectively. MBio Snap Count was evaluated by Logan et al. (2013) and compared to the FACSCalibur [62]. Of the 94 samples, 2.1% were misclassified upward and 3.2% were misclassified downward at a threshold of 350 cells/μl.
Even through misclassification probabilities can be influenced by the number of patients with CD4 counts close to the threshold in each study, Pointcare NOW showed an overall tendancy towards upward misclassification at both thresholds of 200 and 350 cells/μl. Most other technologies showed misclassification probabilities of <10%.
Five studies reported between-laboratory precision using whole blood. Two were studies evaluating BD FACSCount, [39,42] one evaluated Panleucogating, [50] and two were studies evaluating bead-based SP technology (BD Trucount tubes and BC Flow-Count fluorospheres) [46,47]. Gernow et al. studied the reproducibility of BC Cytospheres and found poor precision, with a coefficient of variation of 58% [15]. The study by Landay et al., however, found precision levels more in keeping with the other technologies [17].
Overall, in studies addressing between-laboratory precision, SP flow cytometry using BD Trucount tubes or BC Flow-Count fluorospheres showed less inter-laboratory variability than the DP comparators. [46,47] In addition, Denny et al. demonstrated improved inter-laboratory precision using DP Panleucogating compared with technologies which included DP or SP conventional flow cytometry [50]. External quality assurance data showed that the BD FACS-Count, which is most commonly used a reference standard for CD4 assay evaluations, has within-laboratory and between-laboratory precision of 15% or less [38,39,42,46].

Discussion
This review highlights the difficulties of answering clinically relevant questions about CD4 test performance from the published literature. A minority of studies reported clinically useful measures of accuracy, and few POC tests were carried out under field conditions.
It can be seen that whatever technology is chosen, there is variability associated with CD4 measurement. It should be noted that there is also significant physiological variability in CD4 count, that may account for as much as, if not more than, technical variability of CD4  [65][66][67]. Performance characteristics vary between technologies and for the same technology depending on the reference technology used as a comparator. These characteristics have important implications both for individual patient management and for HIV treatment programmes. It is essential to consider assay performance as well as operating characteristics when choosing a technology. However, these data are not always available in the literature, and currently, evaluations are not sufficiently robust or comprehensive to give a clear idea of the comparative merit of different technologies.
Misclassification probabilities describe the likelihood that a test will incorrectly categorise a result as higher or lower than a given cut-off value as measured by a reference standard. They are clinically relevant measures of accuracy, as they can be used to assess the likelihood that a patient will be incorrectly classified above or below a defined CD4 threshold used in clinical decision making. Misclassification probabilities for the same assay can vary not only because the test is compared to different reference standards, but also because the probabilities are affected by the number of samples clustered around the thresholds of 200 or 350 cells/ul. Two types of misclassification can be defined-upward misclassification and downward misclassification. Upward misclassification around a treatment threshold may be the most clinically important, leading to a delay in starting ART in some patients, with potentially harmful consequences. Downward misclassification on the other hand would be expected to lead to ART use earlier than indicated, with potential implications for cost and drug exposure. Given the trend towards earlier initiation in global and national guidelines, a degree of over-treatment is likely to be preferred over significant under-treatment [68]. Furthermore, the use of CD4 counts alone to assess ART immunological failure in the absence of viral load monitoring will, because of the biases observed, lead to some patients not receiving the appropriate clinical intervention.
Misclassification data showed that manual technologies [18], particularly the method using BC Cytospheres, were associated with substantial downward misclassification. It would therefore be expected that the implementation of these tests would lead to the decision to treat potentially large numbers of additional patients who have CD4 counts above the guideline threshold when using the reference test. Less upward misclassification was seen, suggesting that under-treatment might be less of a problem with these technologies. Upward misclassification by either manual technology is however likely to be an underestimate as the majority of counts in this study were very low (less than 25% of samples had counts >200 cells/μl); if the tests were to be used in a population with counts closer to the treatment threshold (as might be the case if used primarily in asymptomatic patients), upwards misclassification would be expected to be higher.
Limited misclassification data were available for the Partec CyFlow instruments. Of concern, one study evaluating the Partec CyFlow Counter under field conditions found 29% upward misclassification; that is, 29% of patients potentially eligible for treatment may have been denied treatment if the Partec CyFlow Counter-determined CD4 counts were the only criteria for assessing eligibility [16]. No other studies of the Partec CyFlow Counter or other CyFlow instruments reported misclassification probabilities at 350 cells/μl. Therefore, we do not know if this finding was replicated elsewhere. The Guava PCA (using EasyCD4 reagents) and the Pima Analyzer showed acceptable upward and downward misclassification rates.
There is some disparity in precision reported for the BC Cytospheres, and the reason for this disparity is not clear. The CD4 counts of the 19 samples used for replicate analysis in the study conducted by Gernow et al were not stated; however, they included 12 HIV negative samples that might be assumed to have high counts [15]. Given that several papers found BC Cytospheres to have poorer performance at higher counts, this may be in keeping with poor reproducibility in these replicates. Another study, conducted by Landay et al., found better precision on a sample with a CD4 count of 1200 (%CV 3Á5%) than on a sample with a CD4 count of 200 (%CV 10Á8%) [17]. Manual methods, although employing simple technology, are labor intensive and require significant user skill. Inadequate training, lack of supervision, or user fatigue may lead to poor performance of these techniques. Neither study described the training received by technicians performing the manual tests, nor reported blinding. Betweenlaboratory precision is likely to be superior with SP technologies (using BD Trucount tubes or BC Flow-Count fluorospheres) than with conventional DP technologies. As more point-ofcare devices are introduced to lower levels of the health system, where training and supervision can be challenging, the lack of adequate training and supervision may introduce additional sources of error, contributing to decreased assay precision. This should be addressed through the development of a comprehensive training and supervision policy and implementation plan for the introduction of POC devices. In addition to the studies presented in this review, evaluations have been performed by government agencies and other national bodies that have not been published in peer-reviewed journals. An evaluation of BC Cytospheres published by the Medical Devices Agency of the UK included samples from 17 HIV positive subjects, and compared BC Cytospheres against DP flow cytometry [69]. Accuracy was reported using assessment of bias; misclassification probabilities were not reported. Imprecision was addressed using 6-7 replicates of 6 samples, and found a CV range of 3Á2-17Á6% (mean 8Á5%). Unpublished evaluations have not been included in this review.
A recent review of external quality assurance (EQA) programmes involving 58,626 CD4 data sets from over 3,000 laboratories over a 12-year period show that SP technologies consistently give lower relative errors and confidence limits than DP technologies at clinically significant absolute CD4 counts [70].

Limitations of this review
Limitations include the fact that we only included papers published in the English language, and we may have overlooked data because of this limitation. Limiting the search to the peer-reviewed literature may have overlooked robust evaluations conducted by national reference facilities or similar institutions.
It is important to consider that the reference standard technologies themselves are not perfect. Misclassification assumes that the reference result is accurate, i.e., the closest approximation to the truth. Thus, a result considered as a misclassification may in fact be correct. The reference technology if performed once may give a result of 340cells/ul, but if performed in duplicate using the same specimen may give results of 340 and 360. It may be important for the reference technology to be performed in duplicate and only when a concordant result is obtained around a threshold of 350 can it be used as the reference standard for the new test.
What constitutes an "acceptable" margin of error and misclassification probabilities around a threshold remain undefined and may vary among sites, depending on local factors such as the distribution of CD4 counts among asymptomatic patients, how often patients undergo repeat CD4 testing, and the implications of potential overtreatment (e.g., cost, long term risk of drug toxicity). However, given the move towards earlier treatment and the use of better-tolerated, less toxic drugs, misclassification that results in overtreatment may be more acceptable than would have previously been the case. It is relatively straightforward for national programmes to decide which technology best fits their needs based purely on cost and operating characteristics; it may be harder to decide what performance characteristics are acceptable, and harder still to obtain data on test performance to inform choice.
Given the potential for testing error, laboratory participation in EQA programmes and access to quality control (QC) reagents are essential. EQA information is not mentioned in the publications included in the review. Without this information, the proficiency of the laboratory staff performing the testing may have contributed to the errors and variation in addition to the assays themselves.

Conclusions
A wide range of bias and percent misclassification around treatment thresholds over the clinically relevant range were reported on the CD4 enumeration technologies included in this review. Less than half the studies reporting assay precision or reproducibility of the CD4 values obtained. This is a rapidly evolving field with new tests under development, and with existing instruments and reagents being regularly replaced by updated versions. A systematic review of POC tests compared to laboratory-based technologies showed that POC CD4 testing can increase retention in care prior to treatment initiation and can also reduce time to eligibility assessment resulting in more eligible patients being initiated on life-saving treatment [71]. The lack of standardized methodology on test evaluation, including consensus on reference standards, is a barrier to assessing relative assay performance and could hinder the introduction of new POC assays in countries where they are most needed.