A Systematic Review of the Screening Accuracy of the HIV Dementia Scale and International HIV Dementia Scale

Background The HIV Dementia Scale (HDS) and International HIV Dementia Scale (IHDS) are brief tools that have been developed to screen for and aid diagnosis of HIV-associated dementia (HAD). They are increasingly being used in clinical practice for minor neurocognitive disorder (MND) as well as HAD, despite uncertainty about their accuracy. Methods and Findings A systematic review of the accuracy of the HDS and IHDS was conducted. Studies were assessed on Standards for Reporting Diagnostic Accuracy criteria. Pooled sensitivity, specificity, likelihood ratios (LR) and diagnostic odds ratios (DOR) were calculated for each scale as a test for HAD or MND. We retrieved 15 studies of the HDS, 10 of the IHDS, and 1 of both scales. Thirteen studies of the HDS were conducted in North America, and 7 of the IHDS studies were conducted in sub-Saharan Africa. Estimates of accuracy were highly heterogeneous between studies for the HDS but less so for the IHDS. Pooled DOR for the HDS was 7.52 (95% confidence interval 3.75–15.11), sensitivity and specificity for HAD were estimated at 68.1% and 77.9%, and sensitivity and specificity for MND were estimated at 42.0% and 91.2%. Pooled DOR for the IHDS was 3.49 (2.12–5.73), sensitivity and specificity for HAD were 74.3% and 54.7%, and sensitivity and specificity for MND were 64.3% and 66.0%. Conclusion Both scales were low in accuracy. The literature is limited by the lack of a gold standard, and variation in estimates of accuracy is likely to be due to differences in reference standard. There is a lack of studies comparing both scales, and they have been studied in different populations, but the IHDS may be less specific than the HDS. These rapid tests are not recommended for diagnostic use, and further research is required to inform their use in asymptomatic screening.


Introduction
HIV-associated neurocognitive disorders (HAND) are defined as impairment of multiple cognitive domains in association with HIV, in the absence of other causes for the impairment [1]. HAND may affect up to half of all HIV positive (HIV+) individuals, even in regions with good access to antiretroviral therapy (ART) [2,3]. Symptomatic HAND (HIV-associated dementia [HAD] or minor neurocognitive disorder [MND]) is recommended as a reason to initiate [4,5] or modify [4] ART in recent European and British clinical guidelines.
The ''Frascati criteria'' are a research classification system that define HAD, the most severe grade of HAND, as impairment in at least two cognitive domains, scoring at least 2 standard deviations (SD) below demographically-appropriate means, with marked impairment of activities of daily living (ADL) caused by the cognitive deficits [1]. The two milder grades of HAND, much more common than HAD, are MND (defined as at least 1 SD below the mean in two domains with at least moderate impairment of ADL) and asymptomatic neurocognitive impairment (ANI) (defined similarly to MND but without impairment of ADL).
Fulfilment of the Frascati criteria requires neuropsychological (NP) testing of at least five cognitive domains from a possible seven, assessment of ADL, and exclusion of other diagnoses. The criteria are further limited by the lack of a standardised method of grading ADL, uncertainty about the clinical significance and possible oversensitivity for mild impairment [6], and lack of confirmatory neuropathological, imaging or laboratory biomarkers. The 1991 American Academy of Neurology (AAN) criteria are simpler to use, in that they only require that clinical diagnosis is ''supplemented by'' neuropsychological assessment, but are otherwise very similar to the Frascati criteria [7]. The 1998 Memorial Sloan-Kettering (MSK) criteria are based largely on clinical assessment and therefore may be more subjective, and are suited to an era prior to the availability of ART when HAD was terminally progressive [8].
Given the complexity of diagnosis, there is a role for rapid tests that can be incorporated into routine asymptomatic screening. The HIV Dementia Scale (HDS) was developed in 1995 as a ''brief but sensitive instrument to identify [HIV-associated] dementia'' [9]. The scale comprises four tests of subcortical cognitive domains (attention, motor speed, construction, and working memory). In response to culturally-specific elements of the HDS and difficulties with the administration of the anti-saccadic errors test, the International HIV Dementia Scale (IHDS) was developed as an alternative in 2005 [10]. Both tests provide a score but have a standardised cut-off for determining a positive or negative result. Both were proposed as rapid tests for screening (i.e. for individuals free of significant symptoms) and not diagnostic tests to confirm disease in patients with signs or symptoms of HAND, and patients who test positive with either the HDS or IHDS should undergo further assessment for diagnosis [9,10]. Other brief clinical screening tools [2,[11][12][13][14][15][16][17][18][19] and computerised cognitive test batteries [20][21][22] have been used, but there are fewer studies of their accuracy.
The HDS and IHDS have been used in recent clinical studies in North and Central America [23,24], sub-Saharan Africa [12,25], South Asia [26,27] and Europe [2,28,29], and have been considered for inclusion as screening tools in expert HIV treatment guidelines [19,30] (although the IHDS has recently been replaced with a three-symptom questionnaire in updated European guidelines [4]), but important questions remain. First, they were devised for identifying HAD, and their performance in detecting milder neurocognitive impairment may be quite different. Second, it is unknown whether one scale has better accuracy than the other. And third, the study methods, settings and estimates vary considerably between diagnostic accuracy studies. To enable evidence-based use of these tests in clinical practice, we conducted a systematic review to estimate the accuracy of each scale for the diagnosis of HAD and MND when compared to standard diagnostic criteria.

Search strategy and selection criteria
A literature search was conducted in July 2011 and repeated in January 2013 by the first author, including PubMed and PsycInfo indexes, searchable online HIV/AIDS conference proceedings, specialist journals, and major online sources of HIV-related information. Search terms were formulated to capture all studies using the HDS or IHDS alongside another diagnostic method for HAND in a sample of HIV positive adults (Table 1). Manual searches included reference lists of relevant articles identified in automated searches, conference proceedings, and requests for unpublished data to authors of major articles. PubMed and PsycInfo searches were limited to 1994 onwards (the year prior to publication of the HDS) and conference abstracts were limited to available years (mainly 2001 onwards).
From this initial search, studies were excluded if they duplicated data reported in another study in the search, and were only included if they used either the HDS or IHDS to assess individual HIV+ adults, as well as an appropriate reference standard for comparison. In this review, the highest-quality reference standard was a standardised clinical definition (Frascati, AAN, or MSK) supported by a NP battery evaluating at least five broad cognitive domains (attention and working memory; verbal and/or visual learning and recall; processing speed; executive functions; motor skills). Studies using other reference standards such as a detailed NP battery only, clinical opinion or brief NP tools were reviewed but not included in all stages of the analysis (see below).

Assessment of study quality
Data collected for each study included study identifiers, the year(s) in which the work was conducted, geographical region, details of HIV positive study participants (number, age, education, degree of immunodeficiency, ART coverage, drug and alcohol use, psychiatric conditions, and relevant co-morbidities), test of interest (HDS or IHDS), reference standard for comparison, possible sources of bias and error, and the results of the test of interest and reference standard. Authors of papers with useable data were contacted to clarify their methodology.
Possible sources of bias and error were identified from a prespecified list of quality criteria, based on Standards for Reporting Diagnostic Accuracy (STARD) guidelines [31]. Criteria to assess selection methods were the target population, inclusion and exclusion criteria, sampling methodology (consecutive, random, or opportunistic), information about eligible patients who were not recruited, and whether there was an a priori power calculation. Criteria relating to diagnostic methods included whether assessors completing the test and the reference standard were blinded to each other's assessment, adequacy and appropriateness of methods used for the reference standard, methods of ensuring validity and reliability of the assessments, and time lag or drop-outs between assessments. Studies were also assessed on whether the patient sample was adequately described, and whether there were any characteristics of the sample that might reduce its generalizability.

Collection of screening or diagnostic accuracy data
The number of true and false positives (TP, FP) and number of true and false negatives (TN, FN) among HIV+ study subjects, using standard cut-offs for the test of interest (less than or equal to 10 for both scales), was determined. The reference standard was categorised as having either a severe or a moderate threshold. Severe threshold reference standards were those using Frascati or AAN criteria for HAD, or the MSK grading for AIDS Dementia Complex (grade 1 to 4). All three of these standards are similar in threshold, although the Frascati definition for HAD may represent the more severe end of the impairment spectrum [32]. Severe impairment in studies employing NP batteries was defined by similar criteria to HAD, namely impairment to $2 SD below normative means in at least two out of five cognitive domains. Moderate threshold reference standards were those that used MND (Frascati criteria), Minor Cognitive-Motor Disorder (MCMD; AAN criteria), or MSK grade 0.5 as a cut-off, with more severe impairment also included as a positive diagnosis. There is slightly less agreement between MND, MCMD, and MSK grade 0.5 than for more severe impairment [32,33]. Moderate impairment in studies using NP batteries was defined by similar criteria to MND, namely impairment in at least two domains at a level of at least 1 SD below expected means.
If the necessary values could not be extracted from published papers, but it was apparent that the necessary source data might exist elsewhere (e.g. if test scores were reported as a continuous distribution), the corresponding author was contacted to request these data. If it was not possible to dichotomise both the test of interest and the reference standard, the study was excluded from the analysis.
The accuracy of the test of interest in each study was quantified by the sensitivity (true positive rate), specificity (1-false positive rate), positive likelihood ratio (LR+; equal to sensitivity 4 [12specificity]), negative likelihood ratio (LR2; equal to [12sensitivity] 4 specificity), and diagnostic odds ratio (DOR; equal to [TP6TN] 4 [FP6FN]). Positive and negative LR can be multiplied by the assumed odds of a diagnosis being present before conducting the test (prior odds) to determine the final odds of a diagnosis being present (posterior odds). According to Jaeschke et al, tests with LR+ .5 or LR2 ,0.2 provide strong evidence for or against the diagnosis, and LR+ .10 and LR2 ,0.1 provide convincing diagnostic evidence in most scenarios [34]. 95% confidence intervals (CI) were calculated for each measure.

Statistical analysis
Four groups of studies were defined, according to the test of interest (separate analyses were performed for the HDS and IHDS) and the reference standard threshold (severe or moderate). Some studies reported more than one grade of impairment and therefore contributed to more than one group. Studies were then pooled if they used comprehensive criteria (Frascati, AAN or MSK), but were discounted if they used only a NP battery or a brief tool as the reference standard. Studies were not excluded on the basis of other quality criteria. Where there were two reference standards applied to the same sample, the more comprehensive standard was retained.
For each of these four pools of studies, heterogeneity between estimates of sensitivity and specificity was assessed by chi-squared tests, ignoring studies with cell sizes of ,5. Heterogeneity between estimates of LRs was assessed using the I-squared measure. Reasons for heterogeneity between studies were later assessed by meta-regression of LRs with study characteristics as the independent variable.
Pooled sensitivity and specificity were then calculated as averages, weighted by sample size, and pooled LR+ and LR2 were calculated using standard meta-analysis methods for risk ratios with a random effects model. These methods have the potential to underestimate test accuracy in the presence of diagnostic threshold variation; such variation was assessed using Spearman's rank test to demonstrate correlation between sensitivity and specificity [35]. A summary DOR that is constant across diagnostic threshold removes this source of error [36].
Summary DORs were calculated using the Littenberg-Moses method [37] in which the linear relationship D = a+bS is examined in a regression model (where D = exp(DOR) and S = logit[TPR] + logit[FPR]), with points weighted by the square root of sample size. When calculating DORs, it was possible to combine studies with severe and moderate reference thresholds, and those with detailed NP batteries as the reference standard were re-incorporated into the analysis. A continuity correction was applied, because some studies had FN or FP equal to zero.
DORs are a single composite measure of both true-and falsepositive rates, and therefore less clinically useful than other measures. To assist interpretation, predicted specificity and LRs were calculated from the average sensitivity and the summary DOR.

Methodology and study quality
Methodological characteristics relating to study quality are summarised in Figure 2. Eighteen studies were specifically designed to assess one of the screening tools [2,9,10,13,[23][24][25][64][65][66]68,[71][72][73][74][75]78,79,81]. The sampling method was random or consecutive in only seven samples (allowing for some ambiguity in reporting) [2,23,68,72,75,79,80]. The number of eligible patients who were not recruited was available for seven studies. No published articles reported any justification for their sample These search terms were for PubMed, the primary source of citations. Searches of other data sources used modified versions of these terms. doi:10.1371/journal.pone.0061826.t001 size, but one author disclosed that they had performed a power calculation [75].
Full [24,69,78,79] or partial [23,74,75] blinding between assessments occurred in seven studies, and most studies did not report the use or non-use of blinding. Lack of blinding was usually due to assessments being done by the same investigator. Verification bias was difficult to exclude with available information, but three studies had a time-lag between assessments [9,23,64].

Estimates of accuracy of the HIV Dementia Scale
Sensitivity estimates for detecting severe HAND with the HDS ranged widely from 35.7-91.7%, specificity 60.4-100%, LR+ 1.89-6.29, and LR2 0.12-0.72 (Table 4). After removing studies with NP batteries or brief tools as reference standards, there was evidence of heterogeneity between these estimates (p = 0.10 for LR+; p = 0.06 for LR2; p = 0.03 for sensitivity; p,0.001 for specificity). There was borderline evidence of a correlation between sensitivity and specificity across these studies (Spearman's r = 20.68 for nine observations, p = 0.062). Pooling seven studies that used a comprehensive reference standard gave sensitivity 68 Sensitivity estimates for detecting moderate-to-severe HAND were again in a wide range from 9.1-61.5%, specificity 62.5-97.8%, LR+ 1.33-7.00, LR2 0.47 to 0.93. There was also *Sample sizes refer to the number of patients with data that were useable for meta-analysis, following discussion with study authors as necessary. **Participants were randomly selected from an existing cohort, in proportions approximating published prevalence of MND and HAD. ***Participants were sampled to generate two equal groups (n = 50 each): those with symptoms of cognitive impairment and those without. The quoted prevalence of MND and HAD is based on extrapolation up to a larger sample The summary DOR for the HDS was estimated at 7.52 (3.75-15.11) (Figure 3). Predictions of test accuracy for the HDS were made using the above pooled sensitivity estimates (68.1% and 42.0%), giving specificity of 77.9% for severe HAND and 91.2% for moderate-to-severe HAND, LR+ of 3.08 and 4.79, and LR2 of 0.41 and 0.64, respectively. Repeat analysis using only studies with the highest-quality reference standards and populations unselected for neurocognitive symptoms gave slightly poorer

Estimates of accuracy of the International HIV Dementia Scale
For the IHDS, sensitivity estimates for detecting severe HAND ranged from 57.1-100%, specificity 22.2-65.6%, LR+ 1.05-2.00, and LR2 0.31-0.82 (Table 4). Two sets of estimates came from the same study, one using MSK grading and one using Frascati criteria [75]; the latter was dropped from further analysis because the researchers found limitations to using Frascati criteria in rural Kenya. There was strong evidence of heterogeneity between studies in the specificity estimates (p,0.001), and correlation between sensitivity and specificity across IHDS studies using a valid reference standard (r = 20. Sensitivity estimates for detecting moderate-to-severe HAND with the IHDS ranged from 53.8-87.5%, with specificity 45.0-80.6%, LR+ 1.32-2.78, LR2 0.25-0.62, with one conspicuous outlier of low sensitivity and high specificity [25]. There was strong evidence of heterogeneity between specificity estimates (p = 0.004), borderline evidence of heterogeneity between sensitivity estimates (p = 0.07), and no evidence of heterogeneity between LR estimates (p.0.10). There was no evidence of a correlation between sensitivity and specificity (r = 20.80 for four observations, p = 0.20). Pooled estimates were sensitivity 64.3% (55.6-72.1%),

Analysis of sources of heterogeneity
Analysis of study methodological features showed higher average DOR (20.5 vs. 6.85, p = 0.001) and lower average LR2 (0.26 vs. 0.59, p = 0.01) in two studies comparing the HDS to a severe-impairment reference standard when the target population was highly immunodeficient [9,64]. When compared to the IHDS, the HDS had a significantly higher summary DOR (p = 0.009) and LR+ (p = 0.019), but both scales had similar LR2 (p = 0.98). This comparison may however be based on an artificial foundation, given the differences between target populations in studies of each scale. The single study that used both scales in the same population was of small sample size and failed to find a difference between the two [13].

Discussion
We have systematically reviewed 15 studies of the HDS, ten of the IHDS, and one that included both scales. Most studies in the review apply to the original intended role of the HDS and IHDSscreening rather than diagnosis-in that participants were not selected on the basis of symptoms. Summary estimates for the HDS as a test for HAD or an equivalent diagnosis (severe HAND) were: sensitivity 68%, specificity 78%, LR+ 3.1, LR2 0.41, but its accuracy appeared to be lower when analysis was limited to studies with high-quality reference standards and unselected populations. When using the HDS as a test for MND or equivalent (all symptomatic HAND), estimates of accuracy were: sensitivity 42%, specificity 91%, LR+ 4.8, LR2 0.64. Summary estimates for the IHDS as a test for severe HAND were: sensitivity 74%, specificity 55%, LR+ 1.6, LR2 0.47. When using the IHDS as a test for all symptomatic HAND, estimates of accuracy were: sensitivity 64%, specificity 66%, LR+ 1.9, LR2 0.54. These summary estimates and most individual study estimates for both scales failed to achieve accepted levels of accuracy to provide strong evidence for a diagnosis of HAND [34], confirming their unsuitability for diagnostic purposes when used alone.
Comparing the two scales, the HDS had a higher DOR and LR+ than the IHDS, but the only direct comparison of both scales within the same sample failed to find a difference between the two, and was limited by its small sample size [13]. Furthermore, the two scales were studied in different settings, with most of the HDS studies conducted in North America, and most of the IHDS studies conducted in Africa. Unfortunately, while the IHDS was developed with resource-limited settings in mind, it is not free from and ''Sacktor US'' correspond to two separate studies published in a single paper [10]. The cross labelled ''Sacktor MCN'' corresponds to baseline data from a multicentre trial of minocycline for treatment of cognitive impairment [77]. The two points labelled ''Meyer'' are derived from the same study [75]; ''(Frascati)'' and ''(MSK)'' denote the reference standard in each case. Red circles indicate studies using neuropsychological (NP) test batteries or brief NP tests as the reference standard, again labelled by first author. Solid diamonds indicate predicted values based on pooled sensitivity and summary diagnostic odds ratio. A, Reference standard = AIDS dementia complex, HIV-associated dementia, or severe neurocognitive impairment. B, Reference standard = mild neurocognitive disorder, minor cognitive/motor disorder, or mild/moderate neurocognitive impairment. CI: confidence interval; DOR: diagnostic odds ratio. doi:10.1371/journal.pone.0061826.g004 culturo-linguistic effects. The four-word recall task (in both tests) must be modified for different languages [24,52], and it was shown in an Indian population that education was associated with IHDS score, but HIV status was not [27]. The two scales were also studied in different years, and considerable changes in our understanding of HIV pathogenesis and treatment occurred in the decade between the publication of the HDS in 1995 and the IHDS in 2005.
Estimates of screening accuracy showed wide variation between studies, particularly for the HDS. We did not find strong evidence of a diagnostic threshold effect. However, tests of correlation used to demonstrate this effect are known to have low statistical power [36], and the reference diagnosis of HAD is complex and subject to variations of interpretation. It is therefore plausible that differences between reference standards contributed to the varying accuracy of these well-standardised diagnostic tools.
Regarding other sources of variability, an increased DOR and lower LR2 was seen in two studies assessing the HDS in patients with more advanced immunodeficiency. Spectrum bias is a form of selection bias that may occur when the study population is sampled from a limited or specialised clinical setting and therefore has a narrow spectrum of disease. This form of bias could have increased sensitivity in samples of more severely-impaired patients, such as those conducted in Africa, in the pre-HAART era, or in hospital wards. Spectrum bias could also reduce specificity in those in whom it was difficult to exclude competing diagnoses, such as in resource-limited settings, or conversely increase specificity in samples with fewer competing diagnoses. Non-random, nonconsecutive sampling strategies are known to lead to overestimation of accuracy [82].
There were a number of methodological limitations to this review. First, the literature search and data extraction were carried out by a single author (LJH). Second, the literature search could have missed studies not cited in the target data sources, or articles in which it was not clear from the abstract that neurocognitive testing was done. To minimise this, researchers in the field were asked about the existence of unpublished datasets. Third, it was not always possible to generate two-by-two tables from available data, usually because HDS and IHDS scores were reported as continuous variables. In a few studies, the estimated values were not consistent with other information in the same article, suggesting other unknown errors in the results. This was despite requests for reconfigured data directly from researchers.
More importantly, the review is limited by the lack of a clinical gold standard for neurocognitive impairment in HIV, whether this be neurological criteria, neuroimaging findings, biomarkers in cerebrospinal fluid, or histopathology. The Frascati criteria are relatively detailed, objective, and appropriate for a research definition, so the analysis in this review provides the best available estimates of the accuracy of the HDS and IHDS when used as screening tools for MND or HAD. However, current data do not clearly inform clinicians of the natural history or appropriate treatment of these conditions, particularly milder impairment, and this limits our ability to predict the effects of screening.
British HIV Association (BHIVA) guidelines do not comment on screening for HAND [5], whereas the European AIDS Clinical Society (EACS) guidelines recommend a brief symptom questionnaire in all patients at regular intervals [2,4] and a recent review made similar recommendations but did not support one screening test over another [83]. The general rule that one should minimise false positives if the confirmatory test is expensive or invasive favours the HDS over the IHDS, and the penalty for missing an asymptomatic case of HAND is arguably not high, so the lowersensitivity test is acceptable. The prevalence of HAD was 2-4% of HIV positive individuals in recent surveys in the US and Switzerland [1][2][3], lower than the prevalence in most studies included in this review. At this low prior probability, one might confidently exclude the diagnosis with a negative HDS, but the posterior probability would be less than 15% after a positive HDS. In comparison, when used as a test for MND, a positive HDS result would give a posterior probability of 56% in the presence of a prior probability of 20%.
A screening test is an intervention that should be subject to interventional research as any other, and for it to be routinely used in clinical practice, the evidence base should address the next steps in the clinical pathway. For example, we need to evaluate how to investigate patients further, how to predict their outcome, and how to modify medical therapy in the light of a positive or negative screening test. On the tests themselves, studies are needed to determine their repeatability, intra-subject variation, and learning effects, and understand the causes of false positive and false negative results (not explored in the studies reviewed). Further studies of the HDS and IHDS should adhere to STARD guidelines. Specific settings of interest are the use of the HDS in an African or other resource-limited setting, or the IHDS in a North American or European setting with high ART coverage and relatively preserved immune function. There may be a role for studying the scales specifically in older adults, given the growing proportion of HIV+ individuals over the age of 50 [84] and their greater risk of HAND [85], although their ability to distinguish between HAND and non-HIV causes of NCI has not been assessed. One could also model theoretical screening programmes for neurocognitive impairment within HIV positive populations of known prevalence.
In conclusion, in current clinical practice, interpretation of the results of assessment with the HDS or IHDS requires an appreciation of their limited accuracy, the lack of generalisability of existing research, and the heterogeneity of estimates. The HDS appears to be more accurate overall and its higher specificity probably makes it the preferred test for detecting asymptomatic HAND, although the IHDS may be preferred in situations where sensitivity is most important, at the expense of loss of specificity. Having reviewed the evidence we advise against their further use as diagnostic tests for HAND in symptomatic patients, even in resource-limited settings, and believe that studies reporting their use should acknowledge their limited validity.

Supporting Information
Flowchart S1 Flowchart in PRISMA format.