Verbal Autopsy: Evaluation of Methods to Certify Causes of Death in Uganda

To assess different methods for determining cause of death from verbal autopsy (VA) questionnaire data, the intra-rater reliability of Physician-Certified Verbal Autopsy (PCVA) and the accuracy of PCVA, expert-derived (non-hierarchical) and data-driven (hierarchal) algorithms were assessed for determining common causes of death in Ugandan children. A verbal autopsy validation study was conducted from 2008-2009 in three different sites in Uganda. The dataset included 104 neonatal deaths (0-27 days) and 615 childhood deaths (1-59 months) with the cause(s) of death classified by PCVA and physician review of hospital medical records (the ‘reference standard’). Of the original 719 questionnaires, 141 (20%) were selected for a second review by the same physicians; the repeat cause(s) of death were compared to the original,and agreement assessed using the Kappa statistic.Physician reviewers’ refined non-hierarchical algorithms for common causes of death from existing expert algorithms, from which, hierarchal algorithms were developed. The accuracy of PCVA, non-hierarchical, and hierarchical algorithms for determining cause(s) of death from all 719 VA questionnaires was determined using the reference standard. Overall, intra-rater repeatability was high (83% agreement, Kappa 0.79 [95% CI 0.76-0.82]). PCVA performed well, with high specificity for determining cause of neonatal (>67%), and childhood (>83%) deaths, resulting in fairly accurate cause-specific mortality fraction (CSMF) estimates. For most causes of death in children, non-hierarchical algorithms had higher sensitivity, but correspondingly lower specificity, than PCVA and hierarchical algorithms, resulting in inaccurate CSMF estimates. Hierarchical algorithms were specific for most causes of death, and CSMF estimates were comparable to the reference standard and PCVA. Inter-rater reliability of PCVA was high, and overall PCVA performed well. Hierarchical algorithms performed better than non-hierarchical algorithms due to higher specificity and more accurate CSMF estimates. Use of PCVA to determine cause of death from VA questionnaire data is reasonable while automated data-driven algorithms are improved.


Introduction
Verbal autopsy (VA) is an indirect method of determining cause of death based on an interview with the caretakers of a deceased individual, which has been widely used to collect information on cause-specific mortality where vital registration systems are lacking and medical information on deaths is incomplete [1]. Different approaches of determining cause of death from VA interview information exist, including physician review, algorithms, and more recently, computerized coding of VA (CCVA) which can either be algorithmic or probabilistic in approach [2][3][4]. However, the optimal approach for determining causes of death from VA data is unclear, and has been the subject of debate [1,3,5].
The most widely used method is physician review, known as physician-certified VA (PCVA), in which physicians are trained to review questionnaire data and determine cause of death. Although the validity of PCVA has been evaluated [6], concerns about the repeatability of PCVA have been raised [4,7]. The level of agreement on causes of death certified by two independent physicians from VA (inter-rater repeatability) has been extensively studied [5,[8][9][10][11][12][13][14][15]. However, very few published studies have assessed the repeatability of causes of death certified from VA by the same physicians at different time points (intra-rater reliability) [16].
An alternative to PCVA for determining cause(s) of death from VA data are algorithms. Algorithms can be expert-derived or data-driven. Expert algorithms include a set of predefined diagnostic criteria developed by a panel of physicians,based on experience or review of existing literature [2]. Alternatively, data-driven algorithms are derived from existing data using standard statistical techniques including logistic regression, decision tree algorithms, and bayesian classification, which identify discriminatory functions of indicators to be included in an algorithm [2]. Algorithms can be used to guide physicians as they review VA questionnaires and classify cause(s) of death; alternatively, algorithms may be computerized to automate the process [3,17]. Several algorithms based on expert opinion or derived from data have been developed, but their accuracy has been shown to vary widely, and may be lower than that of PCVA [18][19][20][21][22]. In addition to algorithms, probabilistic approaches have been developed [3]. Unlike algorithmic approaches that assess the presence or absence of single cause of death based on positive or negative responses to symptom-related questions, automated methods apply probabilistic reasoning adjusting the probability of a range of multiple possible outcomes simultaneously [2]. Like algorithms, probabilistic methods can be expert driven or data driven [23]. Recent reports suggest that automated probabilistic approaches outperformor are equivalent to PCVA [24,25], but these results have been disputed [23,26].
Data from a VA validation study conducted in three epidemiological settings in Uganda were used to investigate the performance of different methods for determining causes of death from VA data. We evaluated the intra-rater reliability of PCVA, and also compared the accuracy of PCVA to that of two algorithms; one developed with the input of expert physicians (non-hierarchical) and another data-driven (hierarchical).

Materials and Methods
The VA data-set used to investigate the performance of different methods for determining causes of death was obtained from a VA validation study that was approved by the Ugandan National Council for Science and Technology, the Centers for Disease Control and Prevention, and the ethics committees of Makerere University Faculty of Medicine, and the London School of Hygiene and Tropical Medicine. Details of the VA validation study method are published elsewhere [27]. Briefly, the study was conducted from 2008-2009 in selected public hospitals located in three districts; Tororo (high malaria transmission) Kampala (medium transmission) and Kisoro (low transmission). Deaths among hospitalized children aged less than five years, including neonatal deaths were registered over a period of one year and VA interviews were conducted with appropriate caretaker of children. PCVA was used for determining cause of death following World Health Organization (WHO) standards at the time [28]. The reference standard for assessing the accuracy of PCVA was the cause of death determined by physician review of hospital medical records at each site. The sensitivity, specificity and positive predictive value, and accuracy of cause specific mortality fraction (CSMF) estimates of the PCVA method for determining cause of death were computed for a select group of common causes of childhood death for each site. Analysis and presentation of results was stratified by two age groups: 1) Neonatal deaths (0-28 days), and 2) Childhood deaths (1-59 months) Intra-rater reliability of PCVA Twenty percent of VA questionnaires were systematically sampled for assessment of intrarater reliability. Using a list of sequentially ordered identification numbers for each site, we systematically selected every fifth VA questionnaires with the corresponding COD originally determined by physician review of the data. VA questionnaires were re-evaluated by the original physician a second time. Re-determination of causes of death from VA questionnaires occurred 3-9 months after the original assessment, and physicians were blinded to the causes of death recorded in the original VA death certificate.

Development of non-hierarchical algorithms
The non-hierarchical algorithms were based on previously published expert algorithms [19,21,[29][30][31]. Seven physicians who reviewed the original VA questionnaires were asked to review existing algorithms and develop a refined algorithm (including the criteria for diagnosis) taking into account diagnostic criteria that they used to attribute malaria and other common childhood illness as cause of death when originally reviewing VA questionnaires. The non-hierarchical algorithms underwent a final round of review by a team of the investigators, including a pediatrician and three epidemiologists. Each algorithm consisting of a pre-determined set of diagnostic criteria to be applied to VA questionnaire data; specific combinations of the presence or absence of certain signs and symptoms experienced prior to death indicating different causes of death. For neonatal causes of death, nonhierarchical algorithms were developed for the following causes of death: 1) septicemia, 2) meningitis, 3) pneumonia, and 4) congenital malformation. Final non-hierarchical algorithms for childhood deaths were limited to the most common causes of death, including 1) malaria, 2) pneumonia, 3) meningitis, 4) diarrheal illnesses, 5) malnutrition, and HIV/AIDS ( Table 1).

Development of hierarchical algorithms
Hierarchical algorithms were developed by ranking the performance of the non-hierarchical algorithms to reach common causes of childhood deaths, including neonatal deaths. Ranking was prioritized based on specificity of causes of death as determined using expert algorithms. The cause of death with the highest specificity wasplaced at the top of the hierarchy while the least specific was placed at the bottom (Fig 1). Neonatal deaths were ranked in the following order: (1) septicemia, (2) meningitis, (3) pneumonia, and (4) congenital malformations. Childhood causes of death were ranked as follows: (1) meningitis, (2) pneumonia, (3) malnutrition, (4) diarrhea, (5) HIV, and (6) malaria.

Data analysis
Intra-rater reliability of PCVA. The cause of death determined by physicians upon repeat review of VA questionnaires was compared to the cause of death originally determined by the same physician. The percentage level of agreement and Kappa statistic was calculated using Stata 12 (StataCorp, College Station, Texas, USA) for each physician. Interpretation of Kappa values was based according to the criteria of Landis and Kock [32], who recommended that a Kappa value greater than 0.8 be considered 'almost perfect', between 0.6 and 0.8 'substantial', between 0.4 and 0.6 'moderate', between 0.2 and 0.4 'fair', between 0 and 0.2 'slight', and between 0 and -1 'poor.' Furthermore, to assess the impact of re-determination of cause of death on the CSMF attributable to malaria and other common illness at the population level we compared the CSMF (CSMF Original ) to the re-determined CSMF (CSMF Repeat ).

Validation of algorithms
A database comprised of responses to closed-ended sections of VA questionnaires, and the reference causes of death derived from medical records were generated. Causes of death determined by non-hierarchal algorithms were derived by applying non-hierarchal algorithms to the closed-ended sections of VA questionnaires. Non-hierarchal algorithms were capable of classifying more than one cause of death. Hierarchal algorithms were also applied to the same VA questionnaire database, generating a single cause of death for each questionnaire. The sensitivity and specificity of each method for determining cause of death were calculated by comparing the cause of death assigned by each method to the 'reference standard' for causes of death derived from hospital medical records, including malaria, pneumonia, diarrhea, meningitis, malnutrition, and HIV. CSMF estimates of the leading causes of death were also calculated for PCVA (CSMF PCVA ), non-hierarchical algorithms (CSMF NHA ) and hierarchal algorithms (CSMF HA ). The difference between the CSMF determined using each of the three methods and the 'reference standard' (CSMF MR ) was calculated for the common causes of death. For neonatal and childhood deaths, where algorithms were developed for five and four commonest causes of death respectively, causes of death that did not fit the commonest cause of death list were categorized as 'others' and were factored in all analysis.

Intra-rater reliability of PCVA
A total of 149 VA questionnaires were selected for re-determining cause of death by four physician reviewers, each with a different number of VA questionnaires (Fig 2). Although the performance of individual physicians varied, intra-rater reliability was almost perfect for physician reviewer '2' (Kappa statistic = 0.87) and substantial for physician reviewer '1' and '3' (Kappa statistic = 0.77, respectively) and moderate for physician reviewer '4' (Kappa statistic = 0.52). Overall, the level of agreement was substantial (Kappa statistic = 0.79) ( Table 2). The repeat estimates of CSMF for the different causes of death did not differ substantially (< 10%) when compared to the original CSMF estimated by the same reviewer (Table 3). Accuracy of PCVA, non-hierarchical algorithms and hierarchal algorithms for neonatal deaths A total of 104 questionnaires representing neonatal deaths were evaluated using algorithms (Fig 3). Based on PCVA, common causes of death among neonates included septicemia (29%), meningitis (38%), pneumonia (8%), and congenital malformations (6%). Sensitivity of PCVA, non-hierarchical algorithms, and hierarchical were generally low (<50%) for the four major causes of neonatal deaths, with exception of the sensitivity of non-hierarchical algorithms (76%) for septicemia deaths, and PCVA (61%) for meningitis deaths. For congenital malformation, pneumonia, and septicemia deaths, specificity of PCVA was high (97%, 93%, and 78% respectively), and comparable to that of hierarchical algorithms (94%, 88%, and 52% respectively). With the exception meningitis deaths where the specificity score of non-hierarchical algorithms (79%) was high, for the other causes of neonatal deaths the specificity of non-hierarchical algorithms (<20%) was very low (Table 4). CSMF estimates for congenital malformation and pneumonia deaths were accurate and comparable for PCVA (0%, and -3% difference respectively), non-hierarchical algorithms (1%, and 2% difference respectively), and hierarchical algorithms (1% and 2% difference respectively). Non-hierarchical algorithms (50% difference), and hierarchical algorithms (16% difference) overestimated the CSMF for septicemia deaths compared to PCVA (-3% difference) that  performed best. On the contrary non-hierarchical algorithms (5% difference), and hierarchical algorithms (-4% difference) had better CSMF estimates for meningitis deaths compared to PCVA (-16% difference, Table 5).

Accuracy of PCVA, non-hierarchical algorithms and hierarchal algorithms for causes of childhood deaths
A total of 615 questionnaires representing childhood deaths were evaluated using algorithms (Fig 3). The accuracy of PCVA, non-hierarchical algorithms and hierarchical algorithms ranged widely depending on the cause of death and the site (Table 4). For malaria deaths, the sensitivity of non-hierarchical algorithms (84%) was higher than that of PCVA (61%) and hierarchical algorithms (16%). This pattern was consistent in Kampala and Tororo. In contrast, the specificity of non-hierarchical algorithms for determining malaria deaths was low in Kampala (34%) and Tororo (39%), and much lower than the specificity of PCVA (84-88%) and hierarchal algorithms (93-94%) in determining malaria deaths (Table 4). Sensitivity and specificity of all methods for determining diarrheal deaths followed a pattern similar to that observed in determining malaria deaths. Sensitivity and specificity of non-hierarchical algorithms in determining pneumonia and meningitis deaths were comparable to hierarchal algorithms but lower when compared to PCVA at all sites (Table 4). CSMF estimates of non-hierarchical algorithms (CSMF NHA ) deviated greatly from the reference standard (CSMF MR; difference > 10%), with a tendency to overestimate the CSMF for the leading causes of death across all sites. The CSMF estimated by PCVA (CSMF PCVA ) and the hierarchal algorithms (CSMF HA ) approximated that of the reference standard (CSMF MR ) for all cause(s) of death, performing far better than non-hierarchical algorithms. However, overall CSMF estimates of malaria deaths were best approximated by hierarchal algorithms (0% difference), exceeding performance of both PCVA (6% difference) and non-hierarchical algorithms (56% difference), which both overestimated the fraction of deaths attributable to malaria when compared to the reference standard (Table 5). This pattern was consistent across all sites with the exception of Tororo, where PCVA was more accurate.

Discussion
To investigate the performance of different methods for determining causes of death from previously collected VA data, we evaluated the intra-rater reliability of PCVA, and compared the accuracy of PCVA and two algorithms, using physician review of hospital medical records as a reference standard. Contrary to prior reports, our findings suggest that the intra-rater reliability for classifying cause of death using PCVA is high [7,33]. Reliability of 3 out of 4 physicians was classified as 'substantial', and repeat CSMF estimates for common causes of death were similar to the original estimates. One physician's score was sub-optimal possibly due to low number of records reviewed by the physician. Regardless, the overall performance was good with a Kappa score indicating 'substantial' agreement between reviews. The physicians' prior knowledge of local epidemiology likely contributed to the good performance by three physicians [2]. Although prior knowledge and subjective application of clinical judgment may be considered as 'biases', they are likely to have had a positive impact on the physicians' ability to correctly identify cause of death [34]. However, the subjectivity of the PCVA method may limit the ability to apply temporal and spatial comparisons of mortality data. Standardized training of physician reviewers addresses this concern to an extent [11].
Although use of algorithms has been advocated to overcome the issue of subjectivity, the accuracy of algorithms remains a concern [4]. For neonatal deaths, sensitivity of PCVA, nonhierarchical algorithms, and hierarchical algorithms was low (<50%) for all the causes of neonatal deaths, with exception of meningitis with PCVA (61%). On the contrary, specificity of PCVA and hierarchical algorithms performed well compared to non-hierarchical algorithms, although specificity was relatively low for meningitis with PCVA (68%) and for septicemia with hierarchical algorithms (52%). In terms of estimating CSMF, all three methods were relatively accurate with exception of non-hierarchical algorithms and hierarchical algorithms which overestimated the CSMF for septicemia deaths, a fact probably attributed to the low specificity of non-hierarchical algorithms and hierarchical algorithms in determining septicemia deaths.
For childhood deaths, compared to PCVA, sensitivity of non-hierarchical algorithms was impressive, particularly for classification of malaria, diarrheal and malnutrition deaths. However, sensitivity was gained at the expense of specificity. This imbalance between sensitivity and specificity undermined the performance of the non-hierarchical algorithms when estimating CSMF for common causes of death resulting in gross overestimation of the CSMF for respective causes of death. Importantly, we note that the degree of error in estimating the CSMF was inversely proportional to the specificity level attained, implying that error in estimating CSMF reduced as specificity increased. With exception of septicemia deaths, this phenomenon was not observed with neonatal deaths. Overlap of signs and symptoms of common illnesses used to develop diagnostic criteria for these diseases could have limited the ability of the algorithms to distinguish between illnesses resulting in assignment of multiple cause(s) of death and a marked decline in specificity. Hierarchical algorithms assigning a single cause of death from each VA questionnaire resulted in an increase in specificity of the algorithm in determining causes of death, but at the expense of sensitivity which declined. However, compared to the non-hierarchal algorithms, hierarchal algorithm estimates of the reference CSMF were accurate and as good as those of PVCA for all the common causes of death; a fact attributed to the high specificity levels of hierarchal algorithms. This finding, previously described by Anker et al [35], demonstrated that specificity is an important driver of the accuracy of CSMF estimates determined by these methods. However, superiority was apparent only when the reference CSMF level was low (~< 10%) for a particular disease [35]. In Tororo and Kisoro, the reference CSMF levels for malaria and pneumonia deaths were very high and hierarchal algorithms, despite low specificity, greatly underestimated the CSMF attributable to malaria and pneumonia deaths at these sites suggesting that benefits of increased specificity in estimating the CSMF are only applicable when the true CSMF is low. Indeed, this may explain why non-hierarchal algorithms and hierarchal algorithms overestimated septicemia deaths among neonates. The primary limitation of either algorithm is their inflexibility. Unlike physicians, algorithms lack 'clinical acumen' and are not capable of interpreting the potential contribution of multiple disease processes ultimately leading to death. This limitation of algorithms is well-recognized, and has been cited as the primary disadvantage of algorithms and other automated methods for determining cause (s) of death from VA data [4].
Several computerized methods premised on different algorithmic methods (expert driven, data driven; Tariff, Artificial Neural Network, and Random Forest), probabilistic (expert driven; InterVA, Data drive; King-Lu, and Simplified Symptom Pattern) approaches have been developed as alternative methods of determining cause(s) of death from VA questionnaires [23,28,30,33,[36][37][38]. The dataset used to validate the Tariff, Random Forest, King-Lu and Simplified Symptom Pattern methods was comprised of a randomly selected number of gold standard hospital deaths that formed part of a larger multi-country verbal autopsy validation study [39]. In these validation studies, all three methods were more accurate than PCVA for most of the causes of death [36,37,40]. However these results have been disputed, with a systemic review of 19 studies finding that no single VA method outperformed the other across selected CODs for both individual and population-level COD assignment [23].
InterVA uses a probability matrix, which was derived from clinical knowledge of group of physicians [41], and in addition to the TARIFF method, has been recommended by the World Health Organization in their 2012 VA guidelines as one of preferred methods for determining cause(s) of death [42]. However, two studies validating the performance of InterVA compared to PCVA against a gold standard based on rigorously defined clinical criteria yielded conflicting results; one study conducted in Kilifi on the coast of Kenya showed that InterVA performed as well as PCVA in determining the top five underlying causes of death in a rural community, the other study based on a multisite validation study showed that InterVA performance was suboptimal compared to PCVA [5,43]. Although InterVA has been widely implemented [44][45][46][47], inconsistent reports of the performance of this method, as well as alternative CCVA approaches, should not be overlooked. Until CCVA methods are improved and evaluated, consistently yielding more accurate results than PCVA, it is likely that PCVA will continue to be used widely to determine causes of death from verbal autopsy questionnaires [23].
Our study is not without limitation. Internal evaluation of the performance of the hierarchical algorithm may have biased results, showing good performance of the hierarchical algorithms. However, the results of our analysis are strengthened by the inclusion of three different study sites. Furthermore, the small sample of deaths among some of the causes of the death in both neonates and children, especially when stratified by site, may have undermined our ability to detect representative estimates of measures of performance.

Conclusions
Our study provides insights into the performance of different methods for determining cause (s) of death from VA questionnaire data collected in three sites. Importantly, we demonstrate that repeatability of PCVA is high, contrary to expectation, and that overall PCVA performed well. Thus, based on our results and available evidence so far, PCVA remains a reliable method for determining cause of death from VA questionnaire data. Given the lack of consensus on the accuracy of recently developed CCVA methods, PCVA still has a place in determining cause of death in VA, while existing and newer automated data-driven algorithms, which undoubtedly would be more efficient, are further developed, refined, and evaluated.