A comparison between seven assessors of a new CAT, formulated by an expert focus group, compared with the Naranjo CAT in 80 cases from a prospective observational study and 37 published ADR case reports (819 causality assessments in total).
Main Outcome Measures
Utilisation of causality categories, measure of disagreements, inter-rater reliability (IRR).
The Liverpool ADR CAT, using 40 cases from an observational study, showed causality categories of 1 unlikely, 62 possible, 92 probable and 125 definite (1, 62, 92, 125) and ‘moderate’ IRR (kappa 0.48), compared to Naranjo (0, 100, 172, 8) with ‘moderate’ IRR (kappa 0.45). In a further 40 cases, the Liverpool tool (0, 66, 81, 133) showed ‘good’ IRR (kappa 0.6) while Naranjo (1, 90, 185, 4) remained ‘moderate’.
Citation: Gallagher RM, Kirkham JJ, Mason JR, Bird KA, Williamson PR, Nunn AJ, et al. (2011) Development and Inter-Rater Reliability of the Liverpool Adverse Drug Reaction Causality Assessment Tool. PLoS ONE 6(12): e28096. https://doi.org/10.1371/journal.pone.0028096
Editor: Antje Timmer, Bremen Institute of Preventive Research and Social Medicine, Germany
Received: January 20, 2011; Accepted: November 1, 2011; Published: December 14, 2011
Copyright: © 2011 Gallagher et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: This paper presents independent research commissioned by the National Institute for Health Research (NIHR) under its Programme Grants for Applied Research scheme (RP-PG-0606-1170). The views expressed are those of the author(s) and not necessarily those of the National Health Service, the NIHR or the Department of Health. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: RLS and MP are members of the Commission on Human Medicines. MP Chairs the Pharmacovigilance Expert Advisory Group, while RLS Chairs the Paediatric Medicines Expert Advisory Group.
Adverse drug reactions are a frequent source of morbidity and mortality , . Causality assessment of ADRs may be undertaken by clinicians, academics, pharmaceutical industry, regulators and in different settings, including clinical trials , , , . At an individual level, health care providers assess causality informally when dealing with ADRs in patients to make decisions regarding therapy. Regulatory authorities assess spontaneous ADR reports ,  where causality assessment can help in signal detection and aid in risk-benefit decisions regarding medicines , .
An early paper by Sir Bradford Hill , describing minimum criteria for establishing causality of adverse events, pre-dates the earliest attempts to formulate ADR causality assessment tools. Bradford Hill set out criteria for establishing causality which included assessment of strength of the association, consistency of the association, specificity, temporal relationship, biological gradient (dose response), biological plausibility, coherence, experimental evidence, and reasoning by analogy. Although these criteria were not meant for ADRs, the elements have been adapted in ADR causality tools. Indeed, attempts to formalise causality assessment of ADRs into structured assessment tools have been ongoing for more than 30 years , . It is known that assessing ADR likelihood without a structure can lead to wide disagreements between assessors . These disagreements may be the result of differing clinical backgrounds, specialties and experience. The causality tools thus aim to limit disagreement between assessors of ADR cases as to the likelihood that a reaction is related to a particular medication taken by the patient. A large number of causality tools have been developed ranging from the simple to the complex, but none have gained universal acceptance .
One of the most widely used causality assessment tools is the Naranjo tool . This is a simple 10-item questionnaire that classifies the likelihood that a reaction is related to a drug using concepts such as timing, plausibility/evidence, de-challenge and re-challenge/previous exposure. Each element of the questionnaire is weighted and the total score used to categorise the event into unlikely, possible, probable and definite. The tool was developed 30 years ago by adult pharmacologists/physicians and psychiatrists. Published case reports were used to validate the reliability of the tool in assessing causality. It has subsequently been widely used, including recently in two prospective observational studies of ADRs causing hospital admission and occurring in hospital in-patients , . However, the reliability of the Naranjo tool has been questioned by a number of investigators , , , , .
While undertaking a prospective observational study of ADRs in children (in preparation), we found several difficulties with using the Naranjo tool. When assessing this heterogeneous mix of potential ADR cases, the investigators found some questions were not appropriate, leading to many answers being categorised as “unknown”. This led to lack of sensitivity as the overall score obtained for each causality assessment may be artificially lowered, which in turn underestimates the likelihood of an ADR. The investigators encountered several cases which were unanimously thought to be definite ADRs (e.g. repeated episodes of febrile neutropenia during oncological chemotherapy) but which did not reach the threshold for definite using the published Naranjo tool. Moreover, the weighting for each question and the ADR classification scoring boundaries used in the Naranjo tool were not justified in the original publication, or subsequently. Therefore, we undertook to develop a causality assessment tool that would overcome some of these issues, while at the same time (a) making it as easy, or easier, to use than the Naranjo tool; and (b) ensuring that the basic principles of assessing causality as defined by Bradford Hill were maintained.
Each of seven investigators (RG, JM, KB, MP, TN, RS, MT) independently assessed the first 40 consecutive case reports from a study of suspected ADRs causing hospital admission (ADRIC Study 1 – adverse drug reactions in children available at http://www.adric.org.uk/) using the Naranjo tool. The first 40 cases assessed using Naranjo were reviewed in terms of the results of the pair-wise agreements between the seven investigators. The cases where major discrepancies occurred, that is, where the range of causality probability differed by more than one category (e.g. possible and definite), and the cases where close to half of the raters differed from the others by one category were identified. The questions within the Naranjo tool which caused the discrepancies were identified and reviewed.
Each question in the Naranjo tool was reviewed by the investigators at a consensus meeting to assess whether it was appropriate to incorporate, discard or integrate with other questions into a new, more appropriate, causality tool (Table 1). A new causality tool was drawn up and modified through a consensus approach between the seven investigators. The format of the new tool was an algorithm, or flowchart, with dichotomous responses to each decision followed by routing to further, specific questions, rather than the weighted responses used in the Naranjo tool.
The new Liverpool ADR causality tool was then used to assess 20 new suspected ADR case reports from our observational study. All cases assessed from the ADRIC study contained a similar level of documentation. The collated causality categories for all seven assessors showed 1 (0.7%) unlikely, 18 (12.9%) possible, 2 (1.4%) probable and 119 (85%) definite. The assessors achieved moderate agreement with a kappa of 0.51 (95% CI 0.19, 0.82). However, there was an inappropriate bias towards the category of definite which was caused by decision paths leading to an answer of definite without the need for a positive re-challenge or previous reaction with exposure to the same drug. The assessment tool was reviewed again, and major discrepancies between scorers identified and each question within the algorithm reviewed to assess usefulness. Questions and decision pathways that caused major discrepancies were then modified. The new assessment tool was then tested on a further 20 case reports; ten from the ADRIC study and ten from an observational study of in-patient ADRs in an adult hospital. Collated causality categories for the ten ADRIC 1 cases showed 0 (0%) unlikely, 24 (34%) possible, 39 (56%) probable and 7 (10%) definite with a kappa of 0.27 (95% CI 0.11, 0.44). Collated causality categories for the ten adult cases showed 0 (0%) unlikely, 13 (19%) possible, 48 (69%) probable and 9 (13%) definite with a kappa of 0.13 (95% CI −0.14, 0.38).
The results of these assessments prompted another review of the appropriateness of the tool and questions. A third iteration was used so that the development and evaluation of tool prototypes was based on discussions in which 80 cases were used (Figure 1). After the third iteration the investigators were satisfied with the final version of the new tool (Figure 2) in terms of ease of use, lack of ambiguity, and appropriateness of the causality assignment. This was judged by expert opinion and consensus within the group.
The assessment process for the Liverpool causality assessment tool followed a step-wise procedure:
- The original 40 case reports (case reports of raw clinical data from an observational study) initially assessed with Naranjo were assessed by each of the seven investigators using the new assessment tool to provide a comparison of the inter-rater reliability between the two tools.
- In order to examine the tool using cases other than those collected in our observational study, 37 cases of ADRs were randomly selected from the Annals of Pharmacotherapy (Figure S1) and independently evaluated by the seven assessors using only the new tool. The Annals of Pharmacotherapy requires authors to apply a Naranjo assessment prior to publication of case reports.
- Since the original 40 cases from our observational study had been used in the design of the new tool, a further new set of 40 ADR case reports from our study were then used to compare inter-rater reliability using both the Naranjo and the Liverpool tools.
Categorical scores from both the Naranjo tool and the new tool take the same four point ordinal scale. The inter-rater agreements at each stage of the assessment process were assessed using a linear weighted kappa with 95% confidence intervals for ordered categories. Exact agreement percentages (%EA) were computed to measure the absolute concordances between assessor scores. The percentage of extreme disagreement (%ED), where the causality scores between two raters of the same case are wider than one causality interval apart (e.g. definite for 1 rater and possible for the other), were also computed to measure extreme disagreements between pair-wise rater assessments. To supplement the pair-wise kappas, a global kappa score measuring nominal scale agreement across multiple assessors was calculated with 95% confidence intervals . The global kappa score provides a single statistic to quantify assessor agreement for each set of cases. Kappa values were interpreted according to the guidance from Altman : poor <0.2; fair 0.21–0.40; moderate 0.41–0.60; good 0.61–0.80; and very good 0.81–1.00 agreement.
Assessment of the original 40 consecutive ADR cases by the seven investigators using the Naranjo tool showed collated categorisation of causality scores for all assessors (n = 280 assessments) of 0 (0%) unlikely, 100 (36%) possible, 172 (61%) probable and 8 (3%) definite (Table 2). Exact agreement percentages for the pair-wise comparisons between raters ranged from 43%–93%. Percentage of extreme disagreement (%ED) was 2.5% for four of the twenty-one pair-wise comparisons. There were no extreme disagreements in 17/21 pair-wise comparisons. Pair-wise kappas ranged from 0.27 to 0.86 and the assessors achieved moderate inter-rater reliability with a global kappa of 0.45 (95% CI 0.35–0.54) (Table 3). The same cases assessed using the new Liverpool tool showed collated causality categories of 1 (0.4%) unlikely, 62 (22%) possible, 92 (33%) probable and 125 (45%) definite. Exact agreement percentages ranged from 43–93%. All 21 pair-wise comparisons displayed extreme disagreement with percentages ranging from 5–20%. Pair-wise kappas ranged from 0.27 to 0.84 and the assessors achieved moderate inter-rater reliability with a global kappa score of 0.48 (95% CI 0.42–0.54) (Table 3).
The 37 randomly selected ADR case reports from the Annals of Pharmacotherapy assessed by the seven investigators using the Liverpool tool showed collated categorisation of causality scores (n = 259 assessments) of 1 (0.4%) unlikely, 67 (26%) possible, 136 (53%) probable and 55 (21%) definite. Exact agreement percentages ranged from 57%–97%. 18/21 pair-wise comparisons between raters showed some extreme disagreement, with the percentage ranging from 5–11%, while three showed no extreme disagreements. Pair-wise kappas ranged from 0.31 to 0.96 and the assessors achieved moderate inter-rater reliability with a global kappa of 0.43 (95% CI 0.34–0.51) (Table 4). These case reports were not assessed by the investigators using the Naranjo tool as The Annals of Pharmacotherapy requires authors to apply a Naranjo assessment prior to publication of case reports in the journal. The collated categorization of the case report author assessments for the 37 cases showed 0 unlikely, 5 (14%) possible, 29 (78%) probable and 3 (8%) definite.
The 40 newly selected ADR cases assessed by the seven investigators using the Naranjo tool showed collated categorisation of causality scores (n = 280 assessments) of 1 (0.4%) unlikely, 90 (32%) possible, 185 (66%) probable and 4 (1%) definite. Exact agreement percentages ranged from 63%–90%. Percentage of extreme disagreement was 2.5% for four pair-wise comparisons. There were no extreme disagreements in 17/21 comparisons. The pair-wise kappas ranged from 0.19 to 0.81 with moderate inter-rater reliability and global kappa of 0.44 (95% CI 0.33–0.55) (Table 5). The same cases assessed using the Liverpool tool showed collated causality categories of 0 (0%) unlikely, 66 (24%) possible, 81 (29%) probable and 133 (48%) definite. Exact agreement percentages ranged from 65%–88%. Percentage of extreme disagreement ranged from 2.5–7.5% for 14 pair-wise comparisons. There were no extreme disagreements in 7/21 comparisons. Pair-wise kappas ranged from 0.51 to 0.85 and the assessors achieved good inter-rater reliability with a global kappa of 0.60 (95% CI 0.54–0.67) (Table 5).
A recent systematic review of studies assessing the reliability of causality assessments concluded that “no causality assessment method has shown consistent and reproducible measure of causality.” We are currently undertaking a comprehensive assessment of adverse drug reactions in children . As part of this, we had initially decided to use the Naranjo tool to assess causality in our patients admitted with ADRs, and those who developed ADRs as in-patients. In order to do this, we planned to have assessments conducted independently by seven assessors. Initial assessments revealed some significant issues with the Naranjo tool (as outlined in the introduction above), which led us to develop the Liverpool Causality Assessment Tool.
The development of the Liverpool Causality Assessment Tool involved an iterative process conducted by a multidisciplinary team using raw case data and published case reports. The clinical team included nurses, pharmacists and physicians, including those working with adults and children. Previous experience with formal ADR assessment ranged from minimal to advanced. The assessment team comprised medical statisticians who focused discussion on how to classify cases and monitored progress using standard tools for inter-rater agreement. This approach has the strength of timeliness but the potential weaknesses of “group-think”, in which independent thinking and expression of differences may be lost in the pursuit of group cohesiveness.
We believe that the Liverpool Causality tool has several advantages over the Naranjo tool. First, it performed as well as the Naranjo tool with the first set of cases that were assessed. The inter-rater reliability improved over time with the new tool, whereas the inter-rater reliability when using Naranjo remained similar, despite the fact that there was as much exposure to this tool within the assessing group. The improved inter-rater reliability with the new tool may be explained by increasing experience of its use.
The proportion of exact agreements between assessors was comparable between the two tools for both sets of cases despite the improvement in the global kappa for the new tool. This is because it is difficult to achieve a ‘definite’ category using the Naranjo tool and assessors mainly scored cases as ‘possible’ or ‘probable.’ Therefore, the chances of exact agreement between two assessors of the same case using the Naranjo tool are likely to be falsely elevated compared to the kappa scores which adjust for chance agreement. This paradox has been discussed previously in the literature , , .
The percentage of extreme disagreement between raters was higher for the Liverpool tool, when compared to Naranjo. Due to the difficulty in achieving a ‘definite’ score with Naranjo the chances of finding extreme disagreement, when comparing pair-wise assessments, is likely to be falsely low. The observed percentage of extreme disagreements decreased when using the Liverpool tool from the first set of 40 cases to the last set. This may also be explained by increasing experience of its use.
Second, the inter-rater reliability on assessing published case reports with the new tool was similar to that when we assessed our observational study cases with the Naranjo tool. Five of the seven assessors work in paediatric practice and the published case reports were adult cases. This perhaps provides an indication, albeit indirectly, of the robustness of the tool in assessing a range of case reports, even when used by assessors for cases from unfamiliar clinical settings.
Third, in the Naranjo tool, almost all cases were categorised as possible or probable. With the new tool, the range of categorisations was broader with some cases judged as being definite. A novel aspect of the tool which made this possible was that prior exposure that led to the same ADR, for example during a previous course of chemotherapy, was included and was thus judged as being equivalent to a prospective re-challenge. The high proportion of definite causality assessments can be explained by the fact that our study contained a large number of children with malignancies who had repeated courses of chemotherapy. It is also important to note that the cases were extracted from an observational study of suspected ADRs in children, and thus some case selection had occurred a priori making it improbable to record a score of ‘unlikely’ when assessing with either tool.
Fourth, a flow diagram rather than scoring system was used in the new tool for causality assessment and was felt by assessors to be easy to follow and quick to complete. We used a classification approach based on binary decisions (taking account of “don't know” responses). In this case, it is important to ensure that the binary decisions are robust. Once this has been done, then the instrument should be relatively context-independent. A weighted scoring system, such as the Naranjo tool, however will give more influence to some variables than others. A weighting scheme requires the validation of both the items in the tool and the weightings themselves. Ideally, the weightings need to be developed and validated in a context that is similar to the context in which they are applied. Thus a weighting scheme is more likely to be sensitive and specific within a defined context (as long as you have a gold standard) but is more likely to be context-dependent. Thus we would conclude, that for ADRs where many different drugs can cause reactions in different settings, and where the patient's ADR may be assessed by healthcare professionals from a variety of backgrounds, it is more important to develop a tool that is context-independent.
Not unexpectedly, we were unable to achieve complete agreement about causality assessment for a minority of suspected ADRs. Most likely, this reflects underlying uncertainty arising from issues such as the perceived likelihood of alternative explanations. These perceptions will vary between raters depending on their experience or professional backgrounds.
In summary, we present a new causality assessment tool, developed by a multi-disciplinary team, which performed better than the Naranjo tool. We believe the new tool to be practicable and likely to be acceptable for use by healthcare staff in assessing ADRs. We have undertaken a validation of the tool, with a total of 819 causality assessments by seven investigators, using investigators within our ADRIC research programme. Although this validation is equivalent, if not better, than that undertaken for many other tools , , , one limitation is that the increase in IRR for the second set of 40 case reports using the new tool remains unexplained. We plan to investigate this using external validation in a randomised clinical trial. Another limitation is that the validation has been undertaken internally and not independently by other investigators. However, we feel that the tool shows promise, and by publishing it, we hope it will allow other investigators to undertake independent assessments of the usefulness of this tool in other populations (e.g. using data from adult or elderly care settings), not only for spontaneous reports but also for adverse events occurring within trials.
Annals of Pharmacotherapy published adverse drug reaction case reports assessed using the Liverpool ADR Causality Assessment Tool.
Conceived and designed the experiments: RMG JJK PRW RLS MP. Performed the experiments: RMG JRM KAB AJN MAT RLS MP. Analyzed the data: JJK RMG. Wrote the paper: RMG JJK PRW MAT RLS MP.
- 1. Impicciatore P, Choonara I, Clarkson A, Provasi D, Pandolfini C, et al. (2001) Incidence of adverse drug reactions in paediatric in/out-patients: a systematic review and meta-analysis of prospective studies. British Journal Of Clinical Pharmacology 52: 77–83.
- 2. Clavenna A, Bonati M (2009) Adverse drug reactions in childhood: a review of prospective studies and safety alerts. Arch Dis Child 94: 724–728.
- 3. Agbabiaka TB, Savovic J, Ernst E (2008) Methods for causality assessment of adverse drug reactions: a systematic review. Drug Saf 31: 21–37.
- 4. Turner WM (1984) The Food and Drug Administration algorithm. Special workshop–regulatory. Drug Information Journal 18: 259–266.
- 5. Arimone Y, Bidault I, Collignon AE, Dutertre JP, Gerardin M, et al. (2010) Updating of the French causality assessment method: 29. Fundamental & Clinical Pharmacology 6:
- 6. Laine L, Goldkind L, Curtis SP, Connors LG, Zhang YQ, et al. (2009) How Common Is Diclofenac-Associated Liver Injury? Analysis of 17,289 Arthritis Patients in a Long-Term Prospective Clinical Trial. American Journal Of Gastroenterology 104: 356–362.
- 7. Kling A (2004) Sepsis as a possible adverse drug reaction in patients with rheumatoid arthritis treated with TNFalpha antagonists. JCR: Journal of Clinical Rheumatology 10: 119–122.
- 8. Macedo AF, Marques FB, Ribeiro CF, Teixeira F (2005) Causality assessment of adverse drug reactions: comparison of the results obtained from published decisional algorithms and from the evaluations of an expert panel. Pharmacoepidemiol Drug Saf 14: 885–890.
- 9. Hill AB (1965) THE ENVIRONMENT AND DISEASE: ASSOCIATION OR CAUSATION? Proc R Soc Med 295–300.
- 10. Naranjo CA, Busto U, Sellers EM (1981) A method for estimating the probability of adverse drug reactions. Clinical Pharmacology and Therapeutics 30: 239–245.
- 11. Irey NS (1976) Tissue reactions to drugs. American Journal of Pathology 82: 613–648.
- 12. Arimone Y, Begaud B, Miremont-Salame G, Fourrier-Reglat A, Moore N, et al. (2005) Agreement of expert judgment in causality assessment of adverse drug reactions. Eur J Clin Pharmacol 61: 169–173.
Jones JK (2005) Determining causation from case reports. Pharmacoepidemiology/edited by Brian L. Strom;. In: Strom BL, editor. J. Wiley: Chichester, England.
- 14. Davies EC, Green CF, Taylor S, Williamson PR, Mottram DR, et al. (2009) Adverse Drug Reactions in Hospital In-Patients: A Prospective Analysis of 3695 Patient-Episodes. PLOS ONE 4:
- 15. Pirmohamed M, James S, Meakin S, Green C, Scott AK, et al. (2004) Adverse drug reactions as cause of admission to hospital: prospective analysis of 18,820 patients. British Medical Journal 329: 15–19.
- 16. Avner M, Finkelstein Y, Hackam D, Koren G (2007) Establishing causality in pediatric adverse drug reactions: use of the Naranjo probability scale. Paediatr Drugs 9: 267–270.
- 17. Kane-Gill SL (2005) Are the Naranjo criteria reliable and valid for determination of adverse drug reactions in the intensive care unit? Annals Of Pharmacotherapy 39: 1823–1827.
- 18. Garcia-Cortes M, Lucena MI, Pachkoria K, Borraz Y, Hidalgo R, et al. (2008) Evaluation of Naranjo Adverse Drug Reactions Probability Scale in causality assessment of drug-induced liver injury. Alimentary Pharmacology & Therapeutics 27: 780–789.
- 19. Fleiss JL (1971) Measuring nominal scale agreement among many raters. Psychological Bulletin 76: 378–382.
Altman DG (1991) Practical statistics for medical research. London: Chapman & Hall.
University of Liverpool (2009) Adverse drug reactions in children. Available at: http://www.adric.org.uk/. Accessed 2011 March 1.
- 22. Feinstein AR, Cicchetti DV (1990) High agreement but low kappa: I. The problems of two paradoxes. Journal of Clinical Epidemiology 43: 543–549.
- 23. Cicchetti DV, Feinstein AR (1990) High agreement but low kappa: II. Resolving the paradoxes. Journal of Clinical Epidemiology 43: 551–558.
- 24. Lantz CA, Nebenzahl E (1996) Behavior and interpretation of the kappa statistic: resolution of the two paradoxes. Journal of Clinical Epidemiology 49: 431–434.
- 25. Danan G, Benichou C (1993) Causality Assessment of Adverse Reactions to Drugs.1. A Novel Method Based on the Conclusions of International Consensus Meetings - Application to Drug-Induced Liver Injuries. Journal of Clinical Epidemiology 46: 1323–1330.
- 26. Koh Y, Li SC (2005) A new algorithm to identify the causality of adverse drug reactions. Drug Saf 28: 1159–1161.