Replicating the Violence Risk Appraisal Guide: A Total Forensic Cohort Study

Introduction The performance of violence risk assessment instruments can be primarily investigated by analysing two psychometric properties: discrimination and calibration. Although many studies have examined the discrimination capacity of the Violence Risk Appraisal Guide (VRAG) and other actuarial risk assessment tools, few have evaluated how well calibrated these instruments are. The aim of the present investigation was to replicate the development study of the VRAG in Europe including measurements of discrimination and calibration. Method Using a prospective study design, we assessed a total cohort of violent offenders in the Zurich Canton of Switzerland using the VRAG prior to discharge from prisons, secure facilities, and outpatient clinics. Assessors adhered strictly to the assessment protocol set out in the instrument’s manual. After controlling for attrition, 206 offenders were followed in the community for a fixed period of 7 years. We used charges and convictions for subsequent violent offenses as the outcomes. Receiver operating characteristic analysis was conducted to measure discrimination, and Sanders’ decomposition of the Brier score as well as Bayesian credible intervals were calculated to measure calibration. Results The discrimination of the VRAG’s risk bins was modest (area under the curve = 0.72, 95% CI = 0.63–0.81, p<0.05). However, the calibration of the tool was poor, with Sanders’ calibration score suggesting an average assessment error of 21% in the probabilistic estimates associated with each bin. The Bayesian credible intervals revealed that in five out of nine risk bins the intervals did not contain the expected risk rates. Discussion Measurement of the calibration validity of risk assessment instruments needs to be improved, as has been done with respect to discrimination. Additional replication studies that focus on the calibration of actuarial risk assessment instruments are needed. Meanwhile, we recommend caution when using the VRAG probabilistic risk estimates in practice.


Introduction
Many actuarial risk assessment instruments have been developed over the past 30 years in response to seminal research [1,2] and government reports [3][4][5] on the poor predictive validity of unstructured clinical judgments regarding the prediction of violence. These structured instruments are composed of weighted risk and protective factors that have been found to be statistically associated with the likelihood of violence. To obtain an estimate of risk, one combines the factors using a pre-determined algorithm that assigns subjects to a risk category or ''bin'' to which the instrument's creators have assigned an empirically determined probability of future violence [6]. In doing so, these empirically developed risk assessment instruments offer a probability model rather than predicting one of two outcomes (recidivism vs. no recidivism; cf. [7,8]).
According to recent surveys [9,10], one of the actuarial instruments most commonly used by clinicians is the Violence Risk Appraisal Guide (VRAG; [11]). The VRAG was developed in Canada using a sample of 618 adult mentally disordered violent offenders at the Mental Health Centre in Penetanguishene [12]. The offenders were followed for 6.8 (SD = 5.1) years after discharge and both charges and convictions for subsequent violent offenses were identified. The scheme was constructed using the Nuffield [13] strategy, which identifies items and subsequently assigns weights according to how well the characteristics differentiate between the base rates of offending. Since its publication, the VRAG has become one of the best-researched instruments in terms of studies designed to measure its performance to assess the risk of recidivism [14].
Despite widespread implementation of the VRAG and its large research base, a systematic review suggests that no studies have been published that replicate the original development study of the VRAG in terms of sex and age composition, sample index offense, use of file information for administration, reliable scoring, lack of item approximations and omissions, length of (fixed) follow-up, controls for attrition, assessment of violent recidivism, and use of conviction as the legal status of outcomes [15]. Moreover, there are few studies attempting to replicate the probabilistic estimates put forth in the VRAG manual for the instrument's nine actuarial risk bins (Table 1). Studies that have investigated the goodness-of-fit between the rates of violent recidivism published in the VRAG manual and rates observed during the research have produced inconsistent findings (Table 2). This inconsistency is important given the equal importance of discrimination and calibration when attempting to establish a valid risk assessment tool.

Components of Performance Measures
When thinking about the performance of a violence risk assessment instrument, two distinct aspects deserve attentiondiscrimination and calibration [16]. In the present context, discrimination refers to the instrument's ability to differentiate between recidivists and non-recidivists, and calibration refers to the fit between the risk estimates provided by the instrument's creators (which typically are based on the recidivism rates in the sample used to develop the tool) and the observed recidivism rates in the sample of current interest.
As several researchers have pointed out, discrimination findings for actuarial instruments do not necessarily equate to calibration validity (cf. [17][18][19]); rather, discrimination and calibration are equally important sides of the same coin, both of which need to be established in order to argue that a tool is valid [20]. Therefore, despite a considerable evidence base supporting the VRAG's discrimination [21], evidence of the tool's calibration (i.e., the ability of a risk assessment tool to estimate rates of recidivism for single risk scores) remains an essential piece of the puzzle that is currently incomplete.

The Present Study
In the present investigation it was our aim to replicate the initial development study of the VRAG in Europe, paying particular attention to matching the demographic and design characteristics of the tool's normative investigation. Both discrimination and calibration performance indicators were calculated. We hypothesised that discrimination and calibration indices would be satisfactory using a sample and methodology that did not differ substantially from the original VRAG study.

Participants
The sample for the present study was taken from the Zurich Forensic Study, a prospective study of all 465 offenders supervised by the criminal justice system of the Canton of Zurich, Switzerland, as of August 2000 [22]. This total forensic cohort included all offenders regardless of the severity of their index offense, mental health status, criminal responsibility, and length of prison stay provided a minimum sentence of 10 months or courtordered therapy was carried out. To make the study sample comparable to the VRAG development sample, we considered only male offenders who were discharged into the community and who achieved a follow-up time of 7 years (n = 287). After elimination of participants who died, were deported before the end of the follow-up period, or were missing five or more VRAG items, the final study sample consisted of 206 offenders ( Figure 1). Research using this dataset was approved by the Ethics Committee of the Canton of Zurich. With agreement from the committee, informed consent was not needed because there was no contact with any of the study participants.

Procedure
The present study represents the first true replication study of the VRAG performed according to the comprehensive criteria for Harris et.al. [45] Harris et al. [48] Mill [49] Yessine et al. [50] Snowden et al. [51] Krö ner et al. [52] Hastings et. al.

Zurich Forensic Study
Replication match 3 Note. -= Not Applicable; NR = Not Reported; LoFU = Length of follow-up. 1 VRAG development sample. 2 Rates for men only. 3 Out of 12 matching criteria established by Rossegger and colleagues [15]. 4 Base rate of violent (including sexual) recidivism for offenders with a VRAG score. doi:10.1371/journal.pone.0091845.t001 matching design and demographics that were established by Rossegger and colleagues [15] to compare validation and crossvalidation studies (Table 3). Two masters-level psychologists who had attended accredited Psychopathy Checklist-Revised [23] workshops and were blind to the purpose of the study and participant outcomes scored a validated translation of the VRAG [24]. The assessors adhered strictly to the assessment protocol set out in the instrument's manual, avoiding systematic item omissions and using the prorating algorithm published by the VRAG authors [11]. A pilot study revealed substantial inter-rater agreement between the item and total scores of the two assessors (k = 0.70-0.89 [25]).
Recidivism was defined as a new charge or conviction for a violent (including sexual) offense committed after discharge from prisons, secure facilities, and outpatient clinics. Determination of recidivism was based on criminal records, which included information on charges and convictions, date of offense, type of offense, and length of sentence. Of note, in Switzerland charges are only displayed in the criminal record while a subject is under investigation. The potential time at risk was from August 2000 until May 2011. In order to create a follow-up period comparable to that used in the VRAG development study, we considered only offenses committed within 7 years after discharge.

Statistical Analysis
Discrimination was measured using receiver operating characteristic (ROC) curve analysis and the resulting area under the curve (AUC). The ROC curve plots the true positive rate (the fraction of recidivists correctly identified) as a function of the false positive rate (the fraction of nonrecidivists misidentified) as the decision criterion (or cut-off) is moved from the highest to the lowest risk bin. The AUC represents the probability that a randomly selected recidivist would have a higher risk bin classification than a randomly selected non-recidivist.
Calibration was measured using three methods. First, we compared the violent recidivism rates for each VRAG risk bin during the 7-year fixed follow-up period in the development study published by Harris and colleagues [12] with the recidivism rates of participants in the total forensic cohort of this study. Second, we calculated the squared error between the average predicted recidivism rate and the average observed recidivism rate in each risk bin using Sanders' decomposition of the modified Brier score [26]. The Brier score is a commonly known overall performance measure calculating the disagreement between expected rates and a binary variable (i.e. the mean squared error of prediction) [27][28][29]. Thus, it addresses both -the discrimination and calibration of a model and ranges from 0 to 1 with 0 suggesting a perfect model performance (cf. [30]). These properties (discrimination and calibration) can be analysed separately by using the Sanders' decomposition of the modified Brier score. The first term of the Sanders' decomposition of the modified Brier score provides information on the calibration as it measures the error that emerges from the mean forecast within the group without measuring the mean outcome within the group. The second term contains the discrimination of the model [30,31]. An overview of the Brier score is provided by Ferro [32] and Redelmeier, Bloch, and Hickam [33]. Third, we calculated Bayesian credible intervals for the VRAG's risk bin of the Zurich Forensic Study by using the Jeffreys' prior for the Beta distribution [34]. We applied a Bayesian approach to investigate the observed data by comparing the binspecific rates with those published by the tool's authors considering a prior probabilistic distribution [35][36][37].

Sample Characteristics
The sample population for the present investigation was composed of 206 adult male offenders with a mean age of 34.8 years (SD = 11.5) at the time of conviction for their index offense and 37.6 years (SD = 11.7) at their discharge. Index offenses included the following: homicide (n = 37, 18.0%), robbery (n = 55, 26.7%), assault (n = 31, 15.1%), child sexual abuse (n = 44, 21.4%), and rape (n = 39, 18.9%). Court-mandated therapy was ordered for 131 (63.6%) offenders. Criteria for a personality disorder according to DSM-IV and/or ICD-10 were fulfilled in 45.6% (n = 94) of the offenders, and 11.1% (n = 23) of the sample met the diagnostic criteria for schizophrenia.

Base rate of Violent Recidivism
The cohort was followed for 7 years post-discharge and criminal registers were used to ascertain whether they had recidivated or not. The base rate of violent (including sexual) recidivism was 18.0% (n = 37). When stratified by offense type, the following recidivism rates were documented: homicide, 1.5% (n = 3); robbery, 5.3% (n = 11); assault, 8.7% (n = 18); child sexual abuse, 3.9% (n = 8); and rape, 1.9% (n = 4). Three participants engaged in acts that were classified under more than one category.

Performance Measures
The discrimination analysis of the VRAG -assessed by using ROC curve analysis -produced an AUC of 0.72 (95% CI = 0.63-0.81, p,0.05). This suggests that the probability that a randomly selected violent recidivist had a higher risk bin classification than a randomly selected non-recidivist was 72%. Although there is considerable variability in what constitutes a small, moderate, and large value for AUC [39], there is general agreement that this effect size represents good discrimination [40].
We explored the calibration descriptively and also analysed group differences using Sanders' decomposition of the modified Brier score as well as the Bayesian credible intervals to investigate the significance of differences in risk rates for each VRAG risk bin (cf. Table 4). The mean VRAG score was 4.9 (SD = 11.7, range = 220 to +38). There were no offenders with scores warranting classification in the lowest risk bin. The majority of the offenders (n = 125, 60.7%) were classified in the fourth through sixth bins. The base rate of violent recidivism in the majority of the risk bins was lower in the total forensic cohort than in the VRAG development sample (Figure 2). A good overall performance of the VRAG was indicated by a Brier Score of B = 0.18 (AUC = 0.72). However, the Sanders' decomposition score for the prediction of violent recidivism was 0.04, which corresponds to an average error of 21.0% per risk bin. Of particular note was the ratio of the excess forecast variance to the minimum forecast variance for the VRAG, which was 10.4. Ratios higher than 6.0 suggest substantial excess variation in risk predictions [30].
In five out of nine risk bins (bins 3, 5, 6, 7, and 9), the published recidivism rates fell outside the Bayesian 95% credible interval calculated for the data from the Zurich Forensic Study and, therefore, exceeded the observed rates of recidivism (Table 4, Figure 2). This indicates a significant deviation of the published risk rates in most of the VRAG risk bins compared to those found in the current study.

Discussion
The aim of the present study was to assess the performance of a commonly used violence risk assessment instrument, the VRAG. This research represents the first replication of the VRAG in which the dataset fulfilled the methodologic requirements of the tool's development study including its prospective orientation, 7year fixed length of follow-up, participant inclusion criteria, scoring protocol, and controls for sources of attrition. To ensure a comprehensive evaluation of the tool's performance to assess violent offenders' risk of recidivism, both discrimination (the ability to differentiate between recidivists and non-recidivists) and calibration (the fit between predicted risk and observed risk) were measured in the study.
The overall performance and discrimination validity of the VRAG was found to be good with respect to its ability to differentiate between violent recidivists and non-recidivists (B = 0.18 [AUC = 0.72] respectively AUC = 0.72). This level of discrimination is comparable to that reported by a number of other authors [21]. This being said, the calibration validity of the instrument was found to be poor; when we examined the observed rates of violent recidivism in each of the nine VRAG risk bins we found substantial differences compared with the expected rates as published by the tool's developers. In addition to descriptively exploring violence rates, we also investigated calibration validity using two additional approaches: Sanders' decomposition of the Brier score and Bayesian credible intervals for the VRAG risk bins. Using all three approaches we obtained consistent evidence that the VRAG was poorly calibrated for use in Switzerland. This corresponds to reports by other authors of poor calibration for the tool ( Table 2).

Implications
Results of the present study suggest that the VRAG lacks calibration validity. This is rather peculiar for actuarial instruments, since their key advantage over alternative approaches to risk assessment such as structured professional judgment lies in their conversion of total risk scores into probabilistic estimates of future violence risk. A poor fit of expected and observed recidivism rates limits the usefulness of actuarial risk assessment instruments in practice, because it reduces the tool's ability to guide resource allocation and level of service classification using recidivism estimates. In legal contexts, lack of calibration validity may also lead to overestimation of the risk of future violence, resulting in long sentences, costly mandated therapy, or unnecessary community supervision. Given these serious consequences, further calibration studies using sound study protocols and comprehensive strategies for data analysis are needed. As part of this effort, the observed rates of recidivism for each risk bin should be routinely reported. Furthermore, discussion concerning the measurement of the calibration validity of risk assessment instruments needs to be advanced, as has been performed for discrimination [39]. Until this has been achieved, caution is needed when using the instrument's probabilistic risk estimates in practice.
In accordance with a Bayesian approach, recent meta-analyses of literature on risk assessment for both violent [41] and sex [42,43] offenders suggest that it might not be possible to reliably assign an expected probability to a group without taking into consideration population-based priors. This raises the following question: if the published expected recidivism rates for the nine VRAG risk bins are not reliable, of what practical use are differences between bins? For example, what actions would be appropriate for the individuals in bin 4 that would not be needed for individuals in bin 3?
Previous studies have endeavoured to measure the calibration validity of the VRAG using either the x 2 goodness-of-fit index or the correlation coefficient, both of which have limited usefulness  for this task. Regarding the former, the goodness-of-fit index is calculated using the expected rate of violence as specified in the VRAG manual and the rate of violence observed in a given replication study: The first notable issue when using this calibration parameter with the VRAG is that the expected rate of violence for the lowest risk bin of the instrument is 0%, resulting in division by zero. Adding a small constant in instances of zero cell counts allows nonparametric analyses to proceed [44] but can result in considerably biased x 2 estimates. For example, Harris and colleagues [45] found that 17% of individuals in the lowest risk bin of the VRAG went on to violently recidivate. Using a substitute of 1% for the expected rate results in a single risk bin x 2 of 256, well above the a = 0.05 critical threshold of 15.51. A second limitation of the index is that differences in expected and observed rates of violence in lower risk bins have a larger influence on the resulting x 2 estimate than differences in higher bins. For example, a deviation of 5% in the highest risk bin has almost no statistical influence, whereas the same 5% mismatch in the lowest risk bin will have considerable impact. A third obstacle that needs to be considered when using the goodness-of-fit test is that risk assessment tools developed in populations with higher base rates will have poorer calibration estimates when replicated in samples with higher rates of violence than in those with lower rates. Given the substantial variability in the rates of violence in VRAG studies [46], this may be an important issue to consider in some studies. For these reasons, goodness-of-fit tests may be inappropriate for measuring calibration validity in replication studies whose samples are derived from populations with higher overall base rates (or at least higher rates of violence in individuals with lower VRAG scores). The correlation coefficient (r) is similarly limited in that deviations of even 20% in the expected rate of violence in each risk bin can still produce statistically significant evidence of good calibration.

Limitations
The present sample represented a total forensic cohort in Switzerland, a country with a criminal justice system based on civil law. The VRAG, however, was developed in a common law jurisdiction (Ontario, Canada), meaning that several items are couched in jurisprudence that is not relevant abroad. Thus, it is perhaps understandable that the VRAG performed poorly in terms of calibration validity in our investigation despite our attempts to replicate the conditions of the instrument's development study as closely as possible. This said, the authors of the VRAG manual have previously claimed that the tool can be used in international settings based on discrimination findings using performance indicators such as the AUC and correlation coefficient. Taking into consideration the present findings together with previous reports that the instrument's probabilistic estimates of future violence risk do not hold in other countries including Germany, Sweden, the United Kingdom, and the United States, the developers of actuarial risk assessment tools might need to revise their conclusions concerning generalizability. One way forward could be the establishment and incorporation of jurisdiction-specific norms for group-based risk estimates, which would allow for greater cultural sensitivity when instruments developed in one country are implemented in another.

Conclusion
The performance of violence risk assessment tools has two components: discrimination and calibration. To date, studies have primarily focused on discrimination, and calibration has been largely neglected. However, both components need to be established before concluding that a risk assessment instrument is useful in practice. The large body of discrimination evidence for actuarial instruments such as the VRAG belies scant calibration findings that suggest poor performance in prospective risk assessment using probabilistic risk estimates. In the end, although the performance of the instrument with respect to discrimination indicates potential of the VRAG, its poor calibration results raise questions regarding its practical usefulness.