Assessment of Response to Lithium Maintenance Treatment in Bipolar Disorder: A Consortium on Lithium Genetics (ConLiGen) Report

Objective The assessment of response to lithium maintenance treatment in bipolar disorder (BD) is complicated by variable length of treatment, unpredictable clinical course, and often inconsistent compliance. Prospective and retrospective methods of assessment of lithium response have been proposed in the literature. In this study we report the key phenotypic measures of the “Retrospective Criteria of Long-Term Treatment Response in Research Subjects with Bipolar Disorder” scale currently used in the Consortium on Lithium Genetics (ConLiGen) study. Materials and Methods Twenty-nine ConLiGen sites took part in a two-stage case-vignette rating procedure to examine inter-rater agreement [Kappa (κ)] and reliability [intra-class correlation coefficient (ICC)] of lithium response. Annotated first-round vignettes and rating guidelines were circulated to expert research clinicians for training purposes between the two stages. Further, we analyzed the distributional properties of the treatment response scores available for 1,308 patients using mixture modeling. Results Substantial and moderate agreement was shown across sites in the first and second sets of vignettes (κ = 0.66 and κ = 0.54, respectively), without significant improvement from training. However, definition of response using the A score as a quantitative trait and selecting cases with B criteria of 4 or less showed an improvement between the two stages (ICC1 = 0.71 and ICC2 = 0.75, respectively). Mixture modeling of score distribution indicated three subpopulations (full responders, partial responders, non responders). Conclusions We identified two definitions of lithium response, one dichotomous and the other continuous, with moderate to substantial inter-rater agreement and reliability. Accurate phenotypic measurement of lithium response is crucial for the ongoing ConLiGen pharmacogenomic study.


Abstract
Objective: The assessment of response to lithium maintenance treatment in bipolar disorder (BD) is complicated by variable length of treatment, unpredictable clinical course, and often inconsistent compliance. Prospective and retrospective methods of assessment of lithium response have been proposed in the literature. In this study we report the key phenotypic measures of the ''Retrospective Criteria of Long-Term Treatment Response in Research Subjects with Bipolar Disorder'' scale currently used in the Consortium on Lithium Genetics (ConLiGen) study.
Materials and Methods: Twenty-nine ConLiGen sites took part in a two-stage case-vignette rating procedure to examine inter-rater agreement [Kappa (k)] and reliability [intra-class correlation coefficient (ICC)] of lithium response. Annotated firstround vignettes and rating guidelines were circulated to expert research clinicians for training purposes between the two stages. Further, we analyzed the distributional properties of the treatment response scores available for 1,308 patients using mixture modeling.
Results: Substantial and moderate agreement was shown across sites in the first and second sets of vignettes (k = 0.66 and k = 0.54, respectively), without significant improvement from training. However, definition of response using the A score as a quantitative trait and selecting cases with B criteria of 4 or less showed an improvement between the two stages (ICC 1 = 0.71 and ICC 2 = 0.75, respectively). Mixture modeling of score distribution indicated three subpopulations (full responders, partial responders, non responders).

Conclusions:
We identified two definitions of lithium response, one dichotomous and the other continuous, with moderate to substantial inter-rater agreement and reliability. Accurate phenotypic measurement of lithium response is crucial for the ongoing ConLiGen pharmacogenomic study.

Introduction
Bipolar disorder (BD) is a lifelong and severe psychiatric illness characterized by recurrences of episodes of depression and hypomania/mania [1]. Lithium is among the first-line maintenance treatments for BD [2,3], preventing relapses and recurrences of opposite polarity. In addition, lithium decreases the risk of suicidal behaviour and all-cause mortality in mood disorders [4][5][6].
Despite a significant genetic component for lithium-responsive BD [12,19], pharmacogenetic studies have not produced replicated results [20,21]. One possible explanation for the lack of conclusive pharmacogenetic findings is the varying definition of lithium response across the studies. Indeed, the assessment of lithium maintenance treatment response, and consequently the definition of the phenotype under study, is complicated by factors inherent to the natural history of BD. The irregular clinical course of BD [22] as well as variable treatment adherence [23] are only few of the factors that contribute to the complexity in assessing the response to lithium maintenance treatment.
To reduce the impact of the clinical heterogeneity of BD in pharmacogenetics (and possibly to define genetically more homogeneous subgroups of BD patients), researchers have proposed to select prospectively followed patients on lithium monotherapy with unequivocal clinical response [24,25]. However, this may not be practical if large patient samples are needed. In such cases, we need to rely on retrospective evaluation of treatment response. Several such methods have been described in the literature including the Affective Morbidity Index (AMI) [26] and the Illness Severity Index [27]. The AMI takes into account the duration and the severity of an episode, the latter scored on a 4-point scale (0 = no conspicuous affective disturbance, 1 = mild depression or mania, 2 = moderate depression or mania, 3 = severe depression or mania). The area under the curve can be calculated from these two variables and compared between defined treatment periods. Similarly, the Illness Severity Index measures the efficacy of lithium treatment in controlling mood episodes. It is defined as the frequency of affective episodes prior to starting lithium adjusted for age at the time lithium was started [27]. However, changes of affective morbidity might be not only a result of the treatment, but could be due to other factors. In the Consortium on Lithium Genetics (ConLiGen, www.ConLiGen. org) study [28], we adopted the ''Retrospective Criteria of Long-Term Treatment Response in Research Subjects with Bipolar Disorder'' as the principal method of evaluation of the response to lithium [12,13]. In addition to measuring the degree of clinical improvement, this scale weighs clinical factors considered relevant in determining whether the observed clinical change is in fact due to the lithium treatment.
Since ConLiGen is an international multi-centre collaboration, it has been crucial to assess the key phenotypic measures and the response to long-term lithium treatment reliability across the participating research groups. Here we present: 1) the results of the reliability analysis of response to lithium treatment across the participating centres, and 2) the distributional properties of the scale scores. These two sets of findings have been instrumental in obtaining stringent phenotypic definitions of lithium response. These analyses are of particular importance in light of the genome-wide association study (GWAS) currently being undertaken by ConLiGen.

Assessment of Clinical Response to Lithium Treatment
The response to lithium treatment was measured using a previously published and validated rating scale: the ''Retrospective Criteria of Long-Term Treatment Response in Research Subjects with Bipolar Disorder'' [12,28]. Briefly, this scale quantifies the degree of improvement in the course of treatment (A criterion or A score) expressed as a composite measure of change in frequency and severity of mood symptoms. The A score is weighed against 5 factors (B criteria) which allow one to determine if the observed improvement is a result of the treatment rather than a spontaneous improvement or an effect of additional medication. Specifically, the B criteria consider: the number of episodes before/off the treatment (B1), the frequency of episodes before/off the treatment (B2), the duration of the treatment (B3), the compliance during period(s) of stability (B4) and the use of additional medication during the period of stability (B5). The total score (TS) is obtained by subtracting the B score from the A score. Analysis of the Inter-rater Agreement and Reliability of the Assessment of Lithium Response The agreement and reliability of the assessment of lithium response between raters of 29 ConLiGen participating centres was measured using a two-stage case-vignette rating procedure (Table 1). Specifically, the study protocol had three phases: 1) twelve standardized case vignettes prepared by investigators (M.A., J.G., C.S.) at Dalhousie University were circulated and rated by 70 investigators; 2) annotated first-round vignettes and rating guidelines were circulated for training purposes after the first stage; 3) sixteen additional more complex vignettes prepared by senior researchers at Dalhousie University, Johns Hopkins University School of Medicine, National Institute of Mental Health (NIMH) and Academia Sinica of Taiwan (M.A., J.G., J.P., T.G.S., F.M., A.C.) were circulated and rated by 48 investigators at the participating sites. The first set of vignettes was based exclusively on BD patients who had been prospectively followed in a specialty program and with detailed clinical information on the course of illness and treatment history. The second set of vignettes was heterogeneous and included patients treated in various settings, some with limited clinical details assessed cross-sectionally. Since raters had no prior knowledge of the rating scale, this design allowed us to estimate the impact of training on agreement and reliability of lithium response assessment. The rating procedure was performed from April 2009 to October 2012.
The degree of concordance of lithium response definition was assessed with Cohen's kappa (k) [29] and intra-class correlation (ICC) coefficient [30]. These analytical methods were applied to the dichotomous and continuous definition of lithium response, respectively. The k statistics (multiple raters with two outcomes) were calculated with 95% confidence interval (CI) for each cut off point of the TS scale in the range from 3 (non response to lithium) to 8 (full response to lithium). Interpretation of the strength of agreement was made according to Landis and Koch: poor (k ,0.00), slight (0.00-0.20), fair (0.21-0.40), moderate (0.41-0.60), substantial (0.61-0.80), almost perfect (0.81-1.00) [31].
The quantitative scores of the treatment response scale were analyzed in the first (ICC 1 ) and second (ICC 2 ) stage of ratings. Specifically, we analyzed the TS (weighted clinical improvement), the A score (uncorrected clinical improvement), the B score (quantification of confounders), and the A score when B score #4. The latter measure allows the identification of ''valid cases'' through selection at the B criteria. Subjects with B score #4 are likely to have a clinical improvement causally related to lithium treatment. The ICC was tested with the two-way random effects model, that assumes a random sample of K investigators selected from a larger population, and each rates N targets (i.e., case vignettes) altogether, and the two-way mixed effects model, with each target rated by each of the same K investigators, who are the only ones of interest. For both models we calculated the single and average measure reliability.

Analysis of the Distributional Properties of the Treatment Response Scale
For the analysis of the distributional properties, we accessed TS data of 1,308 BD patients from the NIMH centralized ConLiGen phenotypic dataset.

Mixture
analysis: frequentist and Bayesian approach. We used mixture analysis to test whether we could identify subgroups of patients according to the degree of response to lithium as expressed by TS. The choice of the mixture model that best fit the distribution of TS was made according to the Akaike's and Schwarz's Bayesian information criteria (AIC and BIC, respectively). The lower values of these two criteria indicated the most parsimonious model that best fit the empirical function of total score distribution. The analysis was performed using the ''NMixEM'' function implemented in the MixAk package [32] of R software (version 2.13.2).
To verify the findings from the frequentist mixture analysis, we performed the Bayesian mixture analysis employing a minimum message length approach (MML) [33]. Specifically, we used the Snob software [34] to test whether the distribution results from a union of a number of ''classes'', where the distributions ''withinclasses'' are homogeneous and have a simple form, but vary significantly ''between-classes''. The best fitting model was indicated as the most parsimonious model (i.e., the one with the lower cost expressed in nits, a specific measure unit conventionally used to express the length message). The analysis was performed using a measurement error equal to 2.5 empirically estimated by plotting the distribution of TS.
Cut off point calculation. Cut off points were derived using the theoretical TS function and calculating each data point's probability of belonging to each class. Specifically, once the mixture model parameters were estimated, we calculated the posterior probability of any data point x belonging to the i-th class as where v is the weight, m is the mean, s is the standard deviation. The resulting probabilities were then compared in order to establish which class the data point belonged to.

Inter-rater Agreement and Reliability of the Assessment of Lithium Response
Raters agreed to a substantial/moderate (first stage of casevignettes ratings) and moderate/fair (second stage of case-vignettes ratings) degree in assessing lithium response as a dichotomous variable (response/non response) ( Table 2). We did not detect an effect of training as shown by the lack of improvement in k. Specifically, in the first stage of ratings, the k score showed a substantial level of agreement when we considered the TS cut off for response to lithium at 6 (k = 0.65, 95% CI = 0.36-0.85) and at 8 (k = 0.61, 95% CI = 0.33-0.83). The highest k value was for the TS cut off point of 7 (k = 0.66, 95% CI = 0.38-0.86). The second stage of ratings had overall lower k values than the first indicating a moderate level of agreement in the assessment of lithium response (TS = 6: k = 0.51, 95% CI = 0.29-0.73; TS = 7: k = 0.54, 95% CI = 0.31-0.76; TS = 8: k = 0.54, 95% CI = 0.28-0.76). Again, the highest k value was found for the TS cut off point of 7. Details can be found in Table 2.
We then analyzed the inter-rater reliability for the continuous definition of lithium response. We found that ICC values (two-way random and mixed effects models, single measure) were higher in the first stage of ratings for TS (ICC 1 = 0.74 versus ICC 2 = 0.55), for A score (ICC 1 = 0.66 versus ICC 2 = 0.52) and for total B score (ICC 1 = 0.59 versus ICC 2 = 0.34). However, the training improved the inter-rater reliability of the A score when B score was #4 (ICC 1 = 0.71 versus ICC 2 = 0.75). These results are outlined in Table 2.

Assessment of Lithium Response in Bipolar Disorder
PLOS ONE | www.plosone.org Table 2. Inter-rater agreement and reliability of the assessment of lithium response in the two-stage case-vignette rating procedure: kappa and intra-class correlation analysis.   Figure 1 illustrates the distribution of TS and A score in 1,308 BD patients characterized for lithium response. Two hundred eighty three patients (21.6%) had TS equal to 0 and 104 patients (8%) had A score equal to 0. In the whole sample the mean A score 6 standard deviation] was 6.163.1 and the mean TS was 4.463.1. The joint distribution of TS and A scores is represented in Figure 2. It illustrates the presence of two frequency peaks at the extreme ends of the scale, namely at 0 and in the area comprised between score A equal to 9 and TS equal to 8-10. A third peak is present at the intersection of A score equal to 6 and TS of 4.  Figure 3B.
Cut off point calculation. The functions of TS identified with the two different mixture analysis approaches (frequentist and Bayesian) were used to derive the probability of belonging and to calculate the cut off point between the components. The frequentist mixture model suggested two cut off points at TS = 3 and TS = 6.4. Considering the Bayesian MML theoretical function, we obtained two cut off points at 2 and 7. These results confirmed that TS $7 is the most appropriate cut off for the definition of full response to lithium prophylaxis as suggested in previous studies [12,13].

Discussion
The purpose of this study was to assess the key phenotypic measures of response to lithium treatment in the large interna- tional collaborative Consortium on Lithium Genetics. To this end, two main analyses have been carried out: the inter-rater agreement and reliability of lithium response definition across the ConLiGen participating sites, and the analysis of the distributional properties of the lithium treatment response scale [12]. We found that two definitions of lithium response, one dichotomous and the other continuous had moderate to substantial inter-rater agreement and reliability. Specifically, the two-stage case vignettes inter-rater reliability analysis pointed to the measure of clinical improvement under lithium treatment expressed by the A score and with selection of ''valid cases'' through a total B score #4. This phenotypic definition of lithium response had a substantial inter-rater reliability in the first stage of ratings (ICC 1 = 0.71) with further improvement in the second stage (ICC 2 = 0.75). Regarding the dichotomous definition of lithium response, a scale TS $7 was identified as the best cut off as shown by interrater agreement k scores in the first (k = 0.66) and second (k = 0.54) stages of case vignette ratings. Further, the analysis of the distributional properties of the treatment response scale further supported this dichotomous definition. In addition, this same measure of lithium response has been previously proposed in several clinical and genetic papers [12,13,35,36].
Some methodological considerations need to be made. For the analysis of the distributional properties, we applied mixture modeling, a method that has been extensively used in psychiatry for the identification of patient subgroups, reducing phenotypic heterogeneity and ultimately helping genetic research [37][38][39]. It should be noted that this method is exploratory and it does not identify the factors determining the differences between the identified subgroups [40]. A validation of the model can be obtained by comparison of the characteristics of each subgroup. In the ConLiGen study, we plan to use the clinical correlates of lithium response as external validators of the phenotypic measure suggested by the mixture modeling. Such analysis will test and compare the direction and magnitude of the association of a number of clinical variables with lithium response in its dichotomous and continuous definition.
Notably, the analysis of inter-rater reliability and agreement has involved investigators belonging to different research groups with different clinical backgrounds and training. Nevertheless, the use of standardized case vignettes and the training procedures has produced moderate to substantial agreement in the assessment of lithium response. These findings are of importance, given the evidence that even in the context of inpatient unit settings the inter-rater agreement can be unsatisfactory [41].
We performed a two-stage case-vignettes procedure aimed at testing the effect of training on the assessment of lithium response. Contrary to our expectations, we only detected improvement in the inter-rater reliability of lithium response expressed by the A score and with selection of ''valid cases'' through a total B score #4, but not in that expressed by TS or A score. Arguably, the second set of vignettes described more complicated clinical cases with comorbidities, lack of compliance and multiple treatments, all factors that could have influenced the scoring of the B criteria. Indeed, the ICC for the total B score decreased noticeably in the second stage of ratings, implying an increased variability in rating that impacted the discrimination among cases [42]. This explanation is corroborated by the finding of the higher ICC 2 of A score with total B score #4. By applying this cut-off we decreased the assessment variability ultimately increasing the discrimination among cases.
Further, these findings confirm that patients with short duration of lithium treatment, poor compliance, and concomitant medications are unlikely to be assessed reliably. This argues against the inclusion of such complex, non-standard cases in pharmacogenomic studies of lithium response. Finally, the higher inter-rater agreement and reliability found in the first set of vignettes suggests that the assessment of lithium response is reliable if sufficient clinical details are available. On the other hand if the information is limited, additional rater training will be of little help.
In conclusion, our findings support the use of two definitions of lithium response for the pharmacogenomic GWAS currently being performed by ConLiGen. Accurate phenotypic definitions of treatment response are crucial in pharmacogenomic studies [43,44]. Heterogeneity in the phenotype definition of treatment response can be a problem especially when in the context of psychiatric disorders. In the absence of other reliable clinical measures of response to lithium, this study has suggested two plausible phenotypic definitions that await application and validation in other samples.