Resemblance of Symptoms for Major Depression Assessed at Interview versus from Hospital Record Review

Background Diagnostic information for psychiatric research often depends on both clinical interviews and medical records. Although discrepancies between these two sources are well known, there have been few studies into the degree and origins of inconsistencies. Principal findings We compared data from structured interviews and medical records on 1,970 Han Chinese women with recurrent DSM-IV major depression (MD). Correlations were high for age at onset of MD (0.93) and number of episodes (0.70), intermediate for family history (+0.62) and duration of longest episode (+0.43) and variable but generally more modest for individual depressive symptoms (mean kappa = 0.32). Four factors were identified for twelve symptoms from medical records and the same four factors emerged from analysis of structured interviews. Factor congruencies were high but the correlation of factors between interviews and records were modest (i.e. +0.2 to +0.4). Conclusions Structured interviews and medical records are highly concordant for age of onset, and the number and length of episodes, but agree more modestly for individual symptoms and symptom factors. The modesty of these correlations probably arises from multiple factors including i) inconsistency in the definition of the worst episode, ii) inaccuracies in self-report and iii) difficulties in coding medical records where symptoms were recorded solely for clinical purposes.


Introduction
Accurate clinical diagnosis is a cornerstone of psychiatric research. Many epidemiological findings of importance for public health, for example those that report the lifetime prevalence and development of psychiatric disorders, often rely on data obtained by interviewing subjects. Clinical data for the majority of these and other research projects are typically collected using a structured interview such as the composite international diagnostic interview (CIDI) or structured clinical interview for DSM (SCID) [1]. Lifetime diagnoses based on structured interviews have good interrater reliability [2] but they suffer from a number of limitations.
Foremost among these is that structured clinical interviews assessing a lifetime history of illness rely solely on the accuracy of the subject's memory, which is often imprecise and potentially biased. A number of studies have shown that the reliability of the lifetime prevalence of psychiatric disorders assessed by structured interview is often modest, and that an individual's present mood state impacts on the probability that they will recall a prior depressive episode [3,4,5].
There are however alternative sources of information that can be used to augment information from the clinical interview. Medical records provide summaries, usually taken contemporaneously, of information obtained by medical staff involved in the care of patient. The data typically include summaries of interviews and the results of physical and laboratory examinations, together with diagnoses, treatments, and care plans. Medical records can provide an accurate summary of the course of a disease, often recording important events and symptoms the patients themselves do not recall. However, medical records are unstructured and their quality varies, depending on the skills and diligence of the individual physicians and nurses recording the information [6].
Because of the complementary nature of information gleaned from structured interviews and medical records, some researchers combine both sources of information. For example chart diagnosis may not concur with results of a structured interview such as the SCID [7]. However this raises a number of issues, not the least of which is what to do when the two sources of data contradict each other. Despite the importance of these issues, research into the degree and origins of inconsistency between the two sources of clinical data are scant [7,8]. We have been unable to find a comparison of the two approaches to the diagnosis of major depression (MD).
Here we used data from 1,970 depressed Chinese women to compare these two assessment methods. Because the patients were given a detailed structured interview, covering known risk factors for depression, as well as subject to a careful chart review, we were able to explore patterns of response that might throw light on the nature of the agreements and disagreements between the two assessment methods.

Ethics statement
The study protocol was approved centrally by the Ethical Review Board of Oxford University (Oxford Tropical Research Ethics Committee) and the ethics committees in all participating hospitals in China. Major psychotic illness was an exclusion criterion, and the large majority of patients were in remission from illness (seen as out-patients). All interviewers were mental health professionals who are well able to judge decisional capacity. The study posed minimal risk (an interview and saliva sample).

Study subjects
The data for the present study were drawn from the ongoing China, Oxford and VCU Experimental Research on Genetic Epidemiology (CONVERGE) study of MD. These analyses were based on a total of 1,970 cases recruited from 53 provincial mental health centres and psychiatric departments of general medical hospitals in 41 cities in 19 provinces and four central cities: Beijng, Shanghai, Tianjin and Chongqing. All cases were female and had four Han Chinese grandparents. They were aged between 30 and 60, had suffered two or more episodes of MD, with the first episode occurring between the ages of 14 and 50 and had not abused drugs or alcohol before their first episode of MD. Cases were excluded if they had a pre-existing history of bipolar disorder, any type of psychosis or mental retardation.
All cases were interviewed using a computerised assessment system, which lasted on average two hours. All interviewers were mental health professionals, largely psychiatrists and a few psychiatric nurses, trained by the CONVERGE team for a minimum of one week in the use of the interview. The interview includes assessment of psychopathology, demographic and personal characteristics, and psychosocial functioning. Interviews were tape-recorded and a proportion of them were listened to by the trained editors who provided feedback on the quality of the interviews. The interview was semi-structured and required the interviewers to make a range of judgements about the nature and meaning of the reported symptoms. The section of the interview that assessed major depression was adapted from the Composite International Diagnostic Interview (CIDI) (WHO lifetime version 2.1; Chinese version) and classified diagnoses according to the Diagnostic and Statistical Manual of Mental Disorders (DSM-IV) criteria [14].
Additional information using instruments employed from VATSPSUD, was translated and reviewed for accuracy by members of the CONVERGE team. Information on postnatal depression was assessed using an adaptation of the Edinburgh Scale [15]. The history of lifetime major depression in the parents and siblings was assessed using the Family History Research Diagnostic criteria [16]. All available medical records (n = 1,880) were reviewed typically before but sometimes after the interview with the respondent. As would be expected, the quantity and quality of medical records varied considerably. Some included only out-patient records while others included material from inpatient hospitalizations. Notes by treating psychiatrists were almost always available. Notes from nursing staff or other mental health professionals were sometimes included. Interviewers were trained in the completion of the Case Record Rating Scale. If records were available from multiple episodes of illness, they were instructed to focus on the worst episode. They were also instructed to focus on admission and discharge summaries as likely to contain the most complete clinical descriptions. All available case notes were assessed to obtain a diagnosis of DSM-IV depression. The presence of each of 12 symptoms was evaluated as either ''clear evidence absent'', ''inferred absent'', ''inferred present'', ''clearly present moderate'', ''clearly present severe'' or ''no information''.
Interviewers were also instructed that information obtained from the medical records should never influence their interview of the respondents as the two sources of data were to be kept entirely separate.
The case interview was fully computerized into a bilingual system of Mandarin and English developed in house in Oxford, and called SysQ. Skip patterns were built into SysQ. Interviews were administered by trained interviewers and entered offline in real time onto SysQ, which was installed in the laptops. Once an interview was completed, a backup file containing all the previously entered interview data could be generated with database compatible format. The backup file, together with an audio recording of the entire interview, was uploaded to a designated server currently maintained in Beijing by a service provider. All the uploaded files in the Beijing server were then transferred to an Oxford server quarterly.

Statistical analysis
Statistical analyses were performed using the software package SPSS 17.0 (SPSS Inc., Chicago, IL). Pearson correlation coefficients were calculated to compare the age of onset of the first episode of depression, the longest duration episode of depression, and the number of episodes. Agreement between the interview and hospital records for 12 individual MD symptoms was assessed by Cohen's kappa statistic [17]. Symptoms were subject to factor analysis with a Varimax rotation again using SPSS software.

Results
We compared information obtained from medical records and structured interviews on MD episode duration, occurrence, family history and symptomatology. Table 1 shows good correlations for the age of onset of MD, number of episodes and the length of the longest episode (in weeks), with a strikingly high correlation for the age of onset (+0.93). Figure 1 displays graphically the nature of the correlations. Table 2 presents kappa coefficients for family history and for symptoms experienced during the worst MD episode. Family history for depression was relatively reliable across sources (kappa = +0.62). Agreement for symptoms was variable but generally modest (range +0.14 [insomnia] to +0.48 [appetite/ weight loss]), with a mean of +0.32. There was no obvious pattern among symptoms that correlated poorly compared to those that correlated more strongly (for example measures of biological symptoms such as changes in sleep and weight varied as much as measures of psychological state, such as feelings of fatigue and thoughts of suicide).
The modest correlation in symptomatology might reflect errors in the way symptoms were elicited and recorded or, since medical records may contain information collected at a different time from the structured interview, it might reflect real differences in the symptom profile of the disease. If the differences were due to random error, then the relationship between individual symptoms might also be disturbed. We tested this by assessing the similarity in factor structure of the symptoms obtained from medical records and structured interview.
We identified four factors from 12 symptoms obtained from medical records with eigenvalues greater than 1. These factors explained 49% of the variance. The first loaded most strongly on three symptoms (psychomotor agitation, irritable/angry, nervous/ jittery/anxious symptoms); we label this factor ''Anxiety''. The second loaded most strongly on four symptoms (fatigue/loss of energy, appetite/weight loss, psychomotor retardation, and difficulty concentrating); we label this factor ''Fatigue''. The third loaded on four symptoms (hypersomnia, insomnia, appetite/ weight gain, appetite/weight loss); we label this factor ''Neurovegetative''; the last factor loaded heavily on just two symptoms (suicidal ideation/acts, crying a lot), and we have labeled this ''Suicide''. Results for the factor analysis of medical records are shown in Table 3.
The same four factors emerged from an analysis of structured interviews, explaining 40% of the variance, with Table 4 showing similar loadings to the factors obtained from medical records. Factor congruences (cosines of pairs of vectors defined by the loadings matrix) for the factors extracted from the interviews and medical records were quite high: +0.91 for the Anxiety factor, +0.99 for Fatigue, +0.95 for Neurovegetative and +0.86 for Suicide.
Finally, we tested the correlation between the factors extracted from chart review and from the structured interviews. Table 5 shows significant correlations between factors obtained from the two sources that are modest to moderate (,+0.20-0.45). However, the within factor correlations between the interview and medical records are much higher than cross factor correlations (means of 0.27 and 0.04 respectively).

Discussion
Our results show that clinical information about MD obtained from unstructured medical records correlates with that from structured clinical interviews, with the degree of correlation depending on the nature of the information. For age of onset, the number and length of episodes and family history, the two sources are highly concordant, but we find variable and generally more modest agreement between individual symptoms. We find the same factor structure present in symptoms from both sources and factor congruence is greater than 0.85 in all cases. This indicates that symptom score differences are unlikely to be due to random error, and that we need to find additional explanations for the discrepancy. In our study, information from medical records was not contemporaneous with that obtained by interview. This suggests two important reasons for the lack of concordance between symptoms elicited at interview and from medical records. The first is that the information from the two sources may often relate to different episodes of MD. We have no way of knowing whether the worst episode described by the patient corresponds to the worst episode that was picked for rating using hospital records. Consistency of symptoms between depressive episodes is typically modest. For example, the mean correlation between MD symptoms elicited from a large population-based sample of female twins interviewed twice at least one year apart was 0.28 [9]. Similarly modest correlations were found in a study of 78 hospital inpatients examined at intervals of one and two years apart [10]. Both studies found the highest correlations for suicidal behaviour.
In our study the correlation between the suicide factors in the two sources of information was also highest (+0.44).
Difference in remembrance is a second reason for discrepancies between the symptoms we acquired from hospital records and from interviews [3,4,5]. When subjects are assessed longitudinally recall is known to affect the results, sometimes in predictable ways: for example Bromet et al found that on a second assessment about twice as many patients reported fewer lifetime depressive episodes than those who reported more [3]. However consistency in recall of some items is high, for example reporting of age at onset rarely differs by more than one year [5], a finding that we corroborated.
Psychiatry's reliance on interviews, rather than objective tests, for basic clinical information has spawned a large literature on the reliability of structured ways of assessing patients [1, 11,12], and typically additional information from chart review is incorporated to obtain a best estimate of lifetime psychiatric diagnosis [2]. These studies show that diagnoses based on interview data alone are an adequate substitute for best estimate diagnoses based on all available data. However there has been much less interest in the validity of chart-derived information. Very few studies have examined the relationship between chart and interview-derived information. Our results can be usefully compared with a recent study of schizophrenia using a very similar methodology to the current report [8]. Ratings of psychotic symptoms were compared in 1,021 patients with schizophrenia studied in Ireland between personal interview and a review of medical records. Correlations for 21 signs and symptoms of psychotic illness ranged from +0.02 (somatic hallucinations) to +0.55 (religious delusions), with a mean of +0.26. Despite examining a different disease and in a different country, these results are quite similar to those obtained in this study.
A few other studies are moderately relevant to our findings. For example, one study of community practitioners reported a kappa of 0.24 for reliability between chart diagnosis and that obtained from a SCID [7]; by contrast, a survey of diagnoses made on 101 psychiatric inpatients reported high concordance, with most errors judged to have occurred in the charts [13]. It is noteworthy that our findings, pointing to a relatively good agreement in the sources of information, were also obtained from psychiatric hospitals, rather than community physicians, and suggests that the setting in which medical records are obtained may be an important determinant of the reliability of the information.
Our results do not allow us to decide what to do when medical records and interview data disagree. This is most likely to be true for reports of symptoms. While we have shown that there is consistency in factor structure between the two sources of   information we have no way of determining the more accurate measure.
Our results should be interpreted in the context of four potentially important methodological limitations. First, an important concern is that medical staff rated the medical records on the same patients that they interviewed. Although interviewers were instructed to keep the two sources of information separate, this may not always have been possible. Therefore we may be overestimating the degree of concordance between medical records and interview-acquired data. Second, the sample is entirely female and our results may or may not extrapolate to men in China. Third, medical records were missing on a small number of cases (n = 90) and often did not contain information about the presence or absence of some individual depressive symptoms. This may introduce an unacknowledged source of bias into our results. Fourth, we have no data on the reliability of our interviews. While we assume that the quality of our interview data is comparable to that of other studies [11] without analysis of repeat interviews we cannot be certain on this point.