Diagnostic information for psychiatric research often depends on both clinical interviews and medical records. Although discrepancies between these two sources are well known, there have been few studies into the degree and origins of inconsistencies.
We compared data from structured interviews and medical records on 1,970 Han Chinese women with recurrent DSM-IV major depression (MD). Correlations were high for age at onset of MD (0.93) and number of episodes (0.70), intermediate for family history (+0.62) and duration of longest episode (+0.43) and variable but generally more modest for individual depressive symptoms (mean kappa = 0.32). Four factors were identified for twelve symptoms from medical records and the same four factors emerged from analysis of structured interviews. Factor congruencies were high but the correlation of factors between interviews and records were modest (i.e. +0.2 to +0.4).
Structured interviews and medical records are highly concordant for age of onset, and the number and length of episodes, but agree more modestly for individual symptoms and symptom factors. The modesty of these correlations probably arises from multiple factors including i) inconsistency in the definition of the worst episode, ii) inaccuracies in self-report and iii) difficulties in coding medical records where symptoms were recorded solely for clinical purposes.
Citation: Chen Y, Li H, Li Y, Xie D, Wang Z, Yang F, et al. (2012) Resemblance of Symptoms for Major Depression Assessed at Interview versus from Hospital Record Review. PLoS ONE 7(1): e28734. https://doi.org/10.1371/journal.pone.0028734
Editor: Jerson Laks, Federal University of Rio de Janeiro, Brazil
Received: September 9, 2011; Accepted: November 14, 2011; Published: January 11, 2012
Copyright: © 2012 Chen et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: This study was funded by the Wellcome Trust. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Accurate clinical diagnosis is a cornerstone of psychiatric research. Many epidemiological findings of importance for public health, for example those that report the lifetime prevalence and development of psychiatric disorders, often rely on data obtained by interviewing subjects. Clinical data for the majority of these and other research projects are typically collected using a structured interview such as the composite international diagnostic interview (CIDI) or structured clinical interview for DSM (SCID) . Lifetime diagnoses based on structured interviews have good inter-rater reliability  but they suffer from a number of limitations.
Foremost among these is that structured clinical interviews assessing a lifetime history of illness rely solely on the accuracy of the subject's memory, which is often imprecise and potentially biased. A number of studies have shown that the reliability of the lifetime prevalence of psychiatric disorders assessed by structured interview is often modest, and that an individual's present mood state impacts on the probability that they will recall a prior depressive episode , , .
There are however alternative sources of information that can be used to augment information from the clinical interview. Medical records provide summaries, usually taken contemporaneously, of information obtained by medical staff involved in the care of patient. The data typically include summaries of interviews and the results of physical and laboratory examinations, together with diagnoses, treatments, and care plans. Medical records can provide an accurate summary of the course of a disease, often recording important events and symptoms the patients themselves do not recall. However, medical records are unstructured and their quality varies, depending on the skills and diligence of the individual physicians and nurses recording the information .
Because of the complementary nature of information gleaned from structured interviews and medical records, some researchers combine both sources of information. For example chart diagnosis may not concur with results of a structured interview such as the SCID . However this raises a number of issues, not the least of which is what to do when the two sources of data contradict each other. Despite the importance of these issues, research into the degree and origins of inconsistency between the two sources of clinical data are scant , . We have been unable to find a comparison of the two approaches to the diagnosis of major depression (MD).
Here we used data from 1,970 depressed Chinese women to compare these two assessment methods. Because the patients were given a detailed structured interview, covering known risk factors for depression, as well as subject to a careful chart review, we were able to explore patterns of response that might throw light on the nature of the agreements and disagreements between the two assessment methods.
Materials and Methods
The study protocol was approved centrally by the Ethical Review Board of Oxford University (Oxford Tropical Research Ethics Committee) and the ethics committees in all participating hospitals in China. Major psychotic illness was an exclusion criterion, and the large majority of patients were in remission from illness (seen as out-patients). All interviewers were mental health professionals who are well able to judge decisional capacity. The study posed minimal risk (an interview and saliva sample).
The data for the present study were drawn from the ongoing China, Oxford and VCU Experimental Research on Genetic Epidemiology (CONVERGE) study of MD. These analyses were based on a total of 1,970 cases recruited from 53 provincial mental health centres and psychiatric departments of general medical hospitals in 41 cities in 19 provinces and four central cities: Beijng, Shanghai, Tianjin and Chongqing. All cases were female and had four Han Chinese grandparents. They were aged between 30 and 60, had suffered two or more episodes of MD, with the first episode occurring between the ages of 14 and 50 and had not abused drugs or alcohol before their first episode of MD. Cases were excluded if they had a pre-existing history of bipolar disorder, any type of psychosis or mental retardation.
All cases were interviewed using a computerised assessment system, which lasted on average two hours. All interviewers were mental health professionals, largely psychiatrists and a few psychiatric nurses, trained by the CONVERGE team for a minimum of one week in the use of the interview. The interview includes assessment of psychopathology, demographic and personal characteristics, and psychosocial functioning. Interviews were tape-recorded and a proportion of them were listened to by the trained editors who provided feedback on the quality of the interviews. The interview was semi-structured and required the interviewers to make a range of judgements about the nature and meaning of the reported symptoms. The section of the interview that assessed major depression was adapted from the Composite International Diagnostic Interview (CIDI) (WHO lifetime version 2.1; Chinese version) and classified diagnoses according to the Diagnostic and Statistical Manual of Mental Disorders (DSM-IV) criteria .
Additional information using instruments employed from VATSPSUD, was translated and reviewed for accuracy by members of the CONVERGE team. Information on postnatal depression was assessed using an adaptation of the Edinburgh Scale . The history of lifetime major depression in the parents and siblings was assessed using the Family History Research Diagnostic criteria . All available medical records (n = 1,880) were reviewed typically before but sometimes after the interview with the respondent. As would be expected, the quantity and quality of medical records varied considerably. Some included only out-patient records while others included material from in-patient hospitalizations. Notes by treating psychiatrists were almost always available. Notes from nursing staff or other mental health professionals were sometimes included. Interviewers were trained in the completion of the Case Record Rating Scale. If records were available from multiple episodes of illness, they were instructed to focus on the worst episode. They were also instructed to focus on admission and discharge summaries as likely to contain the most complete clinical descriptions. All available case notes were assessed to obtain a diagnosis of DSM-IV depression. The presence of each of 12 symptoms was evaluated as either “clear evidence absent”, “inferred absent”, “inferred present”, “clearly present moderate”, “clearly present severe” or “no information”. Interviewers were also instructed that information obtained from the medical records should never influence their interview of the respondents as the two sources of data were to be kept entirely separate.
The case interview was fully computerized into a bilingual system of Mandarin and English developed in house in Oxford, and called SysQ. Skip patterns were built into SysQ. Interviews were administered by trained interviewers and entered offline in real time onto SysQ, which was installed in the laptops. Once an interview was completed, a backup file containing all the previously entered interview data could be generated with database compatible format. The backup file, together with an audio recording of the entire interview, was uploaded to a designated server currently maintained in Beijing by a service provider. All the uploaded files in the Beijing server were then transferred to an Oxford server quarterly.
Statistical analyses were performed using the software package SPSS 17.0 (SPSS Inc., Chicago, IL). Pearson correlation coefficients were calculated to compare the age of onset of the first episode of depression, the longest duration episode of depression, and the number of episodes. Agreement between the interview and hospital records for 12 individual MD symptoms was assessed by Cohen's kappa statistic . Symptoms were subject to factor analysis with a Varimax rotation again using SPSS software.
We compared information obtained from medical records and structured interviews on MD episode duration, occurrence, family history and symptomatology. Table 1 shows good correlations for the age of onset of MD, number of episodes and the length of the longest episode (in weeks), with a strikingly high correlation for the age of onset (+0.93). Figure 1 displays graphically the nature of the correlations. Table 2 presents kappa coefficients for family history and for symptoms experienced during the worst MD episode. Family history for depression was relatively reliable across sources (kappa = +0.62). Agreement for symptoms was variable but generally modest (range +0.14 [insomnia] to +0.48 [appetite/weight loss]), with a mean of +0.32. There was no obvious pattern among symptoms that correlated poorly compared to those that correlated more strongly (for example measures of biological symptoms such as changes in sleep and weight varied as much as measures of psychological state, such as feelings of fatigue and thoughts of suicide).
Each graph plots data from interview data on the horizontal axis against data from medical records on the vertical axis; a) the age of onset (in years) b) the number of episodes c) the duration of the longest episode of major depression (in weeks).
The modest correlation in symptomatology might reflect errors in the way symptoms were elicited and recorded or, since medical records may contain information collected at a different time from the structured interview, it might reflect real differences in the symptom profile of the disease. If the differences were due to random error, then the relationship between individual symptoms might also be disturbed. We tested this by assessing the similarity in factor structure of the symptoms obtained from medical records and structured interview.
We identified four factors from 12 symptoms obtained from medical records with eigenvalues greater than 1. These factors explained 49% of the variance. The first loaded most strongly on three symptoms (psychomotor agitation, irritable/angry, nervous/jittery/anxious symptoms); we label this factor “Anxiety”. The second loaded most strongly on four symptoms (fatigue/loss of energy, appetite/weight loss, psychomotor retardation, and difficulty concentrating); we label this factor “Fatigue”. The third loaded on four symptoms (hypersomnia, insomnia, appetite/weight gain, appetite/weight loss); we label this factor “Neurovegetative”; the last factor loaded heavily on just two symptoms (suicidal ideation/acts, crying a lot), and we have labeled this “Suicide”. Results for the factor analysis of medical records are shown in Table 3.
The same four factors emerged from an analysis of structured interviews, explaining 40% of the variance, with Table 4 showing similar loadings to the factors obtained from medical records. Factor congruences (cosines of pairs of vectors defined by the loadings matrix) for the factors extracted from the interviews and medical records were quite high: +0.91 for the Anxiety factor, +0.99 for Fatigue, +0.95 for Neurovegetative and +0.86 for Suicide.
Finally, we tested the correlation between the factors extracted from chart review and from the structured interviews. Table 5 shows significant correlations between factors obtained from the two sources that are modest to moderate (∼+0.20–0.45). However, the within factor correlations between the interview and medical records are much higher than cross factor correlations (means of 0.27 and 0.04 respectively).
Our results show that clinical information about MD obtained from unstructured medical records correlates with that from structured clinical interviews, with the degree of correlation depending on the nature of the information. For age of onset, the number and length of episodes and family history, the two sources are highly concordant, but we find variable and generally more modest agreement between individual symptoms. We find the same factor structure present in symptoms from both sources and factor congruence is greater than 0.85 in all cases. This indicates that symptom score differences are unlikely to be due to random error, and that we need to find additional explanations for the discrepancy.
In our study, information from medical records was not contemporaneous with that obtained by interview. This suggests two important reasons for the lack of concordance between symptoms elicited at interview and from medical records. The first is that the information from the two sources may often relate to different episodes of MD. We have no way of knowing whether the worst episode described by the patient corresponds to the worst episode that was picked for rating using hospital records. Consistency of symptoms between depressive episodes is typically modest. For example, the mean correlation between MD symptoms elicited from a large population-based sample of female twins interviewed twice at least one year apart was 0.28 . Similarly modest correlations were found in a study of 78 hospital inpatients examined at intervals of one and two years apart . Both studies found the highest correlations for suicidal behaviour. In our study the correlation between the suicide factors in the two sources of information was also highest (+0.44).
Difference in remembrance is a second reason for discrepancies between the symptoms we acquired from hospital records and from interviews , , . When subjects are assessed longitudinally recall is known to affect the results, sometimes in predictable ways: for example Bromet et al found that on a second assessment about twice as many patients reported fewer lifetime depressive episodes than those who reported more . However consistency in recall of some items is high, for example reporting of age at onset rarely differs by more than one year , a finding that we corroborated.
Psychiatry's reliance on interviews, rather than objective tests, for basic clinical information has spawned a large literature on the reliability of structured ways of assessing patients , , , and typically additional information from chart review is incorporated to obtain a best estimate of lifetime psychiatric diagnosis . These studies show that diagnoses based on interview data alone are an adequate substitute for best estimate diagnoses based on all available data. However there has been much less interest in the validity of chart-derived information. Very few studies have examined the relationship between chart and interview-derived information.
Our results can be usefully compared with a recent study of schizophrenia using a very similar methodology to the current report . Ratings of psychotic symptoms were compared in 1,021 patients with schizophrenia studied in Ireland between personal interview and a review of medical records. Correlations for 21 signs and symptoms of psychotic illness ranged from +0.02 (somatic hallucinations) to +0.55 (religious delusions), with a mean of +0.26. Despite examining a different disease and in a different country, these results are quite similar to those obtained in this study.
A few other studies are moderately relevant to our findings. For example, one study of community practitioners reported a kappa of 0.24 for reliability between chart diagnosis and that obtained from a SCID ; by contrast, a survey of diagnoses made on 101 psychiatric inpatients reported high concordance, with most errors judged to have occurred in the charts . It is noteworthy that our findings, pointing to a relatively good agreement in the sources of information, were also obtained from psychiatric hospitals, rather than community physicians, and suggests that the setting in which medical records are obtained may be an important determinant of the reliability of the information.
Our results do not allow us to decide what to do when medical records and interview data disagree. This is most likely to be true for reports of symptoms. While we have shown that there is consistency in factor structure between the two sources of information we have no way of determining the more accurate measure.
Our results should be interpreted in the context of four potentially important methodological limitations. First, an important concern is that medical staff rated the medical records on the same patients that they interviewed. Although interviewers were instructed to keep the two sources of information separate, this may not always have been possible. Therefore we may be overestimating the degree of concordance between medical records and interview-acquired data. Second, the sample is entirely female and our results may or may not extrapolate to men in China. Third, medical records were missing on a small number of cases (n = 90) and often did not contain information about the presence or absence of some individual depressive symptoms. This may introduce an unacknowledged source of bias into our results. Fourth, we have no data on the reliability of our interviews. While we assume that the quality of our interview data is comparable to that of other studies  without analysis of repeat interviews we cannot be certain on this point.
All authors are part of the CONVERGE consortium (China, Oxford and VCU Experimental Research on Genetic Epidemiology) and gratefully acknowledge the support of all partners in hospitals across China.
Conceived and designed the experiments: JF KK HD. Performed the experiments: HD. Analyzed the data: Ying Chen HL JF KK HD. Contributed reagents/materials/analysis tools: Ying Chen HL Y. Li DX ZW FY YS SN YW Y. Liu LL CG JL LY GW KL QH TL JZ YR QD JT HC Y. Luo FZ GS CS X. Wang YZ X. Weng Yunchun Chen ZK JG Yiping Chen SS HD. Wrote the paper: Ying Chen HL JF KK.
- 1. Williams JB, Gibbon M, First MB, Spitzer RL, Davies M, et al. (1992) The Structured Clinical Interview for DSM-III-R (SCID). II. Multisite test-retest reliability. Arch Gen Psychiatry 49: 630–636.
- 2. Leckman JF, Sholomskas D, Thompson WD, Belanger A, Weissman MM (1982) Best estimate of lifetime psychiatric diagnosis: a methodological study. Arch Gen Psychiatry 39: 879–883.
- 3. Bromet EJ, Dunn LO, Connell MM, Dew MA, Schulberg HC (1986) Long-term reliability of diagnosing lifetime major depression in a community sample. Arch Gen Psychiatry 43: 435–440.
- 4. Kendler KS, Neale MC, Kessler RC, Heath AC, Eaves LJ (1993) The lifetime history of major depression in women. Reliability of diagnosis and heritability. Arch Gen Psychiatry 50: 863–870.
- 5. Fendrich M, Weissman MM, Warner V, Mufson L (1990) Two-year recall of lifetime diagnoses in offspring at high and low risk for major depression. The stability of offspring reports. Arch Gen Psychiatry 47: 1121–1127.
- 6. Miller PR, Dasher R, Collins R, Griffiths P, Brown F (2001) Inpatient diagnostic assessments: 1. Accuracy of structured vs. unstructured interviews. Psychiatry Res 105: 255–264.
- 7. Shear MK, Greeno C, Kang J, Ludewig D, Frank E, et al. (2000) Diagnosis of nonpsychotic patients in community clinics. Am J Psychiatry 157: 581–587.
- 8. Fanous AH, Amdur RL, O'Neill FA, Walsh D, Kendler KS (2011) Concordance between chart review and structured interview assessments of schizophrenic symptoms. Compr Psychiatry.
- 9. Kendler KS, Neale MC, Kessler RC, Heath AC, Eaves LJ (1992) Familial influences on the clinical characteristics of major depression: a twin study. Acta Psychiatr Scand 86: 371–378.
- 10. Oquendo MA, Barrera A, Ellis SP, Li S, Burke AK, et al. (2004) Instability of symptoms in recurrent major depression: a prospective study. Am J Psychiatry 161: 255–261.
- 11. Helzer JE, Robins LN, McEvoy LT, Spitznagel EL, Stoltzman RK, et al. (1985) A comparison of clinical and diagnostic interview schedule diagnoses. Physician reexamination of lay-interviewed cases in the general population. Arch Gen Psychiatry 42: 657–666.
- 12. Segal DL, Hersen M, Van Hasselt VB (1994) Reliability of the Structured Clinical Interview for DSM-III-R: an evaluative review. Compr Psychiatry 35: 316–327.
- 13. Helzer JE, Clayton PJ, Pambakian R, Woodruff RA (1978) Concurrent diagnostic validity of a structured psychiatric interview. Arch Gen Psychiatry 35: 849–853.
- 14. Association AP (1994) Diagnostic and statistical manual of mental disorders. Washington, D.C.: American Psychiatric Association.
- 15. Cox JL, Holden JM, Sagovsky R (1987) Detection of postnatal depression. Development of the 10-item Edinburgh Postnatal Depression Scale. Br J Psychiatry 150: 85–100.
- 16. Endicott J, Andreasen N, Spitzer RL (1975) Family History-Research Diagnostic Criteria. New York: Biometrics Research, New York State Psychiatric Institute.
- 17. Cohen J (1960) A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20: 37–46.