Accuracy of Electronic Health Record Data for Identifying Stroke Cases in Large-Scale Epidemiological Studies: A Systematic Review from the UK Biobank Stroke Outcomes Group

Objective Long-term follow-up of population-based prospective studies is often achieved through linkages to coded regional or national health care data. Our knowledge of the accuracy of such data is incomplete. To inform methods for identifying stroke cases in UK Biobank (a prospective study of 503,000 UK adults recruited in middle-age), we systematically evaluated the accuracy of these data for stroke and its main pathological types (ischaemic stroke, intracerebral haemorrhage, subarachnoid haemorrhage), determining the optimum codes for case identification. Methods We sought studies published from 1990-November 2013, which compared coded data from death certificates, hospital admissions or primary care with a reference standard for stroke or its pathological types. We extracted information on a range of study characteristics and assessed study quality with the Quality Assessment of Diagnostic Studies tool (QUADAS-2). To assess accuracy, we extracted data on positive predictive values (PPV) and—where available—on sensitivity, specificity, and negative predictive values (NPV). Results 37 of 39 eligible studies assessed accuracy of International Classification of Diseases (ICD)-coded hospital or death certificate data. They varied widely in their settings, methods, reporting, quality, and in the choice and accuracy of codes. Although PPVs for stroke and its pathological types ranged from 6–97%, appropriately selected, stroke-specific codes (rather than broad cerebrovascular codes) consistently produced PPVs >70%, and in several studies >90%. The few studies with data on sensitivity, specificity and NPV showed higher sensitivity of hospital versus death certificate data for stroke, with specificity and NPV consistently >96%. Few studies assessed either primary care data or combinations of data sources. Conclusions Particular stroke-specific codes can yield high PPVs (>90%) for stroke/stroke types. Inclusion of primary care data and combining data sources should improve accuracy in large epidemiological studies, but there is limited published information about these strategies.


Introduction
Stroke is the second commonest cause of death worldwide and a major global cause of disability [1]. Pathological types and subtypes of stroke differ in their risk factor associations [2,3]. Very large prospective studies, yielding large numbers of stroke cases, are needed to examine these associations reliably [4]. Linkage to routinely collected, coded healthcare data is a practical means of ascertaining stroke and other health-related outcomes. However, such data have variable completeness and accuracy [5][6][7][8][9][10].
UK Biobank is a very large prospective cohort study of 503,000 participants, aged 40-69 years when recruited in England, Scotland and Wales between 2006 and 2010 [11]. Participants completed a detailed questionnaire at baseline, underwent a range of physical measurements, and provided biological samples for genetic, biochemical and other analyses. Follow up is chiefly through cohort-wide linkages to National Health Service data, including electronic, coded death certificate, hospital, and primary care data. By 2017, around 5,000 incident strokes are expected to have occurred among UK Biobank participants [12].
In most countries, including the UK, hospital admissions and death certificates are coded using the International Classification of Diseases (ICD) [13][14][15]. The primary ICD code identifies the main condition treated during a hospital admission, or the underlying cause of death. Secondary codes record additional diagnoses relevant to an admission, or contributing to death. Codes for cerebrovascular disease include a range of presentations. Fig 1 shows which ICD codes most closely match the World Health Organisation (WHO) definition of stroke [16] or of one of its three main pathological types: ischaemic stroke, intracerebral haemorrhage (ICH), and subarachnoid haemorrhage (SAH). Although not all of these represent a diagnosis of the clinical syndrome of stroke, many studies which have looked at determinants of stroke using linked ICD-coded datasets have included all cerebrovascular disease codes in the relevant ICD coding chapter, implicitly assuming that they are all codes for stroke. Over the last 10 years, health care systems in European countries have switched from ICD-9 to ICD-10, while those in North America use ICD-9-CM (a clinically-modified version of ICD-9). Primary care data in the UK are coded by general practitioners using the Read coding system, which encodes diagnoses, symptoms, signs, procedures, prescriptions and other administrative data [17,18].
For health-related outcomes such as stroke, UK Biobank aims to maximise statistical power to detect genuine associations in nested case-control or case-cohort studies. This requires a strategy that identifies cases representative of the spectrum of the disease being studied with adequate sensitivity, and that maximises positive predictive value (PPV, the proportion of cases that are true positives). Minimising false positives will minimise loss of statistical power through misclassification of cases. Some false negatives can be tolerated, since these are diluted International Classification of Diseases (ICD) codes for cerebrovascular disease. * 433: occlusion/stenosis of pre-cerebral arteries with or without infarction. † 434: thrombosis/embolism of cerebral arteries with or without infarction.Codes in blue text denote ICD-9 codes which most closely represent stroke when subdivided using additional coding available in the clinically modified version of ICD-9 (ICD-9-CM) used in North America. In ICD-9-CM, 'with infarction' (433.x1, 434.x1) is distinguished from 'without infarction' (433.x0, 434.x0).
‡ 436: acute, ill-defined cerebrovascular disease ¶ a pathological term for ischaemic stroke § G46: not a diagnostic code; may be used for the presenting symptoms of either stroke or TIA.
by the very much larger control population, with much more limited impact on statistical power. UK Biobank aims to fulfil these requirements by using multiple sources of coded data (primary care, hospital and death certificate data) to ascertain possible stroke cases, and then to implement algorithms, using combinations of coded data, supplemented where required by more detailed medical record review, to confirm and sub-classify cases of stroke. An important first step in developing such algorithms is to understand the accuracy of the coded data sources.
To inform approaches to ascertainment, confirmation and sub-classification of stroke in UK Biobank and other large epidemiological studies, we therefore performed a systematic review of published studies of the accuracy of coded health record data for stroke and its main pathological types. We chose not to include transient ischaemic attacks (TIAs), which are clinically harder to diagnose accurately, with poor agreement even amongst experts [19], and of substantially less public health impact than strokes. We used the traditional, epidemiological 'symptom-based' definition of stroke (symptom duration >24 hours) to distinguish stroke from TIA. [16] The more recent, alternative 'tissue-based' definition relies on the presence of brain infarction to diagnose stroke, irrespective of symptom duration (<24hours). [20] Accurate diagnosis of brain infarction depends on the availability, choice, and timing of brain imaging, which may vary between different centres. [21] We chose to use the 'symptom-based' definition to maximise comparability between different studies.

Methods
The study protocol is displayed in S1 Appendix.

Search Strategy
We searched Medline and Embase from 1990 to November 2013 for studies which compared electronic health record data coded events against a reference standard data source for stroke or its main types. We used a combination of medical subject heading and text word terms for 'cerebrovascular disease', 'stroke', 'medical records', 'clinical coding', and 'validation studies' (S1 Appendix). We identified additional relevant studies by reviewing the bibliographies of included primary studies and relevant reviews, as well as lists of publications from the Clinical Practice Research Datalink [22] and The Health Improvement Network [23] websites for studies evaluating accuracy of primary care data.

Eligibility Criteria
Included studies had to have assessed International Classification of Diseases (ICD) or Read coded events against a reference standard data source for stroke or of one or more of its three major pathological types (Fig 1), defined according to WHO or equivalent definitions. [16] Studies had to report which codes were validated and either their positive predictive value (PPV) or data from which it could be calculated. We excluded studies with less than 50 coded events (since these would have limited precision) and studies in highly selected populations (e.g., those with vascular risk factors or known vascular disease) at increased risk of stroke because of the influence of stroke prevalence on PPV. One author reviewed all titles and abstracts to select potentially relevant studies, and a subset of 10% of titles and abstracts was independently reviewed by a second author, who reached the same conclusions as the first. Two authors independently reviewed full texts of potentially relevant studies and selected studies for inclusion. Any areas of uncertainty from this two phase study selection process were discussed and resolved with a third, senior author with extensive experience both in stroke epidemiology and in systematic review methodology.

Data Extraction and Analysis
We extracted and tabulated information from each included study on: first author and publication year; geographic setting (country); age (mean and/or range) of included cases (coded events); data source (hospital, death certificates, primary care); coding system and version; codes used to identify cases; diagnostic position of these codes in the electronic health record (primary versus secondary); number of cases (coded events) compared against the reference standard; reference standard used; PPV and, where reported or calculable, sensitivity, specificity, and negative predictive value (NPV) of codes. We only extracted sensitivity, specificity and NPV values where the reference standard was a population-based stroke register which had clearly aimed to include all stroke cases in the population under study.
We assessed study-level quality with a modified version of the Quality Assessment of Diagnostic Studies tool (QUADAS-2), [24] adapted from a recent systematic review of the validity of myocardial infarction diagnoses in administrative databases. [25] We used this to assess reporting quality, generalisability to the UK population (because we sought to recommend codes for UK Biobank), and risk of bias. The study protocol (S1 Appendix) provides a detailed list of questions and scoring methods. An overall quality score (0-14) was derived by combining scores for reporting quality, generalisability, and low risk of bias. We did not exclude studies on the basis of quality assessments.
We calculated 95% confidence intervals for PPV, sensitivity, specificity and NPV values in Stata (version 12) using the Wilson method for binomial proportions [26]. For stroke and each of its main pathological types, we assessed the influence on PPV (and, where available, sensitivity) of the codes used to identify stroke cases, and of other study characteristics, using visual inspection of tabulated data and forest plots, and making within-study comparisons where possible to minimise bias. We did not undertake formal meta-analyses or meta-regression because of the substantial heterogeneity between studies in their settings, methods and reporting.

Studies Identified
A total of 39 studies fulfilled our inclusion criteria (Fig 2). Of these, 37 were of ICD-coded hospital data, death certificates, or both. Only two were of Read-coded primary care data [27,28].

Quality Assessment
Detailed results of the quality assessment are displayed in S2 Table. Quality scores ranged from 4 to 12 (median 9, interquartile range 8 to 11). With respect to reporting quality, participant selection criteria and coding algorithms were generally well reported, but only ten studies acknowledged the potential for uncertainty of the reference standard diagnosis in their results. [33,36,38,39,41,45,56,58,59,64] With respect to generalisability to the UK population, only eight studies were conducted in the UK. However, all the other studies were based in high income countries, among populations of predominantly European origin with broadly similar health care provision, and are therefore likely to be broadly generalizable (from a global perspective) to population-based studies in these types of settings (including the UK). Of the UKbased studies, two had suboptimal generalisability because all coded discharges were taken from a single hospital department, [61,62] while for the other six generalisability was unclear due to incomplete reporting. [56][57][58][59][60]63] With respect to risk of bias, only five studies achieved the optimum score. [33,45,50,54,65] Incomplete reference standard data (due to a variable proportion of missing or irretrievable records) [29-31, 34, 36, 37, 39, 42-44, 46-48, 51, 52, 55-57, 60, 63, 64] and lack of or inadequate blinding of adjudicators to the coded diagnosis [29, 30, 32, 34, 36-39, 42-45, 47-49, 51-53, 56, 57, 61, 62] were the most common potential causes of bias.

Accuracy of ICD-Coded Events
The range of PPVs reported for various codes used to identify stroke or one of its main pathological types was very broad, reflecting considerable heterogeneity of study characteristics. Results were particularly variable for all stroke (PPV 31-97%) and for ischaemic stroke (PPV 6-95%), while they appeared more consistent for haemorrhagic stroke (PPV 73-89%), SAH (PPV 86-96%) and ICH (PPV 71-96%), although based on fewer studies.
Within-study comparisons. Only six studies used a population-based reference standard and, of these, only four (all from Scandinavian countries) [45,48,50,64] provided sufficient data to calculate sensitivity, specificity and negative predictive value (NPV) of codes for stroke. Sensitivities for identifying stroke were around 80% or more using general cerebrovascular or stroke-specific codes from either hospital data or hospital data combined with death certificates, but-unsurprisingly-sensitivity was much lower for death certificates alone (S3 Table). There were no data on sensitivity for the main pathological types of stroke.
Eight studies (all of ICD-9 codes) assessed influence of coding position on PPV for a variety of ICD-9 code groups (cerebrovascular disease codes, ischaemic stroke codes, or haemorrhagic stroke codes) [30,31,34,37,40,43,49,52]. Restriction to the primary position code (versus inclusion of codes from the primary or secondary diagnostic position) increased the PPV, but by no more than about 5-10% in all but two studies [30,37] (S4 Table). It was not possible directly to assess the influence of code position on sensitivity, but restriction to the primary position reduced the number of coded events identified by around 10-30%.
Comparisons between groups of studies reporting PPV for stroke and its main types. The PPV of codes for stroke and its main types, stratified according to the code group(s) selected (see below), are displayed in Figs 3-5. They display results of studies which identified: stroke events using either a broad selection of cerebrovascular codes or stroke-specific codes (Fig 3); ischaemic stroke events, using either codes for ischaemic and unspecified type of stroke or for ischaemic stroke alone (Fig 4.); and haemorrhagic stroke events using codes for ICH and SAH together or separately (Fig 5.). Informed by our within-study comparisons, results exclude studies which included the poorly performing ICD-9 code 433 among the stroke-specific or ischaemic stroke codes, except those which used the clinical modification 433.x1 (Fig 1,  Table 2, Fig 4).
For each of stroke and its main pathological types, PPVs of >90% were achieved in some studies (Figs 3-5). In line with results from within-study comparisons (Table 1), stroke-specific codes yielded higher PPVs for stroke (range 68-90%) than general cerebrovascular disease codes (range 31-80%) (Fig 3), while PPVs for ischaemic stroke were slightly higher with codes for ischaemic stroke alone (range 66-95%) than with codes for ischaemic and unspecified stroke (range 65-90%), but identified smaller numbers of outcomes (Fig 4). Codes for haemorrhagic stroke, and for ICH and SAH separately, performed consistently well or very well (PPV range 65-96%) (Fig 5). In general, ICD-10 appeared to perform better than ICD-9 codes, except where the 'clinical modification' (ICD-9-CM, see Fig 1) was available. Studies from the UK, yielding data that might be considered most informative for UK Biobank, reported PPVs of 78% and 86% for ischaemic stroke in one study [57] (the lower value when codes for unspecified stroke were included), 96% for SAH in two studies [57,63] and 96% for ICH in one study [63]. The quality scores did not appear to influence PPV (Figs 3-5).
Selection of the best code using a code hierarchy. Two studies used a 'code hierarchy' to select a single stroke code when more than one was used for an individual hospital admission [34,40]. These studies selected the single 'best code' for each case, based on presumed coding accuracy (SAH>ICH>ischaemic stroke>transient ischaemic attack [TIA]). This approach was no more accurate than selection of the primary position code in one study [40], and less accurate than selection of the primary position code in another [34].(S4 Table).
Distinguishing ischaemic stroke subtypes. Very few studies assessed accuracy of ICD codes for more detailed ischaemic stroke subtypes, and none assessed accuracy for subtypes of SAH or ICH. One study found that out of 106 coded events for ischaemic stroke subtypes, >70% had unspecified ischaemic stroke subtype codes [42]. The PPV of the cardiac embolism subtype code was 73% (based on only 11 coded events), but PPVs for other ischaemic subtypes were not reported.
Another study attempted to classify ischaemic strokes into four subtypes (lacunar stroke, cardiac embolism, large artery atherosclerosis and other) based on the hospital discharge abstract (which was used to generate the ICD codes) rather than the codes themselves [54]. This approach produced PPVs of 66-87% (highest for cardiac embolism and lacunar ischaemic stroke), and sensitivities of 67-74% (highest for cardiac embolism and large artery atherosclerosis).

Studies of Read-Coded Primary Care Data
Two UK-based studies reported PPVs of Read codes from primary care data, one for ischaemic and one for haemorrhagic stroke (S5 Table) [27,28]. Neither study reported code sensitivity. PPV was 89% for ischaemic stroke and 82% for haemorrhagic stroke, increasing to 90% for haemorrhagic stroke with exclusion of haemorrhagic codes which overlapped with antithrombotic drug prescription codes.

Combining Multiple Data Sources
None of the included studies assessed the combination of primary care codes with hospital or death certificate codes for stroke or its main types. A few excluded studies compared primary care and hospital codes to search for stroke plus TIA [66,67]. A UK study found that, compared to hospital ICD codes for stroke plus TIA in a primary care population of~5800 individuals, Read codes increased sensitivity and decreased PPV by absolute values of 53% and 17% respectively [66]. Similarly, a community-based study in Canada found that combining primary care physician billing data with hospital ICD codes detected more stroke/TIA events, but with lower PPV, compared to ICD codes alone: sensitivity for combined data sources was 78% (95% CI 66%-83%) versus 37% (95% CI 28%-46%) for ICD codes alone; PPV for combined data sources was 40% (95% CI 33%-46%) versus 81% (95% CI 70%-92%) for ICD codes alone [67]. Two UK studies explored the possibility of using medical record extracts to reduce the proportion of unspecified stroke codes (I64) [57,68]. In one, the primary care record held information to classify 74% of ICD-coded 'unspecified strokes' as ischaemic or haemorrhagic [57]. In the other, CT brain scan reports were used to assign~8400 stroke cases (identified by ischaemic stroke, intracerebral haemorrhage or unspecified stroke codes) to a main pathological type [68]. The proportion of 'unspecified' stroke cases fell from 67% to 33% when ICD coded data plus natural language processing of scan reports was used, versus ICD coded data alone. Using a physician's classification of radiology reports of 300 randomly selected cases as a reference standard, ICD coding plus analysis of scan reports was more accurate for ischaemic (PPV 95%, 95% CI 90% to 97%) than for haemorrhagic stroke (PPV 77%, 95% CI 69% to 73%).

Discussion
As far as we are aware from published work, this is the first systematic assessment of the accuracy of coded hospital, death certificate and primary care data for identifying stroke. Previous reviews have been less comprehensive in their data presentation and analysis, or less precise in their definition of stroke, with the inclusion of TIA, subdural haemorrhage, or all cerebrovascular disease in the reference standard. A previous review based on US studies alone reported similar results but did not include UK-based studies or consider either ICD-10 codes or the performance of primary care data or combined data sources [10]. Previous UK-based reviews of ICD or Read code accuracy have reviewed overall accuracy for a wide range of diseases rather than accuracy for stroke specifically [69,70], with limited numbers of stroke/cerebrovascular disease studies [9,[71][72][73].
We found wide variation in the performance of ICD codes for stroke and its main types, reflecting the heterogeneity of codes assessed and variation in study settings and methods. Our data also show a lack of consensus among stroke epidemiology studies about which codes should be used for identifying stroke outcomes. We have demonstrated that with appropriate selection of stroke-specific codes, PPVs of close to or >90% can be achieved for stroke and each of its main pathological types. Such PPVs will be adequate for many large scale epidemiological studies of the determinants of stroke. However, we found very few studies of the accuracy for stroke of Read-coded primary care data or of two or more overlapping data sources. Furthermore, the few available studies of ICD-coded data sources for identification of ischaemic stroke subtypes found that the majority of ischaemic subtype codes were 'unspecified' [42], and reliability of ischaemic subtype classification was limited [74,75]. We found no studies of the accuracy of coded data for identification of subtypes of ICH or SAH.
Within-and between-study comparisons revealed several consistent patterns. First, for stroke of any pathological type, PPV is increased by use of stroke-specific rather than general cerebrovascular codes, making it preferable to use stroke-specific codes to maximise PPV if no further adjudication of outcomes is planned after identification using ICD codes. Limited evidence suggests that sensitivity is poor when only death certificate data are used as a data source and is markedly increased by including data from hospital admissions, without compromising PPV. [45,64] Based on one study, using general cerebrovascular rather than stroke-specific codes also seems likely to increase sensitivity, albeit perhaps by only a small amount and at the expense of a lower PPV. [48] To reduce the number of false positives, this method of identifying stroke outcomes is, therefore, probably best used in combination with further steps to confirm which cases are true positives. The best approach for this confirmation process requires further investigation, but could potentially use combinations of ICD codes with coded data from primary care or other sources, or more detailed medical record review. Second, for ischaemic stroke, a greater number of outcomes are identified with little reduction in PPV by using a combination of ischaemic and unspecified stroke codes to identify outcomes. Third, specific codes for ICH and SAH were found to have generally high PPVs (range 71 to 96%). Fourth, across a range of codes for cerebrovascular disease, stroke and pathological stroke types, identification of stroke outcomes using only codes in the primary position increased PPV, but generally by only a modest amount and at the expense of missing true positive outcomes. Furthermore, the relevant studies were of ICD-9 codes only, which are now rarely used outside the USA. [31,32,34,37,40,43,49,52] Thus, use of appropriately selected codes in both the primary and secondary positions would seem appropriate for most purposes.
There were some limitations. First, since we only searched two online databases, we may have missed a few relevant articles. However, we also reviewed bibliographies of all included publications to increase the sensitivity of our search strategy. Second, our finding that use of the primary diagnostic position improved PPV in some studies may have been due to publication or reporting bias, since many studies did not report on this. Third, since PPV increases with increasing prevalence of the outcome studied, the lower prevalence of ICH and SAH (which together comprise around 20% of all strokes) compared with ischaemic stroke means that the PPVs of these different pathological types are not directly comparable. Fourth, some included studies had potentially less accurate sources available as a reference standard, such as hospital discharge summaries (a free text summary of the hospital admission, which is often written by less experienced doctors), or non-specialist primary care records (potentially based on hospital discharge summaries). We may have overestimated PPV of codes for haemorrhagic stroke types by using such reference standard data from two UK-based studies [57,63]. Apart from the examples above, all included studies used more accurate reference standard data sources (independent medical record review and/or expert-led stroke registers), and we excluded studies which did not use WHO or equivalent definitions of stroke and its main types. [16] However, there is no 'gold standard' diagnosis for stroke. Even experts are inconsistent in their ability to diagnose stroke, [76], and choice and timing of imaging (which may vary between centres and therefore between studies) influences the diagnostic accuracy of stroke types. [77,78] Fifth, the paucity of specific published data about the accuracy of Read-coded primary care data for stroke is an important further limitation, since up to half of stroke patients are not admitted to hospital in the UK [79,80], and hospitalised and non-hospitalised strokes may differ in the distribution of pathological types and subtypes and in their risk factor associations [81]. Combining primary care data with other sources (hospital and death certificate data) should improve the detection of non-hospitalised cases, reducing potential bias in the selection of cases. Although we identified six systematic reviews of Read code accuracy for a wide range of diseases [9,[69][70][71][72][73], none included data specifically for stroke. Two excluded studies validated Read codes for cerebrovascular disease [66,82], against a reference standard diagnosis of 'cerebrovascular disease'. These 'reference standards' were potentially less accurate because they included hospital ICD codes and patient-self-report without medical record review, or used internal validation by GP questionnaire (not an independent data source). In addition to improving case ascertainment, primary care data may enhance the sub-classification of potential stroke cases. Around 40% of ICD codes for stroke are of unspecified type, although this proportion may be declining [83,84]. Diagnostic codes combined with investigation, procedure, and/or medication codes (in primary care or hospital data) may increase PPV for ischaemic or haemorrhagic stroke [28,53].

Conclusions
Informed by this review, we recommend using 430, 431, 434, 436 (ICD-9), or I60, I61, I63, I64 (ICD-10), in either the primary or secondary diagnostic position to identify stroke cases with sufficiently high PPV for use in epidemiological studies where further confirmation steps are not envisaged. This may achieve PPVs of >90% for stroke. To increase the number of potential events identified, we suggest using all cerebrovascular disease ICD codes (ICD-9 430-438, or ICD-10 I60-I69, G45, G46) in both primary and secondary positions, but these would have to be combined with additional methods of stroke confirmation to maintain a high PPV. For ischaemic stroke we recommend codes 434, 436 (ICD-9), 433.x1, 434.x1, 436 (ICD-9-CM), and I63, I64 (ICD-10). For haemorrhagic stroke we recommend 430 (ICD-9) and I60 (ICD-10) for SAH, and 431 (ICD-9) and I61 (ICD-10) for ICH. Identifying more detailed stroke subtypes is likely to require coded data from investigations, procedures, and/or drug prescriptions, as well as diagnostic codes, and possibly more detailed review of medical record and imaging data.
Ultimately, UK Biobank aims to improve the accuracy and completeness of stroke outcomes ascertainment by linking multiple sources of coded data. Further work is needed to examine the use of multiple coded data sources to maximise PPV and sensitivity for stroke.
Supporting Information S1 Appendix. Study protocol. (DOCX) S1 Table. Characteristics of studies validating ICD codes from hospital and death certificate data for stroke and its pathological types.