Browse Subject Areas

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Identifying Cases of Type 2 Diabetes in Heterogeneous Data Sources: Strategy from the EMIF Project

  • Giuseppe Roberto ,

    Affiliation Regional Agency for Healthcare Services of Tuscany, Epidemiology unit, Florence, Italy

  • Ingrid Leal,

    Affiliation Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, Netherlands

  • Naveed Sattar,

    Affiliation British Heart Foundation Glasgow Cardiovascular Research Centre, University of Glasgow, Glasgow, United Kingdom

  • A. Katrina Loomis,

    Affiliation Pfizer Worldwide Research and Development, Groton, Connecticut, United States of America

  • Paul Avillach,

    Affiliations Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, Netherlands, Department of Biomedical Informatics, Harvard Medical School & Children’s Hospital Informatics Program, Boston Children’s Hospital, Boston, Massachusetts, United States of America

  • Peter Egger,

    Affiliation GlaxoSmithKline, Worldwide Epidemiology GSK, Stockley Park West, Uxbridge, United Kingdom

  • Rients van Wijngaarden,

    Affiliation PHARMO Institute for Drug Outcomes Research, Utrecht, Netherlands

  • David Ansell,

    Affiliation The Health Improvement Network, Cegedim Strategic Data Medical Research Ltd, London, United Kingdom

  • Sulev Reisberg,

    Affiliation Quretec, Software Technology and Applications Competence Center, University of Tartu, Tartu, Estonia

  • Mari-Liis Tammesoo,

    Affiliations Estonian Genome Center, University of Tartu, Tartu, Estonia, Tartu University Hospital, Tartu, Estonia

  • Helene Alavere,

    Affiliations Estonian Genome Center, University of Tartu, Tartu, Estonia, Tartu University Hospital, Tartu, Estonia

  • Alessandro Pasqua,

    Affiliation Health Search, Italian College of General Practitioners and Primary Care, Firenze, Italy

  • Lars Pedersen,

    Affiliation Department of Clinical Epidemiology, Aarhus University Hosptial, Aarhus, Denmark

  • James Cunningham,

    Affiliation University of Manchester, Manchester, United Kingdom

  • Lara Tramontan,

    Affiliation Arsenàl.IT Consortium, Veneto's Research Centre for eHealth Innovation, Treviso, Italy

  • Miguel A. Mayer,

    Affiliation Hospital del Mar Medical Research Institute (IMIM) and Universitat Pompeu Fabra, Barcelona, Spain

  • Ron Herings,

    Affiliation PHARMO Institute for Drug Outcomes Research, Utrecht, Netherlands

  • Preciosa Coloma,

    Affiliation Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, Netherlands

  • Francesco Lapi,

    Affiliation Regional Agency for Healthcare Services of Tuscany, Epidemiology unit, Florence, Italy

  • Miriam Sturkenboom,

    Affiliation Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, Netherlands

  • Johan van der Lei,

    Affiliation Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, Netherlands

  • Martijn J. Schuemie,

    Affiliations Janssen Research & Development, Epidemiology, Titusville, New Jersey, United States of America, Observational Health Data Sciences and Informatics, New York, New York, United States of America

  • Peter Rijnbeek,

    Affiliation Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, Netherlands

  •  [ ... ],
  • Rosa Gini

    Affiliation Regional Agency for Healthcare Services of Tuscany, Epidemiology unit, Florence, Italy

  • [ view all ]
  • [ view less ]

Identifying Cases of Type 2 Diabetes in Heterogeneous Data Sources: Strategy from the EMIF Project

  • Giuseppe Roberto, 
  • Ingrid Leal, 
  • Naveed Sattar, 
  • A. Katrina Loomis, 
  • Paul Avillach, 
  • Peter Egger, 
  • Rients van Wijngaarden, 
  • David Ansell, 
  • Sulev Reisberg, 
  • Mari-Liis Tammesoo


Due to the heterogeneity of existing European sources of observational healthcare data, data source-tailored choices are needed to execute multi-data source, multi-national epidemiological studies. This makes transparent documentation paramount. In this proof-of-concept study, a novel standard data derivation procedure was tested in a set of heterogeneous data sources. Identification of subjects with type 2 diabetes (T2DM) was the test case. We included three primary care data sources (PCDs), three record linkage of administrative and/or registry data sources (RLDs), one hospital and one biobank. Overall, data from 12 million subjects from six European countries were extracted. Based on a shared event definition, sixteeen standard algorithms (components) useful to identify T2DM cases were generated through a top-down/bottom-up iterative approach. Each component was based on one single data domain among diagnoses, drugs, diagnostic test utilization and laboratory results. Diagnoses-based components were subclassified considering the healthcare setting (primary, secondary, inpatient care). The Unified Medical Language System was used for semantic harmonization within data domains. Individual components were extracted and proportion of population identified was compared across data sources. Drug-based components performed similarly in RLDs and PCDs, unlike diagnoses-based components. Using components as building blocks, logical combinations with AND, OR, AND NOT were tested and local experts recommended their preferred data source-tailored combination. The population identified per data sources by resulting algorithms varied from 3.5% to 15.7%, however, age-specific results were fairly comparable. The impact of individual components was assessed: diagnoses-based components identified the majority of cases in PCDs (93–100%), while drug-based components were the main contributors in RLDs (81–100%). The proposed data derivation procedure allowed the generation of data source-tailored case-finding algorithms in a standardized fashion, facilitated transparent documentation of the process and benchmarking of data sources, and provided bases for interpretation of possible inter-data source inconsistency of findings in future studies.


In recent years, an increasing number of projects have been focusing on re-using existing electronic health records (EHR) for clinical research.[1] In particular, huge efforts have been made to combine health data from isolated environments and perform valid multi-data source observational studies.[2, 3]

In this context, the European Medical Information Framework (EMIF) project was launched with the main objective of building an infrastructure for the efficient re-use of existing European health care data for epidemiological research ( Within the project, a federation of heterogeneous sources of real world data (e.g. administrative, hospital or primary care databases, disease registries, biobanks), currently collecting health information on around 52 million European citizens, collaborate in the EMIF-Platform whose focus is the consistent exploitation of currently available patient-level data to support novel research. One of the main challenges for the EMIF-Platform is to deal with the heterogeneous characteristics of the participating data sources and facilitate the execution of high quality multi-national, multi-data source observational studies based on populations with otherwise unconceivable sample sizes and follow-up time span.

In general, different strategies can be adopted to identify a population of interest from a single source of EHR.[4, 5] The choice of a particular case-finding algorithm is generally driven by both the specific research question and the data source peculiarities.[6] The chosen algorithm, however, can significantly affect the characteristics of the cases identified [4, 5] and, for this reason, should be carefully taken into account when discussing study results.

In multi-data source studies, tailored choices may be necessary [68], and the diversity of local case-identification algorithms may increase along with the heterogeneity of the data sources involved [6, 9, 10]. A transparent process of documentation and evaluation of local case-finding algorithms becomes paramount for the correct interpretation of study results as well as for the discussion of possible inter-data source inconsistency of study findings [1012]. It must be noted that data sources available to study European populations are much more heterogeneous than data sources from a single country, such as the United States [10]. Therefore, in order to address this issue, the EMIF-Platform designed a novel standard procedure for data derivation which leverages the experience gained from previous European multi-national, multi-data source studies [2, 3, 9, 13]. In this proof-of-concept study, the identification of type 2 diabetes mellitus (T2DM), a common chronic condition with important implications for future health[14], was used as a test case.

Materials and Methods

Data sources

Eight European data sources collecting health care information on around 20 million subjects from six different countries participated to this study (Table 1). Three were primary care data sources (PCDs), three were record linkage systems of different registries (RLDs), one was a hospital data source (HD) and one was a biobank (BD). In specific, the three primary care data sources were the Health Search IMS Health LPD database (HSD, Italy),[9, 15] the Integrated Primary Care Information database (IPCI, The Netherlands)[16] and The Health Improvement Network database (THIN, UK), in which the general practitioners (GPs) function as data keeper of all patient’s medical information.[17] The three record linkage data sources were the Aarhus University Hospital (AUH, Aarhus, Denmark),[18, 19] PHARMO (PHARMO, The Netherlands)[20] and the Regional Health Authority of Tuscany (ARS, Italy),[9, 15] which collect data from different sources (e.g. hospital discharge records, death registries, drug dispensing and procedures). The HD was the Information System of Parc de Salut Mar Barcelona (IMASIS, Spain) that records information from routine healthcare activities of Hospital del Mar of Barcelona.[21, 22] The BD was the Estonian Genome Center of University of Tartu (EGCUT, Estonia) in which information from interviews of voluntary donors of biological samples is collected through standard questionnaires.[23] EGCUT is the only cross-sectional data source included in this study. In all data sources except the Spanish HD, IMASIS, information on a representative sample of the population living in the corresponding geographic area are collected. In the Italian PCD and in the Estonian BD only adult population is represented (>14 and >18 years of age, respectively). The information in the corresponding databases is recorded using different coding systems. Diagnoses are coded according to the International Classification of Diseases, 9th Revision, Clinical Modification (ICD-9-CM) or ICD-10 (10th version), the International Classification of Primary Care (ICPC), READ or are as free text. Prescriptions/dispensings are coded according to the Anatomical Therapeutic Chemical classification (ATC) or BNF/Multilex. The majority of the data sources collect records concerning the utilization of diagnostic procedures and laboratory results. The coding of these data domains are based on local service terminologies.

Study population and design

In each participating data source the study population corresponded to all active subjects on January the 1st 2012 (reference date) that at the same date had ≥16 years of age. Due to sample size issues, exception was made for EGCUT in which January the 1st 2009 was considered as the reference date.

A descriptive, cross-sectional, retrospective multi-database study was performed. Patients with T2DM were identified within the populations selected from the participating data sources by using different case-finding algorithms.

Event definition

T2DM is a chronic clinical condition characterized by hyperglycemia due to insulin resistance and a progressive deficiency in insulin production[14]. It represents the most common form of diabetes, comprising about 90% of all cases of diabetes worldwide[24]. Diagnosis and follow-up of T2DM is based on laboratory tests for blood glucose measurements and treatment includes life style interventions (i.e. diet and physical exercise) and use of medications[14].

As the first step of the data derivation procedure, a shared clinical definition of T2DM was adopted (Fig 1) and defined according to the European Society of Cardiology and European Association for the Study of Diabetes (ESC/EASD) guideline[14].

Generation of a list of component algorithms

To identify subjects with T2DM in a healthcare data source, information from one or more data domains may be available. Diagnoses and/or records collecting information on routine patients’ clinical care and follow-up, such as drug prescriptions, utilization of diagnostic tests and laboratory results, can be used,[4, 25, 26] so that combining data from one or more of these domains, different case-finding strategies, with different sensitivity and positive predictive value (PPV), can be obtained.[4]

As the second step of the data derivation procedure a unique list of standard algorithms useful to identify cases of T2DM in the selected data sources was generated. Such standard algorithms, referred to as “component algorithms”, were defined as rules to identify subjects with a defined pattern of records selected from a single data domain. For the identification of T2DM, a total of four data domains were concerned: diagnoses (DIAG), drug prescription/dispensing (DRUG), utilization of a diagnostic test (TEST) or laboratory results (LABVAL). Component algorithm could be intended as inclusion, exclusion or refinement criteria. Two sources of knowledge were leveraged and integrated for the design of the component algorithms: a central expert-based definition of T2DM (top-down engineering) and the expertise provided by local data source experts (bottom-up learning).[8] The top-down engineering was embedded in the previously mentioned clinical definition (see previous subsection) and in an operational definition. The latter was intended as a description of typical diagnostic and therapeutic patterns that patients with T2DM are expected to follow. Both the clinical and operational definition were created by researchers with clinical expertise and agreed with both the local data source experts and the central study leader. The bottom-up learning, instead, was embedded in a questionnaire where local experts were asked to describe the algorithms they would have used to identify T2DM cases in their own data source, possibly mentioning relevant validation studies. All the information gathered were then used by the central study leader to create a unique list of inclusion, exclusion and refinement criteria corresponding to those mentioned at least once in one of the documents generated. The list was created on the grounds of the central study leaders' judgement and revised by the clinical and local experts. Each criterion was then translated into a standard component algorithm, as follow. As already described in greater details in a previous published paper,[3] the Unified Medical Language System (UMLS) was used to build a shared semantic foundation across the different coding systems: medical concepts pertinent to the clinical and operational definition of T2DM were identified and projected to local terminologies. The final list of local codes, strings and free text keywords was obtained through an iterative process involving local experts’ feedback. Each component algorithm was fully described by two additional rules: the first was the pattern of records that triggers identification of the event (for instance: at least two records in the same calendar year), and the second concerned the criteria to identify the case’s index date (e.g. date of the first record).

Data extraction and analysis: “the component algorithm strategy

A distributed network approach was adopted in EMIF to allow partners for maintaining control of their data and to benefit from local data source experts consultation on the appropriate use of data and interpretation of results.[2, 3] Local experts were asked to select and extract all component algorithms considered useful to identify T2DM cases in their data source. All person-time available up to the reference date was considered for algorithm application. Extracted data were prepared to be inputted in Jerboa, a custom-built software developed in the EU-ADR project[2] which was run locally to standardize the data aggregation process. After providing formal approval, local data source experts uploaded aggregated analytical datasets to a common virtual machine.

Using a custom-built analysis tool (a Microsoft Access interface for Stata [StataCorp. 2013. Stata Statistical Software: Release 13. College Station, TX: StataCorp LP] and LaTeX []), local experts could test the extracted components in any possible logical combinations by using Boolean operators (i.e. AND, OR, AND NOT). This strategy, we referred to as “the component algorithm strategy”, allowed local experts to build more complex case-finding strategies (composite algorithms) by combining two or more of the extracted components as a mean of inclusion, exclusion or refinement criteria.The process of testing different combinations of components was led separately by each local expert, who finally chose a particular combination of components as the recommended composite algorithm for the identification of T2DM in the relevant data source. The local experts, the clinical experts and the central study leader held a series of conference calls to discuss the final choice about the recommended composite algorithm, but in case of disagreement the local expert opinion prevailed. A comment describing the reasons behind the choice was recorded together with an estimate, either objective or subjective, of the expected sensitivity and PPV. This information, as well as the minutes from the conference calls, was stored and intended as a source of reusable knowledge.

Presentation of results

Results from the application of individual components and recommended composite algorithms were compared across data sources and presented as age-specific percentages of subjects identified in the study population of the corresponding data source.

In each participating data source, the impact of extracted component was assessed with respect to the total number of subjects identified using the recommended composite algorithm, which was considered as the reference case population. For this purpose, we calculated: i) the percentage of subjects identified by each component in the reference case population and ii) the prevalence rate ratio (PRR) of subjects identified by the recommended composite algorithm with and without the use of the tested component as additional inclusion criteria, i.e. PRR = ((N in tested component OR in recommended composite algorithm)/ N in recommended composite algorithm)-1.

Patient records were anonymized and de-identified prior to analysis and only aggregated data were shared across sites therefore no written informed consent was necessary for this study. Permission for both re-use of the data analyzed in this study as well as for publication of the results obtained was granted by each participating organizations’ review board.

The full protocol of the research project is publicly available on the electronic register of observational studies of the European Network of Centers for Pharmacoepidemiology and Pharmacovigilance (


Since this was a proof-of-concept study, results presented here are not intended as estimates of disease frequency.

Overall, the EMIF-Platform provided for this study aggregated health data from around 12 million European citizens.

The size of the study populations selected from the participating data sources ranged from 1600 to 3.4 million subjects. Components algorithms included in at least one recommended composite algorithms are reported in Table 2.

In Fig 2 four examples of comparisons of age band-specific results from individual component algorithms across data sources are shown. The full list of comparisons concerning all those components extracted from at least two data sources and included in at least one recommended composite algorithm are available as supporting information in S1 Fig. As for DIAG-based components very different performances were associated to the healthcare setting of data collection (primary, secondary, inpatient care). The component DRUG_ORAL (i.e. ≥2 records of non-insulin antidiabetic drugs utilization in one calendar year) and DRUG_INSULIN (i.e. ≥2 records of insulin utilization in one calendar year) were extracted in all the participating PCDs and RLDs and resulted in a comparable age band-specific percentage of subjects identified in the respective study populations.

Fig 2. Comparison of results from individual component algorithms: four examples.

The data source tailored recommended composite algorithms are shown in Fig 3 together with the comments of the local experts. The PCD from UK and the BD from Estonia adopted the component algorithm based on T2DM diagnoses from primary care (DIAG_T2DM_PC) only as the recommended choice. The HD from Spain excluded from the pool of subjects identified through inpatients diagnoses of T2DM (DIAG_T2DM_INP) those with a recorded diagnosis of type 1 diabetes (DIAG_T1DM). Only three data sources (the PCD from Italy and the RLDs from Denmark and Italy) had previous validation studies available [25, 27, 28]. They all adopted a final composite algorithm based on the relevant validation study.

Fig 3. Recommended composite algorithms: age band-specific percentages of subjects identified on the relevant total study population.

PPV: Positive Predictive Value.

The Dutch PCD added a sensitive pattern of utilization of non-insulin antidiabetic drugs (i.e. DRUG_ORAL_ONE) as inclusion criterion, due to the observed low sensitivity of the DIAG-based algorithm DIAG_T2DM_PC in this data source. The Dutch RLD chose to include only subjects utilizing non-insulin antidiabetics, because the available DIAG-based component that used diagnoses from inpatients setting was considered unreliable by local experts.

Through the application of the recommended composite algorithm, the lowest percentage of study population was identified in the Estonian BD, 3.5%, while the highest in the Spanish HD, 15.7%. In the RLDs it ranged from 4.1% to 7.5% while in PCDs from 6.8% to 8.6%. The age band-specific percentages of the total case populations identified using the recommended composite algorithms showed more comparable results across all participating data sources (Fig 3). The expected sensitivity of the recommended composite algorithms, as reported by local experts either from previous validation studies or from subjective judgement, was >0.9 in all data sources except for the Italian and Dutch RLDs for which a sensitivity between 0.7 and 0.9 was expected. As for PPV, the Italian and Danish RLDs only reported an expected value ranging from 0.7 and 0.9 while for the remaining data sources the Fig was >0.9.

The union of any extracted DIAG-based component among the five intended as inclusion criteria identified from 93 to 100% of the reference case population in PCDs (Table 3), 100% in both BD and HD, and from 15% to 73% in RLDs. In RLDs, DRUG-based components identified from 81% to 100% of the respective total case population, while from 58% to 83% in PCDs. TEST-based components were included in the recommended composite algorithm of the Danish RLD only in which these algorithms identified 44% of the total case population. Although TEST-based components were also extracted from the Italian RLD, they were not included in the recommended composite algorithm since they would have almost doubled the total case population (PRR = +79.2%), thus suggesting a too low specificity. LABVAL-based algorithms were included in the recommended composite algorithm of the Italian PCD only: overall, the three components from this data domain identified 46% of the total case population. Notably, subjects from the same data source could be identified by one or more component thus the percentages reported above may overlap.

Table 3. Impact of extracted component algorithms on total case population identified in each participating data source through the application of the relevant recommended composite algorithm.


Through the application of the standard data derivation procedure tested in this study, cases of T2DM were identified in eight distinct sources of health data with heterogeneous characteristics. Logical combinations of standardized component algorithms, each based on a single data domain, were used to build data source-tailored case-finding algorithms. This “component algorithm strategy” facilitated both benchmarking and interpretation of results across data sources. It also allowed the assessment of the impact of individual standardized component algorithms on the total population of cases retrieved in each participating data source that ultimately provided insight into the strengths and limitations of each data source with respect to the identification of T2DM cases.

Compared to previous projects that aimed to combine different European sources of EHR for research purposes,[2, 3] the main innovation of the standard procedure tested in this study was the use of component algorithms as building blocks that could be combined to create more complex case-finding algorithms. As demonstrated by the results presented here, in the context of a multi-national, multi-data source study, the “component algorithm strategy” represents an extremely flexible approach for generating EHR-driven[6] case-finding algorithms in a standardized fashion: on the one hand, it allows the local experts’ knowledge of the EHR “natural system”[8] to be fully leveraged, avoiding loss of information and assuring the correctness of the derived information, while, on the other hand, it facilitates the interpretation and benchmarking of results obtained even across data sources with very different characteristics. Notably, the data derivation procedure tested in this study requires that all component algorithms locally available for the identification of the condition of interest should be extracted, tested and stored regardless of whether they will be subsequently included in the final recommended composite algorithm. This also gives to investigators and local experts the chance to tweak the preferred identification algorithm at the study design stage, according to the study questions.

Gaining insight into cases identified by data source-tailored case-finding algorithms

In this study, the composite algorithms recommended by local experts for the identification of T2DM were extremely variable, resulting, however, in a selection of cases that are likely to represent the best possible local approximation of the true case identification. Indeed, since the age-specific prevalence of diabetes is expected to be fairly homogeneous across the geographic areas we are considering,[29] the observed differences in terms of percentage of the corresponding study populations can be interpreted in light of both the specific components adopted and of relevant data sources’ characteristics. Among all data sources, the highest percentage of cases was identified in HD because this data source only captures subjects who visit the hospital, who, by definition, will have a higher burden of disease with respect to the general population. On the other extreme, the BD showed the lowest percentage, possibly because people volunteering to participate in this data source are slightly healthier than the general population. Both HD and BD identify patients with T2DM using DIAG-based component only. However, while in HD cases were identified among inpatients only who are expected to be at a more advanced stage of the disease and more likely to have comorbidities,[5] in BD characteristics of cases were probably more representative of patients with T2DM in the corresponding source population, because diagnoses are recorded in a primary care setting. As for the three primary care data sources, the Italian PCD adopted a case finding strategy based on data from DIAG and LABVAL. This strategy was expected to be very sensitive. Moreover, in a previous validation study, it was also proven to have the highest possible PPV. [27] Therefore, its recommended algorithm can be considered an excellent approximation of a true case identification and the observed percentage of cases can be assumed to be a valid estimate of the prevalence of T2DM in the correspondent source population. In the PCD from UK a lower percentage of cases was identified compared to the Italian PCD. This result could be due to a slight underreporting of diagnoses in the data source. As for the Dutch PCD, the age-specific percentage of detected cases was almost identical to that observed in the PCD from UK. However, in the Dutch PCD a DRUG-based algorithm was adopted as additional inclusion criterion to the DIAG-based component DIAG_T2DM_PC, since the latter was not sensitive enough when used alone. In fact, general practitioners participating to the Dutch PCD often record diagnoses using free text description which may sometimes remain elusive to the keywords-based retrieval process. Among RLDs, the percentage of the population identified in the Dutch RLD was slightly lower than that observed in the other two RLDs from Italy and Denmark respectively. Indeed, local experts of the Dutch RLD recommended the use of one single DRUG-based component (DRUG_ORAL) as the preferred case-finding algorithm, while the other two data sources, on the grounds of previous validation studies,[25, 28] adopted more complex composite algorithms that allowed to increase sensitivity by including also components based on DIAG and/or TEST. In particular, the Danish RLD was the only data source collecting diagnoses from secondary care. Notably, TEST-based components, which identify patients through specific patterns of utilization of glycated haemoglobin tests, were not included in the Italian RLD since they resulted to be far more unspecific than in the Danish RLD. This was clearly showed when the impact of TEST-based components on the total population of cases identified in the two data sources was observed. Such a difference was probably due to local healthcare system organization and guidelines with respect to diagnosis and follow-up of diabetic patients.

Understanding quality of a local case-finding algorithm

In studies utilizing routinely collected health data, understanding the quality of local case-finding algorithms is paramount for the interpretation of study findings[11, 30] and a fortiori in multi-data source studies. The component algorithm strategy proposed in this study can indirectly provide approximation of algorithm validity indexes, even when no formal validation studies are available for one or more of the participating data sources. This is attained through the benchmarking of components and composite algorithms across data sources with similar characteristics but collecting data from different geographic areas or vice versa.

Indeed, in this study, cases in PCDs were basically identified through primary care diagnoses and are thus expected to be fairly representative of the T2DM patients in the corresponding source populations. In RLDs, instead, most of cases were captured through non-insulin antidiabetic drugs utilization which cannot identify those patients at a earlier stage of the disease who are not on drug treatment (do diet only) and may also misclassify T2DM with other diseases for which the same drugs can sometimes be used (e.g. polycystic ovary syndrome).[4] Supposing that the validity of the latter case-finding algorithm was completely unknown, data reported in Table 3 can be used to obtain an approximation of its expected sensitivity and PPV. As an example, the Dutch RLD, which used a case-finding algorithm based on the utilization of non-insulin antidiabetics only (i.e. DRUG_ORAL) can be considered. Since sensitivity corresponds to the percentage of subjects with a true diagnosis of T2DM who also have the DRUG_ORAL pattern of non-insulin antidiabetic drugs utilization, this percentage can be estimated from the Dutch PCD to be around 77%, or slightly lower if we accept that sensitivity in the Dutch PCD is not 100% (the corresponding percentage in the other two PCD data sources is lower than 55%). PPV, instead, is the percentage of subjects utilizing oral antidiabetics who really have type 2 diabetes. In this case, value higher than 90% is expected since other indications for such drugs have a very low prevalence.[4] In fact, this is also confirmed in both PCDs from Italy and UK where the component DRUG_ORAL added less than 3% of cases when used as additional inclusion criteria.

Tailoring selection of components to a research question

Since this study was solely intended as an exercise to test the feasibility of the methodology proposed, the research question was rather generic and, consequently, not all the composite algorithms were chosen with the primary objective of addressing specificity or sensitivity. In general, the preferences of local experts were more often directed towards sensitivity, at expense of specificity, with the notable exception of RLD-NL. Diagnosis-based components were selected preferentially because of their face validity. Components based on unspecified type of diabetes (DIAG_DMUNSPEC or DIAB_OTH) were used in PCD-I and RLD-I due to specific characteristics of the local data source. Moreover, components that minimally or slightly increased capture were generally included, while those which were not specific were dropped if they inflated capture, e.g DRUG_INSULIN and DRUG_ORAL in PCD-UK, in order to avoid misclassification with type 1 with type 2 diabetes.

Indeed, at the design stage of a specific study, the proposed data derivation procedure allows investigators and local experts to modify their preferred identification algorithm according to the type of study question or sensitivity analysis. In the case of a study involving T2DM, if specificity is important, they may switch to DIAG-based components at the expenses of sensitivity. This may happen, for instance, when studying occurrence of organ complications in patients with T2DM. In case sensitivity is important, they may add other inclusion criteria, like TEST-based components: this may be recommended in safety studies. Finally, if homogeneity across different data sources is important, investigators and local experts may agree to adopt a DRUG-based strategy.


Although this was a proof-of-concept study in which results obtained were not intended as estimates of disease frequencies, limitations that might have biased the results and comparisons discussed in this manuscript must be acknowledged. In particular, formal validation of the retrieved cases against medical chart review was not performed as well as important variables other than age were not considered for stratification of results. Finally, identifying T2DM with an algorithm that is not validated represents a limitation because the estimates of validity indices rely on the subjective judgment of the local expert. Nevertheless, in a multi-national, muti-data source study, in absence of a previously validated algorithm, the local expert recommendation remained the best possible choice.


Through the identification of T2DM cases, this study demonstrates that the standard procedure for data derivation developed within the EMIF project represent a methodological advancement for the execution of multi-national, multi-data source studies. In fact, on the basis of a shared definition of any event of interest, the procedure assures interoperability of heterogeneous EHR systems and allows establishing data-source tailored case-identification algorithm in a standardized fashion, providing sufficient information for contextualization and correct interpretation of study results and generating transparent and reusable documentation on the entire data derivation process. Further studies are warranted to explore the validity of different components and composite algorithms as well as the heterogeneity of the population identified across data sources.

Supporting Information

S1 Fig. Comparison of results from individual component algorithms.


S1 Table. Mapping of local codes and free text keywords corresponding to the medical concepts embedded in the component algorithms adopted for type 2 diabetes identification.


Author Contributions

  1. Conceptualization: RG GR IL PR MJS.
  2. Data curation: GR RG PR.
  3. Formal analysis: RG GR.
  4. Investigation: RG IL PA RvW DA SR AP LP LT MAM PC PR.
  5. Methodology: RG GR IL PR.
  6. Software: RG PR.
  7. Supervision: RG.
  8. Visualization: GR RG.
  9. Writing – original draft: GR RG IL.
  10. Writing – review & editing: GR IL NS AKL PA PE RvW DA SR M-LT HA AP LP JC LT MAM RH PC FL MS JvdL MJS PR RG.


  1. 1. Richesson RL, Horvath MM, Rusincovitch SA. Clinical research informatics and electronic health record data. Yearb Med Inform 2014;9(1):215–23.
  2. 2. Coloma PM, Schuemie MJ, Trifiro G, Gini R, Herings R, Hippisley-Cox J, et al. Combining electronic healthcare databases in Europe to allow for large-scale drug safety monitoring: the EU-ADR Project. Pharmacoepidemiol Drug Saf 2011 Jan;20(1):1–11. pmid:21182150
  3. 3. Avillach P, Coloma PM, Gini R, Schuemie M, Mougin F, Dufour JC, et al. Harmonization process for the identification of medical events in eight European healthcare databases: the experience from the EU-ADR project. J Am Med Inform Assoc 2013 Jan 1;20(1):184–92. pmid:22955495
  4. 4. Richesson RL, Rusincovitch SA, Wixted D, Batch BC, Feinglos MN, Miranda ML, et al. A comparison of phenotype definitions for diabetes mellitus. J Am Med Inform Assoc 2013 Dec;20(e2):e319–e326. pmid:24026307
  5. 5. Morley KI, Wallace J, Denaxas SC, Hunter RJ, Patel RS, Perel P, et al. Defining disease phenotypes using national linked electronic health records: a case study of atrial fibrillation. PLoS One 2014;9(11):e110900. pmid:25369203
  6. 6. Pathak J, Kho AN, Denny JC. Electronic health records-driven phenotyping: challenges, recent advances, and perspectives. J Am Med Inform Assoc 2013 Dec;20(e2):e206–e211. pmid:24302669
  7. 7. Overby CL, Pathak J, Gottesman O, Haerian K, Perotte A, Murphy S, et al. A collaborative approach to developing an electronic health record phenotyping algorithm for drug-induced liver injury. J Am Med Inform Assoc 2013 Dec;20(e2):e243–e252. pmid:23837993
  8. 8. Hripcsak G, Albers DJ. Next-generation phenotyping of electronic health records. J Am Med Inform Assoc 2013 Jan 1;20(1):117–21. pmid:22955496
  9. 9. Valkhoff VE, Coloma PM, Masclee GM, Gini R, Innocenti F, Lapi F, et al. Validation study in four health-care databases: upper gastrointestinal bleeding misclassification affects precision but not magnitude of drug-related upper gastrointestinal bleeding risk. J Clin Epidemiol 2014 Aug;67(8):921–31. pmid:24794575
  10. 10. Gini R, Schuemie M, Brown J, Ryan P. Data Extraction And Management In Networks Of Observational Health Care Databases For Scientific Research: A Comparison Among EU-ADR, OMOP, Mini-Sentinel And MATRICE Strategies. eGEMs 2016;4(1, Article 2).
  11. 11. Benchimol EI, Smeeth L, Guttmann A, Harron K, Moher D, Petersen I, et al. The REporting of studies Conducted using Observational Routinely-collected health Data (RECORD) Statement. PLoS Med 2015 Oct;12(10):e1001885. pmid:26440803
  12. 12. Conway M, Berg RL, Carrell D, Denny JC, Kho AN, Kullo IJ, et al. Analyzing the heterogeneity and complexity of Electronic Health Record oriented phenotyping algorithms. AMIA Annu Symp Proc 2011;2011:274–83. pmid:22195079
  13. 13. Trifiro G, Coloma PM, Rijnbeek PR, Romio S, Mosseveld B, Weibel D, et al. Combining multiple healthcare databases for postmarketing drug and vaccine safety surveillance: why and how? J Intern Med 2014 Jun;275(6):551–61. pmid:24635221
  14. 14. Ryden L, Grant PJ, Anker SD, Berne C, Cosentino F, Danchin N, et al. ESC Guidelines on diabetes, pre-diabetes, and cardiovascular diseases developed in collaboration with the EASD. Eur Heart J 2013 Oct;34(39):3035–87. pmid:23996285
  15. 15. Gini R, Francesconi P, Mazzaglia G, Cricelli I, Pasqua A, Gallina P, et al. Chronic disease prevalence from Italian administrative databases in the VALORE project: a validation through comparison of population estimates with general practice databases and national survey. BMC Public Health 2013;13:15. pmid:23297821
  16. 16. Vlug AE, van der LJ, Mosseveld BM, van Wijk MA, van der Linden PD, Sturkenboom MC, et al. Postmarketing surveillance based on electronic patient records: the IPCI project. Methods Inf Med 1999 Dec;38(4–5):339–44. pmid:10805025
  17. 17. Blak BT, Thompson M, Dattani H, Bourke A. Generalisability of The Health Improvement Network (THIN) database: demographics, chronic disease prevalence and mortality rates. Inform Prim Care 2011;19(4):251–5. pmid:22828580
  18. 18. Nexo BA, Pedersen L, Sorensen HT, Koch-Henriksen N. Treatment of HIV and risk of multiple sclerosis. Epidemiology 2013 Mar;24(2):331–2.
  19. 19. Johannesdottir SA, Horvath-Puho E, Ehrenstein V, Schmidt M, Pedersen L, Sorensen HT. Existing data sources for clinical epidemiology: The Danish National Database of Reimbursed Prescriptions. Clin Epidemiol 2012;4:303–13. pmid:23204870
  20. 20. Herk-Sukel MP, Lemmens VE, Poll-Franse LV, Herings RM, Coebergh JW. Record linkage for pharmacoepidemiological studies in cancer patients. Pharmacoepidemiol Drug Saf 2012 Jan;21(1):94–103. pmid:21812067
  21. 21. Mayer MA, Furlong LI, Torre P, Planas I, Cots F, Izquierdo E, et al. Reuse of EHRs to Support Clinical Research in a Hospital of Reference. Stud Health Technol Inform 2015;210:224–6. pmid:25991136
  22. 22. Sancho JJ, Planas I, Domenech D, Martin-Baranera M, Palau J, Sanz F. IMASIS. A multicenter hospital information system—experience in Barcelona. Stud Health Technol Inform 1998;56:35–42. pmid:10351871
  23. 23. Leitsalu L, Haller T, Esko T, Tammesoo ML, Alavere H, Snieder H, et al. Cohort Profile: Estonian Biobank of the Estonian Genome Center, University of Tartu. Int J Epidemiol 2015 Aug;44(4):1137–47. pmid:24518929
  24. 24. World Health Organization. Definition, diagnosis and classification of diabetes mellitus and its complications. Part 1: Diagnosis and classification of diabetes mellitus. Geneva; 1999. Report No.: WHO/NCD/NCS/99.2.
  25. 25. Carstensen B, Kristensen JK, Ottosen P, Borch-Johnsen K. The Danish National Diabetes Register: trends in incidence, prevalence and mortality. Diabetologia 2008 Dec;51(12):2187–96. pmid:18815769
  26. 26. Kho AN, Hayes MG, Rasmussen-Torvik L, Pacheco JA, Thompson WK, Armstrong LL, et al. Use of diverse electronic medical record systems to identify genetic risk for type 2 diabetes within a genome-wide association study. J Am Med Inform Assoc 2012 Mar;19(2):212–8. pmid:22101970
  27. 27. Gini R, Schuemie MJ, Mazzaglia G, Lapi F, Pasqua A, Dazzi P, et al. Automatic identification of stages of type 2 diabetes, hypertension, ischaemic heart disease and heart failure from Italian General Practioners' electronic medical records: a validation study [Abstract]. Pharmacoepidemiol Drug Saf Sep 2015;24:1–587 2015.
  28. 28. Gini R, Schuemie MJ, Mazzaglia G, Lapi F, Francesconi P, Pasqua A, et al. Identifying chronic conditions from data sources with incomplete diagnostic information: the case of Italian administrative databases [Abstract]. Pharmacoepidemiol Drug Saf Sep 2015;24:1–587 2015.
  29. 29. Prevalence estimates of diabetes, adults aged 20–79 years, 2011. OECD-iLibrary 2011Available from: URL:
  30. 30. Hernan MA. With great data comes great responsibility: publishing comparative effectiveness research in epidemiology. Epidemiology 2011 May;22(3):290–1. pmid:21464646