Term sets: A transparent and reproducible representation of clinical code sets

Objective Clinical code sets are vital to research using routinely-collected electronic healthcare data. Existing code set engineering methods pose significant limitations when considering reproducible research. To improve the transparency and reusability of research, these code sets must abide by FAIR principles; this is not currently happening. We propose ‘term sets’, an equivalent alternative to code sets that are findable, accessible, interoperable and reusable. Materials and methods We describe a new code set representation, consisting of natural language inclusion and exclusion terms (term sets), and explain its relationship to code sets. We formally prove that any code set has a corresponding term set. We demonstrate utility by searching for recently published code sets, representing them as term sets, and reporting on the number of inclusion and exclusion terms compared with the size of the code set. Results Thirty-one code sets from 20 papers covering diverse disease domains were converted into term sets. The term sets were on average 74% the size of their equivalent original code set. Four term sets were larger due to deficiencies in the original code sets. Discussion Term sets can concisely represent any code set. This may reduce barriers for examining and reusing code sets, which may accelerate research using healthcare databases. We have developed open-source software that supports researchers using term sets. Conclusion Term sets are independent of clinical code terminologies and therefore: enable reproducible research; are resistant to terminology changes; and are less error-prone as they are shorter than the equivalent code set.


Results
Thirty-one code sets from 20 papers covering diverse disease domains were converted into term sets. The term sets were on average 74% the size of their equivalent original code set. Four term sets were larger due to deficiencies in the original code sets.

Discussion
Term sets can concisely represent any code set. This may reduce barriers for examining and reusing code sets, which may accelerate research using healthcare databases. We have developed open-source software that supports researchers using term sets. PLOS

Introduction
Clinical code terminologies, such as SNOMED [1] and ICD [2], are dictionaries of terms that allow clinicians to record events in electronic health records (EHRs) using alpha-numeric codes rather than free text. This makes patient records more manageable for clinical care, and allows secondary uses of the data, such as researchers performing retrospective observational studies. Researchers construct clinical codes sets [3][4][5] to represent the medical concepts they wish to investigate. This is a time-consuming activity, and prone to errors which can lead to biases in subsequent analyses [6]. Storing code sets in a format that facilitates validation, sharing and reuse is important, and called for frequently [7][8][9][10]. Code sets, also called code lists and value sets [3,9], range from one code to several thousand. The Value Set Authority Centre (VSAC) [11] provides a repository for code sets allowing their sharing and reuse. Their largest, for "Problem", contains 117,930 SNOMED codes. This code set is likely not useful, but there are several that are and that contain thousands of codes: Trauma (ICD-10) 18524, Fracture lower body (ICD-10) 5902, Infection (SNOMED) 4066 and Cancer (SNOMED) 3867. Verifying large code sets, by checking that all included codes are correct, and also that no codes are missing, is an enormous task and acts as a barrier to reuse [3]. Updating code sets as terminologies change over time, and sub-setting or extending code sets, are laborious and error-prone activities.
This is important because differences in code sets can cause large variations in findings. Rodriguez et al [12] found rheumatoid arthritis (RA) incidence to be 0.15 per 1000 personyears, while Watson et al [13], in the same database, found it to be 1.03 per 1000 person-years; a sevenfold difference. Another study [14], calculated the weekly incidence of infectious intestinal disease as: 8.3/100,000 if using the World Health Organisation's ICD-10 code set; 10.24/ 100,000 if using the Royal College of General Practitioners Research and Surveillance Centre's ICD-9 code set; and 17.93/100,000 if using the ontological definition on which the paper was based.
The FAIR principles [15] aim to improve the transparency and reusability of scientific data and the algorithms and tools for processing and curating that data. Clinical code sets are a key part of the research process and should abide by FAIR principles; they should be findable, accessible, interoperable and reusable. This is not currently the case. Almost all code sets are unpublished [4] and therefore not accessible. Those that are published, on dedicated repositories such as VSAC or clinicalcodes.org [16], are findable but reuse is a challenge. In theory, reuse is achieved by downloading the relevant code set and applying it to an EHR database. However the task of checking the code set for errors involves reading the definition for each code to confirm that they are correctly in the set, and also speculatively searching the rest of the terminology for codes that may have been omitted. This is arguably as time-consuming as constructing the code set from scratch and is one of the current barriers to reuse. There is also no way currently to determine if a missing code was accidentally or deliberately omitted, therefore impossible to determine if a mistake was made, or if the code set definition contained a subtlety not otherwise described.

Objective
We propose a new representation of selection criteria for EHR based studies, based on lists of inclusion and exclusion terms. We introduce a methodology for constructing codes sets which takes advantage of this representation, show that our method can represent any possible code set, and in doing so is typically more concise, and therefore practical for other researchers to verify, validate and ultimately reuse with confidence.

Materials and methods
We introduce 'term sets' to define cohort selection criteria for EHR-based studies. A 'term set' consists of three parts: inclusion terms describing the feature of interest (e.g. 'stroke, 'heart failure'); exclusion terms describing things of no interest (e.g. 'family history', 'screening'); and the target clinical code terminology and version (e.g. terminology = SNOMED-CT, version = uk-edition-v20180401). A code set is created from a term set by searching the terminology for codes that contain inclusion terms but that don't contain exclusion terms.

Relationship between code sets and term sets
The traditional representations of cohort selection criteria are clinical code sets which are applied to EHR databases via a query language. Code sets are extensional; they enumerate every code in the set. Term sets by contrast are intensional; they provide necessary and sufficient conditions by which a code is a member of the set. When applied to a particular terminology and version, a term set uniquely defines a code set. For example, consider the phrase "countries of the world" which is intensional, as compared with a complete list of countries of the world which is extensional. The list of countries changes over time, but at any point the intensional set can be derived from the extensional definition. Similarly, the extensional code set can be derived from the intensional term set.

Procedure for constructing term sets
Our method to construct a term set: 1. Select a clinical code terminology 2. Decide upon one or more inclusion terms, e.g. 'heart failure'.
3. Perform a search within the terminology for codes with a definition matching the inclusion terms. The search rules are described below.
4. Optionally exclude matching definitions by adding exclusion terms. E.g. for 'stroke', it would make sense to exclude the term 'family history'.
5. For hierarchical code terminologies, return codes that are descendants of matching codes, with definitions that do not contain an inclusion term. Add inclusion or exclusion terms to explicitly include or exclude these descendant codes.
6. Iterate until all inclusion terms have been added, and there are no unmatched descendants.
Deciding upon inclusion and exclusion terms is often a complex task requiring medical expertise. Therefore when implementing this method a clinician would need to be involved, or at the very least an expert in the particular disease domain. However for now we concentrate on the method itself, rather than its implementation. A worked example for the method can be found in S2 Appendix.

Proof that any code set can be represented as a term set
This ensures that our method can actually be used in practice for all code sets.
Clinical code terminology. A clinical code terminology T = (C,D,f ) is a set of codes C, a set of definitions D, and a mapping function f:C!D that links each code c2C with a set of one or more definitions d2D. Examples for Snomed CT, Read v2 and ICD-10 would be: The mapping function is surjective; each element of D is mapped to by at least one element of C. The inverse function f −1 :D!C therefore exists for all definitions in D and is defined such Matching definition set. For a set of word sequences W = {w 1 ,. . .,w m } and a terminology T = (C,D,f) we define the matching definition set MD(T,W) as the set of all definitions d2D where w i matches d.
Matching definition set with exclusions. Given two sets of word sequences W,E and a terminology T = (C,D,f) we define the matching definition set with exclusions MDE(T,W,E) as the set of all definitions d2D where w i matches d and e j does not match d.
Matching concept set. For a terminology T = (C,D,f), and two sets of word sequences W, E, we define the matching concept set M(T,W,E) as all codes in the terminology whose definition matches W. Alternatively: Proposal. Any subset of clinical codes from a terminology can be represented by a set of inclusion terms and a set of exclusion terms. Formally, for terminology T = (C,D,f) and any X = {x 1 ,x 2 ,. . .,x n }, a subset of C, there exists a set of inclusion word sequences I = {i 1 ,i 2 ,. . .,i r } and a set of exclusion word sequences E = {e 1 ,e 2 ,. . .,e s } such that MðT; I; EÞ ¼ X

Term set software
We have developed a web application (https://getset.herokuapp.com) that implements the above methods and allows users to create and verify term sets. The tool is currently implemented for Read v2 codes [17] which are used in UK general practice, however it is straightforward to extend to other hierarchical terminologies like ICD or SNOMED. Once created, term sets can be automatically verified and then shared via GitHub (https://github.com/). Users are encouraged to add their name, a short title and description, so that researchers reusing their set can easily determine their intent.

Empirical study
The proof above demonstrates "completeness"; any code set can be represented as a term set. We also wished to demonstrate "efficiency": a term set is shorter than the equivalent code set and is therefore easier and quicker to check. We therefore conducted an empirical study which found published clinical code sets, created their equivalent term set representations, and reported on their relative sizes. GetSet is currently configured with Read v2, therefore we searched PubMed for papers using the Clinical Practice Research Datalink (CPRD) [18]; a large primary care database containing Read v2 codes with 100s of publications annually. We used the search term ("CPRD"[all fields] or "Clinical Practice Research Datalink"[all fields]) and sorted the results by date descending. Reviewing recent papers ensured we can demonstrate that our method is valid for the current state of the art in clinical code set engineering.
We reviewed each paper in turn and included those that required the construction of code sets to define a cohort of patients. Cohort definition is the focal point of each paper and therefore the code set(s) that are most likely to appear. Also, by focussing on cohort definition, we avoided over-representation from papers with numerous code sets.
For each paper reviewed we extracted any code sets that described a patient cohort for a condition/diagnosis that had not been previously included. Certain conditions will likely be studied more frequently than others; restricting ourselves to one code set per condition ensured we had a sufficient variety of diseases.
We continued to review papers until code sets were discovered from 20 distinct papers. This ensured we would find 20 code sets for a variety of diagnoses and from a variety of authors.
We then created term set representations for each code set, using the above method, with the following caveats: • Any 'medcodes' (CPRD's code dictionary) were first converted to Read v2 codes.
• We removed all codes except Read v2 (e.g. CPRD also contains Oxmis codes, which were in use pre-2000, and CTV3 codes).
• Where multiple codes have identical definitions, and the code set has included some but not all, we extended the code set to include them all.
For each code set we reported on the code set size and compared this with the number of inclusion and exclusion terms in our equivalent representation.

Results
The PubMed search was executed on 17 th January 2018 by the lead author and returned 809 papers. The target of code sets from 20 distinct papers was reached after reviewing 45 papers; no further papers were reviewed. The 20 papers consisted of: 18 which included their code set in the paper, as a supplement, or in an online repository; 1 with code sets available on request so they were requested and received; and 1 that referenced code sets from another paper so this was retrieved to obtain the code sets. A total of 31 code sets for cohort definitions were found in the 20 papers. For further detail see: https://doi.org/10.5281/zenodo.1316984.
The median number of codes in each code set was 48 (IQR [18,120]). The smallest code set was for Stevens-Johnson syndrome and contained 1 code, while the largest code set, for infections that could lead to a potential hospitalization, contained 3,219 codes.
Each code set was successfully converted into a term set using our previously described procedure. The term sets are available at https://doi.org/10.5281/zenodo.1316984. The full list of code set definitions, their sizes, and the equivalent term set sizes are in Table 1. Nine code sets Table 1. Codes set descriptions and sizes, the size of the related inclusion/exclusion term sets, and the inclusion/exclusion term sizes as proportions of the original code set size. Proportions � 100% are displayed in bold.

Cohort definition code sets
Code set size omitted codes with definitions identical to an included code and so these codes were added prior to the conversion process. As an example, the code set for rheumatoid arthritis included the code "N040R00: Rheumatoid nodule", but did not include the code "N042200: Rheumatoid nodule", therefore N042200 was added prior to the conversion to a term set. The full list of extra codes for these nine code sets is available in S1 Table. The total size of the term sets was on average 74% of the size of the code sets. In four code sets the total number of inclusion and exclusion terms exceeded the size of the code set: marital status, cohabitation, residence and heart failure. The code sets for marital status and cohabitation both use the code "1331.00: Single". The inclusion term "single" matches many unrelated codes therefore many exclusion terms are needed. The code sets for residence and heart failure were perhaps poorly defined by the original authors. The residence code set aims to include codes that describe a person's residential status and includes such wonderful terms as "Fall from cliff, occurrence in residential institution" and "Bitten by crocodile, occurrence in residential institution", but then doesn't include the terms "Prolonged stay in weightless environment, occurrence in residential institution" or "Victim of avalanche, occurrence in residential institution". In order to represent this precisely with a term set we needed to include a large number of unnecessary exclusion terms. Finally the heart failure code set includes some, but not all, cardiomyopathy codes. There is no clinical reason for this and the number of inclusion terms would reduce if "cardiomyopathy" could be included, as opposed to the current situation where the exact definition of 15 cardiomyopathy codes must be included.

Discussion
We have developed a method for creating clinical code sets that incorporates metadata on how the code set was created. We have demonstrated with a formal proof that our method works for any code set, and have shown empirically that the lists of inclusion and exclusion terms are on average shorter than the list of codes themselves.
A recent HL7 initiative provides a method for defining intensional value sets (code sets) [38]. Using this method a researcher can define a set of rules which when applied to a Term sets: A transparent and reproducible representation of clinical code sets terminology generate a code set. However this does not give the creator of the code set any support, methodology or tools for how to create the rules for the intensional definition. In a similar way, Reference Sets [39] within SNOMED can be used to specify a subset of concepts for use in a particular application, but without creation support. Reference sets are also specific to SNOMED. Our approach provides a generalizable methodology and software tool which are used to build term sets and their associated code sets. Integration of the approaches could be achieved if term sets created with our software were exportable to the HL7 definition of an intensional value set. This would then provide a robust and transparent code set creation process, along with a precise, formal definition. There are at least four existing tools and associated methodologies for constructing clinical code sets. Davé and Petersen [40] created code sets by searching for synonymous terms and browsing the hierarchy. The final Stata script can be shared so that the process can be scrutinized. Others have developed R/Stata scripts: pcdsearch [41] and CALIBERcodelists [42,43]. These scripts reuse the ideas of Davé and Petersen, while allowing more complex queries using Boolean operators and regular expressions. Recently Watson et al. [5] presented a three-stage process: defining the clinical concept a priori with clinician assistance; searching a clinical terminology using R or Stata to create an initial code set; and producing a final code set via a Delphi exercise with at least two GPs (the main difference to previous approaches).
Our approach builds on the strengths of these methods while addressing certain limitations. Each method above has a way of excluding codes; typically by specifying the codes themselves. By using exclusion terms, we produce metadata that is uncoupled from particular terminologies and is more readable to reviewers of the code set. The output of the above methods is always a script (Stata or R). By not tying our method to a particular scripting language, and using a simple web application, we reduce the barriers to the methodical creation, inspection and reuse of code sets. Allowing regular expressions may help the code set creator, however it will likely act as a further barrier to reuse if the expressions get overcomplicated or if the next researcher is unfamiliar with regular expressions. We have kept our search strategy as simple as possible to mitigate this problem.
Although some of the reviewed code sets may have used one of the above methods, none made available the scripts used to create them. It is probably a safe assumption that this is true for the majority of code sets. The problem, for researchers reusing the code set, is that it is unknown which codes are missing and whether they were omitted deliberately or accidentally. Using our methodology these decisions become explicit. A future researcher may disagree with a decision, but at least it is available for scrutiny, and they can reuse the generated code set by tweaking the definition rather than starting from scratch.
Clinician involvement in code set development is critical, but precisely how research groups incorporate our methodology into their working practices is an open question. One option would be to use the three-stage process from Watson et al. with steps one and two (synonym definition and code set creation) facilitated with our tool.
We found examples where definitions only make sense when considered in the context of the hierarchy. E.g. the term "single" could be a numerical descriptor or a marital status. Our search strategy could be extended to examine the definitions of each codes' ancestors. A search for "marital status single" would then return the code with the definition "single" only if it had ancestors that contained the words "marital" and "status". This would alleviate the problem where inclusion terms with low specificity ("single" as a marital status, "white" as an ethnicity) lead to large numbers of exclusion terms.
The Read dictionary has a prefix-based hierarchy (G30's parent is G3, G3's parent is G). Two of the code sets we analysed (Dementia and potential hospitalized infections) used wildcards to represent multiple codes, e.g. "A � " to represent "A. . .." and all of its descendants. This leads to shorter code sets, which are easier to interpret, however it is problematic for two reasons. Firstly, when a code is included in a set it is not necessary that all descendants should also be included, and simply using a wildcard gives no guarantees that the researcher has inspected and accepted each code. Secondly, as the actual codes used in the analysis are not explicitly provided, it is impossible to determine which codes were actually used because code dictionaries change over time, with codes added and removed. Our methodology, which encourages users to specify inclusion (or exclusion) terms to match all descendants of included codes leads to more complete synonym lists and gives extra confidence to researchers reusing the code set.
Various problems were identified in the code sets (examples in Table 2). They fall into three categories: codes are included which do not correspond to the code set description; codes are omitted when they are obviously part of the code set; and some included and omitted codes are contradictory and should either all be included or all omitted. As we aimed to reproduce the code sets exactly, we have invariably created code sets with more inclusion and exclusion terms than are strictly necessary. By correcting the four code sets which had larger associated term sets we saw the average term set to code set proportion fall from 118.5% to 77.3%; all four term sets are now smaller than the code sets. For code sets constructed from scratch using our tool we would expect the number of inclusion and exclusion terms to be further reduced.
There are reasons why published code sets have omissions that aren't necessarily errors. A researcher might justifiably decide that it is more important to capture a short list of codes which occur most frequently in their dataset than to focus on codes that occur infrequently or not at all. This may be true for their own research, but for other researchers wanting to reuse their code sets on different data sources it is not good enough. The burden of large code sets might have encouraged researchers to keep their code sets short, but with our methodology this is no longer a restriction, as validation can be performed on the shorter term sets rather than the code sets.
Another valid reason for omissions is that code dictionaries change over time so it is possible that codes recently added to a terminology do not appear in a code set. This becomes a question of how to best keep code sets updated over time, and our approach provides a simple way to do this. Previously when updating a code set a researcher, who hadn't kept records of their search strategy from several years before, may end up recreating the code set. Now with the inclusion and exclusion terms captured and stored alongside the code set, one simply executes the term set definition against the updated code dictionary to see what additional codes may or may not need to be included.
We have demonstrated our method using Read codes, however the only precondition is that a terminology maps codes to definitions in a hierarchy, so our method would easily transfer to other terminologies such as SNOMED and ICD. One interesting avenue for further investigation is whether code sets can be translated into different terminologies. Once a researcher has defined a code set for one terminology, they could use the web tool to switch to a second terminology and automatically apply the same inclusion and exclusion terms to define a code set for that terminology. This would be useful for researchers using UK primary care data which is migrating from Read to SNOMED.

Strengths
We have shown that our method works formally via the proof and empirically via the code set mapping exercise. Using recent code sets from a variety of authors and for a variety of conditions demonstrates the generalisability of our technique. We have built upon the ideas from existing tools and methodologies as well as the recommendations from our earlier review [3].

Limitations
The search for papers was performed by a single author, however given the transparency of the search strategy the biggest risk is that a paper containing a code set has been incorrectly rejected. This would presumably be a random bias and not affect the results. The list of papers reviewed is also available for inspection at https://doi.org/10.5281/zenodo.1316984. The decision to select code sets for the cohort definition, rather than for the outcomes or the confounders, could have affected the results. However we found code sets for a wide variety Term sets: A transparent and reproducible representation of clinical code sets of conditions and had few problems converting them into our format, so consider it likely that this would extend to other conditions. Code sets can be represented in multiple ways, some of which will be easier to understand than others. Some researchers may therefore be able to produce 'better' term sets. This can also be seen as a strength, as researchers are more likely to use term sets that are more clearly defined, so these term sets will prevail at the expense of those that are harder to understand.
There may be occasions where it is unclear if a code should be included or not, for example if clinicians use the code in different ways. At present one solution is to create two or more term sets that either include or exclude the uncertain codes. These term sets would have slightly different inclusion and exclusion lists, and their associated description would highlight how sensitive or specific the term set was.
Finally, although largely terminology agnostic, on occasion the particular inclusion and exclusion terms are loosely tied to the terminology used. One extreme example in Read v2 is for the term "G21z00: . . .without congestive cardic failure" which misspells the word "cardiac". When selecting this code you would need an inclusion term of "cardic failure" which could be confusing and is unlikely to work in other terminologies. This is, however, an infrequent occurrence.

Conclusion
We have developed a new representation of cohort selection criteria for EHR based studies, a term set, which consists of: inclusion and exclusion terms; and a clinical code terminology and version. We have described a method to create term sets and developed an open source web application that implements this procedure. We have shown that our representation is as expressive as clinical code sets, but more efficient. Finally, term sets are easier to share, inspect, and reuse, because they are independent of specific (versions of) clinical terminologies. We expect that this will benefit transparent and reproducible research with EHR data.
Supporting information S1 Appendix. Full proof and definitions. Full formal proof and all definitions for the claim that a term set can represent any code set. (DOCX) S1