rEHR: An R package for manipulating and analysing Electronic Health Record data

Research with structured Electronic Health Records (EHRs) is expanding as data becomes more accessible; analytic methods advance; and the scientific validity of such studies is increasingly accepted. However, data science methodology to enable the rapid searching/extraction, cleaning and analysis of these large, often complex, datasets is less well developed. In addition, commonly used software is inadequate, resulting in bottlenecks in research workflows and in obstacles to increased transparency and reproducibility of the research. Preparing a research-ready dataset from EHRs is a complex and time consuming task requiring substantial data science skills, even for simple designs. In addition, certain aspects of the workflow are computationally intensive, for example extraction of longitudinal data and matching controls to a large cohort, which may take days or even weeks to run using standard software. The rEHR package simplifies and accelerates the process of extracting ready-for-analysis datasets from EHR databases. It has a simple import function to a database backend that greatly accelerates data access times. A set of generic query functions allow users to extract data efficiently without needing detailed knowledge of SQL queries. Longitudinal data extractions can also be made in a single command, making use of parallel processing. The package also contains functions for cutting data by time-varying covariates, matching controls to cases, unit conversion and construction of clinical code lists. There are also functions to synthesise dummy EHR. The package has been tested with one for the largest primary care EHRs, the Clinical Practice Research Datalink (CPRD), but allows for a common interface to other EHRs. This simplified and accelerated work flow for EHR data extraction results in simpler, cleaner scripts that are more easily debugged, shared and reproduced.


Introduction
'how-to' papers exist for good practice in observational data management but they address only some of the issues or focus on specific applications [5; 10; 11; 12]. At the same time there is a wealth of health informatics and computer science literature on how to make these research processes more transparent, reducing the duplication of effort and improving the consistency of data processing [13; 14]. Finally, several software packages exist for speeding up data analysis, but these are generic, do not apply directly to EHR manipulation and may require specialist knowledge to effectively use for fast manipulation of dataframes [15], for database integration [16], and for parallel processing (parallel in base R).
rEHR simplifies and accelerates the process of extracting ready-for-analysis datasets from EHR databases. In section 2 we provide instructions on loading the software and importing flat text files of the kind supplied by EHR providers into a local SQL database. In section 3 we describe the basic query operations provided by the package, the building of longitudinal data and calculation of prevalence and incidence statistics. In section 4 we convert the longitudinal data from the previous section to a cohort dataset suitable for survival analysis and illustrate algorithms to match controls to cases and to cut cohort data by time-varying covariates. In section 5 we briefly discuss some accessory functions provided in the package. In the final section we discuss the .ehr environment used to define the EHR database being used and how this can be set to work with different databases.
The package includes a number of simulated flat files to allow users to familiarise themselves with advanced aspects, which we use in this paper to provide examples.

Importing EHR data
rEHR is installed and loaded in the usual way: if(! "rEHR" %in% rownames(installed:packages())) install:packages("rEHR") library (rEHR) The development version of the package is available from Github and is accessible via the devtools package [17]: library(devtools) install À github("rOpenHealth=rEHR") library (rEHR) EHR data are stored as relational databases but are most commonly made available to researchers in the form of flat text files. This has the advantage of easier access for simple tasks and, for example, viewing the files in a spreadsheet. However, most non-trivial operations require researchers to iterate over a series of (potentially large) different groups of files. For example here we present pseudocode for a simple workflow leading to the production of a dataset of prevalent cases for a condition such as diabetes: Each level of iteration (represented by the nested for loops) and each type of file (e.g. clinical, referral, drugs etc.) in the above algorithm introduces the opportunity for bugs to creep into extraction code, while the repeated opening and closing of multiple text files, combined with the inherent inefficiency of for loops in R often result in slow, error prone code. The rEHR package allows researchers to first automatically import these flat files into a SQLite database and then use predefined functions to query this database efficiently and precisely. We use SQLite databases for a variety of reasons: • SQLite databases are stored as files in the directory system of the computer and require no installation setup. SQLite3 is installed automatically as a result of installing the dependencies for the package • SQLite files are stored efficiently and are relatively small compared to text files • The SQL language has been optimised for very rapid and efficient queries of SQLite files, resulting in much faster queries than would be available to multiple flat files • Working with SQLite databases allows users to use some very well developed tools that are already available to the R community such as sqldf [16] and RSQLite [18] if they are familiar with R SQL integration tools. These tools also allow for more specific tool functions to be built to shield users from the complexities of SQL queries.

Selecting all events
Once EHR data has been imported to the database, the rEHR package has a number of flexible built-in querying functions for extracting data. These functions are much faster to execute and less error prone than having to loop through hundreds of text files.
The primary generic query function is select_events() and is able to select all the events in a database table matching a provided where argument. This function is also called by the other more specific query functions. An example set of lists of clinical codes for a number of medical conditions is provided with the package (data(clinical_codes)). select_events() returns a dataframe of extracted data. This collection of disease specific code lists stems from our previous work and are reposited in www.clinicalcodes.org [9]. However, code lists are dynamic and context specific and researchers will very likely need to consider strategies to develop their own code lists, if existing code lists are considered inadequate [19].
diabetes_codes <-clinical_codes[clinical_codes$list == "Diabetes",] select À events(db, tab ¼ "Clinical", columns ¼ c("patid", "eventdate", "medcode"), The tab argument is used to select the file type (Clinical, Consultation, Patient, Practice or Referral in the previous code example), while the columns argument selects variables from these files. The where argument is equivalent to the WHERE clause in SQL, in that it is used to select subsets of the data table. The user must supply a string representation of valid R code, which is then translated to SQL via the dplyr::translate_sql_ function. There are two important caveats to this: 1. If an element of the clause represents an R object to be accessed (such as the elements of a vector) it must be wrapped in a .() (See the example above). String elements wrapped in .() are processed by the expand_string function before being passed to dplyr::translate_sql_.
2. Dates should separately quoted and entered in ISO format ('%Y-%m-%d'). This is because dates are stored as ISO text in the database, not as r Date types.
If the argument sql_only == TRUE, the function only generates the SQL needed for the query, rather than running the query itself. In this way, select_events can be used as the base for more complex query functions. The results of this function can also then be passed to temp_table() to create temporary tables where it is not desirable to keep large query results in RAM. For example: Asthma_codes <-clinical_codes[clinical_codes$list == "Asthma",] q <-select À events(db, tab ¼ "Clinical", columns ¼ c("patid", "eventdate" , "medcode"), where ¼ "medcode %in% :ðAsthma À codes$medcodeÞ", sql À only ¼ TRUE) temp À table(db, tab À name ¼ "Asthma", select À query ¼ q) ## Temporary There are two methods for including R objects in raw SQL strings. First, wrapping the string in a call to expand_string() allows for the .() notation to be used as in where arguments to select_events() based functions. Alternatively, a helper function, wrap_sql_query() is provided that functions in a similar way to base::sprintf but formats objects according to SQL syntax. If the result of evaluating the argument is a vector of length 1, it is inserted as is; if it is a vector of length > 1, it is wrapped in parentheses and comma separated.

Querying longitudinal data with select_by_year()
Researchers will often want to extract data over a range of different time-points, for example they may want to calculate the prevalence of a condition and how this changes through time.
When working with flat text files, this must be done with a complex nested loop that is both slow and error-prone. The select_by_year() function provides a simple interface to extract longitudinal data. On posix-compliant computers (Linux, BSD, Mac), this function can make use of parallel processes to select data for different years concurrently, greatly accelerating the extraction process on multicore machines. The function runs a series of selects over a year range and collects in a list of dataframes. The function applies a database select over a range of years and outputs as a list or a dataframe. Either a database object or a path to a database file can be supplied. If multiple cores are being used (i.e. cores > 1), a path to a database file must be used because the same database connection cannot be used across threads. In this case, a new database connection is made with every fork. Note that when working with temporary tables, cores must be set to 1 and the open database connection must be set with db. This is because the use of parallel:: mclapply means that new database connections need to be started for each fork and temporary files are only available inside the same connection.
Queries can be made against multiple tables, assuming that the columns being extracted are present in all tables. The columns argument is a character vector of column names to be selected. The individual elements can be of arbitrary length. This means it is possible to insert SQL clauses e.g. "DISTINCT patid".
A numeric vector of years is passed to the year_range argument to specify the years to select data for. Selection is done according to the function passed to the selector_fn argument. select_events is the default but first_events and last_events can also be used, as well as custom selection functions. The where argument works in the same way as in select_events except that year-start and year-end criteria can be added as 'STARTDATE' and 'ENDDATE'. These are translated to the correct year-start and end dates. Different start and end dates can be specified by supplying a function to the year_fn argument. This function must accept a single year argument and return a list with two elements-"startdate" and "enddate", each of which must be date characters in posix format (i.e. "%Y-%m-%d"). Three functions are provided to define years (standard_years for 1st January to 31st December, qof_years for UK financial years as used in the UK Quality and Outcomes Framework [20], and qof_15_months for the period starting 1st January in the year in question and finishing on the 31st March the following year) and a convenience function, build_date_fn () is provided to which users can supply lists of year offsets, months and days for year-start and end to return a function that can be supplied as the year_fn argument. Finally the user can set the as_list argument to determine whether data from each year is returned as a separate list element or as a single data frame.

Selecting prevalent and incident events.
To show the utility of the package we demonstrate how one might extract an incident and prevalent cohort of diabetes patients from the simulated example data. Prevalent events for a chronic condition are selected by the earliest diagnostic event prior to the end of the time period in question. The denominator for the calculation of the prevalence is the total number of patients registered at that time point.
# Select all patients with current registration date ðcrdÞ < the start # date for each year: registered_patients<-select À by À year(db ¼ db, tables ¼ "patient", columns ¼ c("patid", "practid", "gender", "yob", "crd", "tod", "deathdate"),  Notice that select_by_year returns a dataframe in long form, with a year column for the longitudinal component. Next we collect the incident cases, which are those patients with first diagnoses at any point before the end of the year in question, plus the dates for the first diagnoses. In this case we include events matching our list of diabetes clinical codes in either clinical or referral files. Because we only want the first diagnosis dates we set the selector_fn argument to first_events: Note that in this case extra columns have been added for both year and table, to identify the table the event was found in. Because events were taken from more than one table (Clinical and Referrals), the incident_cases dataframe should be sorted and duplicates removed to ensure that only the first events are kept. The two datasets are then merged to give the dataset from which the denominators and numerators can be calculated. The dplyr package is imported to the namespace when the rEHR package is loaded. This simplifies and accelerates merging operations, using left_join from the dplyr package in the example below, and is an important part of the rEHR workflow: ## All patients are kept (equivalent to merge(all.x = TRUE)) prevalence_dat <-left À join(registered_patients, incident_cases) ## Remove duplicates across clinical and referral tables: incident_cases %>% group À by(patid, year) %>% arrange(eventdate) %>% distinct() %>% ungroup -> incident_cases Prevalence and incidence can be calculated by the built-in functions prev_terms() and prev_totals(). prev_terms() adds logical columns for membership of incidence and prevalence denominators as well as a column for the contribution of the individual to that year's followup time. prev_totals() summarises this information to calculate the denominators and numerators for prevalence and incidence, according to the users' grouping factors. The criteria for membership of the incidence and prevalence numerators and denominators as well as for followup time are shown in Table 1, where event date is the date the event of interest occurs, transfer out date is the date the patient (may have) exited the practice or the database, and year start date is either 1st of Jan (calendar) or 1st of April (financial).
An example in the use of these functions is provided below: prevalence_dat <-prev À terms(prevalence_dat) totals <-prev À totals(prevalence_dat) totals$prevalence$year_counts Here we see that, in our simulated dataset, we have a diabetes prevalence of 17.7% in 2008 raising to 28.7% in 2012 and an incidence of 2.8% in 2008 increasing to 3.6% in 2012.

Building cohorts, matching and time-varying covariates
In this section we demonstrate how to convert the longitudinal data from the previous section to a cohort dataset suitable for survival analysis and also illustrate algorithms to match controls to cases and to cut cohort data by time-varying covariates.
One of the most common uses of EHR data in research is to build cohorts for survival analyses. The longitudinal data in the previous section is easily converted to survival cohort format using the build_cohort() function. This returns a dataset with a single row for each patient and includes only patients in the numerator or denominator for whichever cohort type is chosen (either incident or prevalent cohorts). Columns are added for start and end dates and for start and end times as integer differences from the cohort start date. A binary column is added to indicate membership of the case group. All patients with start dates greater than their end dates are removed from the dataset. The diagnosis_start argument is used to include the diagnosis date in the definition of the start dates for the patients. If it is not required for the diagnosis date to be included in the start date definition, this argument can be set to NULL.

Matching
Matching cases to controls is an important pre-analysis step. The rEHR package provides three methods for matching cases to controls: 1. Incidence density matching (IDM) 2. Exact matching 3. Matching on a dummy index date sourced from consultation files rEHR: An R package for manipulating and analysing Electronic Health Record data 4.1.1 Incidence density matching. This is performed using the get_matches() function. With IDM, controls are selected for a particular case at the time of diagnosis (or other event such as death) from other members of the cohort who, at that time, do not have the diagnosis. The IDM sampling procedure allows the same patient to be selected as a control for more than one case, thus providing a full set controls for each case while still producing unbiased estimates of risk [7; 21]. This also means that the matching procedure can be parallelised to increase computational efficiency.
In all of the matching algorithms, matching is performed by default on categories selected in the match_vars argument. However, more complex matching strategies can also be employed via the extra_conditions argument. You can wrap calls to expressions in dotted brackets to automatically expand them. This is particularly useful when you want to find the value for each individual case. Each case is denoted by CASE, e.g. "start_date <. (CASE$start_date)" will ensure the start date for controls is prior to the start date for the matched case. The following code also selects controls whose birth year (yob) is within 2 years either side of their matched case: IDM_controls2 <-get À matches(cases ¼ filter(cohort2, case == 1), control À pool ¼ filter(cohort2, case == 0), match À vars ¼ c("gender", "region"), extra À conditions ¼ "yob >¼ ð:ðCASE$yobÞ À 2Þ &yob <¼ ð:ðCASE$yobÞ þ 2Þ", n À controls ¼ 4, cores ¼ 1, method ¼ "incidence À density", diagnosis À date ¼ "eventdate")

Exact matching.
Exact matching only matches controls from the control pool, unlike in IDM matching. Also, matched controls are removed from the control pool after each case has been matched, so each control can be used a maximum of one time. Therefore it is possible to have fewer matched controls for some cases than are requested via the n_controls argument. Because the control pool is being altered for every case, exact matching is not thread safe and so will only run on a single core. The cores and diagnosis_date arguments are ignored when this method is selected. exact_controls3 <-get À matches(cases ¼ filter(cohort2, case == 1), control À pool ¼ filter(cohort2, case == 0), match À vars ¼ c("gender", "region"), n À controls ¼ 4, cores = 1, method ¼ "exact", diagnosis À date ¼ "eventdate") In a small cohort, this can rapidly reduce the control pool, leading to many cases without matches. In this example, 19 out of 23 were matched with mean 3.6 controls matched to every case.

Matching on a dummy index date.
A common matching approach is to match on an index date, for example the diagnosis date of the cases or the date followup starts. There are several reasons to match on index date: 1. It ensures cases and controls are followed-up, on average, for the same amount of time. Not including an index date for controls may result in them being, on average, in the cohort for longer than the cases because their cohort start date is not constrained by the index date 2. There is a possible reduction of detection bias, for example if cases are expected to visit their doctors more often because they have more co-morbidities 3. If controls are known to have attended their practice at around the same time as their matched case, it is likely they will experience similar conditions in terms of practice policy and active GPs 4. Patients who, though registered, have no records of contact with the medical system ("Ghost patients") are excluded However, the controls will often not have the same index to match on (this is true by definition if the diagnosis date is used). In this situation, it is common to match on a dummy index date which may be a clinical event or interaction in the control's electronic health record that occurs around the same time as the index date of the case [22; 23]. The match_on_index() function allows for matching on an arbitrary number of categorical match_var variables and on continuous variables via the extra_conditions argument in the same way as the get_matches() function above. In addition, a supplied index date for each case is matched to event dates in a series of consultation files (1 file for each practice), providing a dummy index date for controls of a consultation date within index_diff_limit days of the matched case's index date.
Note that the consultation files must be in flat-file format, i.e. not as part of the database, but as text (or other filetype, e.g stata dta) files. This is the data format provided by CPRD ("Clinical Practice Research Datalink (CPRD) GOLD"). Although in most situations it is more efficient to process EHR data in SQL databases, as in the earlier functions described here, consultation tables are often very large and searching these for every case in a large cohort would be very slow. By processing consultation files that have been split by practice, it is possible to search for matches a practice at a time which is both efficient and allows for parallel processing to speed the process up still further. For convenience, a function flat_files() is provided that can export a database table to flat files split by practice in a format of their choosing. The match_on_index() function has an import_fn argument to use different file formats (e.g. foreign::read.dta or readstata13:: read.dta13 for Stata 12 or Stata 13 file). consultation_dir <-" $ =R=rEHR À testing" flat À files(db, out À dir ¼ consultation_dir, file À type ¼ "csv") index_controls <-match À on À index(cases ¼ filter(cohort2, case == 1), control À pool ¼ filter(cohort2, case == 0), index À var ¼ "eventdate", match À vars ¼ c("gender", "region"), index À diff À limit ¼ 90, consult À path ¼ consultation_dir, n À controls ¼ 4, import À fn ¼ function(x) convert À dates(read:csv(x))) # clean up constructed dirs after analysis unlink(consultation_dir,recursive ¼ TRUE) This function performs matching that is still more conservative than the previous methods, since it requires matching of patients within the same practice and with consultation dates near the index date. In the test example above, no matched controls were found which is not surprising with a control pool of only 143. In practice this method is only appropriate where there is a control pool of hundreds of thousands or even millions of patients. If too few controls are found, the constraint can be relaxed by setting a higher index_diff_limit. Setting this to an arbitrarily high value effectively means that matching is not done on index date, but just on practice and the other user-specified matching variables. Users may find that this is a more efficient way to perform exact matching than using the get_matches() function. We have used this method to accelerate matching runs with several million controls that previously took days or weeks to minutes or a few hours.

Time-varying covariates
Often, researchers want to cut a survival cohort by time-varying covariates. In this situation, individual patients may run over more than one row in the cohort dataset. For example, a drug exposure may occur after the entry into the cohort and one might be interested in how this might affect the outcome. In this situation, it is useful to have a pre-exposure and postexposure time period in the dataset.
The cut_tv() function cuts up a dataset based on times supplied for the time-varying covariate. If there is already a variable for the time-varying covariate, you can chose to flip the existing values or increment them. This means the function can be called multiple times to, e.g. deal with drugs starting and stopping and also to model the progression of treatment. Other packages implement similar functions (e.g. the cutLexis function from the Epi package [24]). The cut_tv() function is considerably faster than other cutting methods (particularly on large datasetss), does not require conversion of the dataset to other formats (such as Lexis), can be parallelised on posix compliant machines and is designed to be chained with dplyr workflows using the %>% operator. cut_tv() can deal with the following scenarios: • Binary chronic covariates e.g. The time of diagnosis for a chronic (unresolvable) condition.
This requires a single column variable of times from entry in the dataset • Binary covariates e.g. times of starting and stopping medication. This requires more than one column variable in the dataset, one for each start or stop event. The state flips with each new change.

Accessory functions
In this section we briefly discuss some miscellaneous functions provided in the package.

Clinical code list construction
An important part of EHR analyses is the construction of lists of clinical codes to define conditions, comorbidities and other clinical entities of interest to the study [9]. We have previously described methodologies to construct draft lists of clinical codes from keyword and code searches [19]. The R implementation of this methodology is now part of the rEHR package.
Building draft lists of clinical codes is a two-stage process: First, the search is defined by instantiating an object of class MedicalDefinition, containing the terms to be searched for in the lookup tables. MedicalDefinition objects can be instantiated from terms defined within R or imported from a csv file. The constructor function can be provided with lists of: terms(clinical search terms), codes (clinical codes), tests (test search terms), drugs (drug search terms), drugcodes (drug product codes). Within the individual argument lists, vectors of length > 1 are searched for together (logical AND), in any order. Different vectors in the same list are searched for separately (logical OR). Placing a "-" character at the start of a character vector element excludes that terms from the search. Providing NULL to any of the arguments means that this element will not be searched for. Underscores are treated as spaces. When searching for codes, a range of clinical codes can be searched for by providing two codes separated by a hyphen. e.g "E114-E117z".

Unit conversion
HbA1C tests for glycated haemoglobin are one of the best recorded clinical tests in UK primary care databases, to a large extent because of testing being incentivised under the UK Quality and Outcomes Framework pay-for-performance scheme [20; 25]. However, HbA1C data is not recorded in CPRD consistently. Measurements may have been made in mmol/mol, mmol/L or mg/dL. Also the closely analogous fructosamine test can also be converted into the same units for direct comparison. The CPRD-specific cprd_uniform_hba1c_values() function accepts a single argument of a dataframe in the CPRD "Additional" table form containing only entity types for HbA1C and Fructosamine and converts any HbA1C and fructosamine values to a common mmol/mol scale. Once this conversion has taken place, the function also removes obvious mis-coding errors that are far outside the possible range. A dataframe is returned with an extra column hba1c_score.

Exporting data to stata format
Sometimes researchers may need to share data with others in the same group who may not have R expertise. We have provided the to_stata function to export dataframes to stata dta format. This function compresses a dataframe to reduce file size in the following ways: 1. Date variables (as specified by the date_fields argument) are converted to integer days from 1960-01-01 to avoid compatibility issues between R and Stata. An alternative origin can be set with the origin argument.
2. Fields specified in the integer_fields are converted from numeric to integer.
the stata13 boolean argument indicates whether files should be stored in Stata13 format (Using readstata13::savedta13) or in Stata 12 compatible format (using foreign:: write.dta). The former includes a further compression step, similar to the compress command in Stata.

Working with temporary database tables
The size of EHR databases may require keeping intermediate data extractions as database tables, rather than as in-memory R dataframes. For example, extractions of clinical events for a common condition such as diabetes or asthma will require the extraction of millions of rows

Setting EHR type
In the final section we discuss the .ehr environment used to define the EHR database being used and how this can be set to work with different databases.
In many of the functions in this package, specific tables and variables in the database need to be accessed. A particular database system, such as CPRD, will have its own schema describing the organisation of the data within it. To simplify the functions in this package, we have opted to include an interface to the database schema in the form of an environment, .ehr, that is accessed by the various analysis functions in order to extract the correct data from the correct place in the database. This is effectively a list of attributes relating to the EHR system being used. For example there is an attribute specifying the patient id variable in the database. By default, a schema environment for CPRD is loaded when the package is loaded via a call to set_CPRD(). We have provided accessor functions to get and set attributes in the .ehr environment. It is preferable to use these accessor functions rather than setting elements directly. A list of all of the attributes is provided by the list_EHR_attributes() function. For example: list À EHR À attributes()

## [1] "PATIENT"
The default settings can be reverted to using the set_CPRD() function: set À CPRD() ## Using CPRD settings get À EHR À attribute(patient_id) The .ehr environments will allow for the simple definition of interfaces to other EHR systems, via the construction of new setting functions.

Conclusion
Working with structured EHR data requires a combination of computational and statistical expertise. The rEHR package greatly simplifies and accelerates the extraction and processing of coded data from EHR databases, enabling researchers to spend more time on their analyses, time that would otherwise be consumed with laborious preparation of research-ready data. The workflow is straightforward, amounting to a flat series of function calls rather than a complex set of nested loops, therefore errors are much more easily spotted and fixed. The available functions are summarised in Table 3. The combination of SQL native databases, optimised data manipulation packages and multicore functionality results in a package that runs many times faster than equivalent code.

Limitations and future work
Although rEHR is currently only tested with CPRD data, the .ehr environment system will allow it to be easily linked to other EHR databases. For future versions of the rEHR software we will consider: • Implementation of the repsample algorithm for representative sampling of practices [26].
• Iterative proportional fitting for matching on population characteristics between different EHR databases [8].
• A robust algorithm for determining smoking status.
• Interfaces to other EHR systems, in particular UK primary care databases such as THIN, QResearch and Research One.
• Uniform units functions for other clinical measurements such as blood pressure, cholesterol and serum creatinine.
• Probabilistic or other matching for patient record linkage.

Ethics and consent statement
This study is based on data from the Clinical Practice Research Datalink (CPRD) obtained under licence from the UK Medicines and Healthcare products Regulatory Agency. However, the interpretation and conclusions contained in this paper are those of the authors alone. The study was approved by the independent scientific advisory committee (ISAC) for CPRD research (reference number: 16_115R). No further ethics approval was required for the analysis of the data.