Skip to main content
  • Loading metrics

High resolution data modifies intensive care unit dialysis outcome predictions as compared with low resolution administrative data set

  • Jennifer Ziegler ,

    Contributed equally to this work with: Jennifer Ziegler, Barret N. M. Rush

    Roles Methodology, Resources, Writing – original draft, Writing – review & editing

    Affiliation Department of Internal Medicine, Max Rady College of Medicine, Rady Faculty of Health Sciences, University of Manitoba, Winnipeg, Manitoba, Canada

  • Barret N. M. Rush ,

    Contributed equally to this work with: Jennifer Ziegler, Barret N. M. Rush

    Roles Conceptualization, Formal analysis, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Writing – original draft, Writing – review & editing

    Affiliation Department of Internal Medicine, Max Rady College of Medicine, Rady Faculty of Health Sciences, University of Manitoba, Winnipeg, Manitoba, Canada

  • Eric R. Gottlieb,

    Roles Conceptualization, Investigation, Methodology, Validation

    Affiliations Department of Medicine, Mount Auburn Hospital, Cambridge, Massachusetts, United States of America, Harvard Medical School, Boston, Massachusetts, United States of America, Institute for Medical Engineering and Science, Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America

  • Leo Anthony Celi ,

    Roles Conceptualization, Investigation, Methodology, Project administration, Resources, Supervision, Validation, Writing – review & editing

    ‡ LAC and DH also contributed equally to this work.

    Affiliations Harvard Medical School, Boston, Massachusetts, United States of America, Institute for Medical Engineering and Science, Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America, Department of Medicine, Beth Israel Deaconess Medical Center, Boston, Massachusetts, United States of America, Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, United States of America

  • Miguel Ángel Armengol de la Hoz

    Roles Data curation, Formal analysis, Investigation, Methodology, Project administration, Visualization, Writing – original draft

    ‡ LAC and DH also contributed equally to this work.

    Affiliations Institute for Medical Engineering and Science, Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America, Department of Anesthesia, Critical Care and Pain Medicine, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, Massachusetts, United States of America, Big Data Department, Fundacion Progreso y Salud, Regional Ministry of Health of Andalucia


High resolution clinical databases from electronic health records are increasingly being used in the field of health data science. Compared to traditional administrative databases and disease registries, these newer highly granular clinical datasets offer several advantages, including availability of detailed clinical information for machine learning and the ability to adjust for potential confounders in statistical models. The purpose of this study is to compare the analysis of the same clinical research question using an administrative database and an electronic health record database. The Nationwide Inpatient Sample (NIS) was used for the low-resolution model, and the eICU Collaborative Research Database (eICU) was used for the high-resolution model. A parallel cohort of patients admitted to the intensive care unit (ICU) with sepsis and requiring mechanical ventilation was extracted from each database. The primary outcome was mortality and the exposure of interest was the use of dialysis. In the low resolution model, after controlling for the covariates that are available, dialysis use was associated with an increased mortality (eICU: OR 2.07, 95% CI 1.75–2.44, p<0.01; NIS: OR 1.40, 95% CI 1.36–1.45, p<0.01). In the high-resolution model, after the addition of the clinical covariates, the harmful effect of dialysis on mortality was no longer significant (OR 1.04, 95% 0.85–1.28, p = 0.64). The results of this experiment show that the addition of high resolution clinical variables to statistical models significantly improves the ability to control for important confounders that are not available in administrative datasets. This suggests that the results from prior studies using low resolution data may be inaccurate and may need to be repeated using detailed clinical data.

Author summary

Healthcare administrative databases and disease registries are frequently used in clinical research; however, these sources of data were often not designed for this purpose and lack important detailed clinical data. Therefore, when using these data to answer clinical research questions, important clinical variables are missing and may bias the results. Over the past decade, high resolution databases that integrate administrative information and clinical patient data obtained from electronic health records have been developed specifically for the purpose of clinical research. The purpose of this study is to compare the effects of dialysis on mortality in similar cohorts of critically ill patients with sepsis requiring mechanical ventilation from both an administrative database and from a high resolution database. We found that the addition of clinical variables significantly altered the mortality odds ratio such that it was no longer significant. These results suggest that previous studies using administrative data and repositories may not be valid due to the lack of important clinical variables included in the models.


Low resolution databases are data sets that lack granular and detailed clinical data, and often contain pre-specified types of information, such as patient demographics, diagnoses, hospital information as well hospital admission and discharge information [1,2]. Administrative databases, which have been utilized for medical research purposes since they were first created in the 1970s, are one example of a low resolution database [3]. These databases have allowed for the analysis of large amounts of healthcare data over the past decades and have been responsible for numerous practice-changing studies [46]. Administrative databases, such as the Nationwide Inpatient Sample (NIS), provide large patient samples and include valuable and reliable information such as patient demographics, diagnostic coding of primary and secondary diagnoses, procedures performed, length of hospitalization and discharge status (ie. Discharge, death, transfer to another facility) [7]. These databases are easily accessible, inexpensive and permit the study of practices and outcomes across a large spectrum of healthcare related research questions. However, these databases were often created with the intent of gathering data for financial, health policy or administrative use, and therefore have inherent limitations. Information bias including coding misclassification and coding accuracy may be present and must be carefully evaluated when using these data sources [1,810]. Furthermore, most administrative databases lack follow-up information and clinical information, such as patient vital signs, laboratory values and medication use. Therefore, important additional clinical variables and confounders may be unmeasured, leading to bias in the results.

Medical registries are another type of low resolution data, which include health services registries, product registries and disease registries, and are also important sources of data for medical research and epidemiological studies [1113]. In contrast to medical databases, medical registries are created with specific well-defined characteristics including that entries are unique individual identifiable persons sharing a common feature, the population is geographically defined, the registry has a pre-defined purpose, and the registry is updated systematically[11,14,15]. Medical registries have been used in numerous practice changing studies, and are particularly useful in the study of rare diseases [1619]. The data collected by registries are variable and determined by the specific objectives of the registry, but are typically sourced from electronic health records (EHRs), medical charts, administrative databases, and patient reports [15,20]. Although medical registries may contain limited specific clinical data, similar to administrative databases, registries lack granularity. Furthermore, registries are also limited by information biases including coding misclassification, data collection errors, and data completeness errors [15].

More recently, the widespread use of EHRs has provided access to large amounts of clinical data [21,22]. The EHR data with the integration of clinical monitoring systems have allowed for the development of modern databases containing high fidelity clinical information. These datasets with highly granular and detailed clinical data can be considered as high resolution databases. High resolution databases often draw information from EHRs including patient monitoring parameters and physiological data, patient care flow sheets, laboratory values, procedures performed as well as the temporal trends throughout a patient’s clinical course [3,23]. Databases such as the eICU Collaborative Research Database (eICU) and the Medical Information Mart for Intensive Care (MIMIC) are examples of high resolution databases [24,25]. These datasets directly integrate clinical data from EHRs and bedside monitoring linearly over time into comprehensive datasets that can be used for clinical research. Utilization of these datasets may allow for better adjustment for confounding as they contain detailed clinical information that administrative datasets and disease registries lack. But the use of datasets such as these and other big data resources are similarly prone to errors and bias, including computational errors with complex statistical techniques, quality and reliability of collected data, and lack of familiarity with knowledge translation to clinical practice [2628].

Statistical models used to analyze data in non-randomized trials are limited by numerous factors. One of the major barriers to accuracy of modeling is the presence of residual confounding not accounted for by the study design [29]. Specifically, in the intensive care unit, administrative datasets often lack detailed clinical information about severity of illness, laboratory investigations, and vital signs, as well as time-stamped interventions, which are likely important confounders not accounted for in the models [30]. Conversely, the use of robust databases that incorporate clinical and bedside monitoring data might improve modelling accuracy due to the ability to control for additional clinical variables and confounders, that are not available with low resolution datasets. The aim of this experiment is to compare the ability to adjust for confounding between a low resolution large national administrative database and a high resolution large multi-center EHR database examining the same clinical question.


There were 139,367 patients included in the 2014 eICU sample, of which a total of 8,822 (6.3%) patients were included in the cohort (Fig 1). A total of 7,071,762 hospitalizations from the 2014 NIS sample were analyzed and 223,947 (3.2%) were included in the cohort (Fig 2). The overall mortality was 22.6% in the eICU cohort, while the NIS cohort had a mortality of 26.9%. There were 727 (8.2%) patients in the eICU cohort who required dialysis, with a mortality rate of 36.0%; the NIS cohort contained 19,149 (8.5%) patients who required dialysis with a mortality rate of 40.0%. The baseline characteristics of the cohorts are displayed in Table 1. For the high resolution data variables for the eICU cohort, the baseline levels are available in S1 Table.

Fig 1. eICU cohort.

The cohort selection from the eICU database.

Fig 2. NIS cohort.

The cohort selection from the NIS database.

Table 1. The baseline characteristics of the patients in the eICU and NIS cohorts.

The patients receiving dialysis and those not receiving dialysis from each cohort are compared using the Student’s independent t-test of Wilcoxon Rank Sum test for continuous variables, and the Chi-Square test for categorical variables.

The results of the low-resolution model for in-hospital mortality are displayed in Table 2. After controlling for all variables in the model, dialysis use was associated with an increased risk of mortality in both cohorts (eICU: OR 2.07, 95% CI 1.75–2.44, p<0.01; NIS: OR 1.40, 95% CI 1.36–1.45, p<0.01).

Table 2. The low resolution multivariable logistic regression model predicting in-hospital mortality in the NIS and eICU cohorts.

For the eICU cohort, after addition of the high resolution covariates, the point estimate for the detrimental effect of hemodialysis use was no longer significant (OR 1.04, 95% 0.85–1.28, p = 0.64). The results of the high resolution model are displayed in Table 3. The correlation plot (Fig 3) shows that overall, there was not significant correlation between the majority of the high resolution variables. The pairs of variables that did show high correlation were hemoglobin and hematocrit, AST and ALT, mean arterial pressure and diastolic blood pressure, and mean arterial pressure and systolic blood pressure.

Fig 3. The correlation plot for the variables in the high resolution model.

Table 3. The high resolution multivariable logistic regression model predicting in-hospital mortality in the eICU cohort.

The authors provide open access to all their data extraction, filtering, data wrangling, modeling, figures and tables, code, and queries on


In this comparative analysis utilizing comparable cohorts of mechanically ventilated patients with sepsis, the addition of the high-resolution clinical variables in the eICU database allowed for greater adjustment of severity of illness and significantly altered the point estimate for the association of hemodialysis use and hospital mortality. We demonstrate that the cohorts of patients that were obtained from the low-resolution NIS database and the high resolution eICU-CRD are comparable by baseline patient and hospital demographics, dialysis use, and in-hospital mortality. The results show that the baseline low resolution models from both the cohorts predicted a significant association of in-hospital mortality with dialysis use. This is in keeping with previously published epidemiological studies that have reported that dialysis is associated with increased in-hospital mortality among critically ill patients with sepsis who require mechanical ventilation [31,32]. After adjusting for the high resolution variables in our model, the association of in-hospital mortality and dialysis use in the eICU cohort was no longer significant. These results demonstrate that after the addition of the high resolution clinical variables to the model, the results of the analysis changed significantly. This highlights the importance of the granular clinical data within the high resolution model, as the low resolution model did not account for these important clinical confounders.

Administrative databases and disease registries have been widely used for decades to examine clinical questions in all areas of healthcare, and the publication of such epidemiological studies has increased over time [6]. Often these data sources were not created with the intention of answering clinical questions and lack the detailed clinical information required for proper adjustment in statistical modelling to remove residual confounding [7,10]. The use of administrative databases such as the NIS have inherent limitations related to coding and misclassification bias [33]. Furthermore, a recent study found that up to 85% of studies published using the NIS database did not adhere to the specified methodological standards, which can further bias study results and interpretations [34]. Disease registries, which are non-randomized observational datasets containing patient, medical treatment or device information, are also limited by selection biases, information biases, and data quality errors [35,36]. Furthermore, data registries in particular may be prone to data linkage errors, which can also mislead study findings [37,38]. For example, a recent study evaluating prostate specific antigen values in cancer registries found high rates of misclassification error when compared to the gold standard EHR laboratory value, resulting in important differences in clinical outcomes [39]. Both administrative datasets and disease registries lack detailed clinical information, which are important confounders to consider in clinical research, as study results can be greatly influenced with the addition of these variables.

In contrast to administrative datasets and disease registries, more recently developed high resolution databases such as eICU, were created with its use for clinical research as a primary objective. Globally, there are several other examples of commercial and non-commercial high resolution databases and national repositories in the field of critical care that are currently used for medical research. In the United States, examples include the Veterans Affairs patient database, the MIMIC-II database and the APACHE and Project IMPACT databases [25,4042]. In the United Kingdom, the Intensive Care National Audit and Research Center (ICNARC) provides high resolution data from national repositories, and similarly, the Australian and New Zealand Intensive Care Society also curates a large database from patient ICU stays [43]. These high resolution datasets are well designed and structured to answer clinical questions. They also contain vast amounts of clinical data such as patient vital signs and laboratory values captured from various electronic sources in hospitals [24]. These datasets are increasingly being used due to the powerful data available to address clinical research questions, and have provided data to support hundreds of studies in recent years. However, given the large amount of data drawn from multiple sources within these databases, data integrity and accuracy of the collected data are very important considerations when performing analyses because systematic errors in the data lead to propagated error within models and may influence the conclusions of the analysis [44,45]. While data integrity is also a concern in administrative datasets, the sheer volume of data collated in high resolution databases may compound the degree of error. As our experiment demonstrates, the inclusion of detailed clinical variables in the model leads to a difference in research conclusions compared to when using the administrative database. This is despite baseline similarities between the two cohorts with respect to patient demographics, dialysis use and clinical outcomes. Not surprisingly, this supports the argument that important confounders are lacking in administrative databases, and therefore conclusions drawn from prior research using these types of databases may not be accurate.

The strengths of this analysis lie in the large sample sizes and multi-center scope of both datasets [24,46]. Our analysis is generalizable to a significant proportion of the critical care population from across the United States, taking into account the limitations of each dataset. The accuracy of coding for dialysis use has also been shown to be highly reliable by prior studies, and the diagnostic coding has been validated for use in administrative databases [47].

The results of the study must be interpreted while acknowledging several important limitations. As described above, the use of administrative and EHR based datasets contains inherent risks for data error, misclassification bias and coding errors. The NIS database does not identify individual patients, but rather each entry is a patient encounter [48]. Accordingly, it is possible that if a patient is transferred between hospitals, this may represent more than one entry in the database. Limitations with the eICU database are inherent in the fact that the data are derived from multiple eICUs across the United States. A recent study by O’Halloran and colleagues showed that the eICU database may lack generalizability to large ICUs with more acuity, as the majority of data from within the database is obtained from small and medium sized ICUs [49]. Furthermore, this study identified potential ambiguity with coding of mechanical ventilation within the dataset therefore this may have influenced our cohort selection [49].

This study serves as a thought experiment and proof of concept that highlights the benefits high-resolution data sources from EHRs have over low-resolution administrative datasets. Our findings show that the addition of granular clinical data to administrative data elements significantly altered the point estimate for the association of dialysis use and mortality in patients with sepsis. The results of this experiment suggest that many prior analyses utilizing low resolution administrative data may need to be repeated with more powerful data sources in order to better assess causal relationships while controlling for important clinical confounders. Another avenue that is worth exploring is the linking of multi-center EHR datasets with claims and registry data to combine the benefits of these different types of data sources. Further studies examining the outcome of clinical research questions using both high and low resolution datasets should be performed in the future to further evaluate the impact of dataset granularity on the results of observational studies.

Materials and methods

Study rationale

For this experiment we attempted to create comparable cohorts of patients in each dataset with inclusion criteria that would allow for a high degree of certainty. Mechanical ventilation and the use of hemodialysis are well coded in administrative datasets, whereas sepsis is a common indication for admission to the intensive care unit.

The Nationwide Inpatient Sample (NIS) was selected to represent the low resolution administrative dataset [48]. This national, all-payer database is produced by the United States Agency for Healthcare Quality and Research (AHRQ). It captures approximately 20% of all inpatient hospitalizations and is designed to approximate >95% of all inpatient care (prison hospitals and non-traditional hospitals are excluded). It is a well-validated database that has excellent data integrity and has been used for decades for health services research. The eICU Collaborative Research Database (eICU-CRD) was selected to represent the high resolution dataset [24]. This eICU-CRD is a publicly available database curated from a partnership between the Laboratory for Computational Physiology (LCP) at Massachusetts Institute of Technology (MIT) and the Electronic ICU Research Institute. The eICU database contains the high granularity de-identified data for over 200,000 inpatient admissions to telehealth ICUs across the United States. This database has been used for hundreds of research projects since it became publicly available in 2017, and the data has undergone stringent technical validation [24]. The use of the eICU-CRD is exempt from institutional review board approval due to the retrospective design, lack of direct patient intervention, and the security schema, for which the re-identification risk was certified as meeting safe harbor standards by an independent privacy expert (Privacert, Cambridge, MA) (Health Insurance Portability and Accountability Act Certification no. 1031219–2).

A waiver of consent for this analysis was obtained from the Research Ethics Board at the University of Manitoba, as all of the data is publicly available and de-identified. This study is reported in accordance with the STrengthening the Reporting of OBservational studies in Epidemiology (STROBE) statement [50].

Cohort selection

Comparable cohorts of patients with sepsis who required mechanical ventilation were created in the NIS and eICU databases. The details of cohort selection for each database can be found in Figs 1 and 2. The baseline characteristics of the patients in both the NIS and eICU cohorts is shown in Table 1. The patients from each cohort are compared by use of dialysis using either the Student’s Independent t-test or the Wilcoxon Rank Sum depending on normality for continuous variables, and the Chi-Square analysis for categorical variables. All statistical analyses in this study were performed assuming a two-sided alpha level of 0.05.

Base Covariates (eICU and NIS)

For each dataset, a baseline set of common variables was collected. Covariates included age, gender, race (White, Black, Hispanic, Other/Missing), Region of Country (West, Northeast, South, Midwest, Missing), Hospital size (<100 beds, 100–249 beds, ≥ 250 beds, other/unknown), Charlson comorbidity index, the presence of shock, teaching status of hospital, and the use of dialysis.

High Resolution Covariates (eICU only)

Additional high-resolution covariates were obtained for each patient from the eICU-CRD. These included the patient laboratory results (sodium, potassium, bicarbonate, blood urea nitrogen (BUN), creatinine, glucose, calcium, phosphate, hematocrit, hemoglobin, red cell distribution width (RDW), platelet count, white blood cell count, international normalized ratio (INR), lactate, liver enzymes and function tests), patient vital signs (fraction of inspired oxygen (FiO2), heart rate, respiratory rate, oxygen saturation, blood pressure and temperature) as well as medication use (any vasopressor or inotrope including dopamine, dobutamine, norepinephrine, phenylephrine, epinephrine, vasopressin, milrinone and heparin). For variables with more than one record per hour, a median value per hour was computed. The hourly average was then determined and aggregated for each variable per patient. Missing data for the high-resolution model continuous variables were imputed using a forward-backward filling imputation method. Forward filling means to fill missing values with previous data available for a given patient. Backward filling means to fill missing values with the next data point available for a given patient. The function first attempts to fills the data point with the backward method if a datapoint is available, and if not available, then with the forward method. For binary data, missing data was treated as a 0. Missing data for the high-resolution binary medication variables were treated as not administered. The plot of missingness for each variable is shown in supplemental material S1 Fig.

Low resolution model

In order to compare the output of the two datasets, identical multivariate logistic regression models predicting mortality after ICU admission for sepsis while requiring mechanical ventilation were created. The variables included in the baseline low-resolution model, which were the same for both datasets, were chosen a priori and are the same variables described in the Baseline Covariates section of the Methods. Normalization of the variables was not performed prior to performing the logistic regression analysis.

High resolution model

A multivariate logistic regression model (estimated using maximum likelihood) was fitted to predict in-hospital mortality for patients admitted to the ICU with sepsis while requiring mechanical ventilation was performed with the eICU cohort. Standardized parameters were obtained by fitting the model on a standardized version of the dataset. A typical predictor is fitted for binomial families, the response is specified as a factor where the first level denotes failure and all others success. Both the low resolution (listed in the Base Covariates section) and high-resolution (listed in the High Resolution Covariates section) covariates described above were used, and were not normalized prior to analysis. The variables included in the high resolution model were determined a priori. In order to address collinearity of the model, a correlation plot was generated to study all the interaction between the model variables (Fig 3).

Supporting information

S1 Table. The eICU high resolution variables. The baseline characteristics of the high resolution variables for the patients in the eICU database.


S1 Fig. The plot of missingness for the high resolution variables.



  1. 1. Garland A, Gershengorn HB, Marrie RA, Reider N, Wilcox ME. A Practical, Global Perspective on Using Administrative Data to Conduct Intensive Care Unit Research. Ann Am Thorac Soc. 2015;12: 1373–1386. pmid:26148250
  2. 2. Tange HJ, Schouten HC, Kester AD, Hasman A. The granularity of medical narratives and its effect on the speed and completeness of information retrieval. J Am Med Inform Assoc. 1998;5: 571–82. pmid:9824804
  3. 3. Marshall J, Chahin A, Rush B. Review of Clinical Databases. Secondary Analysis of Electronic Health Records. Cham: Springer International Publishing; 2016. pp. 9–16.
  4. 4. Sarrazin MSV, Rosenthal GE. Finding pure and simple truths with administrative data. JAMA. 2012;307: 1433–5. pmid:22474208
  5. 5. Mohammed MA, Stevens A. The value of administrative databases. BMJ. 2007;334: 1014–5. pmid:17510106
  6. 6. Khera R, Krumholz HM. With Great Power Comes Great Responsibility: Big Data Research From the National Inpatient Sample. Circ Cardiovasc Qual Outcomes. 2017;10. pmid:28705865
  7. 7. Hashimoto RE, Brodt ED, Skelly AC, Dettori JR. Administrative database studies: goldmine or goose chase? Evid Based Spine Care J. 2014;5: 74–6. pmid:25278880
  8. 8. Hertzer NR. Reasons Why Data from the Nationwide Inpatient Sample Can Be Misleading for Carotid Endarterectomy and Carotid Stenting. Semin Vasc Surg. 2012;25: 13–17. pmid:22595476
  9. 9. McPhee JT, Schanzer A, Messina LM, Eslami MH. Carotid artery stenting has increased rates of postprocedure stroke, death, and resource utilization than does carotid endarterectomy in the United States, 2005. J Vasc Surg. 2008;48: 1442–1450.e1. pmid:18829236
  10. 10. Johnson EK, Nelson CP. Values and pitfalls of the use of administrative databases for outcomes assessment. J Urol. 2013;190: 17–8. pmid:23608038
  11. 11. Solomon DJ, Henry RC, Hogan JG, Van Amburg GH, Taylor J. Evaluation and implementation of public health registries. Public Health Rep. 1991;106: 142–50. Available: pmid:1902306
  12. 12. Gladman DD, Menter A. Introduction/overview on clinical registries. Ann Rheum Dis. 2005;64 Suppl 2: ii101–2. pmid:15708919
  13. 13. Stausberg J, Altmann U, Antony G, Drepper J, Sax U, Schütt A. Registers for Networked Medical Research in Germany: Situation and prospects. Appl Clin Inform. 2010;1: 408–18. pmid:23616850
  14. 14. Donaldson L. Registering a need. BMJ. 1992;305: 597–8. pmid:1393067
  15. 15. Pop B, Fetica B, Blaga ML, Trifa AP, Achimas-Cadariu P, Vlad CI, et al. The role of medical registries, potential applications and limitations. Med Pharm reports. 2019;92: 7–14. pmid:30957080
  16. 16. VAN Hest NAH, Story A, Grant AD, Antoine D, Crofts JP, Watson JM. Record-linkage and capture-recapture analysis to estimate the incidence and completeness of reporting of tuberculosis in England 1999–2002. Epidemiol Infect. 2008;136: 1606–16. pmid:18346285
  17. 17. Navarro C, Martos C, Ardanaz E, Galceran J, Izarzugaza I, Peris-Bonet R, et al. Population-based cancer registries in Spain and their role in cancer control. Ann Oncol Off J Eur Soc Med Oncol. 2010;21 Suppl 3: iii3–13. pmid:20427357
  18. 18. Bray F, Ferlay J, Laversanne M, Brewster DH, Gombe Mbalawa C, Kohler B, et al. Cancer Incidence in Five Continents: Inclusion criteria, highlights from Volume X and the global status of cancer registration. Int J cancer. 2015;137: 2060–71. pmid:26135522
  19. 19. Taruscio D, Gainotti S, Mollo E, Vittozzi L, Bianchi F, Ensini M, et al. The current situation and needs of rare disease registries in Europe. Public Health Genomics. 2013;16: 288–98. pmid:24503589
  20. 20. Gliklich R, Dreyer N, Leavy M. Registries for Evaluating Patient Outcomes: A User’s Guide. Third edition. Two volumes. (Prepared by the Outcome DEcIDE Center [Outcome Sciences, Inc., a Quintiles company] under Contract No. 290 2005 00351 TO7.) AHRQ Publication No. 13(14)-EHC111. Rockville, MD; 2014. Available:
  21. 21. Bates DW, Saria S, Ohno-Machado L, Shah A, Escobar G. Big data in health care: using analytics to identify and manage high-risk and high-cost patients. Health Aff (Millwood). 2014;33: 1123–31. pmid:25006137
  22. 22. Iwashyna TJ, Liu V. What’s so different about big data?. A primer for clinicians trained to think epidemiologically. Ann Am Thorac Soc. 2014;11: 1130–5. pmid:25102315
  23. 23. Celi LA, Mark RG, Stone DJ, Montgomery RA. “Big data” in the intensive care unit: Closing the data loop. Am J Respir Crit Care Med. 2013;187: 1157–1160. pmid:23725609
  24. 24. Pollard TJ, Johnson AEW, Raffa JD, Celi LA, Mark RG, Badawi O. The eICU Collaborative Research Database, a freely available multi-center database for critical care research. Sci data. 2018;5: 180178. pmid:30204154
  25. 25. Johnson AEW, Pollard TJ, Shen L, Lehman L-WH, Feng M, Ghassemi M, et al. MIMIC-III, a freely accessible critical care database. Sci data. 2016;3: 160035. pmid:27219127
  26. 26. Shilo S, Rossman H, Segal E. Axes of a revolution: challenges and promises of big data in healthcare. Nat Med. 2020;26: 29–38. pmid:31932803
  27. 27. Yang S, Stansbury LG, Rock P, Scalea T, Hu PF. Linking Big Data and Prediction Strategies: Tools, Pitfalls, and Lessons Learned. Crit Care Med. 2019;47: 840–848. pmid:30920408
  28. 28. Soriano-Valdez D, Pelaez-Ballestas I, Manrique de Lara A, Gastelum-Strozzi A. The basics of data, big data, and machine learning in clinical practice. Clin Rheumatol. 2021;40: 11–23. pmid:32504192
  29. 29. Vetter TR, Mascha EJ. Bias, Confounding, and Interaction: Lions and Tigers, and Bears, Oh My! Anesth Analg. 2017;125: 1042–1048. pmid:28817531
  30. 30. Sanchez-Pinto LN, Luo Y, Churpek MM. Big Data and Data Science in Critical Care. Chest. 2018;154: 1239–1248. pmid:29752973
  31. 31. Sakhuja A, Kumar G, Gupta S, Mittal T, Taneja A, Nanchal RS. Acute Kidney Injury Requiring Dialysis in Severe Sepsis. Am J Respir Crit Care Med. 2015;192: 951–957. pmid:26120892
  32. 32. Vaara ST, Pettilä V, Reinikainen M, Kaukonen K-M. Population-based incidence, mortality and quality of life in critically ill patients treated with renal replacement therapy: a nationwide retrospective cohort study in finnish intensive care units. Crit Care. 2012;16: R13. pmid:22264319
  33. 33. van Walraven C, Austin P. Administrative database research has unique characteristics that can risk biased results. J Clin Epidemiol. 2012;65: 126–31. pmid:22075111
  34. 34. Khera R, Angraal S, Couch T, Welsh JW, Nallamothu BK, Girotra S, et al. Adherence to Methodological Standards in Research Using the National Inpatient Sample. JAMA. 2017;318: 2011. pmid:29183077
  35. 35. Toppari J, Kaleva M, Virtanen HE. Trends in the incidence of cryptorchidism and hypospadias, and methodological limitations of registry-based data. APMIS. 2001;109: S37–S42.
  36. 36. Nathan H, Pawlik TM. Limitations of Claims and Registry Data in Surgical Oncology Research. Ann Surg Oncol. 2008;15: 415–423. pmid:17987343
  37. 37. Brenner H, Schmidtmann I, Stegmaier C. Effects of record linkage errors on registry-based follow-up studies. Stat Med. 1997;16: 2633–2643. pmid:9421866
  38. 38. Siegler JE, Boehme AK, Dorsey AM, Monlezun DJ, George AJ, Shaban A, et al. A comprehensive stroke center patient registry: advantages, limitations, and lessons learned. Med student Res J. 2013;2: 21–29. pmid:26913217
  39. 39. Guo DP, Thomas I-C, Mittakanti HR, Shelton JB, Makarov D V., Skolarus TA, et al. The Research Implications of Prostate Specific Antigen Registry Errors: Data from the Veterans Health Administration. J Urol. 2018;200: 541–548. pmid:29630980
  40. 40. Wang XQ, Vincent BM, Wiitala WL, Luginbill KA, Viglianti EM, Prescott HC, et al. Veterans Affairs patient database (VAPD 2014–2017): building nationwide granular data for clinical discovery. BMC Med Res Methodol. 2019;19: 94. pmid:31068135
  41. 41. Zimmerman JE, Kramer AA, McNair DS, Malila FM, Shaffer VL. Intensive care unit length of stay: Benchmarking based on Acute Physiology and Chronic Health Evaluation (APACHE) IV. Crit Care Med. 2006;34: 2517–29. pmid:16932234
  42. 42. Cook SF, Visscher WA, Hobbs CL, Williams RL, Project IMPACT Clinical Implementation Committee. Project IMPACT: results from a pilot validity study of a new observational database. Crit Care Med. 2002;30: 2765–70. pmid:12483071
  43. 43. Stow PJ, Hart GK, Higlett T, George C, Herkes R, McWilliam D, et al. Development and implementation of a high-quality clinical database: the Australian and New Zealand Intensive Care Society Adult Patient Database. J Crit Care. 2006;21: 133–41. pmid:16769456
  44. 44. Kruse CS, Goswamy R, Raval Y, Marawi S. Challenges and Opportunities of Big Data in Health Care: A Systematic Review. JMIR Med informatics. 2016;4: e38. pmid:27872036
  45. 45. Gao J, Xie C, Tao C. Big Data Validation and Quality Assurance—Issuses, Challenges, and Needs. 2016 IEEE Symposium on Service-Oriented System Engineering (SOSE). IEEE; 2016. pp. 433–441.
  46. 46. Introduction to the HCUP Nationwide Inpatient Sample [Internet]. In: Agency Healthc Res Qual Healthc Cost Util Proj. 2011.
  47. 47. Waikar SS, Wald R, Chertow GM, Curhan GC, Winkelmayer WC, Liangos O, et al. Validity of International Classification of Diseases, Ninth Revision, Clinical Modification Codes for Acute Renal Failure. J Am Soc Nephrol. 2006;17: 1688–94. pmid:16641149
  48. 48. Databases HCUP. Healthcare Cost and Utilization Project (HCUP). In: Agency for Healthcare Research and Quality, Rockville, MD. [Internet]. 2021 [cited 1 Jun 2021]. Available:
  49. 49. O’Halloran HM, Kwong K, Veldhoen RA, Maslove DM. Characterizing the Patients, Hospitals, and Data Quality of the eICU Collaborative Research Database*. Crit Care Med. 2020;48: 1737–1743. pmid:33044284
  50. 50. Vandenbroucke JP, von Elm E, Altman DG, Gøtzsche PC, Mulrow CD, Pocock SJ, et al. Strengthening the Reporting of Observational Studies in Epidemiology (STROBE): explanation and elaboration. PLoS Med. 2007;4: e297. pmid:17941715