Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Pooling individual participant data from randomized controlled trials: Exploring potential loss of information

  • Lennard L. van Wanrooij ,

    Roles Conceptualization, Formal analysis, Methodology, Software, Visualization, Writing – original draft, Writing – review & editing

    l.l.vanwanrooij@amc.uva.nl

    Affiliation Department of Neurology, Amsterdam UMC, University of Amsterdam, Amsterdam, The Netherlands

  • Marieke P. Hoevenaar-Blom,

    Roles Conceptualization, Data curation, Methodology, Supervision, Writing – review & editing

    Affiliations Department of Neurology, Amsterdam UMC, University of Amsterdam, Amsterdam, The Netherlands, Department of Neurology, Donders Institute for Brain, Cognition and Behaviour, Radboud University Medical Center, Nijmegen, The Netherlands

  • Nicola Coley,

    Roles Data curation, Writing – review & editing

    Affiliations Department of Epidemiology and Public Health, Toulouse University Hospital, Toulouse, France, INSERM, University of Toulouse UMR1027, Toulouse, France

  • Tiia Ngandu,

    Roles Data curation, Writing – review & editing

    Affiliation Chronic Disease Prevention Unit, National Institute for Health and Welfare, Helsinki, Finland

  • Yannick Meiller,

    Roles Conceptualization, Data curation, Methodology, Writing – review & editing

    Affiliation Department of Information and Operations Management, ESCP Europe, Paris, France

  • Juliette Guillemont,

    Roles Data curation, Writing – review & editing

    Affiliation INSERM, University of Toulouse, Toulouse, France

  • Anna Rosenberg,

    Roles Data curation, Writing – review & editing

    Affiliation Department of Neurology, Institute of Clinical Medicine, University of Eastern Finland, Kuopio, Finland

  • Cathrien R. L. Beishuizen,

    Roles Data curation, Writing – review & editing

    Affiliation Department of Neurology, Amsterdam UMC, University of Amsterdam, Amsterdam, The Netherlands

  • Eric P. Moll van Charante,

    Roles Data curation, Writing – review & editing

    Affiliation Department of General Practice, Amsterdam UMC, University of Amsterdam, Amsterdam, The Netherlands

  • Hilkka Soininen,

    Roles Data curation, Writing – review & editing

    Affiliations Department of Neurology, Institute of Clinical Medicine, University of Eastern Finland, Kuopio, Finland, Neurocenter, Neurology, Kuopio University Hospital, Kuopio, Finland

  • Carol Brayne,

    Roles Conceptualization, Data curation, Methodology, Writing – review & editing

    Affiliation Department of Public Health and Primary Care, Cambridge Institute of Public Health, University of Cambridge, Cambridge, United Kingdom

  • Sandrine Andrieu,

    Roles Data curation, Writing – review & editing

    Affiliations Department of Epidemiology and Public Health, Toulouse University Hospital, Toulouse, France, INSERM, University of Toulouse UMR1027, Toulouse, France

  • Miia Kivipelto,

    Roles Data curation, Writing – review & editing

    Affiliations Chronic Disease Prevention Unit, National Institute for Health and Welfare, Helsinki, Finland, Department of Neurology, Institute of Clinical Medicine, University of Eastern Finland, Kuopio, Finland, Aging Research Center, Karolinska Institutet, Stockholm University, Stockholm, Sweden, Karolinska Institutet Center for Alzheimer Research, Stockholm, Sweden

  • Edo Richard

    Roles Conceptualization, Data curation, Funding acquisition, Methodology, Supervision, Writing – review & editing

    Affiliations Department of Neurology, Amsterdam UMC, University of Amsterdam, Amsterdam, The Netherlands, Department of Neurology, Donders Institute for Brain, Cognition and Behaviour, Radboud University Medical Center, Nijmegen, The Netherlands

Abstract

Background

Pooling individual participant data to enable pooled analyses is often complicated by diversity in variables across available datasets. Therefore, recoding original variables is often necessary to build a pooled dataset. We aimed to quantify how much information is lost in this process and to what extent this jeopardizes validity of analyses results.

Methods

Data were derived from a platform that was developed to pool data from three randomized controlled trials on the effect of treatment of cardiovascular risk factors on cognitive decline or dementia. We quantified loss of information using the R-squared of linear regression models with pooled variables as a function of their original variable(s). In case the R-squared was below 0.8, we additionally explored the potential impact of loss of information for future analyses. We did this second step by comparing whether the Beta coefficient of the predictor differed more than 10% when adding original or recoded variables as a confounder in a linear regression model. In a simulation we randomly sampled numbers, recoded those < = 1000 to 0 and those >1000 to 1 and varied the range of the continuous variable, the ratio of recoded zeroes to recoded ones, or both, and again extracted the R-squared from linear models to quantify information loss.

Results

The R-squared was below 0.8 for 8 out of 91 recoded variables. In 4 cases this had a substantial impact on the regression models, particularly when a continuous variable was recoded into a discrete variable. Our simulation showed that the least information is lost when the ratio of recoded zeroes to ones is 1:1.

Conclusions

Large, pooled datasets provide great opportunities, justifying the efforts for data harmonization. Still, caution is warranted when using recoded variables which variance is explained limitedly by their original variables as this may jeopardize the validity of study results.

Introduction

The sample size in individual cohort studies and randomized controlled trials (RCTs) is often too small to answer specific (secondary) research questions. Pooling individual participant data (IPD) from multiple studies increases the sample size and statistical power to perform subgroup analyses and enable assessment of consistency of findings across different studies. To enable IPD meta-analyses it is necessary to harmonize data from different studies, which is often complicated by diversity across the available datasets.

Differences in data collection include, but are not limited to, data in multiple file formats and languages, the use of different instruments or scales to measure the same domains and use of different units of measurement. When syntaxes to recode data are developed ad hoc, without detailed information on the data collection procedure, misinterpretation may lead to a decrease of validity of the data [1].

For this reason much attention is given to state-of-the-art data harmonization techniques, as evidenced by the development of software such as DataSHAPER and ViPAR [2, 3]. These initiatives facilitate data pooling from different studies while overcoming differences in the collection of the data. Regardless of whether data harmonization is performed using specific software or by hand, little is known on the extent of information that is lost during the data harmonization process.

The aim of this study was to quantify the consequences of recoding variables for data pooling with regards to loss of information and the resulting loss of validity of study results. We therefore explored how much information was lost after recoding of variables, expressed as the proportion of the variance in the pooled variables that is explained by the data in the original datasets. We hypothesized most variance was lost when variables were recoded from continuous to discrete, which is in line with previous studies. We used data that were pooled from three clinical trials as well as simulated data to study our hypothesis. When a substantial loss of information occurred for a non-simulated variable, we additionally explored the potential impact on the validity of analyses outcomes. In a simulation study, we explored the influence of the range of continuous variables and ratio of dichotomous variables on information loss. Finally, we share all recoding schemes that we used for pooling variables relevant for research on cardiovascular risk factors and cognitive decline.

Methods

Data collection

Individual participant data from three recently completed RCTs on multi-domain interventions to prevent cognitive decline or dementia with a total of 6435 participants, were pooled. These were the Prevention of Dementia by Intensive Vascular Care trial (preDIVA, ISRCTN 29711771 [4]), the Finnish Geriatric Intervention Study to Prevent Cognitive Impairment and Disability trial (FINGER, NCT 01041989 [5]) and the Multidomain Alzheimer Preventive Trial (MAPT, NCT 00672685 [6]). The study teams of these three clinical trials collaborate in the HATICE consortium (www.hatice.eu), and are dedicated to share data amongst each other. All data had been pseudonymized prior to access for analysis. In Table 1 the study characteristics are shown.

Recoding schemes

The pooled dataset consisted of 170 variables, of which 101 were available for the original FINGER subset of the database, 137 for the MAPT subset and 91 for the preDIVA subset. This sums to a total 329 recoding schemes that were necessary to create the pooled variables. For 238 recoding schemes an algorithmic transformation of the original variable was not necessary to create the pooled variable. We did not perform analyses for these recoding schemes, since these would be redundant by definition. This means we conducted analyses only for those 91 recoding schemes in which an algorithmic transformation of the original variable(s) was required to create the pooled variable. Three different algorithmic transformations were used: (1) Continuous variables that were recoded into discrete variables (Noriginal = 8); (2) Discrete variables that were recoded into discrete variables with a different number or order of categories (Noriginal = 44); and (3) Recoded variables that were based on multiple original variables (Noriginal = 39).

Statistical analyses

For the current analyses we focus on available baseline data. Handling missing data with context-free data encoding has been described previously [7, 8]. We used a stepwise procedure to first quantify the information loss of pooled variables, and subsequently explored the potential impact on future analyses by using variables that had lost most information (Fig 1).

Step 1: We used linear regression models with the recoded variables as dependent variables and the original variables as independent variables and extracted the R-squared as measures for explained variance. In case a recoded variable was based on multiple original variables, we used these as predictors in a single linear regression model. When the R-squared was at least 0.8, an arbitrarily set threshold, we considered the amount of information that was lost acceptable, else we continued with step 2.

Step 2: To explore the potential impact of the information loss, in a linear regression model we assessed whether the Beta coefficient of an independent variable changed by more than 10% when using the pooled variable instead of the original variable as a confounder. A Beta coefficient is the degree of change in the dependent variable for each one unit increase of the independent variable [9]. This is analogous to commonly applied criteria for confounders [10]. For these regression models we chose an independent and dependent variable which we, based on literature, expected to be associated to each other as well as to the confounder. Therefore, we used various independent and dependent variables for these step 2 analyses.

In our simulation study we randomly generated continuous variables and recoded these to dichotomous variables. For these simulated variables we explored how much information was lost with varying range of the continuous variable (type 1a), varying ratio of recoded ones to recoded zeroes (type 1b) and a combination of both (type 2). For all simulations we sampled two sets of numbers. The sampling method for the first set was the same across iterations, while the second set differed in range of continuous numbers, ratio of recoded ones to recoded zeroes (by differing sample sizes recoded to 1 while the amount recoded to 0 remained the same), or both. For the first set 1000 numbers between 0 and 1000 were sampled. For the second set we sampled K numbers between 1001 and N. For type 1a K was kept at a constant of 1000, while N was a factor of between 1 and 34 as high or low compared to 1000. Therefore, the lowest N was 1030 (1000/34+1001) and the highest N was 35001 (1000*34+1001). For type 1b N was kept at a constant of 1000, while K was a factor of between 1 and 34 as high or low compared to 1000. The lowest K was therefore 29 (1000/34) and the highest K 34000 (1000*34). For type 1a and 1b each of the 67 simulations (a factor of 1 as high or low is the same) was replicated 100 times. For our type 2 simulations, all combinations of K and N from type 1a and 1b were used, therefore yielding 4489 (67 * 67) combinations. Each of these combinations was replicated 10 times. With this latter simulation type we tested to what extent information loss caused by a difference in range for set 1 and set 2 numbers can be compensated for or exaggerated by varying the ratio of ones to zeroes. For all three types of simulations numbers between 0 and 1000 were recoded to 0 and those above 1000 to 1. We then could use the same linear models as in the Step 1 analyses, in which the recoded zeroes and ones were the dependent variables and the sampled numbers the independent variables. From these models we again extracted the R-squared to explore to what extent the R-squared changed depending on the difference in range of numbers, the ratio of ones to zeroes or a combination of both. The full syntax of this simulation is available in the supplement.

Normality and homoscedasticity of residuals have been checked for the linear regression models. The impact of violations of assumptions are discussed at the end of the results section. All analyses have been conducted using R Studio [11], specifically the built-in package ‘stats’ [12] for the linear models and the additionally loaded packages ‘ggplot2’ [13] and ‘gridExtra’ [14] for the visualizations.

Participants gave written informed consent prior to their baseline visit. The preDIVA study was approved by the Medical Ethics Committee of the Academic Medical Center, Amsterdam. The MAPT trial protocol was approved by the French Ethical Committee located in Toulouse (CPP SOOM II) and was authorized by the French Health Authority. FINGER was approved by the coordinating ethics committee of the Hospital District of Helsinki and Uusimaa.

Results

For the eight continuous-to-discrete recoded variables, the median R-squared of the regression models, as described as the first step of our analyses, was 0.54 (IQR: 0.37–0.67). For the 44 discrete-to-discrete recoded variables it was 0.97 (IQR: 0.92–1.00) and for the 39 multiple-to-single recoded variables 0.98 (IQR: 0.92–1.00) (Fig 2). All individual R-squareds and the recoding schemes, listed by content category, are provided in the S1 File.

thumbnail
Fig 2. Pooling accuracy for three data recoding categories.

https://doi.org/10.1371/journal.pone.0232970.g002

Eight regression models yielded an R-squared of below 0.8. For these we performed exploratory analyses as described for our step 2 method (Table 2). The Beta coefficient of the independent variable changed by more than 10% in four out of eight models depending on whether recoded or original confounders were used. In other words, in half of the models the Beta coefficient changed considerably when using the recoded instead of original variables as confounders, which may lead to a different interpretation of the results.

thumbnail
Table 2. Change of the beta coefficient of an association when using the recoded variable as a confounder compared to the original variable for the variables with less than 80% explained variance after recoding to assess the impact of information loss on the validity of associations.

https://doi.org/10.1371/journal.pone.0232970.t002

The type 1a simulation study, in which the range of numbers recoded to 1 differed from the range of numbers recoded to 0, yielded a median R-squared of 0.63 (IQR: 0.62–0.64). The median R-squared of the type 1b simulation, in which the ratio of ones to zeroes varied, was 0.38 (IQR: 0.30–0.51). This means the R-squared is more negatively impacted by differences in amount of ones compared to zeroes than by difference in ranges between numbers that were either recoded to 0 or 1. When combined, in type 2 of our simulation study, the median R-squared was 0.54 (IQR: 0.16–0.74). When the range of the set 2 numbers increased, the R-squared increased when the ratio of ones to zeroes decreased, while the R-squared decreased even more when the ratio of ones to zeroes also increased (Fig 3).

thumbnail
Fig 3. R2s for simulations in which numbers between 0 and 1000 are recoded to 0 and those above 1000 to 1.

For all iterations, 1000 numbers between 0 and 1000 were sampled and recoded to 0. Numbers that were recoded to 1 originated from sampling numbers between 1001 and N with sample size K. Left: K was a constant of 1000, N was between 1 and 34 times as high or low as 1000 (simulation type 1a, red circles); N was a constant of 2001, K was between 1 and 34 times as high or low as 1000 (simulation type 1b, blue triangles). Right: N was between 1 and 34 times as high or low as 1000 and K was between 1 and 34 times as high or low as 1000 (simulation type 2).

https://doi.org/10.1371/journal.pone.0232970.g003

For most linear regression models the assumptions of normality and homoscedasticity of residuals were met. We exploratively assessed to what extent the R-squared changed when log-transformed dependent and independent variables were used instead of untransformed variables. We did this for all 91 linear regression models in which pooled variables were modelled that had not been pooled directly. We observed a median change of 0.03 (IQR: 0.00–0.06) and a maximum change of 0.27 in the R-squared when this procedure was used. This suggests using log-transformed variables for this purpose generally does not considerably alter the R-squared.

Discussion

In pooling data from three RCTs, recoding of 91 variables resulted in a loss of explained variance of more than 20% in 8 variables. Most substantial loss was observed for variables that were recoded from continuous to discrete. Exploratory analysis suggested that the impact of recoded variables with substantial loss of explained variance on multivariate analyses might not be trivial.

To our knowledge, this study is the first to explore the degree of information in original data that is lost when harmonizing data, including analyses to assess potential consequences for the validity of findings using recoded variables from a pooled dataset. Although most of the recoded variables within the harmonized database appear to be valid and reliable, those variables that lost a substantial part of the explained variance following their recoding should perhaps be left out of future analyses or be handled with caution. These include important areas of study that are measured as continuous such as cognitive and depressive symptomatology.

Using data dictionaries such as CDISC when designing studies is recommended. Within a specific research field, a certain level of harmonization of assessment instruments would also reduce the need for recoding. However, even if agreement will be attained, recoding variables is sometimes inevitable, for example when pooling data from trials from countries with different standard measures. Specific attention should be paid to assessing of the potential altered findings when analyzing data with variables that were recoded from a continuous to a discrete scale. We encourage other researchers conducting pooled analyses to include a quantification of the information that was lost due to the recoding process. Also, we recommend that when data has been recoded for a pooled analysis, the analysis should be repeated in the original dataset with both the original and the recoded variable, and results should be reported (at least in a supplement), to illustrate the impact of recoding. Using the R-squared of linear regression models is appropriate as a crude summary to quantify information loss of pooled variables. When the dependent variable is nominal, a linear regression may not be the appropriate analysis. For consistency and enhancement of comparability of the R-squareds as a crude summary of information loss, we decided to use linear regression models for all types of recoded variables. More generally, this method may be less valid in case of violations of assumptions for linear models. However, exploratory analysis showed that using log-transformed variables instead of untransformed variables has only limited impact on the R-squared (median change in R-squared after transformation of all 91 non-directly pooled models: 0.03 (IQR: 0.00–0.06)). Adding the R-squared of key variables in pooled analyses in summary results allows readers to assess the full impact of harmonization of data on research findings.

The impact of data harmonization is most substantial in case a continuous variable is recoded into a dichotomous variable, as hypothesized. This in line with findings of previous studies [15, 16] and followed both from our main analyses as well from our simulation study. Our different types of simulations showed that the R-squared is more influenced by differences in ratio of ones to zeroes than by differences in ranges between numbers that have been recoded to 0 or 1. Also, it followed that increasing differences in ranges between two sets, can be compensated for by decreasing the size difference between the two groups of numbers that have been recoded to either 0 or 1, and vice versa. We do not recommend excluding variables that have been recoded from continuous to dichotomous from pooled analyses for the purpose of increasing overall explained variance of pooled variables.

To conclude, large, pooled datasets provide important opportunities, justifying the efforts for data harmonization. However, caution is warranted when using recoded variables whose variance is poorly related to that of the original variables as this may jeopardize the validity of study results.

Supporting information

S1 File. Recoding schemes and R-squareds for all pooled variables.

https://doi.org/10.1371/journal.pone.0232970.s001

(DOCX)

Acknowledgments

The authors would like to thank the ‘Prevention of dementia by intensive vascular care’ (preDIVA) team, the ‘Finnish Geriatric Intervention Study to Prevent Cognitive Impairment and Disability’ (FINGER) team and the ‘Multidomain Alzheimer Preventive Trial’ (MAPT) team.

References

  1. 1. Fortier I, Raina P, Van den Heuvel ER, Griffith LE, Craig C, Saliba M, et al. Maelstrom research guidelines for rigorous retrospective data harmonization. Int J Epidemiol. 2016;0:1–13
  2. 2. Gaye A, Marcon Y, Isaeva J, LaFlamme P, Turner A, Jones EM, et al. Datashield: Taking the analysis to the data, not the data to the analysis. Int J Epidemiol. 2014;43:1929–1944 pmid:25261970
  3. 3. Carter KW, Francis RW, Carter KW, Francis RW, Bresnahan M, Gissler M, et al. Vipar: A software platform for the virtual pooling and analysis of research data. Int J Epidemiol. 2015;45:408–416 pmid:26452388
  4. 4. Moll van Charante EP, Richard E, Eurelings LS, van Dalen J-W, Ligthart SA, van Bussel EF, et al. Effectiveness of a 6-year multidomain vascular care intervention to prevent dementia (prediva): A cluster-randomised controlled trial. The Lancet. 2016;388:797–805
  5. 5. Ngandu T, Lehtisalo J, Solomon A, Levälahti E, Ahtiluoto S, Antikainen R, et al. A 2 year multidomain intervention of diet, exercise, cognitive training, and vascular risk monitoring versus control to prevent cognitive decline in at-risk elderly people (finger): A randomised controlled trial. The Lancet. 2015;385:2255–2263
  6. 6. Andrieu S, Guyonnet S, Coley N, Cantet C, Bonnefoy M, Bordes S, et al. Effect of long-term omega 3 polyunsaturated fatty acid supplementation with or without multidomain intervention on cognitive function in elderly adults with memory complaints (mapt): A randomised, placebo-controlled trial. The Lancet. Neurology. 2017;16:377–389 pmid:28359749
  7. 7. Hoevenaar-Blom MP, Guillemont J, Ngandu T, Beishuizen CRL, Coley N, Moll van Charante EP, et al. Improving data sharing in research with context-free encoded missing data. PLoS One. 2017;12:e0182362 pmid:28898245
  8. 8. Meiller Y, Guillemont J, Beishuizen CR, Richard E, Andrieu S, Kivipelto M. An is approach for handling missing data in collaborative medical research. Twenty-second Americas Conference on Information Systems. 2016:1–10
  9. 9. Freedman DA. Statistical models: Theory and practice.: Cambridge University Press; 2009.
  10. 10. Lee PH. Is a cutoff of 10% appropriate for the change-in-estimate criterion of confounder identification? Journal of Epidemiology. 2014;24:161–167 pmid:24317343
  11. 11. Rstudio: Integrated development for r. 2016
  12. 12. Team RC. R. A language and environment for statistical computing. 2019
  13. 13. Wickham H. Create elegant data visualisations using the grammar of graphics. 2016
  14. 14. Auguie B. Gridextra: Miscellaneous functions for "grid" graphics. 2017
  15. 15. Diniz MA, Tighiouart M, Rogatko A. Comparison between continuous and discrete doses for model based designs in cancer dose finding. PLoS One. 2019;14:e0210139 pmid:30625194
  16. 16. Shaw DG, Huffman MD, Haviland MG. Grouping continuous data in discrete intervals: Information loss and recovery. Journal of Educational Measurement. 1987;24:167–173