Initial data analysis for longitudinal studies to build a solid foundation for reproducible analysis

Lara Lusa; Cécile Proust-Lima; Carsten O. Schmidt; Katherine J. Lee; Saskia le Cessie; Mark Baillie; Frank Lawrence; Marianne Huebner; on behalf of TG3 of the STRATOS Initiative

doi:10.1371/journal.pone.0295726

Abstract

Initial data analysis (IDA) is the part of the data pipeline that takes place between the end of data retrieval and the beginning of data analysis that addresses the research question. Systematic IDA and clear reporting of the IDA findings is an important step towards reproducible research. A general framework of IDA for observational studies includes data cleaning, data screening, and possible updates of pre-planned statistical analyses. Longitudinal studies, where participants are observed repeatedly over time, pose additional challenges, as they have special features that should be taken into account in the IDA steps before addressing the research question. We propose a systematic approach in longitudinal studies to examine data properties prior to conducting planned statistical analyses. In this paper we focus on the data screening element of IDA, assuming that the research aims are accompanied by an analysis plan, meta-data are well documented, and data cleaning has already been performed. IDA data screening comprises five types of explorations, covering the analysis of participation profiles over time, evaluation of missing data, presentation of univariate and multivariate descriptions, and the depiction of longitudinal aspects. Executing the IDA plan will result in an IDA report to inform data analysts about data properties and possible implications for the analysis plan—another element of the IDA framework. Our framework is illustrated focusing on hand grip strength outcome data from a data collection across several waves in a complex survey. We provide reproducible R code on a public repository, presenting a detailed data screening plan for the investigation of the average rate of age-associated decline of grip strength. With our checklist and reproducible R code we provide data analysts a framework to work with longitudinal data in an informed way, enhancing the reproducibility and validity of their work.

Citation: Lusa L, Proust-Lima C, Schmidt CO, Lee KJ, le Cessie S, Baillie M, et al. (2024) Initial data analysis for longitudinal studies to build a solid foundation for reproducible analysis. PLoS ONE 19(5): e0295726. https://doi.org/10.1371/journal.pone.0295726

Editor: Arka Bhowmik, Memorial Sloan Kettering Cancer Center, UNITED STATES

Received: November 27, 2023; Accepted: March 13, 2024; Published: May 29, 2024

Copyright: © 2024 Lusa et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: Data are available for research purposes at https://share-eric.eu/data/data-access.

Funding: Lara Lusa was partially supported by the research program "Methodology for data analysis in medicine" (P3-0154) of the Slovenian Research and Innovation Agency (ARIS). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

1 Introduction

Initial data analysis (IDA) is the part of the data pipeline that commonly takes place between the end of data retrieval and the beginning of data analysis that addresses the research question. The main aim of IDA is to provide reliable knowledge about data so that appropriate statistical analyses can be conducted, ensuring transparency, integrity, and reproducibility. This is necessary for accurate interpretation of results to answer predefined research questions.

A general framework of IDA for observational studies includes the following six steps: (1) metadata setup (to summarize the background information about data), (2) data cleaning (to identify and correct technical errors), (3) data screening (to examine data properties), (4) initial data reporting (to document findings from the previous steps), (5) refining and updating the research analysis plan, and (6) documenting and reporting IDA in research papers [1]. Statistical practitioners often do not perform such necessary steps in a systematic way. They may combine data screening steps with analyses steps leading to ad-hoc decisions; however, maintaining a structured workflow is a fundamental step towards reproducible research [2].

The value of an effective IDA strategy for data analysts lies in ensuring that data are of sufficient quality, that model assumptions made in the analysis strategy are satisfied and are adequately documented; IDA also supports decisions for the statistical analyses [3]. IDA data screening investigations could lead to discovery of data properties that may identify errors beyond those addressed in data cleaning, affect the interpretation of results of statistical models, and/or modify the choices linked to the specification of the model.

There are many tools for data screening, including numerical data summaries and data visualizations, that are common for IDA and for exploratory data analysis (EDA, [4]) [5]. However, IDA and EDA have different objectives and differ in emphasis. EDA inspects data to uncover patterns in the data or identifies possible errors and anomalies, typically through graphical approaches [6]. In addition, EDA may include investigations for model building and hypothesis generation in a cyclical manner [7, 8].

In contrast, IDA systematically examines data properties and their appropriateness for the intended statistical analyses to answer a pre-specified research question. Thus, IDA is more closely linked to inference than EDA, prioritizing the assurance that data quality is sufficient. A systematic process for IDA is crucial, and steps for IDA should be included in a statistical analysis plan.

The term IDA was first used by Chatfield [9], aimed at data description and model formulation. However, Chatfield also recognized that post-IDA decisions could bias analyses with spurious deviations to the original analysis plan [10]. For a more in-depth discussion on the relationship between IDA and EDA, interested readers can refer to [11] and the introduction chapter of [6].

The STRengthening Analytical Thinking for Observational Studies (STRATOS) Topic Group 3 on Initial data analysis was established in 2014 [12] to provide guidance on these controversial issues; they proposed an IDA framework aimed at separating activities to understand data properties and purposefully refraining from hypothesis-generating activities.

An IDA checklist for data screening in the context of regression models for continuous, count or binary outcomes was proposed recently [13], not considering outcomes that were of survival-type, multivariate or longitudinal. The goal of this study is to extend the checklist to longitudinal studies, where participants are measured repeatedly over time and the main research question is addressed using a regression model; the focus is on data screening (IDA step 3), where the examination of the data properties provide the data analyst with important information related to the intended analysis. While an investigation of missing data, and univariate and multivariate description of variables is common across studies [13], longitudinal studies pose additional challenges for IDA. Different time metrics, the description of how much much data was collected through the study (how many observations and at which times), missing values across time points, including drop-out, and longitudinal trends of variables should be considered. Model building and inference for longitudinal studies have received much attention [14], and many textbooks on longitudinal studies discuss data exploration and the specific challenges due to missing values [14–16]; however, a systematic process for data screening is missing.

We propose a comprehensive checklist for the data screening step of the IDA framework, which includes data summaries to help understanding data properties, and their potential impact on the analyses and interpretation of results. The checklist can be used for observational longitudinal studies, which include panel studies, cohort studies, or retrospective studies. Potential applications include medical studies designed to follow-up with patients over time and electronic health records with longitudinal observations and complex surveys. Other aspects of the IDA framework such as data preparation and data cleaning have been discussed elsewhere [17, 18].

This paper is an effort to bring attention to a systematic approach of initial data analysis for longitudinal studies that could affect the analysis plan, presentation, or interpretation of modeling results. This contributes to the general aim of the international initiative STRATOS (http://stratos-initiative.org, [12]).

We outline the setting and scope of our paper in Section 2. We describe the necessary steps for data screening of longitudinal studies in Section 3, where a check list is also provided. A case study is presented in Section 4, using hand grip strength from a data collection across several waves in a complex survey [19]; we present several data summarizations and visualizations, and provide a reproducible R vignette for this application. Possible consequences of the IDA findings for the analyses in this case study are presented in Section 4.5, where we discuss the potential implications to the statistical modeling or interpretation of results based on the evaluation of the data properties. In the discussion we emphasize the importance of integrating IDA into study protocols and statistical analysis plans and comment on the anticipated utility of our checklist and reproducible code for researchers working with longitudinal data.

2 Setting and scope

A plan for data screening should be matched to the research aims, study settings, and analysis strategy. We assume that the study protocol describes a research question that involves longitudinal data, where the outcome variable is measured repeatedly over time, and is analysed using a regression model applied to all time points or measurements. We assume that baseline explanatory variables are measured, and consider also the possibility of time-varying explanatory variables.

“Measurement” in longitudinal studies could refer to a data collection with survey instruments, interviews, physical examinations, or laboratory measurements. Time points at which the measurements are obtained and the number of measurements can vary between individuals. Time series, time-to-event models, or applications where the number of explanatory variables is extremely large (omics/high-dimensional) are out-of-scope for this paper. We assume that only one outcome variable is measured repeatedly over time, but most of the considerations would apply also to longitudinal studies with multiple outcomes. We focus on observational longitudinal studies, but most of the explorations that we propose would also be appropriate for experimental studies.

Important prerequisites for a data-screening checklist have been described in [13]. A clearly defined research question must be defined, and an analysis strategy for addressing it must be known. The analysis strategy includes the type of statistical model for longitudinal data, defines variables to be considered for the model, expected methods for handling missing data, and model performance measurements. A statistical analysis plan can be built from the analysis strategy and the data screening plan. Structural variables in the context of IDA were introduced in [13]; these are variables that are likely to be critical for describing the sample (e.g. variables that could highlight specific patterns) and that are used to structure IDA results. They can be demographic variables, variables central to the research aim, or process variables (e.g., variables that describe the process under which data was collected, might be centers where data is collected, providers, locations); they may or be not also explanatory variables used in the analysis strategy. They help to organize IDA results to provide a clear overview of data properties. In particular, this might reduce the number of cocievable multivariable explorations. For example, the attention can be directed towards a detailed exploration of the association between the explanatory variables and structural variables, rather than presenting all possible associations among explanatory variables [13].

Moreover, data summaries stratified by structural variables can provide further insights about the data. For example, important characteristics of the data collection process can be apparent when the data are stratified by the center where data were collected. Key predictors in a model, such as sex and age could reveal important information about the variable distributions in the study population.

We assume that data retrieval, data management and data cleaning (identification of errors and inconsistencies) have already been performed. These aspects comprise specific challenges with longitudinal data, where data sets are prepared in multiple formats (long, one row per measurement, the preferred format for data modeling, and wide, one row per participant, for data visualizations), the harmonization of variable definitions across measurements/over time is often needed, and inconsistencies of repeated measurements across time might be identified during data cleaning. A data dictionary and sufficient meta-data should be available to clarify the meaning and expected values of each variable and information about study protocol and data collection.

An important principle of IDA is, as much as possible, to avoid hypothesis-generating activities. Therefore, in the data-screening process, associations between the outcome variable and the explanatory variables are not evaluated. However, evaluating the changes of the outcome in time is part of the outcome assessment in the IDA for longitudinal data.

Because longitudinal studies can be very heterogeneous in their data structure, it is challenging to propose a unified data-screening checklist. The topics addressed in this paper and summarized in our checklist can be considered a minimum set of analyses to include in an IDA report for transparency and reproducibility to prepare for the statistical modeling that addresses the research questions; the optional extensions present explorations that might be relevant only in some studies.

3 IDA data-screening checklist for longitudinal data

An IDA data-screening domain refers to the type of explorations, such as evaluating missing data, describing univariate and multivariate distributions, as proposed in the checklist for regression models in cross-sectional studies [13]. To address the specificities of longitudinal studies we extend these IDA domains to include participation profile and longitudinal aspects. The missing values domain is substantially extended, and the univariate and multivariate descriptions include explorations at time points after baseline.

Several items of the IDA screening checklist suggest to summarize data for each time point, which is sensible for study designs where all the individuals have pre-planned common times of measurements or when the number of different times is limited; in this case these times can be used as structural variables in IDA (for instance time visits or waves). For studies where the time points are many and/or uncommon, or not determined by design (random times of observation), we suggest that, for description purposes, the time metric is summarized in intervals and the summaries are provided by time intervals rather than for each of the time points.

The aims of the IDA screening domains and the main aspects of each domain are presented in the following sections, and summarized in Table 1.

Download:

Table 1. Initial data analysis checklist for data screening in longitudinal studies.

https://doi.org/10.1371/journal.pone.0295726.t001

3.1 Participation profile

Aim: (1) to summarize the participation pattern of individuals in the study over time; (2) to describe the time metric(s).

A participation profile represents the temporal patterns in participation. The number of participating individuals, the number of times they were measured, and the distribution of the number of measurements per time point and per individual are described in this IDA screening domain.

Different choices of time metrics are possible depending on the research question. It can be time since inclusion in the study, time since an event, calendar time, age, or measurement occasion (defined as order of pre-planned measurement times for a participant). In some studies, it may be useful to use more than one time metric to describe the study.

Most timescales induce subject-specific times of measurements, which is naturally handled in regression analyses (for instance with mixed models that use the actual times of measurement and where using measurements at time points that are not common for all subject is not problematic), but this poses an additional challenge for summary statistics during IDA steps. When subject-specific times remain closely linked to a shared timescale, for instance planned visits or waves (nominal times), IDA can be done according to the shared timescale, with a mention of the variability the approximation in time induced. The deviations between nominal and actual times should also be explored. In other contexts, for instance, when using age as the time scale in cohorts with heterogeneous ages at baseline, relevant intervals of time need to be considered for summary statistics and overall trends.

The description of the number of observations at each time point in studies with pre-planned times of observations provides information about missing values (discussed more in detail in the next domain), while it does not in study designs that foresee random times of observation [16].

3.2 Missing values

Aim: (1) to describe missing data over time and by types of missingness (non-enrollment, intermittent visit missingness, loss to follow-up, missing by design, or death); (2) to summarize the characteristics of participants with missing values over time; (3) to describe the variables with missing values; (4) to find possible patterns of missing data across variables; and (5) to evaluate possible predictors of missingness and missing values.

Longitudinal data with complete information are very rare, and missing data are one of the major challenges in the design and analysis of longitudinal studies. Different analysis methods can rely on different assumptions about the missing data mechanism. An incorrect handling of missing data can lead to biased and inefficient inference [20]. Therefore, a thorough investigation of the pattern of observed missingness before the beginning of the statistical analysis can have major implications for the interpretation of the results or imply possible changes in the analysis strategy.

In longitudinal studies it is important to distinguish between unit missingness (of participants) due to non-enrollment (participants that fulfill inclusion criteria that do not participate in the study), intermittent visit missingness (a missing visit) and drop-out, defined as visit missingness due to attrition/loss-to-follow-up (missing values for participants that previously participated in the study). Additionally, participants may have incomplete follow-up due to death.

It is also possible that some variables (outcome and/or explanatory variables) are missing among participants for which the measurements of the other variables are available at the same visit. This type of partial missingness, at variable rather than participant level, is often defined as variable or item missingness. The methods used to handle different types of missingness in the analyses may differ (for example, survey weights, multiple imputation, maximum likelihood estimation). The analysis strategy determines which aspects of missing value is important to describe.

Missing values in explanatory variables can be handled either by considering complete cases or by performing multiple imputation (MI) [21, 22]. In maximum likelihood mixed-based models the imputation of outcome is not needed, as the model intrinsically handles the missing data in the outcome. In survey studies, unit non-enrollment missingness is often addressed using survey weights, which can be used to adjust the analyses for the selection of participants that makes the sample non-representative of its target population.

The number and the known characteristics of the non-responders should be described, along with the characteristics of the participants that are lost during follow-up, the corresponding time points and reasons, if available, and the time of last observed response.

To understand how non-enrollment influences the characteristics of the available sample, some of the main characteristics of the enrolled and non-enrolled (the participants that were selected but did not enter the study) can be compared, if data are available, or the sample of enrolled can be compared to the target population. It is also useful to estimate the probability of drop-out after inclusion during study, stratifying by structural variables. The display of the mean outcome as a function of time stratified by different drop-out times can suggest a relationship between the outcome and the drop-out process [15].

For item missingness, the frequency and reasons for missing data within single explanatory variable, and the co-occurrence of missing values across different variables (for example, using visualization techniques such as clustering of indicators of missing values) may be used to identify patterns of missingness. The characteristics and number of the participants for which an individual item is missing can also be described separately.

Predictors of missing values can be identified by comparing the characteristics of subjects with complete and incomplete data at each measurement occasion; it is common to compare the baseline characteristics, where the extent of missing values is usually smaller compared to longitudinal measurements.

Another aim within this domain can be to identify potential auxiliary variables, i.e., variables not required for the analysis but that can be used to recover some missing information through their association with the incomplete variables, for example via inclusion in an imputation model (if envisioned in the analysis strategy) or for the construction of survey weights. As this often requires looking at the association between variables, this can be assessed via the multivariate descriptions.

3.3 Univariate descriptions

Aim: (1) to describe all variables that are used in the analysis (outcomes, explanatory variables, structural variables, auxiliary variables) with numerical and graphical summaries at baseline; (2) to describe the time-varying variables at all time points.

The univariate descriptions explore the characteristics of the variables, one at a time. The results can be used to evaluate if the observed distributions are as expected, or to identify problematic aspects (unexpected values, sparse categories, etc). Descriptive statistics can be used to summarize the variables, as described in [13].

The time-varying variables should be summarized also at time points after baseline. As evoked earlier, discretization into intervals may be indicated if the time metric is on a continuous rather than on a categorical scale and the number of different observed times is large. Different time metrics can be used to summarize the variables. Using the time metric of the data collection process can be useful for the identification of data collection problems (e.g., specific characteristics or problems in some waves). In contrast, the time metric linked to the analysis strategy can provide more useful information about the distributions of the variables to be modelled.

3.4 Multivariate descriptions

Aim: (1) to describe associations between explanatory variables and structural variables or process variables; (2) to describe associations between explanatory variables (focusing mostly on baseline values); (3) to provide stratified summaries of the data.

The explorations proposed in the multivariate domain are very similar to those proposed in the context of IDA for regression modeling [13], and include the exploration of associations between explanatory variables with structural variables, and the evaluations of associations among explanatory variables. If interactions between explanatory variables are considered, the exploration of the association between these variables should be carefully addressed in IDA [13].

We suggest to focus primarily on associations between variables at baseline (where usually the missing values are less common). Follow-up times can be considered if the aim is to evaluate if/how the associations change during follow-up; however, the interpretations should be cautious, as the results are based only on observed data and the missing data mechanism that occurs during follow-up can alter the associations.

The distributions of explanatory variables stratified by the values of the structural variables are also described in the multivariate descriptions; the considerations about the influence of missing values on the results apply also for these descriptions; numerical structural variables might require some type of discretization.

We emphasize that a core principle of IDA is to refrain from conducting analyses related to the research question, avoiding numerical or graphical exploration of associations between explanatory variables and the outcome.

3.5 Longitudinal aspects

Aim: (1) to describe longitudinal trends of the time-varying variables including changes and variability within and between subjects; (2) to evaluate the strength of correlation of the repeated measurements across time points.

The exploration of the characteristics of the participants through time is of upmost importance and should be described using the time metric chosen in the analysis strategy. The repeated measurements from the same subject in longitudinal studies are usually correlated, thus IDA should explore the trend of the repeated variables but also the degree of dependence within subjects by evaluating the variance, covariance and correlation on repeated measurements of the outcome variable.

The time-varying explanatory variables can be explored; these explorations are useful for providing domain experts a description of some of the characteristics of the sample that can be compared to the expected. As discussed earlier, descriptive summaries based on the observed longitudinal data might be biased, and should therefore be interpreted carefully.

In many applications it is important to summarize the cohort (individuals who experience the same event in the same time) or period (time when the participants are measured) effect on the outcome and on the explanatory variables. The design of the longitudinal study might make the effect of age, cohort and period difficult to separate and subject to confounding. The results from IDA explorations might indicate the need to take cohort or period effects into account in the modelling.

4 Case study: Age-associated decline in grip strength in the Danish data from the SHARE study

To illustrate the use of the data-screening checklist for longitudinal data we conducted the IDA screening step for a case study, where the research aim was to evaluate the age-associated decline in grip strength. An IDA data screening plan was developed (S1 File) and a reproducible and structured IDA report for the analysis was implemented using R language [23] (version 4.0.2) and made available at https://stratosida.github.io/longitudinal/; the report presents the full IDA data screening results and provides the R code for reproducibility.

First, we briefly illustrate the data and the analysis strategy and present the IDA data screening plan; a selected set of explorations are presented in the results, and the possible consequences of the IDA findings are reported and discussed in Section 4.5.

4.1 SHARE data

We used the data from the Survey of Health, Ageing and Retirement in Europe (SHARE). SHARE is a multinational panel data survey, collecting data on medical, economic and social characteristics of about 140,000 unique participants after age 50 years, from 28 European countries and Israel [19]. The SHARE study contains health, lifestyle, and socioeconomic data at individual and household level. These data have been collected over several waves since 2004, using questionnaires and conducting a limited number of performance evaluations. The baseline and the longitudinal questionnaires differ in some aspects, and some questions were modified during the course of the study; in wave 3 and partly in wave 7 a different questionnaire (retrospective SHARELIFE) was used to collect retrospective information about participants. Leveraging these data for research purposes can be daunting due to the complex structure of the longitudinal design with refresher samples organized in 25 modules with about 1000 questions. Functions written in the R language [23] are available that facilitate data extraction and data preparation of SHARE data [17].

We provide an explanation and elaboration of an IDA checklist for data screening using SHARE data collected during the first seven waves 2004 to 2017 in Denmark, which based the selection of participants on simple random sampling.

4.2 Study aims and corresponding analysis strategy

The research question aims at assessing the age-associated decline of hand grip strength by sex, after adjusting for a set of explanatory variables that are known to be associated with the outcome (weight, height, education level, physical activity and smoking). Here we give a basic overview of the corresponding statistical analysis strategy.

The study population are individuals from Denmark aged 50 or older at first interview. The outcome is maximum grip strength measured at different interviews (recorded with a hand-held dynamometer, assessed as the maximum score out of two measurements per hand). The time metric is the age at interview. The time-fixed variables evaluated at first interview are sex, height and education (categorized in three levels); the time-varying variables are weight, physical activity (vigorous or low intensity, both dichotomized) and smoking status. Interaction terms between age and all the time-fixed variables (sex, education, height) will be included in the pre-specified statistical analysis models to evaluate the association between these time-fixed variables with the trajectory of the outcome; the main interest is in the interpretation of the interaction terms between sex and functions of age on the grip strength. Nonlinear functional forms for continuous variables will be assessed using linear, quadratic, and cubic polynomials.

A linear mixed model [24] is planned to be used to address the research question. The trajectory over time of the outcome is explained at the population level using fixed effects and individual-specific deviations from the population trajectory are captured using random effects to account for the intra-individual serial correlation. The model accommodates individual-specific times of outcome measurements.

The linear mixed model, estimated by maximum likelihood, is robust to missing at random outcome data, that is when the missingness can be predicted by the observations (outcome and explanatory). Missing data at variable/item level (for the time-fixed explanatory variables) can be handled either by considering complete cases or by performing multiple imputation. We will use data from the SHARE study that are publicly available upon registration for use for research purposes (http://www.share-project.org/data-access.html). All analyses will be carried out using R statistical language [23].

4.3 IDA plan

The detailed IDA data screening plan for this study includes most of the points included in our checklist, describing the specific explorations that should be addressed and their aim (S1 File).

Structural variables in the context of IDA are: sex and grouped age (because of their known association with the outcome), wave and type of interview (baseline vs. longitudinal) (because of differences in data collection process).

4.4 Results of IDA

Here we present the main IDA findings for each domain; the consequences are discussed in the next section.

4.4.1 Participation profile.

The interviews were carried out between April 2004 and October 2017, in seven waves (Fig 1). The median time between interviews in successive waves was about 2 years, the longest times passed between wave 1 and 2 (median: 2.5 years, see the online IDA report for more details).

Download:

Fig 1. Distribution of the number of interviews carried out in Denmark in the SHARE study in time.

https://doi.org/10.1371/journal.pone.0295726.g001

Overall, 5452 unique participants were interviewed 18632 times during the study. The number of participants who attended the interview in each wave, stratified by baseline wave are shown in Fig 2, which highlights that new participants (refreshment samples) were included during the study and that wave 5 had the most interviews. The exploration of the age at inclusion shows that full range refreshment samples were used in wave 2 and 5, and refreshment samples only of the younger people in wave 4 and 6 (Fig 3), as described in the study protocol.

Download:

Fig 2. Number of participants in each wave, stratified by baseline wave.

https://doi.org/10.1371/journal.pone.0295726.g002

The median and modal number of interviews per participant was 3, 18% were interviewed only once, only 22% were interviewed 6 or 7 times (Table 2); further aspects about drop-out are discussed in the missing value domain.

Download:

Table 2. Number of interviews per participant.

https://doi.org/10.1371/journal.pone.0295726.t002

Age is the time metric of interest in the analysis described in the analysis strategy, therefore its distribution is described in the participation profile. In later waves the participants were on average older (for example, the median age increased from 62 to 66 from wave 1 to wave 7), but the age distribution in the sample and in the target population was similar. Fig 3 shows the distribution of age over waves, overall and by type of interview.

Download:

Fig 3. Distribution of age across waves and by baseline or longitudinal/SHARELIFE interview.

Note that SHARELIFE interviews were conducted in waves 3 (all participants) and 7 (60% of the participants).

https://doi.org/10.1371/journal.pone.0295726.g003

The participation profile highlighted the complexity of the study design and the fact that most participants were measured few times; it also provided information about the distribution of age, which is the continuous time of interest and for which we did not identify any specific problems.

4.4.2 Missing data.

The characteristics of non-enrolled subjects could be studied only through the comparison of the observed samples with some known characteristics of the target population (sex, age and education composition, data were available from the statistical office of the European Union—Eurostat https://commission.europa.eu/index_en, from year 2007, wave 2 of the study). The aim of this comparison is to evaluate if the sample differs from the target population.

The responders that participated in the survey at least once had substantially higher education compared to the population in the same age and sex groups; males in the younger age groups were slightly underrepresented, as were older females (Fig 4 for the distribution of education for data from wave 2, the complete results are similar for the other waves and presented in the online IDA report).

Download:

Fig 4. Distribution of education in the population in year 2007 in Denmark and the refreshment sample of wave 2, by sex and age group.

The analyses were limited to the ages between 50 and 85, as population data on education were unavailable for older inhabitants; the sample displayed from wave 2 is a random sample used in this wave as refreshment sample; details are given in the online IDA report.

https://doi.org/10.1371/journal.pone.0295726.g004

Many participants that entered the study had missing data during the longitudinal follow-up. The deaths were reported with high quality and timely, as only 1% of the participants had unknown vital status at the end of the study and overall, 978 deaths were reported by wave 7.

In Fig 5 participants were classified in seven categories based on their participation at each measurement occasion. The figure highlights that some participants had intermittent missingness, missingness by design because participants were not eligible (out-of-household) was very rare, while administrative censoring was common due to the study design (for example, many new participants were included in wave 5 and the follow-up ended in wave 7), and so were deaths and losses to follow-up (missing and out-of-sample).

Download:

Fig 5. Number of participants with observed and missing data by measurement occasion and by type of missingness.

Interview: participant participated with a valid interview; intermittent missingness: missing at measurement occasion but with valid interview later; missing: missing at measurement occasion and no interview later; out-of-sample: was removed from the sample because lost to follow-up (by study definition after at least three missing interviews, here the definition was applied retrospectively); out-of-household: not interviewed because not member of the household; death: died at measurement occasion or earlier; administrative censoring: did not have interview because the study ended.

https://doi.org/10.1371/journal.pone.0295726.g005

For the analysis purposes, the participants of some of the groups described in Fig 5 would be classified as lost to follow-up (out-of-sample, missingness, out-of-household if not re-included in the sample later); using this definition, we estimated the probability of loss to follow-up, death and death after follow-up. Estimate of cumulative incidence functions (using Aalen-Johansen estimators for loss to follow-up and deaths) indicated that the probability of loss to follow-up was virtually the same across age groups and sex. In contrast, the probability of death prior and post loss to follow-up substantially increased with age, as expected, and tended to be higher for males at younger ages (Fig 6). Additional analyses showed that participants that died differed from the others also because they were more frequently smokers, had lower education and engaged in less physical activity, and had considerably lower levels of grip strength at baseline measurement (online IDA report); compared to complete responders, those that dropped out of the study for reasons different than death, had lower education, had less physically activity and smoked more frequently (online IDA report).

Download:

Fig 6. Cumulative incidence estimates of loss to follow-up, death without loss to follow-up and death after loss to follow-up, stratified by sex and age category.

The first two incidence functions are obtained using the Aalen-Johansen estimator, the third is based on the Kaplan-Maier estimator. Aalen-Johansen estimators, stratified by sex and age group at first interview, obtained using the survivalR package.

https://doi.org/10.1371/journal.pone.0295726.g006

The mean outcome profiles of participants that died during follow-up were lower compared to those that survived, especially among older males (Fig 7, left panel), while the difference in outcome between complete and incomplete cases due to loss to follow-up was smaller (Fig 7, right panel).

Download:

Fig 7. Mean maximum grip strength for groups with reported death (left panel) or with loss to follow-up (right panel) at different measurement occasions, stratified by age group and sex.

Participants classified in the groups still in the cohort had complete measurements for 7 waves.

https://doi.org/10.1371/journal.pone.0295726.g007

We explored the amount of missing outcomes among the interviews that were conducted (item missingness in the outcome) to evaluate the frequency of outcome missingness with valid interview, and its association with the characteristics of the participants. The amount of this type of outcome missingness varied from 2.2 to 6.5% across measurement occasions, females had more missing values than males and the proportion of missingness increased with longer follow-up (Table 4) and with age (Table 3). Participants with missing outcome data were unable to perform the grip strength test in 36% of cases, indicating that missing values might be related to bad physical conditions; 21% refused to take the measurement, 2% had a proxy interview, while the reason for missingness was unknown for the others.

Download:

Table 3. Percentage (%) and number (n) of missing values in the outcome (maximum grip strength) among participants that were interviewed, by age group and sex using all available data.

https://doi.org/10.1371/journal.pone.0295726.t003

Download:

Table 4. Percentage (%) and number (n) of missing values in the explanatory variables and outcome by measurement occasion and sex.

PA: physical activity. Here we show only the first interview data for variables used as time-fixed in the model (height, education and smoking—following the change suggested by IDA) and remove the observations missing by design.

https://doi.org/10.1371/journal.pone.0295726.t004

There was no clear association between missingness in different measurement occasions in the outcome, and a relatively small proportion of subjects had outcome missingness in more than one occasion, when the interview was performed (Fig 8).

Download:

Fig 8. Co-occurrence of outcome missingness across measurement occasions.

The number on the bars indicate the number of participants that have certain variables missing together (the missing variables are indicated using dots on the horizontal axis, M1_NA indicates that the variable is missing at first measurement occasion, etc.).

https://doi.org/10.1371/journal.pone.0295726.g008

In this case study, item missingness of the explanatory variables is considered separately from unit missingness, as the analysis strategy considers using multiple imputations to handle item missingness of the explanatory variables, or complete case analysis if the amount of missing values is relatively small.

Some of the time-varying explanatory variables were missing by design (weight in wave 3 and physical activity variables in SHARELIFE interviews, current smoking in longitudinal interviews in waves 6 and 7), as highlighted by Fig 9. The analyst might thus decide to consider smoking status at baseline rather than current smoking in the statistical analysis. Item missingness was very low for all variables when missing by design missingness was not considered (Table 4).

Download:

Fig 9. Graphical representation of the percentage of missing values (item missingness) for time varying variables, stratified by wave and type of interview and for the outcome.

By design new participants were not included in wave 3 or 7, SHARELIFE interviews were conducted in wave 3 (all participants) and in partly in wave 7 (only for participants that did not have a SHARELIFE interview in wave 3, about 60%). n is the sample size.

https://doi.org/10.1371/journal.pone.0295726.g009

4.4.3 Univariate descriptions.

The characteristics of the participants at baseline interview are summarized in Table 5 (overall and by sex, discussed in the multivariate descriptions). The summary statistics did not indicate specific problems (unexpected location or variability values for numerical variables, sparse categories for categorical variables).

Download:

Table 5. Descriptive statistics of the baseline characteristics of n = 5452 participants, overall and stratified by sex.

Md (Q1, Q3) represent the median, lower quartile and the upper quartile for continuous variables. Numbers after percentages are frequencies.

https://doi.org/10.1371/journal.pone.0295726.t005

The variables weight, height, and grip strength were reported with terminal digit preference (values ending with 0 and 5 were more frequent than expected). Fig 10 shows the distribution of grip strength and indicates that digit preferences did occur with examiners choosing more likely numbers ending with 0 or 5. This likely increases measurement error, and the IDA suggests that the impact on regression analyses would be worth exploring. The bimodality of the distribution is due to the inclusion of males and females, as shown in multivariate descriptions.

Download:

Fig 10. Distribution of maximum grip strength across all participants (gray bars indicate numbers ending with figure 0 or 5).

https://doi.org/10.1371/journal.pone.0295726.g010

4.4.4 Multivariate descriptions.

Sex is a structural variable in our case study, therefore the distributions of all the explanatory variables, stratified by sex, are explored in the multivariate descriptions. Females and males differed substantially in the distribution of height, weight, vigorous (but not low-intensity) physical activity, and education, while the distribution of age and the proportion of current smokers was similar (S2 File and online IDA report).

The bimodal distribution of grip strength was explained by the large average differences between males and females and the histogram of age indicated that a Gaussian distribution assumption at each wave is appropriate when separated by sex (Fig 11).

Download:

Fig 11. Generalized pairs plot for grip strength, across waves and by sex.

https://doi.org/10.1371/journal.pone.0295726.g011

As expected, at baseline the couples of variables with highest positive correlation were weight and height, and the two variables measuring physical activity (Table 6 for overall correlations and Fig 12 stratifying by sex).

Download:

Fig 12. Correlation between explanatory variables at baseline, stratified by sex; the education levels were used as numbers.

https://doi.org/10.1371/journal.pone.0295726.g012

Download:

Table 6. Correlations (above diagonal), standard deviations (diagonal) and covariances (below diagonal) of grip strength across waves for males.

https://doi.org/10.1371/journal.pone.0295726.t006

Age was negatively correlated with all the explanatory variables, the negative association between age and education among females was the highest, which can be explained by the study design, where there is a strong association between age and birth year. The SHARE study encompasses multiple age cohorts, followed in different calendar periods, as summarized in S2 File and as expected due to the design of the cohort.

4.4.5 Longitudinal aspects.

To visualize individuals’ grip strength trajectories we used profile plots; interactive plots are also available (online IDA report). The profiles based on subgroups of participants facilitate the visualizations of individual trajectories (Fig 13 and S2 File show 100 random participants per group of initial grip strength quantile), which are not visible using complete data when the number of participants is large (Fig 14 and S2 File show the profiles for all data). Even though age was included as a continuous time metric in the analysis strategy, a summary stratified by ten-year groups can serve as a quick overview of the longitudinal trends by age. The graphs that use age as a time metric give an idea of the shape of trajectory for model specification (which has to be determined a priori), those based on measurement occasion give a clearer overview of the individual trajectories, as participants enter the study at different ages.

Download:

Fig 13. Profile plots of grip strength across measurement occasion, for a subset of participants (the selection of 100 participants for each group is based on the quantile of grip strength at baseline).

https://doi.org/10.1371/journal.pone.0295726.g013

Download:

Fig 14. Profile plots of grip strength across measurement occasion for all participants, stratified by sex; age is used as time metric.

https://doi.org/10.1371/journal.pone.0295726.g014

The profile plots highlight the trend towards the diminishing grip strength with age and show that the rate of change seems to accelerate over age (the slope at later ages is bigger than at the beginning). Older participants are followed up for shorter times, substantial increases or decreases in grip strength between measurement occasions can also be observed. The variability of the outcome tended to decrease at later measurement occasions, especially in the older age groups.

The scatterplots of the outcome measurements across waves and their correlations are shown using a generalized pairs plot (Fig 11); across waves there were no substantial differences in the correlations (slightly lower in wave 1) or variability of the outcome (Table 6).

The correlation of grip strength between measurements taken at different ages, indicated that the serial correlation was very high, generally above 0.70 for two-year periods and reduced slightly with the distance; correlations were generally slightly larger for males than for females (Fig 15).

Download:

Fig 15. Correlation between successive outcome measurements taken at different ages in males and females; age of participants is grouped in two-year classes.

Only estimates based on more than 20 observations are shown; the correlation between each pair of variables is computed using all complete pairs of observations on those variables.

https://doi.org/10.1371/journal.pone.0295726.g015

Fig 16 shows the smoothed estimated association between age and outcome for females, stratifying the data for grouped year of birth cohorts, and compares them to the estimates obtained using all longitudinal data, or cross-sectional data from only the first interview. Heterogeneity in the association between age and the outcome across year of birth cohorts were observed also for males, or considering different waves, and similar year of birth cohort effects could be observed for weight or height (online IDA report). These summaries should not be overintepreted, as they are not robust to missing data and assume independence between repeated measurements, but they suggest once again the potential importance of taking into account the year of birth cohort effect in the modelling, which can be addressed more formally during the modelling of the data.

Download:

Fig 16. Estimated association between age and grip strength within different subsets of data; participants are stratified in grouped year of birth cohorts.

Black lines are the estimates using all longitudinal data (dashed line) and cross-sectional data from the first interview (solid line).

https://doi.org/10.1371/journal.pone.0295726.g016

Finally, the longitudinal changes of vigorous physical activity at least once a week is examined graphically, using a Sankey diagram (Fig 17). The graph highlights that the transitions between active/not active state are common and that missing data are common (missing by design and losses to follow-up). These explorations are useful for providing domain experts a description of some of the characteristics of the sample that can be compared to the expected characteristics.

Download:

Fig 17. Sankey diagram of vigorous activity status across measurement occasions (all participants are displayed, with different reasons for missing values, measurement occasions are displayed from M1 (leftmost) to M7 (rightmost)).

https://doi.org/10.1371/journal.pone.0295726.g017

4.5 Examples of potential consequences of data screening

Table 7 lists examples of how results from the IDA data screening could lead to new considerations for the data analyses.

Download:

Table 7. Potential consequences of data screening.

https://doi.org/10.1371/journal.pone.0295726.t007

5 Discussion

IDA is crucial to ensure reliable knowledge about data properties and necessary context so that appropriate statistical analyses can be conducted and pitfalls avoided [3]. Often it is not transparent in publications what initial analyses researchers conducted, and the reporting is poor. A multitude of decisions after examining data have an impact on results and conclusions [28].

This work builds on the IDA checklist for regression models with a continuous, count or binary outcome [13]. Addressing longitudinal cases requires additional considerations. This includes describing time metrics, assessing participation over time, conducting a thorough examination of missing data and patterns, and examining changes in time-varying variables. In the longitudinal setting IDA explorations can quickly become overwhelming even with a small number of variables. To manage this, we provide guidance with explanations and elaborations of the items that should be explored prior to undertaking the analysis detailed in the analysis strategy or in the statistical analysis plan (SAP).

An aim of IDA is to examine data properties that support the use of statistical methods, and IDA findings may provide insights for interpretation and presentation of model results. For example, when using mixed models to analyze longitudinal data, many different options exist for using random effects across different time functions and/or addressing autocorrelation. Within a parametric model, IDA findings might suggest suitable bases for time functions. Changes to the statistical models may relate to the selection and modeling of explanatory variables, strategies for handing informative drop-outs, or adjustments for variables associated to drop-out and/or selection. One outcome of IDA could be specifying sensitivity analyses (for example, modifying the functional form of an explanatory variable and evaluating the robustness of the results).

Reporting guidelines such as the STROBE checklist [29] include a limited number of IDA elements such as characteristics of study participants, number of missing participants, information about confounders, summary of follow-up time, and summary measures over time. However, our IDA recommendations are more comprehensive and provide suggestions of summaries and visualization to understand details about the data and possible implications for the analyses.

It is important to remember that an IDA workflow is not a standalone procedure but is closely linked to the study protocol and the analysis strategy or the SAP. Per ICH guidelines for good clinical practice [30], a SAP describes the variables and outcomes that will be collected and includes ‘detailed procedures for executing the statistical analysis of the primary and secondary variables and other data’.

Guidelines for SAPs in clinical trials [31] and in observational studies [32] mention time points at which the outcomes are measured, timing of lost to follow-up, missing data, description of baseline characteristics and outcomes. By describing choices for the statistical methods, the SAP identifies what data properties should be explored and then anticipates how these could be addressed in the statistical models. However, a carefully conducted IDA workflow as proposed here can improve the understanding of potential sources of bias. There may be unanticipated findings, or IDA could potentially reveal different ways of conducting the analyses. Adjustments to the analytic strategy in light of IDA discoveries are permissible, but they should be thoroughly documented and justified to ensure complete transparency [33, 34]. For example, the recently proposed framework for the treatment and reporting of missing data in observational studies places a strong emphasis on pre-specifying how missing data will be handled, highlighting the need to assess the validity of the pre-planned methods once the data are available by examining the data [22]. Thus, a SAP needs to include both an IDA plan and details of the analysis strategy for transparency and reproducibility and to avoid ad-hoc decisions.

A worked example with available data and reproducible R code including many effective data visualizations is provided to help data analysts with performing IDA data screening for longitudinal studies.

To summarize, we provide recommendations for a check list for IDA data screening in longitudinal studies. This includes examples for data visualizations to enable researchers follow a systematic approach and reproducible strategies. IDA reports examine data properties to assure assumptions for statistical models are satisfied and aid in accurately interpreting model results. It can facilitate discussion among research teams about expectations of data properties and lead to a more thorough understanding about the potential for their study.

Supporting information

S1 File. IDA plan for the application presented in the paper.

Detailed description of the IDA plan.

https://doi.org/10.1371/journal.pone.0295726.s001

(PDF)

S2 File. Additional tables and figures.

https://doi.org/10.1371/journal.pone.0295726.s002

(PDF)

Acknowledgments

This work was developed as part of the international initiative of Strengthening Analytical Thinking for Observational Studies (STRATOS). The objective of STRATOS is to provide accessible and accurate guidance in the design and analysis of observational studies (http://stratos-initiative.org/). Members of the Topic Group Initial Data Analysis of the STRATOS Initiative are Mark Baillie (Switzerland), Marianne Huebner (USA), Saskia le Cessie (Netherlands), Lara Lusa (Slovenia), Michael Schell (USA) and Carsten Oliver Schmidt (Germany).

References

1. Huebner M, le Cessie S, Schmidt C, Vach W. A Contemporary Conceptual Framework for Initial Data Analysis. Observational Studies. 2018;4:171–192.
- View Article
- Google Scholar
2. Stoudt S, Vásquez VN, Martinez CC. Principles for data analysis workflows. PLOS Computational Biology. 2021;17(3):e1008770. pmid:33735208
- View Article
- PubMed/NCBI
- Google Scholar
3. Baillie M, le Cessie S, Schmidt CO, Lusa L, Huebner M, Topic Group ‘Initial Data Analysis’ of the STRATOS Initiative. Ten simple rules for initial data analysis. PLoS Computational Biology. 2022;18(2):e1009819. pmid:35202399
- View Article
- PubMed/NCBI
- Google Scholar
4. Tukey JW. Explanatory Data Analysis. vol. 2. Reading, MA; 1977.
5. Mariño J, Kasbohm E, Struckmann S, Kapsner LA, Schmidt CO. R Packages for Data Quality Assessments and Data Monitoring: A Software Scoping Review with Recommendations for Future Developments. Applied Sciences. 2022;12(9):4238.
- View Article
- Google Scholar
6. Cook D, Swayne DF. Interactive and Dynamic Graphics for Data Analysis: With R and Ggobi. New York, NY: Springer New York; 2007.
7. Behrens JT. Principles and procedures of exploratory data analysis. Psychological methods. 1997;2(2):131.
- View Article
- Google Scholar
8. Wickham H, Çetinkaya-Rundel M, Grolemund G. R for data science. O’Reilly Media, Inc.; 2023.
9. Chatfield C. The initial examination of data. Journal of the Royal Statistical Society: Series A (General). 1985;148(3):214–231.
- View Article
- Google Scholar
10. Chatfield C. Confessions of a pragmatic statistician. Journal of the Royal Statistical Society Series D: The Statistician. 2002;51(1):1–20.
- View Article
- Google Scholar
11. Cook D, Reid N, Tanaka E. The foundation is available for thinking about data visualization inferentially. Harvard Data Science Review. 2021;3.
- View Article
- Google Scholar
12. Sauerbrei W, Abrahamowicz M, Altman DG, le Cessie S, Carpenter J, STRATOS Initiative. STRengthening analytical thinking for observational studies: the STRATOS initiative. Statistics in Medicine. 2014;33(30):5413–5432. pmid:25074480
- View Article
- PubMed/NCBI
- Google Scholar
13. Heinze G, Baillie M, Lusa L, Sauerbrei W, Schmidt C, Harrell F, et al. Regression without regrets—initial data analysis is an essential prerequisite to multivariable regression; PREPRINT (Version 1) available at Research Square https://doi.org/10.21203/rs.3.rs-3580334/v1. November 14, 2023.
14. Verbeke G, Molenberghs G. Linear Mixed Models for Longitudinal Data. Springer—Verlag, New York; 2000.
15. Diggle P, Heagerty P, Liang KY, Zeger S. Analysis of Longitudinal Data. Oxford University Press; 2002.
16. Weiss RE. Modeling Longitudinal Data. Springer; 2005.
17. Lusa L, Huebner M. Organizing and Analyzing Data from the SHARE Study with an Application to Age and Sex Differences in Depressive Symptoms. International Journal of Environmental Research and Public Health. 2021;18(18). pmid:34574605
- View Article
- PubMed/NCBI
- Google Scholar
18. Schmidt CO, Struckmann S, Enzenbach C, Reineke A, Stausberg J, Damerow S, et al. Facilitating harmonized data quality assessments. A data quality framework for observational health research data collections with software implementations in R. BMC Medical Research Methodology. 2021;21(1):1–15. pmid:33810787
- View Article
- PubMed/NCBI
- Google Scholar
19. Börsch-Supan A, Brandt M, Hunkler C, Kneip T, Korbmacher J, Malter F, et al. Data resource profile: the Survey of Health, Ageing and Retirement in Europe (SHARE). International Journal of Epidemiology. 2013;42(4):992–1001. pmid:23778574
- View Article
- PubMed/NCBI
- Google Scholar
20. Sterne JA, White IR, Carlin JB, Spratt M, Royston P, Kenward MG, et al. Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. British Medical Journal. 2009;338.
- View Article
- Google Scholar
21. Little RJ, Rubin DB. Statistical Analysis with Missing Data. John Wiley & Sons; 2019.
22. Lee KJ, Tilling KM, Cornish RP, Little RJ, Bell ML, Goetghebeur E, et al. Framework for the treatment and reporting of missing data in observational studies: The Treatment And Reporting of Missing data in Observational Studies framework. Journal of Clinical Epidemiology. 2021;134:79–88. pmid:33539930
- View Article
- PubMed/NCBI
- Google Scholar
23. R Core Team. R: A Language and Environment for Statistical Computing; 2021. Available from: https://www.R-project.org/.
24. Laird NM, Ware JH. Random-effects models for longitudinal data. Biometrics. 1982; p. 963–974. pmid:7168798
- View Article
- PubMed/NCBI
- Google Scholar
25. Rouanet A, Avila-Rieger J, Dugravot A, Lespinasse J, Stuckwisch R, Merrick R, et al. How selection over time contributes to the inconsistency of the association between sex/gender and cognitive decline across cognitive aging cohorts. American Journal of Epidemiology. 2021;191(3):441–452.
- View Article
- Google Scholar
26. Wagner M, Grodstein F, Proust-Lima C, Samieri C. Long-term trajectories of body weight, diet, and physical activity from midlife through late life and subsequent cognitive decline in women. American Journal of Epidemiology. 2020;189(4):305–313. pmid:31781745
- View Article
- PubMed/NCBI
- Google Scholar
27. Bertoni M, Maggi S, Weber G. Work, retirement, and muscle strength loss in old age. Health Economics. 2018;27(1):115–128. pmid:28512845
- View Article
- PubMed/NCBI
- Google Scholar
28. Ebrahim S, Sohani ZN, Montoya L, Agarwal A, Thorlund K, Mills EJ, et al. Reanalyses of randomized clinical trial data. JAMA. 2014;312(10):1024–1032. pmid:25203082
- View Article
- PubMed/NCBI
- Google Scholar
29. Vandenbroucke J, von Elm E, Altman D, Gøtzsche P, Mulrow C, Pocock S, et al. Strengthening the Reporting of Observational Studies in Epidemiology (STROBE): Explanation and Elaboration. PLoS Medicine. 2007;4(10):e297. pmid:17941715
- View Article
- PubMed/NCBI
- Google Scholar
30. International Conference on Harmonisation of technical requirements for registration of pharmaceuticals for human use. ICH harmonized tripartite guideline: Guideline for Good Clinical Practice. Journal of Postgraduate Medicine. 2001;47(1):45–50.
- View Article
- Google Scholar
31. Gamble C, Krishan A, Stocken D, Lewis S, Juszczak E, Doré C, et al. Guidelines for the Content of Statistical Analysis Plans in Clinical Trials. JAMA. 2017;318(23):2337–2343. pmid:29260229
- View Article
- PubMed/NCBI
- Google Scholar
32. Hiemstra B, Keus F, Wetterslev J, Gluud C, van der Horst IC. GEBATE-statistical analysis plans for observational studies. BMC Medical Research Methodology. 2019;19(233). pmid:31818263
- View Article
- PubMed/NCBI
- Google Scholar
33. Thomas L, Peterson ED. The value of statistical analysis plans in Ooservational research: defining high-quality research from the start. JAMA. 2012;308(8):773–774. pmid:22910753
- View Article
- PubMed/NCBI
- Google Scholar
34. Islam N, Cole TJ, Ross JS, Feeney T, Loder E. Post-submission changes to prespecified statistical analysis plans. British Medical Journal. 2022;378. pmid:36130792
- View Article
- PubMed/NCBI
- Google Scholar

[ref1] 1. Huebner M, le Cessie S, Schmidt C, Vach W. A Contemporary Conceptual Framework for Initial Data Analysis. Observational Studies. 2018;4:171–192.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Stoudt S, Vásquez VN, Martinez CC. Principles for data analysis workflows. PLOS Computational Biology. 2021;17(3):e1008770. pmid:33735208
View Article
PubMed/NCBI
Google Scholar

[5] View Article

[6] PubMed/NCBI

[7] Google Scholar

[ref3] 3. Baillie M, le Cessie S, Schmidt CO, Lusa L, Huebner M, Topic Group ‘Initial Data Analysis’ of the STRATOS Initiative. Ten simple rules for initial data analysis. PLoS Computational Biology. 2022;18(2):e1009819. pmid:35202399
View Article
PubMed/NCBI
Google Scholar

[9] View Article

[10] PubMed/NCBI

[11] Google Scholar

[ref4] 4. Tukey JW. Explanatory Data Analysis. vol. 2. Reading, MA; 1977.

[ref5] 5. Mariño J, Kasbohm E, Struckmann S, Kapsner LA, Schmidt CO. R Packages for Data Quality Assessments and Data Monitoring: A Software Scoping Review with Recommendations for Future Developments. Applied Sciences. 2022;12(9):4238.
View Article
Google Scholar

[14] View Article

[15] Google Scholar

[ref6] 6. Cook D, Swayne DF. Interactive and Dynamic Graphics for Data Analysis: With R and Ggobi. New York, NY: Springer New York; 2007.

[ref7] 7. Behrens JT. Principles and procedures of exploratory data analysis. Psychological methods. 1997;2(2):131.
View Article
Google Scholar

[18] View Article

[19] Google Scholar

[ref8] 8. Wickham H, Çetinkaya-Rundel M, Grolemund G. R for data science. O’Reilly Media, Inc.; 2023.

[ref9] 9. Chatfield C. The initial examination of data. Journal of the Royal Statistical Society: Series A (General). 1985;148(3):214–231.
View Article
Google Scholar

[22] View Article

[23] Google Scholar

[ref10] 10. Chatfield C. Confessions of a pragmatic statistician. Journal of the Royal Statistical Society Series D: The Statistician. 2002;51(1):1–20.
View Article
Google Scholar

[25] View Article

[26] Google Scholar

[ref11] 11. Cook D, Reid N, Tanaka E. The foundation is available for thinking about data visualization inferentially. Harvard Data Science Review. 2021;3.
View Article
Google Scholar

[28] View Article

[29] Google Scholar

[ref12] 12. Sauerbrei W, Abrahamowicz M, Altman DG, le Cessie S, Carpenter J, STRATOS Initiative. STRengthening analytical thinking for observational studies: the STRATOS initiative. Statistics in Medicine. 2014;33(30):5413–5432. pmid:25074480
View Article
PubMed/NCBI
Google Scholar

[31] View Article

[32] PubMed/NCBI

[33] Google Scholar

[ref13] 13. Heinze G, Baillie M, Lusa L, Sauerbrei W, Schmidt C, Harrell F, et al. Regression without regrets—initial data analysis is an essential prerequisite to multivariable regression; PREPRINT (Version 1) available at Research Square https://doi.org/10.21203/rs.3.rs-3580334/v1. November 14, 2023.

[ref14] 14. Verbeke G, Molenberghs G. Linear Mixed Models for Longitudinal Data. Springer—Verlag, New York; 2000.

[ref15] 15. Diggle P, Heagerty P, Liang KY, Zeger S. Analysis of Longitudinal Data. Oxford University Press; 2002.

[ref16] 16. Weiss RE. Modeling Longitudinal Data. Springer; 2005.

[ref17] 17. Lusa L, Huebner M. Organizing and Analyzing Data from the SHARE Study with an Application to Age and Sex Differences in Depressive Symptoms. International Journal of Environmental Research and Public Health. 2021;18(18). pmid:34574605
View Article
PubMed/NCBI
Google Scholar

[39] View Article

[40] PubMed/NCBI

[41] Google Scholar

[ref18] 18. Schmidt CO, Struckmann S, Enzenbach C, Reineke A, Stausberg J, Damerow S, et al. Facilitating harmonized data quality assessments. A data quality framework for observational health research data collections with software implementations in R. BMC Medical Research Methodology. 2021;21(1):1–15. pmid:33810787
View Article
PubMed/NCBI
Google Scholar

[43] View Article

[44] PubMed/NCBI

[45] Google Scholar

[ref19] 19. Börsch-Supan A, Brandt M, Hunkler C, Kneip T, Korbmacher J, Malter F, et al. Data resource profile: the Survey of Health, Ageing and Retirement in Europe (SHARE). International Journal of Epidemiology. 2013;42(4):992–1001. pmid:23778574
View Article
PubMed/NCBI
Google Scholar

[47] View Article

[48] PubMed/NCBI

[49] Google Scholar

[ref20] 20. Sterne JA, White IR, Carlin JB, Spratt M, Royston P, Kenward MG, et al. Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. British Medical Journal. 2009;338.
View Article
Google Scholar

[51] View Article

[52] Google Scholar

[ref21] 21. Little RJ, Rubin DB. Statistical Analysis with Missing Data. John Wiley & Sons; 2019.

[ref22] 22. Lee KJ, Tilling KM, Cornish RP, Little RJ, Bell ML, Goetghebeur E, et al. Framework for the treatment and reporting of missing data in observational studies: The Treatment And Reporting of Missing data in Observational Studies framework. Journal of Clinical Epidemiology. 2021;134:79–88. pmid:33539930
View Article
PubMed/NCBI
Google Scholar

[55] View Article

[56] PubMed/NCBI

[57] Google Scholar

[ref23] 23. R Core Team. R: A Language and Environment for Statistical Computing; 2021. Available from: https://www.R-project.org/.

[ref24] 24. Laird NM, Ware JH. Random-effects models for longitudinal data. Biometrics. 1982; p. 963–974. pmid:7168798
View Article
PubMed/NCBI
Google Scholar

[60] View Article

[61] PubMed/NCBI

[62] Google Scholar

[ref25] 25. Rouanet A, Avila-Rieger J, Dugravot A, Lespinasse J, Stuckwisch R, Merrick R, et al. How selection over time contributes to the inconsistency of the association between sex/gender and cognitive decline across cognitive aging cohorts. American Journal of Epidemiology. 2021;191(3):441–452.
View Article
Google Scholar

[64] View Article

[65] Google Scholar

[ref26] 26. Wagner M, Grodstein F, Proust-Lima C, Samieri C. Long-term trajectories of body weight, diet, and physical activity from midlife through late life and subsequent cognitive decline in women. American Journal of Epidemiology. 2020;189(4):305–313. pmid:31781745
View Article
PubMed/NCBI
Google Scholar

[67] View Article

[68] PubMed/NCBI

[69] Google Scholar

[ref27] 27. Bertoni M, Maggi S, Weber G. Work, retirement, and muscle strength loss in old age. Health Economics. 2018;27(1):115–128. pmid:28512845
View Article
PubMed/NCBI
Google Scholar

[71] View Article

[72] PubMed/NCBI

[73] Google Scholar

[ref28] 28. Ebrahim S, Sohani ZN, Montoya L, Agarwal A, Thorlund K, Mills EJ, et al. Reanalyses of randomized clinical trial data. JAMA. 2014;312(10):1024–1032. pmid:25203082
View Article
PubMed/NCBI
Google Scholar

[75] View Article

[76] PubMed/NCBI

[77] Google Scholar

[ref29] 29. Vandenbroucke J, von Elm E, Altman D, Gøtzsche P, Mulrow C, Pocock S, et al. Strengthening the Reporting of Observational Studies in Epidemiology (STROBE): Explanation and Elaboration. PLoS Medicine. 2007;4(10):e297. pmid:17941715
View Article
PubMed/NCBI
Google Scholar

[79] View Article

[80] PubMed/NCBI

[81] Google Scholar

[ref30] 30. International Conference on Harmonisation of technical requirements for registration of pharmaceuticals for human use. ICH harmonized tripartite guideline: Guideline for Good Clinical Practice. Journal of Postgraduate Medicine. 2001;47(1):45–50.
View Article
Google Scholar

[83] View Article

[84] Google Scholar

[ref31] 31. Gamble C, Krishan A, Stocken D, Lewis S, Juszczak E, Doré C, et al. Guidelines for the Content of Statistical Analysis Plans in Clinical Trials. JAMA. 2017;318(23):2337–2343. pmid:29260229
View Article
PubMed/NCBI
Google Scholar

[86] View Article

[87] PubMed/NCBI

[88] Google Scholar

[ref32] 32. Hiemstra B, Keus F, Wetterslev J, Gluud C, van der Horst IC. GEBATE-statistical analysis plans for observational studies. BMC Medical Research Methodology. 2019;19(233). pmid:31818263
View Article
PubMed/NCBI
Google Scholar

[90] View Article

[91] PubMed/NCBI

[92] Google Scholar

[ref33] 33. Thomas L, Peterson ED. The value of statistical analysis plans in Ooservational research: defining high-quality research from the start. JAMA. 2012;308(8):773–774. pmid:22910753
View Article
PubMed/NCBI
Google Scholar

[94] View Article

[95] PubMed/NCBI

[96] Google Scholar

[ref34] 34. Islam N, Cole TJ, Ross JS, Feeney T, Loder E. Post-submission changes to prespecified statistical analysis plans. British Medical Journal. 2022;378. pmid:36130792
View Article
PubMed/NCBI
Google Scholar

[98] View Article

[99] PubMed/NCBI

[100] Google Scholar

Figures

Abstract

1 Introduction

2 Setting and scope

3 IDA data-screening checklist for longitudinal data

3.1 Participation profile

3.2 Missing values

3.3 Univariate descriptions

3.4 Multivariate descriptions

3.5 Longitudinal aspects

4 Case study: Age-associated decline in grip strength in the Danish data from the SHARE study

4.1 SHARE data

4.2 Study aims and corresponding analysis strategy

4.3 IDA plan

4.4 Results of IDA

4.4.1 Participation profile.

4.4.2 Missing data.

4.4.3 Univariate descriptions.

4.4.4 Multivariate descriptions.

4.4.5 Longitudinal aspects.

4.5 Examples of potential consequences of data screening

5 Discussion

Supporting information

S1 File. IDA plan for the application presented in the paper.

S2 File. Additional tables and figures.

Acknowledgments

References