Figures
Abstract
Analytical validation is a crucial step in the evaluation of algorithms that process data from sensor-based digital health technologies (sDHTs). Analytical validation of novel digital measures can be complicated when reference measures with directly comparable units are not available. To address this, we conducted a simulation study. Data was simulated assuming a latent physical ability trait, indirectly accessed through an sDHT-derived target measure collecting step count data, and the items of a clinical outcome assessment (COA) measuring self-reported physical activity. We quantified the ability of two methods to assess the latent relationship between reference and target measures: the Pearson Correlation Coefficient (PCC) and factor correlations from a two-factor confirmatory factor analysis (CFA) model. Additionally, three multiple linear regression models were used to evaluate if multiple COA reference measures can more completely represent a target measure of interest. Our findings show that PCC was more stable, easier to compute, and relatively robust with respect to violations of parametric assumptions than CFA, particularly with small sample sizes. However, CFA was less biased than PCC in all scenarios investigated. We demonstrate that using both PCC and CFA generates more confidence in the results of a target and reference measure comparison. Finally, regression results suggest that incorporating multiple reference measures with more frequent collection time points can provide a more complete presentation of the sDHT’s analytical validity. Novel digital measures are being developed at an accelerating pace and promise to revolutionize patient care and medical product development. Our findings provide investigators with crucial information for choosing appropriate methods to perform rigorous analytical validation of these novel measures, including an open-access simulation toolkit.
Citation: Turner S, Chen C, Acosta R, Chon R, Daza EJ, Floden L, et al. (2026) Methods for analytical validation of novel digital clinical measures: A simulation study. PLoS One 21(5): e0308190. https://doi.org/10.1371/journal.pone.0308190
Editor: Ambiga Natesan, Sri Akilandeswari Women’s College, INDIA
Received: July 18, 2024; Accepted: March 16, 2026; Published: May 13, 2026
Copyright: © 2026 Turner et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All simulation code, simulated data, full results, the data generating mechanism, and guidance of how to adapt these materials to different digital measure scenarios, are available open-access through the website of the Digital Medicine Society (https://datacc.dimesociety.org/validating-novel-digital-clinical-measures). The code and simulated data is housed on GitHub (https://github.com/Digital-Medicine-Society/VNDCM-Simulation-Toolkit).
Funding: This work was supported by Arnold Ventures (grant number 23-08673). The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: I have read the journal’s policy and the authors of this manuscript have the following competing interests: ST is a contractual employee of the Digital Medicine Society. CC reports employment and equity ownership in Verily Life Sciences. PG is a contractual employee of the Digital Medicine Society and President of SeeingTheta. The Digital Medicine Society received a grant from Arnold Ventures (URL: https://www.arnoldventures.org) to support this work (grant number 23-08673). Arnold Ventures had no involvement in the design of the simulation study, analysis and interpretation of the simulated data, or writing of the report; nor do they require any restrictions regarding the submission of the report for publication.
Introduction
For a sensor-based digital health technology (sDHT) to aid scientific and clinical decision-making, its sensors must be verified against a technical specification, algorithms analytically validated, usability validated, and clinical validation conducted to ensure relevance in specific contexts, as outlined in the V3 + framework by the Digital Medicine Society (DiMe) [1,2].
Analytical validation confirms that an sDHT algorithm accurately captures physiological or behavioral measures against a reference standard. How to identify a fit-for-purpose reference measure is outlined in the framework developed by Bakker et al. [3]). When no reference measures exist with directly comparable units, clinician-reported outcomes (ClinROs) and patient-reported outcomes (PROs) are often used [4,5], particularly in challenging therapeutic areas [6,7] such as neurodegenerative diseases [8], mental health, and cardiometabolics [9,10].
However, these types of reported episodic reference measures may not correlate well with intensive longitudinal data from sDHTs [11], complicating analytical validation. They typically do not directly assess the desired concept or construct and have inherent measurement errors that make it challenging to compare them directly to a sDHT-derived measure. Directly evaluating the correspondence between the target and reference measures is often infeasible, and may not be needed.
For instance, the Movement Disorder Society-Unified Parkinson’s Disease Rating Scale (MDS-UPDRS) [12] is a common reference for Parkinson’s disease but is known to be subjective with high within-subject variability and low test-retest reliability [13]. Comparisons between sDHT measures and MDS-UPDRS often yield poor results, which does not necessarily disprove the validity of the sDHT-derived measure.
The focus has shifted from replicating MDS-UPDRS components to developing fit-for-purpose sDHT measures that objectively assess similar concepts. This highlights the importance of convergent validity [14], which examines relationships between sDHT outputs and reference measures without directly comparable units, assuming the existence of an underlying latent trait. Statistical methods like correlation, regression, and confirmatory factor analysis (CFA) can estimate these relationships.
Our work explores the statistical properties of Pearson correlation coefficients (PCC) and confirmatory factor analysis (CFA) in evaluating convergent validity. We also investigate the validity of a given target measure by evaluating how multiple reference measures can improve an analytical validation strategy. These results should be used to inform statistical analysis planning when conducting analytical validation with a reference measure without directly comparable units. Additionally, the simulation approach provides a way to test an analytical validation approach in silico before investing additional resources.
Methods
We evaluated statistical methods for analytically validating a digital measure (target) against reference measures of clinical outcome assessments (COAs). We simulated total daily step count data from a fictive sDHT as a measure of physical activity. The associated latent trait is physical ability which fluctuates daily. Additionally, we simulated COAs that measure patient or clinician-reported physical activity.
Software
All simulations and analyses were conducted in R version 4.3.3 (R Core Team, 2021) on a Windows 11 operating system.
Data generation mechanism (DGM)
We simulated seven days of sDHT data (i.e., the target measure), and data from multiple COAs (i.e., the reference measures). The seven-day period was chosen to balance the need for sufficient data collection while minimizing participant burden and potential dropout rates in real-world studies. Fig 1 depicts a summary of the DGM we used (Table 1).
C{1-6} = fluctuation factors 1-6; COA = clinical outcome assessment; GRM = graded response model; LT = latent trait; LTC = latent trait for the COA; LT-sDHT = latent trait for the sDHT.
Simulating the latent variables
The data generation mechanism simulated latent variables from a standard Normal distribution. The latent variable for each subsequent day was simulated by adding a fluctuation factor to the latent variable from the previous day, and then rescaling to give a sample variance of 1.0 for the data across all individuals on that day. The fluctuation factor was taken from a Normal variable with mean 0 and standard deviation (SD) 0.30. This SD value represents a moderate daily variability in physical ability, based on the physical activity patterns referenced in Fig 1.
Simulating the target measure data
We chose to use the stimulation of seven daily counts for each individual based on contexts-of-use and population data [15–17]. Data were to be assessed on consecutive days; effects of weekdays and weekends were not specified.
To mimic the imperfect ability of the target measure to observe the latent trait, Gaussian noise with mean 0 and SD 0.5 was added to to obtain
for individual j on day i. This “method filter” did not vary between days.
To simulate the count data, a random value was drawn from a Poisson distribution with mean , where
was calculated as:
Here, is the hypothesized mean of the target measure for a latent variable value of zero (i.e., an individual from the population with mean physical ability). This term is independent of an individual’s latent ability. In this study,
= 10,000. The
term represents the proportional effect of an individual’s latent physical ability on their expected target measure count, and scales an individual’s “filtered” latent trait. We set
= 1,250 (i.e., the effect of latent ability on step count). The final term (
) represents measurement error, and is Normally distributed with mean 0 and SD
. This term is varied as a simulation parameter. Each Poisson mean was constrained as
. The observed count data for individual j on day i is denoted as
The values were chosen to reflect a correlation matrix representative of real-world physical activity measurements [18], considering typical step count ranges and variability observed in population-based studies [15–17].
Simulating the reference measure data
Primary reference measure: Weekly PRO.
The primary reference measure for the study is a weekly PRO, a four-response, twelve-item patient-reported outcome administered at the end of the seven-day period. Since each individual is assumed to recall information across the entire study period when responding to the PRO, their daily latent variable values influence their answers. To model this, we used the mean of each individual’s latent traits over the seven simulated days, denoted as , as the starting point.
We assumed individuals have an imperfect ability to perceive their value when recalling their activities during the previous seven days. To simulate this, we added random noise generated from a Normal distribution with mean 0 and SD 1.0.
denotes the “perception-filtered” latent trait for each individual j.
Each individual’s value was then used to simulate their responses to the twelve-item PRO with four-response options. The item scores were simulated using an item response theory (IRT) graded response model (GRM). The difficulty threshold parameters took the following fixed values across each individual:
The discrimination parameter α was set to a fixed value across each individual, chosen to result in an approximate reliability of 0.8 for the simulated PRO. A total score was then derived as the sum of the item scores, rescaled to a 0–100 scale. These total scores are the final reference measure data that were compared to the target measure data, and are denoted by for individual j.
Secondary reference measures: Weekly ClinRO and daily PRO
The secondary aim of this study investigates whether introducing multiple reference measures could provide more information for assessing the convergent validity of a digital measure. We simulated two additional types of clinical outcome assessments: (1) weekly ClinRO: A 7-item questionnaire with 5 response options; and (2) daily PRO: A single-item measure with 5 response options.
The weekly ClinRO was simulated in a similar way to the weekly PRO. We simulated an individual’s imperfect recall over the seven days by applying a perception filter to and adding random noise generated from a Normal distribution with mean 0 and SD 1.0. We then used an IRT GRM to simulate each individual’s responses to the seven items on the five-point scale. The difficulty threshold parameters took the following fixed values across each individual:
Here,
The discrimination parameter α was set to a fixed value across each individual, chosen to result in an approximate reliability of 0.7 for the simulated ClinRO. A total score was calculated as the sum of the item scores, rescaled to a 0–100 scale. These rescaled total scores were the variables analyzed in our study.
The single-item daily PRO was simulated using the approach in Griffiths et al [19]. The response of an individual j to a single item (on a five-point [0–4] scale) on day i depends on .
As before, we added random noise generated from a Normal distribution with mean 0 and SD 0.5 to mimic the imperfect ability of an individual to perceive their value. This “perceived latent trait” value is denoted
, and is assumed to exactly determine (i.e., without noise) the response of individual j on day i.
We assumed the thresholds at which an individual gives a particular response were −1.5, −0.5, 0.5, and 1.5. For example, a perceived latent trait value of −1.6 should on average result in a score of 0; a perceived latent trait value of 0.8 should on average result in a score of 3.
Finally, we assume that these score threshold values vary slightly based on an individual’s own perception. What one individual may consider to be “vigorous” physical activity, another may consider “moderate”. Therefore, each individual’s score thresholds were generated from Normal distributions with means −1.5, −0.5, 0.5 and 1.5, each with SD 0.075. An individual’s value was then compared against their score thresholds to determine their daily PRO score on day i, and this process was repeated for each of the seven days in the study.
Statistical methods
We used PCC and CFA to assess the relationship between the sDHT measure and the weekly PRO. Of note, we assumed Normal distributions for many of our variables. This may not apply to all clinical variables in real-world scenarios.
First, the true relationship between the target measure and the reference measure was calculated. This was done by computing , the linear correlation between the mean of the sDHT “method-filtered” latent values and the “perception-filtered” mean of the latent values.
Two estimates of the true relationship were computed. First, the PCC between the weekly mean of the observed target measure values ( and the observed reference measure values (
) was computed. Second, a two-factor CFA model with correlated factors was created, and the model-implied correlation coefficient of the factors was obtained. To construct the CFA model, each item from the reference measure was loaded as an indicator onto a “reference measure factor”, and each day of the target measure data was loaded as an indicator onto a separate “sDHT” factor. The standardized value of the factor covariance was computed, and these were the factor correlation variables analyzed in our study. Fig 2 depicts the path diagram for the two-factor CFA model.
For the secondary aim of the study, simple linear regression (SLR) and multiple linear regression (MLR) were used to assess the impact of using more than one reference measure to conduct analytical validation.
For each regression model, was used as the dependent variable. Ordinary least squares was used for the SLR model, with
as the predictor variable. MLR was employed in three different ways: First, by introducing the weekly ClinRO as a predictor variable alongside the weekly PRO (the “weekly COAs model”); second, by introducing the mean of the daily PRO values as a third predictor variable alongside the weekly COAs (the “daily means” model); and lastly, by instead introducing the individual daily PRO scores as seven distinct predictor variables alongside the weekly COAs (the “daily distinct” model).
For each linear regression model used, a coefficient of determination was computed: R2 statistic in the SLR case and adjusted R2 in the MLR case, calculated as , where n is the number of observations and p is the number of reference measures.
Parameters of the simulation
The simulation varied the number of included COA measures and other parameters to assess their impact on the performance of the statistical methods investigated. A full factorial design was used to test the following parameters:
- Sample size (N = 35, 100, 200, 1000)
- Number of repeated assessments of the target measure included in the analysis (RA = 1, 3, 5, 7)
- Missing data rate (MDR = 0, 0.10, 0.25, 0.40)
- Missing data mechanism (No missing data, missing completely at random (MCAR), missing not at random (MNAR)) [20,21]
- sDHT-measurement error magnitude (MEM):
For a more detailed explanation of the missingness mechanisms employed, and the implementation of fewer repeated assessments of the target measure, in the simulation, please see the Supplementary Materials.
Performance measures
Primary aim of the study.
To assess the ability of PCC and CFA factor correlation to observe the underlying relationship between the target and primary reference measure, the empirical bias of both methods was computed. The empirical bias of a method, defined in terms of statistical estimators, is defined as the difference between 1) the true relationship between the target measure and the reference measure, and 2) the sample-based estimate of this true relationship from that method. The empirical bias for PCC is the difference between and
. For CFA, the empirical bias is the difference between
and the factor correlation estimate.
To assess the precision of the above methods, the empirical standard error (empSE) of the PCC and the CFA factor correlation were computed for each simulation condition. EmpSE is defined as the square root of the variance of each method’s correlation estimates.
In addition, four model fit statistics were calculated for the CFA model to evaluate the goodness of fit between the proposed model and the observed data. The Comparative Fit Index (CFI), Tucker-Lewis Index (TLI), Root Mean Square Error of Approximation (RMSEA), and Standardized Root Mean Square Residual (SRMR) were used as fit statistics. The rate for which the model fit statistics suggested an acceptable model fit are reported for each simulation condition. Interpretation of the fit statistics was guided by the following thresholds [22,23]: (1) CFI and TLI acceptable fit: values ≥ 0.90; and (2) RMSEA and SRMR acceptable fit: values < 0.08. The convergence, or failure rates (), for the CFA model are reported for each simulation condition.
Secondary aim of the study
To assess the ability of the MLR models to recover more information about the underlying target and reference measure relationship than the SLR model, the change in information added, denoted , was calculated for each MLR model.
is defined as the difference between the MLR adjusted R2 estimate and the SLR R2 estimate; in other words,
. To assess the precision of each regression method, for each simulation condition, the empSE of the SLR model and each MLR model were calculated. Finally, any failures that occurred in each of the regression models are reported.
Number of simulation repetitions and Monte Carlo standard error
The number of simulation repetitions (nsim) was chosen to ensure that the Monte Carlo Standard Error (MCSE) of the empSE for each statistical method was less than 0.025. To achieve this, nsim must satisfy the inequality , where
is the sample variance of the estimates [24,25]. Based on a test simulation, nsim = 500 is sufficient to achieve the desired MCSE. The MCSEs of each model are reported for each simulation condition.
Results
Primary aim of the study
Table 2 depicts a summary of the mean empirical bias and mean empSE for the PCC and CFA methods under the influence of different simulation conditions, as well as an overall summary across all conditions.
Empirical Bias
Overall, PCC was negatively biased and CFA was positively biased, but CFA was the less-biased of the two methods. The same behavior was observed when splitting the bias results by any single simulation parameter. When considering the impact of any single simulation parameter on the mean empirical bias, the negative bias of PCC increased as the MEM and missing data rate increased. The choice of missing data mechanism only had a minor impact on the mean empirical bias of PCC, and the impact of sample size was negligible. The negative bias of PCC decreased as the number of repeated assessments increased.
For the CFA method, any single simulation parameter had little impact on the mean empirical bias. At the maximum MEM level, the CFA method was slightly less biased than for lower MEM values; the same behavior was observed for the maximum missing data rate. Data that were MNAR, produced a slightly smaller magnitude of the bias when compared to no missing data or MCAR data. Varying the number of repeated target measure assessments, had no discernible impact on the mean empirical bias. When increasing the sample size, there was a very slight trend for the mean empirical bias of CFA to decrease, and this was the strongest of the trends exhibited when considering a single simulation condition.
Focusing on a specific simulation condition (no missing data and 7 repeated assessments per subject; Fig 3), sample size impacted the average trend of empirical bias less than MEM, while the dispersion of empirical bias appeared to decrease as sample size increased. We observed a pattern of increasing underestimation from PCC as MEM increased. In contrast, CFA slightly overestimated the bias, but the average trend was not greatly affected by the varying measurement error. This pattern was repeated in the 3-assessment and 5-assessment case.
EmpSE
Table 2 shows that, overall, PCC had a lower mean empSE than CFA. In fact, a stronger condition is true – for every combination of simulation parameters investigated in the study, the empSE of CFA was greater than the corresponding empSE of PCC. This is also illustrated by Fig 4, which shows the empSEs of PCC plotted against the empSEs of the CFA factor correlation, colored by sample size.
Colors represent the different sample sizes.
Continuing with Table 2, when considering the impact of any single simulation parameter on the mean empSE, decreasing the MEM, or the missingness rate caused the mean empSE to decrease for both PCC and CFA. For both methods, the empSE was lower with no missing data compared to that for MNAR data. The empSE for MNAR data was lower than that for MCAR data. When sample size increased, the mean empSE of both methods decreased. An increase in the number of assessments corresponded with a decrease in mean empSE for both methods. These trends persisted in more granular simulation scenarios, such as the one shown in Table 3 (i.e., no missing data, seven repeated assessments).
Fig 5a depicts the empSEs for PCC and CFA factor correlations, as a function of increasing sample size with a fixed missing data rate of 0.40; Fig 5b depicts the corresponding empirical biases. When holding the missing data rate constant and allowing sample size to vary (Fig 5a), we observed a pattern of increasing empSE with increasing MEM, which was mitigated by increasing sample size. For example, the difference between the median empSE at the highest and lowest MEM levels in the PCC condition was 0.025 for n = 35, whereas the difference between the highest and lowest MEM values was only 0.007 for n = 1000 (i.e., a 3.6-fold difference). Furthermore, the empSE of CFA was more sensitive to increasing MEM at all sample sizes. The difference between the median empSE at the highest and lowest MEM levels was 0.202 and 0.018, for the n = 35 and n = 1000 cases respectively (i.e., an 11.2-fold difference). While the empSE of PCC also grew as MEM increased, its impact was less severe across all sample sizes. Empirical biases (Fig 5b) in these scenarios did not behave differently than the overall trend (Table 2).
When holding sample size constant (n = 35) and allowing MEM to vary, there was a trend for the empSE (Fig 6a) to perform notably worse in the 3-assessment cases than larger number of repeated assessments (5 and 7); this trend was mitigated by reducing the MEM, except that for the largest simulated MEM () empSE for all repeated assessments were equally poor. While following a similar trend, PCC has a notably smaller empSE across these scenarios as compared to CFA correlations. The same pattern holds in larger sample sizes (see S2a Fig in S1 File for n = 100). Empirical biases (Fig 6b and S2b Fig in S1 File) did not differ notably from the overall trend (Table 2).
CFA model fit statistics: rates of acceptable fit and failure rates
Table 4 depicts a summary of the rates of acceptable fit for each CFA model fit statistic and the model failure rate under the influence of different simulation conditions, as well as an overall summary across all conditions.
Across all conditions combined, fit statistics CFI and TLI were generally within acceptable bounds (84.8% and 82.2% respectively), and SRMR were generally acceptable but lower (58.0%). RMSEA were the least consistently acceptable, with only 45.4% meeting the criterion.
When considering the impact of varying any single simulation condition, increasing the parameters that control information quality (MEM and missing data rate) reduced the mean rate of acceptable fit across all fit statistics. MEM had a particularly large impact on RMSEA, decreasing the mean rate of acceptable fit by 67.3% as MEM increased from the minimum to the maximum value. The equivalent decrease in mean acceptable fit for SRMR was 16.9%, 37.7% for CFI and 42.7% for TLI. Varying the missing data mechanism had a much smaller impact on the mean acceptable rate for any of the fit statistics.
Increasing the sample size increased the mean acceptable fit rate for each statistic, especially for RMSEA and SRMR. For n = 1000, CFI, TFI and SRMR always indicated an acceptable model fit, whereas RMSEA had an acceptable fit rate 84.7%. For n = 35, almost none of the models were an acceptable fit according to SRMR, less than third were an acceptable fit according to RMSEA, and fewer than two-thirds were an acceptable fit according to CFI and TLI. Based on SRMR, almost all of the unacceptable models had a sample size of n ≤ 100.
Increasing the number of repeated assessments decreased the mean rate of acceptable fit for each fit statistic. The decrease in rate of acceptability was slight for each of CFI, TLI and SRMR, and slightly greater for RMSEA.
Fig 7 shows the rate of acceptable fit according to RMSEA and SRMR, restricted to n = 100 and grouped by MEM. Fig 7 shows that the model fit according to RMSEA when n = 100 was generally acceptable when MEM was at the minimum level of ; the mean acceptable fit rate was 91.2% compared to 31.4% when considering all MEM conditions together for n = 100. In contrast, almost no models had an acceptable RMSEA fit when MEM was at the maximum value – the mean acceptable rate was less than 1% in these cases as opposed to the overall 31.4% rate. The SRMR acceptability rate exhibited a similar but weaker trend to RMSEA, with a mean acceptable rate of 68.4% at minimum MEM, compared to 9.0% at maximum MEM (with an overall mean SRMR acceptable fit rate of 34.7% when n = 100).
Similar but weaker patterns were seen for RMSEA and SRMR acceptable fit when grouping the n = 100 cases by both the number of repeated assessments and by missing data rate: as the grouping parameter increased, the rate of acceptable fit decreased.
In approximately 9% of the total individual simulation repetitions, the RMSEA value was exactly zero. In all of these repetitions, the CFI was exactly 1.0 and the TLI was greater than 1.0. These repetitions were not limited to any particular combination of simulation conditions, although they were more likely to occur in conditions with lower levels of information (i.e., small number of assessments and small sample size) and better levels of signal to noise (i.e., small MEM and low missing data rate).
Almost all of the failures of the CFA model to converge occurred when sample size was 35, and sample size had the largest impact on failure rate when compared with other simulation parameters in this study. Increasing the MEM and missing data rate resulted in slight increases in failure rate. When including repeated target measure assessments, increasing the number of assessments led to a reduction in failure rate.
When attempting to use a single target measure assessment in the CFA model, the model was consistently non-identifiable (i.e., multiple model specifications fit the data equally well). While factor correlations were computed in these cases, they are not been reported as they are unlikely to be meaningful.
Secondary aim of the study
Table 5 depicts a summary of the mean change in information added () and mean empSE for each regression method (SLR, MLR with “daily means”, MLR with “daily distinct”, and MLR with “weekly COAs”) under the influence of the different simulation conditions, as well as an overall summary across all conditions.
Change in information added, 
Both MLR methods that included the daily PRO data performed well, recovering more information than the SLR model in the vast majority of cases (96.9% for the daily PRO means model, and 95.9% for the individual daily PRO days model). The MLR method including only the weekly COAs performed less well but was still generally better than the SLR model, adding more information than the SLR method in 77.8% of cases.
When considering the mean , the MLR methods performed similarly well to each other, with a slight increase in mean
for the “daily distinct” model as compared to the “daily means” model (0.206 and 0.198, respectively) The “weekly COAs” model performed much worse, with a mean
of 0.031. This pattern was repeated when considering the results grouped by any single simulation parameter.
When grouping the results by any single simulation parameter, the largest impacts on
were observed for MEM and the number of repeated assessments. In general, increasing the sample size and number of assessments led to an increase in
for all models. Moreover, reducing the MEM and missing data rate also increased
for all models. Sample size had a notable impact on the variability of
under the daily distinct model, with an SD of 0.190 at n = 35 compared to 0.072 at n = 1000. Values of
from simulations with a sample size of 35 exhibited the greatest variability of all models across all conditions.
When investigating simulations with negative values (3.1% of all values for the means model, 4.1% of all values for the daily distinct model, and 22.2% of all values for the weekly COAs model), sample size had the greatest impact out of any single simulation parameter. When sample size was 35, over half of the
values for the weekly COAs model were negative, decreasing to around 25% simulations at n = 100, then around 10% at n = 200, and almost no simulation at n = 1000 produced a negative
for this model. This pattern was repeated for the other two models; between 10% and 15% of simulations at n = 35 returned a negative
for the means and daily distinct models, respectively, which reduced to almost zero for larger sample sizes.
There were a limited number of simulations where the MLR values were negative; the majority of these negative values occurred for n = 35.
EmpSE
Overall, the empSE was similar between all MLR and SLR methods. The SLR method had a slightly smaller mean empSE, with the “daily means” model and “weekly COA” models having slightly larger but broadly similar means, and the “daily distinct” model having the largest mean empSE among all four methods. When considering any single simulation parameter, varying the sample size resulted in the strongest impact on empSE, with an increase in sample size tending to result in a decrease in empSE for each regression method.
This trend was less clear when varying the number of repeated assessments. For MLR methods there was a slight trend of decrease in mean empSE as the number of assessments increased, but for the SLR method, varying the number of assessments had broadly no impact. Introducing missing data via MNAR tended to produce a slightly reduced empSE in each method as compared to MCAR, and increasing the missing data rate tended to slightly increase empSE for each method. When allowing just the MEM to vary, a more complicated picture emerged. For both models that included the daily PRO, empSE exhibited a slight tendency to increase as MEM increased. However, for methods only including weekly COAs (i.e., the SLR and the MLR “Weekly COAs model) this was reversed, and the empSE exhibited a slight tendency to decrease as MEM increased.
Regression model failure rates
We observed a limited number of failures when the estimate returned as NaN in the “daily distinct” MLR model, all of which occurred when sample size was 35 and missing data rate was at its maximum of 0.40 under the MCAR mechanism. Table 6 lists the specific simulation parameter combinations that led to failures, along with their failure rates.
There were a small number of estimates produced from the daily distinct model which strongly underperformed the SLR
; these estimates tended to arise in the same simulation conditions where model failures were reported. The SLR model, the “daily means” MLR model, and the “weekly COAs model exhibited no failures during the simulation.
Monte Carlo standard errors of the empSE
All MCSE values across each of the simulation conditions were smaller than the acceptability criteria of 0.025, for each of the six statistical methods investigated in this work. Table 7 shows the mean MCSE of the empSE for each method, and we observed that these values are all comfortably below the acceptability threshold.
Discussion
Summary of results and how they help in practice
Our simulation study highlights the challenges in validating novel digital endpoints against reference measures with non directly comparable units. While sDHTs offer the potential for increasingly more objective and continuous methods of measurement, rigorous analytical validation is critical to support the continued development and implementation of such novel digital clinical measures.
This simulation study investigated the convergent validity of an sDHT-derived measure with reported reference measures. PCC and CFA factor correlations were used. The study design and parameter choices to simulate these measures were informed by real-world data and observations, including missing data mechanisms.
Our analysis found that Pearson correlation was more stable, easier to compute, and relatively robust to violations of parametric assumptions [26], making it a recommended choice for small sample size studies. However, PCC tends to underestimate the true correlation, and this underestimation becomes more pronounced as the degree of measurement error increases.
CFA factor correlation has gained popularity as its approach to extracting latent information can better mitigate the statistical impacts of measurement error on estimation. Because it is a more complex model, it requires a well-sized sample (n > 100) and frequent repeated assessments from the sDHT. We deemed the CFA model unsuitable for a single assessment scenario due to model non-identifiability. Additionally, model fitting evaluation needs careful investigation, as commonly used fit statistics, such as CFI and RMSEA, yielded different qualitative conclusions in the simulated examples. We speculate that these results do not indicate model misspecification but can be attributed to small sample sizes and violations of parametric assumptions [27].
The strengths and weaknesses of PCC and CFA factor correlation complement each other. We encourage researchers to present results from both methods when assessing convergent validity during analytical validation. PCC is a conservative and stable method that demonstrates the relationship between the target and reference measures and may provide a lower bound of the true correlation. Conversely, CFA factor correlation offers an optimistic but intuitive estimate, serving as an upper bound. The true correlation likely lies between the results from these two methods.
Furthermore, multiple reference measures collected at a higher time resolution (e.g., daily instead of weekly) can increase the information available for mapping the sDHT to these reference measures, providing a more comprehensive presentation of the sDHT’s convergent validity. Using an MLR approach with daily collected reference measures notably boosted the explainable variance from the sDHT, compared to when the reference measure was only collected once.
Novelty of the work
We operationalized the analytical validity of an sDHT using psychometric concepts (latent and observable variables). Our approach provides investigators with a general-use simulation-based framework for evaluating how well an sDHT corresponds to reported reference measures. Our approach can be adopted to other measures, beyond the physical activity case presented here.
Investigators can use our simulation code to explore various analytical validation scenarios for novel measures. The simulation code, including data generation methods, simulated data, and guidance on adapting these materials to different scenarios are publicly available on the website of the Digital Medicine Society [28] (see also the data availability statement).
We evaluated CFA factor correlation as a method for handling data believed to be generated by latent traits, alongside PCC. We believe our study is the first to adopt this approach to conduct analytical validation of sDHTs and provide a simulation toolkit for researchers to test out various scenarios in silico before starting data collection.
Limitations and future research
Future research may consider specifying data generation methodologies in a manner that explicitly controls the true relationship between the sDHT and the reference measure, such as explicitly controlling the true correlation between the measures. This would allow a researcher to specify a single parameter, independent of simulation repetitions, that the statistical models can attempt to identify, thereby estimating the true bias of each model, rather than relying on empirical bias as we did here.
When modeling missingness, the MCAR mechanism was chosen to recreate missing data patterns observed in the collection of physical activity data during clinical trials [29]. The simulation revealed that whether values were MCAR or MNAR did not substantially impact the performance of the methods. Further work could explore the impact of other missing data mechanisms on the performance of statistical models, particularly mechanisms that more closely model observed real-world trends. These might encompass mechanisms applied at the epoch level instead of the summary level, such as an inverted MNAR method (i.e., making smaller data points more likely to be deleted than larger ones) [30]. Additionally, future work should investigate the impact of missing reference measure data on the analysis results, as the current simulation only included missing data for the target sDHT measure.
Our data generation approaches did not explicitly account for demographic factors. We acknowledge the importance of these factors in real-world scenarios but simulating them was beyond the scope of this work.
Finally, statistical methods examining errors-in-variables, such as Deming regression, may warrant exploration. These models would be particularly suitable because they account for error in the predictor variables. Another method that merits investigation is CFA when modeling multiple reference measures. A potentially suitable correlated two-factor model could involve loading the items from the weekly COAs and the daily assessments from the daily PRO onto separate factors, which are then loaded onto a combined “reference measure” factor, correlated against a target measure factor with each day of sDHT data loaded as separate indicators. The two-factor CFA model we used could also be modified to explore if the model fit can be improved for small sample sizes and other conditions, such as introducing covariances between consecutive days of target measure data, which more closely aligns with the DGM design.
Supporting information
S1 File. Further details on the simulation parameters and empSEs and empirical biases of correlation methods.
https://doi.org/10.1371/journal.pone.0308190.s001
(DOCX)
Acknowledgments
The authors gratefully acknowledge the contributions of the following experts through participation in the statistical advisory committee and async review of the simulation protocol and results: Chakib Battoui, Jakob Bjørner, Yiorgos Christakis, Valentin Hamy, Andrew Potter, Bohdana Ratitch, David Reasner, Colleen Russell, Sachin Shah, Berend Terluin, Andrew Trigg, Kevin Weinfurt, Robert Wright. In addition, the authors gratefully acknowledge the contributions of DiMe members for their support: Samantha McClenahan and Bethanie McCrary.
References
- 1. Goldsack JC, Coravos A, Bakker JP, Bent B, Dowling AV, Fitzer-Attas C, et al. Verification, analytical validation, and clinical validation (V3): the foundation of determining fit-for-purpose for Biometric Monitoring Technologies (BioMeTs). NPJ Digit Med. 2020;3:55. pmid:32337371
- 2.
DiMe. V3 : An extension to the V3 framework to ensure user-centricity and scalability of sensor-based digital health technologies. https://datacc.dimesociety.org/resources/v3-an-extension-to-the-v3-framework-to-ensure-user-centricity-and-scalability-of-sensor-based-digital-health-technologies/. 2024. Accessed 2024 June 14.
- 3. Bakker JP, McClenahan SJ, Fromy P, Turner S, Peterson BT, Vandendriessche B, et al. A hierarchical framework for selecting reference measures for the analytical validation of sensor-based digital health technologies. JMIR Preprints. 2024.
- 4. Biomarkers Definitions Working Group. Biomarkers and surrogate endpoints: preferred definitions and conceptual framework. Clin Pharmacol Ther. 2001;69(3):89–95. pmid:11240971
- 5.
Center for Drug Evaluation, Research. Surrogate Endpoint Resources for Drug and Biologic Development. https://www.fda.gov/drugs/development-resources/surrogate-endpoint-resources-drug-and-biologic-development. 2021. Accessed 2024 June 13.
- 6. Wittkampf KA, Naeije L, Schene AH, Huyser J, van Weert HC. Diagnostic accuracy of the mood module of the Patient Health Questionnaire: a systematic review. Gen Hosp Psychiatry. 2007;29(5):388–95. pmid:17888804
- 7. Paulsen JS, Wang C, Duff K, Barker R, Nance M, Beglinger L. Challenges assessing clinical endpoints in early Huntington disease. Mov Disord. 2010;25(15):2595–603.
- 8. Byrom B, McCarthy M, Schueler P, Muehlhausen W. Brain monitoring devices in neuroscience clinical research: the potential of remote monitoring using sensors, wearables, and mobile devices. Clin Pharmacol Ther. 2018;104(1):59–71. pmid:29574776
- 9.
Bergemann T. Use of accelerometer data to evaluate physical activity as a surrogate endpoint in heart failure clinical trials. In: ASA Biopharmaceutical Section Regulatory-Industry Statistics Workshop. American Statistical Association. 2017.
- 10. Sen-Gupta E, Wright DE, Caccese JW, Wright JA Jr, Jortberg E, Bhatkar V, et al. A Pivotal Study to Validate the Performance of a Novel Wearable Sensor and System for Biometric Monitoring in Clinical and Remote Environments. Digit Biomark. 2019;3(1):1–13. pmid:32095764
- 11.
Walls TA, Schafer JL. Models for intensive longitudinal data. Walls TA, Schafer JL, editors. Oxford University Press. 2006.
- 12. Goetz CG. Movement Disorder Society-Unified Parkinson’s Disease Rating Scale (MDS-UPDRS): a new scale for the evaluation of Parkinson’s disease. Rev Neurol (Paris). 2010;166(1):1–4. pmid:19910010
- 13. Evers LJW, Krijthe JH, Meinders MJ, Bloem BR, Heskes TM. Measuring Parkinson’s disease over time: The real-world within-subject reliability of the MDS-UPDRS. Mov Disord. 2019;34(10):1480–7. pmid:31291488
- 14. Tucker CA, Bevans KB, Teneralli RE, Smith AW, Bowles HR, Forrest CB. Self-reported pediatric measures of physical activity, sedentary behavior, and strength impact for PROMIS: item development. Pediatr Phys Ther. 2014;26(4):385–92. pmid:25251790
- 15. Yao J, Tan CS, Lim N, Tan J, Chen C, Müller-Riemenschneider F. Number of daily measurements needed to estimate habitual step count levels using wrist-worn trackers and smartphones in 212,048 adults. Scientific Reports. 2021;11(1):9633.
- 16. Hart TL, Swartz AM, Cashin SE, Strath SJ. How many days of monitoring predict physical activity and sedentary behaviour in older adults?. Int J Behav Nutr Phys Act. 2011;8:62.
- 17. Dillon CB, Fitzgerald AP, Kearney PM, Perry IJ, Rennie KL, Kozarski R, et al. Number of days required to estimate habitual activity using wrist-worn geneactiv accelerometer: a cross-sectional study. PLoS One. 2016;11(5):e0109913. pmid:27149674
- 18. Zhang GQ, Cui L, Mueller R, Tao S, Kim M, Rueschman M. The National Sleep Research Resource: Towards a Sleep Data Commons. J Am Med Inform Assoc. 2018;25(10):1351–8.
- 19. Griffiths P, Terluin B, Trigg A, Schuller W, Bjorner JB. A confirmatory factor analysis approach was found to accurately estimate the reliability of transition ratings. J Clin Epidemiol. 2022;141:36–45. pmid:34464687
- 20.
Little RJ, Rubin DB. Statistical analysis with missing data. John Wiley & Sons. 1986.
- 21. RUBIN DB. Inference and missing data. Biometrika. 1976;63(3):581–92.
- 22. Hu L, Bentler PM. Cutoff criteria for fit indexes in covariance structure analysis: conventional criteria versus new alternatives. Struct Equ Modeling. 1999;6(1):1–55.
- 23.
Kline R. Principles and Practice of Structural Equation Modeling. 5th. Ed. Guildford Press; 2023.
- 24. Morris TP, White IR, Crowther MJ. Using simulation studies to evaluate statistical methods. Stat Med. 2019;38(11).
- 25.
Siepe BS, Bartoš F, Morris TP, Boulesteix AL, Heck DW, Pawel S. Simulation Studies for Methodological Research in Psychology: A Standardized Template for Planning, Preregistration, and Reporting. http://dx.doi.org/10.31234/osf.io/ufgy6. 2023.
- 26. Havlicek LL, Peterson NL. Robustness of the Pearson Correlation against Violations of Assumptions. Percept Mot Skills. 1976;43(3_suppl):1319–34.
- 27. Lai K, Green SB. The problem with having two watches: assessment of Fit When RMSEA and CFI Disagree. Multivariate Behav Res. 2016;51(2–3):220–39. pmid:27014948
- 28.
DiMe. Validating Novel Digital Clinical Measures. https://datacc.dimesociety.org/validating-novel-digital-clinical-measures/. 2024. Accessed 2024 June 13.
- 29. Cho S, Ensari I, Weng C, Kahn MG, Natarajan K. Factors Affecting the Quality of Person-Generated Wearable Device Data and Associated Challenges: Rapid Systematic Review. JMIR Mhealth Uhealth. 2021;9(3):e20738. pmid:33739294
- 30. Yingling LR, Mitchell V, Ayers CR, Peters-Lawrence M, Wallen GR, Brooks AT. Adherence with physical activity monitoring wearable devices in a community-based population: observations from the Washington, D.C., cardiovascular health and needs assessment. Transl Behav Med. 2017;7(4):719–30.