Enhancing site selection strategies in clinical trial recruitment using real-world data modeling

Lars Hulstaert; Isabell Twick; Khaled Sarsour; Hans Verstraete

doi:10.1371/journal.pone.0300109

Abstract

Slow patient enrollment or failing to enroll the required number of patients is a disruptor of clinical trial timelines. To meet the planned trial recruitment, site selection strategies are used during clinical trial planning to identify research sites that are most likely to recruit a sufficiently high number of subjects within trial timelines. We developed a machine learning approach that outperforms baseline methods to rank research sites based on their expected recruitment in future studies. Indication level historical recruitment and real-world data are used in the machine learning approach to predict patient enrollment at site level. We define covariates based on published recruitment hypotheses and examine the effect of these covariates in predicting patient enrollment. We compare model performance of a linear and a non-linear machine learning model with common industry baselines that are constructed from historical recruitment data. Performance of the methodology is evaluated and reported for two disease indications, inflammatory bowel disease and multiple myeloma, both of which are actively being pursued in clinical development. We validate recruitment hypotheses by reviewing the covariates relationship with patient recruitment. For both indications, the non-linear model significantly outperforms the baselines and the linear model on the test set. In this paper, we present a machine learning approach to site selection that incorporates site-level recruitment and real-world patient data. The model ranks research sites by predicting the number of recruited patients and our results suggest that the model can improve site ranking compared to common industry baselines.

Citation: Hulstaert L, Twick I, Sarsour K, Verstraete H (2024) Enhancing site selection strategies in clinical trial recruitment using real-world data modeling. PLoS ONE 19(3): e0300109. https://doi.org/10.1371/journal.pone.0300109

Editor: Krit Pongpirul, Chulalongkorn University Faculty of Medicine, THAILAND

Received: November 22, 2023; Accepted: February 21, 2024; Published: March 11, 2024

Copyright: © 2024 Hulstaert et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The data underlying this article were provided by third parties (Komodo Health & IQVIA) under license and cannot be shared publicly. The source data for this study were licensed by Johnson & Johnson from Komodo Health and IQVIA, and hence we are not allowed to share the licensed data publicly. However, the same data used in this study are available for purchase by contracting with the database owners, Komodo Health (contact at: https://www.komodohealth.com/) and IQVIA (contact at: https://dqs.drugdev.com/help/contactUs). The authors did not have any special access privileges that other parties who license the data and contract with Komodo Health and IQVIA would not have. Further, in our efforts to enable use and reproducibility of the prediction model, we have provided detailed supporting material on covariates, and model hyperparameter settings.

Funding: The author(s) received no specific funding for this work.

Competing interests: All authors are employees of Janssen Research and Development, a unit of Johnson and Johnson family of companies. The work on this study was part of their employment. All authors hold pension rights from the company and own stock options. This does not alter our adherence to PLOS ONE policies on sharing data and materials.

Introduction

Background

Slow patient enrollment or failing to enroll the required number of patients is a disruptor of clinical trial timelines, leading to potential delays in drug approval, underpowered studies, the need to include additional study sites or even trial terminations [1–3]. Site selection, the process in which research sites, healthcare organizations and their associated investigators, are chosen to participate in clinical trials, is critical to enable timely recruitment.

Common site selection strategies use past trial data to assess how well a site would perform in a prospective clinical trial and different standardized and objective methods have been developed across industry and academia [4–8]. These methods include analyzing factors such as prior trial participation and performance, which are interrogated through database searches in investigator, site, and enrollment data sources. In certain cases, this process is complemented with epidemiologic and geographical analyses to create short lists of research sites that have both relevant research experience and direct access to a sufficiently large target patient population [5,9,10]. Research sponsors subsequently use site-level feasibility questionnaires to get estimates of expected recruitment for shortlisted sites–often resulting in an overestimation of their ability to recruit patients in trials [11]. It is hypothesized that this overestimation is because investigators do not have full protocol information and limited time for a thorough trial feasibility assessment [6].

Estimating site-level trial performance is a complex problem, further complicated by an increasingly competitive trial landscape and complex clinical trial designs [12]. In Table 1, different factors are summarized that have been reported in literature to influence site recruitment performance. A common challenge in site selection strategies is aggregating the information collected to make final decisions. The range of variables that impact the performance of a site require an effective measurement of the trade-offs that influence prospective recruitment performance [3].

Download:

Table 1. An overview of variables published in literature that are hypothesized to drive site recruitment performance.

https://doi.org/10.1371/journal.pone.0300109.t001

Quantitative research in this field is limited by the volume of clinical trial data needed to generate meaningful recruitment insights. Typically, the impact of the reported site level factors on recruitment performance either is not validated, validated only on a small sample of studies or only with feasibility questionnaire data of a single study. The low power observed in these analyses signals that a cross-study analysis is needed to yield generalizable results [17].

Different statistical and machine-learning methods have been developed to estimate trial recruitment to address these challenges. Approaches differ in whether they predict enrollment at study [18,19] or study-site level [20,21], and whether enrollment is predicted at the start of the trial [18,20,21], or over the course of a trial’s enrollment duration [20]. Several study related factors associated with trial enrollment have been studied (e.g. trial design, phase, sponsor, disease indication, competing trials recruiting similar patients, investigator experience and characterization) [20,21]. Existing study-site models examine variables that describe a small number of site-specific factors such as research experience, but these modeling approaches and their data are not tailored to the study indication and population. The use of electronic health record and claims data to characterize the available patient population for example remains limited [1,8].

Objective

The goal of this work is to predict the number of patients enrolled at a clinical trial site, before the start of a trial’s enrollment phase, using a broad set of indication-specific and site level characteristics. We explore a machine learning method that considers research experience, historical performance, patient availability and other site and investigator factors to make site-level enrollment predictions. The model predictions can be used in the operational planning phase prior to the start of a study when potential study sites are selected.

The methodology is validated in inflammatory bowel disease (IBD) and multiple myeloma (MM). Given the limited availability of real-world data sources with large-scale hospital coverage outside of the US, the analysis is limited to predict patient enrollment for US research sites. The approach aims to address the following research questions:

How do approaches that leverage a broad range of (study)-site variables compare to baseline site selection strategies?
Does the use of non-linear models boost the generalization performance vs. a simple linear Poisson regression model? Does the model benefit from capturing non-linear relations between covariates and the model target?

To allow for comparison with previously published model results, the models are also compared in their ability to identify the bottom and top 30% performing investigators [21].

Materials and methods

Data sources

The data used in this work is sourced from different systems that contain structured data related to studies, research sites, investigators, and patient populations. The following section provides a description of the data that is collected from these data sources. A summary of the data sources is provided in Table 2, describing the data type, provider, coverage & time frame.

Download:

Table 2. An overview of the different data sources with a description of the data, coverage, and time frame.

https://doi.org/10.1371/journal.pone.0300109.t002

Enrollment data from the DrugDev DataQuerySystem (DQS) is used to compute study-site level recruitment variables. DQS is a data platform that allows trial sponsors to share information on clinical trial recruitment and is used to capture study performance variables at site level such as the site open date, first and last subject enrolled date, the enrollment duration, and the number of patients who enrolled in a trial. The data is available for pharmaceutical trials across different disease indications. New data is made available monthly through DQS, and sponsors publish enrollment data from their systems consolidated onto a common data format once the study has been finalized.

For each site selection exercise, enrollment data is collected for so-called benchmark studies within a given indication from DQS. Benchmark studies are defined by manually reviewing the available studies within a given indication. These are further refined with to ensure benchmark trials are like prospective studies in terms of study phase, target indication, eligibility criteria, study duration and type of intervention.

In Fig 1, the benchmark studies that have been used across the two exercises are visualized across study phase and study indication. The enrollment data, in terms of number of patients enrolled and enrollment months, is shown in Fig 2. The site distribution across US states is shown in Fig 3, as well as the site open year distribution. The complete list of benchmark studies for each exercise is provided in S1 File.

Download:

Fig 1. Overview of number of benchmark studies across phase and indication.

An overview of the number of benchmark studies across study phase and study indication for resp. IBD and MM.

https://doi.org/10.1371/journal.pone.0300109.g001

Download:

Fig 2. Overview of number of patients enrolled and number of enrollment months across study-site combinations.

An overview of the number of patients enrolled and number of enrollment months per site across the benchmark studies for resp. IBD and MM.

https://doi.org/10.1371/journal.pone.0300109.g002

Download:

Fig 3. Overview of number of sites across US states and open year.

An overview of number of sites across US states and site open year for resp. IBD and MM.

https://doi.org/10.1371/journal.pone.0300109.g003

Data from the Komodo Healthcare map, a real-world data (RWD) source with significant geographical coverage in the US, is used to characterize the study population. This data contains longitudinal patient claims data with information on prescribed drugs, diagnoses, procedures, and treatments.

Linkage of patients across longitudinal data (e.g., instances where a patient is treated at multiple institutions during a follow-up period and across healthcare providers) is performed by Komodo Health prior to sharing the data. Based on these complete patient journeys we can characterize the referral patterns across healthcare providers. Additional tables are provided by Komodo that describe provider level publications and trial participation. These tables have been processed in such a way that they can easily be linked to healthcare providers based on the National Provider Identifier (NPI) system [22].

Patient cohorts are created from common trial eligibility criteria from the benchmark studies to mimic the target population of the prospective and benchmark studies. Exact replication of the target patient population is often not possible with the available claims data. Patient outcomes and lab measurement results for example are typically not available in claims data, while they are often part of a trial’s eligibility criteria. Patient cohort definitions are defined to mimic the general target patient population across benchmark studies. Patient populations are specified through an observation period, specialist type, patient age, diagnosis, drugs, and procedure codes. Inclusion criteria of benchmark studies are used to define a superset of relevant diagnosis, drugs, and procedures codes. These codes define a patient cohort that represents the broad patient population that is eligible for the benchmark studies. The cohort definitions for each exercise are shared in S2 File. The publication and trials databases are filtered only at indication level to capture the breadth of the research experience and interest of the HCO.

Real-world data is available at patient level but aggregated at healthcare provider (HCO) level. Different covariates are extracted at HCO level to characterize the available study population, procedures, treatments, staff, publications, and trial experience. The RWD and recruitment data sources are linked at healthcare organization (HCO) level. As DQS and Komodo use different HCO identifiers, manual validation is performed to ensure that each HCO is correctly linked across the data sources.

The real-world data that has been accessed for this study were deidentified in accordance with the Health Insurance Portability and Accessibility Act, and no personal health information was extracted. Therefore, no informed consent or institutional review board approval is required for this study.

Outcome of interest and covariates

The outcome of interest, the enrollment at study-site level, is defined as the total number of recruited patients at a given site for a given study. A summary of the enrollment characteristics of the two exercises is provided in S1 Table. Covariates are constructed from enrollment and real-world data to characterize a site within the context of a study. From the enrollment data, historical performance and experience variables are generated, such as covariates that summarize historical recruitment over a time window, the enrollment period, and the year when the site opened for a given study. The real-world data is used to build a set of indication-specific covariates that characterize study population, treatment, staff (physicians and specialists), referrals, publication, and trial participation.

Table 3 shows the different covariates that are generated and how they are constructed from the respective data source. Two types of covariates exist, those that characterize the site (site level covariate), are assumed to remain static over time and are not different across benchmark studies, and those that change over time and are unique at the study-site level (study-site level covariate). The covariate creation process was largely hypothesis driven, as defined in Table 1. Covariates are clipped for outliers outside the [0%, 97.5%] interval to the nearest value. Right-side clipping is used as the different covariates are gamma distributed. Missing values are imputed by 0 when a value is missing.

Download:

Table 3. Overview of the covariates.

https://doi.org/10.1371/journal.pone.0300109.t003

Although real-world data represents a broad set of patients that are potentially eligible for trial participation at any given time, its covariates are not aligned at the study-site level. While temporal alignment of RWD & recruitment data is possible based on the claim date and enrollment period for a site in each study, the real-world data is available only from 2016 onwards, while the benchmark studies start as early as 2006. As such, the cohort observation period is used instead to characterize the real-world clinical practice of a site. The variability in yearly calculations of the site level RWD covariates across the available data is sufficiently small, allowing them to be approximated as constant when averaged across the cohort observation period. Before 2016 it is not possible to validate this hypothesis which has the potential to introduce data bias in RWD covariates for studies conducted before 2016.

Proposed approach

To predict site performance based on enrollment and real-world data covariates, different machine learning models have been developed. The number of recruited patients at study-site level are discrete counts that follow a Poisson distribution. The machine learning problem is defined as a Poisson regression problem where the enrollment months represent the exposure period.

We use a random train (80%) and test (20%) data split at site level to avoid the potential of a data distribution bias and corresponding impact on model generalization capabilities. The use of study specific variables is limited to ensure generalizability across studies and limit data leakage. A similar approach is used to perform cross-validation, using 5-fold cross-validation groups. In line with the Poisson modeling objective, models are compared with different regression metrics; mean absolute error (MAE), root mean squared error (RMSE), Spearman correlation coefficient are evaluated on both train and test set. The coefficient of determination (R2) is provided as reference, as the models are optimized on their ability to rank sites using the Spearman correlation coefficient.

We also assess whether the models succeed in identifying the top & bottom 30% of research sites in terms of enrollment, to allow for comparison with the results provided in prior work [21].

The regression outputs are converted into a ranked list and sites are grouped into two classes, top 30% and bottom 70%; top 70% and bottom 30%, based on whether they are part of the target group, respectively top 30% and bottom 30%. This group assignment is done with the actual and predicted enrollment counts to create the actual and predicted labels. We use the area-under-the-curve (AUC) classification metric to compare the different models on this classification task.

As there are no guidelines for systematic evaluation of site selection methods, the performance of new methods is compared to the median historical enrollment as baseline method. With this baseline, referred to as the median baseline, the median of the enrollment in train set is used to predict the enrollment of sites in the test set.

To reflect a common industry practice of using historical performance, we add a baseline method based on site-level historical enrollment, referred to as the site level baseline. With this baseline, the median of the historical enrollment of a site is used, to predict the enrollment of the site in other studies. If no historical enrollment data is available for a site, we impute the historical enrollment with the median historical enrollment in the train set.

Covariate selection and model training

For each exercise, a linear Poisson regression model and two non-linear machine learning models, a RandomForest and an XGBoost model (v.1.7.2), are trained and compared with the median and site baseline [23]. We considered other non-linear models but didn’t observe a significant difference in performance. The open-source framework Tune (v.0.1.5) [24] is used to train and perform hyperparameter tuning on the non-linear models. The range of hyperparameters across which the models are optimized is shared in S4 File, as well as the optimal set of hyperparameters for each experiment. In the hyperparameter optimization framework, a new set of hyperparameters is randomly sampled in each experiment and evaluated using cross-validation. Across 128 experiments, the set of optimal hyperparameters is identified for a given dataset.

The Shapley Additive exPlanations (SHAP) [25] algorithm is used to estimate the importance of the covariates and to determine the partial dependency relationship between covariates and enrollment. Manual covariate selection is performed by assessing covariate importance with the model trained on all covariates using the training data. Covariates with a variable importance, as defined by the covariate mean SHAP value, that is below 0.005 are removed from the covariate set. For each model, the selected set of covariates is defined in S3 File, which is a subset of the full set of covariates described in Table 3. To assess whether accuracy differences across modeling approaches are statistically significant a dependent t-test for paired samples is conducted on the model absolute error.

Results

The model performance results across the different indications are shared in Table 4 for the test dataset. Train model performance results have been added to S2 Table for completeness. The different performance metrics are computed between the target and predicted enrollment. We compare a simple median baseline site selection strategy, with a more advanced ‘site level’ baseline. Finally, we compare the use of non-linear models with more simple linear modeling methods.

Download:

Table 4. Performance metrics are computed between the actual and predicted enrollment.

https://doi.org/10.1371/journal.pone.0300109.t004

Significance testing has been applied (paired Student’s t-test) to assess the significance in performance, as measured by mean absolute error difference between the models. For each experiment, the non-linear XGBoost model had a mean absolute error that was significantly lower than the linear models (p-value < 0.001). As no significant performance difference was observed among the non-linear models, only the XGBoost models are studied further.

We use Shapley values [25] to estimate covariate importance in the model in Fig 4. We also assess the relationship the model has learned between study-site level enrollment and the covariates of interest. Partial dependence plots, computed from the Shapley values, based on the XGBoost models, are used to visualize the relationship of a model covariate with the target variable. We explore the relationships between all selected covariates and the model target in Figs 5 and 6.

Download:

Fig 4. Covariate importance of the selected covariates for the XGBoost models.

The mean SHAP value represents the average impact of a covariate on the model output magnitude.

https://doi.org/10.1371/journal.pone.0300109.g004

Download:

Fig 5. Covariate dependence plots for the IBD XGBoost model.

The SHAP value represents the impact on the model output of a given covariate value.

https://doi.org/10.1371/journal.pone.0300109.g005

Download:

Fig 6. Covariate dependence plots for the MM XGBoost model.

The SHAP value represents the impact on the model output of a given covariate value.

https://doi.org/10.1371/journal.pone.0300109.g006

Discussion

The proposed modeling approach is versatile and applicable across indications when sufficient benchmark studies are available, and the study population can be defined in a RWD cohort. The non-linear model performance improves the ability to rank the sites by expected enrollment, visible both in the increase of Spearman correlation coefficient and AUC on the test set, compared to the linear model and baselines. The ability to generate an accurate site level ranking allows trial organizers to accurately identify and prioritize top performing sites.

Comparing our results to the results of earlier data-driven site selection methodologies [21] is not straightforward due to variability in terms of study context (single indication vs multiple indications), methodology (site vs investigator ranking) and the evaluation approach (random vs study split) that is used. Comparing the top and bottom 30% AUCs of our methods with the published results, we observe that the average AUC on the test set (0.79) is higher compared to prior results (0.75), while our approach maintains a high level of interpretability on the relationship of study recruitment with the different covariates. Although the R2 remains low-to-modest, models have been optimized in the ability to rank sites, as expressed through the Spearman correlation coefficient and AUC.

Across the two experiments, there are important differences in the key covariates, highlighting the fact that different factors play a role in recruitment, depending on the indication. For instance, in trials targeting newly diagnosed patients like in the case for IBD, the research site must wait for patients to become available. In such cases, the recruitment period is an important covariate. On the other hand, for indications where patients are already undergoing treatment, such as MM, covariates that characterize the research setting, including the number of specialists, publications and ongoing trials are key covariates.

Regardless of the indication, past research experience, past high research performance and a high number of patients are consistently strong positive indicators of recruitment potential, aligned with previous research findings [1–17]. The site open year covariate captures the recruitment trend in a disease area over time and provides insights into the level of trial recruitment activity. In the case of IBD, the competition in the trial landscape has increased greatly, as highlighted by the high covariate importance. While the insights gained from these machine learning models are specific to each indication, they can serve to inform future trial designs and recruitment strategies.

The proposed site selection methodology represents a notable advancement; however, challenges with respect to data availability remain. The utility of real-world data for site selection relies on its availability across large geographical areas. At present, this approach is only viable in the United States. Moreover, due to the availability of US claims data from 2016 onwards, the data cannot be aligned to the study period of interest. Furthermore, the absence of large-scale linkage between claims data and electronic healthcare records, lab, and genomic data, poses challenges in the replication of study cohorts.

While expected recruitment is an important consideration in site selection strategies, it should not be the sole determinant in trial planning. Other factors, such as the overall experience collaborating with a research site and their research capabilities must also be considered. Additionally, sites with a diverse patient population need to be considered to improve the representativeness of the study population of clinical trials, and consequently the validity and generalizability of clinical trials results. Nonetheless, within the United States, several barriers to diversity in clinical trial participation still exist [26]. Therefore, new, and diverse research sites, in addition to historically strong performing ones, need to be considered during site selection to ensure novel therapies are more broadly accessible geographically and across underrepresented populations.

Conclusion

This work demonstrated empirically the importance of real-world data in predicting the patient recruitment of research sites in clinical trials. To the best of our knowledge, this is the first study that leverages machine learning methods and indication-level real-world data for site level enrollment prediction. This study adds to an improved understanding and quantitative validation of the factors that are critical to predict site study recruitment and a data-driven decision support system to help select and assess research sites for a proposed trial.

Supporting information

S1 File. List of benchmark studies per indication.

https://doi.org/10.1371/journal.pone.0300109.s001

(DOCX)

S2 File. Real-world data cohort definition per indication.

https://doi.org/10.1371/journal.pone.0300109.s002

(DOCX)

S3 File. Selected set of covariates per indication.

https://doi.org/10.1371/journal.pone.0300109.s003

(DOCX)

S4 File. XGBoost hyperparameter grid and final model hyperparameters per indication.

https://doi.org/10.1371/journal.pone.0300109.s004

(DOCX)

S1 Table. Summary of the enrollment statistics across the two experiments.

https://doi.org/10.1371/journal.pone.0300109.s005

(DOCX)

S2 Table. Model performance on train set across the two experiments.

https://doi.org/10.1371/journal.pone.0300109.s006

(DOCX)

References

1. Desai M. Recruitment and retention of participants in clinical studies: Critical issues and challenges. Perspect Clin Res 2020;11:51–3. pmid:32670827
- View Article
- PubMed/NCBI
- Google Scholar
2. Demaerschalk BM, Brown RD, Roubin GS, Howard VJ, Cesko E, Barrett KM, et al. Factors Associated With Time to Site Activation, Randomization, and Enrollment Performance in a Stroke Prevention Trial. Stroke 2017;48:2511–8. pmid:28768800
- View Article
- PubMed/NCBI
- Google Scholar
3. Fogel DB. Factors associated with clinical trials that fail and opportunities for improving the likelihood of success: A review. Contemp Clin Trials Commun 2018;11:156–64. pmid:30112460
- View Article
- PubMed/NCBI
- Google Scholar
4. Hurtado-Chong A, Joeris A, Hess D, Blauth M. Improving site selection in clinical studies: a standardised, objective, multistep method and first experience results. BMJ Open 2017;7:e014796. pmid:28706090
- View Article
- PubMed/NCBI
- Google Scholar
5. Potter JS, Donovan D, Weiss RD, Gardin J, Lindblad B, Wakim P, et al. Site selection in community-based clinical trials for substance use disorders: Strategies for effective site selection. Am J Drug Alcohol Abuse 2011;37:400–7. pmid:21854283
- View Article
- PubMed/NCBI
- Google Scholar
6. Dombernowsky T, Haedersdal M, Lassen U, Thomsen SF. Criteria for site selection in industry-sponsored clinical trials: a survey among decision-makers in biopharmaceutical companies and clinical research organizations. Trials 2019;20:708. pmid:31829234
- View Article
- PubMed/NCBI
- Google Scholar
7. The Role of Retail Pharmacies in the Evolving Landscape of Clinical Research. Appl Clin Trials Online 2023. https://www.appliedclinicaltrialsonline.com/view/the-role-of-retail-pharmacies-in-the-evolving-landscape-of-clinical-research (accessed April 20, 2023).
- View Article
- Google Scholar
8. Laaksonen N, Bengtström M, Axelin A, Blomster J, Scheinin M, Huupponen R. Clinical trial site identification practices and the use of electronic health records in feasibility evaluations: An interview study in the Nordic countries. Clin Trials 2021;18:724–31. pmid:34431408
- View Article
- PubMed/NCBI
- Google Scholar
9. Miseta E. Bring Down The Cost Of Clinical Trials With Improved Site Selection n.d. https://www.clinicalleader.com/doc/bring-down-the-cost-of-clinical-trials-with-improved-site-selection-0001 (accessed April 20, 2023).
- View Article
- Google Scholar
10. Luo J, Chen W, Wu M, Weng C. Systematic data ingratiation of clinical trial recruitment locations for geographic-based query and visualization. Int J Med Inf 2017;108:85–91. pmid:29132636
- View Article
- PubMed/NCBI
- Google Scholar
11. Barnard KD, Dent L, Cook A. A systematic review of models to predict recruitment to multicentre clinical trials. BMC Med Res Methodol 2010;10:63. pmid:20604946
- View Article
- PubMed/NCBI
- Google Scholar
12. Bakhshi A, Senn S, Phillips A. Some issues in predicting patient recruitment in multi-centre clinical trials. Stat Med 2013;32:5458–68. pmid:24105891
- View Article
- PubMed/NCBI
- Google Scholar
13. Getz. Predicting Successful Site Performance n.d. https://www.appliedclinicaltrialsonline.com/view/predicting-successful-site-performance (accessed April 20, 2023).
- View Article
- Google Scholar
14. Gheorghiade M, Vaduganathan M, Greene SJ, Mentz RJ, Adams KF, Anker SD, et al. Site selection in global clinical trials in patients hospitalized for heart failure: perceived problems and potential solutions. Heart Fail Rev 2014;19:135–52. pmid:23099992
- View Article
- PubMed/NCBI
- Google Scholar
15. Gehring M, Taylor R, Mellody M, Casteels B, Piazzi A, Gensini G, et al. Factors influencing clinical trial site selection in Europe: The Survey of Attitudes towards Trial sites in Europe (the SAT-EU Study). BMJ Open 2013;3:e002957. pmid:24240138
- View Article
- PubMed/NCBI
- Google Scholar
16. Huang GD, Bull J, McKee KJ, Mahon E, Harper B, Roberts JN. Clinical trials recruitment planning: A proposed framework from the Clinical Trials Transformation Initiative. Contemp Clin Trials 2018;66:74–9. pmid:29330082
- View Article
- PubMed/NCBI
- Google Scholar
17. van den Bor RM, Grobbee DE, Oosterman BJ, Vaessen PWJ, Roes KCB. Predicting enrollment performance of investigational centers in phase III multi-center clinical trials. Contemp Clin Trials Commun 2017;7:208–16. pmid:29696188
- View Article
- PubMed/NCBI
- Google Scholar
18. Bieganek C, Aliferis C, Ma S. Prediction of clinical trial enrollment rates. PloS One 2022;17:e0263193. pmid:35202402
- View Article
- PubMed/NCBI
- Google Scholar
19. Zhang X, Long Q. Stochastic modeling and prediction for accrual in clinical trials. Stat Med 2010;29:649–58. pmid:20082363
- View Article
- PubMed/NCBI
- Google Scholar
20. Liu J, Allen PJ, Benz L, Blickstein D, Okidi E, Shi X. A Machine Learning Approach for Recruitment Prediction in Clinical Trial Design 2021.
- View Article
- Google Scholar
21. Gligorijevic J, Gligorijevic D, Pavlovski M, Milkovits E, Glass L, Grier K, et al. Optimizing clinical trials recruitment via deep learning. J Am Med Inform Assoc JAMIA 2019;26:1195–202. pmid:31188432
- View Article
- PubMed/NCBI
- Google Scholar
22. NPPES n.d. https://nppes.cms.hhs.gov/#/ (accessed June 19, 2023).
23. Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System. 2016.
- View Article
- Google Scholar
24. Liaw R, Liang E, Nishihara R, Moritz P, Gonzalez JE, Stoica I. Tune: A Research Platform for Distributed Model Selection and Training 2018. https://doi.org/10.48550/arXiv.1807.05118.
- View Article
- Google Scholar
25. Lundberg SM, Lee S-I. A Unified Approach to Interpreting Model Predictions. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, et al., editors. Adv. Neural Inf. Process. Syst. 30, Curran Associates, Inc.; 2017, p. 4765–74.
26. Clark LT, Watkins L, Piña IL, Elmer M, Akinboboye O, Gorham M, et al. Increasing Diversity in Clinical Trials: Overcoming Critical Barriers. Curr Probl Cardiol 2019;44:148–72. pmid:30545650
- View Article
- PubMed/NCBI
- Google Scholar

[ref1] 1. Desai M. Recruitment and retention of participants in clinical studies: Critical issues and challenges. Perspect Clin Res 2020;11:51–3. pmid:32670827
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Demaerschalk BM, Brown RD, Roubin GS, Howard VJ, Cesko E, Barrett KM, et al. Factors Associated With Time to Site Activation, Randomization, and Enrollment Performance in a Stroke Prevention Trial. Stroke 2017;48:2511–8. pmid:28768800
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. Fogel DB. Factors associated with clinical trials that fail and opportunities for improving the likelihood of success: A review. Contemp Clin Trials Commun 2018;11:156–64. pmid:30112460
View Article
PubMed/NCBI
Google Scholar

[10] View Article

[11] PubMed/NCBI

[12] Google Scholar

[ref4] 4. Hurtado-Chong A, Joeris A, Hess D, Blauth M. Improving site selection in clinical studies: a standardised, objective, multistep method and first experience results. BMJ Open 2017;7:e014796. pmid:28706090
View Article
PubMed/NCBI
Google Scholar

[14] View Article

[15] PubMed/NCBI

[16] Google Scholar

[ref5] 5. Potter JS, Donovan D, Weiss RD, Gardin J, Lindblad B, Wakim P, et al. Site selection in community-based clinical trials for substance use disorders: Strategies for effective site selection. Am J Drug Alcohol Abuse 2011;37:400–7. pmid:21854283
View Article
PubMed/NCBI
Google Scholar

[18] View Article

[19] PubMed/NCBI

[20] Google Scholar

[ref6] 6. Dombernowsky T, Haedersdal M, Lassen U, Thomsen SF. Criteria for site selection in industry-sponsored clinical trials: a survey among decision-makers in biopharmaceutical companies and clinical research organizations. Trials 2019;20:708. pmid:31829234
View Article
PubMed/NCBI
Google Scholar

[22] View Article

[23] PubMed/NCBI

[24] Google Scholar

[ref7] 7. The Role of Retail Pharmacies in the Evolving Landscape of Clinical Research. Appl Clin Trials Online 2023. https://www.appliedclinicaltrialsonline.com/view/the-role-of-retail-pharmacies-in-the-evolving-landscape-of-clinical-research (accessed April 20, 2023).
View Article
Google Scholar

[26] View Article

[27] Google Scholar

[ref8] 8. Laaksonen N, Bengtström M, Axelin A, Blomster J, Scheinin M, Huupponen R. Clinical trial site identification practices and the use of electronic health records in feasibility evaluations: An interview study in the Nordic countries. Clin Trials 2021;18:724–31. pmid:34431408
View Article
PubMed/NCBI
Google Scholar

[29] View Article

[30] PubMed/NCBI

[31] Google Scholar

[ref9] 9. Miseta E. Bring Down The Cost Of Clinical Trials With Improved Site Selection n.d. https://www.clinicalleader.com/doc/bring-down-the-cost-of-clinical-trials-with-improved-site-selection-0001 (accessed April 20, 2023).
View Article
Google Scholar

[33] View Article

[34] Google Scholar

[ref10] 10. Luo J, Chen W, Wu M, Weng C. Systematic data ingratiation of clinical trial recruitment locations for geographic-based query and visualization. Int J Med Inf 2017;108:85–91. pmid:29132636
View Article
PubMed/NCBI
Google Scholar

[36] View Article

[37] PubMed/NCBI

[38] Google Scholar

[ref11] 11. Barnard KD, Dent L, Cook A. A systematic review of models to predict recruitment to multicentre clinical trials. BMC Med Res Methodol 2010;10:63. pmid:20604946
View Article
PubMed/NCBI
Google Scholar

[40] View Article

[41] PubMed/NCBI

[42] Google Scholar

[ref12] 12. Bakhshi A, Senn S, Phillips A. Some issues in predicting patient recruitment in multi-centre clinical trials. Stat Med 2013;32:5458–68. pmid:24105891
View Article
PubMed/NCBI
Google Scholar

[44] View Article

[45] PubMed/NCBI

[46] Google Scholar

[ref13] 13. Getz. Predicting Successful Site Performance n.d. https://www.appliedclinicaltrialsonline.com/view/predicting-successful-site-performance (accessed April 20, 2023).
View Article
Google Scholar

[48] View Article

[49] Google Scholar

[ref14] 14. Gheorghiade M, Vaduganathan M, Greene SJ, Mentz RJ, Adams KF, Anker SD, et al. Site selection in global clinical trials in patients hospitalized for heart failure: perceived problems and potential solutions. Heart Fail Rev 2014;19:135–52. pmid:23099992
View Article
PubMed/NCBI
Google Scholar

[51] View Article

[52] PubMed/NCBI

[53] Google Scholar

[ref15] 15. Gehring M, Taylor R, Mellody M, Casteels B, Piazzi A, Gensini G, et al. Factors influencing clinical trial site selection in Europe: The Survey of Attitudes towards Trial sites in Europe (the SAT-EU Study). BMJ Open 2013;3:e002957. pmid:24240138
View Article
PubMed/NCBI
Google Scholar

[55] View Article

[56] PubMed/NCBI

[57] Google Scholar

[ref16] 16. Huang GD, Bull J, McKee KJ, Mahon E, Harper B, Roberts JN. Clinical trials recruitment planning: A proposed framework from the Clinical Trials Transformation Initiative. Contemp Clin Trials 2018;66:74–9. pmid:29330082
View Article
PubMed/NCBI
Google Scholar

[59] View Article

[60] PubMed/NCBI

[61] Google Scholar

[ref17] 17. van den Bor RM, Grobbee DE, Oosterman BJ, Vaessen PWJ, Roes KCB. Predicting enrollment performance of investigational centers in phase III multi-center clinical trials. Contemp Clin Trials Commun 2017;7:208–16. pmid:29696188
View Article
PubMed/NCBI
Google Scholar

[63] View Article

[64] PubMed/NCBI

[65] Google Scholar

[ref18] 18. Bieganek C, Aliferis C, Ma S. Prediction of clinical trial enrollment rates. PloS One 2022;17:e0263193. pmid:35202402
View Article
PubMed/NCBI
Google Scholar

[67] View Article

[68] PubMed/NCBI

[69] Google Scholar

[ref19] 19. Zhang X, Long Q. Stochastic modeling and prediction for accrual in clinical trials. Stat Med 2010;29:649–58. pmid:20082363
View Article
PubMed/NCBI
Google Scholar

[71] View Article

[72] PubMed/NCBI

[73] Google Scholar

[ref20] 20. Liu J, Allen PJ, Benz L, Blickstein D, Okidi E, Shi X. A Machine Learning Approach for Recruitment Prediction in Clinical Trial Design 2021.
View Article
Google Scholar

[75] View Article

[76] Google Scholar

[ref21] 21. Gligorijevic J, Gligorijevic D, Pavlovski M, Milkovits E, Glass L, Grier K, et al. Optimizing clinical trials recruitment via deep learning. J Am Med Inform Assoc JAMIA 2019;26:1195–202. pmid:31188432
View Article
PubMed/NCBI
Google Scholar

[78] View Article

[79] PubMed/NCBI

[80] Google Scholar

[ref22] 22. NPPES n.d. https://nppes.cms.hhs.gov/#/ (accessed June 19, 2023).

[ref23] 23. Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System. 2016.
View Article
Google Scholar

[83] View Article

[84] Google Scholar

[ref24] 24. Liaw R, Liang E, Nishihara R, Moritz P, Gonzalez JE, Stoica I. Tune: A Research Platform for Distributed Model Selection and Training 2018. https://doi.org/10.48550/arXiv.1807.05118.
View Article
Google Scholar

[86] View Article

[87] Google Scholar

[ref25] 25. Lundberg SM, Lee S-I. A Unified Approach to Interpreting Model Predictions. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, et al., editors. Adv. Neural Inf. Process. Syst. 30, Curran Associates, Inc.; 2017, p. 4765–74.

[ref26] 26. Clark LT, Watkins L, Piña IL, Elmer M, Akinboboye O, Gorham M, et al. Increasing Diversity in Clinical Trials: Overcoming Critical Barriers. Curr Probl Cardiol 2019;44:148–72. pmid:30545650
View Article
PubMed/NCBI
Google Scholar

[90] View Article

[91] PubMed/NCBI

[92] Google Scholar

Figures

Abstract

Introduction

Background

Objective

Materials and methods

Data sources

Outcome of interest and covariates

Proposed approach

Covariate selection and model training

Results

Discussion

Conclusion

Supporting information

S1 File. List of benchmark studies per indication.

S2 File. Real-world data cohort definition per indication.

S3 File. Selected set of covariates per indication.

S4 File. XGBoost hyperparameter grid and final model hyperparameters per indication.

S1 Table. Summary of the enrollment statistics across the two experiments.

S2 Table. Model performance on train set across the two experiments.

References