An interactive retrieval system for clinical trial studies with context-dependent protocol elements

A well-defined protocol for a clinical trial guarantees a successful outcome report. When designing the protocol, most researchers refer to electronic databases and extract protocol elements using a keyword search. However, state-of-the-art database systems only offer text-based searches for user-entered keywords. In this study, we present a database system with a context-dependent and protocol-element-selection function for successfully designing a clinical trial protocol. To do this, we first introduce a database for a protocol retrieval system constructed from individual protocol data extracted from 184,634 clinical trials and 13,210 frame structures of clinical trial protocols. The database contains a variety of semantic information that allows the filtering of protocols during the search operation. Based on the database, we developed a web application called the clinical trial protocol database system (CLIPS; available at https://corus.kaist.edu/clips). This system enables an interactive search by utilizing protocol elements. To enable an interactive search for combinations of protocol elements, CLIPS provides optional next element selection according to the previous element in the form of a connected tree. The validation results show that our method achieves better performance than that of existing databases in predicting phenotypic features.


Introduction
Clinical trial protocols play a primary role in clinical trials [1]. Well-established protocols simplify clinical procedures, help avoid unnecessary protocol amendments, and facilitate preliminary assessments of latent issues. Thus, the protocols contribute to the success of clinical trials not only by reducing costs but also by improving performance [2,3].
In recent years, optimized protocol design has become increasingly important with the increase in the cost of clinical trials and the complexity of protocols. According to an investigation conducted in 2016, the drug development cost has skyrocketed in past decades, with the total capitalized cost per approved new drug VOLUME XX, 2017 9 reaching $2.6 billion. The clinical cost per approved new drug also increased, reaching $1.5 billion [4]. The capitalized clinical trial cost has grown by 8.3% per year over the past decade. The highly sophisticated nature of modern drug development is the cause of this upward trend in protocol complexity and work load [5].
Despite the importance of clinical trial protocols and the increasing demand for streamlining them, contemporary drug development is rather inefficient. The Tufts Center of for the Study of Drug Development (Tufts CSDD) reported that 57% of analyzed protocols had at least one major amendment, and 45% of these amendments had "avoidable" reasons that originated from imperfect protocol design, such as design flaws or recruitment difficulties [6]. Moreover, the number of amendments and changes per amendment were concentrated in phase III of the trials, resulting in greater impact costs [7].
As there is a clear demand for a better method of protocol design, there have been various attempts to aid the protocol design process [8]. These approaches can be categorized into two groups: expert guidelines and computerized systems. Computerized systems can be further subdivided into automated and database systems.
The advantage of using a computerized system is that they are quick and allow iterated searches for previous protocols related to the study of interest. In this regard, computerized systems are based on information retrieval technology.
First, researchers can refer to expert guidelines when designing their own trial protocols [9]. While referring to expert opinions guarantees the credibility of the protocol design, two obvious limitations follow: the protocol can only be applied if credible guidelines exist in that particular clinical field and not all guidelines offer specific values for all elements of trial design. Consequently, the determination of these specific elements relies on the subjective intuition of individual researchers.
Computerized systems offer an automated method for designing a protocol. For example, a context-aware architecture for clinical trial protocol design composed of a decision support module and semantic search engine has been developed [10]. While the idea of constructing an automated system was an innovative approach, this system offers only limited performance. For instance, the idea focuses on creating scientific queries for finding information about a clinical trial protocol and retrieving only related papers through queries. Furthermore, the web-based service is no longer available. VOLUME XX, 2017 9 To the best of our knowledge, the most promising computerized system uses a database of clinical trial protocols. Current databases contain extensive information on previous clinical trials covering a wide range of clinical fields [11], [12]. Researchers can retrieve information from specific clinical trials according to their purpose. However, current clinical trial databases offer only limited support in searching for clinical trial protocols. The systems use medical subject heading (MeSH) terms but do not cover all of the text in the protocols, preventing a truly semantic search [12]. Moreover, the current databases do not allow structural protocol searches to retrieve context-dependent protocol elements. A biomedical literature database system (PubMed) could be used to search for clinical trial protocols [13]. However, this would not be efficient because additional work would be required to extract the necessary information from the retrieved literature.
To overcome the limitations of the current database systems, we present our clinical trial protocol database system (CLIPS). We developed a database that enables a semantic search for the core contents of clinical trial protocols, along with semantic filterable features and frame structures for the protocols. Furthermore, our system is based on the database that efficiently finds clinical trial protocols by using a query refinement method.
To resolve the difficulty of retrieving specific protocols from a database of complex structures, we developed a graph-based querying system (Fig 1).

Definitions
Clinical trial protocols consist of several elements that can be grouped into factors according to their characteristics [8]. We define the terms "element" and "factor" as follows.
• Element: individual items constituting the clinical trial protocol. An element has a value that defines the protocol. For example, in a protocol, "model" is an element, and the value of this element can be "crossover." • Factor: a common characteristic of grouped elements. A factor can have multiple elements. For instance, "model" and "allocation" elements are used to design a protocol. Thus, they belong to the "design" factor. Another example is the "enrollment type" and "gender" elements which determine the subject of a protocol and are part of the "subject" factor. VOLUME XX, 2017 9

Related work Guidelines
The retrieval of documents containing the design contents of a protocol is one method of determining a clinical research protocol. The document containing guidelines covers the overall information of clinical research protocols. This is the most basic approach for gathering information to develop a protocol. Chan et al. [1] proposed SPIRIT, a high-quality guideline containing 33 checklist items for the development of a clinical trial protocol. Meeker-O'Connell et al. [14] developed a principle document that defines the factors needed to assure patient safety and reliability in a trial. Moreover, some guides specify protocols for examining the efficacy of food or food components for specific diseases [15]. For instance, documents comprising gut health and immunity, diet-related cancer, and atherosclerosis are included [16], [17], [18].
However, guideline-based approaches use subjective judgement in determining what information is included [19]. This limitation can result in different outcomes depending on the user.

Database systems
Database-based information retrieval technologies can be utilized to retrieve clinical protocols. Zarin et al. [11] developed clinicaltrials.gov, the largest database-based retrieval system for all clinical trials, including regulatory mandates and a broad group of trial sponsors. Tasneem et al. [12] established and operated a relational database containing all clinical trials registered with clinicaltrials.gov. Furthermore, there are systems for protocol retrieval using general document retrieval technology, e.g., PubMed, Scopus, Web of Science, and Google Scholar [20][21][22][23]. However, current database-based retrieval systems have limited ability for protocol-specific search objectives, such as retrieving the protocol structure or selecting contextdependent protocol elements sequentially.

Intelligent systems
Intelligent systems are an effective approach for retrieving clinical trial protocols. Tsatsaronis et al. developed an intelligent system based on a context-aware approach for automated protocol design [10]. Their VOLUME XX, 2017 9 system supports study-and domain-driven searches. Study-driven searches use the parameters (i.e., condition, intervention) of a particular trial as provided by a researcher. In domain-driven searches, a researcher selects options according to the study domain, and then the system automatically searches and categorizes the retrieved information. However, such a system is currently not available.

Clinical protocol database
A clinical trial protocol presents the structure of a clinical trial and is composed of various elements that can be clustered into key factors. In this study, we defined five key factors based on a previous baseline research [8]: design, subject, variables, statistical issues, and descriptions. While the design factor determines how the trial is structured and modelled to measure data generated during the trial, the subject factor determines who is eligible to participate in the trial and how they are treated to ensure generalizability of target population. The variables are the parameters to be measured to evaluate the efficacy or safety of a drug or treatment. Statistical issues describe how the clinical trial will be analyzed, specifying sampling procedures or statistical significance. Finally, the description factor covers additional information such as the organization, different phases, and additional explanations of the protocol or trial itself.
We selected and clustered elements from Aggregate Analysis of ClinicalTrials.gov (AACT), which was released on March 27, 2015 [12]. We downloaded a dump file of the AACT database and completely overhauled the loaded database to give 42 tables of 270 columns. We classified the data types into four elements: categorical, value, description, and not union (N/U). Categorical-type elements contain categorical variables, and the sequential selection of these elements can determine the protocol structure (S1 Data). Value-type elements include interval and ratio data, which are important values in key factors. Description-type elements contain additional explanatory text, numeric values, and abbreviated words or dates for the description factor. The N/U-type element consists of primary keys, foreign keys, and database management values. Based on this classification type, we selected the categorical and value types, and clustered these elements into the design, subject, variable, and statistical issue factors according to the above-mentioned criteria (Table 1). We designed a table schema and amassed the data compilation progress. The N/U elements were eliminated because we constructed relational table with key-value attributes, discarding unnecessary keys and values for database management. The use of key-value attributes makes it possible to search the skeleton structure of a clinical trial protocol efficiently and effectively manages the data storage for deploying inconsistent data [24]. We designed a table schema accounting for these aspects ( Table 2). The next step was data compilation. We organized the element values by resolving typographical errors, reflecting dependency structures, and removing some control characters and type conversions. Furthermore, we removed ambiguous design types, which are null, and expanded the access and observations (patient registry) to search for specific clinical trial protocols. As a result, we collected 184,634 clinical trial protocols and their detailed information. The resulting database can be used to optimize query refinement for retrieving protocol information.

Semantic filtering feature generation
Although we developed our clinical trial protocol database by using a frame structure so that all protocols had a similar structure, the similarity of frame structures does not guarantee the similarity of detailed protocol content.
MeSH offers a potential solution and is used for indexing and cataloging clinical trials in ClinicalTrials.gov and AACT [12], [15]. However, MeSH has limited coverage that does not extend across the spectrum of various biomedical terminologies [26]. Consequently, we extracted various biomedical semantic features to find or filter similar clinical trials in the searched structure.
We generated semantic filterable features related to the conditions and interventions that are considered significant in clinical trials, resulting in a subdivided semantic similarity search. The condition was a phenotype, including any diseases and disorders, observed during clinical trials as well as reported symptoms. The disease-specific phenotype is a set of observable characteristics. Drugs commonly refer to intervention that is the focus of a clinical trial, and they can involve chemical compounds [27]. Similar clinical trials can be searched for or filtered through each of the corresponding elements. In addition, the identification of similar target genes or proteins is a promising method of searching for similarities among chemical compounds and phenotypes, as they are a molecular proxy that links them [28], [29]. Thus, we applied named entity recognition (NER) to the phenotypes, chemical compounds, and genes to enable a semantic search, for which Semantic filters were employed for the following description elements: brief title, official title, brief summary, detailed description, keywords, and conditions ( Fig   2). Phenotype VOLUME XX, 2017 9 We extracted semantic features to represent disease specific phenotype words. The unified medical language system (UMLS) is a repository of integrated biomedical terminologies, and thus we used UMLS2015AB to process phenotype words [26]. To employ NER on descriptive values, we applied Metamap 2016 and cTakes 3.2.2 [30], [31]. We combined each result and removed duplicates using the above-mentioned tools to synthesize the advantages [32]. Next, we selected 15 semantic types, which are considered disease phenotypic types, and removed other types from the results ( Table 3). As a result, disease phenotypic features with unique concept IDs were generated for each clinical trial.

Chemical compound
We applied NER to chemical compound entities from the descriptions given by ChemSpot [33]. ChemSpot provides Chemical Abstract Service (CAS) IDs and International Chemical Identifiers (InChI) but does not provide VOLUME XX, 2017 9 standard InChIKeys. The InChIKey is the compacted version of InChI, and the standard InChIKey is a stable identifier for reflecting the identifier version designation [34]. Moreover, standard InChIKeys are considered to provide equivalent descriptions between compounds in drug discovery [35]. To take advantage of standard InChIKeys for chemical compound entities, we examined the original words of the NER-processed entities by using ChemSpider [36]. The simple application programming interface (API) of ChemSpider was exploited, allowing us to generate chemical compound entities with standard InChIKeys, InChIs, and the simplified molecular-input line-entry system notation.

Gene
We appended gene entities in the elements of semantic filters. The gene annotation tool, Moara, was used for gene NER, considering that Moara is capable of performing both recognition and normalization of gene entities, recognizing entities and their positions in the input text, and linking the entities to gene IDs in a known gene database [37]. Moara provides various preconstructed machine learning-based models for various organism species. For our task, we adopted the human-oriented model. For gene normalization, we obtained lists of gene IDs corresponding to each gene entity. The gene ID with the highest score was selected and mapped to the recognized gene term.

Web application development for query refinement
Clinical trial protocols haves increasingly complex structures [38]. The level of protocol complexity is inversely related to clinical trial performance, as complex protocols negatively impact factors such as protocol amendment rates, patient recruitment, and retention rates [39]. In addition, the increasing complexity of the protocols hinders the design of new protocols, as clinicians referring to previous clinical trials to design a protocol inevitably face difficulties in searching for suitable examples. Thus, from a query refinement standpoint, we developed the CLIPS web application to provide a graph querying interface for retrieving information about reliable clinical trial protocols, rather than a text querying interface that cannot visualize the dependency among prior elements affecting the protocol structure [40], [41].
We defined categorical-type elements as the frame structures of protocols. Although we have provided default orders of the elements, the user is free to choose the order. Once a decision about the order has been made, the user  The backend of the interface was developed using Node.js [42] and the visualization of the interface is manipulated by d3.js [43]. We developed custom functions on d3.js to show each element title and protocol count for the user's selection. To implement semantic filtering of the user's free-text input, a backend engine is connected to the representational state transfer (REST) NER API. This provides NER of processed entities and types, allowing relevant entities to be searched for in the database. We designed the system architecture to combine the interface application, APIs, and database for stable operation in a cloud-computing environment (S1 Fig).

Database
We collected 184,634 clinical trial protocols; 13,210 frame structures of clinical trial protocols; and extracted 5,765,054 phenotypes, 1,151,053 chemical compounds, and 222,966 gene features for semantic filtering (Table 4).
Furthermore, we designed a continuous data update procedure so that the protocol methods could evolve naturally, thus enhancing the quality of the database (S2 Fig). In conclusion, we developed a database system that efficiently retrieves information about existing clinical trial protocols for use in designing new clinical trial protocols.

Application
The web application provides a service for retrieving protocol structures and inquiring about protocol information.
The user goes through four stages in using this service. (1) First, the order of the protocols must be set. (2) The protocol structure is then designed by selecting the elements that correspond to each sequence. (3) Next, various functions required to search for the desired protocol information are set. (4) Finally, the user receives the desired protocol information and explores the contents in detail. We developed the necessary interfaces to perform all of these processes (Fig 4).

Fig 4. System overview.
Before retrieving the protocol structure, the user should define the protocol sequence. This process uses a drag-anddrop interface to sort the list into one box and set the order. This allows users to work more intuitively [44]. After determining the protocol sequence, the vector-based collapsible tree diagram visualization interface illustrates the protocol level and structure [45,46]. The loaded protocol data are assembled into a hierarchical data structure. This visualization is rendered as a relation tree with a parent/child structure. The user must click on the edge of the tree to add the next-step protocol. Conversely, to remove protocol edges from the current stage, the user must click on the parent edge of the previous step. The entire data structure is synchronized and updated every time the process occurs [47].
After the protocol structure has been retrieved, the user obtains protocol information based on the selected protocol structure. We developed a function called Clip to back up the protocol design. This allows the user to reuse previously selected protocol structures and receive corresponding study information. In addition, a protocolinformation-filter function allows the user to retrieve the study information. The user searches for a disease and generates a label that contains the disease code. The protocol information is then filtered according to the set label. VOLUME XX, 2017 9 The resulting data are rendered as a table, which can be sorted with respect to the column entities to focus on and export specific data. When exploring detailed protocol information, our system transforms the data into a collapsible interface instead of providing raw text.
Although protocol design concepts have evolved around the world, the development of tools to design clinical trial protocols is trivial [48][49][50]. Our aim was to simplify clinical trial retrieval and the design stage by developing a dedicated interface. We expect this to be the starting point for the creation, sharing, and development of more clinical trial protocols.

Validation Technical validation
The goal of CLIPS was to provide an information retrieval method that can search complex clinical trial protocols.
For this, we developed a search tool that can build and utilize a database suitable for the protocol structure.
Furthermore, we created semantic features in CLIPS using text-mining methods. As a result, it is possible to perform accurate searches using the contextual meaning of the protocol. To evaluate the performance of CLIPS, we attempted to verify whether the semantic filter results in better performance than a keyword search.
For the technical validation of the semantic filter of CLIPS, we used the relation information between clinical trial protocols and corresponding disease conditions as collected from clinicaltrials.gov [11]. As this disease condition assignment is manually curated by experts and does not originate from the protocol itself, it can be used as a gold standard to evaluate the semantic filters of CLIPS.
The gold standard set of disease conditions and corresponding trial protocols was obtained by crawling the topic page of clinicaltrials.gov [11]. Among 25 conditional categories provided by clinicaltrials.gov, the "Cancers and Other Neoplasms" condition category was selected, as it covers 44.74% of the total protocol set. Consequently, the corresponding trial protocols and corresponding disease conditions were identified. For instance, in our gold standard set, 353 distinct protocols were associated with the disease condition "Abdominal neoplasm." As a result, a set of 82,584 distinct protocols corresponding to 520 disease conditions was compiled (as of July 12, 2017) and used as a gold standard set for technical validation. VOLUME XX, 2017 9 The semantic search performance of CLIPS was validated using the following procedure. In CLIPS, search keywords containing the disease condition names were supplied as input queries to the system. The semantic entities were translated from the search keyword through the text-mining-based models described in the previous section. The results were obtained by conducting a search using the translated semantic entities from the CLIPS database. Exact matching with the AACT database was used as a baseline. CLIPS and AACT database were configured on a single local server.
We used a condition name (e.g., Adrenocortical Carcinoma) as a search keyword to retrieve the condition field of the source database and semantic entities (e.g., C0206686) from CLIPS. We then validated our retrievals by (1) The F1-Score of CLIPS (0.515) was higher than that of the keyword search (0.38) (Fig 5). The precision of CLIPS was (0.437) was slightly lower than that of the keyword search (0.668) but it outperformed the keyword search by more than a factor of two in terms of recall (0.63 and 0.26, respectively). As higher recall values are a positive factor in a clinical trial design, CLIPS can retrieve more protocols that provide more suitable references for protocol design.

Fig 5. Evaluation results of keyword search and using semantic filter of CLIPS for "Cancer and Other
Neoplasms" categorized conditions.

User experience
As described earlier, the CLIPS search system was developed for different purposes than those of the existing clinical trial search systems. CLIPS is intended to assist in the effective design of a specific trial protocol, whereas existing search systems are generally used to process various information on clinical trials. Therefore, to evaluate the performance of CLIPS, an evaluation method that reflects this purpose should be constructed. As the ultimate goal of a search system is to help users collect the information they require, the subjective satisfaction of users is a significant measure of performance. Thus, the evaluation of a search system should be able to quantify the subjective impressions of users as well as objective indicators. By considering these factors, we conducted an evaluation trial that compared CLIPS with the conventional search system provided by clinicaltrials.gov.
Ten participants aged between 24 and 33 were recruited from a group of experts in the field of bioinformatics (Bio and Brain Engineering Department of Korea Advanced Institute of Science and Technology, Republic of Korea) They included two undergraduates, three master's students, four PhD candidates, and one postdoctoral researcher. The participants were assigned the task of finding a suitable clinical trial set for a simulated problem.
Two separate tasks were given to the participants, who were asked to construct the most common previous protocol design under the given trial conditions and research questions and perform each task by using each of the two search systems within the time limit (5 min each). To construct the most common trial design, participants had to collect information about the various elements of the trial protocol (Fig 6). After the task, participants completed a questionnaire on their subjective satisfaction regarding the system. The questionnaire consisted of six questions that collected participants' satisfaction using a 7-point Likert scale [53].

Fig 6. Tasks given to participants for evaluation.
Participants were observed to perform better when using CLIPS than when using the clinicaltrials.gov search system. By using CLIPS, participants obtained more answers within the time limit, and the average time required to perform the task was shorter. The number of clinical trials retrieved from CLIPS was less than that from clinicaltrials.gov, as it was possible to apply more detailed search filters to narrow the search scope. Participants were more satisfied with CLIPS than the existing search systems, as evidenced by the average score of the VOLUME XX, 2017 9 questionnaire responses (Table 5). These results show that CLIPS can be effectively used to retrieve certain types of trial protocols.

Discussion
Clinical trial protocols are the foundation for planning, approving, conducting, and reporting clinical trials [1].
They include general information, objectives, trial design, the selection and withdrawal of subjects, treatment, safety assessments, quality control procedures, and record keeping processes [54]. This study aimed to develop an efficient method of providing the information necessary for clinical trial protocol development. In particular, we have made it possible to find previous protocols of the desired type by using the structural features of the protocol VOLUME XX, 2017 9 composed of context-dependent protocol elements. Furthermore, semantic filtering was included to ensure the retrieval of relevant protocol context information.
CLIPS can search for protocols or specific disease names and structures and can be used for a combination of structural searches, structural order searches, semantic searches, or searches including both structure and semantic context. For instance, our system can perform the following functions: (1) Define key and ordering elements.
(2) search each dependent element for its prior selected elements.
(3) Select clinical trial protocols and more clinical trial protocols through the clipped protocol.
(4) Search for relevant clinical trial protocols including all information about that protocol.
We believe that many clinicians will be able to utilize our system to design more reliable clinical trial protocols.
The developed semantic filter can be used to search for protocols and can be used for drug discovery by using the retrieved protocols. CLIPS provides search results as a downloadable file containing the semantic filters as well as protocol structure information. This ensures the wide coverage of the protocol search. For example, we used UMLS as a phenotype semantic filter. UMLS is an integrated terminology system that combines biomedical terminologies including SNOMED-CT, MeSH, and MedDRA [55]. Clinical terms in SNOMED CT have been integrated into the UMLS metathesaurus since 2003 [56]. For instance, coverage of the HPO term is higher in UMLS than in SNOMED CT [57]. According to Bodenreider's study, UMLS covered 54% of the HPO phenotype terms, whereas SNOMED CT covered only 30% [26]. Based on this semantic filter information, it is possible to screen chemical compounds of drug candidate substances, regardless of whether they are known to be effective, from the previous retrieved protocols [58]. Gene and phenotype can also be used for drug efficacy screening in massive biological networks using a similar approach [59,60]. Furthermore, users can download the entire database. The efficacy weight of edges can then be predicted and the predicted pathways validated in large-scale biological networks [61]. VOLUME XX, 2017 9 The present study is limited in terms of user-experience validation. This is because clinicaltrials.gov and CLIPS have different objectives. Clinicaltrials.gov was developed to register clinical trials [11]. The registered information can be retrieved by clinical researchers, patients, and families of patients. This is a different objective from that of CLIPS, which is specially designed for the protocol search task. As the objective is different, the method is different. Therefore, it cannot be claimed that the clinicaltrials.gov offers worse performance than CLIPS, although the user-experience validation experiment showed a low score for use of clinicaltrials.gov. The satisfaction score of CLIPS is high in terms of the objective of protocol searching. If clinicaltrials.gov were to include the search function of CLIPS, it would offer a comprehensive and specific information search ability to clinical researchers.

Conclusion
Clinical trial protocols are a crucial factor in allowing clinical trials to achieve their primary purposes. However, clinical researchers tend to design clinical trials according to their individual expertise. This can cause inconsistencies and objectivity problems among clinical trial protocols. To solve these problems, an information retrieval system for clinical trial protocols is needed. Therefore, we developed a clinical trial protocol database system and an information retrieval web application for clinicians and pharmaceuticals, or even regulatory agencies, to design and check clinical trial protocol conveniently. This paper has described the formulation of the CLIPS database system and explained its implementation and advantages over existing keyword-base search systems. The whole database is available for download (http://corus.kaist.edu/clips).