EC2Seq2Sql: Patient-trial matching with LLM agents

doi:10.1371/journal.pone.0341827

Table 1.

Comparison of representative systems for eligibility criteria interpretation and patient–trial matching.

More »

Expand

Fig 1.

The overall framework of EC2Seq2Sql.

The workflow starts from clinical trial eligibility criteria collected from ClinicalTrials.gov (input), then parses the free-text criteria into lightweight structured patterns using a BART-based semantic parser, and finally converts the structured patterns into executable SQL through an LLM-based agent to retrieve eligible patients from the EHR database (output). The three main stages are (1) eligibility criteria acquisition, (2) eligibility criteria parsing, and (3) patient matching, which are connected by directional arrows to indicate the processing order. Each stage produces intermediate outputs (retrieved trials, structured patterns, SQL queries) that serve as inputs to the subsequent stage.

More »

Expand

Table 2.

Definitions of the seven clinical domains used in this study.

More »

Expand

Fig 2.

SQL generation process.

This figure details the workflow that transforms the structured eligibility patterns into an executable SQL query. The input is the lightweight, seven-domain structured representation produced in the previous stage. First, the front-end sends the structured input to the FastAPI service, which forwards it to the LangChain-based workflow. LangChain constructs a prompt that combines a system prompt (task description and database context) and a human prompt (explicit inclusion/exclusion conditions). Guided by this hierarchical prompt, the GPT-4 model generates a syntactically correct and schema-aligned SQL statement. The final output is an SQL query that can be executed on the hospital EHR database to return the list of patients matching the criteria.

More »

Expand

Fig 3.

Distribution of the seven conceptual domains.

This figure shows how EC elements are mapped to the seven domains (condition, procedure, observation, laboratory, drug, age, gender). It motivates using this schema in the EC2Seq2Sql pipeline.

More »

Expand

Fig 4.

Sentence-length distribution of EC texts.

This figure summarizes token lengths of EC sentences in the dataset, indicating the presence of both short rules and long, nested clauses that require transformer-based parsing.

More »

Expand

Table 3.

BART-large-CNN hyperparameters.

More »

Expand

Table 4.

Performance comparison of different sequence models on the dataset.

More »

Expand

Fig 5.

Interface for BART-based EC parsing.

Users input free-text eligibility criteria, and the system returns the corresponding lightweight structured pattern, which is then fed to the SQL generation stage.

More »

Expand

Fig 6.

Example of executing an auto-generated SQL for patient retrieval.

The SQL produced from the structured eligibility patterns is run on the de-identified hospital EHR to return patients who satisfy the trial criteria; the figure shows this end-to-end result display.

More »

Expand

Fig 7.

Radar visualization of multi-metric ablation results.

The full system encloses the largest area across all five metrics. Removing structured patterns mainly harms EM/EX despite similar text-level scores; removing agent prompting causes the most severe SQL degradation; dropping the seven-domain constraint preserves SQL accuracy but reduces clinical matching on the real-world cohort.

More »

Expand

Table 5.

Ablation study on the benchmark dataset and the real-world EHR cohort.

More »

Expand

Table 6.

Common error types and representative examples in the EC2Seq2Sql framework.

More »

Expand