Skip to main content
Advertisement
  • Loading metrics

A global analysis of national cardiovascular disease control plans using a multi-agent artificial intelligence model

  • Hugh Pearson,

    Roles Conceptualization, Formal analysis, Investigation, Project administration, Writing – original draft, Writing – review & editing

    Affiliation Health Systems Innovation Lab, Department of Global Health and Population, Harvard T.H. Chan School of Public Health, Harvard University, Boston, Massachusetts, United States of America

  • Caleb J. Kumar,

    Roles Conceptualization, Data curation, Formal analysis, Methodology, Software, Validation, Visualization

    Affiliation Health Systems Innovation Lab, Department of Global Health and Population, Harvard T.H. Chan School of Public Health, Harvard University, Boston, Massachusetts, United States of America

  • Che L. Reddy,

    Roles Methodology, Writing – review & editing

    Affiliation Health Systems Innovation Lab, Department of Global Health and Population, Harvard T.H. Chan School of Public Health, Harvard University, Boston, Massachusetts, United States of America

  • Estella Rose LeBlanc,

    Roles Resources, Writing – review & editing

    Affiliation Harvard College, Harvard University, Cambridge, Massachusetts, United States of America

  • Rifat Atun ,

    Roles Conceptualization, Supervision, Writing – review & editing

    ratun@hsph.harvard.edu

    Affiliations Health Systems Innovation Lab, Department of Global Health and Population, Harvard T.H. Chan School of Public Health, Harvard University, Boston, Massachusetts, United States of America, Department of Health Policy and Management, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, United States of America, Department of Global Health and Social Medicine, Harvard Medical School, Harvard University, Boston, Massachusetts, United States of America, Faculty of Medicine, Imperial College London, London, United Kingdom

  • With the CVD Control Collaborative

    Membership of the CVD Control Collaborative is provided in the Acknowledgements.

Abstract

Cardiovascular diseases cause nearly one-third of global deaths, yet standalone National Cardiovascular Disease Control Plans remain uncommon and inconsistently structured. We assessed the comprehensiveness of recent national plans using a validated health-systems framework and a multi-agent artificial intelligence model. We identified the most recent official plan for 45 countries from World Health Organisation and World Heart Federation repositories and government sources. We adapted a health-systems planning framework for cardiovascular disease and validated it through a two-stage expert consensus process involving 42 specialists from 28 countries, resulting in 11 elements and 69 sub-elements with standardised definitions and scoring criteria. Plans were analysed using a three-stage artificial intelligence pipeline that ingested documents, applied framework-based scoring, and performed automated validation checks. Sub-elements were scored on a 0–5 scale and summarised by element, World Health Organisation region, and World Bank income group. Overall comprehensiveness was low (median 1.20/5). Plans most consistently addressed strategic direction (median 2.80) and governance arrangements (2.14). Contextual assessment was deficient — threats (0.12) and opportunities (0.29) — as were performance specification elements, including objectives (0.50) and health system outcomes (0.67). The Western Pacific region scored highest (median 1.71) and Africa lowest (0.90), though scores remained below moderate levels across all regions. Income group pairwise comparisons were non-significant across all groups; given the small LIC sample (n = 2), no inferential conclusions about income group differences are drawn. Validation against blinded human review across six countries showed 43.7%exact agreement and 68.0%agreement within one point; ordinal agreement statistics were uniformly weak and non-significant, indicating the approach is validated for structural benchmarking rather than fine-grained qualitative judgement. Most national cardiovascular disease plans articulate vision without sufficient operational detail, particularly for contextual analysis, measurement, and integrated financing. Standardised planning templates and artificial intelligence–supported benchmarking, complemented by expert review, could strengthen national planning quality and enable scalable global comparisons.

Author summary

Cardiovascular disease remains the world’s leading cause of death, yet the national strategies designed to fight it often lag behind those for other major illnesses like cancer. In this study, we set out to understand the quality of these strategies by analysing 45 national control plans from around the globe. Using a specialised artificial intelligence tool validated against independent human review, we evaluated how comprehensive these plans truly are against a standard health-system framework. We found that while most countries are effective at setting high-level goals and identifying leadership structures, they frequently fail to include the practical details necessary for implementation, such as specific budgets, local risk analysis, and clear methods to measure progress. Plan quality did not differ significantly across income groups in this sample, suggesting that the global deficit in planning may reflect the absence of standardised methodology more than resource constraints alone. Our findings highlight an urgent need for evidence-based planning guides to help governments transition from political promises to practical action. We also demonstrate that while artificial intelligence can efficiently analyse large volumes of health policy, it serves best as a partner to human judgment rather than a replacement.

Introduction

Cardiovascular diseases (CVDs) are the leading cause of death globally, accounting for an estimated 19.8 million deaths annually and 32% of all global mortality in 2023 [13]. Although this burden is widespread, differences in demographics, economic conditions, and health-system capacity create substantial disparities in how CVDs are prevented, diagnosed, treated, cared for, and rehabilitated worldwide [4,5].

National Cardiovascular Disease Control Plans (NCVDCPs) are policy instruments used to organise the prevention, treatment, and rehabilitation of CVD [6,7]. Implementing these integrated, system-wide plans is central to fulfilling the World Health Assembly’s WHA70.12 resolution (2017), which calls on Member States to adopt coordinated approaches to non-communicable disease (NCD) control that set clear objectives, indicators, and monitoring arrangements, and that strengthen financing, workforce, and access to essential medicines with an explicit equity focus [8]. NCVDCPs are also critical to achieving the UN Sustainable Development Goal target of a 30% reduction in premature mortality from NCDs by 2030 [9].

However, national strategic planning for CVD lags behind other major disease areas. Research by the World Heart Federation (WHF) indicates that while 87% of countries have national cancer control plans, only 7% have developed standalone strategies for CVD [7]. Where CVD plans do exist, there is no globally agreed or standardised format to guide their structure, comprehensiveness or quality, despite roadmaps published by the WHF and WHO [6,10].

Large language models (LLMs) offer a scalable approach to policy appraisal, enabling rapid, consistent, high-granularity analysis of complex documents while mitigating the resource constraints and subjectivity inherent in traditional manual reviews [11,12]. To address current gaps in CVD strategic planning, we applied artificial intelligence-based methods with human oversight embedded at every stage of development and assessment [13,14], to evaluate the comprehensiveness of a global sample of NCVDCPs – drawn predominantly from high-income and European settings. Applying this framework, we compared comprehensiveness across country income groups and WHO regions to identify opportunities to strengthen national CVD planning.

Methods

National cardiovascular disease control plan collection

We identified the most recent official NCVDCP for each country by searching the WHO NCD Document Repository [15], WHF resources [6], and government websites. Inclusion was limited to CVD specific national plans in any language. We excluded broader NCD plans with CVD components, sub-national plans, draft documents, and superseded versions.

Analytical framework development

To define the foundational content of a comprehensive plan, we adapted the 2008 health-systems framework by Atun et al. [16]. Originally used to analyse European National Cancer Control Plans (NCCPs), this framework takes a systems perspective to define the elements of an ‘ideal’ plan and is designed for scalable application across contexts. We recontextualised this model for CVD by undertaking a literature review identifying current approaches to NCVDCP development and by incorporating guidance from the WHF CVD Roadmaps [6] and the WHO HEARTS Technical Package [10]. The resulting template comprises 11 elements and 69 sub-elements, each with standardised definitions, indicators, and scoring criteria specific to NCVDCPs.

Sub-elements were scored on a 0–5 ordinal scale with criteria defined prospectively in the analytical framework: 0 = not discussed; 1 = partially discussed, no recommended indicators used; 2 = partially discussed with some recommended indicators; 3 = fully addressed with some or all recommended indicators; 4 = fully addressed with all recommended indicators and national baseline data; 5 = fully addressed with all indicators, baseline data, and specific measurable targets. Scoring criteria were sub-element-specific and applied identically by PRISM and human reviewers; full criteria are provided in the supplementary information (S1 Table).

Analytical framework validation

The analytical framework was validated through a modified two-stage Delphi process [17]. The expert panel, the CVD control collaborative, comprised 42 specialists in clinical cardiology, health policy, public health, and health systems recruited from 28 countries via the networks of the WHF, Global Health Policy Lab, Wellcome Trust/India Alliance, and the Harvard Health Systems Innovation Lab.

In the first stage (August to October 2025), participants completed an online questionnaire to rate the relevance of sub-elements on a 1–5 scale, assess definition accuracy, and suggest additions to definitions and indicators. Initial ratings indicated moderate-to-high relevance (overall mean 3.90/5). Elements scoring below 4 – Health System Performance Outcomes (3.40), Outputs (3.42), Objectives (3.44), Threats (3.79), and Opportunities (3.87) – were prioritised for detailed revision in the subsequent stage (S2 Text). Qualitative feedback from stage one prompted substantial refinements to the framework, particularly regarding the integration of comprehensive care models and value-based metrics (S3 Text). Definitions were expanded to explicitly include the management of key comorbidities, such as periodontal disease, and to incorporate patient-reported outcome measures that capture wellbeing alongside traditional epidemiological data. The revised framework also added specific indicators for vulnerable populations, multidisciplinary workforce integration, and high-impact, low-cost interventions such as polypills and sodium reduction strategies.

During stage two, conducted on October 23, 2025, a subset of 16 experts from the original 42 convened in a virtual roundtable to finalise the framework. Revisions included nine major content additions, such as metrics for pregnancy-related CVD risk, gender equity, and economic productivity gains. Furthermore, definitions were refined to emphasise primary care–led hypertension management, community-based delivery models, and digital health integration (S4 Text). The final framework shown in Table 1 encompasses 11 elements and 69 sub-elements, requiring the assessment of system performance, contextual threats, and opportunities; the definition of strategy; the specification of governance, financing, and resource reforms; the delineation of service interventions; and a comprehensive implementation plan.

thumbnail
Table 1. Framework to analyse National Cardiovascular Disease Control Plans: Description of the 11 elements and 69 sub-elements by health system theme.

https://doi.org/10.1371/journal.pdig.0001447.t001

Policy Reasoning Integrated Sequential Model Development (PRISM)

To facilitate rapid, consistent and scalable analysis, we developed the Policy Reasoning Integrated Sequential Model (PRISM), a three-stage multi-agent system (S1 Text). Consistent with human-in-the-loop principles, domain experts were embedded at each stage of the system’s lifecycle—informing framework design through the Delphi process, guiding prompt engineering and model selection, and validating outputs against independent human review [13,14]. In this context, an agent is defined as an autonomous AI model executing designated tasks within a coordinated workflow. In stage one, a document-ingestion agent (Qwen2.5-VL-72B) processed policy documents of heterogeneous format and layout. This agent applied automated preprocessing to mitigate noise and geometric distortion, utilising spatial-aware text extraction with integrated visual recognition to digitise and structure the content [18]. In stage two, a policy-analysis agent (Llama 4 Scout 70B) applied the 69 sub-elements of the analytical framework to the NCVDCPs. Scoring criteria required only that relevant content be discussed; implicit or narrative treatment without explicit headings was eligible for non-zero scores. By employing retrieval-augmented generation (RAG) and structured prompts, this agent produced standardised, machine-readable Javascript Object Notation (JSON) outputs [11,12]. Finally, in stage three, a semantic-validation agent assessed semantic coherence and JSON-schema conformity, triggering threshold-based re-analysis of flagged sections to yield the final analytic dataset.

Policy reasoning integrated sequential model validation

We assessed PRISM’s accuracy of analysis and concordance against independent human review, consistent with human-in-the loop-principles. Six countries were purposively selected to reflect variation in programme maturity, region, and income group – Ghana, India, Myanmar, Turkey, United Kingdom, and the United States. Three human experts blinded to PRISM outputs independently scored a subset of plans using the same analytical framework. PRISM evaluated the same plans with identical criteria. Each reviewer was assigned two countries — Ghana and India (Reviewer 1), Myanmar and the United States (Reviewer 2), Turkey and the United Kingdom (Reviewer 3) — with no shared observations across reviewers. Scores were compared using Spearman ρ, linear- and quadratic-weighted κ, ICC (3,1) as ordinal-appropriate statistics, alongside exact and within-1-point agreement rates.

PRISM showed variable concordance with human ratings across 403 matched sub-element comparisons. Exact agreement occurred in 43.7% of comparisons (176/403), and 68.0% were within one point (274/403). Ordinal agreement statistics were uniformly weak and non-significant (Spearman ρ = 0.061, p = 0.222; linear-weighted κ = 0.049; quadratic-weighted κ = 0.045; ICC (3,1) = 0.046, p = 0.181; Pearson r = 0.048, p = 0.34, reported for reference only). A floor effect was present: 61.0% of LLM scores and 62.3% of human scores were 0, inflating per-element exact agreement for elements with sparse policy content. Within one-point agreement by country ranged from 75.0% (Ghana, United States), to 45.2% (India), with the United Kingdom at 73.1%, Turkey at 69.6% and Myanmar at 68.1%. Bias was bidirectional and score-dependent: PRISM over-scored relative to human reviewers when expert scores were 0 (mean difference +0.96) but under-scored when expert scores were ≥ 2 (mean differences −0.93 to −4.00). Overall mean difference was + 0.28 points. Collapsing scores into three tiers, PRISM over-scored at low human ratings (scores 0–1, n = 318, mean difference +0.75) and under-scored at medium (scores 2–3, n = 72, mean difference −1.22) and high ratings (scores 4–5, n = 13, mean difference −3.08). As each reviewer assessed a distinct pair of countries with no overlapping observations, formal inter-rater reliability could not be assessed.

Policy reasoning integrated sequential model application

We analysed NCVDCPs on an Amazon Web Services Elastic Compute Cloud Instance, generating standardised scores for all 69 sub-elements with automated quality assurance (S1 Text).

Comparative analysis

Following the application of PRISM, we analysed NCVDCP comprehensiveness within and across World Bank country income groups and WHO country regions.

In this context, comprehensiveness is defined as the extent to which a plan addresses the 11 elements and 69 sub-elements of the validated framework, reflecting both the breadth of topic coverage and the depth of specific detail. Non-zero scores reflect the documented presence and specificity of content within the plan; they do not indicate the adequacy, effectiveness, or implementation success of the policies described. Sub-element scores are ordinal; numerical aggregates (medians, means, IQRs) are reported as heuristic indicators of directional patterns across plans and should not be interpreted as precise quantitative distances between rubric levels. Data processing and visualisation were conducted in Python, utilising the pandas, numpy, and matplotlib libraries. Scores for the 69 sub-elements were normalised to a standard 0–5 scale. Element and country-level scores were derived by taking the median across constituent sub-elements, applying equal weight to each sub-element; this treats breadth of coverage within an element as substantively meaningful. Median rather than mean was used as the primary summary statistic, consistent with the ordinal scale. A sensitivity analysis comparing sub-element-equal weighting against element-equal weighting and mean against median aggregation confirmed that principal findings were robust across approaches (S2 Table). Non-parametric Mann-Whitney U tests were used for income group comparisons.

Results

Plan characteristics

We identified 45 NCVDCPs outlined in Table 2. The majority originated from high-income countries (n = 27), followed by upper-middle-income (n = 10), lower-middle-income (n = 6), and low-income countries (n = 2). Geographically, the plans were predominantly from the European region (n = 22, 48.9%), with limited representation from the Americas (n = 8), Africa (n = 4), the Western Pacific (n = 4), the Eastern Mediterranean (n = 5), and South-East Asia (n = 2). This distribution reflects the global scarcity of standalone CVD plans and limits generalisability beyond high-income, European settings.

thumbnail
Table 2. List of National Cardiovascular Disease Plans and broader policy landscape by country.

https://doi.org/10.1371/journal.pdig.0001447.t002

Overall comprehensiveness

The median overall comprehensiveness score across all plans was 1.20/5, as shown in Table 3, indicating that most plans address framework elements only partially, without specific indicators or baseline data. Ten of the eleven elements scored below 2.0. Contextual analysis represented the most critical gap, with negligible scores for threats (median 0.12) and opportunities (0.29); elements assessing current health system performance were also consistently under-specified, including objectives (0.50), outcomes (0.67), and outputs (0.75). Conversely, CVD strategy (2.80) and governance and organisation (2.14) were the most comprehensively addressed elements. Implementation/monitoring and evaluation (1.75), financing (1.44), and health services (1.43) were mentioned but lacked the specificity, indicators, or baseline data required for substantive coverage.

thumbnail
Table 3. Comprehensiveness scores for the National Cardiovascular Disease Plans by World Bank income group and World Health Organisation region.

https://doi.org/10.1371/journal.pdig.0001447.t003

Regional variation

Comprehensiveness varied markedly across regions as shown in Fig 1. Regional patterns are presented descriptively. Findings for South-East Asia (n = 2) and Africa and Western Pacific (n = 4 each) should be interpreted with caution given small subgroup sizes. The Western Pacific region (WPRO; n = 4) achieved the highest median total score at 1.71 [IQR 1.54, 1.84]. This was driven by high scores in strategy (3.00 [2.95, 3.10]), implementation (2.50 [2.38, 2.62]), and health services (2.24 [1.68, 2.66]). Mid-range scores included financing (1.87 [1.23, 2.56]) and resource management (1.79 [1.34, 2.32]), while threats (0.41 [0.22, 0.87]) and opportunities (0.61 [0.38, 0.98]) remained low.

thumbnail
Fig 1. Comparison of health system performance in relation to elements by World Health Organisation region.

https://doi.org/10.1371/journal.pdig.0001447.g001

The Eastern Mediterranean region (EMRO; n = 5) recorded a median total of 1.46 [1.41, 1.79]. Leading elements were strategy (3.00 [2.80, 3.00]), governance and organisation (2.43 [2.29, 2.71]), and health services (2.38 [1.62, 2.57]). Financing (2.10 [1.90, 2.10]) was also notable. However, significant gaps persisted in threats (0.25 [0.12, 0.29]), opportunities (0.60 [0.25, 0.67]), and outcomes (0.67 [0.33, 0.75]).

In the Americas (PAHO; n = 8), the median total was 1.26 [1.11, 1.55]. While strategy scored highly (3.00 [2.70, 3.20]), followed by implementation (2.25 [1.31, 2.50]), other areas were weaker, including health services (1.50 [1.14, 1.89]) and financing (1.60 [1.08, 1.82]). The lowest scores were observed in threats (0.12 [0.12, 0.17]) and opportunities (0.19 [0.12, 0.38]).

The European region (EURO; n = 22) had a median total of 1.14 [0.87, 1.40]. The highest scoring elements were strategy (2.40 [2.00, 3.15]) and governance (2.29 [1.43, 2.43]). Implementation (1.38 [0.54, 2.25]), financing (1.35 [0.93, 1.70]), and health services (1.33 [0.82, 1.68]) showed moderate completeness. Contextual analysis was poor, with threats scoring 0.12 [0.12, 0.34].

South-East Asia (SEARO; n = 2) recorded a median total of 0.97 [0.82, 1.11]. Despite high scores in strategy (2.70 [2.45, 2.95]) and financing (2.45 [2.42, 2.48]), most elements scored poorly, including implementation (0.50 [0.25, 0.75]), outputs (0.50 [0.25, 0.75]), and threats (0.25 [0.12, 0.38]).

The African region (AFRO; n = 4) showed the lowest median total at 0.90 [0.80, 1.02]. While strategy (2.30 [1.50, 3.05]) was relatively strong, threats scored 0.00 [0.00, 0.06]. Implementation (0.38 [0.19, 0.69]), health services (0.86 [0.71, 1.21]), and financing (0.95 [0.67, 1.23]) were also low.

This figure shows comprehensiveness scores for 45 NCVDCPs across 11 elements, stratified by WHO region. Box plots display the distribution (median and IQR) of element scores, with individual countries shown as dots and coloured by region. Element scores represent the median across constituent sub-categories.

Income group variation

Income group comparisons are presented descriptively; all pairwise Mann-Whitney comparisons were non-significant (all p > 0.05), and the LIC group (n = 2) is insufficient for any inferential conclusion. Analysis by World Bank country income classification is presented in Fig 2. Low-income countries (LIC; n = 2) recorded a median total score of 1.39 [1.32, 1.46]. These plans recorded their highest scores in governance (2.64 [2.04, 3.25]), financing (2.55 [2.48, 2.62]), strategy (2.50 [2.15, 2.85]), and implementation (2.00 [1.50, 2.50]). The lowest scores were observed in outcomes (0.50 [0.42, 0.58]) and threats (0.31 [0.22, 0.41]).

thumbnail
Fig 2. Comparison of health system performance in relation to elements by World Bank country income group.

https://doi.org/10.1371/journal.pdig.0001447.g002

High-income countries (HIC; n = 27) recorded a median total score of 1.30 [1.05, 1.67]. Leading elements included strategy (3.00 [2.40, 3.20]), governance and organisation (2.29 [1.79, 2.43]), and implementation (2.00 [0.75, 2.50]). Health services (1.50 [1.14, 2.29]) and financing (1.50 [1.10, 1.95]) scored in the moderate range. Contextual threats (0.12 [0.12, 0.33]) and opportunities (0.38 [0.25, 0.71]) scored lowest.

Upper-middle-income countries (UMIC; n = 10) recorded a median total of 1.13 [0.90, 1.37]. Strategy (2.40 [2.10, 2.95]) and resource management (1.94 [0.92, 2.08]) were the highest scoring elements. Financing (1.50 [0.93, 1.90]) and implementation (1.25 [0.56, 2.19]) showed moderate coverage. Threats (0.13 [0.12, 0.34]) and opportunities (0.12 [0.12, 0.28]) were lowest.

Lower-middle-income countries (LMIC; n = 6) recorded the lowest median total at 0.90 [0.72, 1.14]. Strategy (2.60 [1.75, 3.00]) and governance (1.60 [1.42, 1.80]) were the highest scoring elements. Implementation (0.38 [0.06, 1.06]) and threats (0.00 [0.00, 0.19]) recorded the lowest scores.

This figure shows comprehensiveness scores for 45 NCVDCPs across 11 elements, stratified by World Bank income group. Box plots display the distribution (median and IQR) of element scores, with individual countries shown as dots and coloured by income group. Element scores represent the median across constituent sub-categories.

Discussion

The validated 11-element, 69-sub-element framework constitutes a reusable policy instrument applicable to national CVD planning assessment independent of the computational pipeline described here. It was developed through systematic literature review and two-stage Delphi consensus with 42 specialists. Our analysis of 45 NCVDCPs indicates that while high-level strategic vision is often present, the financing, governance and operational architecture required to execute that vision is frequently under-specified. A median overall comprehensiveness score of 1.20/5 reveals a substantial gap between the scale of the global CVD burden and the granularity of national planning documentation.

Most plans articulate a Strategy (median 2.80) and define Governance structures (2.14); however, they frequently lack a documented grounding in local context or baseline system performance. Negligible scores for Contextual Threats (0.12) and Opportunities (0.29), combined with limited definitions of Health System Outcomes (0.67), indicate that strategies are frequently developed without documented epidemiological baselines or health-system assessments. Without a documented assessment of the current system status or external factors that might impede progress – such as economic instability or demographic shifts – even a well-articulated strategy risks remaining a signal of political intent rather than an actionable roadmap.

Among the sampled plans, the Western Pacific region recorded the highest median (1.71), with stronger documentation of Implementation and Health Services; the African region recorded the lowest (0.90), with Contextual Threats scoring 0.00. Both findings are based on n = 4 countries and should not be interpreted as representative regional assessments. Income group comparisons were non-significant across all pairwise tests, and the LIC sample (n = 2) is insufficient for inference. Descriptive patterns — including LMIC scoring lowest (median 0.90) — are reported without causal attribution. The primary finding — that comprehensiveness was low across all income groups — suggests the deficit reflects the absence of standardised planning methodology more than resource constraints, reinforcing the case for universally adaptable templates similar to those established for cancer control [1923].

Three major implications follow from these findings. First, national CVD planning must transition from vision-based to evidence-based approaches. The divergence between strong scores for Strategy and weaker scores for Outcomes (0.67) and Objectives (0.50) indicates a disconnect between goals and the metrics required to track them. To fulfil WHA and SDG mandates and targets, CVD planning frameworks should explicitly link strategic goals to baseline data and measurable targets. This alignment is critical for tracking progress toward mortality reduction targets, yet monitoring and evaluation frameworks currently score poorly (1.75) across the sample.

Second, limited detail regarding Financing (1.44) and Contextual Analysis jeopardises plan sustainability. A comprehensive plan requires not just a budget, but an analysis of fiscal space, funding sources, and financial risk protection. The omission of these details suggests financial planning is frequently decoupled from service delivery planning. Future planning should integrate service delivery objectives with financing and implementation architecture to build resilience against political or economic shocks.

Third, the application of PRISM demonstrates both the potential and the current limitations of AI-enabled policy appraisal. By automating the analysis of complex policy documents, PRISM enables scalable benchmarking and a dynamic global observatory for CVD policy. However, validation against human review reveals important nuances in its application. While PRISM achieved exact agreement with human experts in 43.7% of comparisons and fell within a one-point margin in 68.0% of cases, ordinal agreement statistics (Spearman ρ = 0.061, weighted κ = 0.049, ICC = 0.046), confirm that PRISM reliably identifies structural presence but does not replicate expert qualitative depth assessment. Bias was score-dependent rather than uniformly generous: PRISM over-scored when reviewers found no substantive content (mean difference +0.96 at human score 0) but under-scored substantially when reviewers identified genuine policy depth (−0.93 to −4.00 at scores 2–5), consistent with a keyword-trigger effect, where the model rewards stated intent over demonstrated specificity.

This pattern was reflected at the element level: PRISM performed best on elements requiring structural detection (Element 4, exact agreement 72.9%; Element 5, 72.9%), and worst on elements requiring qualitative depth assessment (Element 6, 16.7%; Element 7, 19.0%; Element 8, 37.3%). Representative cases illustrate both failure modes: PRISM assigned scores of 4–5 for Vision and Financing sub-elements in Ghana, Myanmar, and the United States where human reviewers scored 0, rewarding aspirational language without implementation architecture; conversely, in India, PRISM scored 0 across Governance, Outputs, and Health Services sub-elements where reviewers assigned 3–4, failing to integrate content across document sections or apply contextual inference. Full disagreement cases (|diff| ≥ 3, n = 84) are provided in S3 Table.

This bidirectional pattern reflects score compression — PRISM’s output clusters in the low-to-middle range, missing both the complete absence and the substantive depth that expert reviewers reliably distinguish. PRISM is therefore validated for structural benchmarking of documented plan content. It should not be used as a fine-grained ordinal scoring instrument equivalent to expert judgement, and serves best as a screening tool to augment, rather than replace, expert review.

Future iterations should explore few-shot prompting — providing PRISM with scored exemplars prior to evaluation — and stricter criteria for awarding scores above 0; both represent tractable approaches to reducing over-scoring of nominal policy language.

These findings must be interpreted in light of significant limitations.

First, PRISM evaluates document content, which may not reflect implementation reality or policies detailed in separate, unlinked documents (e.g., national budgets). Thus, low scores indicate a lack of integration within the primary CVD strategy, not necessarily government inaction. Additionally, formal analysis of scoring reliability by document language or format was not conducted. RAG-based retrieval may also perform less reliably on contextual content dispersed across narrative text than on explicitly structured policy sections, potentially contributing to the very low observed scores for threats and opportunities. Plans were processed in any language using a multilingual ingestion agent; whether reliability varied systematically by language or document structure cannot be determined from the current validation sample.

Second, the sample is geographically and economically skewed: 22 of 45 countries are European, 27 are high-income, and low-income (n = 2), African (n = 4), and South-East Asian (n = 2) contexts are substantially under-represented; findings are most generalisable to high-income, European settings. Countries without accessible standalone CVD plans — including China — could not be included, which is itself consistent with the study’s central finding on the scarcity of dedicated CVD planning. Expanding coverage across Africa, South-East Asia, and Latin America is the primary direction for future work.

Finally, PRISM’s validated scope is structural: it identifies whether defined policy architecture is present, not whether that architecture is sufficient, feasible, or of high quality.

Supporting information

S1 Text. Large language model pipeline architecture overview.

https://doi.org/10.1371/journal.pdig.0001447.s001

(DOCX)

S2 Text. Delphi process stage one questionnaire quantitative scoring by element.

https://doi.org/10.1371/journal.pdig.0001447.s002

(DOCX)

S3 Text. Delphi process stage one questionnaire qualitative feedback by element.

https://doi.org/10.1371/journal.pdig.0001447.s003

(DOCX)

S4 Text. Delphi process stage two qualitative feedback summary by key themes.

https://doi.org/10.1371/journal.pdig.0001447.s004

(DOCX)

S5 Text. Quantitative analysis of national cardiovascular disease control plans across all sub-elements.

https://doi.org/10.1371/journal.pdig.0001447.s005

(DOCX)

S1 Table. National cardiovascular disease control plan scoring rubric.

https://doi.org/10.1371/journal.pdig.0001447.s006

(DOCX)

S2 Table. Sensitivity analysis of scoring aggregation methods.

https://doi.org/10.1371/journal.pdig.0001447.s007

(DOCX)

S3 Table. Large disagreements between LLM and human reviewer scores (|difference| ≥ 3).

https://doi.org/10.1371/journal.pdig.0001447.s008

(DOCX)

Acknowledgments

Members of CVD control collaborative

Note: Stage One: 42 completed questionnaires (participants not listed elected to remain anonymous), Stage Two: 16 participants (participant not listed elected to remain anonymous).

Human Review of LLM

Completed by Dr Aminu Osman Alem, Maia Cullen, Brooke Forde from the Harvard University Health Systems Innovation Lab

References

  1. 1. World Health Organisation. The top 10 causes of death. Geneva: World Health Organisation; 2024. https://www.who.int/news-room/fact-sheets/detail/the-top-10-causes-of-death
  2. 2. Hamza Khalifa I. Global Trends in Cardiovascular Mortality and Risk Factors: Insights from WHO and Global Burden of Disease Data. Libyan Open Univ J Med Sci Sustain. 2025:28–36.
  3. 3. World Health Organisation. Cardiovascular diseases (CVDs). Geneva: World Health Organisation. https://www.who.int/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds
  4. 4. Chaturvedi A, Zhu A, Gadela NV, Prabhakaran D, Jafar TH. Social determinants of health and disparities in hypertension and cardiovascular diseases. Hypertension. 2024;81(3):387–99. pmid:38152897
  5. 5. Coronado F, Melvin SC, Bell RA, Zhao G. Global responses to prevent, manage, and control cardiovascular diseases. Prev Chronic Dis. 2022;19:E84. pmid:36480801
  6. 6. World Heart Federation. CVD roadmaps. Geneva: World Heart Federation; https://world-heart-federation.org/cvd-roadmaps/
  7. 7. World Heart Federation. WHF urges countries to develop cardiovascular action plans, launches global petition. Geneva: World Heart Federation; https://world-heart-federation.org/news/whf-urges-countries-to-develop-cardiovascular-action-plans-launches-global-petition/
  8. 8. World Health Organisation. Cancer prevention and control in the context of an integrated approach: World Health Assembly resolution WHA70.12. Geneva: World Health Organisation; 2017.
  9. 9. The Global Goals. Goal 3: good health and well-being. https://globalgoals.org/goals/3-good-health-and-well-being/
  10. 10. World Health Organisation. HEARTS technical package for cardiovascular disease management in primary health care. Geneva: World Health Organisation; 2016. https://iris.who.int/server/api/core/bitstreams/aef55845-a638-45d0-b816-70dd5bf8885a/content
  11. 11. Meta AI. The Llama 4 herd: the beginning of a new era of natively multimodal AI innovation. 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/
  12. 12. Guha N, Nyarko J, Ho DE, et al. LegalBench: a collaboratively built benchmark for measuring legal reasoning in large language models. Adv Neural Inf Process Syst. 2023;36:35327–43.
  13. 13. Mosqueira-Rey E, Hernández-Pereira E, Alonso-Ríos D, Bobes-Bascarán J, Fernández-Leal Á. Human-in-the-loop machine learning: a state of the art. Artif Intell Rev. 2023;56:3005–54.
  14. 14. Amershi S, Cakmak M, Knox WB, Kulesza T. Power to the People: The Role of Humans in Interactive Machine Learning. AI Mag. 2014;35(4):105–20.
  15. 15. World Health Organisation. Noncommunicable disease (NCD) document repository. Geneva: World Health Organisation. https://extranet.who.int/ncdccs/documents/
  16. 16. Atun R, Ogawa T, Martin-Morena J. Analysis of national cancer control programmes in Europe. London: Imperial College London Business School; 2008.
  17. 17. Nasa P, Jain R, Juneja D. Delphi methodology in healthcare research: How to decide its appropriateness. World J Methodol. 2021;11(4):116–29. pmid:34322364
  18. 18. Bai S, Yang A, Cheng J, et al. Qwen2.5-VL technical report. arXiv. 2025. https://arxiv.org/abs/2502.13923
  19. 19. International Cancer Control Partnership (ICCP). National plans. https://iccp-portal.org/map
  20. 20. Nicholson BD, Shinkins B, Price S, Verbakel JY, Merriel S, Society of Academic Primary Care Cancer Special Interest Group. National cancer control plans. Lancet Oncol. 2018;19(12):e665. pmid:30507425
  21. 21. Fadhil I, Alkhalawi E, Nasr R, Fouad H, Basu P, Camacho R, et al. National cancer control plans across the Eastern Mediterranean region: challenges and opportunities to scale-up. Lancet Oncol. 2021;22(11):e517–29. pmid:34735820
  22. 22. Stefan DC, Elzawawy AM, Khaled HM, Ntaganda F, Asiimwe A, Addai BW, et al. Developing cancer control plans in Africa: examples from five countries. Lancet Oncol. 2013;14(4):e189–95. pmid:23561751
  23. 23. Romero Y, Tittenbrun Z, Trapani D, Given L, Hohman K, Cira MK, et al. The changing global landscape of national cancer control plans. Lancet Oncol. 2025;26(1):e46–54. pmid:39701116