Refining biomarker-based clustering of cardiovascular inflammatory phenotypes in HIV using Recursive Feature Addition: A comparative evaluation approach

Rachel Mac Cann; Dana Alalwan; Gurvin Saini; Alejandro Abner Garcia Leon; Neeltje A. Kootstra; Padraig McGettrick; Aoife G Cotter; Alan Winston; Peter Reiss; Caroline Sabin; Patrick W. Mallon; on behalf of the UPBEAT-CAD, AIID and COBRA cohorts

doi:10.1371/journal.pcbi.1014209

Abstract

Background

People living with HIV remain at elevated risk for a number of non-communicable diseases, including cardiovascular disease (CVD), driven in part by chronic inflammation. While prior studies have identified inflammatory biomarker patterns linked to CVD in people with HIV, it remains unclear which combinations of biomarkers most effectively predict clinical outcomes. We aimed to develop and evaluate a framework for refining biomarker-based clustering approaches to better capture inflammatory patterns associated with a cardiovascular phenotype (CVP) in people with HIV.

Methods

We developed and evaluated three recursive feature addition (RFA) models to enhance biomarker-driven clustering of people with and without HIV. Using a 24-marker initial panel of biomarkers chosen for their links to clinical CVP in people with HIV, we compared three models for selective inclusion of 31 additional, exploratory biomarkers: (1) a stepwise additive model evaluating biomarkers cumulatively based on biological relevance; (2) a stepwise additive model evaluating biomarkers individually; and (3) a greedy forward-backward selection model. Each model was assessed using principal component analysis (PCA), cluster stability, biological coherence and association with a CVP and 10-year Atherosclerotic Cardiovascular Disease (ASCVD) risk.

Results

All three RFA models generated three, biomarker-derived clusters. Post RFA cluster biomarker composition, model stability and clinical associations of these clusters differed across models. The individual additive model (Model 2) produced the most distinct separation of inflammatory profiles, incorporating 11 additional biomarkers, including, GDF-15, IFN-λ2 and Thrombopoietin). In this model, Cluster 3 was characterised by heightened innate and adaptive immune activation, the highest CVP prevalence (11%) and the strongest association with CVP (adjusted odds ratio (aOR) 2.3, 95% CI 1.04–5.09).

Conclusion

We demonstrate that an RFA framework using a stepwise, additive model evaluating biomarkers individually to enhance clustering profiles provides optimal unsupervised clustering of exploratory biomarkers to reveal additional associations between inflammatory patterns and CVP in people with and without HIV.

Author summary

People living with HIV are living longer, healthier lives thanks to effective treatment, but they remain at greater risk of developing non-communicable diseases, such as cardiovascular disease. This risk appears to be linked to ongoing systemic inflammation and previous studies have shown that people with HIV often have higher levels of circulating inflammatory biomarkers. Understanding which of these biomarkers, or combinations of them, best identify people at higher risk is an important step toward more personalised care, but is also a challenging computational problem. In our study, we used a computer-based approach called recursive feature addition to explore whether adding new biomarkers to an existing panel associated with cardiovascular disease could improve how individuals are grouped based on their inflammatory profiles. Combining data from three large international cohort studies, we compared different modelling strategies and found that one approach produced particularly distinct and biologically meaningful clusters, revealing a subgroup with higher inflammation and greater cardiovascular disease risk. Our findings demonstrate how data-driven feature selection can refine the identification of biological patterns linked to disease, bridging computational modelling with clinical understanding.

Citation: Mac Cann R, Alalwan D, Saini G, Garcia Leon AA, Kootstra NA, McGettrick P, et al. (2026) Refining biomarker-based clustering of cardiovascular inflammatory phenotypes in HIV using Recursive Feature Addition: A comparative evaluation approach. PLoS Comput Biol 22(4): e1014209. https://doi.org/10.1371/journal.pcbi.1014209

Editor: Lun Hu, Xinjiang Technical Institute of Physics and Chemistry, CHINA

Received: November 24, 2025; Accepted: April 7, 2026; Published: April 27, 2026

Copyright: © 2026 Mac Cann et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All analyses were performed using R (version 4.3.2). All relevant data are within the manuscript and all intermediate model outputs and bootstrap results are provided as Supporting Data files (S1–S3 Datas) to facilitate reproducibility. Processed, de-identified summary data required to reproduce the key findings are provided within the Supporting Information files (S1–S4 Datas). Because the recursive feature addition framework incorporates random sampling and model initialisation, minor variations in intermediate results may occur between runs. To ensure reproducibility, all analyses were executed using fixed random seeds and dependency management with the renv package (version 1.1.0) which preserves package versions and the computational environment used in the original analysis. De-identified participant-level data from the AIID and UPBEAT-CAD cohorts contain sensitive personal data and cannot be shared publicly due to ethical and legal restrictions under institutional and national data protection regulations. Requests for access to these data may be submitted to the University College Dublin Data Protection Office, which serves as the data access authority for both cohorts: University College Dublin Data Protection Office Ulrike Kolch Roebuck Castle, University College Dublin Belfield, Dublin 4, Ireland Email: gdpr@ucd.ie De-identified participant-level data from the COBRA cohort are subject to cohort-specific governance agreements. Data access requests are reviewed on a case-by-case basis following submission of a concept proposal, directed to the COBRA study data access committee via info@aighd.org.

Funding: The HIV UPBEAT Cohort Study has received funding from the Health Research Board (HRB) (award number HRA_POR/2010/66; both grants awarded to P. W. G. M. as principal investigator). R. M. C. was supported by the Irish Clinical Academic Training program, supported by the Wellcome Trust and the Irish Health Research Board (grant number 203930/B/16/Z), the Health Service Executive, National Doctors Training and Planning and the Health and Social Care, Research and Development Division, Northern Ireland. RMC also received additional funding from the British HIV Association (BHIVA) from a 2022 BHIVA research award. The COBRA cohort received funding from the European Union’s Seventh Framework Programme for research, technological development and demonstration under grant agreement n° 305522. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: I have read the journal’s policy and the authors of this manuscript have the following competing interests: A.C. has received honoraria, educational or travel support, or unrestricted research grants from Gilead Sciences, MSD, ViiV Healthcare, and Janssen-Cilag. A.W. has received honoraria or research grants on behalf of Imperial College London, or has served as a consultant or investigator in clinical trials sponsored by Bristol-Myers Squibb, Gilead Sciences, GlaxoSmithKline, Janssen-Cilag, Roche, and ViiV Healthcare. P.W.G.M. has received honoraria and/or travel grants from Gilead Sciences, MSD, Bristol-Myers Squibb, and ViiV Healthcare. P.R. has received independent scientific grant support from Gilead Sciences, Janssen Pharmaceuticals, Inc., Merck & Co., and ViiV Healthcare, and has served on scientific advisory boards for Gilead Sciences, ViiV Healthcare, and Merck & Co.; all honoraria were paid to his institution. C.S. has received honoraria for the preparation and delivery of educational materials from Gilead Sciences and ViiV Healthcare. All other authors declare no competing interests.

Introduction

Worldwide, people living with HIV are living longer due to the success of antiretroviral therapy (ART) [1]. However, despite these advances, people with HIV remain at increased risk of non-communicable diseases (NCDs) such as cardiovascular disease (CVD), metabolic syndrome, certain cancers, and cognitive impairment, likely reflecting the combined effects of chronic inflammation and traditional risk factors. CVD has now emerged as a leading cause of death in people with HIV receiving ART [2], with traditional CVD risk factors, although prevalent in this group, not fully accounting for this excess risk [3].

Chronic inflammation and immune activation are hallmarks of HIV infection, persisting even in individuals on ART with fully suppressed viral replication [4,5]. Elevated markers of a number of inflammatory pathways, such as innate immune activation (sCD14) and systemic inflammation (hsCRP, IL-6), have been linked to the development of subclinical and clinical coronary artery disease (CAD) in people with HIV [6]. Although associations between single biomarkers and CVD risk have been well documented among people with HIV [7,8], discovery of a single biomarker that fully reflects the complex, multifactorial nature of inflammation has remained elusive, with limited predictive accuracy for CVD risk. A more integrated approach to understanding how multiple inflammatory pathways interact and contribute to the development of NCD in people with HIV is only beginning to emerge.

An increasing number of studies have measured multiple biomarkers to identify inflammatory patterns associated with higher CVD and NCD risk, rather than relying on single-marker analyses [5,9,10]. Previous analysis within our group have identified three distinct biomarker-derived inflammatory patterns associated with both subclinical CAD (measured by CT coronary angiography) and prevalent CVD events [10]. One cluster, characterised by elevated markers of gut epithelial barrier disruption (I-FABP), T-cell activation, and systemic inflammation, was particularly notable for its strong association with increased coronary artery plaque burden (CAP) and clinical CVD, even after adjustment for traditional CVD risk factors and HIV status [10]. These findings were validated in the Pharmacokinetic and clinical Observations in PeoPle over FiftY (POPPY) study in the UK/Ireland, which also identified a three clusters, one of which was also associated with higher estimated CVD risk [5]. An analysis of the AGEhIV cohort in the Netherlands also demonstrated that a preserved-thymic/low-inflammation cluster in people with HIV was linked to lower comorbidity burden including CVD, during longitudinal follow-up [9]. These studies suggest that a people with HIV can be stratified into higher and lower risk cardiovascular categories based on distinct inflammatory profiles, independent to conventional CVD risk factors.

However, while these prior studies highlight the relevance of key inflammatory pathways, it remains unclear which individual biomarkers, or their combinations, most robustly associate with clinical outcomes. As high-throughput platforms have introduced a new era of multiplex biomarker assessment that is feasible to implement into clinical care, it is important to determine how this expanded data can be incorporated into existing frameworks to refine disease stratification in people living with HIV.

A key challenge in this setting is that naïvely expanding biomarker panels can degrade clustering performance by introducing redundant or weakly informative features, reducing cluster stability and interpretability. Recursive feature addition (RFA) is a wrapper-based, forward feature selection strategy in which candidate features are iteratively added to an existing model and retained only if they provide incremental improvement in performance or cluster discrimination relative to a predefined baseline feature set. By evaluating the marginal contribution of each additional feature, RFA enables controlled model refinement while preserving biologically meaningful structure derived from prior knowledge. This makes RFA particularly well suited to optimising biomarker-driven clustering in high-dimensional, correlated immunological data, where the goal is improved outcome discrimination while maintaining interpretability and translational relevance.

In this study we applied a RFA strategy, comparing three different biomarker inclusion strategies to assess the contribution of new biomarkers to an existing clustering model. Our goal was to refine biomarker-based clustering by identifying additional markers that enhance the biological relevance and clinical stratification of clusters associated with cardiovascular outcomes in people with HIV.

Methods

Ethics statement

The AIID Cohort Study was approved by the National Research Ethics Committee in Ireland, reference 20-NREC-COV-056. The UPBEAT-CAD study was approved by the Mater Misericordiae University Hospital and Mater Private Hospital Institutional Review Board. The COBRA cohort study was approved by the institutional review board of the Academic Medical Center (AMC) (reference number NL 30802.018.09) and a UK Research Ethics Committee (REC) (reference number 13/LO/0584 Stanmore, London). Participants across all cohorts gave written informed consent.

Dataset and study cohort

People with and without HIV from three different prospective, multicentre cohort studies were included,. This included people with HIV from the All-Ireland Infectious Diseases Cohort (AIID) study, a prospective, multicentre cohort study in Ireland. The second cohort included participants >40 years old with no known history of CVD from the Understanding the Pathology of Comorbid Disease in HIV-Infected Individuals With Coronary Artery Disease (HIV UPBEAT CAD) sub study, a cross-sectional study of people with and without HIV (that were propensity score matched for traditional cardiovascular risk factors. The third cohort included a subset of participants from the Co-morBidity in Relation to HIV/AIDS (COBRA) study, which investigated age-related comorbidities in people living with and without HIV enrolled in the AGEhIV and POPPY studies [11]. All participants with HIV were virally suppressed. All cohorts contributed blood samples, along with clinical and socioeconomic data.

Cardiovascular conditions were defined as a composite vascular phenotype (CVP), encompassing a history of hypertension, myocardial infarction, stroke, transient ischaemic attack (TIA), coronary artery disease, or peripheral vascular disease. This definition aligns with the broader vascular disease framework previously applied in the UPBEAT-CAD and AGEhIV cohort and underpinned the original biomarker selection strategy used here [10,12].To complement this composite phenotype and to capture cardiovascular risk independent of established disease, particularly given the inclusion of hypertension within this definition, the 10-year Atherosclerotic Cardiovascular Disease (ASCVD) risk score was calculated using pooled cohort equations and used as an additional outcome [13].

Biomarker measurement and data preparation

We used quantitative immunoassays (as previously described) to measure 55 plasma biomarkers associated with specific immune and inflammatory pathways (Table 1) [10]. Biomarker data from all platforms were combined into a single dataset for integrated analysis. For markers with incomplete data (less than 10% of the overall data), missing values were imputed using multiple imputation with predictive mean matching (m = 5, 50 iterations) via the mice package (v3.16.0) in R [14].

Download:

Table 1. Full Biomarker panel.

https://doi.org/10.1371/journal.pcbi.1014209.t001

To adjust for any plate-specific batch effects, ComBat batch correction (from the sva package) was applied to the entire combined dataset using an empirical Bayes framework [15]. This approach adjusts for systematic batch effects while preserving biological variation relevant to inflammatory biomarker patterns. Biomarker concentrations were log-transformed to approximate normality and scaled to unit variance to ensure that markers with larger intrinsic variability did not disproportionately influence the analysis. These preprocessing steps were performed on the full biomarker dataset prior to downstream analyses to ensure comparability of measurements across cohorts and assay batches.

Definition of cardiovascular outcomes

Associations between biomarker-defined clusters and CVP and 10-year ASCVD risk were assessed using univariable logistic regression, with the uninflamed cluster as the reference. Multivariable models for CVP were adjusted for age, sex, smoking history (current or ex-smoker), body mass index (BMI) and dyslipidaemia (defined as elevated total cholesterol or triglycerides, or a documented history of dyslipidaemia). ASCVD risk was not adjusted for these covariates, as it is itself a composite risk score derived from established cardiovascular risk factors.

Missing data for clinical covariates (BMI and smoking history) were imputed using multiple imputation with predictive mean matching (m = 5, 50 iterations) [14]. Continuous variables were imputed using predictive mean matching, and categorical variables using classification and regression tree (CART) methods. All analyses were conducted in R version 4.3.2, and odds ratios (ORs) for the association between outcomes and biomarker cluster were reported with profile-likelihood 95% confidence intervals (95% CIs).

Initial clustering and baseline model replication

An initial panel of 24 biomarkers was selected based on prior biomarker–based clustering analysis conducted in the UPBEAT-CAD and AIID cohorts, which identified a distinct inflammatory cluster associated with increased prevalence of both subclinical and clinical CVD (Table 1, highlighted in bold) [10]. This 24-biomarker panel was used as the common baseline input for all subsequent clustering and modelling analyses.

Clustering was performed using a consistent analytical pipeline comprising principal component analysis (PCA) for dimensionality reduction, followed by hierarchical clustering using Ward’s minimum variance method and squared Euclidean distance (11). Principal components were retained according to the default Hierarchical Clustering on Principal Components (HCPC) criterion implemented in the FactoMineR package, which selects a reduced set of components that capture the majority of total variance while minimising noise. In this baseline 24-biomarker model, the retained principal components collectively explained more than 70% of the total variance. The optimal number of clusters was automatically determined using the HCPC algorithm [16]. This PCA–HCPC pipeline and parameterisation were used consistently across all subsequent clustering analyses.

Following baseline clustering, biomarker contributions to cluster formation were quantified as the standardised difference between the conditional mean within a cluster and the overall mean across all participants. Biomarkers with the largest standardised differences, exceeding the threshold defined by the corresponding normal distribution quantile, were considered the most influential in defining that cluster’s profile. Within clusters, these influential biomarkers were ranked in descending order to highlight those most characteristic of the cluster-specific inflammatory pattern.

Recursive feature addition methodology

To evaluate how different strategies for incorporating additional biomarkers altered cluster structure and downstream cardiovascular associations, three RFA approaches were compared (Fig 1). Each strategy extended the initial 24-biomarker model by incorporating up to 31 additional candidate biomarkers, with inclusion criteria specific to each strategy.

Download:

Fig 1. Workflow for biomarker-based clustering and recursive feature addition to characterise cardiovascular phenotypes in HIV.

https://doi.org/10.1371/journal.pcbi.1014209.g001

Model 1. Cumulative biomarker addition guided by functional pathway relevance
Model 2. Independent single-marker evaluation without cumulative retention
Model 3. Iterative forward–backward selection driven by classification performance.

At each iteration, model performance was evaluated against the initial 24-biomarker model, allowing the marginal utility of each additional biomarker to be quantified.. Performance metrics, included classification accuracy and Cohen’s Kappa.

Analytical workflow used to refine biomarker-based clustering for the identification of cardiovascular phenotypes (CVP) in people with HIV. An initial set of 24 CVP-related biomarkers was selected and used for primary clustering. RFA was applied to evaluate three feature selection strategies: (i) cumulative biomarker addition guided by biological relevance, (ii) independent single-marker evaluation, and (iii) greedy forward–backward selection. Biomarkers were retained for further analysis if they met predefined performance criteria based on importance, classification accuracy, and agreement. For each model, clustering was repeated using the baseline biomarkers together with retained markers. Refined clusters were subsequently compared using principal component analysis (PCA) visualisation, cluster stability assessment, and associations with cardiovascular phenotypes and atherosclerotic cardiovascular disease (ASCVD). Created in BioRender. Alalwan, D. (2026) https://BioRender.com/ahf37oh

Common Processing Pipeline

The primary analytical tool used for all modelling strategies was a random forest (RF) classifier [17] due to its ability to handle high-dimensional, potentially collinear predictors and its internal bootstrap-based validation framework [18]. RF models were used to evaluate the incremental contribution of candidate biomarkers to cluster classification.

Models were implemented using the caret framework with the ranger backend, with 500 trees grown per model and permutation-based variable importance enabled. For each modelling strategy, data were partitioned into training (70%) and test (30%) subsets using stratified sampling to preserve cluster proportions.

Within the training set, 3-fold cross-validation was applied to ensure internal validation and mitigate overfitting [19]. All feature selection, variable importance estimation, and model tuning steps were performed exclusively within the training data during cross-validation, while the held-out test set was used only for final model evaluation. Candidate biomarkers were retained if they demonstrated evidence of improved model performance, defined by exceeding thresholds for variable importance, classification accuracy (> 0.80), and Cohen’s kappa (> 0.65). As the biomarker data were harmonised prior to modelling, RF accuracy was interpreted as a relative measure of cluster separability rather than as an independent indicator of cluster validity. Variable importance values were normalised to obtain percentage contributions to facilitate comparison across biomarkers [20–22].

RFA model implementation

Model 1: Stepwise addition with cumulative evaluation

In this approach, biomarkers were added cumulatively to the initial model in a fixed sequence determined by the authors, based on their involvement in hypothesised pathways (Table 2). Beginning with the initial set of 24 biomarkers, each subsequent model iteration incorporated one additional biomarker in the predefined order (i.e., 24 + 1 (25), 24 + 2 (26),..., 24 + 31 (55)). This cumulative inclusion allowed for the assessment of incremental improvements in model performance and identification of saturation points.

Download:

Table 2. Sequential addition of biomarkers (25–55) used in Model 1 to the initial 24-marker model, ordered by functional relevance to biological pathways identified in initial clustering.

https://doi.org/10.1371/journal.pcbi.1014209.t002

Model 2: Independent Addition Without Order Assumptions

This approach involved each candidate biomarker being evaluated independently by adding it to the baseline model one at a time. Unlike Model 1, no cumulative feature retention was performed; each model consisted of the 24 initial biomarkers plus one additional candidate (i.e., 24 + 1 (marker 26), 24 + 1 (marker 27),..., 24 + 1 (marker 55)).. This strategy allowed for direct comparison of the individual predictive utility of each biomarker when added in isolation to the initial core set.

Model 3: Bidirectional feature selection

A data-driven, greedy feature selection algorithm was implemented to iteratively identify the optimal subset of biomarkers that maximised classification performance. Unlike traditional stepwise selection methods that use statistical significance (p-values) to guide inclusion or exclusion, this greedy recursive procedure relied solely on empirical model performance, specifically accuracy and Kappa, to determine which biomarkers to retain. The algorithm began with the initial 24 biomarker model and performed forward selection, sequentially testing each of the remaining 31 biomarkers one at a time. For each candidate biomarker, a new RF model was trained and its classification accuracy (and Kappa) evaluated on the held-out test set. The biomarker yielding the greatest improvement was retained and the model repeated, creating successive models of 24 + 1, 24 + 2, and so on. Once no further forward gains were achieved, a backward elimination step followed, in which previously included biomarkers were temporarily removed and the model re-evaluated. If exclusion led to improved performance, the biomarker was permanently discarded. This forward–backward iterative process continued until the model reached a performance plateau, yielding a final, parsimonious biomarker set optimised purely on empirical accuracy rather than predefined biological pathways.

Biomarker selection stability

To assess the robustness of biomarker selection beyond a single train–test split, biomarker selection stability was evaluated for each modelling strategy. For Models 1 and 2, stability was evaluated using repeated resampling (n = 50 iterations). In each iteration, a baseline RF model using the initial 24 biomarkers was trained and evaluated on a training-only split. Candidate biomarkers (markers 25–55) were then added according to model design and re-evaluated. Selection was defined using a delta-based criterion, whereby a biomarker was considered selected in a given iteration only if it produced a minimum improvement in both classification accuracy and Cohen’s kappa relative to the baseline model (delta-based criterion). Selection frequency across resampling iterations was used as an indicator of relative stability of biomarker prioritisation within the correlated biomarker panel. As candidate biomarkers may be partially correlated, no strict a priori threshold for stable selection was imposed; instead, frequencies were interpreted comparatively across models.

For Model 3, given the path-dependent nature of greedy selection, the full forward–backward procedure was repeated across multiple resampled training–testing splits. For each iteration, the final selected biomarker set and corresponding performance were recorded. Selection frequencies were summarised to characterise the reproducibility of the greedy procedure.

Cluster reconstruction and stability assessment

To assess the impact of model-specific biomarker sets on clustering, secondary clustering was performed separately for each RFA model. The baseline 24 biomarkers were combined with model-specific selected biomarkers, and PCA–HCPC clustering was repeated using the same pipeline and parameters as the baseline analysis. This yielded three updated clustering solutions corresponding to each model.

Cluster robustness was evaluated using bootstrap resampling (n = 500 iterations) of the biomarker matrix. For each bootstrap iteration, participants were resampled with replacement to generate datasets of equal size to the original cohort. PCA and HCPC were re-applied using the same pipeline as in the primary analysis. Cluster assignments from each bootstrap iteration were compared with those from the original solution using the Adjusted Rand Index (ARI), which quantifies agreement while correcting for chance

Iterations in which the clustering procedure failed to converge or produced invalid solutions were excluded (<2% of iterations). The resulting distribution of ARI values (median, mean, standard deviation, minimum, maximum, and interquartile range) was used to summarise cluster stability and to characterise the presence of any unstable tail in the stability distribution. Higher ARI values indicate greater agreement between bootstrap-derived and original cluster assignments, with values closer to 1 reflecting highly stable clustering solutions.

Given the multi-cohort structure of the dataset, potential cohort effects were examined using chi-squared tests for cluster–cohort association and analysis of variance to assess whether cohort explained variation along principal components.

Cluster–outcome associations and regression stability

The clinical relevance of the updated clusters was assessed by examining associations with CVP and its components. Sensitivity analyses were conducted separately for cardiovascular disease events (myocardial infarction, stroke, TIA, CAD, or peripheral vascular disease) and for hypertension alone to account for outcome heterogeneity and differential event prevalence. Predictive validity was assessed by comparing the degree to which CVP was differentiated across clusters relative to the initial clustering solution.

To assess the robustness of downstream associations, stratified bootstrap resampling (n = 1000 iterations) was applied to adjusted logistic regression models. For each iteration, samples were resampled with replacement while preserving outcome prevalence, and models were refitted Bootstrap analyses were performed separately for CVP, cardiovascular events, and hypertension. OR distributions were summarised using medians, percentile-based confidence intervals, and the proportion of iterations in which ORs exceeded unity, providing a measure of cluster–outcome reproducibility.

Results

Participant characteristics

A total of 408 participants (318 (77.9%) people with HIV) were included in the analysis (Table 3). Median age was 50 (interquartile range [IQR], 43–58) years, 83% were male, 68% White, and 78% were people with HIV. Overall, 136 participants (33.3%) met criteria for a CVP, including hypertension (n = 120), coronary artery disease (n = 23), heart failure (n = 4), myocardial infarction (n = 3), peripheral artery disease (n = 3) and TIA/cerebrovascular accident (n = 2). 17 participants reported more than one type of event. Median ASCVD risk score was 6% (IQR 2.4-13.0%).

Download:

Table 3. Baseline demographics of combined cohorts.

https://doi.org/10.1371/journal.pcbi.1014209.t003

Initial model evaluation

Modelling the initial 24 biomarkers revealed three clusters: Cluster 1 (n = 181, 44.3%) (“Uninflamed”) exhibited low levels of inflammatory biomarkers such as soluble CD40-Ligand (sCD40L), CD163, CRP, IL18, IL6, and TNF-α (Fig 2), Cluster 2 (n = 183, 45%) showed elevated systemic inflammation and endothelial activation markers (CD40-Ligand, IL6, TNF-α, vWF, E-selectin, P-selectin) and elevated IL-10 but suppressed Th1 cytokines (IL2, IL12, IL1β) (Table 4). Cluster 3 (n = 44, 10.7%) displayed increases in both pro-inflammatory and Th1-associated pathways, including IFN-γ, IL1β, IL2, IL12, TNF-α, TSLP and MIP1α. CVP rates were comparable across clusters 2 (38%) and 3 (39%) and slightly lower in cluster 1 (28%) (p = 0.091). Median 10-year ASCVD risk scores also differed across clusters, Cluster 1; 4.95% (IQR 1.83–11.5%), Cluster 2; 7.75% (2.63–14.9%), and Cluster 3; 7.12% (2.54–13.8%) (p = 0.030).

Download:

Table 4. Comparison of Baseline and RFA Models: Key Biomarkers and CVP associations and regression outcomes by Cluster.

https://doi.org/10.1371/journal.pcbi.1014209.t004

Download:

Fig 2. Heatmap showing biomarker contribution to cluster formation in the initial model.

https://doi.org/10.1371/journal.pcbi.1014209.g002

In univariate analyses, older age (OR 1.07 per 1 year increment; 95% CI 1.05–1.10; p < 0.001), higher BMI (OR 1.11 per 1 kg/m² increment; 95% CI 1.06–1.16; p < 0.001), smoking history (current or ex-smoker) (OR 1.86; 95% CI 1.23–2.84; p = 0.003), and dyslipidaemia (OR 2.50; 95% CI 1.64–3.82; p < 0.001) were all associated with greater odds of CVP (Fig 3). Membership in Cluster 2 was also associated with significantly higher CVP odds (OR 1.59; 95% CI 1.02–2.48; p = 0.041) (S1 Table). Although not significant, cluster 3 showed a similar trend toward increased CVP risk (OR 1.65; 95% CI 0.82–3.27; p = 0.15). However, after adjustment, the associations for Cluster 2 (OR 1.23; 95% CI 0.76–2.01, p = 0.55) was attenuated and no longer significant (S2 Table). Univariate analyses determining associations with 10-year ASCVD risk as a continuous outcome yielded broadly similar findings. As with CVP, membership in Cluster 2 was associated with higher ASCVD scores (fold-change 1.40; 95% CI 1.10–1.78; p = 0.005), whereas Cluster 3 was not (fold-change 1.30; 95% CI 0.88–1.91; p = 0.18).

Download:

Fig 3. Baseline 24-biomarker model: associations with CVP.

https://doi.org/10.1371/journal.pcbi.1014209.g003

Forest plot showing unadjusted and adjusted logistic regression analyses for inflammatory clusters and clinical covariates in relation to CVP, using cluster 1 as the reference group. Cluster membership was derived from 24 initial inflammatory biomarkers. Adjusted models included age, sex, BMI, smoking history, and dyslipidaemia. Points represent odds ratios with 95% confidence intervals.

Model 1: Stepwise Addition with Cumulative Evaluation

This model, which sequentially added biomarkers cumulatively to the baseline model, resulted in the selection of 6 additional biomarkers that had the highest impact on model accuracy and multicollinearity; CXCL9, IL-17, EGF, IL-8, Thrombopoietin and GDF-15 (Table 5). Biomarkers identified in the Model 1 analysis showed low to moderate selection frequencies across repeated resamples.

Download:

Table 5. Biomarker selection criteria scores model 1.

https://doi.org/10.1371/journal.pcbi.1014209.t005

Repeat PCA of these 6 markers added to the 24 initial biomarkers yielded 3 clusters with cluster 1 (n = 141, 35%), again displaying low levels of inflammation with suppressed TNF-α, CXCL9, IL6, IL1RA, TGF-α, EGF, Cluster 2 (n = 200, 49%) characterised by elevated IL18, IL10, CRP, TNF-α, GDF-15 and decreased IL-1β, IL12, MIP-1α2 whilst cluster 3 (n = 67, 16.4%) displayed an overall inflamed pattern with high TNF-α, IL1β, IL2, IL-8, CRP, MIP1α (Fig 4). Cluster reproducibility was assessed using 500 bootstrap iterations of PCA and HC. Bootstrap clustering consistently reproduced the same number of clusters as identified in the primary analysis, allowing direct comparison of cluster assignments across resampled datasets. Cluster assignments from each bootstrap were compared to the original solution using the ARI. The distribution of ARI values for model 1indicated moderate cluster stability, with a median Adjusted Rand Index of 0.55 and standard deviation of 0.21, across 500 iterations (S3 Table).

Download:

Fig 4. Heatmap showing biomarker contribution to cluster formation in RFA model 1.

https://doi.org/10.1371/journal.pcbi.1014209.g004

CVP rates in Model 1 varied significantly across clusters, with the highest rate observed in Cluster 3 (45%), compared to Cluster 1 (35%) and Cluster 2 (26%) (p = 0.018). Median 10-year ASCVD risk scores showed a similar pattern, Cluster 1: 4.02% (1.26–10.5%); Cluster 2: 7.58%, (2.97–15.6%); Cluster 3: 8.64% (3.06–14.9%), Kruskal-Wallis p = < 0.0005) (Table 6).

Download:

Table 6. Median ASCVD 10-year risk and fold-change associations with CVP across models and clusters.

https://doi.org/10.1371/journal.pcbi.1014209.t006

In Model 1, univariate analysis showed that membership in Cluster 2 was not significantly associated (OR 1.57, 95% CI 0.98–2.55) but Cluster 3 membership was associated with higher odds of CVP (OR 2.36, 95% CI 1.28–4.38) (S4 Table). However, after adjustment, Cluster 3 (OR 1.68, 95% CI 0.84–3.32) no longer remained significantly associated with CVP (Fig 5) (S5 Table). A similar pattern was observed when 10-year ASCVD risk was examined, with both clusters 2 and 3 associated with higher ASCVD scores (Cluster 2 fold change 1.65, 95% CI 1.29–2.12, p < 0.001, Cluster 3 fold change 1.74, 95% CI 1.25–2.44, p = 0.0012).

Download:

Fig 5. Unadjusted and adjusted odd ratios (95% CI) for CVP in all three RFA models, using cluster 1 as the reference group.

https://doi.org/10.1371/journal.pcbi.1014209.g005

Stratified bootstrap analyses showed that odds ratio estimates were directionally consistent (S6 Table) and robust to sampling variability. When the CVP was stratified into cardiovascular events and hypertension, associations with cardiovascular events were less stable across resamples, whereas associations with hypertension were more consistent, particularly for Cluster 3.

Model 2: Independent addition without order assumptions

This model, which assessed each marker individually, resulted in 11 biomarkers being added to the baseline model; CXCL9, IL-17, IL-8, Thrombopoietin, GM-CSF, GDF-15, IFN-λ2, IFN-α2a, IL-5, EGF, TGF-α (Table 7). Biomarkers identified in the Model 2 analysis showed low to moderate selection frequencies across repeated resamples.

Download:

Table 7. Biomarker selection criteria scores model 2.

https://doi.org/10.1371/journal.pcbi.1014209.t007

Repeat PCA on this expanded biomarker set again identified three distinct clusters (Fig 6). Cluster 1 (n = 180) was characterised by low levels of inflammatory markers, including sCD40L, CXCL9, GDF-15, IFN-λ2, IL-6, IL-17, IL1RA, TGF-α, Thrombopoietin, and TNF-α. Cluster 2 (n = 192) included individuals with elevated levels of sCD40L, CXCL9, EGF, GDF-15, IFN-λ2, IL-6, IL-18, MCP-1, TGF-α, and TNF-α, and low levels of GM-CSF, IFN-α2a, IL-2, and IL-12. Similar to model 1, Cluster 3 (n = 36) was the smallest cluster and exhibited high levels of inflammatory markers such as IL-1β, IL-2 and MIP-1α. However, unlike model 1, this revised cluster 3 also additional markers of innate and adaptive immunity, including elevated GM-CSF, IFN-α2a, IFN-γ, IL-12, TSLP, TGF-α, and Thrombopoietin, with low P-selectin. Bootstrap re-clustering of the expanded biomarker set demonstrated consistent cluster numbers, and good stability of the resulting cluster structure, with a median ARI of 0.74 and standard deviation of 0.16 across 500 bootstrap iterations.

Download:

Fig 6. Heatmap showing biomarker contribution to cluster formation in model 2.

https://doi.org/10.1371/journal.pcbi.1014209.g006

When examining CVP outcomes, Cluster 3 had the highest proportion of CVP (42%), followed by Cluster 2 (36%) and Cluster 1 (28%) (p = 0.14). Median 10-year ASCVD risk scores also differed significantly across clusters, with Cluster 1 at 4.9% (1.4–10.8%), Cluster 2 at 7.7% (2.9–15.5%), and Cluster 3 highest at 8.4% (2.9–14.3%), Kruskal-Wallis p = 0.008).

In univariate analysis, both Cluster 2 (OR 1.61; 95% CI 1.03–2.52; p = 0.037) and Cluster 3 (OR 2.52; 95% CI 1.24–5.12; p = 0.010) were associated with higher odds of CVP compared with the uninflamed Cluster 1 (Fig 5) (S7 Table). After adjustment, only Cluster 3 remained significantly associated with CVP (OR 2.30; 95% CI 1.05–5.06; p = 0.037) (S8 Table). Similarly, for 10-year ASCVD risk, Cluster 2 (fold change 1.48; 95% CI 1.17–1.88; p = 0.001) and Cluster 3 (fold change 1.48; 95% CI 0.99–2.22; p = 0.053) showed higher risk.

Stratified bootstrap analyses demonstrated that Cluster 3 membership was consistently associated with higher odds of the CVP, with a median odds ratio of 2.25 and the odds ratio exceeding unity in 97.5% of bootstrap iterations (S6 Table). When the CVP was stratified, associations with cardiovascular events remained was less stable across resamples, whereas associations with hypertension were more consistent for both clusters.

Model 3: Bidirectional feature selection

Model 3 applied both forward and backward selection methods to include only biomarkers that improved model performance metrics in both directions. Five biomarkers were ultimately retained: IL-17, TGF-α, GDF-15, MDC and IFN- α2a (Table 8). In repeated resampling analyses, the greedy forward–backward selection procedure showed substantial instability. Across 50 repeated train–test splits, no candidate biomarker (markers 25–55) was selected in more than 10% of runs, and the number of added biomarkers varied between 0 and 3 per iteration.

Download:

Table 8. Model 3 (Greedy forward - backward) biomarker selection path and resampling stability.

https://doi.org/10.1371/journal.pcbi.1014209.t008

PCA based on these markers once again revealed three clusters (Fig 7). Cluster 1 (n = 186) was characterised by lower inflammatory markers, with reduced levels of sCD40L, E-selectin, CRP, IL-18, IL-6, IL1RA, MCP-1, TNF-α, and vWF. Cluster 2 (n = 183) showed elevated pro-inflammatory and endothelial markers, including P-selectin, TNF-α, and vWF, but low levels of adaptive immune markers like IL-2, IL-12, and IFN- α2a. Cluster 3 (n = 39) exhibited higher levels of immune activation, with high levels of TSLP, IFN-γ, IL-1β, IL-6, IL-12, MIP-1α, and TNF-α, and low expression of the endothelial marker VCAM, reflecting dysregulated adaptive and innate immune responses. Bootstrap re-clustering of the expanded biomarker set demonstrated good stability of the resulting cluster structure, with a median ARI of 0.79 and standard deviation of 0.17 across 500 bootstrap iterations.

Download:

Fig 7. Correlation Heatmap showing biomarker contribution to cluster formation in model 3.

https://doi.org/10.1371/journal.pcbi.1014209.g007

Clusters 2 and 3 exhibited the highest CVP prevalence, Cluster 2 (37%), Cluster 3 (38%) compared to Cluster 1 (29%) (p = 0.64). Median ASCVD 10-year risk was likewise highest in Cluster 3 (8.0%, 2.7–14.0) and Cluster 2 (7.7%, 2.9–14.9) compared with Cluster 1 (4.9%, 1.8–11.5) (p = 0.024).

In univariate analysis, neither Cluster 2 (OR 1.41; 95% CI 0.91–2.19; p = 0.12) nor Cluster 3 (OR 1.53; 95% CI 0.73–3.11; p = 0.25) was significantly associated with CVP compared with the uninflamed Cluster 1 (Fig 5) (S9 Table). These were not appreciably altered after adjustment (S10 Table). ASCVD 10-year risk was higher in Cluster 2 (fold change 1.41; 95% CI 1.11–1.79; p = 0.005) but not in Cluster 3 (fold change 1.36; 95% CI 0.91–2.04; p = 0.13) relative to Cluster 1. Bootstrap resampling showed modest, directionally consistent associations between cluster membership and outcomes, but with wide confidence intervals and limited stability across resamples (S6 Table).

Comparison of biomarker composition across models

Across models, IL-17 and GDF-15 were consistently selected, underscoring their central roles in inflammatory and vascular processes (Fig 8). Model 1 introduced four additional biomarkers—CXCL9, EGF, IL-8 and Thrombopoietin—broadening coverage of immune activation and growth factor pathways. Model 2 incorporated all of these markers and further expanded the panel to include GM-CSF, IFN-α2a, IFN-λ2, IL-5, and TGF-α, thereby capturing additional interferon and regulatory cytokine pathways. While TGF-α and IFN-α2a were shared with Model 3, the latter retained a more restricted subset, reflecting a narrower focus on immune modulation. Together, these patterns highlight IL-17 and GDF-15 as stable indicators of inflammation related to CVP, with Model 2 offering the most biologically comprehensive framework that was most closely aligned with prevalent CVP.

Download:

Fig 8. Overlap of Biomarkers Across the Three Predictive Models.

https://doi.org/10.1371/journal.pcbi.1014209.g008

Comparison of models and model stability

These three models each evaluated the contribution of additional biomarkers beyond a common initial model and were compared based on biomarker selection reproducibility, cluster stability, and downstream clinical relevance. Overall, selection frequencies indicated low-to-moderate reproducibility of individual biomarkers across resampling iterations, consistent with the presence of correlated inflammatory markers, and stability varied across modelling strategies. Model comparison therefore prioritised unsupervised cluster stability (ARI distributions) and robustness of clinical associations (bootstrap odds ratio distributions), with RF metrics interpreted as secondary indicators of cluster separability. Considering all evaluation criteria, Model 2 demonstrated the strongest overall performance, achieving a balance between robust biomarker selection, stable clustering, and meaningful associations with clinical outcomes (Table 9).

Download:

Table 9. Comparison of biomarker selection robustness, cluster stability, and bootstrap regression results across models.

https://doi.org/10.1371/journal.pcbi.1014209.t009

Model 1 identified a limited set of additional biomarkers with low-to-moderate selection frequencies and produced clusters with moderate stability (median ARI 0.55). Although Cluster 3 showed higher associations with the CVP in unadjusted analyses, these associations were attenuated after covariate adjustment and were less stable when cardiovascular outcomes were stratified into clinical events and hypertension.

In contrast, Model 2 identified a broader and more reproducible set of biomarkers that more reliably improved classification beyond the baseline model. Clusters derived under Model 2 exhibited greater stability (median ARI 0.74) and showed greater and more consistent associations with clinical outcomes. In adjusted analyses, Cluster 3 remained significantly associated with CVP, and bootstrap resampling demonstrated a high proportion of iterations in which odds ratios exceeded unity for both CVP and hypertension.

Model 3, which applied a greedy forward–backward selection strategy to iteratively optimise model performance, achieved the highest cluster stability (median ARI 0.79) but showed limited reproducibility in biomarker selection, with most candidate biomarkers selected infrequently across resampling iterations. Correspondingly, associations between cluster membership and clinical outcomes were weaker and less stable, with wide confidence intervals and lower bootstrap consistency, suggesting reduced robustness of downstream inferences despite stable cluster structure.

Although cohort sizes differed substantially, with Dublin contributing the largest number of participants, the distribution of cohorts across clusters was broadly consistent across all clustering models (Table 10). Each model demonstrated similar proportional representation of Amsterdam, Dublin, and London participants within corresponding clusters, indicating that cluster structure was not driven by a single cohort or geographic site. Cohort effects explained a small proportion of variance in the principal component space (S11 Table), and within-cohort regression analyses yielded directionally consistent but imprecise estimates, consistent with limited statistical power in smaller cohorts (S12 Table).

Download:

Table 10. Distribution of study cohorts across clusters for each clustering model.

https://doi.org/10.1371/journal.pcbi.1014209.t010

Discussion

In this study, we compared different RFA modelling strategies to enhance biomarker clustering to better associate with CVP in people with HIV. Of the three expanded RFA strategies, an independent single-marker evaluation without cumulative retention approach (Model 2) resulted in the best enhancement of the original clustering to strengthen associations with CVP.

This novel analytical approach provides a structured framework for modelling associations between multiple biomarkers and clinical outcomes, based on an initial choice of markers known to be associated with the outcome, subsequently enriched by additional biomarkers to provide enhanced insights into disease pathogenesis. The result is an enhanced model that represents a meaningful precision medicine approach to help better determine inflammatory patterns associated with important non-communicable diseases.

In our analysis, the initial clustering included a previously characterised panel of 24 biomarkers, chosen for their known association with CAD in people with HIV [5,10]. Compared to previous studies, while this initial clustering successfully stratified individuals into three clusters with distinct inflammatory profiles and moderate differences in CVP prevalence, its discriminatory power in this combined cohort was relatively limited. This may reflect the limited biological scope of the included biomarkers, which, although chosen for their known link to CVP in people with HIV, also capture overlapping aspects of systemic inflammation and immune activation. Building on this initial clustering, our analysis expanded the initial 24 biomarker panel with 31 additional biomarkers encompassing multiple inflammatory, endothelial, and metabolic pathways.

Of the three models explored, Model 2 demonstrated clear advantages over the initial clustering, with the inclusion of 11 biomarkers mapping to pathways relevant to endothelial function, tissue remodelling, and systemic inflammation, aligning well with known CVD mechanisms [23,24]. Compared with Models 1 and 3, Model 2 achieved greater differentiation between clusters, with Cluster 3, characterised by elevated levels of circulating markers of immune regulation (GM-CSF), antiviral activity (IFN-α2a, IFN-γ), systemic inflammation (IL-1β, IL-6), innate immune activation (IL-2, IL-12, TSLP, MIP-1α), and anti-inflammatory activity (IL1RA). Associations between this cluster and CVP persisted after adjustment (adjusted OR: 2.3; 95% CI: 1.04, 5.09), and were supported by stability analyses showing reproducible cluster structure and consistent downstream associations under bootstrap resampling. Although ASCVD scores did not show strong separation between Clusters 2 and 3, the overall pattern supports the biological profile of the clusters and is consistent with the associations observed with the CVP.

Model 1, which resulted in the incorporation of six additional biomarkers, including interferons and growth factors, increased the biological diversity of inflammatory pathways involved but did not improve model performance compared to the baseline model substantially Although this model offered a comprehensive view of immune regulation, the resulting clusters did not show strong differentiation in CVP prevalence, and the association observed for the inflamed cluster in unadjusted analyses did not persist after adjustment. Stability analyses further indicated only moderate reproducibility of the cluster structure and less consistent associations with CVP across resamples, suggesting that the observed effects were sensitive to sampling variability. The ASCVD findings showed a similar effect, with both inflamed clusters exhibiting higher predicted ASCVD risk than the uninflamed Cluster 1, supporting the underlying biological distinctions between clusters, although without strong separation between Cluster 2 and Cluster 3. Although the ASCVD results were favourable and directionally consistent, ASCVD scores were broadly comparable across Models 1 and 2.

Model 3, which employed a greedy backward and forward selection strategy, identified five biomarkers to be added to the initial model. The repeat PCA based on this model again identified an inflamed cluster (Cluster 3) defined by markers such as IL-6, TNF-α, and VCAM-1, all of which align with well-established pathways in inflammation and vascular dysfunction [25–27]. However, performance of this model was less consistent, with lower cluster separation and reduced predictive utility for CVP and despite demonstrating high structural stability of the clustering solution, stability analyses of downstream associations showed limited reproducibility. Additionally, because greedy selection evaluates individual features one step at a time, it may settle on suboptimal sets of biomarkers, missing better combinations that might only emerge when considered in combination, particularly when feature interactions are complex and non-linear. This risk of overfitting and limited generalisability, combined with increased model complexity, may explain why Model 3 was less robust.

Together, these findings highlight how different biomarker integration strategies can yield complementary perspectives on inflammation-driven cardiovascular risk. The consistency of the patterns of cluster across models, with a less inflamed cluster, more inflamed cluster and a third, differentially inflamed cluster 3 all consistently represented, despite varying marker inputs, suggests that a biologically meaningful subset of individuals with distinct altered inflammation can be reproducibly identified.

Interestingly, while the inflamed clusters consistently showed higher observed CVP in unadjusted associations, these effects were mostly attenuated after multivariable adjustment, with the exception of Model 2. Likewise, in analyses of the continuous ASCVD risk score, clusters 2 and 3 both demonstrated higher predicted risk than the uninflamed cluster, but the separation between clusters was limited and similar across models 1 and 2.. This likely reflects the different constructs represented by the two outcomes. The CVP endpoint captures realised disease events, integrating both biological and environmental exposures over time, whereas the ASCVD score estimates predicted risk based on traditional risk factors (age, blood pressure, cholesterol, etc.) that may not fully incorporate the contribution of inflammation. Nevertheless, the complementary nature of these outcomes underscores the added value of biomarker-driven clustering in capturing underlying pathophysiological processes that traditional risk models might overlook.

This analysis demonstrates the utility of an RFA framework for uncovering clinically meaningful groups based on immune and inflammatory biomarker profiles. By iteratively evaluating biomarker contributions to clustering performance, the method enables data-driven identification of informative biomarker panels, improving both interpretability and potential clinical applicability. Compared to traditional clustering approaches, this framework allows for systematic exploration of the importance of each biomarker while maintaining flexibility across diverse datasets and outcomes. The resulting clusters not only reflect underlying biological variation but also show associations with the CVP, highlighting the potential translational relevance of this method. Importantly, the adaptability of this approach enables further modifications as new biomarkers emerge and makes it suitable for application across a range of biomarker-driven disease domains and non-communicable diseases, not just CVD.

Despite the strengths of the RFA framework, this study has several limitations. First, the approach requires access to large, high-quality biomarker datasets to ensure reliable clustering and model stability. Second, the framework is dependent on model-specific performance metrics—such as those derived from RF classifiers—which may introduce bias or limit comparability across different modelling approaches. Additionally, the cross-sectional design limits causal inference and statistical power, while survivor bias and residual confounding cannot be excluded. It is also possible that individuals with established CVD may have modified their behaviour or received interventions that influenced their biomarker profiles, introducing potential reverse causation. Finally, translating this approach to clinical settings poses challenges, as the specialised platforms required to measure these biomarkers are resource-intensive, technically demanding, and not yet integrated into routine diagnostic workflows. Furthermore, the interpretability of machine learning-derived clusters can be complex, potentially limiting clinical uptake without further validation and simplification. To address these limitations, future work should focus on validating the approach in diverse, independent and longitudinal cohorts with incident CVD outcomes to better assess temporal and causal relationships. Additional work should also explore methods to streamline biomarker panels for clinical feasibility, and evaluating ensemble clustering techniques that combine multiple algorithms for greater stability. Integration with broader omics data could also enhance the biological interpretability of clusters, improve predictive power, and potentially reduce reliance on large single-modality biomarker sets.

Conclusion

In this analysis of a combined, international cohort of people with and without HIV, we identified a RFA framework that can enhance biomarker-derived clustering to predict cardiovascular outcomes. The most robust model, which added biomarkers individually in an unsupervised, data-driven manner, outperformed other models in identifying distinct, outcome-associated clusters. Its adaptive and unbiased design makes it broadly applicable across clinical and biomarker discovery settings, offering an analytical framework for improving host stratification and informing precision medicine for prediction of common comorbidities.

Supporting information

S1 Table. Univariate associations with the composite vascular phenotype for the baseline clustering model.

https://doi.org/10.1371/journal.pcbi.1014209.s001

(DOCX)

S2 Table. Multivariable associations with the composite vascular phenotype for the baseline clustering model.

https://doi.org/10.1371/journal.pcbi.1014209.s002

(DOCX)

S3 Table. Bootstrap cluster stability across the three modelling strategies.

https://doi.org/10.1371/journal.pcbi.1014209.s003

(DOCX)

S4 Table. Univariate associations with the composite vascular phenotype for recursive feature addition Model 1.

https://doi.org/10.1371/journal.pcbi.1014209.s004

(DOCX)

S5 Table. Multivariable associations with the composite vascular phenotype for recursive feature addition Model 1.

https://doi.org/10.1371/journal.pcbi.1014209.s005

(DOCX)

S6 Table. Bootstrap sensitivity analysis of cluster–outcome associations across recursive feature addition models.

https://doi.org/10.1371/journal.pcbi.1014209.s006

(DOCX)

S7 Table. Univariate associations with the composite vascular phenotype for recursive feature addition Model 2.

https://doi.org/10.1371/journal.pcbi.1014209.s007

(DOCX)

S8 Table. Multivariable associations with the composite vascular phenotype for recursive feature addition Model 2.

https://doi.org/10.1371/journal.pcbi.1014209.s008

(DOCX)

S9 Table. Univariate associations with the composite vascular phenotype for recursive feature addition Model 3.

https://doi.org/10.1371/journal.pcbi.1014209.s009

(DOCX)

S10 Table. Multivariable associations with the composite vascular phenotype for recursive feature addition Model 3.

https://doi.org/10.1371/journal.pcbi.1014209.s010

(DOCX)

S11 Table. Assessment of cohort effects on clustering across recursive feature addition models.

https://doi.org/10.1371/journal.pcbi.1014209.s011

(DOCX)

S12 Table. Within-cohort adjusted associations between cluster membership and cardiovascular outcomes.

https://doi.org/10.1371/journal.pcbi.1014209.s012

(DOCX)

S1 Data. Model 1 data archive.

Compressed folder containing input biomarker matrices, random forest outputs, cluster assignments, bootstrap resampling results, and summary files used to generate Model 1 results.

https://doi.org/10.1371/journal.pcbi.1014209.s013

(ZIP)

S2 Data. Model 2 data archive.

Compressed folder containing input data, feature selection outputs, cluster solutions, bootstrap stability metrics, and regression results for recursive feature addition Model 2.

https://doi.org/10.1371/journal.pcbi.1014209.s014

(ZIP)

S3 Data. Model 3 data archive.

Compressed folder containing greedy forward–backward feature selection outputs, cluster assignments, bootstrap analyses, and supporting files for recursive feature addition Model 3.

https://doi.org/10.1371/journal.pcbi.1014209.s015

(ZIP)

S4 Data. R code for recursive feature addition models.

Annotated R scripts implementing data preprocessing, imputation, principal component analysis, recursive feature addition strategies, clustering, bootstrap stability analyses, and regression modelling used across all three models.

https://doi.org/10.1371/journal.pcbi.1014209.s016

(ZIP)

S1 Text. Membership list for the UPBEAT-CAD, AIID and COBRA cohort working groups.

https://doi.org/10.1371/journal.pcbi.1014209.s017

(DOCX)

Acknowledgments

The authors wish to thank all study participants and their families for their participation and support in the conduct of the AIID Cohort Study, the UPBEAT- CAD study and the COBRA Cohort study.

References

1. Trickey A, Sabin CA, Burkholder G, Crane H, d’Arminio Monforte A, Egger M, et al. Life expectancy after 2015 of adults with HIV on long-term antiretroviral therapy in Europe and North America: a collaborative analysis of cohort studies. Lancet HIV. 2023;10(5):e295–307. pmid:36958365
- View Article
- PubMed/NCBI
- Google Scholar
2. Trickey A, McGinnis K, Gill MJ, Abgrall S, Berenguer J, Wyen C, et al. Longitudinal trends in causes of death among adults with HIV on antiretroviral therapy in Europe and North America from 1996 to 2020: a collaboration of cohort studies. Lancet HIV. 2024;11(3):e176–85. pmid:38280393
- View Article
- PubMed/NCBI
- Google Scholar
3. McGettrick P, Mallon PWG, Sabin CA. Cardiovascular disease in HIV patients: recent advances in predicting and managing risk. Expert Rev Anti Infect Ther. 2020;18(7):677–88. pmid:32306781
- View Article
- PubMed/NCBI
- Google Scholar
4. Vos AG, Idris NS, Barth RE, Klipstein-Grobusch K, Grobbee DE. Pro-Inflammatory Markers in Relation to Cardiovascular Disease in HIV Infection. A Systematic Review. PLoS One. 2016;11(1):e0147484. pmid:26808540
- View Article
- PubMed/NCBI
- Google Scholar
5. Sukumaran L, Kunisaki KM, Bakewell N, Winston A, Mallon PWG, Doyle N, et al. Association between inflammatory biomarker profiles and cardiovascular risk in individuals with and without HIV. AIDS. 2023;37(4):595–603. pmid:36541572
- View Article
- PubMed/NCBI
- Google Scholar
6. Nordell AD, McKenna M, Borges ÁH, Duprez D, Neuhaus J, Neaton JD. Severity of cardiovascular disease outcomes among patients with HIV is related to markers of inflammation and coagulation. J Am Heart Assoc. 2014;3(3):e000844.
- View Article
- Google Scholar
7. Subramanya V, McKay HS, Brusca RM, Palella FJ, Kingsley LA, Witt MD, et al. Inflammatory biomarkers and subclinical carotid atherosclerosis in HIV-infected and HIV-uninfected men in the Multicenter AIDS Cohort Study. PLoS One. 2019;14(4):e0214735. pmid:30946765
- View Article
- PubMed/NCBI
- Google Scholar
8. Mooney S, Tracy R, Osler T, Grace C. Elevated Biomarkers of Inflammation and Coagulation in Patients with HIV Are Associated with Higher Framingham and VACS Risk Index Scores. PLoS One. 2015;10(12):e0144312. pmid:26641655
- View Article
- PubMed/NCBI
- Google Scholar
9. Vanbellinghen MC, Boyd A, Kootstra NA, Schim van der Loeff MF, van der Valk M, Reiss P, et al. A Biomarker Profile Reflective of Preserved Thymic Function Is Associated With Reduced Comorbidities in Aging People With HIV: An AGEhIV Cohort Analysis. J Infect Dis. 2025;231(3):622–32. pmid:39658325
- View Article
- PubMed/NCBI
- Google Scholar
10. McGettrick P, Tinago W, O’Brien J, Miles S, Lawler L, Garcia-Leon A. Distinct inflammatory phenotypes are associated with subclinical and clinical cardiovascular disease in people living with HIV. J Infect Dis. 2024.
- View Article
- Google Scholar
11. De Francesco D, Wit FW, Cole JH, Kootstra NA, Winston A, Sabin CA, et al. The “COmorBidity in Relation to AIDS” (COBRA) cohort: Design, methods and participant characteristics. PLoS One. 2018;13(3):e0191791. pmid:29596425
- View Article
- PubMed/NCBI
- Google Scholar
12. De Francesco D, Verboeket SO, Underwood J, Bagkeris E, Wit FW, Mallon PWG, et al. Patterns of Co-occurring Comorbidities in People Living With HIV. Open Forum Infect Dis. 2018;5(11):ofy272. pmid:30465014
- View Article
- PubMed/NCBI
- Google Scholar
13. Jaeger B. PooledCohort: Predicted Risk for CVD using Pooled Cohort Equations, PREVENT Equations, and Other Contemporary CVD Risk Calculators [[R package]]. CRAN.; 2025. https://cran.r-project.org/web/packages/PooledCohort/index.html
14. Buuren SV, Groothuis-Oudshoorn K. Mice: Multivariate imputation by chained equations in R. J Stat Softw. 2011;45(3).
- View Article
- Google Scholar
15. Johnson WE, Li C, Rabinovic A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics. 2007;8(1):118–27.
- View Article
- Google Scholar
16. Lê S, Josse J, Husson F. FactoMineR: An R Package for Multivariate Analysis. J Stat Softw. 2008;25(1).
- View Article
- Google Scholar
17. Breiman L. [No title found]. Mach Learn. 2001;45(1):5–32.
- View Article
- Google Scholar
18. Matsuki K, Kuperman V, Van Dyke JA. The Random Forests statistical technique: An examination of its value for the study of reading. Sci Stud Read. 2016;20(1):20–33. pmid:26770056
- View Article
- PubMed/NCBI
- Google Scholar
19. Tougui I, Jilbab A, Mhamdi JE. Impact of the Choice of Cross-Validation Techniques on the Results of Machine Learning-Based Diagnostic Applications. Healthc Inform Res. 2021;27(3):189–99. pmid:34384201
- View Article
- PubMed/NCBI
- Google Scholar
20. Strobl C, Boulesteix A-L, Kneib T, Augustin T, Zeileis A. Conditional variable importance for random forests. BMC Bioinformatics. 2008;9:307. pmid:18620558
- View Article
- PubMed/NCBI
- Google Scholar
21. Altmann A, Toloşi L, Sander O, Lengauer T. Permutation importance: a corrected feature importance measure. Bioinformatics. 2010;26(10):1340–7. pmid:20385727
- View Article
- PubMed/NCBI
- Google Scholar
22. Vieira SM, Kaymak U, Sousa JMC. Cohen’s kappa coefficient as a performance measure for feature selection. In: International Conference on Fuzzy Systems, 2010. 1–8.
- View Article
- Google Scholar
23. Canbay A, Celebi OO, Celebi S, Aydogdu S, Diker E. Procalcitonin: a marker of heart failure. Acta Cardiol. 2015;70(4):473–8.
- View Article
- Google Scholar
24. Sinning CR, Sinning J-M, Schulz A, Schnabel RB, Lubos E, Wild PS, et al. Association of serum procalcitonin with cardiovascular prognosis in coronary artery disease. Circ J. 2011;75(5):1184–91. pmid:21378450
- View Article
- PubMed/NCBI
- Google Scholar
25. Ridker PM, Rifai N, Stampfer MJ, Hennekens CH. Plasma concentration of interleukin-6 and the risk of future myocardial infarction among apparently healthy men. Circulation. 2000;101(15):1767–72. pmid:10769275
- View Article
- PubMed/NCBI
- Google Scholar
26. Wang TJ, Wollert KC, Larson MG, Coglianese E, McCabe EL, Cheng S, et al. Prognostic utility of novel biomarkers of cardiovascular stress: the Framingham Heart Study. Circulation. 2012;126(13):1596–604. pmid:22907935
- View Article
- PubMed/NCBI
- Google Scholar
27. Sattar N, Wannamethee G, Sarwar N, Chernova J, Lawlor DA, Kelly A. Leptin and coronary heart disease. J Am Coll Cardiol. 2009;53(2):167–75.
- View Article
- Google Scholar

[ref1] 1. Trickey A, Sabin CA, Burkholder G, Crane H, d’Arminio Monforte A, Egger M, et al. Life expectancy after 2015 of adults with HIV on long-term antiretroviral therapy in Europe and North America: a collaborative analysis of cohort studies. Lancet HIV. 2023;10(5):e295–307. pmid:36958365
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Trickey A, McGinnis K, Gill MJ, Abgrall S, Berenguer J, Wyen C, et al. Longitudinal trends in causes of death among adults with HIV on antiretroviral therapy in Europe and North America from 1996 to 2020: a collaboration of cohort studies. Lancet HIV. 2024;11(3):e176–85. pmid:38280393
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. McGettrick P, Mallon PWG, Sabin CA. Cardiovascular disease in HIV patients: recent advances in predicting and managing risk. Expert Rev Anti Infect Ther. 2020;18(7):677–88. pmid:32306781
View Article
PubMed/NCBI
Google Scholar

[10] View Article

[11] PubMed/NCBI

[12] Google Scholar

[ref4] 4. Vos AG, Idris NS, Barth RE, Klipstein-Grobusch K, Grobbee DE. Pro-Inflammatory Markers in Relation to Cardiovascular Disease in HIV Infection. A Systematic Review. PLoS One. 2016;11(1):e0147484. pmid:26808540
View Article
PubMed/NCBI
Google Scholar

[14] View Article

[15] PubMed/NCBI

[16] Google Scholar

[ref5] 5. Sukumaran L, Kunisaki KM, Bakewell N, Winston A, Mallon PWG, Doyle N, et al. Association between inflammatory biomarker profiles and cardiovascular risk in individuals with and without HIV. AIDS. 2023;37(4):595–603. pmid:36541572
View Article
PubMed/NCBI
Google Scholar

[18] View Article

[19] PubMed/NCBI

[20] Google Scholar

[ref6] 6. Nordell AD, McKenna M, Borges ÁH, Duprez D, Neuhaus J, Neaton JD. Severity of cardiovascular disease outcomes among patients with HIV is related to markers of inflammation and coagulation. J Am Heart Assoc. 2014;3(3):e000844.
View Article
Google Scholar

[22] View Article

[23] Google Scholar

[ref7] 7. Subramanya V, McKay HS, Brusca RM, Palella FJ, Kingsley LA, Witt MD, et al. Inflammatory biomarkers and subclinical carotid atherosclerosis in HIV-infected and HIV-uninfected men in the Multicenter AIDS Cohort Study. PLoS One. 2019;14(4):e0214735. pmid:30946765
View Article
PubMed/NCBI
Google Scholar

[25] View Article

[26] PubMed/NCBI

[27] Google Scholar

[ref8] 8. Mooney S, Tracy R, Osler T, Grace C. Elevated Biomarkers of Inflammation and Coagulation in Patients with HIV Are Associated with Higher Framingham and VACS Risk Index Scores. PLoS One. 2015;10(12):e0144312. pmid:26641655
View Article
PubMed/NCBI
Google Scholar

[29] View Article

[30] PubMed/NCBI

[31] Google Scholar

[ref9] 9. Vanbellinghen MC, Boyd A, Kootstra NA, Schim van der Loeff MF, van der Valk M, Reiss P, et al. A Biomarker Profile Reflective of Preserved Thymic Function Is Associated With Reduced Comorbidities in Aging People With HIV: An AGEhIV Cohort Analysis. J Infect Dis. 2025;231(3):622–32. pmid:39658325
View Article
PubMed/NCBI
Google Scholar

[33] View Article

[34] PubMed/NCBI

[35] Google Scholar

[ref10] 10. McGettrick P, Tinago W, O’Brien J, Miles S, Lawler L, Garcia-Leon A. Distinct inflammatory phenotypes are associated with subclinical and clinical cardiovascular disease in people living with HIV. J Infect Dis. 2024.
View Article
Google Scholar

[37] View Article

[38] Google Scholar

[ref11] 11. De Francesco D, Wit FW, Cole JH, Kootstra NA, Winston A, Sabin CA, et al. The “COmorBidity in Relation to AIDS” (COBRA) cohort: Design, methods and participant characteristics. PLoS One. 2018;13(3):e0191791. pmid:29596425
View Article
PubMed/NCBI
Google Scholar

[40] View Article

[41] PubMed/NCBI

[42] Google Scholar

[ref12] 12. De Francesco D, Verboeket SO, Underwood J, Bagkeris E, Wit FW, Mallon PWG, et al. Patterns of Co-occurring Comorbidities in People Living With HIV. Open Forum Infect Dis. 2018;5(11):ofy272. pmid:30465014
View Article
PubMed/NCBI
Google Scholar

[44] View Article

[45] PubMed/NCBI

[46] Google Scholar

[ref13] 13. Jaeger B. PooledCohort: Predicted Risk for CVD using Pooled Cohort Equations, PREVENT Equations, and Other Contemporary CVD Risk Calculators [[R package]]. CRAN.; 2025. https://cran.r-project.org/web/packages/PooledCohort/index.html

[ref14] 14. Buuren SV, Groothuis-Oudshoorn K. Mice: Multivariate imputation by chained equations in R. J Stat Softw. 2011;45(3).
View Article
Google Scholar

[49] View Article

[50] Google Scholar

[ref15] 15. Johnson WE, Li C, Rabinovic A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics. 2007;8(1):118–27.
View Article
Google Scholar

[52] View Article

[53] Google Scholar

[ref16] 16. Lê S, Josse J, Husson F. FactoMineR: An R Package for Multivariate Analysis. J Stat Softw. 2008;25(1).
View Article
Google Scholar

[55] View Article

[56] Google Scholar

[ref17] 17. Breiman L. [No title found]. Mach Learn. 2001;45(1):5–32.
View Article
Google Scholar

[58] View Article

[59] Google Scholar

[ref18] 18. Matsuki K, Kuperman V, Van Dyke JA. The Random Forests statistical technique: An examination of its value for the study of reading. Sci Stud Read. 2016;20(1):20–33. pmid:26770056
View Article
PubMed/NCBI
Google Scholar

[61] View Article

[62] PubMed/NCBI

[63] Google Scholar

[ref19] 19. Tougui I, Jilbab A, Mhamdi JE. Impact of the Choice of Cross-Validation Techniques on the Results of Machine Learning-Based Diagnostic Applications. Healthc Inform Res. 2021;27(3):189–99. pmid:34384201
View Article
PubMed/NCBI
Google Scholar

[65] View Article

[66] PubMed/NCBI

[67] Google Scholar

[ref20] 20. Strobl C, Boulesteix A-L, Kneib T, Augustin T, Zeileis A. Conditional variable importance for random forests. BMC Bioinformatics. 2008;9:307. pmid:18620558
View Article
PubMed/NCBI
Google Scholar

[69] View Article

[70] PubMed/NCBI

[71] Google Scholar

[ref21] 21. Altmann A, Toloşi L, Sander O, Lengauer T. Permutation importance: a corrected feature importance measure. Bioinformatics. 2010;26(10):1340–7. pmid:20385727
View Article
PubMed/NCBI
Google Scholar

[73] View Article

[74] PubMed/NCBI

[75] Google Scholar

[ref22] 22. Vieira SM, Kaymak U, Sousa JMC. Cohen’s kappa coefficient as a performance measure for feature selection. In: International Conference on Fuzzy Systems, 2010. 1–8.
View Article
Google Scholar

[77] View Article

[78] Google Scholar

[ref23] 23. Canbay A, Celebi OO, Celebi S, Aydogdu S, Diker E. Procalcitonin: a marker of heart failure. Acta Cardiol. 2015;70(4):473–8.
View Article
Google Scholar

[80] View Article

[81] Google Scholar

[ref24] 24. Sinning CR, Sinning J-M, Schulz A, Schnabel RB, Lubos E, Wild PS, et al. Association of serum procalcitonin with cardiovascular prognosis in coronary artery disease. Circ J. 2011;75(5):1184–91. pmid:21378450
View Article
PubMed/NCBI
Google Scholar

[83] View Article

[84] PubMed/NCBI

[85] Google Scholar

[ref25] 25. Ridker PM, Rifai N, Stampfer MJ, Hennekens CH. Plasma concentration of interleukin-6 and the risk of future myocardial infarction among apparently healthy men. Circulation. 2000;101(15):1767–72. pmid:10769275
View Article
PubMed/NCBI
Google Scholar

[87] View Article

[88] PubMed/NCBI

[89] Google Scholar

[ref26] 26. Wang TJ, Wollert KC, Larson MG, Coglianese E, McCabe EL, Cheng S, et al. Prognostic utility of novel biomarkers of cardiovascular stress: the Framingham Heart Study. Circulation. 2012;126(13):1596–604. pmid:22907935
View Article
PubMed/NCBI
Google Scholar

[91] View Article

[92] PubMed/NCBI

[93] Google Scholar

[ref27] 27. Sattar N, Wannamethee G, Sarwar N, Chernova J, Lawlor DA, Kelly A. Leptin and coronary heart disease. J Am Coll Cardiol. 2009;53(2):167–75.
View Article
Google Scholar

[95] View Article

[96] Google Scholar

Figures

Abstract

Background

Methods

Results

Conclusion

Author summary

Introduction

Methods

Ethics statement

Dataset and study cohort

Biomarker measurement and data preparation

Definition of cardiovascular outcomes

Initial clustering and baseline model replication

Recursive feature addition methodology

Common Processing Pipeline

RFA model implementation

Model 1: Stepwise addition with cumulative evaluation

Model 2: Independent Addition Without Order Assumptions

Model 3: Bidirectional feature selection

Biomarker selection stability

Cluster reconstruction and stability assessment

Cluster–outcome associations and regression stability

Results

Participant characteristics

Initial model evaluation

Model 1: Stepwise Addition with Cumulative Evaluation

Model 2: Independent addition without order assumptions

Model 3: Bidirectional feature selection

Comparison of biomarker composition across models

Comparison of models and model stability

Discussion

Conclusion

Supporting information

S1 Table. Univariate associations with the composite vascular phenotype for the baseline clustering model.

S2 Table. Multivariable associations with the composite vascular phenotype for the baseline clustering model.

S3 Table. Bootstrap cluster stability across the three modelling strategies.

S4 Table. Univariate associations with the composite vascular phenotype for recursive feature addition Model 1.

S5 Table. Multivariable associations with the composite vascular phenotype for recursive feature addition Model 1.

S6 Table. Bootstrap sensitivity analysis of cluster–outcome associations across recursive feature addition models.

S7 Table. Univariate associations with the composite vascular phenotype for recursive feature addition Model 2.

S8 Table. Multivariable associations with the composite vascular phenotype for recursive feature addition Model 2.

S9 Table. Univariate associations with the composite vascular phenotype for recursive feature addition Model 3.

S10 Table. Multivariable associations with the composite vascular phenotype for recursive feature addition Model 3.

S11 Table. Assessment of cohort effects on clustering across recursive feature addition models.

S12 Table. Within-cohort adjusted associations between cluster membership and cardiovascular outcomes.

S1 Data. Model 1 data archive.

S2 Data. Model 2 data archive.

S3 Data. Model 3 data archive.

S4 Data. R code for recursive feature addition models.

S1 Text. Membership list for the UPBEAT-CAD, AIID and COBRA cohort working groups.

Acknowledgments

References