Correction
17 Mar 2026: The PLOS One Staff (2026) Correction: Does C-reactive protein exhibit high prognostic information value in acute pulmonary embolism? A novel structural pathway for disease progression beyond classical statistical associations. PLOS ONE 21(3): e0345304. https://doi.org/10.1371/journal.pone.0345304 View correction
Figures
Abstract
Acute pulmonary embolism (APE) is a life-threatening condition requiring precise risk stratification. Although numerous prognostic factors have been proposed, redundancy and limited predictive utility often obscure clinical interpretation. To analyze a predefined set of clinical and laboratory variables in patients with APE using both classical statistical models and a novel taxonomic structural analysis, aiming to identify factors associated with early mortality beyond conventional outcome-based associations. We retrospectively analyzed 366 patients diagnosed with APE between 2009 and 2018, of whom 76 died within one year of the acute event. A total of 20 clinical and laboratory variables—including both established prognostic markers and features with no presumed direct impact on mortality—were assessed using Cox and logistic regression models with the concordance index (C-index) and Akaike’s Information Criterion (AIC). A structural analysis based on Marczewski–Steinhaus (M–S) taxonomic distances was applied to all 1,140 unique triads of risk factors to identify clusters of high patient variability. Segmented regression was then used to determine the transition between homogeneous and heterogeneous predictor spaces. Classical regression identified age as the strongest mortality predictor in APE. In contrast, the taxonomic outcome-agnostic approach revealed CRP as the most prominent structural signal, followed by other key inflammatory markers such as D-dimer, high-sensitivity troponin T (hsTnT), and activated partial thromboplastin time (aPTT). Age, along with certain hematological parameters (e.g., hemoglobin) and major electrolytes (Na ⁺ , K ⁺ , Cl⁻), appeared taxonomically insensitive to acute disease-related changes, reflecting more stable background characteristics. Several other variables, including renal biomarkers (urea, creatinine, and GFR), showed no significant role in APE, with their levels varying randomly between patients. Within this framework, CRP exhibits the highest structural variability among the analyzed factors, suggesting prognostic relevance beyond classical outcome-based associations (such as age). The proposed taxonomic approach complements traditional methods by reducing redundancy, enhancing interpretability, and improving the identification of truly relevant prognostic factors.
Citation: Tukiendorf A, Feusette P (2026) Does C-reactive protein exhibit high prognostic information value in acute pulmonary embolism? A novel structural pathway for disease progression beyond classical statistical associations. PLoS One 21(2): e0343108. https://doi.org/10.1371/journal.pone.0343108
Editor: Yoshiaki Taniyama, Osaka University Graduate School of Medicine, JAPAN
Received: November 5, 2025; Accepted: February 2, 2026; Published: February 18, 2026
Copyright: © 2026 Tukiendorf, Feusette. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data are within the paper and its Supporting information files.
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Acute pulmonary embolism (APE) is a serious and potentially life-threatening condition. Its clinical significance stems from its high mortality rate, the need for accurate risk stratification, diverse clinical presentation, diagnostic complexity, treatment challenges, and implications for long-term management [1]. Identifying key prognostic variables is therefore crucial. These include hemodynamic instability, right ventricular dysfunction, cardiac biomarkers (e.g., troponins, natriuretic peptides), demographic factors (e.g., age, comorbidities), inflammatory markers, imaging findings, laboratory values (e.g., white blood cell count, fibrinogen-to-albumin ratio), and composite clinical scores such as the Pulmonary Embolism Severity Index (PESI) [2–5].
The high dimensionality of clinical datasets necessitates robust variable selection techniques to reduce redundancy and improve predictive performance [6]. Redundant predictors—particularly in cardiovascular datasets—may reflect internal clustering and compromise model interpretability [7]. Studies have shown that predictive accuracy can often be maintained even when the number of predictors is drastically reduced (e.g., from 433 to 38 [8]); however, this does not necessarily imply that the retained variables are the most clinically meaningful. This highlights the need for approaches that assess the structural relevance of variables beyond their predictive contribution [8].
In this study, we introduce a taxonomic, outcome-agnostic framework for evaluating prognostic variables. This approach assesses structural interrelationships between variables, regardless of their direct associations with clinical endpoints. By applying a matrix of Marczewski–Steinhaus distances and segmented regression, we aim to identify variables with high informational impact that may be overlooked by classical statistical models [9,10].
Materials and methods
Ethics approval
The study was approved by the Bioethics Committee at the Opole Chamber of Physicians (Resolution No. 268, dated May 17, 2018). The Committee granted permission to use anonymized clinical data of patients diagnosed with acute pulmonary embolism for scientific research purposes, without the requirement to obtain individual informed consent, provided that patient confidentiality is maintained. The study was conducted in accordance with the principles of the Declaration of Helsinki.
Study population and data sources
This study included all patients diagnosed with acute pulmonary embolism (APE) who were treated at the Department of Cardiology, University Clinical Hospital in Opole, between December 2009 and December 2018. The dataset comprised medical records of 366 consecutive individuals with confirmed APE (150 [41%] male, 216 [59%] female; mean age, 65.0 ± 16.6 years; median, 68; range, 19–94). A broad range of clinical data was collected, including demographic information, comorbidities, imaging results (echocardiography and ultrasound), and laboratory findings—yielding nearly one hundred parameters used for clinical characterization.
APE was diagnosed based on the following criteria:
- Chest computed tomography pulmonary angiography (CTPA) showing a filling defect in the pulmonary arteries, accompanied by clinical symptoms such as dyspnea and chest pain (325 patients, 89%);
- Characteristic clinical presentation, elevated D-dimer levels, and visualization of a thrombus in the right heart chambers (20 patients, 6%);
- Positive Doppler ultrasound of the lower limb veins in conjunction with symptoms and elevated D-dimer levels (15 patients, 4%);
- Abnormal pulmonary scintigraphy findings, supported by clinical symptoms and elevated D-dimer levels (4 patients, 1%).
Patients without objective imaging confirmation or consistent clinical and laboratory features were excluded from the study.
Mortality data were retrieved from the regional database of the National Health Fund (Opole branch), with 76 deaths observed within one year after the APE event.
The remaining clinical data were retrieved from the medical records of the University Clinical Hospital in Opole, Poland. All data were accessed for research purposes between 03/07/2023 and 29/12/2023.
We evaluated 20 potential prognostic variables: age, body mass index (BMI), hemoglobin (Hb), white blood cell count (WBC), red blood cell count (RBC), platelet count (PLT), urea (UREA), creatinine (CREA), glomerular filtration rate (GFR), glucose (GLU), sodium (Na⁺), potassium (K⁺), chloride (Cl⁻), CK-MB mass, high-sensitivity troponin T (hsTnT), D-dimer, adjusted D-dimer (aD-dimer), international normalized ratio (INR), activated partial thromboplastin time (aPTT), and C-reactive protein (CRP). In addition to clinically established prognostic markers, variables with no presumed direct impact on APE-related mortality (e.g., sodium, chloride) were deliberately included to enhance the overall variability of the dataset.
Table 1 summarizes the descriptive statistics for all evaluated variables, including measures of central tendency (mean and median), dispersion (standard deviation [SD] and interquartile range [IQR]), coefficient of variation (CV), and the results of the Kolmogorov–Smirnov test for normality (where a p-value > 0.05 indicates no significant deviation from a normal distribution).
Table 1 reveals substantial variability across the examined variables, with adjusted D-dimer (CV = 2.57), hsTnT (CV = 1.88), and CRP (CV = 1.23) exhibiting the highest relative dispersion. In contrast, sodium (Na⁺) showed the lowest variability (CV = 0.03). According to the Kolmogorov–Smirnov test, most variables significantly deviated from a normal distribution (p < 0.05), with only a few—such as hemoglobin and red blood cell count—demonstrating approximately normal characteristics.
Overall survival probability during the 1-year follow-up period was estimated using the Kaplan–Meier method (Fig 1).
The shaded area indicates the 95% confidence interval. The risk table below the graph shows the number of patients remaining under observation at selected time points.
The survival curve presented in Fig 1 demonstrates a gradual decline in survival probability, with the most pronounced decrease occurring within the first few days following admission. Thereafter, the curve levels off, indicating a slower rate of mortality in the subsequent months. Notably, the median survival time was not reached during the one-year follow-up period.
Model discrimination and parsimony were assessed using the concordance index (C-index), Akaike’s Information Criterion (AIC), and p-values. In Cox regression, the C-index quantifies the model’s ability to correctly rank survival times, ranging from 0.5 (no better than chance) to 1.0 (perfect discrimination). The AIC facilitates model comparison by balancing goodness-of-fit with model complexity and is calculated as AIC = 2k – 2ln(L), where k is the number of model parameters and L is the maximized likelihood function.
Lower AIC values indicate better-fitting models with fewer parameters. In both Cox and logistic regression analyses, the statistical significance of a risk factor’s association with the clinical outcome was assessed using the p-value, with p < 0.05 considered statistically significant and higher values indicating insufficient evidence of association.
In our proposed structural analysis of predictors, based on Newton’s binomial theorem, we considered all ((20 3) =) 1,140 unique triads (combinations without repetition) of 20 clinical variables. For each triad, square matrices of pairwise taxonomic distances between patients (symmetric dissimilarities) were computed using the Marczewski–Steinhaus (M–S) metric [10]. In this approach, the symmetric taxonomic distance (D) between two subjects (A and B) is defined as:
where the numerator represents the absolute difference between A and B, and the denominator is their maximum value [10].
This metric has been successfully applied in several prior studies (e.g., [11]). Notably, one such study demonstrated that the seemingly simple M–S metric produced classification results in striking agreement with those generated by the more complex Expectation–Maximization (E–M) algorithm [12]—which, incidentally, was reported in 2014 to be the second most cited statistical paper globally, following Sir D. R. Cox’s seminal 1972 article on regression models [13,14].
The foundational concept of the M–S metric was later highlighted by Stanisław Marcin Ulam—Steinhaus’s student and collaborator, participant in the renowned Scottish Café gatherings, and a key contributor to the Manhattan Project—who noted its usefulness in practical and biological applications [15].
The Marczewski–Steinhaus (M–S) distance was selected because it measures relative rather than absolute differences between objects. Unlike Euclidean distance, which aggregates squared deviations across all dimensions, M–S focuses on the proportion of difference to the maximum observed value within each pairwise comparison. This makes it inherently scale-invariant after standardization and more appropriate when the interest lies in structural dissimilarity patterns rather than magnitude alone. Moreover, M–S satisfies the triangle inequality while preserving metric properties even in cases where variables have sparse or zero-inflated distributions—conditions under which Euclidean distance can artificially inflate separation. The original formulation by Marczewski and Steinhaus also emphasized its adaptability for set-theoretic interpretations, allowing the same measure to be meaningfully applied to both numerical and categorical data, which broadens its applicability in heterogeneous clinical datasets.
Then, we generated a scree plot of triads ranked by average distance and applied segmented regression to detect a breakpoint distinguishing low- from high-variability segments (the division into two segments was chosen to avoid complexity and allow clearer interpretation of the statistical findings).
In each defined segment, the frequency of occurrence for each clinical variable was calculated across all ((19 2) =) 171 possible triads in which it was present, complemented by the remaining (1140 − 171 =) 969 triads in which it was absent.
The frequency of each variable’s appearance across segments was calculated, and chi-square tests identified significant overrepresentation.
All statistical computations were performed using R software [16], with the aid of the ‘survminer’, ‘cluster’, and ‘segmented’ packages for survival analysis, taxonomic clustering, and regression segmentation, respectively [17–19].
Methodological note
In summary, the taxonomic approach applied in this study differs fundamentally from conventional regression-based methods. While classical models (e.g., Cox or logistic regression) rely on outcome-dependent associations and require explicit model assumptions, our method is unsupervised and outcome-agnostic. By calculating Marczewski–Steinhaus (M–S) distances within triads of clinical variables, it quantifies structural divergence between patients in a purely geometric sense. This allows for the detection of latent heterogeneity that may be masked by additive or linear assumptions. Unlike subject-based analyses, this approach operates on combinations of variables, evaluating their behavior across all possible triads. As a result, it enables the identification of variables that make the largest contribution to structural variability within the dataset.
Results
The statistical analyses yielded results from both classical regression models and the taxonomic approach, enabling a comparison between outcome-dependent and outcome-agnostic perspectives. Table 2 presents the findings from classical regression analyses, including coefficients of variation, p-values from Cox and logistic regressions, concordance indices (C-index), and Akaike’s Information Criterion (AIC) values from both univariate and multivariate models.
Table 2 shows that Cox regression identified 15 variables as statistically significant (p < 0.05), while logistic regression identified 14. Among them, age emerged as the most predictive factor (C-index = 66.6%, AIC = 350.71), whereas platelet count demonstrated the weakest discriminatory performance (C-index = 49.5%, AIC = 377.70). In multivariate Cox regression, age, hemoglobin, and INR remained significant; in multivariate logistic regression, only age retained statistical significance.
The taxonomic analysis identified the triad Na ⁺ /Cl ⁻ /INR as having the lowest average structural distance (0.259), while the triad hsTnT/adjusted D-dimer/CRP showed the highest average distance (0.709), as illustrated in Fig 2, Panels A and B, respectively.
Panel A. The most structurally cohesive triad: Na⁺/Cl⁻/INR (mean taxonomic distance = 0.259). Panel B. The most structurally divergent triad: hsTnT/adjusted D-dimer/CRP (mean taxonomic distance = 0.709).
In the dendrograms shown in Fig 2, individual patients (labeled as “P” followed by identification numbers) are hierarchically clustered based on nearest-neighbor relationships derived from Marczewski–Steinhaus distances. Each leaf represents a single patient, and patients that are more similar in the three variables forming a given triad are joined together at lower levels of the tree, while more dissimilar patients merge at higher levels. Thus, the vertical height at which branches join reflects the magnitude of taxonomic distance between patients: short branches indicate structural similarity, whereas long branches indicate pronounced divergence in clinical profiles. The red dashed horizontal line marks the mean taxonomic distance for a given triad and serves as a reference level separating relatively homogeneous patient groupings from more dispersed ones. Triads whose dendrograms extend far above this line therefore represent variable combinations that generate high inter-patient variability, suggesting that these variables are strongly involved in differentiating patients in a disease-relevant manner. Conversely, triads with compact dendrograms below this line indicate more uniform, stabilizing variables that contribute little to structural heterogeneity.
This visualization allows the reader to directly see how different combinations of clinical variables either compress patients into similar profiles or spread them apart into highly heterogeneous patterns—an effect that is not accessible from regression coefficients alone.
The scree plot of average taxonomic distances across all 1,140 triads formed from 20 clinical risk factors is shown in Fig 3, Panel A. Panel B displays the corresponding segmented regression model used to identify a structural change-point in the distribution.
Panel A. Distribution of average taxonomic distances across all triads. Panel B. Segmented regression identifying structural change-point.
As detailed in Table 3 and illustrated in Fig 2, Panel B, a distinct change-point was identified between triads 963 and 964, corresponding to an average taxonomic distance of 0.478.
As shown in Table 3, triads ranked 1 through 963 define the first segment, characterized by modest increases in average taxonomic distance (‘slope1’), whereas the remaining 177 triads (ranks 964–1,140) constitute the second segment, which exhibits a markedly steeper increase in structural dissimilarity (‘slope2’). Both slopes were statistically significant (p < 0.05), and the ratio of the mean increases between the second and first segments was 0.00099/ 0.00017 = 5.82, indicating that the increase in structural divergence in the second segment was nearly six times greater (see Table 3).
The distribution of individual risk factors across the two segments—based on their presence or absence within the 1,140 evaluated triads—is presented in Table 4. Chi-square testing was used to assess the statistical significance of overrepresentation.
Table 4 presents the distribution of individual risk factors between the two structural segments identified in the taxonomic analysis. Sixteen out of the 20 analyzed variables showed statistically significant differences (p < 0.05) in their occurrence between segments. The most prominent overrepresentation in the high-variability segment was observed for C-reactive protein (CRP), which was present in 61 out of 177 triads in segment 2 (χ² = 62.25, p < 0.0001). Other factors strongly associated with the high-variability segment included D-dimer, adjusted D-dimer, hsTnT, age, hemoglobin, and electrolytes (Na ⁺ , K ⁺ , Cl⁻).
In contrast, white blood cell count (WBC) together with three renal function markers—urea (UREA), creatinine (CREA), and glomerular filtration rate (GFR)—showed no statistically significant difference in distribution between segments (p > 0.05).
Overall, the examined variables exhibited three distinct statistical patterns (Table 4):
- Lack of significant differentiation – parameters whose frequency differences between segments were statistically indistinguishable from random variation in the studied population (WBC, UREA, CREA, GFR).
- Significant predominance in the low-variability segment (segment 1) – factors such as age, BMI, Hb, RBC, PLT, and the electrolyte panel (Na ⁺ , K ⁺ , Cl⁻), whose overrepresentation may reflect a baseline structural configuration observed in the majority of patient triads.
- Significant overrepresentation in the high-variability segment (segment 2) – factors including GLU, CK-MB mass, hsTnT, D-dimer, aD-dimer, INR, aPTT, and CRP, whose clustering within this segment reflects statistically significant deviations from the baseline structure, consistent with a state of increased structural divergence between triads.
Discussion
In this study, we first applied conventional outcome-based models, including Cox proportional hazard and logistic regression, to identify prognostic factors associated with mortality in acute pulmonary embolism. These analyses confirmed that age was the strongest statistical predictor of death, while several additional biomarkers reached nominal statistical significance. However, this pattern raised an important interpretative problem: age is a non-modifiable background characteristic that reflects accumulated vulnerability rather than acute pathophysiological activity, and the large number of statistically significant predictors suggested substantial redundancy among the modeled variables. In this context, regression alone provided limited insight into which biomarkers were actively involved in the disease process itself, as opposed to merely correlating with outcome.
To address this limitation, we introduced a complementary outcome-agnostic taxonomic framework based on Marczewski–Steinhaus distances. Its purpose was not to infer causality, but to examine how individual variables contribute to the geometric structure of inter-patient variability independently of clinical endpoints. Within this framework, C-reactive protein (CRP) emerged as the most active biomarker in shaping multidimensional patient dispersion, in contrast to age, which primarily promoted structural homogeneity rather than disease-driven divergence. Thus, the taxonomic approach serves as a hypothesis-generating filter, highlighting biomarkers that may carry pathophysiological relevance beyond what is captured by conventional regression models. What else did we learn by applying this original approach? We explain this below.
While this interpretation is quantitatively justified and should be regarded as hypothesis-generating rather than confirmatory, the triadic taxonomic structure of 20 patient characteristics (Fig 2) may be divided into two segments based on a statistically defined change-point in structural variability and corresponding frequency counts (Fig 3, Table 4). Intuitively, the first segment reflects a more homogeneous configuration consistent with a clinically stable or health-related state, whereas the second segment captures increased structural heterogeneity that may be associated with disease-related processes. Within this two-segment framework, three distinct profiles of risk factor behavior can be identified:
- Factors without practical relevance for the clinical assessment of disease, whose signaling of health or illness is neutral and random (p > 0.05). In this study, these include: (WBC, as well as the renal markers UREA, CREA, and GFR).
- Factors insensitive to pathological processes, reflecting stable and enduring patient characteristics more typical of a healthy state rather than a dynamic response to disease (p < 0.05). Among those analyzed, these include: age, BMI, Hb, RBC, PLT, and the electrolytes Na⁺ , K⁺ , and Cl⁻ .
- Factors showing potential susceptibility to the adverse influence of disease and sensitivity to pathological processes, manifested by significant overrepresentation in the diseased structural segment and increased taxonomic dispersion (p < 0.05). These life-threatening signals, in descending order of strength, include: CRP, D-dimer, hsTnT, aPTT, CK-MB mass, GLU, aD-Dimer, and INR.
Moreover, given the nearly sixfold ratio of slope coefficients between segment 2 and segment 1 (see Table 3), one may infer high biochemical activity of the “overrepresented” biomarkers and, consequently, metabolic stimulation of the organism in the inflammatory state compared to the homeostatic state.
Taking regression relationships into account, it appears reasonable to state that age is the strongest predictor of mortality in patients with APE not simply because this follows from calculated statistics (Table 2), but because it is a non-modifiable background determinant of risk rather than a marker of acute pathological processes. This conclusion arises from the geometric structure of this variable in statistical space with other patient characteristics, making patients clinically more similar to one another. Furthermore, by accumulating the burden of years—decline in physiological reserves, comorbidities, and reduced capacity to compensate for acute disturbances—age directly translates into higher mortality regardless of other clinical parameters (e.g., reduced cardiovascular performance, diminished respiratory reserve, slower immune response, impaired metabolic regulation). Thus, while age is the best regression-derived predictor of death (Table 2), it is not a direct signaler of life-threatening risk induced by APE, but rather an indirect one.
In view of the above statistical and clinical considerations, one-dimensional analysis of individual risk factors—e.g., via the coefficient of variation (see Table 1)—appears misleading. Only a combined regression and non-regression approach enables the identification of the most reliable mortality risk factor in APE, which, in light of the presented methodology, is CRP (χ² = 62.25). Next in rank are D-dimer, hsTnT, aPTT, CK-MB mass, GLU, aD-dimer, and INR (with χ² statistics of 55.24, 39.53, 31.36, 19.84, 17.86, 15.97, and 8.13, respectively; see Table 4). Thus, of the initial 15 and 14 biomarkers identified as statistically significant (p < 0.05) in univariate Cox and logistic regression, only GLU, CK-MB mass, hsTnT, D-dimer, INR, and—most strongly—CRP meet the criteria of a risk factor in both analytical approaches.
Notably, elevated CRP levels have previously been associated with higher mortality and adverse outcomes in APE, with thresholds above 48 mg/L [20] and 124 mg/L [21] linked to increased risk of death and hemodynamic instability. CRP has also been consistently associated with elevated 30-day mortality [22,23]. Our findings extend this body of evidence by showing that CRP has the strongest structural contribution to inter-patient variability in multidimensional clinical space, independent of classical regression-based associations. In the case of CRP, its distinct structural behavior may have implications for early risk stratification, patient monitoring, and therapeutic decision-making.
Beyond APE-specific reports, the prominent structural behavior of CRP observed in our taxonomic analysis is consistent with a broader body of evidence linking systemic inflammation to adverse outcomes across cardiovascular and thrombo-inflammatory conditions. In the setting of atrial fibrillation, inflammatory markers have been shown to predict recurrence following ablative procedures; for example, in a recent study comparing inflammatory markers for the prediction of atrial fibrillation recurrence following cryoablation, CRP-related indices were significantly associated with recurrent arrhythmia, underscoring the role of persistent inflammatory activity in arrhythmogenic substrate and disease progression [24]. Similarly, composite indices of systemic inflammation have been associated with non-cardiac complications and poor prognosis in acute coronary syndromes: a study on the systemic immune-inflammatory index demonstrated its predictive value for contrast-induced nephropathy in patients with non-ST-segment elevation myocardial infarction [25], indicating that inflammatory burden may translate into organ vulnerability beyond the primary cardiac event. Moreover, in acute infectious and thrombo-inflammatory contexts such as COVID-19 pneumonia, the C-reactive protein–to–albumin ratio (CAR) has been reported as a strong predictor of in-hospital mortality, reinforcing the prognostic relevance of CRP-based signatures in systemic disease [26].
Taken together, these studies place CRP within a coherent biological framework in which inflammation interacts with thrombosis, endothelial dysfunction, and tissue injury across diverse clinical contexts. From this perspective, our taxonomic findings on CRP do not merely replicate known outcome-based associations but provide an additional structural layer: they indicate that inflammatory activity linked to CRP organizes multidimensional clinical variability in ways that classical regression models do not fully capture. This supports the interpretation that CRP is not only a correlate of clinical severity, but a key informational driver of disease heterogeneity in APE.
To our knowledge, this is the first study to apply a complementary taxonomic, outcome-agnostic approach to risk factor evaluation in APE. Our findings underscore the potential of information-centric metrics to enhance classical models, minimize redundancy, and prioritize variables exhibiting unique structural properties. Importantly, this methodology does not imply causality; rather, it provides a complementary perspective on the informational value of presumed risk factors. Nevertheless, future research should further investigate the biological role of CRP in APE pathogenesis and explore the applicability of this framework in other clinical contexts.
Conclusions
This study demonstrates that conventional regression analyses may overlook the multidimensional structure of prognostic variables in acute pulmonary embolism (APE). Using a taxonomic, outcome-agnostic approach based on variable triads, we found that C-reactive protein (CRP) emerged as the strongest structural signal among all analyzed factors, displaying a pattern of variability not captured by standard statistical techniques.
Our findings suggest that CRP may carry clinically meaningful information, potentially reflecting underlying inflammatory mechanisms involved in APE progression. Its consistent overrepresentation in high-variability combinations supports the need to reconsider its role—not merely as a passive marker of inflammation, but as a possible contributor to adverse outcomes.
The taxonomic framework introduced here offers a novel and complementary tool for identifying high-value prognostic indicators. By focusing on structural inter-variable relationships rather than direct outcome associations, this method has the potential to enrich clinical insight, reduce redundancy, improve model interpretability, and uncover latent prognostic signals that might otherwise remain hidden in traditional analyses.
The results presented in this study warrant further exploration in future research, including validation in independent datasets and prospective study designs, to expand their clinical applicability.
Study limitations
This study has several limitations. First, its retrospective design relied on data collected during routine clinical care, which—although complete—was not originally intended for the specific analytical purposes of this study. Second, all participants were drawn from a single tertiary care center, potentially limiting the generalizability of the findings to other populations or clinical settings. Third, the taxonomic method employed is unsupervised and outcome-agnostic; therefore, the segment analysis does not support direct causal inference. Finally, although the geometric approach provides a novel perspective on variable interactions, it requires external validation in independent datasets and prospective cohorts to confirm its reproducibility and clinical relevance.
References
- 1. Adams AG, Awsare BK. Review for hospitalists: acute pulmonary embolism. Hosp Pract (1995). 2011;39(4):55–62. pmid:22056823
- 2. Janata K. Relevance of cardiac biomarkers for prognosis and therapy in pulmonary embolism. Intens Care Emerg Treat. 2006;31(4):237–44.
- 3. Becattini C, Agnelli G. Acute pulmonary embolism: risk stratification in the emergency department. Intern Emerg Med. 2007;2(2):119–29. pmid:17619833
- 4. Sanchez O, Trinquart L, Caille V, Couturaud F, Pacouret G, Meneveau N, et al. Prognostic factors for pulmonary embolism: the prep study, a prospective multicenter cohort study. Am J Respir Crit Care Med. 2010;181(2):168–73. pmid:19910608
- 5. Zhan Y, Che X. A prognostic prediction model for acute pulmonary embolism. J Investig Med. 2024;72(8):930–7. pmid:39262152
- 6. de Mutsert R, Jager KJ, Zoccali C, Dekker FW. The effect of joint exposures: examining the presence of interaction. Kidney Int. 2009;75(7):677–81. pmid:19190674
- 7. Yuda E, Ueda N, Kisohara M, Hayano J. Redundancy among risk predictors derived from heart rate variability and dynamics: ALLSTAR big data analysis. Ann Noninvasive Electrocardiol. 2021;26(1):e12790. pmid:33263196
- 8. Brester C, Kauhanen J, Tuomainen T-P, Voutilainen S, Rönkkö M, Ronkainen K, et al. Evolutionary methods for variable selection in the epidemiological modeling of cardiovascular diseases. BioData Min. 2018;11:18. pmid:30127856
- 9.
Weiser EB. Structural equation modeling in personality research. In: Zeigler-Hill V, Shackelford TK, editors. The Wiley Encyclopedia of Personality and Individual Differences. Vol. 1–4. Wiley; 2020.
- 10. Marczewski E, Steinhaus H. On a certain distance of sets and the corresponding distance of functions. Colloq Math. 1958;6:319–27.
- 11. Tukiendorf A, Kaźmierski R, Michalak S. The taxonomy statistic uncovers novel clinical patterns in a population of ischemic stroke patients. PLoS One. 2013;8(7):e69816. pmid:23875000
- 12. Dempster AP, Laird NM, Rubin DB. Maximum Likelihood from Incomplete Data Via the EM Algorithm. J R Stat Soc Series B: Stat Methodol. 1977;39(1):1–22.
- 13.
Gelman A. The most-cited statistics papers ever. Statistical Modeling, Causal Inference, and Social Science [Internet]. 2014 [cited 2025 Aug 3]. Available from: https://statmodeling.stat.columbia.edu/2014/03/31/cited-statistics-papers-ever
- 14. Cox DR. Regression models and life-tables. J R Stat Soc Series B. 1972;34(2):187–220.
- 15.
Ulam SM. Analogies between analogies. The mathematical reports of S M Ulam and his Los Alamos collaborators. Berkeley (CA): University of California Press; 1990. pp. 469–73.
- 16.
R Core Team. R: A Language and Environment for Statistical Computing. Vienna: R Foundation for Statistical Computing. 2024. Available from: https://www.R-project.org/
- 17.
Kassambara A, Kosinski M, Biecek P. survminer: Drawing Survival Curves using ‘ggplot2’ [R package]. Version 0.4.9. 2021. Available from: https://CRAN.R-project.org/package=survminer
- 18.
Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K. cluster: Cluster Analysis Basics and Extensions [R package]. Version 2.1.4. 2023. Available from: https://CRAN.R-project.org/package=cluster
- 19.
Muggeo VM. segmented: Regression Models with Break-Points/Change-Points Estimation [R package]. Version 1.5-0. 2022. Available from: https://CRAN.R-project.org/package=segmented
- 20. Abul Y, Karakurt S, Ozben B, Toprak A, Celikel T. C-reactive protein in acute pulmonary embolism. J Investig Med. 2011;59(1):8–14. pmid:21218608
- 21. Eggers A-S, Hafian A, Lerchbaumer MH, Hasenfuß G, Stangl K, Pieske B, et al. Acute infections and inflammatory biomarkers in patients with acute pulmonary embolism. J Clin Med. 2023;12(10):3546. pmid:37240652
- 22. Milic R, Dzudovic B, Subotic B, Obradovic S, Soldatovic I, Petrovic M. The significance of C-reactive protein for the prediction of net-adverse clinical outcome in patients with acute pulmonary embolism. Vojnosanit Pregl. 2020;77(1):35–40.
- 23. Büyükşirin M, Anar C, Polat G, Karadeniz G. Can the level of CRP in acute pulmonary embolism determine early mortality? Turk Thorac J. 2021;22(1):4–10. pmid:33646097
- 24. Kalenderoglu K, Hayiroglu MI, Cinar T, Oz M, Bayraktar GA, Cam R, et al. Comparison of inflammatory markers for the prediction of atrial fibrillation recurrence following cryoablation. Biomark Med. 2024;18(17–18):717–25. pmid:39263796
- 25. Tezen O, Hayıroğlu Mİ, Pay L, Yumurtaş AÇ, Keskin K, Çetin T, et al. The role of systemic immune-inflammatory index in predicting contrast-induced nephropathy in non-ST-segment elevation myocardial infarction cases. Biomark Med. 2024;18(21–22):937–44. pmid:39469834
- 26. Güney BÇ, Taştan YÖ, Doğantekin B, Serindağ Z, Yeniçeri M, Çiçek V, et al. Predictive value of CAR for in-hospital mortality in patients with COVID-19 pneumonia: a retrospective cohort study. Arch Med Res. 2021;52(5):554–60. pmid:33593616