Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Identifying risk patterns for sudden cardiac death in athletes: A clustering and principal component analysis approach

  • Giacinto Angelo Sgarro ,

    Contributed equally to this work with: Giacinto Angelo Sgarro, Paride Vasco, Domenico Santoro, Luca Grilli, Marco Giglio, Natale Daniele Brunetti, Luigi Traetta, Giuseppe Cibelli, Anna Antonia Valenzano

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

    giacintoangelo.sgarro@unifg.it (GAS); giuseppe.cibelli@unifg.it (GC)

    Affiliation Department of Social Sciences, University of Foggia, Foggia, FG, Italy

  • Paride Vasco ,

    Contributed equally to this work with: Giacinto Angelo Sgarro, Paride Vasco, Domenico Santoro, Luca Grilli, Marco Giglio, Natale Daniele Brunetti, Luigi Traetta, Giuseppe Cibelli, Anna Antonia Valenzano

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliation Department of Clinical and Experimental Medicine, University of Foggia, Foggia, FG, Italy

  • Domenico Santoro ,

    Contributed equally to this work with: Giacinto Angelo Sgarro, Paride Vasco, Domenico Santoro, Luca Grilli, Marco Giglio, Natale Daniele Brunetti, Luigi Traetta, Giuseppe Cibelli, Anna Antonia Valenzano

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliation Department of Economics, Statistics and Business, Faculty of Technological and Innovation Sciences, Universitas Mercatorum, Rome, RM, Italy

  • Luca Grilli ,

    Contributed equally to this work with: Giacinto Angelo Sgarro, Paride Vasco, Domenico Santoro, Luca Grilli, Marco Giglio, Natale Daniele Brunetti, Luigi Traetta, Giuseppe Cibelli, Anna Antonia Valenzano

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliation Department of Economics, University of Foggia, Foggia, FG, Italy

  • Marco Giglio ,

    Contributed equally to this work with: Giacinto Angelo Sgarro, Paride Vasco, Domenico Santoro, Luca Grilli, Marco Giglio, Natale Daniele Brunetti, Luigi Traetta, Giuseppe Cibelli, Anna Antonia Valenzano

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliation Department of Clinical and Experimental Medicine, University of Foggia, Foggia, FG, Italy

  • Natale Daniele Brunetti ,

    Contributed equally to this work with: Giacinto Angelo Sgarro, Paride Vasco, Domenico Santoro, Luca Grilli, Marco Giglio, Natale Daniele Brunetti, Luigi Traetta, Giuseppe Cibelli, Anna Antonia Valenzano

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliation Department of Medical and Surgical Sciences, University of Foggia, Foggia, FG, Italy

  • Luigi Traetta ,

    Contributed equally to this work with: Giacinto Angelo Sgarro, Paride Vasco, Domenico Santoro, Luca Grilli, Marco Giglio, Natale Daniele Brunetti, Luigi Traetta, Giuseppe Cibelli, Anna Antonia Valenzano

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliation Department of Humanities, Letters, Cultural Heritage and Educational Studies, University of Foggia, Foggia, FG, Italy

  • Giuseppe Cibelli ,

    Contributed equally to this work with: Giacinto Angelo Sgarro, Paride Vasco, Domenico Santoro, Luca Grilli, Marco Giglio, Natale Daniele Brunetti, Luigi Traetta, Giuseppe Cibelli, Anna Antonia Valenzano

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

    giacintoangelo.sgarro@unifg.it (GAS); giuseppe.cibelli@unifg.it (GC)

    Affiliation Department of Clinical and Experimental Medicine, University of Foggia, Foggia, FG, Italy

  • Anna Antonia Valenzano

    Contributed equally to this work with: Giacinto Angelo Sgarro, Paride Vasco, Domenico Santoro, Luca Grilli, Marco Giglio, Natale Daniele Brunetti, Luigi Traetta, Giuseppe Cibelli, Anna Antonia Valenzano

    Roles Conceptualization, Data curation, Funding acquisition, Formal analysis, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliation Department of Clinical and Experimental Medicine, University of Foggia, Foggia, FG, Italy

Abstract

Sudden Cardiac Death (SCD) is a critical and unexpected condition that occurs due to cardiac causes within one hour of the onset of acute cardiovascular symptoms or twenty-four hours in unwitnessed cases. Despite advancements in cardiovascular medicine, practical methods for predicting SCD are still lacking, and there are no standardized systems to identify individuals at risk, especially in seemingly healthy populations such as athletes. In this study, we employed hierarchical clustering and principal component analysis (PCA) on data from 711 competitive athletes, revealing distinct patterns and cluster distributions in PCA space. Specifically, Clustering revealed characteristic feature combinations associated with increased SCD risk in athletes. Notably, certain clusters shared traits, including participation in Class C sports, sinus tachycardia, ventricular pre-excitation, personal or family history of heart disease, T-wave inversions, and prolonged QTc intervals. PCA helped visualize these patterns in distinct spatial regions, highlighting underlying structures and aiding intuitive risk interpretation. These results enable scientists to derive cluster metrics that serve as reference points for classifying new individuals and visually representing risk patterns in a clear graphical format. These findings establish a foundation for predictive tools that, with additional clinical validation, could aid in the prevention of SCD. The dataset used in this study, along with the clustering and PCA results, is available to the scientific community in an open format, together with the necessary tools and scripts to enable independent experimentation and further analysis.

Introduction

Regular physical activity is widely recognized for its numerous health benefits, including significant reductions in all-cause mortality and the incidence of cardiovascular diseases, cancer, and metabolic disorders [15]. Despite these advantages, the risk of SCD remains a challenging reality, even among healthy individuals and athletes. Sudden cardiac death is defined as an unexpected death caused by cardiac conditions occurring within one hour—or within twenty-four hours in unwitnessed cases—after the onset of acute symptoms, without external contributing factors [6]. In Western countries, SCD accounts for an estimated 13–20% of all deaths [6,7], with its occurrence influenced by age, gender, ethnicity, and the type of physical activity practiced.

Incidence and prevalence

Age plays a crucial role in the etiology of SCD. In individuals under thirty-five, hereditary conditions such as cardiomyopathies and channelopathies are the predominant causes. Notably, hypertrophic cardiomyopathy is responsible for 36% of SCD cases in young athletes in the United States [8], whereas arrhythmogenic idiopathic ventricular cardiomyopathy accounts for 23% of cases in Italy, with hypertrophic cardiomyopathy contributing to only 2% [9]. Conversely, in adults over thirty-five, atherosclerotic coronary artery disease becomes the leading cause of SCD, comprising 73% of cases in military personnel within this age group [10]. The incidence of SCD also increases with age among athletes, rising from 0.47–1.21 per 100,000 person-years in young competitive athletes to 6.64 per 100,000 person-years in those over thirty-five [11].

Risk factors and co-morbidities

Gender disparities are another critical aspect of SCD risk. Studies consistently reveal that male athletes face a higher risk compared to females. Finocchiaro et al. (2021) [12] reported an SCD incidence rate of 2.6 per 100,000 person-years in male athletes versus 1.1 per 100,000 in females [9]. Furthermore, a UK registry of 748 SCD cases during sports activities showed that only 13% involved women, with sudden deaths during intense physical exertion occurring significantly less frequently in females than in males (58% versus 83%) [12]. Possible explanations include gender differences in physiological cardiac adaptations to exercise, chamber remodeling, myocardial fibrosis prevalence, and estrogens’ protective role in women [1214]. Ethnicity also shapes SCD risk. Young African American athletes experience a threefold higher incidence compared to their white counterparts, at 5.6 per 100,000 annually [8,15,16]. This aligns with studies from the United States highlighting racial disparities in SCD occurrence. The type and intensity of physical activity further modulate SCD risk. Competitive athletes, exposed to higher training loads and adrenergic surges, face more significant risks than recreational athletes, with SCD rates of 1 in 100,000 compared to 0.32 in 100,000, respectively [7]. Intense exercise may exacerbate pre-existing cardiac abnormalities through mechanisms such as dehydration, electrolyte imbalances, and acid-base disturbances, triggering fatal arrhythmias [8,17,18].

Challenges in early detection and screening

Early detection remains challenging despite the apparent associations between SCD and these risk factors. Many SCD victims present no symptoms or experience nonspecific warning signs. Post-mortem examinations often reveal structurally normal hearts in younger individuals, suggesting undetectable electrical abnormalities as the underlying cause in 41% of cases [19]. When anomalies are detected, they may include idiopathic left ventricular hypertrophy or myocardial fibrosis, though their clinical significance remains debated [20]. In Italy, mandatory pre-participation physical evaluations (PPPE), including electrocardiograms (ECG), has significantly reduced the incidence of SCD linked to hypertrophic cardiomyopathy [9,21]. However, screening presents limitations: high costs, frequent false positives, and an inability to identify all at-risk individuals [22,23].

Methodological and technological advances

Artificial intelligence (AI) has emerged as a promising tool for overcoming these limitations. By processing complex, heterogeneous data—clinical measurements, environmental observations, and experimental results—AI creates predictive models for human diseases. Machine learning algorithms have shown efficacy in forecasting hypertension and atrial fibrillation using datasets derived from ECG and electronic health records [2426]. Advanced methodologies such as hierarchical clustering (HC) and PCA offer new avenues for SCD risk stratification. Hierarchical clustering identifies groups of patients with shared risk profiles by analyzing intrinsic data relationships. At the same time, PCA reduces data dimensionality, isolating key variables for more intuitive visualization of risk factors and patient clusters [2730]. The present study aims to develop a novel SCD risk diagnostic system by applying a HC algorithm to a dataset collected from hospital patients. This method enables the targeted identification and stratification of risk, revealing clusters of individuals at risk. We propose a PCA-based visualization approach that translates complex patient data into graphical representations, highlighting risk profiles and interconnected factors. The clustering outcomes, visualized through PCA, depict patient risk stratification and are illustrated in Figs 4 and 5. In both images, the elements belonging to the blue cluster are considered non-risk subjects. Using these data and visualizations, it is also possible to classify and position new patients within the PCA space, as demonstrated in Figs 10 and 11. These methods can advance our understanding of SCD risk and support clinical decision-making through data-driven insights. S1 Fig presents a graphical representation of our work.

Materials and methods

Data collection and feature extraction

Data collection.

In this study, we utilized a dataset comprising 711 athletes, each represented by a row, along with 151 initial features derived from pre-participation physical evaluations (PPPEs) conducted at the Sports Medicine Unit of the Policlinics of Foggia between 2019 and 2024. We used Excel as our primary tool for data management, systematically organizing and categorizing the information. For each athlete, we collected demographic and personal data (age, date of birth, gender, ethnicity, sport discipline and category), family history of SCD and cardiovascular/metabolic diseases, and personal medical history including cardiac events and medication use. The medical history encompassed previous cardiac events or diagnoses, including hypertrophic cardiomyopathy, arrhythmogenic cardiomyopathy, and medication use, particularly those that affect heart function or electrolyte balance. We also documented any history of syncope, palpitations, or chest pain during exercise. Lifestyle and training data included the type and intensity of sports practiced, distinguishing between competitive and recreational activities, as well as the duration and frequency of training sessions.

Cardiac and functional assessments.

Cardiac assessments included ECG findings such as QRS complex abnormalities and T-wave inversions. Echocardiography provided measures of chamber size and myocardial thickness. Cardiac MRI was used to identify fibrosis or structural anomalies. Exercise stress tests evaluated exertion-induced arrhythmias. Genetic testing was conducted to screen for mutations associated with cardiomyopathies and channelopathies. Functional tests analyzed heart rate variability (HRV), blood pressure responses during and after exercise, and blood markers like troponin levels and electrolytes. Psychosocial factors were also assessed, including stress levels, coping mechanisms, sleep patterns, and overall quality of life. These examinations assessed body composition (BMI), resting blood pressure, musculoskeletal, respiratory, cardiovascular, and other systems. Additionally, mandatory tests, including electrocardiograms (ECGs) at rest and under exertion, were performed to identify potential abnormalities.

Feature selection and dataset creation.

At the outset, the dataset included a total of 151 variables, encompassing a broad range of demographic, clinical, and instrumental features. These are fully listed and described in Table 13). To enable a more interpretable and targeted analysis, we organized the variables into three hierarchical subsets. The first subset consists of 45 features, defined as the “Literature + Extra” group, which includes variables supported by both established research and emerging clinical hypotheses. The second subset, referred to as “Complete Literature,” narrows this down to 26 features that are strongly corroborated by evidence in the scientific literature regarding sudden cardiac death (SCD). Finally, from this set of 26, we performed a further selection of the 8 variables most strongly correlated with SCD, based on statistical relationships and clinical relevance, referred to as “Literature” group. All three groups—45, 26, and 8 features—are described and numerically identified in Table 14.

For the purpose of this study, we chose to work with the datasets corresponding to the “Literature” and “Literature + Extra” groups, respectively datasetL and datasetLE.

The selection criteria used to build datasetL were informed by the meta-analysis conducted by Harmon et al. [31], which examined the effectiveness of screening history, physical examinations, and ECGs in detecting potentially fatal cardiac disorders in athletes. This study analyzed 47,137 athletes and reported a prevalence of 160 severe cardiovascular conditions, which translates to 0.3%. The meta-analysis highlighted several key disorders associated with SCD, such as Wolff-Parkinson-White syndrome (WPW), long QT syndrome (LQTS), hypertrophic cardiomyopathy (HCM), dilated cardiomyopathy, coronary artery disease (CAD), arrhythmogenic right ventricular cardiomyopathy (ARVC), myocarditis, and Brugada syndrome. Guided by these findings, we carefully selected variables that showed strong correlations with SCD risk. The features included in datasetL consisted of ECG findings like T-wave inversion, short PR intervals, Q waves, ST-segment depression, ventricular extrasystoles, and Brugada type 1 pattern. We also incorporated clinical indicators such as family history of CAD and personal history of myocardial infarction. To further account for physiological risk factors, we added metrics like body mass index (BMI) and blood pressure, which are closely linked to left ventricular hypertrophy and other cardiovascular abnormalities associated with SCD. In parallel, we constructed datasetLE, extending the analysis by combining the full set of 26 literature-based features with additional variables derived from exploratory clinical observations and emerging hypotheses. This resulted in a broader feature space comprising 45 variables.

Data analysis.

We applied clustering analyses to both datasets to identify groups of athletes with shared characteristics, and used PCA for dimensionality reduction and pattern recognition. Comparing results allowed us to evaluate the impact of additional variables on clustering and SCD risk assessment.

The complete forty-five-feature dataset is available for download at https://github.com/hyacintus/Sudden-Cardiac-Death-Survey.git

Alongside the dataset, the repository also provides the tools and scripts necessary to reproduce the research results, perform testing, and independently evaluate the system. Further in the article, links to specific subfolders within the repository are provided in the relevant sections for ease of access.

Hierarchical clustering for pattern recognition

We utilized a hierarchical clustering algorithm to categorize athletes based on their risk of SCD to identify distinct risk profiles. This methodology was organized into four phases: dataset preparation, clustering, dendrogram analysis and cutting, and analysis of clusters and prototypes.

Dataset preparation.

We prepared two datasets of 711 athletes by selecting features associated with SCD risk as identified in existing literature. These features included eight or forty-five variables related to risk factors, physiological parameters, and resting and stress ECG features. The majority of variables are binary (Yes/No), except Body Mass Index (BMI) and Heart Rate, which are continuous (see Table 14 in Appendix section).

To ensure consistency and allow meaningful comparisons between features during mathematical modeling and clustering, we applied min-max normalization to the continuous variables (BMI and Heart Rate). In datasets containing variables with different value ranges, features with wider or numerically larger scales can dominate the outcome of algorithms, skewing the results and reducing interpretability. Normalization addresses this issue by rescaling all features to the same range, allowing them to contribute equally to the analysis.

In our case, each individual value was normalized with respect to the global distribution of its corresponding variable across the entire cohort. Specifically, for each continuous feature, we identified the minimum and maximum values in the dataset and applied the following transformation to each athlete’s data:

(1)

where x represents the original value, xmin and xmax are the minimum and maximum values of the feature, and xi,norm is the normalized value. This transformation maps all values to the [0,1] interval, making them directly comparable with one another and mitigating the distortion caused by features with inherently larger ranges.

Clustering.

In the clustering phase, we applied agglomerative HC, starting with each athlete as a separate cluster. The algorithm iteratively merges the closest clusters based on a specific distance metric and linkage criterion [32]. This study’s metrics and linkage criteria are detailed in Tables 1 and 2.

thumbnail
Table 1. Metrics employed to calculate the distance between two generic elements and [27].

https://doi.org/10.1371/journal.pone.0339377.t001

thumbnail
Table 2. Linkage criteria used to determine the distance between two generic clusters A and B.

Here, and denote the clusters’ cardinality (number of patients), while CA and CB represent their centroids [27].

https://doi.org/10.1371/journal.pone.0339377.t002

To illustrate the algorithm’s functionality, consider an example with three clusters (A, B, and C) at the xth iteration. Cluster A contains two elements, p1 and p2, cluster B includes three components, p3, p4, and p5, while cluster C consists of a single element, p6. During this iteration, one of the cluster combinations (A–B, A–C, or B–C) is selected for merging.

For instance, using the “euclidean” metric and the “single” linkage criterion, the Euclidean distances for the A–B combination are calculated as follows: , , , , , and . With the single-linkage method, the minimum distance among these values is selected. At this stage, each cluster combination is represented by a single entity, and the pair of clusters with the smallest distance is merged. The clustering process was implemented using MATLAB, explicitly leveraging the “linkage” function.

Dendrogram analysis and cutting.

After clustering, the dendrogram was analyzed to identify the optimal number of clusters (Fig 1). The tree-cutting procedure, which involves selecting the point at which to “cut” the tree, was guided by an automatic rule based on the Rk index described right after, which measures the ratio between minimal inter-cluster and maximal intra-cluster variability.

thumbnail
Fig 1. Hierarchical clustering dendrograms with different metrics and datasets.

Dendrogram obtained using the variables from datasetL with Metric “Cityblock” and Linkage “Single” (left image). From datasetLE with Metric “Euclidean” and Linkage “Single” (right image).

https://doi.org/10.1371/journal.pone.0339377.g001

Intra-cluster variability is calculated as the sum of Euclidean distances of each element from its cluster centroid, divided by the cluster size:

(2)

where CA is the centroid of cluster A. Inter-cluster variability between two clusters is calculated as the Euclidean distance between their centroids:

(3)

The Rk index is computed as the ratio of the minimal inter-cluster variability to the maximal intra-cluster variability during each iteration:

(4)

with

(5)(6)

The tree was cut at the point where the Rk value was maximized, indicating the most meaningful clustering configuration (Fig 2).

thumbnail
Fig 2. Rk index evolution for different metrics and datasets.

Rk index for the last ten iterations using the Cityblock metric and Single linkage from datasetL (left image), and the Euclidean metric and Single linkage from datasetLE (right image). In both cases, the best Rk index occurs with two clusters.

https://doi.org/10.1371/journal.pone.0339377.g002

The resulting clusters were interpreted as distinct SCD risk states, each representing a prototype of patients with similar health characteristics.

Clustering patterns analysis.

Clustering aims to achieve an unsupervised classification based on features related to the subject of study. The identified clusters may have physical significance, representing meaningful groupings within the data. In particular, the smaller clusters may suggest a higher correlation with the risk of experiencing sudden cardiac death (SCD). The clustering analysis conducted on datasetL (8 features) and datasetLE (45 features) revealed that various metrics and similarity criteria often produced identical or highly similar clusterings.

Specifically, for datasetL, only four distinct clustering patterns were identified among 37 patients, with the minority clusters frequently involving the same type of individuals. For datasetLE, 10 different clustering patterns emerged; however, two were deemed insignificant in the context of sudden death analysis. These two clusterings produced configurations with one minority cluster of 332 patients (using the complete linkage method with Euclidean, squared Euclidean, and Minkowski metrics) and another with 78 patients (using complete linkage and the Minkowski metric). The complete Chebyshev clustering result was also excluded due to the Rk-factor diverging with increasing clusters. Consequently, seven valid clusterings were considered for datasetLE, involving 33 patients. The clustering patterns are evident in Tables 3 and 4, which report the number of clusters obtained for various metrics and similarity criteria.

thumbnail
Table 3. Number of clusters obtained using different metrics and similarity criteria: (datasetL).

Cells with the same color indicate coincident clusterings.

https://doi.org/10.1371/journal.pone.0339377.t003

thumbnail
Table 4. Clustering results for different metrics and similarity criteria: (datasetLE).

Matching clusterings are highlighted with the same color. Excluded clusterings are represented by numbers followed by a comma (where the second number indicates the smallest cluster size) or marked as “div” to denote divergence of the Rk coefficient.

https://doi.org/10.1371/journal.pone.0339377.t004

Cells with the same color indicate coincident clusterings, while cells with numbers followed by a comma and another number or cells marked with “div” indicate excluded clusterings. The number to the right of the comma represents the size of the smallest cluster, while “div” indicates the divergence of the Rk coefficient. Seven patients appeared in common clusters across the clusterings derived from both datasets. Therefore, out of 711 analyzed patients, sixty-four were identified as “different” compared to the rest. These patients, distributed across eleven total clusterings, were analyzed further. Table 5 presents the results of the clusterings based on the parameters. In addition, Table 5 and Fig 3 offer a tabular and graphical representation of the clusterings of interest for the patients using Euler-Venn diagrams. Notably, these sixty-four individuals exhibited distinct characteristics.

thumbnail
Table 5. Tabular representation of the clusterings of interest for the identified patients.

Among the 711 analyzed patients, 64 were classified as “different” and distributed across 11 clusterings. This table provides a detailed overview of their distribution. The abbreviations “e”, “se”, “ci”, “ch”, “mi”, and “sp” stand for Euclidean, Squared Euclidean, Cityblock, Chebyshev, Minkowski, and Spearman, respectively.

https://doi.org/10.1371/journal.pone.0339377.t005

thumbnail
Fig 3. Euler-Venn diagrams representing the identified patients’ clusterings of interest.

These diagrams offer a graphical visualization of the relationships between clusters, complementing the tabular data in Table 5.

https://doi.org/10.1371/journal.pone.0339377.g003

They are classified as Class C athletes with a familial and/or personal history of cardiopathy. Typical findings among them include atrioventricular block on both resting and stress ECG, QT segment prolongation during stress ECG, right axis deviation on resting and stress ECG, tachycardia on both resting and stress ECG, and ventricular pre-excitation seen on either or both types of ECG. Additional traits may include being born prematurely via cesarean section, complete right bundle branch block observed on stress ECG, hypertension with T-wave inversion in inferior and lateral leads on stress ECG, or a combination of these conditions. While the clustering results provided intriguing insights, several challenges emerged. First, the data was analyzed across eight or forty-five variables, predominantly binary (Y/N), making interpretation difficult. Second, despite their interpretative interest, the results were challenging to present practically and graphically. To address this, clustering results were visualized using PCA for both datasets. PCA was also applied to a subset of selected features, and a detailed discussion is provided in the section dedicated to PCA analysis.

Principal component analysis for graphical representation

Principal Component Analysis (PCA) is a dimensionality reduction technique that simplifies complex datasets by transforming the original data into a new set of variables called principal components. These components are linear combinations of the original variables and are ordered such that the first principal component captures the most significant variance, followed by the second, which captures the next most considerable variance, and so on. Mathematically, PCA involves the following steps. First, the data is standardized by subtracting the mean and dividing by the standard deviation for each variable:

(7)

This ensures that all variables contribute equally to the analysis. Next, the covariance matrix is computed to quantify the relationships between variables:

(8)

where:

(9)

For the covariance matrix, the eigenvectors indicate the directions of the data’s maximum variance (the principal components), while the eigenvalues represent the magnitude of variance along each eigenvector. The principal components are ranked by their eigenvalues in descending order. A feature vector is created by selecting the eigenvectors corresponding to the largest eigenvalues, which retain the components while reducing dimensionality and preserving most of the information. The final step involves reorienting the data along the principal component axes using the transformation (see [33]):

(10)

In clustering analysis, PCA is beneficial for visualizing high-dimensional data. Applying PCA to a dataset containing different clusters of patients, each described by multiple features, can reduce the feature space to two or three dimensions, making it easier to visualize. When plotting the principal components, each data point can be assigned a color corresponding to its cluster membership. This allows researchers to observe whether the points in the same cluster are positioned close to each other and far from points in other clusters. The clustering algorithm successfully captures the data’s inherent structure if clusters appear well-separated in the principal component space. Moreover, when clusters appear differentiated in the PCA representation, this approach can also be extended to include new patients. By projecting the data of new individuals onto the same PCA-transformed space, their positions relative to existing clusters can be visualized. This provides a clear, intuitive way to assess how a new patient aligns with specific clusters, which may have physical or clinical significance. This ability highlights PCA, particularly the feature vectors, as a powerful data representation tool, facilitating dimensionality reduction and the interpretation of new data within existing patterns and structures. This study observed a significant overlap between many clusters in the 11 identified clusterings. As a result, the cluster information was consolidated for representation. For both datasets, datasetL and datasetLE, the common clusters were merged, leading to a total of 21 clusters: 10 derived from the clustering of datasetL and 11 from the clustering of datasetLE. The information on the clusters and the respective patients identified through the consolidation process is presented in Table 6. The patient IDs in bold correspond to those included in clusters identified from both datasetL and datasetLE.

thumbnail
Table 6. Consolidated clustering results from datasetL and datasetLE, resulting in 21 clusters.

Bold patient IDs represent those in clusters identified from both datasets. The term “others” indicates the rest of the patients in the dataset.

https://doi.org/10.1371/journal.pone.0339377.t006

PCA was then applied to the two datasets, and the final clusters (i.e., the overall clusters) were visualized by assigning different colors to the data points of each cluster on the principal component plots. The data was projected onto the first two principal components for visualization in both cases. The results showed that, in most cases, the elements of the clusters are positioned in distinct regions of the plot, clearly separating the clusters.

Interestingly, in the case of datasetL, the data points of the patients appeared to organize themselves along parallel regions, which were linearly separable. This suggests that the clusters in datasetL are well-structured and can potentially be discriminated with a linear classifier in the PCA-transformed space. The only exception is represented by Cluster 5, which is composed of just two elements that fall within the lower part of the band formed by the elements of Cluster 1, thus partially overlapping with it. The plot for the first two principal components of datasetL is shown in Fig 4.

thumbnail
Fig 4. PCA visualization of the final clusters for datasetL, with data points colored according to their respective clusters.

The first two principal components separate the clusters, with data points in datasetL organizing along parallel, linearly separable regions, suggesting potential for linear classification in the PCA-transformed space.

https://doi.org/10.1371/journal.pone.0339377.g004

On the other hand, for datasetLE, the elements belonging to the clusters tended to arrange themselves peripherally relative to the rest of the sample. This indicates a different underlying structure, where clusters form around the edges of the dataset, separating themselves from the central data points. An exception to this pattern is observed for all the elements of clusters 4, 10, and 11, as well as one element from Cluster 3, which are instead located closer to the central region of the plot. The plot for the first two principal components of datasetLE is shown in Fig 5. This differentiation in the clusters’ spatial organization highlights the effectiveness of PCA in revealing structural patterns in the data.

thumbnail
Fig 5. PCA visualization of the final clusters for datasetLE, with data points colored according to their respective clusters.

The elements of the clusters are arranged peripherally around the central data points, indicating a different underlying structure, where clusters form at the edges of the dataset.

https://doi.org/10.1371/journal.pone.0339377.g005

By visualizing the clustering results in the principal component space, it becomes possible to better understand the relationships between the clusters and the data distribution and even to hypothesize about the physical or clinical significance of these patterns. The eigenvalues and eigenvectors are presented in Table 15.

Discussion and application

Interpretation of findings

This study introduces a novel data-driven methodology for assessing SCD risk in athletes by integrating unsupervised HC with PCA. Traditional screening methods rely on clinical guidelines and predefined risk factors. In contrast, this approach identifies natural groupings in high-dimensional data, offering a more nuanced stratification of athletes based on shared physiological and clinical characteristics. By leveraging clustering, the study uncovers hidden patterns in athlete profiles, distinguishing subgroups that may correspond to varying levels of SCD risk. Principal component analysis further enhances interpretability by reducing dimensionality, allowing for a visual representation of risk groupings in a lower-dimensional space. This combination offers an innovative perspective on SCD risk assessment, complementing standard pre-participation screenings [34].

The clustering results align with previous research on SCD risk stratification, supporting the notion that a small subset of individuals consistently emerges as distinct, potentially corresponding to an elevated cardiovascular risk. These clusters were medically interpretable as individuals in high-risk groups exhibited characteristics associated with known clinical predictors of SCD, such as electrocardiographic abnormalities, hypertrophic markers, and family history of cardiovascular disease [35]. The analysis of the two datasets showed clear differences in high-risk groups of athletes.

In the datasetL, we observed more frequent ECG abnormalities. Specifically, 10.8% of these athletes had T-wave inversions in the lateral or inferolateral leads of the resting ECG, and 2.7% showed this in the anterior leads. Additionally, 10.8% had prolonged QTc intervals on the stress ECG, and a significant 56.8% showed signs of ventricular pre-excitation. Notably, syncope (fainting) occurred in 8.1% of high-risk individuals, while none in the low-risk group reported this. The high-risk group also exhibited slightly lower average body mass index (BMI), at 20.86 compared to 22.04, and a lower average heart rate (72.05 bpm vs. 75.05 bpm), indicating different physiological patterns.

In the datasetLE, similar trends were noted. Here, 57.6% of individuals in the high-risk group had a family history of heart disease, in contrast to 19.3% in the low-risk group. A personal history of heart disease was seen in 51.5% of high-risk individuals, compared to just 4.4% in the low-risk group. Furthermore, the high-risk subgroup exhibited higher rates of sinus tachycardia (24.2% vs. 2.8%) and ventricular pre-excitation (12.1% vs. 2.5%) on the resting ECG. Interestingly, the average heart rate was significantly higher in the high-risk group (80.2 bpm vs. 74.6 bpm). Some features, like pectus excavatum and certain axis deviations, were actually more common in the low-risk group, which could suggest some protective factors or random variability.

These differences are summarized in Tables 7 and 8, which present only those features showing statistically significant differences between the high-risk and low-risk groups for each dataset. datasetL included 37 high-risk and 674 low-risk individuals, while datasetLE comprised 33 high-risk and 678 low-risk individuals, respectively.

thumbnail
Table 7. Comparison of feature prevalence in risk and no-risk clusters for datasetL and datasetLE.

Variables from the original literature-based set ( datasetL) are shown in bold in the leftmost column. Values with statistically significant differences within each dataset are also indicated in bold. Feature numbers correspond to those reported in Table 14.

https://doi.org/10.1371/journal.pone.0339377.t007

thumbnail
Table 8. Comparison of feature statistics in risk and no-risk clusters for datasetL and datasetLE.

Feature numbers correspond to those reported in Table 14.

https://doi.org/10.1371/journal.pone.0339377.t008

Analyzing the distinct compositions of individual clusters within the high-risk group offers significant clinical insights, enhancing our understanding of the risk class’s internal structure. This nuanced analysis enables healthcare professionals to interpret how specific feature combinations relate to cluster placements in PCA space, thereby refining diagnostic strategies. Tables 9 and 10 illustrate representative configurations of features found in high-risk clusters within datasetL, while Tables 11 and 12 refer to datasetLE.

thumbnail
Table 9. Feature distribution across clusters for datasetL.

Columns indicate feature numbers as listed in Table 14 or Table 10. Rows represent groups of cluster elements that share the same feature activation pattern, as listed in Table 6. Numbers in parentheses denote the number of individuals within each subgroup (or pattern). When these numbers are shown in bold, they identify elements also found in datasetLE; when they are underlined, they indicate elements located in the PCA region defined as non-risk.

https://doi.org/10.1371/journal.pone.0339377.t009

thumbnail
Table 10. Representative feature patterns across high-risk clusters for datasetL.

Rows correspond to features, as listed in Table 14. Columns represent cluster IDs, as defined in Table 6, and indicate all features activated by at least one element within each cluster.

https://doi.org/10.1371/journal.pone.0339377.t010

thumbnail
Table 11. Feature distribution across clusters for datasetLE.

Columns indicate feature numbers as listed in Table 14 or Table 12. Rows represent groups of cluster elements that share the same feature activation pattern, as listed in Table 6. Numbers in parentheses denote the number of individuals within each subgroup (or pattern). When these numbers are shown in bold, they identify elements also found in datasetL; when they are underlined, they indicate elements located in the PCA region defined as non-risk.

https://doi.org/10.1371/journal.pone.0339377.t011

thumbnail
Table 12. Representative feature patterns across high-risk clusters for datasetLE.

Rows correspond to features, as listed in Table 14. Columns represent cluster IDs, as defined in Table 6, and indicate all features activated by at least one element within each cluster.

https://doi.org/10.1371/journal.pone.0339377.t012

For instance, in datasetL, Cluster 2 is notable for its inclusion of athletes presenting with Class C sport participation, conditions such as sinus tachycardia and ventricular pre-excitation (Stress ECG), and prevalent T-wave inversions in anterior or lateral leads. Coupled with personal and family histories of heart disease, this cluster highlights critical profiles requiring targeted clinical intervention. Conversely, Cluster 7 reveals rare but alarming profiles characterized by prolonged QTc and syncope, indicating an urgent need for thorough evaluation and monitoring. A closer look at Cluster 5 suggests a different clinical narrative, as its members exhibit minimal activations across descriptive variables. This potentially identifies them as a cohort of low-risk subjects. In the PCA projection, these individuals are situated at the lower boundary of Cluster 1, which similarly includes low-risk cases. Merging these clusters could enhance the interpretation of subjects exhibiting a non-risk profile.

Turning to datasetLE, Cluster 1 distinctly showcases a pattern of Class C sport participation, family history of heart disease, and sinus tachycardia readings from Resting ECG. These interrelated patterns underscore the importance of considering both individual features and specific combinations when determining risk groupings in PCA space. Similarly, Clusters 10 and 11 present low levels of feature activation, indicating a prevalence of non-risk profiles. Their positioning within the central region of Cluster 8 suggests potential for consolidation, reinforcing the clinical approach of distinguishing between at-risk and non-risk groups. However, Clusters 3 and 4 present unique challenges. Cluster 3, in particular, presents interpretive challenges for two main reasons: (i) its members are spatially distributed across different regions in the PCA plot, indicating internal heterogeneity; and (ii) element 375, despite showing a moderate number of activations, is positioned centrally within the band formed by Cluster 8—suggesting potential non-risk status. A similar ambiguity arises with element 198 from Cluster 4, which also falls in the dense central region dominated by Cluster 8 elements despite exhibiting numerous feature activations. Since this area also includes several members of Cluster 11, it may be interpreted as either at-risk or non-risk, depending on the severity and clinical relevance attributed to the overlapping features in relation to SCD.

All analytical tools employed to produce the visual results shown in Figs 4 and 5 are accessible in the Experiment subfolder of the central repository at https://github.com/hyacintus/Sudden-Cardiac-Death-Survey/tree/26ba71a0a69e5de58d015f1368512e99ba99dc03/Experiment. This comprehensive analysis underscores the importance of cluster evaluation in driving informed clinical decisions and advancing patient care.

However, this study diverges from conventional models in several key ways. Unlike traditional methods that assess risk factors individually, this model analyzes their combined interactions, offering a more comprehensive risk assessment. Most prior studies rely on supervised classification models trained on known SCD cases; here, clustering provides an unbiased grouping, potentially identifying previously unrecognized risk profiles. While prior research establishes SCD risk factors, the graphical PCA representation is unique, offering a dynamic decision-support tool for sports cardiologists [36].

Testing the model on new data

There are several potential approaches to building a classifier based on the data used for clustering and subsequent PCA in this study. Among the viable methods are fuzzy logic systems [37], artificial neural networks [38], k-means-based classification strategies [39], or even the direct use of the similarity metrics and distance measures already described in this work [32], combined and adapted for classification purposes. However, initiating the development of such a classifier at this stage would be premature and of limited utility, given that the internal characteristics of the identified clusters have not yet been longitudinally studied or clinically validated over time through dedicated follow-up investigations.

Instead, this section aims to demonstrate how new data points—i.e., newly acquired subjects—naturally align with specific clusters in the PCA-transformed space, based on their feature profiles. More precisely, the idea is to show that new elements tend to distribute across the PCA plots in regions where clusters with similar feature combinations are located, as identified in Figs 6 and 7.

thumbnail
Fig 6. PCA visualization of the final clusters for datasetL, with data points colored according to their respective clusters.

Additionally, the corresponding hypothetical cluster membership regions are shown using the same colors, highlighting the areas each cluster occupies in the PCA space.

https://doi.org/10.1371/journal.pone.0339377.g006

thumbnail
Fig 7. PCA visualization of the final clusters for datasetLE, with data points colored according to their respective clusters.

Additionally, the corresponding hypothetical cluster membership regions are displayed using the same colors, emphasizing the peripheral arrangement of clusters around the central data points. This suggests a distinct underlying structure, where clusters emerge at the edges of the dataset.

https://doi.org/10.1371/journal.pone.0339377.g007

To this end, a set of 68 new subjects was introduced. For simplicity and for illustrative purposes, these new subjects were manually labeled as either “at risk” or “not at risk” based on the characteristic features associated with the risk profiles defined in both datasetL and datasetLE. According to these criteria, 7 subjects were classified as “at risk” based on the feature structure of datasetL, and 8 based on that of datasetLE, with 3 subjects overlapping across both designations.

The new subjects were then projected into the PCA space of each dataset and visualized alongside the original clustered data. In these plots, shown in Figs 8 and 9, the new subjects are colored in yellow (at risk) and green (not at risk), while the original dataset elements retain their previous labeling: red for at-risk and blue for not-at-risk individuals. As previously discussed, uncertain clusters or ambiguous elements (e.g., Cluster 5 in datasetL, and Clusters 4, 10, and 11 in datasetLE) were treated as “not at risk” and are represented in black.

The resulting visualization clearly shows that the new data points tend to occupy the same regions as the original clusters with similar characteristics. This provides preliminary qualitative validation of the cluster configurations, supporting the hypothesis that the PCA space preserves meaningful structural relationships between subjects.

All tools used to generate the visual results presented in Figs 8 and 9 are available in the Experiment subfolder within the main repository at https://github.com/hyacintus/Sudden-Cardiac-Death-Survey/tree/809e6aac122aba71379db23d426af654bda3a944/Test

thumbnail
Fig 8. PCA test visualization of the Risk/No Risk clusters for datasetL, with data points colored according to their respective clusters.

The new subjects are colored yellow (at risk) and green (not at risk), while original points remain red (at risk), blue (not at risk), and black for ambiguous cases (e.g., Cluster 5).

https://doi.org/10.1371/journal.pone.0339377.g008

thumbnail
Fig 9. PCA test visualization of the Risk/No Risk clusters for datasetLE, with data points colored according to their respective clusters.

The new subjects are colored yellow (at risk) and green (not at risk), while original points remain red (at risk), blue (not at risk), and black for ambiguous cases (e.g., Cluster 4, 10 and 11).

https://doi.org/10.1371/journal.pone.0339377.g009

Clinical applications and prospects for implementation

This methodology presents a potential real-world application for pre-participation cardiovascular screenings in athletes. Defining an athlete’s profile using either a reduced feature set of eight key variables or the entire dataset of forty-five features, assigning the athlete to a cluster based on distance metrics, and using PCA-based visualization to determine their proximity to high-risk groups offer an intuitive interpretation of their risk status. This approach could augment traditional screening methods, helping sports cardiologists prioritize further testing in athletes who appear closer to high-risk clusters. It could also serve as an early flagging system, directing individuals toward more in-depth cardiac evaluations [40].

Despite its potential, this approach has three key limitations. While high-risk clusters correspond to known clinical markers, there is no definitive proof that they predict actual SCD events. Future studies should conduct external validation using independent datasets incorporating diverse athletic and non-athletic populations to assess model generalizability [41]. Clustering groups individuals with shared characteristics, but does not produce explicit risk equations. Future research should integrate supervised learning models like logistic regression, deep learning, and support vector machines to refine interpretability and predictive power [42]. Principal component analysis reduces dimensionality but lacks direct physiological meaning, meaning proximity to a cluster suggests relative risk rather than absolute thresholds. Incorporating biomarkers such as genetic predisposition indicators and electrocardiographic parameters could enhance clinical applicability [43].

Future research should prioritize prospective cohort studies to enhance and validate this approach, assessing whether clustering-based stratifications correspond with actual SCD incidence. Additionally, optimizing feature selection through advanced techniques, such as recursive feature elimination and Shapley additive explanations, will help isolate the most predictive variables. Incorporating physiological markers, such as cardiac MRI data, genetic risk scores, and exercise-induced ECG responses, can further enhance clinical relevance. Collaborating with experts in sports medicine and cardiology will help refine risk thresholds and enhance real-world applications for both elite and amateur athletes. The integration of clustering and PCA in athlete screening represents a significant shift from traditional, static risk assessment models to a more flexible, data-driven approach. While further validation is required, this methodology offers a promising tool to enhance pre-participation screenings, identify previously undetected risk groups, and ultimately contribute to improved athlete safety. By refining this approach through machine learning advancements, external validation, and clinical collaboration, it has the potential to become an essential component of modern sports cardiology [4446].

To support wider exploration and independent testing of this approach, a dedicated testing tool is provided in the Experiment subfolder of the main repository at https://github.com/hyacintus/Sudden-Cardiac-Death-Survey/tree/3fc0849814be127676081ef15c8a1903d06d3c9c/Trial. This tool is designed to reproduce visualizations such as those shown in Figs 10 and 11, and includes step-by-step instructions.

thumbnail
Fig 10. PCA trial visualization of the Risk/No Risk clusters for datasetL, with data points colored according to their respective clusters.

Original subjects are shown in red (at risk), blue (not at risk), and black for ambiguous cases (e.g., Cluster 5). Newly tested subjects are displayed in yellow.

https://doi.org/10.1371/journal.pone.0339377.g010

thumbnail
Fig 11. PCA trial visualization of the Risk/No Risk clusters for datasetLE, with data points colored according to their respective clusters.

Original subjects are shown in red (at risk), blue (not at risk), and black for ambiguous cases (e.g., Clusters 4, 10, and 11). Newly tested subjects are displayed in yellow.

https://doi.org/10.1371/journal.pone.0339377.g011

To provide clinicians with a cutting-edge advantage in sports cardiology, the integration of advanced data analytics and machine learning approaches can revolutionize the understanding of cardiac health in athletes. By leveraging sophisticated algorithms, clinicians can analyze large datasets effectively, identifying underlying patterns and risk factors that may not be apparent through traditional methods.

  • Real-Time Monitoring: Implement wearable technology that continuously monitors vital signs and cardiac performance during training and competitions. This data can be integrated into predictive models to assess individual risks in real time.
  • Personalized Risk Assessment: Utilize PCA plots and machine learning to create individualized profiles, identifying athletes’ specific risk factors for SCD. This enables tailored recommendations regarding training intensity, recovery protocols, and screenings.
  • Interactive Visualization Tools: Develop user-friendly dashboards that display PCA diagrams in an interactive format, allowing clinicians to explore various datasets and understand the spatial distribution of patient clusters in a more engaging way.
  • Collaborative Research Networks: Encourage partnerships between sports organizations, universities, and healthcare providers to share data on athlete health, which can enhance the development of robust predictive models and improve overall understanding of cardiac risks in sports.
  • Educational Resources: Provide ongoing training and resources for clinicians on the latest advancements in sports cardiology, including interpretation of PCA data and implementing machine learning insights into clinical practice.

By adopting these cutting-edge strategies, clinicians can enhance their ability to assess, monitor, and manage the cardiac health of athletes, leading to improved outcomes and safer sports environments

Conclusions

This study began with a dataset of 711 individuals who underwent screening to assess their physical fitness at a competitive level. We focused on variables correlated with SCD and initially selected twenty-six relevant variables from the literature. After refinement, we narrowed this down to eight core variables. Additionally, we incorporated other variables that we believed might be associated with the disease, resulting in a final set of forty-five variables. Consequently, we created two datasets with 711 entries each: one characterized by eight features ( datasetL) and the other by forty-five features ( datasetLE). We applied hierarchical clustering to both datasets using various metrics and combinations of similarity measures. The resulting clusterings were notably similar, allowing us to consistently reorganize the clusters. Specifically, we defined ten clusters for datasetL and eleven for datasetLE. We comprehensively interpreted all twenty-one clusters, emphasizing their potential medical significance. Following this, we visualized the data for both datasets using PCA. In most instances, elements of the clusters identified as associated with the disease were situated in regions distinctly separate from those considered healthy concerning SCD. We proposed using distance metrics from the clusters and PCA-based visualizations to assess risk in a diagnostic context. However, a significant limitation of these methods is the lack of a direct medical interpretation tied to specific formulas or indices. To address these challenges, future studies will focus on three key areas:

  • In-depth Analysis of Individual Clusters: We will investigate the medical and statistical significance of each cluster identified in this research to enhance our understanding of their role in stratifying the risk of sudden cardiac death.
  • Development of Risk Equations: We aim to derive mathematical equations that can define a risk index for sudden cardiac death based on the data from the clusters. This may involve using regression models, algorithms for feature selection such as genetic algorithms, and employing neural networks to optimize weights for predictive formulas.
  • Identification of Physically Meaningful Variables: We will seek to find variables or data representations with explicit physical meaning that can help distinguish clusters in regions separate from those associated with healthy individuals. This could involve exploring dimensionality reduction techniques, advanced machine learning algorithms, or specialized feature selection methods, such as recursive feature elimination, principal component regression, or autoencoders, to derive latent features.
thumbnail
Table 13. Comprehensive list of the 151 variables recorded for 711 athletes undergoing pre-participation physical evaluation (PPPE).

The variables are categorized into athlete data, family, physiological, and pathological history, clinical examination findings, resting and exertion electrocardiogram (ECG) results, spirometry, urinalysis, additional specialized tests, and final medical judgment.

https://doi.org/10.1371/journal.pone.0339377.t013

By addressing these objectives, we aim to improve the interpretability and utility of this clustering-based approach, transitioning from abstract mathematical representations to practical diagnostic tools with clear physical and medical implications.

Appendix

thumbnail
Table 14. Overview of features included in the datasets.

This table summarizes the features included in the three datasets: Literature-based features and extra features (datasetLE) (45 variables), complete literature-based features (26 variables, and Literature-based features (datasetL) (8 variables), categorized into Risk and Physiological Factors, Resting ECG, and Stress ECG. Most features are binary (Y/N), except for Body Mass Index and Heart Rate.

https://doi.org/10.1371/journal.pone.0339377.t014

thumbnail
Table 15. Eigenvalues and eigenvectors corresponding to the principal components.

These values provide further insight into the structural patterns revealed by PCA, enhancing the understanding of the clustering results and their potential physical or clinical significance.

https://doi.org/10.1371/journal.pone.0339377.t015

Supporting information

S1 Fig. Graphical representation of the study.

On the left, the study development: data from 711 athletes were analyzed using clustering and PCA, revealing that risk clusters occupy distinct regions. The proposed risk diagnostic tool is on the right: patient data are plotted on a region-based graph to determine whether the patient’s point falls within a high-risk area.

https://doi.org/10.1371/journal.pone.0339377.s001

(TIFF)

Acknowledgments

We would like to express our sincere gratitude to Drs. Eleonora Desideri and Simona Gorgoglione for their invaluable support during the course of our research. Their expertise and dedication contributed to the success of this study. Special thanks are also due to Angelina Libertazzi, head nurse, whose commitment and practical support during the analysis of the athletes was crucial. Her professionalism and attention made it possible to collect valuable data and significantly improved the research.

References

  1. 1. Mandsager K, Harb S, Cremer P, Phelan D, Nissen SE, Jaber W. Association of cardiorespiratory fitness with long-term mortality among adults undergoing exercise treadmill testing. JAMA Netw Open. 2018;1(6):e183605. pmid:30646252
  2. 2. Shiroma EJ, Lee I-M. Physical activity and cardiovascular health: lessons learned from epidemiological studies across age, gender, and race/ethnicity. Circulation. 2010;122(7):743–52. pmid:20713909
  3. 3. Radford NB, DeFina LF, Leonard D, Barlow CE, Willis BL, Gibbons LW, et al. Cardiorespiratory fitness, coronary artery calcium, and cardiovascular disease events in a cohort of generally healthy middle-age men: results from the cooper center longitudinal study. Circulation. 2018;137(18):1888–95. pmid:29343464
  4. 4. Shah RV, Murthy VL, Colangelo LA, Reis J, Venkatesh BA, Sharma R, et al. Association of fitness in young adulthood with survival and cardiovascular risk: the Coronary Artery Risk Development in Young Adults (CARDIA) Study. JAMA Intern Med. 2016;176(1):87–95. pmid:26618471
  5. 5. Hussain N, Gersh BJ, Gonzalez Carta K, Sydó N, Lopez-Jimenez F, Kopecky SL, et al. Impact of cardiorespiratory fitness on frequency of atrial fibrillation, stroke, and all-cause mortality. Am J Cardiol. 2018;121(1):41–9. pmid:29221502
  6. 6. Deo R, Albert CM. Epidemiology and genetics of sudden cardiac death. Circulation. 2012;125(4):620–37. pmid:22294707
  7. 7. Sollazzo F, Palmieri V, Gervasi SF, Cuccaro F, Modica G, Narducci ML, et al. Sudden cardiac death in Athletes in Italy during 2019 : internet-based epidemiological research. Medicina (Kaunas). 2021;57(1):61. pmid:33445447
  8. 8. Maron BJ, Doerer JJ, Haas TS, Tierney DM, Mueller FO. Sudden deaths in young competitive athletes: analysis of 1866 deaths in the United States, 1980-2006. Circulation. 2009;119(8):1085–92. pmid:19221222
  9. 9. Corrado D, Basso C, Rizzoli G, Schiavon M, Thiene G. Does sports activity enhance the risk of sudden death in adolescents and young adults?. J Am Coll Cardiol. 2003;42(11):1959–63. pmid:14662259
  10. 10. Harmon KG, Drezner JA, Maleszewski JJ, Lopez-Anderson M, Owens D, Prutkin JM, et al. Pathogeneses of sudden cardiac death in national collegiate athletic association athletes. Circ Arrhythm Electrophysiol. 2014;7(2):198–204. pmid:24585715
  11. 11. Risgaard B, Winkel BG, Jabbari R, Glinge C, Ingemann-Hansen O, Thomsen JL, et al. Sports-related sudden cardiac death in a competitive and a noncompetitive athlete population aged 12 to 49 years: data from an unselected nationwide study in Denmark. Heart Rhythm. 2014;11(10):1673–81. pmid:24861446
  12. 12. Finocchiaro G, Westaby J, Bhatia R, Malhotra A, Behr ER, Papadakis M, et al. Sudden death in female athletes: insights from a large regional registry in the United Kingdom. Circulation. 2021;144(22):1827–9. pmid:34843396
  13. 13. Haukilahti MAE, Holmström L, Vähätalo J, Kenttä T, Tikkanen J, Pakanen L, et al. Sudden cardiac death in women. Circulation. 2019;139(8):1012–21. pmid:30779638
  14. 14. Finocchiaro G, Dhutia H, D’Silva A, Malhotra A, Steriotis A, Millar L, et al. Effect of sex and sporting discipline on LV adaptation to exercise. JACC: Cardiovascular Imaging. 2017;10(9):965–72.
  15. 15. Maron BJ, Haas TS, Murphy CJ, Ahluwalia A, Rutten-Ramos S. Incidence and causes of sudden death in U.S. college athletes. J Am Coll Cardiol. 2014;63(16):1636–43. pmid:24583295
  16. 16. Maron BJ, Doerer JJ, Haas TS, Tierney DM, Mueller FO. Abstract 3872 : Profile and frequency of sudden deaths in 1,463 young competitive athletes: from a 25-year U.S. national registry, 1980–2005. Circulation. 2006;114:II_830-II_830.
  17. 17. Besenius E, Cabri J, Delagardelle C, Stammet P, Urhausen A. Five years-results of a nationwide database on sudden cardiacevents in sports practice in luxembourg. Dtsch Z Sportmed. 2022;73(1):24–9.
  18. 18. Toresdahl BG, Rao AL, Harmon KG, Drezner JA. Incidence of sudden cardiac arrest in high school student athletes on school campus. Heart Rhythm. 2014;11(7):1190–4. pmid:24732370
  19. 19. Eckart RE, Shry EA, Burke AP, McNear JA, Appel DA, Castillo-Rojas LM, et al. Sudden death in young adults: an autopsy-based series of a population undergoing active surveillance. J Am Coll Cardiol. 2011;58(12):1254–61. pmid:21903060
  20. 20. Finocchiaro G, Dhutia H, Gray B, Ensam B, Papatheodorou S, Miles C, et al. Diagnostic yield of hypertrophic cardiomyopathy in first-degree relatives of decedents with idiopathic left ventricular hypertrophy. Europace. 2020;22(4):632–42. pmid:32011662
  21. 21. Corrado D, Basso C, Pavei A, Michieli P, Schiavon M, Thiene G. Trends in sudden cardiovascular death in young competitive athletes after implementation of a preparticipation screening program. JAMA. 2006;296(13):1593–601. pmid:17018804
  22. 22. Malhotra A, Dhutia H, Finocchiaro G, Gati S, Beasley I, Clift P, et al. Outcomes of cardiac screening in adolescent soccer players. N Engl J Med. 2018;379(6):524–34. pmid:30089062
  23. 23. Drezner J, O’Connor F, Harmon K, Fields K, Asplund C, Asif I, et al. AMSSM position statement on cardiovascular preparticipation screening in athletes: current evidence, knowledge gaps, recommendations, and future directions: erratum. Clinical Journal of Sport Medicine. 2018;28(3).
  24. 24. Kanegae H, Suzuki K, Fukatani K, Ito T, Harada N, Kario K. Highly precise risk prediction model for new-onset hypertension using artificial intelligence techniques. J Clin Hypertens (Greenwich). 2020;22(3):445–50. pmid:31816148
  25. 25. Raghunath S, Pfeifer JM, Ulloa-Cerna AE, Nemani A, Carbonati T, Jing L, et al. Deep neural networks can predict new-onset atrial fibrillation from the 12-lead ECG and help identify those at risk of atrial fibrillation-related stroke. Circulation. 2021;143(13):1287–98. pmid:33588584
  26. 26. Hirota N, Suzuki S, Arita T, Yagi N, Otsuka T, Kishi M, et al. Prediction of current and new development of atrial fibrillation on electrocardiogram with sinus rhythm in patients without structural heart disease. Int J Cardiol. 2021;327:93–9. pmid:33188796
  27. 27. Sgarro GA, Grilli L, Valenzano AA, Moscatelli F, Monacis D, Toto G, et al. The role of BIA analysis in osteoporosis risk development: hierarchical clustering approach. Diagnostics (Basel). 2023;13(13):2292. pmid:37443685
  28. 28. Jafarzadegan M, Safi-Esfahani F, Beheshti Z. Combining hierarchical clustering approaches using the PCA method. Expert Systems with Applications. 2019;137:1–10.
  29. 29. Granato D, Santos JS, Escher GB, Ferreira BL, Maggio RM. Use of principal component analysis (PCA) and hierarchical cluster analysis (HCA) for multivariate association between bioactive compounds and functional properties in foods: A critical perspective. Trends in Food Science & Technology. 2018;72:83–90.
  30. 30. Bruse JL, Zuluaga MA, Khushnood A, McLeod K, Ntsinjana HN, Hsia T-Y, et al. Detecting clinically meaningful shape clusters in medical image data: metrics analysis for hierarchical clustering applied to healthy and pathological aortic arches. IEEE Trans Biomed Eng. 2017;64(10):2373–83. pmid:28221991
  31. 31. Harmon KG, Zigman M, Drezner JA. The effectiveness of screening history, physical exam, and ECG to detect potentially lethal cardiac disorders in athletes: a systematic review/meta-analysis. J Electrocardiol. 2015;48(3):329–38. pmid:25701104
  32. 32. Rosati S, Agostini V, Knaflitz M, Balestra G. Muscle activation patterns during gait: a hierarchical clustering analysis. Biomedical Signal Processing and Control. 2017;31:463–9.
  33. 33. Jaadi Z. Principal Component Analysis (PCA): a step-by-step explanation. Built in. 2023.
  34. 34. Abdelfattah OM, Martinez M, Sayed A, ElRefaei M, Abushouk AI, Hassan A, et al. Temporal and global trends of the incidence of sudden cardiac death in hypertrophic cardiomyopathy. JACC Clin Electrophysiol. 2022;8(11):1417–27. pmid:36424010
  35. 35. Jones JC, Sugimoto D, Kobelski GP, Rao P, Miller S, Koilor C, et al. Parameters of cardiac symptoms in young athletes using the Heartbytes database. The Physician and Sportsmedicine. 2020;49(1):37–44.
  36. 36. Austin AV, Owens DS, Prutkin JM, Salerno JC, Ko B, Pelto HF, et al. Do “pathologic” cardiac murmurs in adolescents identify structural heart disease? An evaluation of 15 141 active adolescents for conditions that put them at risk of sudden cardiac death. Br J Sports Med. 2022;56(2):88–94. pmid:33451997
  37. 37. A.Umoh U, M. Ntekop M. A proposed fuzzy framework for cholera diagnosis and monitoring. IJCA. 2013;82(17):1–10.
  38. 38. Yulita IN, Rosadi R, Purwani S, Suryani M. Multi-layer perceptron for sleep stage classification. J Phys: Conf Ser. 2018;1028:012212.
  39. 39. Awad FH, Hamad MM, Alzubaidi L. Robust classification and detection of big medical data using advanced parallel K-means clustering, YOLOv4, and logistic regression. Life (Basel). 2023;13(3):691. pmid:36983845
  40. 40. Lundberg SM, Lee SI. Consistent feature attribution for tree ensembles. arXiv preprint 2017. https://arxiv.org/abs/1706.06060v6
  41. 41. Drezner JA, Harmon KG. Incidence of sudden cardiac death in athletes. In: Pelliccia A, Heidbuchel H, Corrado D, Borjesson M, Sharma S, editors. The ESC Textbook of Sports Cardiology. 2019.
  42. 42. D’Ascenzi F, Valentini F, Pistoresi S, Frascaro F, Piu P, Cavigli L, et al. Causes of sudden cardiac death in young athletes and non-athletes: systematic review and meta-analysis: Sudden cardiac death in the young. Trends Cardiovasc Med. 2022;32(5):299–308. pmid:34166791
  43. 43. Pelliccia A, Solberg EE, Papadakis M, Adami PE, Biffi A, Caselli S, et al. Recommendations for participation in competitive and leisure time sport in athletes with cardiomyopathies, myocarditis, and pericarditis: position statement of the Sport Cardiology Section of the European Association of Preventive Cardiology (EAPC). Eur Heart J. 2019;40(1):19–33. pmid:30561613
  44. 44. Watson CJ, Stone GL, Overbeek DL, Chiba T, Burns MM. Performance-enhancing drugs and the Olympics. J Intern Med. 2022;291(2):181–96. pmid:35007384
  45. 45. Adami PE, Koutlianos N, Baggish A, Bermon S, Cavarretta E, Deligiannis A, et al. Cardiovascular effects of doping substances, commonly prescribed medications and ergogenic aids in relation to sports: a position statement of the sport cardiology and exercise nucleus of the European Association of Preventive Cardiology. Eur J Prev Cardiol. 2022;29(3):559–75. pmid:35081615
  46. 46. Sharma S, Merghani A, Mont L. Exercise and the heart: the good, the bad, and the ugly. Eur Heart J. 2015;36(23):1445–53. pmid:25839670