Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Strokeformer: A novel deep learning paradigm training transformer-based architecture for stroke prognosis prediction

Abstract

Stroke, a common neurological disorder, is considered one of the leading causes of death and disability worldwide. Stroke prognosis issues involve using clinical characteristics collected from patients presented in tabular form to determine whether they are suitable for thrombolytic therapy. Transformer-based deep learning methods have achieved state-of-the-art performance in various classification tasks, but flaws still exist in dealing with tabular data. These models and algorithms largely tend to overfit and exhibit performance degeneration on small-scale, class-imbalanced datasets. Medical datasets are typically small and imbalanced due to the scarcity of labelled medical data samples. Therefore, this study proposes a novel stroke prognosis prediction model called Strokeformer to address these issues. Specifically, novel intra- and interfeature interaction modules are designed to capture internal and mutual information among individual features for more effective latent representations. In addition, we explore the possibility of performing the training process by pretraining on large-scale, class-balanced datasets and then fine-tuning on small-scale, class-imbalanced downstream datasets. This pretraining and fine-tuning paradigm is dramatically feasible for preventing overfitting. To verify the effectiveness of the proposed model and training method, experiments are conducted on 20 public datasets from OpenML and two private stroke prognosis datasets provided by Shenzhen Fuyong People’s Hospital and The Affiliated Taizhou People’s Hospital of Nanjing Medical University, China, respectively. The results show that Strokeformer performance significantly outperforms that of other comparison models on the introduced datasets. The principal limitation of the model lies in its lack of interpretability from the clinicians’ perspective. Nevertheless, given that the interpretability of deep learning remains an open challenge, the promising empirical results achieved by Strokeformer on real-world stroke prognosis datasets highlight its potential to assist in clinical decision-making.

Introduction

Stroke is one of the leading causes of disability and death worldwide [1]. The prediction is that 12 million stroke deaths and more than 200 million disability-adjusted life-years could be lost from stroke each year by 2030 [2]. Stroke can be categorized into two major types depending on the cause: ischaemic stroke caused by vascular blockage and haemorrhagic stroke resulting from the rupture of a blood vessel. Among these, ischaemic stroke is the most common, accounting for approximately 71% of all stroke cases [3]. Ischaemic stroke disrupts the blood supply to a specific area of the brain, which causes damage to and death of neurons, leading to the loss of speech, motor skills, memory, and cognitive functions [4]. Patients may experience limb paralysis, difficulty walking, and problems swallowing. Additionally, cognitive and emotional aspects can also be affected, manifesting as memory decline, a lack of concentration, depression, and anxiety. In severe cases, it can be lethal [5].

The standard treatment for ischaemic stroke is thrombolytic therapy, which involves the use of thrombolytic agents to dissolve blood clots obstructing blood vessels and restoring cerebral blood flow. This therapy must be administered within a critical time window, typically within a few hours after the onset of stroke, to achieve optimal results [6]. In addition, not all patients are suitable candidates for this treatment, even within the time window [7]. Therefore, a comprehensive patient evaluation is needed before treatment to minimize risks. In the early stages of stroke prognosis, scoring methods are commonly used for clinical decision-making, which predict the patient outcome based on their clinical characteristics at the time of admission [8,9]. These methods are typically rule-based and developed from data specific to certain populations, resulting in poor generalization capabilities when applied to different populations. The application of artificial intelligence (AI) technologies has provided new approaches for stroke prognosis. By leveraging vast amounts of patient data, specifically developed AI models and algorithms have been demonstrated to help identify the intricate patterns and factors influencing stroke recovery, thereby increasing the accuracy of prognostic predictions.

Since stroke prognosis is essentially a data classification problem, many machine learning methods have been applied in the field of stroke prognosis prediction. Various machine learning techniques, including logistic regression (LR) [10], support vector machines (SVMs) [11], and random forests (RFs) [12], are employed to predict functional outcomes in patients with ischaemic stroke [13]. Their findings demonstrated that these machine-learning approaches significantly outperformed rule-based scoring systems. In a previous study [14], three distinct machine learning models, artificial neural networks (ANNs), LRs, and RFs, were utilized to predict long-term outcomes in patients with acute ischaemic stroke. This research revealed that the ANN outperformed both the RF and the LR. SVMs, RFs, and ANNs have been used by [15] to analyse the data collected by the Taiwan Stroke Registry since 2006. After 206 clinical variables were selected, 17 key features from the ischaemic stroke dataset and 22 features from the haemorrhagic stroke dataset were identified. LR, RF, and XGBoost [16] were employed to analyse data from 4,237 acute ischaemic stroke patients in [17]. The clinical data and CT brain images of 116 acute ischaemic stroke patients were analysed by [18] to train an SVM aimed at predicting symptomatic intracranial haemorrhage. The results demonstrated that the SVM significantly outperformed traditional scoring systems such as SEDAN [19] and HAT [20]. However, machine learning methods often rely on feature engineering [21], which is not only time-consuming but also requires substantial expertise. Additionally, machine learning methods have limitations in capturing complex data relationships [22,23].

In contrast, deep learning models, with their multilayer structures and nonlinear activation functions, can learn more complex data patterns and relationships [2426]. Therefore, researchers have attempted to apply deep learning methods to the domain of stroke prognosis prediction. Deep convolutional neural networks are employed to predict the final lesion volume in stroke patients [27]. This model was capable of automatically identifying and integrating acute-phase imaging features, thereby enhancing predictive accuracy. Furthermore, the benefits of combining clinical data with neuroimaging features for predicting the three-month prognosis of acute ischaemic stroke patients have been demonstrated [28]. With the swift development of deep learning in recent years, more effective models have been proposed for tabular data classification problems. The transformer [29] is an innovative deep learning architecture that has significantly advanced in various fields, such as natural language processing [30] and image recognition [31]. This model has also been applied to tabular data classification problems [32]. Although the introduction of machine learning and deep learning methods has greatly improved the performance of stroke prognosis prediction models, conventional learning systems still suffer from problems such as limited data size and weak generalization capabilities because of the class-imbalanced training set.

Therefore, in this study, a novel transformer-based deep learning model named Strokeformer is proposed to address the problems described above. Strokeformer includes a feature embedding layer, a feature interaction layer, and an output layer. The feature embedding layer performs encoding of the input data. The feature interaction layer consists of an intrafeature interaction module and an interfeature interaction module, where the intrafeature interaction module applies a linear layer to individual feature vectors to learn information within each feature, and the interfeature interaction module contains several transformer encoder layers to capture the information between feature embeddings. The output layer projects the features learned from the former structure through a linear layer to obtain the final prediction result. Furthermore, this study applies a pretraining and fine-tuning learning paradigm in response to the problems of overfitting and performance deterioration caused by data imbalance. A large amount of unlabelled tabular data is first used for self-supervised learning in the pretraining phase, and then the small-scale target dataset with labelled samples is utilized for fine-tuning. This pretraining and fine-tuning method can effectively enhance the generalization ability of the model and alleviate overfitting on small-scale, class-imbalanced datasets. Experiments are performed on 19 benchmark tabular datasets and two real-world stroke prognosis datasets. The results show that, compared with other competitive methods, the proposed model and learning algorithm significantly improve the prediction results on the stroke prognosis dataset. In summary, the main contributions of this study can be summarized as follows:

  • A novel Strokeformer model is introduced for stroke prognosis prediction, which contains customized intrafeature and interfeature interaction modules to capture internal and mutual information among individual features.
  • A sophisticated learning paradigm is proposed to train Strokeformer to overcome overfitting and class imbalance issues, including a pretraining process on a large-scale, balanced dataset and a fine-tuning process on the downstream dataset.
  • Extensive experiments have demonstrated that the proposed Strokeformer outperforms other state-of-the-art machine-learning and deep-learning methods on both benchmarks and two real-world stroke prognosis tasks.

The rest of the paper is organized as follows. The corresponding research on transformer-based models for medical data analysis and the latest machine-learning and deep-learning methods for tabular data classification tasks is summarized in the Related works section. The links of the code and the data are provided in the Materials and ethics statement section. The structural design and dataflow of the proposed Strokeformer model are described in the Model architecture section. The training algorithm, consisting of a self-supervised pretraining phase and a supervised fine-tuning phase, is illustrated in the Learning paradigm section. The experimental results of the proposed model and algorithm compared with those of conventional methods on all the introduced datasets are presented in the Experimental section. The Discussion section provides a discussion of some implementation details of the proposed model and algorithm, including an ablation study of the novel intrainteraction module and a sensitivity analysis of some variants in the model. Finally, the conclusions are given in the Conclusions section.

Related works

On the one hand, the application of the transformer model in medical data analysis is explored. Transformer models have been widely applied in this field and have been successfully employed in various tasks, including image synthesis and reconstruction, registration, segmentation, detection, and diagnosis [33]. The Vision Transformer (ViT) [31] was further advanced by introducing a novel visual attention mechanism and achieved optimal performance on the COVID-19 diagnostic task without requiring pretraining on ImageNet [34]. By investigating the impact of image patch size when using ViT for tasks such as lesion lung classification and COVID-19 diagnosis, [35] revealed that increasing the patch size led to a decline in model performance, highlighting the trade-off between local and global information. A global-local transformer model designed for rapid brain age estimation using magnetic resonance imaging was proposed in [36]. The model consists of a global pathway that extracts contextual information from the entire input image and a local pathway that captures fine-grained details from image patches. Compared with other models, this approach significantly enhances the accuracy of brain age prediction. A transformer-based representation learning model designed to assist in clinical diagnosis was proposed in [37]. The model is capable of handling multimodal inputs, including patient X-rays, unstructured chief complaints, and structured clinical history data. Bidirectional intramodal and cross-modal attention layers are employed to learn comprehensive representations. This model outperforms image-only and nonintegrated multimodal diagnostic models in tasks such as identifying lung disease and predicting adverse clinical outcomes in COVID-19 patients.

On the other hand, state-of-the-art existing machine learning and deep learning-based tabular data classification solutions are also investigated in this study. Several classic methods are employed as competitors of the proposed model and algorithm in our experiments. XGBoost is an efficient machine-learning algorithm based on gradient-boosting decision trees. It uses a precise approximation method for finding split points and regularization techniques, which effectively avoids overfitting while enhancing the predictive accuracy and inference speed. CatBoost is also a gradient-boosting decision tree-based algorithm that automatically handles categorical features without requiring complex preprocessing. The algorithm introduces symmetric trees as base models, using identical splitting conditions at each layer, which accelerates the training speed and enhances the generalization ability [38]. Adaptive relation modeling network (ARMNet) transforms input features into the exponential space and dynamically sets cross-order and cross-weights for each feature, selectively modelling relationships between features effectively and capable of handling arbitrary-order cross features [39]. Neural network architecture for tabular data (TabNet) employs a multistep decision-making mechanism, selecting features for inference at each step through a sequential attention mechanism, thereby achieving interpretability of the model while enhancing learning efficiency [40]. The transformer architecture to tabular data (TabTransformer) model leverages the transformer structure based on self-attention, converting categorical feature embeddings into context-rich embeddings and thereby improving the predictive accuracy in supervised and semisupervised learning scenarios. The TabTranSELU is a simple yet effective adaptation of the transformer architecture for tabular data. The features alongside their respective names are encoded and fed into an enhanced transformer structure with scaled exponential linear unit activation [41]. Saint combines self-attention and intersample attention to capture complex interactions across features and data points, and both transformer blocks are helpful in advancing scalability and reducing computational overhead [42]. MambaTab is a recent innovative approach for tabular data that is based on an emerging structured state-space model variant named Mamba. It is indicated to have strong capabilities for efficiently extracting effective representations from data with long-range dependencies [43]. The Feature-Tokenizer Transformer (FT-Transformer) model combines feature transformation with the transformer architecture, differentially processing categorical and numerical features, achieving superior performance [44].

Materials and ethics statement

Codes of the Strokeformer model and the pretraining and fine-tuning learning paradigm for it are available via https://github.com/jhc050998/Strokeformer. The two private stroke prognosis datasets used in this paper are available via https://ieee-dataport.org/documents/stroke-prognosis-dataset-taizhou-and-fuyong. The datasets were accessed on October 27, 2024. All the medical records were fully anonymized before we accessed them. The authors do not have access to information that could identify individual participants during or after data collection. This research has passed the formal ethical review process. We received official ethical approval letters such as Ethical Review Approval Letter with No. KY-2024-35 from Shenzhen Fuyong People’s Hospital and Ethical Review Committee Approval Letter with No. KY 2024-018-01 from the Clinical Research Ethics Committee of Taizhou People’s Hospital.

Model architecture

As shown in Fig 1, the Strokeformer architecture comprises three main parts: the feature embedding layer, the feature interaction layer, and the output layer. The feature embedding layer adopts two different encoding methods for categorical and numerical features. The feature interaction layer consists of an intrafeature interaction module followed by an interfeature interaction module. A stack of transformer layers is built in the interfeature interaction module, each consisting of a multihead self-attention layer followed by a positionwise feedforward layer. The output layer contains a linear layer to provide the final prediction result.

thumbnail
Fig 1. Description of the model architecture of Strokeformer. The model contains a feature embedding layer, a feature interaction layer consisting of an intrafeature interaction module, an interfeature interaction module, and an output layer that outputs the prediction result.

https://doi.org/10.1371/journal.pone.0330530.g001

Dataflow overview

Let (X,y) denote a feature–target pair, where . denotes all the n numerical features, and denotes all of the m categorical features. For the embedding of numerical features, the Hadamard product is used to multiply with a parameter matrix , and a learnable bias matrix is added to improve the capacity of the model. represents the matrix of embeddings for all the numerical features. Let with each being a categorical feature, for . Each of the categorical features is embedded into a parametric embedding of dimension d using a lookup table approach, which is explained below in detail. Let for be the embedding of the feature and be the matrix of embeddings for all the categorical features.

Then, the numerical embeddings are concatenated along with the categorical embeddings to form an matrix T. The matrix of all the parametric embeddings is input into the intrafeature interaction module, which contains a multilayer perceptron (MLP) structure, to learn the information within each feature. Next, the output of the intrafeature interaction module is input into the interfeature interaction module consisting of N transformer layers followed by a feedforward layer. Each parametric embedding is transformed into contextual embedding when output from the interfeature interaction module through successive aggregation of context from other embeddings. We denote the intrafeature interaction module as a function and the interfeature interaction module as . The functions and operate on parametric embeddings in order and return corresponding contextual embeddings. These embeddings are input into an MLP, denoted by , to predict the target y.

Let H be the cross-entropy for classification tasks. The following loss function L(X,y) is minimized to learn all the Strokeformer parameters via end-to-end learning by the first-order gradient methods. The Strokeformer parameters include W, , and for the feature embedding layer; ϕ and ψ for the feature interaction layer; and θ for the top MLP structure in the output layer. The overall loss function can be described as follows:

(1)

Below, more implementation details of each layer and module are explained.

Feature embedding layer

Considering that tabular data typically include two different kinds of features, numerical features and categorical features, Strokeformer adopts different embedding methods for these two types of features. On the one hand, for numerical features, embedding via the Hadamard product is described as follows:

(2)

where is the matrix of raw numerical features in the input data and where denotes the matrix of the numerical embeddings. W and represent the learnable parameter matrix and bias matrix, respectively. On the other hand, for each categorical feature termed , we have an embedding lookup table , for . The lookup table is an matrix where S represents the number of all categories in the input table data and where d represents the dimension of the embedding. Each category corresponds to a unique vector. For example, assume that the representation of gender is 1 for males; then, the embedding vector for this category is the vector in the first row of the lookup table. After all the corresponding vectors in the lookup table and an added bias matrix , are obtained, the result serves as the final embedding representation for the categorical features. It can be summarized as follows:

(3)

where denotes the matrix of the categorical embeddings. is the matrix of raw categorical features in the input table data, and represents the matrix consisting of all the input corresponding vectors grabbed from the lookup table. Finally, by concatenating the embeddings of numerical and categorical features, the output of the feature embedding layer can be described as:

(4)

Intrafeature interaction module

The intrafeature interaction module constitutes one of the core parts of the Strokeformer model. Combined with the interfeature interaction module, it deeply explores the interrelationships among the input feature embeddings. To effectively learn the information within a single embedding, a feedforward neural network module is designed for each feature, with its architecture shown in Fig 1. The operation process of the intrafeature interaction module is summarized as follows:

(5)

where Ti represents the input embedding of the ith feature and where denotes the output. A two-layer network structure is utilized to learn the information within each feature. Linear represents a linear layer operation. A ReLU activation function is added between the two linear layers to capture nonlinear relationships within the input embeddings. In addition, the Dropout mechanism is introduced to reduce overfitting and enhance the model’s generalization ability. It is placed after the ReLU function, before the second linear layer.

The module is specifically designed in this study to extract information within individual features, enhancing Strokeformer power on datasets containing a limited number of samples. In the implementation, because the same feedforward network structures described in Eq (5) are used for each feature, the calculations are performed in parallel to improve the model’s computational efficiency.

Interfeature interaction module

The interfeature interaction module contains a stack of N encoder layers derived from the transformer. Each encoder layer consists of a multihead self-attention layer followed by a feedforward layer, with layer normalization being performed before each layer, drawing on the PreNorm model [45]. A self-attention layer comprises three parametric matrices: key, query, and value. Each input embedding is projected to these matrices to generate its key, query, and value vectors. Formally, let , , and be the matrices comprising the key, query, and value vectors of all the embeddings, respectively. d is the number of embeddings inputted to the self-attention layer, and dk and are the dimensions of the key and value vectors, respectively. Every input embedding attends to all other embeddings through an attention head, which is computed as follows:

(6)

where (termed A) is called the attention matrix. For each embedding, calculates how much it attends to other embeddings, thus transforming it into a contextual embedding. The output of the attention head of dimension is projected back to the embedding of dimension d through a fully connected structure consisting of two linear layers and the following activation function.

If represents the output of the jth transformer encoder layer, with the input , . N represents the number of encoder layers. The computational process of the jth encoder layer is as follows:

(7)

where indicates the operation of the jth encoder layer, represents a residual connection along with a layer normalization operation, MultiHead denotes the multihead attention operation, and MLP indicates the feedforward neural network module including two linear layers and an activation function between them. With N stacking encoder layers, the entire operating process of the interfeature interaction module is summarized as follows:

(8)

The output feature matrices of the first and last layers of the transformer are denoted as and , respectively. The final Mean operation, referring to averaging along the first dimension of the tensor, is given by:

(9)

where represents the final output of the interfeature interaction module.

This module is designed to utilize the power of the transformer structure. Layer normalization and residual connections are introduced to ensure that the transformer layers converge. The module works as a main part of the proposed Strokeformer encoder, which aims to capture the information between the features effectively.

Output layer

For the output matrix from the feature interaction layer, the final classification result of the Strokeformer model is obtained by introducing a linear layer in the output layer, whose operation is described as follows:

(10)

where represents the final predictive output of the Strokeformer model. The output dimension of the Linear layer in Eq (10) is one, indicating the probability that the model gives its prediction result as the positive class. Typically, a sigmoid function is used to constrain the final output in the range between 0 and 1, and a threshold of 0.5 is set. When the output probability is equal to or greater than the threshold, the prediction is positive; otherwise, the prediction is negative. Binary cross-entropy (BCE) with a logit loss function is used to calculate the error between the predicted value and the sample label y. This loss function simultaneously conducts the sigmoid transformation and the computation of the BCE.

Learning paradigm

Strokeformer can be trained in an end-to-end supervised manner via labelled examples. However, to address the overfitting problem, the proposed training method comprises a self-supervised pretraining process to learn feature representations from unlabelled data, followed by a fine-tuning process on a target-supervised task.

As shown in Fig 2, two prediction tasks, namely, predicting the masked features and determining whether the features are masked, are employed in the pretraining phase. First, mask matrices are randomly initialized, and the original input features are masked to obtain the masked features. Then, the masked features are processed through the feature embedding layer and the feature interaction layer of the Strokeformer model, and different additional linear layers are used to separately predict the mask matrix and the masked features. Finally, the mask loss and feature loss are summed to obtain the final loss value, which is minimized to train the model. The pretraining process uses a large amount of unlabelled data to conduct self-supervised learning, enabling the model to learn correlations among features. After that, the parameters of the pretrained model are reused and trained on a smaller labelled dataset in the fine-tuning phase.

thumbnail
Fig 2. Pretraining and fine-tuning paradigm for the Strokeformer model.

In the pretraining phase, the feature interaction layer is trained on a large-scale, unlabelled dataset from another corresponding area via self-supervised learning. In the fine-tuning phase, the parameters in the feature interaction layer are frozen, and the feature embedding layer and the output layer are trained on the small-scale target dataset via supervised gradient-based learning.

https://doi.org/10.1371/journal.pone.0330530.g002

Owing to the difficulties in obtaining labelled clinical data, the stroke prognosis challenge naturally faces the problems of an insufficient number of samples and class imbalance in the dataset. The learning paradigm, consisting of pretraining and fine-tuning, is designed to handle these problems. A large-scale dataset with well-balanced samples is employed in the pretraining stage, which is believed to help reduce the overfitting problem caused by the limited sample counts and imbalance existing in the target medical dataset. Details of the implementation of the pretraining and fine-tuning processes are described in the following sections.

Pretraining

Let m denote the binary mask matrix and gmask denote the masked feature generation function. The process of obtaining the masked features from the original features can be represented as:

(11)

where signifies the original feature matrix, refers to the feature matrix obtained by randomly shuffling the order of samples in the original feature matrix X, and represents the masked feature matrix. k represents the number of samples, and d indicates the feature dimension. Each () in m is generated from a Bernoulli distribution, and mij = 1 indicates that the feature at the corresponding position of the sample is masked.

The masked feature matrix is fed to the former part of Strokeformer (feature embedding layer and feature interaction layer, termed s) to yield the output Z. The two tasks adopted by pretraining, i.e., predicting the masked features and whether the features are masked, can be depicted as follows:

(12)

where represents the predicted mask matrix, gm denotes the linear layer used for predicting the mask matrix, represents the predicted features, and gf indicates the linear layer utilized for predicting the original features.

If s refers to the former part of the Strokeformer model without the final classifier, the parameters s, gf, and gm are adjusted through backpropagation by computing loss values. The entire process of pretraining can be represented as:

(13)

where represents the calculated expected value of the loss function, pX is the probability distribution of the original features, pm is the Bernoulli distribution utilized to generate the mask matrix, α indicates the weight coefficients, denotes the BCE loss function for predicting the mask matrix, and signifies the loss function for predicting features, mean squared error for numerical features and cross-entropy for categorical features.

Fine-tuning

In this stage, the parameters from the transformer encoder layers of the pretrained model are reused and frozen during the supervised learning process. Consequently, the primary focus of training in the fine-tuning phase is on the embedding layer, which generates the embeddings of the input features, and the output layer, which provides the final predictions. The optimization process of fine-tuning is summarized as follows:

(14)

where represents the expectation of the loss function. X denotes the input features, y represents the labels, and (X,y) samples according to the joint probability distribution pXy. Ls is the BCE loss function. e represents the feature embedding layer, f indicates the feature interaction layer, and g is the output layer that provides the predicted label values. In addition, an early stopping strategy [46] was incorporated during the model training phase to prevent overfitting.

Experiments

In this section, first, 22 introduced binary classification datasets and five utilized performance metrics are described in the subsections Datasets descriptions and Performance metrics, respectively. More details about the experimental design and setup are presented in the Experimental Setup subsection. Finally, in this subsection, compared with those of other stroke prognosis prediction methods, the classification performances of the proposed Strokeformer model and the training algorithm are compared against those of several state-of-the-art machine learning methods and deep learning-based methods that are specially developed for performing classification on tabular data.

Dataset description

To investigate the classification performance of the proposed model and training algorithm, 19 benchmark datasets sourced from OpenML [47] and two private stroke prognosis datasets provided by Shenzhen Fuyong People’s Hospital (termed the Fuyong dataset) and The Affiliated Taizhou People’s Hospital of Nanjing Medical University (termed the Taizhou dataset) are used in our experiments. In addition, the pretraining phase of the proposed learning algorithm uses another open-source dataset called the Covertype dataset, which is also publicly available from OpenML. Table 1 summarizes the number of samples, the number of positive and negative samples, the number of attributes, and the attribute characteristics of all the introduced datasets.

thumbnail
Table 1. Description of the employed classification datasets.

https://doi.org/10.1371/journal.pone.0330530.t001

In this study, the 19 introduced benchmarks are divided into two subsets: nine small-scale datasets and ten additional datasets with relatively large scales. On the one hand, as shown in Table 1, the number of instances of the small-scale datasets is several hundred, and the number of features varies across these datasets, ranging from 4–41. Notably, Credit-A and Ilpd contain both numerical and categorical features, Tic has only categorical features, and the other datasets include only numerical features. Serious imbalances can be observed in datasets such as Blood and Kc2. On the other hand, the large-scale datasets span a broad range of sample counts, varying from 4,966 to 58,252, and incorporate a mix of categorical and numerical features. The feature counts of these datasets range from 8–31. Table 1 shows that the Bank, Customer, and Income datasets present imbalanced scenarios, with positive-to-negative ratios of 1:7, 1:2, and 1:3, respectively. In contrast, the remaining seven datasets exhibit a balanced ratio of positive to negative samples.

The labelled medical datasets, such as stroke prognosis records, are insufficient. This is because collecting and labelling these records typically requires substantial time and effort from domain experts. Therefore, the training paradigm of performing pretraining on large-scale unlabelled datasets from other domains and then using small-scale labelled medical datasets for fine-tuning is investigated in this study. The Covertype dataset introduced in our pretraining is provided by the United States Geological Survey and the United States Forest Service to predict forest cover types based on cartographic data. It comprises ten numerical attributes, such as elevation, azimuth, and slope, along with 44 categorical attributes, encompassing four wilderness areas and 40 soil types. There are seven types of forest cover, including spruce/fir, pole pine, ponderosa, and pine, among others. This study selects spruce/fir and lodgepole pine for a binary classification task. The experimental results demonstrated the effectiveness of the pretraining process, which was conducted on data from other domains.

Finally, two private stroke prognosis datasets are adopted to estimate the performance of the proposed methods on real-world tasks. The Fuyong dataset contains records of 117 stroke patients who received treatment at Shenzhen Fuyong People’s Hospital between March 16, 2022, and August 27, 2024. In addition, the medical records of 248 stroke patients treated at The Affiliated Taizhou People’s Hospital of Nanjing Medical University between January 1, 2020, and December 31, 2023, were included in the Taizhou dataset. The two private stroke prognosis datasets meticulously track symptoms and their changes in patients before and after thrombolysis therapy, encompassing variables such as patient age, sex, prethrombolysis National Institutes of Health Stroke Scale (NIHSS) score [48], postthrombolysis NIHSS score, the presence of hypertension, atrial fibrillation, and diabetes, among other clinical indicators. Notably, the NIHSS score is a crucial metric used to gauge the severity of a stroke, with lower scores indicating milder strokes. This study defines patients whose NIHSS score decreases by four points or more after thrombolysis as positive samples; otherwise, they are considered negative samples. Samples containing missing values are discarded during the preprocessing phase. The datasets exhibit class imbalance, with a ratio of approximately 1:4.1 between positive and negative samples in the Fuyong dataset and 1:1.9 in the Taizhou dataset. The distributions of NIHSS scores from 0 to 36 among patients recorded in the Fuyong and Taizhou datasets are shown in Figs 3 and 4, respectively. The apparent leftward skew observed in both datasets indicates that milder stroke cases are much more common than severe cases. The red and yellow dashed lines mark the mean and median scores, respectively, in the figures, implying the severely skewed nature of the two datasets. Furthermore, the standard deviations are also shown in the figures. The heterogeneity of the two real-world stroke datasets is emphasized in these visualizations.

thumbnail
Fig 3. Prethrombolysis NIHSS score distribution in the Fuyong dataset.

https://doi.org/10.1371/journal.pone.0330530.g003

thumbnail
Fig 4. Prethrombolysis NIHSS score distribution on the Taizhou dataset.

https://doi.org/10.1371/journal.pone.0330530.g004

Performance metrics

To evaluate the classification performance of the proposed Strokeformer model comprehensively, five widely used classification metrics for binary classification tasks are employed, including accuracy (Acc), precision (Pre), recall (RCL), F-measure (F1), and area under the receiver operating characteristic curve (AUC) [49]. Their mathematical expressions are given as follows:

(15)(16)(17)(18)(19)

where TP and FN indicate the number of correctly classified positive and negative samples, respectively, and where FP and FN represent the number of instances misclassified as positive and negative, respectively. Acc measures the proportion of correctly predicted samples out of all the records, offering a general assessment of classification performance. However, in cases of imbalanced data distribution, Acc alone fails to fully reflect the actual performance of the model. Pre, also referred to as the positive predictive value, represents the ratio of correctly predicted positive samples to all samples predicted as positive. In the context of stroke prognosis prediction, Pre is particularly crucial. Compared with failure to administer thrombolytic therapy in a timely manner to eligible patients, mistakenly applying thrombolytic treatment to those who do not meet the criteria poses a greater risk. Incorrect treatment decisions may subject patients to unnecessary health risks, especially for those with contraindications, where inappropriate thrombolytic therapy could lead to serious, potentially fatal bleeding events, worsening the patient’s prognosis. RCL, also known as sensitivity, indicates the proportion of true positives correctly identified and reflects the model’s ability to correctly identify positive instances from the actual positives. Pre and RCL often exhibit a trade-off relationship. The F1 score is designed to provide a comprehensive performance indicator through the harmonic mean of Pre and RCL. By balancing Pre and RCL, it has become a significant metric for assessing performance in complex prediction tasks. In this study, AUC was selected as the core metric to measure performance. This is because AUC quantifies the overall ability of the model to accurately discriminate between positive and negative samples across various decision thresholds, with values closer to 1 indicating better model performance. Therefore, AUC is not sensitive to class imbalance.

Experimental setup

Several state-of-the-art methods, including the machine learning techniques XGBoost and CatBoost and deep learning methods such as ARMNet, TabNet, TabTransformer, and FT-Transformer, are introduced as competitors of the proposed Strokeformer model. The proposed Strokeformer model is separately trained by direct stochastic gradient descent (SGD) on the raw training set (termed the StrokeformerSGD), SGD with the additional data augmentation technique Mixup [50] (StrokeformerSGD + M), and the described framework contains pretraining (StrokeformerPT) and fine-tuning (StrokeformerFT). Mixup is a data augmentation technique aimed at enhancing the generalization capability of deep learning models across various tasks [51,52]. To ensure optimal model performance across different datasets, the hyperparameters were adjusted accordingly. Table 2 details the adjustable ranges of hyperparameters used for each model in the experiments.

thumbnail
Table 2. Hyperparameter settings of the machine learning and deep learning methods.

https://doi.org/10.1371/journal.pone.0330530.t002

In addition, to ensure the stability and reliability of the results, a ten-fold cross-validation [53] approach is employed for model training on each dataset, with training, validation, and test sets comprising 72%, 18%, and 10% of the data, respectively. The ten-fold cross-validation was repeated three times for each dataset to further validate model performance, resulting in 30 sets of model predictions. The final prediction for each dataset is obtained by averaging these predictions.

Comparison to other stroke prognosis prediction methods

Experiments on nine small-scale benchmark datasets. The performances of Strokeformer trained by SGD (StrokeformerSGD), by SGD with Mixup (StrokeformerSGD + M), and by the pretraining and fine-tuning strategies (StrokeformerPT + FT) on the nine small-scale binary datasets are evaluated in this section. The results in terms of the AUC of the three methods and all other introduced competitors are presented in Table 3. The table shoes that StrokeformerPT + FT outperforms all the other competitors on seven datasets: Blood, Breast, Credit-A, Diabetes, Kc2, Qsar, and Wdbc. For the remaining two datasets Ilpd and Tic, on Ilpd, the proposed StrokeformerSGD + M ranks first, and the performance of StrokeformerPT + FT is slightly worse than that of the other methods (less than 0.5%). All the employed models except for TabNet achieve quite high AUCs on Tic, and the differences among them are insignificant. The proposed StrokeformerSGD+M and StrokeformerPT+FT both derived AUCs over 99%. Furthermore, this study focuses on the average predicted AUC across all test datasets because the diversity of dataset distributions makes it challenging for any single model to achieve optimal results in all datasets. Fig 5 shows the average prediction AUC for each model across nine small-scale datasets, revealing that the StrokeformerPT+FT method significantly outperforms the other methods in terms of average predictive performance. Notably, the average prediction AUC of the StrokeformerPT+FT model clearly outperforms that of the FT-Transformer model, indicating that introducing the internal feature interaction module after the embedding layer in the proposed Strokeformer model to learn information within individual features can effectively enhance the predictive performance of the model. A comparison between StrokeformerSGD and StrokeformerPT+FT proves that the pretraining and fine-tuning approaches significantly improve the model’s generalizability and enhance its performance on small-scale datasets. The performance of the StrokeformerSGD+M method does not surpass that of the StrokeformerPT+FT, possibly because the Mixup method is applicable only to numerical features and is not proper for categorical features. In contrast, the pretraining plus fine-tuning approach in the StrokeformerPT+FT can more effectively account for the characteristics of both numerical and categorical features.

thumbnail
Table 3. Prediction AUC of Strokeformer and the competitor methods on the nine small-scale binary classification datasets.

https://doi.org/10.1371/journal.pone.0330530.t003

thumbnail
Fig 5. Performance comparison in terms of the average predicted AUC of the introduced methods on nine small-scale datasets.

https://doi.org/10.1371/journal.pone.0330530.g005

Experiments on ten large-scale benchmark datasets. To further explore the predictive performance of Strokeformer for binary classification problems, this experiment not only examined datasets with fewer than 2,000 test samples but also specifically selected ten relatively large-scale datasets with more than 2,000 samples for testing. The experimental results are shown in Table 4. In this experiment, Strokeformer is trained directly by SGD because the Mixup method and pretraining plus fine-tuning training paradigm are designed to address the problem of insufficient training data when the large-scale dataset contains sufficient labelled samples. The proposed model outperforms all the competitors on seven datasets: Albert, Bank, Compas, Customer, Electricity, Eye, and Income. ARMNet appears to perform better on large-scale datasets; it achieves a better prediction AUC than do all other models on California, Credit, and House. However, Strokeformer still performs better on all the testing datasets than TabTransformer and FT-Transformer do, which indicates that the proposed model achieves state-of-the-art performance among all the transformer-based models in dealing with tabular data. In addition, as shown in Table 1, the California, Credit, and House datasets, on which ARMNet outperforms Strokeformer, are all class balanced and contain equal numbers of positive and negative samples. In contrast, the Bank, Customer, and Income datasets include three class-imbalanced cases. According to the results presented in Table 4, Strokeformer achieves better classification performance on these datasets than ARMNet and other competitors do. These experimental results demonstrate the advantages of Strokeformer on class-imbalanced datasets. The average prediction AUC shown in Fig 6 also supports the superiority of Strokeformer in large-scale datasets; it outperforms all other comparative models. Relative to small sample datasets, the ARMNet model performed better on large-scale datasets, with its predictive performance approaching that of the Strokeformer model.

thumbnail
Table 4. Prediction AUC of Strokeformer and the competitor methods on the ten large-scale binary classification datasets.

https://doi.org/10.1371/journal.pone.0330530.t004

thumbnail
Fig 6. Performance comparison in terms of the average predicted AUC of the introduced methods on ten large-scale datasets.

https://doi.org/10.1371/journal.pone.0330530.g006

Experiments on two real-world stroke prognosis datasets. Through these experiments, the excellent predictive performance of the Strokeformer model on tabular data classification problems has been adequately verified. This section applies the model to real-world stroke prognosis prediction tasks using the private Fuyong and Taizhou datasets. The experimental results evaluated across multiple performance metrics, such as accuracy, recall, precision, and F1 score, are shown in Table 5 and Table 6. The results indicate that StrokeformerPT + FT achieves an AUC that significantly outperforms the predictive capabilities of other comparative models. Strokeformer also outperforms other competitor models in terms of other predictive metrics, such as accuracy and precision. The predicted AUC of StrokeformerPT + FT reached 92.79%, an improvement of 8.57% compared with the 84.22% prediction result from StrokeformerSGD on the Taizhou dataset, highlighting the significant performance enhancement derived by using pretraining and fine-tuning methods. Fig 7 and Fig 8 display the ROC curves for deep learning models such as ARMNet and TabTransformer and the proposed Strokeformer model on the Fuyong dataset and Taizhou dataset, respectively. The figure clearly shows the significant superiority of StrokeformerPT + FT. As presented in Eqs (17) and (18), the F1 and Recall values are highly dependent on the number of correctly classified positive samples; they appear low on the Fuyong dataset because of its small size and class imbalance, containing only 23 positive instances and 94 negative instances. As shown in Table 5, TabNet achieves the highest Recall value; however, its accuracy and AUC are very low, indicating that the model fails to converge on this dataset. In such a case, a high Recall value is considered incidental and unreliable. Similarly, although ARMNet has relatively high Recall and F1 values, its low accuracy and AUC suggest that it does not perform well on the Fuyong dataset. The performance improvement of Strokeformer on the stroke prognosis dataset is more pronounced than that on the other datasets because the pretraining and fine-tuning processes greatly alleviate the overfitting and performance degeneration caused by class imbalance in the training set. Our assumption that using data from other domains for pretraining can improve the classification performance of some small-scale medical datasets is correct.

thumbnail
Table 5. Comparisons among the proposed Strokeformer model and the state-of-the-art competitors on the private Fuyong stroke prognosis dataset.

https://doi.org/10.1371/journal.pone.0330530.t005

thumbnail
Table 6. Comparisons among the proposed Strokeformer model and the state-of-the-art competitors on the private Taizhou stroke prognosis dataset.

https://doi.org/10.1371/journal.pone.0330530.t006

thumbnail
Fig 7. ROC curves of the proposed Strokeformer model and its competitors on the private Fuyong stroke prognosis dataset.

https://doi.org/10.1371/journal.pone.0330530.g007

thumbnail
Fig 8. ROC curves of the proposed Strokeformer model and its competitors on the private Taizhou stroke prognosis dataset.

https://doi.org/10.1371/journal.pone.0330530.g008

Discussion

In this section, an ablation study is discussed, which removes the intra- and interfeature interaction modules from Strokeformer separately. The experiments compare the performance of the model before and after the removal of the two modules on ten small-scale datasets to analyse the impact of each module on the ultimate prediction outcomes. Additionally, the influence of varying encoder layer counts and different feature representation methods on predictive performance is assessed. Three distinct representation strategies, adding a classify token (CLS token) [54] to the input features and using its associated feature, employing the average of the last encoder layer’s feature embeddings, and utilizing the mean of the first and last encoder layer embeddings, are investigated.

Ablation study

To evaluate the effectiveness of each component in the Strokeformer model, ablation experiments were conducted on the intra- and interfeature interaction modules. The experiments aimed to elucidate the impact of these critical components on the final predictive performance of the model. By individually removing the intrafeature interaction module and the interfeature interaction module and observing changes in the prediction outcomes before and after removal, the contribution of each module to Strokeformer can be assessed. The experimental results on nine small-scale datasets are presented in Table 7, where ‘w/o intrafeature’ denotes removing the intrafeature interaction module and ‘w/o interfeature’ indicates the removal of the interfeature interaction module. On the one hand, the results suggest that after removing the intrafeature interaction module, a decline in performance is observed in seven out of the ten datasets, which aligns with expectations and further confirms the effectiveness of the intrafeature interaction module and its significant contribution to the Strokeformer model. However, performance improvements are noted in the Tic and Wdbc datasets. The assumption is that in these two datasets, adding the intrafeature interaction module not only fails to capture effective internal feature information but also increases model complexity, leading to overfitting. The overfitting problem can be addressed by introducing a pretraining and fine-tuning learning strategy. On the other hand, the performance decreases across all nine datasets when the interfeature interaction is removed, underscoring the critical role of the transformer encoder module in the feature processing process. Overall, the removal of either the intrafeature interaction module or the interfeature interaction module from the model leads to a certain degree of decline in the predictive performance. The results of the ablation experiment prove the effectiveness of both the intrafeature interaction module and the interfeature interaction module in the design of the proposed Strokeformer model.

thumbnail
Table 7. Prediction AUC from ablation experiments of Strokeformer on nine small-scale datasets.

https://doi.org/10.1371/journal.pone.0330530.t007

Variants sensitivity analysis

The experimental results of Strokeformer with different variants on the eight small-scale datasets are presented in Table 8. Among the three compared representation methods, ‘first-last-avg’ exhibits superior performance in four out of eight datasets: Blood, Credit-A, Diabetes, and Qsar. Moreover, the ‘last-avg’ method notably stands out in the Breast, Ilpd, and Wdbc datasets. The remaining Kc2 and Tic datasets show exceptional performance in the variant that includes the CLS token. According to these results, the ‘first–last–avg’ method generally displays the best performance among all three compared variants, although the differences between these variants are not substantial. Furthermore, the experiment explores the performance changes of three different representation methods as the number of transformer layers increases from one to five. To demonstrate the trend of model prediction performance with increasing number of encoder layers. That there are variations in performance when the model changes across different datasets is clear. The trends can be roughly categorized into three types. First, there is an overall upwards trend in model performance as the number of layers increases, which is evident on datasets such as Breast and Wdbc. The second trend shows an initial rise in model performance with increasing layers, reaching an optimal point before declining as more layers are added. This trend is observed in the Credit-A, Kc2, Ilpd, and Diabetes datasets. Finally, the last trend is characterized by an overall decrease in performance as the number of layers increases, with the best performance achieved at just one layer. This can be observed in the Blood and Qsar datasets. Typically, when the number of encoder layers in a model is increased, if a dataset achieves its optimal performance with few layers, the relationships between the features are relatively simple. When the layer count of the model becomes too high, it may lead to an overly complex network, which counteracts the performance enhancement.

thumbnail
Table 8. Prediction AUC of different variants on eight small-scale datasets.

https://doi.org/10.1371/journal.pone.0330530.t008

Conclusions

Deep learning models have been widely applied to medical data analysis tasks, including stroke prognosis prediction. However, conventional models suffer from overfitting problems on small-scale, class-imbalanced datasets. Thus, a novel Strokeformer model is developed in this study, which overcomes these problems and achieves better performance on real-world stroke prognosis prediction tasks. The model innovatively incorporates customized intra- and interfeature interaction modules to learn information embedded within and between individual features. Furthermore, the training process of Strokeformer is extended by a pretraining and fine-tuning methodology aimed at bolstering the generalization capacity of the model. Our experimental findings reveal that the predictive performance of the proposed paradigm markedly exceeds that of other comparative methods. The primary limitation of the model lies in its lack of interpretability from the perspective of clinicians. Nevertheless, when Strokeformer is applied to the real-world Fuyong and Taizhou datasets, the model effectively increases the accuracy of stroke prognosis prediction, whereas the learning process consisting of pretraining and fine-tuning further enhances the predictive ability of the model. In summary, the proposed Strokeformer model and its particular training algorithm carry significant implications for stroke prognosis prediction, offering dependable experimental support to physicians in their diagnostic assessments of stroke prognosis. Our planned future work for this research includes two main components. First, we intend to introduce additional types of input data, such as medical images, into our model to enhance its capabilities in medical applications by integrating tabular data with other data modalities. Second, large language models are also planned to be leveraged to further improve the model’s classification performance on tabular data while expanding our focus to address more critical health care challenges beyond stroke in future work.

References

  1. 1. Wolfe CD. The impact of stroke. Br Med Bull. 2000;56(2):275–86. pmid:11092079
  2. 2. Feigin VL, Forouzanfar MH, Krishnamurthi R, Mensah GA, Connor M, Bennett DA, et al. Global and regional burden of stroke during 1990 -2010: Findings from the Global Burden of Disease Study 2010. Lancet. 2014;383(9913):245–54. pmid:24449944
  3. 3. Campbell BCV, De Silva DA, Macleod MR, Coutts SB, Schwamm LH, Davis SM, et al. Ischaemic stroke. Nat Rev Dis Primers. 2019;5(1):70. pmid:31601801
  4. 4. Lo EH, Dalkara T, Moskowitz MA. Mechanisms, challenges and opportunities in stroke. Nat Rev Neurosci. 2003;4(5):399–415. pmid:12728267
  5. 5. Mayo NE, Wood-Dauphinee S, Ahmed S, Gordon C, Higgins J, McEwen S, et al. Disablement following stroke. Disabil Rehabil. 1999;21(5–6):258–68. pmid:10381238
  6. 6. Frankel MR, Morgenstern LB, Kwiatkowski T, Lu M, Tilley BC, Broderick JP, et al. Predicting prognosis after stroke: A placebo group analysis from the National Institute of Neurological Disorders and Stroke rt-PA Stroke Trial. Neurology. 2000;55(7):952–9. pmid:11061250
  7. 7. Bluhmki E, Chamorro A, Dávalos A, Machnig T, Sauce C, Wahlgren N, et al. Stroke treatment with alteplase given 3.0-4.5 h after onset of acute ischaemic stroke (ECASS III): Additional outcomes and subgroup analysis of a randomised controlled trial. Lancet Neurol. 2009;8(12):1095–102. pmid:19850525
  8. 8. Papavasileiou V, Milionis H, Michel P, Makaritsis K, Vemmou A, Koroboki E, et al. ASTRAL score predicts 5-year dependence and mortality in acute ischemic stroke. Stroke. 2013;44(6):1616–20. pmid:23559264
  9. 9. Flint AC, Faigeles BS, Cullen SP, Kamel H, Rao VA, Gupta R, et al. THRIVE score predicts ischemic stroke outcomes and thrombolytic hemorrhage risk in VISTA. Stroke. 2013;44(12):3365–9. pmid:24072004
  10. 10. Hosmer Jr DW, Lemeshow S, Sturdivant RX. Applied logistic regression. John Wiley & Sons; 2013.
  11. 11. Hearst MA, Dumais ST, Osuna E, Platt J, Scholkopf B. Support vector machines. IEEE Intell Syst Their Appl. 1998;13(4):18–28.
  12. 12. Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.
  13. 13. Monteiro M, Fonseca AC, Freitas AT, Pinho E Melo T, Francisco AP, Ferro JM, et al. Using Machine Learning to Improve the Prediction of Functional Outcome in Ischemic Stroke Patients. IEEE/ACM Trans Comput Biol Bioinform. 2018;15(6):1953–9. pmid:29994736
  14. 14. Heo J, Yoon JG, Park H, Kim YD, Nam HS, Heo JH. Machine learning-based model for prediction of outcomes in acute stroke. Stroke. 2019;50(5):1263–5. pmid:30890116
  15. 15. Lin C-H, Hsu K-C, Johnson KR, Fann YC, Tsai C-H, Sun Y, et al. Evaluation of machine learning methods to stroke outcome prediction using a nationwide disease registry. Comput Methods Programs Biomed. 2020;190:105381. pmid:32044620
  16. 16. Chen T, Guestrin C. Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining; 2016. p. 785–94.
  17. 17. Matsumoto K, Nohara Y, Soejima H, Yonehara T, Nakashima N, Kamouchi M. Stroke prognostic scores and data-driven prediction of clinical outcomes after acute ischemic stroke. Stroke. 2020;51(5):1477–83. pmid:32208843
  18. 18. Bentley P, Ganesalingam J, Carlton Jones AL, Mahady K, Epton S, Rinne P, et al. Prediction of stroke thrombolysis outcome using CT brain machine learning. Neuroimage Clin. 2014;4:635–40. pmid:24936414
  19. 19. Mazya MV, Bovi P, Castillo J, Jatuzis D, Kobayashi A, Wahlgren N, et al. External validation of the SEDAN score for prediction of intracerebral hemorrhage in stroke thrombolysis. Stroke. 2013;44(6):1595–600. pmid:23632975
  20. 20. Lou M, Safdar A, Mehdiratta M, Kumar S, Schlaug G, Caplan L, et al. The HAT Score: A simple grading scale for predicting hemorrhage after thrombolysis. Neurology. 2008;71(18):1417–23. pmid:18955684
  21. 21. Alakbari FS, Mohyaldinn ME, Ayoub MA, Muhsan AS, Abdulkadir SJ, Hussein IA, et al. Prediction of critical total drawdown in sand production from gas wells: Machine learning approach. Can J Chem Eng. 2022;101(5):2493–509.
  22. 22. Hassan AM, Ayoub MA, Mohyadinn ME, Al-Shalabi EW, Alakbari FS. A new insight into smart water assisted foam SWAF technology in carbonate rocks using artificial neural networks ANNs. In: Proceedings of the offshore technology conference Asia; 2022. p. D041S040R002.
  23. 23. Zhou Z, Wu R. Stock price prediction model based on convolutional neural networks. J Ind Eng Appl Sci. 2024;2:1–7.
  24. 24. Alakbari FS, Mohyaldinn ME, Ayoub MA, Muhsan AS. Deep learning approach for robust prediction of reservoir bubble point pressure. ACS Omega. 2021;6(33):21499–513. pmid:34471753
  25. 25. Alakbari FS, Mohyaldinn ME, Ayoub MA, Hussein IA, Muhsan AS, Ridha S, et al. A gated recurrent unit model to predict Poisson’s ratio using deep learning. J Rock Mech Geotechn Eng. 2024;16(1):123–35.
  26. 26. Wu R, Zhang T, Xu F. Cross-market arbitrage strategies based on deep learning. Acad J Sociol Manag. 2024;2:20–6.
  27. 27. Nielsen A, Hansen MB, Tietze A, Mouridsen K. Prediction of tissue outcome and assessment of treatment effect in acute ischemic stroke using deep learning. Stroke. 2018;49(6):1394–401. pmid:29720437
  28. 28. Jo H, Kim C, Gwon D, Lee J, Lee J, Park KM, et al. Combining clinical and imaging data for predicting functional outcomes after acute ischemic stroke: An automated machine learning approach. Sci Rep. 2023;13(1):16926. pmid:37805568
  29. 29. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN. Attention is all you need. Adv Neural Inform Process Syst. 2017;30.
  30. 30. Devlin J, Chang MW, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint. 2018. https://arxiv.org/abs/1810.04805
  31. 31. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint. 2020. https://doi.org/10.48550/arXiv.2010.11929
  32. 32. Huang X, Khetan A, Cvitkovic M, Karnin Z. Tabtransformer: Tabular data modeling using contextual embeddings. arXiv preprint. 2020. https://doi.org/10.48550/arXiv.201206678
  33. 33. He K, Gan C, Li Z, Rekik I, Yin Z, Ji W, et al. Transformers in medical image analysis. Intell Med. 2023;3(1):59–78.
  34. 34. Liu C, Yin Q. Automatic diagnosis of COVID-19 using a tailored transformer-like network. J Phys: Conf Ser. 2021;2010(1):012175.
  35. 35. Than JCM, Thon PL, Rijal OM, Kassim RM, Yunus A, Noor NM, et al. Preliminary study on patch sizes in vision transformers (ViT) for COVID-19 and diseased lungs classification. In: 2021 ieee national biomedical engineering conference (NBEC), 2021. 146–50. http://dx.doi.org/10.1109/nbec53282.2021.9618751
  36. 36. He S, Grant PE, Ou Y. Global-local transformer for brain age estimation. IEEE Trans Med Imaging. 2022;41(1):213–24. pmid:34460370
  37. 37. Zhou H-Y, Yu Y, Wang C, Zhang S, Gao Y, Pan J, et al. A transformer-based representation-learning model with unified processing of multimodal input for clinical diagnostics. Nat Biomed Eng. 2023;7(6):743–55. pmid:37308585
  38. 38. Prokhorenkova L, Gusev G, Vorobev A, Dorogush AV, Gulin A. CatBoost: Unbiased boosting with categorical features. Adv Neural Inform Process Syst. 2018;31.
  39. 39. Cai S, Zheng K, Chen G, Jagadish HV, Ooi BC, Zhang M. ARM-Net: Adaptive relation modeling network for structured data. In: Proceedings of the 2021 international conference on management of data; 2021. 207–20. https://doi.org/10.1145/3448016.3457321
  40. 40. Arik SÖ, Pfister T. TabNet: Attentive interpretable tabular learning. AAAI. 2021;35(8):6679–87.
  41. 41. Mao Y. TabTranSELU: A transformer adaptation for solving tabular data. ACE. 2024;51(1):81–8.
  42. 42. Somepalli G, Goldblum M, Schwarzschild A, Bruss CB, Goldstein T. Saint: Improved neural networks for tabular data via row attention and contrastive pre-training. arXiv preprint. 2021. https://doi.org/10.48550/arXiv.2106.01342
  43. 43. Ahamed, MA.; Cheng, Q. MambaTab: A plug-and-play model for learning tabular data. In Proceedings of the 2024 IEEE 7th international conference on multimedia information processing and retrieval (MIPR). IEEE; 2024, p. 369–75.
  44. 44. Gorishniy Y, Rubachev I, Khrulkov V, Babenko A. Revisiting deep learning models for tabular data. Adv Neural Inform Process Syst. 2021;34:18932–43.
  45. 45. Wang Q, Li B, Xiao T, Zhu J, Li C, Wong DF. Learning deep transformer models for machine translation. In: 2019. https://arxiv.org/abs/1906.01787
  46. 46. Prechelt L. Early stopping-but when?. Neural networks: Tricks of the trade. Springer; 2002. p. 55–69.
  47. 47. Vanschoren J, Van Rijn JN, Bischl B, Torgo L. OpenML: Networked science in machine learning. ACM SIGKDD Explor Newsl. 2014;15(2):49–60.
  48. 48. Kwah LK, Diong J. National Institutes of Health Stroke Scale (NIHSS). J Physiother. 2014;60(1):61. pmid:24856948
  49. 49. Lobo JM, Jiménez-Valverde A, Real R. AUC: A misleading measure of the performance of predictive distribution models. Global Ecol Biogeogr. 2007;17(2):145–51.
  50. 50. Liang D, Yang F, Zhang T, Yang P. Understanding mixup training methods. IEEE Access. 2018;6:58774–83.
  51. 51. Zhang H, Cisse M, Dauphin YN, Lopez-Paz D. Mixup: Beyond empirical risk minimization; 2017. https://arxiv.org/abs/1710.09412
  52. 52. Zhang L, Deng Z, Kawaguchi K, Ghorbani A, Zou J. How does mixup help with robustness and generalization?. arXiv preprint. 2020. https://doi.org/10.48550/arXiv.2010.04819
  53. 53. Shao J. Linear model selection by cross-validation. J Am Stat Assoc. 1993;88(422):486–94.
  54. 54. Zhang C, Liwicki S, Cipolla R. Beyond the CLS Token: Image reranking using pretrained vision transformers; 2022. p. 80.