Applying masked autoencoder-based self-supervised learning for high-capability vision transformers of electrocardiographies

doi:10.1371/journal.pone.0307978

Fig 1.

Flow chart for splitting the datasets.

Flow chart showing how the datasets used for model training and validation were created. LVSD, left ventricular systolic dysfunction; MAE38K, Vision Transformers pretrained on ECG data from UTokyo using a masked autoencoder; Large-dataset, electrocardiography and echocardiography paired dataset from three institutions (UTokyo, Mitsui, and Asahi); MAE130K, Vision Transformers pretrained on ECG data from three institutions (UTokyo, Mitsui, and Asahi).

More »

Expand

Table 1.

Study-level demographic information.

More »

Expand

Table 2.

Patient characteristics in the internal cohort.

More »

Expand

Fig 2.

Network architecture of MAE for 12-lead ECGs.

This figure shows the network architecture of MAE-based self-supervised learning for 12-lead ECGs. The ViT-Huge is used as an example in the figure. We treated original ECG data from each lead as a 1×5000 matrix of the ECG voltage. The input ECG data was divided into 1 × 250 patches and voltage data from each lead is converted into 20 patch sequences, which is 240 patch sequences for 12-lead ECGs. These patches were randomly masked and only unmasked patches (60 patch sequences) were input to the MAE encoder. We used ViT-Huge encoder for the MAE encoder. ViT-Huge encoder then output 60 encoded patch sequence with 1280-dimensional feature vectors. For the input to the MAE decoder, the full set of patches consisting of encoded patches and masked patches were applied. Proposing MAE reconstructs the input by predicting the voltage values for each masked patch of 12-lead ECGs. Each element in the decoder’s output is a vector of voltage values representing a patch. The last layer of the decoder is a linear projection whose number of output channels equals the number of inputs. Loss function computes the mean squared error as reconstructive loss between the reconstructed and original 12-lead ECGs. Same as original MAE, we compute the loss only on masked patches. These processes could create ViT-Huge encoders for 12-lead ECGs with high performances for downstream task. Other implementation details followed those in a previous study [15]. In this study, while we use ViT-Large and ViT-Base as well, the primary model employed is ViT-Huge. Given that the structure of MAE does not change with the size of the ViT model, Fig 2 is presented using ViT-Huge as an example. MAE, Masked Autoencoder; ECG, electrocardiography; ViT, Vision transformer.

More »

Expand

Fig 3.

Example of the reconstruction process in II and V5 lead.

(A) Original ECGs; (B) masked ECGs; and (C) reconstructed ECGs.

More »

Expand

Fig 4.

Model performance values used to detect LVSD from 12-lead ECGs on the internal test dataset and external validation cohorts.

The bars indicate the AUROC for LVSD detection of models on the internal test dataset and validation cohorts of Mitsui, Asahi, Sakakibara, Jichi, TokyoBay, JR, and NTT. LVSD, left ventricular systolic dysfunction; AUROC, area under the receiver operating characteristics curve; ViT-Huge38K, Vision Transformer Huge pretrained on ECG data from UTokyo using a masked autoencoder; ViT-Large38K, Vision Transformer Large pretrained on ECG data from UTokyo using a masked autoencoder; ViT-Base38K, Vision Transformer Base pretrained on ECG data from UTokyo using a masked autoencoder; Baseline-CNN, two-dimensional convolutional neural network; ViT-IN1K, Vision Transformer pretrained on ImageNet-1K using a masked autoencoder.

More »

Expand

Table 3.

Model performances for LVSD detection.

More »

Expand

Table 4.

MAE-based ECG model performances on the benchmark dataset (PTB-XL test fold).

More »

Expand