ViT-Stain: Vision transformer-driven virtual staining for skin histopathology via global contextual learning

doi:10.1371/journal.pone.0341311

Table 1.

Summary – Related work on various virtual staining and image to image translation models.

More »

Expand

Fig 1.

Two classes of staining frameworks shown in a top-to-bottom order.

The top image (a) displays a classic GAN architecture featuring its generator(s), discriminator(s), and associated losses. The bottom image (b), features the ViT-Stain architecture, having given insight into its encoder-decoder hybrid configuration, along with describing its five core modules responsible for this work.

More »

Expand

Table 2.

Transformer and ViTs considered for our virtual staining framework.

More »

Expand

Table 3.

Key modifications in ViT-base and its hybrid encoder–decoder architecture tailored for high fidelity virtual staining.

More »

Expand

Table 4.

The details of precise training parameters, hyperparameters and hardware, etc. used for staining frameworks during training and inference.

More »

Expand

Fig 2.

The top row (a) shows the unstained source image patches, while the middle row (b) shows the paired H&E-stained patches, and the bottom row (c) shows corresponding virtual H&E-equivalent image patches generated by the respective staining frameworks.

More »

Expand

Fig 3.

Virtually stained H&E-equivalent image patches generated by the respective staining frameworks are represented in five rows from top to bottom.

The top row (a) shows the virtually stained patches generated by Pix2Pix, demonstrating low-frequency details. 2^nd row (b) depicts CUTGAN patches, demonstrating weak distributional details. The 3^rd row (c) depicts DCLGAN output patches with slightly lower artifacts and hallucinations than CUTGAN. In 4th row (d), CycleGAN is seen to have good content and stain preservation, while the bottom row (e) shows ViT-Stain-generated patches with strong structural coherence, high distributional details, lowest hallucinations, and superior stain specificity.

More »

Expand

Fig 4.

In the left image (a) the virtual staining output is noisy and inconsistent, with visible artifacts.

In contrast, the middle (b) shows a GAN image and the right (c) shows an ADVS image after merging and blending patches. These images are smooth and seamless with negligible edge artifacts.

More »

Expand

Fig 5.

Quantitative results for perceptual and distributional metrics across two image distributions are shown from left to right.

The left image (a) displays quantitative results between unstained vs virtually generated patches, while the right image (b) highlights the same comparison between H&E vs virtually generated patches using the respective staining frameworks.

More »

Expand

Fig 6.

Quantitative results for perceptual error and diagnostic fidelity across two image distributions, unstained vs virtually generated patches, H&E vs virtually generated patches, are shown as bar plots from left to right.

The plots on the left (a) represent the perceptual error represented by LPIPS, while the right one (b) represents diagnostic fidelity (HSFI) of respective staining frameworks.

More »

Expand

Fig 7.

Training time and convergence plots of the respective staining frameworks distributed over 200 epochs.

The ViT-Stain plot indicates an initial, continuous, and sharp increase in per-epoch time, but it stabilizes quickly and then follows a sharp decline in per-epoch time at convergence, within ~190 epochs. In contrast, GANs exhibit a moderate initial increase in per-epoch time but demonstrate similar behavior to ViT-Stain during stabilization and convergence, within ~190 epochs.

More »

Expand

Fig 8.

Inference time (latency) behavior of the respective staining frameworks during patch merger.

ViT-Stain exhibits higher inference time (latency). In contrast, the behavior of GANs remains largely unchanged, with only slightly different inference times (latency).

More »

Expand

Table 5.

Qualitative evaluations by board-certified dermatopathologists comparing real and virtual H&E images and patches from each staining framework.

More »

Expand

Table 6.

Results of ablation experiment evaluated on five model variants versus full ViT-Stain.

More »

Expand

Table 7.

The classification results of the H&E and ViT-Stain trained classifiers, emphasizing precision, recall, and F-1 score to enable a direct comparison.

More »

Expand

Fig 9.

Both images represent the lesions used in the classification experiment to demonstrate the diagnostic potential of ViT-Stain.

The top image (a) presents those lesions on H&E images, while the bottom image (b) displays the same lesion classes on ViT-Stain generated images; training patches were extracted from both to train the respective classifiers.

More »

Expand

Fig 10.

Confusion matrices of each classifier showing per-class and overall classification accuracy on the held-out test set.

More precisely, the image on the left (a) represents the confusion matrix for the H&E classifier, and the right one (b) shows the confusion matrix for the ViT-Stain classifier.

More »

Expand

Fig 11.

The image shows the ROC curves for the respective classifiers on the held-out test set, highlighting the true positive rate, false positive rate, and AUC.

The left image (a) highlights the ROC curve for the H&E classifier, while the right image (b) shows the ROC curve for the ViT-Stain classifier.

More »

Expand

Fig 12.

Training and validation curves for accuracy and loss for each classifier on the held-out test set are presented.

The top panel (a) presents the curves for the H&E classifier, while the bottom panel (b) displays the curves for the ViT-Stain classifier.

More »

Expand

Table 8.

Comparative outcomes of H&E stains in comparison to virtual stains from ViT-Stain and leading baseline models.

More »

Expand

Table 9.

Pearson correlation coefficient (r) between two sets of measurements: HSFI and Turing test success and FID and Turing test success for the corresponding image measurements derived from the virtual staining and real H&E frameworks.

More »

Expand