Figures
Abstract
Digital pathology enables automatic analysis of histopathological sections using artificial intelligence. Automatic evaluation could improve diagnostic efficiency and find associations between morphological features and clinical outcome. For development of such prediction models in breast cancer, identifying invasive epithelial cells, and separating these from benign epithelial cells and in situ lesions would be important. In this study, we trained an attention gated U-Net for segmentation of epithelial cells in hematoxylin and eosin stained breast cancer sections. We generated epithelial ground truths by immunohistochemistry, restaining hematoxylin and eosin sections with cytokeratin AE1/AE3, combined with pathologists’ annotations. Tissue microarrays from 839 patients, and whole slide images from two patients, were used for training and evaluation of the models. The sections were derived from four breast cancer cohorts. Tissue microarray cores from a fifth cohort of 21 patients was used as a second test set. In quantitative evaluation, mean Dice scores of 0.70, 0.79, and 0.75 were achieved for invasive epithelial cells, benign epithelial cells, and in situ lesions, respectively. In qualitative scoring (0-5) by pathologists, the best results were reached for all epithelium and invasive epithelium, with scores of 4.7 and 4.4, respectively. Scores for benign epithelium and in situ lesions were 3.7 and 2.0, respectively. The proposed model segmented epithelial cells well, but further work is needed for accurate subclassification into benign, in situ, and invasive cells.
Citation: Høibø M, Pedersen A, Dale VG, Berget SM, Ytterhus B, Lindskog C, et al. (2025) Immunohistochemistry guided segmentation of benign epithelial cells, in situ lesions, and invasive epithelial cells in breast cancer slides. PLoS One 20(7): e0328033. https://doi.org/10.1371/journal.pone.0328033
Editor: James A. L. Brown, University of Limerick, IRELAND
Received: November 8, 2024; Accepted: June 25, 2025; Published: July 17, 2025
Copyright: © 2025 Høibø et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The datasets used in this study are not publicly available due to restrictions in the ethical approval of the study (Regional Committee for Medical Research Ethics Central Norway (2018/2141). However, data sharing may be possible upon reasonable request, on the basis of collaboration. Requests to access the datasets on this basis can be directed to marit.valla@ntnu.no (Cohorts BCS-1, BCS-2, and BCS-3), lars.akslen@uib.no (HUS-BC-1), cecilia.lindskog@igp.uu.se (HPA-BC) or kontakt@ikom.ntnu.no.
Funding: The work was funded by The Liaison Committee for Education, Research, and Innovation in Central Norway [Grant Number 2018/42794 and 2020/39645], the Joint Research Committee between St. Olavs hospital and the Faculty of Medicine and Health Sciences, NTNU (FFU) [Grant Number 2019/38882 and 2021/51833], and the Clinic of Laboratory Medicine, St. Olavs hospital, Trondheim University Hospital, Trondheim, Norway. The work was also supported by grants from the Research Council of Norway through its Centres of Excellence funding scheme, project number 223250 (to LAA).
Competing interests: The authors have declared that no competing interests exist.
Introduction
Most pathology laboratories are burdened by an increased workload. There is both an increase in the number of biopsies, and in the number of additional molecular analyses. Advances within molecular pathology have resulted in more complex diagnostics, expanding the workload for each biopsy [1,2]. The recent and ongoing implementation of digital pathology makes it possible to analyze tissue sections on computer screens, and it facilitates distant collaboration, and the use of digital microscopy in teaching [3]. Artificial intelligence (AI) opens opportunities for automatic interpretation of digital tissue slides, with the potential to improve diagnostic efficiency [4] and discover novel features of clinical importance. In pathology, AI has shown promising results in tasks like tissue segmentation, mitosis detection, and prediction of prognosis [5–7]. Within AI, deep learning has become the preferred method for image analysis. In segmentation of medical images, U-Nets [8] are widely used [9], usually with supervised learning through annotated data. They are convolutional networks, comprised of an encoder and a symmetric decoder [8]. Attention-gated U-Nets with multiscale input and deep supervision are shown to outperform regular U-Nets [10]. In semantic segmentation tasks, each pixel in the training images has a label assigned to it, and the neural network will try to learn to label each pixel correctly. The large size of whole slide images (WSIs) represents a significant challenge in image analysis. They can be as large as 200 000 x 100 000 pixels with multiple image planes. This results in a need for large storage capacity, long processing time and complicated data handling.
Methods that enable automatic detection of invasive epithelial cells could be useful in numerous tasks, such as automatic identification of lymph node metastases, automatic biomarker assessment, and prediction of prognosis. Epithelial cells can be detected through immunohistochemical staining with cytokeratins (CK), such as CK AE1/AE3 [11,12]. However, the marker does not differentiate between benign epithelial cells, non-invasive in situ lesions, and invasive epithelial cells. In tasks such as automatic biomarker assessment, separating these cells is important, since only invasive cells are included in the analysis.
Segmentation of the tumor border using manual annotations as ground truth has been done previously [13]. Manual annotations have also been used to segment invasive, benign, and in situ lesions in multiclass models [14–16]. However, accurate manual annotations for separating invasive epithelial cells from other cell types within the tumor region would be too time-consuming on large datasets and therefore not feasible. IHC generated ground truths could therefore be an alternative. Bulten et al. [17] segmented epithelial cells in hematoxylin and eosin (HE) stained prostate cancer slides, using CK to detect epithelial cells, and a myoepithelial cell marker to separate benign and invasive cells. Brázdil et al. [18] used CK to distinguish epithelial cells from surrounding stromal tissue in sections from breast and colon cancer. They did not differentiate between neoplastic and non-neoplastic epithelium. In automatic Ki-67, estrogen receptor (ER), and progesterone receptor (PR) analysis, Valkonen et al. [19] segmented breast cancer epithelial cells using a pan-cytokeratin antibody. Their model did not differentiate between invasive and non-invasive epithelium. Although the use of IHC for segmentation tasks shows promise, separating invasive epithelial cells from benign epithelium and in situ lesions still remains a challenge that needs to be solved.
The aim of this study was to construct an AI model for segmentation of benign, in situ, and invasive epithelial cells in HE stained breast cancer sections, using HE and IHC image pairs and pathologists’ annotations to create ground truth.
The main contributions of this paper are:
- A novel breast cancer dataset comprising HE and CK tissue microarray (TMA) image pairs from 860 patients, as well as whole slide images from two breast cancer patients. All sections include pathologists’ annotations of benign epithelium and in situ lesions.
- An algorithm for extracting TMA cores from histopathological images (TissueMicroArrayExtractor in pyFAST [20,21].
- An algorithm for creating ground truths for HE stained images based on CK stained images.
- A trained multiclass attention-gated U-Net model, segmenting epithelium into benign, in situ, and invasive classes.
- Comprehensive quantitative and qualitative validation studies.
- The model is made available in FAST and the open software FastPathology [22], and the project code is available at https://github.com/AICAN-Research/breast-epithelium-segmentation
Materials and methods
A short summary of the methods can be found in Fig 1.
1. The slides were stained with HE, scanned, restained with CK, and scanned again. 2. Manual annotations of benign and in situ lesions were performed. The DAB-channel in the CK image was thresholded to create initial masks positive for all epithelium. 3. The slides were separated into train, validation, and test sets. 4. TMAs were extracted and ground truths created, at x10 for the training and validation, and x20 for the evaluation. The ground truth was one-hot-encoded, and separated into patches of size 1024x1024 for training/validation. 5. The AGU-Net was trained using HE and the corresponding ground truth patches. 6. The AGU-Net was evaluated on whole TMA cores at x20. Abbreviations: HE = hematoxylin and eosin, CK = cytokeratin, DAB = 3,3’-diaminobenzidin, TMA = tissue microarray, AGU-Net = attention gated U-Net.
Cohorts and tissue specimens
In this study, TMAs from five cohorts [23–27] of breast cancer patients were used.
- The BCS-1 cohort comprises 909 breast cancer patients derived from a background population of 25 727 women, who were invited to participate in a population based study for the early detection of breast cancer in Trøndelag County, Norway [23]. The women in the background population were born between 1886-1928, and they were followed for breast cancer occurrence from January 1st, 1961, to December 31st, 2008.
- The BCS-2 cohort comprises 514 women diagnosed with breast cancer from a background population of 34 221 women, born between 1897-1977 in Trøndelag County, Norway, who participated in a population-based survey [24]. They were followed from attendance (1995-1997) to December 31st, 2009, for breast cancer occurrence.
- The BCS-3 cohort comprises 533 women diagnosed with breast cancer from a background population of 22 931 women born between 1920 and 1966 at EC Dahl’s Foundation in Trondheim, Norway. They were followed for breast cancer occurrence from 1961 to 2012 [25].
- The HUS-BC-1 cohort comprises 534 women from Hordaland County, Norway, diagnosed with breast cancer from 1996–2003. The patients in this cohort were diagnosed through the national breast cancer screening program and were in the age range 50–69 years [26].
- The HPA-BC cohort comprises TMA sections from 25 breast cancer patients from the Uppsala Biobank/Human Protein Atlas [27].
The project was approved by the Regional Committee for Medical Research Ethics Central Norway (2018/2141). The requirement of written informed consent was waived by the ethics committee. The data presented is not publicly available due to requirements in the ethical approval of the study. However, data sharing may be possible upon request, on the basis of mutual collaboration.
The data was accessed between October 1st 2021 and October 1st 2024. The authors did not have access to information that could identify individual participants during or after data collection.
Four TMA slides from BCS-1, BCS-2, and BCS-3, twelve TMA slides from HUS-BC-1 and four TMA slides from the HPA-BC cohorts were used. In addition, two WSIs from BCS-2 were included. The TMA slides from BCS-1, BSC-2, BCS-3, HUS-BC-1, and HPA-BC comprised 3021 TMA cores from 992 patients. A total of 749 TMA cores were excluded, either because they were missing, displaced, improperly CK stained or due to poor tissue quality after staining. After these exclusions 2272 cores from 860 patients were included in the study.
The TMA slides from BCS-1, BCS2, BCS-3, and HPA-BC were 4 m thick with a core diameter of 1 mm. The slides from HUS-BC-1 were 5
m thick with core diameters of 0.6 and 1 mm. All TMA slides were stained as follows: The slides were placed in TissueClear (Sakura Finetek Norway) for 2x3 minutes and rehydrated in four decreasing ethanol dilutes 100, 100, 96, 80% for 35 seconds each, before being placed in water for 1 minute. They were then stained with hematoxylin for 5 minutes, followed by 5 minutes in water, and 1 minute in 80% ethanol. The slides were stained with eosin (alcoholic) for 1 minute, followed by placement in 96, 96, 100, 100, 100% ethanol for 30, 30, 60, 60, 60 seconds, respectively. Finally, they were placed in TissueClear, first for 30, then for 40 seconds, before being air-dried for 2 minutes. The HE stained slides were scanned using Olympus BX61VS with VS120S5 at x40 magnification and extended focal imaging (EFI) with seven planes. The scanned HE slides were inspected by a pathologist. The coverslips were then removed from the HE stained slides by placement in Xylene for 2-3 days, then placed in TissueClear for 3x2 minutes, and rehydrated in decreasing ethanol dilutes: 3x2 minutes in 100%, 1x2 minutes in 96%, 1x2 minutes in 80%, and 2x5 minutes in water. Antigen retrieval was performed through heat induced epitope retrieval (HIER) by placing the slides in a TRS-buffer with pH 9 (Agilent DAKO, Agilent Technologies Denmark ApS) at 98 °C followed by placement in a wash buffer (Agilent DAKO, Agilent Technologies Denmark ApS), for 20 and 2x5 minutes respectively. The HE-stain was removed during the HIER procedure. The slides were restained with the CK AE1/AE3 antibody (Agilent DAKO, Agilent Technologies Denmark ApS) through four incubation steps: hydrogen peroxide, primary antibody for pan CK AE1/AE3 (concentration 176.7 mg/L, dilution: 1:80), EnVision K5007 rabbit/mouse Horseradish Peroxidase (HRP)/Diaminobenzidine (DAB)+ detection Kit (Agilent DAKO, Agilent Technologies Denmark ApS), and DAB+ chromogen, for 5 minutes, 90 minutes, 30 minutes and 2x5 minutes respectively. A wash buffer (Agilent DAKO, Agilent Technologies Denmark ApS) was used between each incubation step. The slides were then stained with contrasting hematoxylin for 15 seconds, dehydrated, and coverslips were applied. Dehydration was performed through placement in ethanol (1 minute in 80%, 1 minute in 96%, and 3x1 minute in 100%). The CK AE1/AE3 stained slides were scanned with Olympus BX61VS with VS120S5 at x40 magnification using EFI with seven image planes.
The two WSIs were stained with HE and scanned using Olympus BX61VS with VS120S5 at x40 magnification without EFI, then CK stained and rescanned. The staining procedures were identical to those for the TMAs.
Cytokeratin staining and creation of masks
The stains in the CK images were separated automatically in QuPath [28] using color deconvolution [29] by setting the image type to brightfield hematoxylin-3,3’-diaminobenzidine (H-DAB). This produced artificially deconstructed hematoxylin, DAB, and residual stains. The default stain vector for one slide was used as reference for all slides. Preliminary ground truth masks were created using the pixel classifier tool in QuPath by thresholding the DAB stain channel using a Gaussian prefilter, smoothing sigma 3.0, and a threshold of 0.25 at 0.3448 m/pixel resolution. Since CK is a cytoplasmic marker, nuclear holes were present in the initial mask. Holes with an area below 150
m2 were therefore filled, and small fragments were discarded by removing objects with an area smaller than 25
m2 (see Fig 2). The masks were exported from QuPath to GeoJSON then converted to tiled, non-pyramidal TIFF, followed by a conversion to pyramidal BigTIFF. Since CK, which in the image corresponds to the DAB stain, stains all epithelium, the initial mask consisted of only two classes: epithelium and non-epithelium (see Figs 2 and 3).
(a) Image of HE stained tissue. (b) Image of CK stained tissue. (c) Image of thresholded DAB channel in CK stained tissue where small holes are filled, and fragments removed. (d) Illustration of binarized thresholded DAB channel. Abbreviations: HE = hematoxylin and eosin, CK = cytokeratin, DAB = 3,3’-diaminobenzidin.
Illustration of (a) HE stained TMA with manual annotations of benign epithelium and in situ lesions. (b) Pyramidal mask of benign epithelium (blue areas) and in situ lesions (red areas) created from the manual annotations in (a). (c) A single TMA core from the HE stained TMA with manual annotations of benign epithelium (blue circle) and in situ lesions (red circle). (d) The same TMA core as in (c) from the pyramidal mask in (b). (e) CK stained TMA. (f) CK stained TMA converted into binary mask by thresholding the DAB channel. (g) A single CK TMA core. (h) The same TMA core as in (g), from the binary mask in (f). Abbreviations: TMA = tissue microarray, DAB = 3,3’-diaminobenzidin, HE = hematoxylin and eosin.
Annotations
Annotations were performed manually in both the CK and HE stained images. A pathologist reviewed the CK images and identified TMA cores with strong background staining to avoid false positives, and cores with false negative CK stains. TMA cores with more than 10% false negative epithelium staining, or strong background staining, were tagged in QuPath for exclusion. The HE images were annotated by two pathologists. In QuPath, they digitally marked benign epithelial structures and all in situ lesions (including all non-invasive atypical epithelial cell proliferations), to separate them from invasive epithelial cells (see Figs 3 and 4). The pathologists reviewed each other’s annotations in the HE images. Consensus was reached through discussion in case of discrepancies. To enable evaluation of TMA cores according to histological subtype and grade, each case (TMA triplet or duplet) was identified and marked in QuPath. The annotations of benign epithelial structures and in situ lesions, annotations of cores with insufficient staining, and case annotations were exported from QuPath to OME-TIFF format as three separate images.
Illustrations of (a) TMA slide of binarized DAB channel. (b) HE stained TMA slide. (c) TMA slide of manual annotations of benign epithelium (blue) and in situ lesions (red). (d) HE stained TMA core with manual annotations of benign epithelium (blue) and in situ lesions (red). (e) Manual annotations of benign epithelium and in situ lesions in TMA core. (f) Binarized DAB channel, positive for all epithelium. (g) Invasive epithelium. (h) Benign epithelium. (i) in situ lesion. Abbreviations: TMA = tissue microarray, DAB = 3,3’-diaminobenzidin, HE = hematoxylin and eosin.
Dataset creation
Since multiple TMA cores existed for each patient, the TMA data was divided into training, validation, and test sets on slide level. This ensured that the same patient was not present in multiple data sets. The TMA slides from the BCS-1, BCS-2, BCS-3, and HUS-BC-1 cohorts were randomly divided into training, validation, and test sets, where 16, 4, and 4 TMA slides were used in training, validation, and test sets, respectively. The HPA-BC cohort was used exclusively as a second test set (Fig 5).
(a) Data stratification from TMAs. Blue boxes represent whole TMA cores (image level 1). White boxes represent training and evaluation data (patches extracted at image level 2). (b) Data stratification from WSIs. Abbreviations: TMA = tissue microarray, WSI = whole slide image.
Training and validation sets.
The HE and CK images, the preliminary DAB ground truth mask, annotations of benign epithelium and in situ lesions, annotations of cores to be removed, and case annotations were imported into Python as six separate pyramidal images. An image processing algorithm (called TissueMicroArrayExtractor) for extracting TMA cores automatically from TMA whole slide images was developed and added to the open-source library FAST [20,21] and the corresponding Python interface pyFAST. This algorithm first performs tissue segmentation by color thresholding on a low-resolution version of the image. The segmentation regions are then extracted using flood fill. Small regions of less than 100 pixels are removed, and for the rest, the median area and diameter is calculated. Any region that differs more than 50% from the median area or diameter are excluded. The final regions are then extracted from the desired magnification level.
For the training and validation sets, the TMA cores were extracted automatically from the HE and CK slides at x10 magnification from the pyramidal images using the TissueMicroArrayExtractor, to get sufficient details while still including larger tissue structures within a patch. HE/CK core pairs were identified by comparing coordinates of the extracted cores. Corresponding areas were extracted from the annotated and thresholded images using the coordinates, and width/height of the HE/CK cores. Cores marked as insufficiently CK stained were not included in the analysis, neither were cores that were severely displaced or destroyed. Small displacements were identified and adjusted for with registration of the CK and HE cores using phase cross-correlation. Histogram equalization was used to increase the contrast in the CK cores to improve registration of tissue with few details and poor contrast. The TMA core images were then downsampled with a factor of four to reduce memory use during registration. The shifts in x and y-directions between the HE and CK cores were calculated, before being upscaled and used to register the images at full resolution.
Invasive epithelium, benign epithelium, and in situ lesions were separated into three separate classes for each TMA core (see Fig 4). Ground truth for invasive epithelium was created by subtracting CK stained epithelium within areas manually annotated as benign or in situ from the mask made by thresholding the DAB channel. Benign epithelium and in situ lesion ground truths were created by identifying the positive cells from the mask within areas annotated as benign or in situ by the pathologists (Fig 4).
Due to the large size of histopathological images, patches of size 1024 x 1024 pixels were created from the HE, CK, benign, in situ, and invasive TMA core images with 25% overlap on all sides using pyFAST. Patches with less than 25% tissue were excluded from training. To improve registration, the HE and CK patches were registered again, as described for the whole TMA cores earlier. A ground truth patch was created by one-hot encoding the non-epithelial tissue, invasive epithelium, benign epithelium, and in situ lesions as four separate classes. Due to the large imbalance between the number of patches including invasive epithelium versus benign and in situ, the patches were divided into benign, in situ, and invasive sets, allowing for a balanced sampling scheme during training. For each patch, the patch was assigned to the in situ set if the patch included in situ lesions. If the patch included benign epithelium but not in situ lesions, it was assigned to the benign set. If the patch did not include in situ or benign epithelium, the patch was assigned to the invasive set (Fig 5).
Selected areas from two whole sections of breast cancer were added to the training data to include more tissue from areas poorly represented in TMAs, such as benign epithelium, in situ lesions, stromal tissue, and adipose tissue. Areas with only invasive epithelial cells or glass were annotated for removal, as well as some regions including large amounts of adipose tissue. The two slides were used for training and validation. The WSIs were annotated for benign epithelium and in situ lesions and reviewed for false positive and false negative CK staining, as previously described. The WSIs were first divided into eight squares, which were registered using phase cross-correlation (downsampled, registered, and shifted). Six of the large squares were included in the training set and two in the validation set. Patches from the same slide could thus be found in both the training and validation set. Creation of HE-patches and corresponding ground truths were performed as described for the TMAs.
Test sets.
The internal test set consisted of TMA slides from BCS-1, BCS-2, BCS-3 and HUS-BC-1 (test set 1), while the external test set consisted of TMA slides from HPA-BC (test set 2), see Fig 5. TMA cores were extracted as whole cores, not split into patches, at x20 magnification using pyFAST’s TissueMicroarrayExtractor, ground truths created, and the HE and corresponding ground truth saved on disk.
Training and evaluation
The implementation was done in Python 3.8, with TensorFlow v2.10.0 for implementation and training of the AGU-Net model. Scikit-image v0.18.3 was used for image-to-image registration using phase cross-correlation [30]. WSI processing and TMA extraction were performed with pyFAST v4.7.0 [20,21]. For running the experiments an Intel Xeon Gold 6230 @2.10GHz central processing unit (CPU) with 256 GB RAM, and an NVIDIA Quadro RTX 6000 dedicated graphics processing unit (GPU) were used. The source code to reproduce the experiments is made openly available at https://github.com/AICAN-Research/breast-epithelium-segmentation
To counter class imbalance, patches were randomly sampled during training from one of the three sets (benign, in situ, or invasive) from the TMAs and WSIs. During patch selection, data augmentation was performed concurrently to the training data to improve model robustness. The following data augmentation techniques were used: random flip, 90° rotations, brightness, hue, saturation, shift, and blur. For each patch, each augmentation technique had a 50% chance of being enabled, except for blur augmentation which had a 10% chance. The network used was an attention-gated U-Net (AGU-Net) [10,13] with seven spatial levels and 16, 32, 32, 64, 64, 128, 128 filters, multiscale input, and deep supervision [10,13]. Adam [31] was used as optimizer, with an initial learning rate of 0.0005 that decreased with a factor of 0.5 for every tenth epoch without improvement [17,18]. The model was trained for 500 epochs using the Dice loss function, with an early stopping of patience 200. An epoch was defined as 160 and 40 weight updates for the training and validation sets, respectively. The following three models were compared: an AGU-Net trained on TMAs only without data augmentation (model one), an AGU-Net trained on TMAs only with data augmentation (model two), and an AGU-Net trained on TMAs and WSIs with data augmentation (model three). The three models were evaluated on the validation set, and on test sets 1 and 2.
The quantitative segmentation performance was evaluated on TMA core level, using the Dice similarity coefficient and pixel-wise precision and recall, disregarding the background class. Each HE/CK TMA core pair in the test sets and evaluation set was saved on disk at x20 magnification. For each TMA, overlapping inference of 30% was performed using pyFAST, resulting in segmentation predictions on TMA-level used for evaluation. The metrics were computed for the three classes on all TMA cores. Metrics were also computed exclusively for cores where the respective classes were present in either the ground truth or in the prediction, and exclusively for cores where the respective classes were present in the ground truth. In cases where the denominator was zero, the metric score for that core was set to one. This will be the case for Dice scores if the prediction and the ground truth are both zero. The best performing model was then converted to the Open Neural Network Exchange (ONNX) format for deployment in FastPathology. Dice scores were also calculated separately for different histological subtypes and for each histological grade.
An additional TMA slide was segmented by a preliminary model in FastPathology at x10 magnification. Segmentations in 26 cores (9 patients) were adjusted manually by two pathologists in QuPath. These segmentations were then used as ground truth for a new quantitative evaluation of the final model. Dice scores were calculated for the three classes on TMA core level for all cores.
A qualitative evaluation of the segmentations of the final model was also performed by two pathologists through manual inspection of all TMA cores in the test set slides (test sets 1 and 2) in QuPath. The TMA slides were segmented by the final model in FastPathology at x10 magnification and exported to QuPath. Each case was assigned a score between zero and five, similarly as done by Valkonen et al. [19]. The scoring system is described in detail in Table 1.
Results
Characteristics of the five cohorts are described in Table 2. The proportion of the histological subtype invasive carcinoma of no special type (NST) varied between 67-89%, and the proportion of lobular carcinoma varied between 2-24% in the five cohorts. The proportion of histological grade 1, 2, 3 in the five cohorts varied between 10-41%, 43-61%, and 16-43%, respectively. Table 3 shows three scores for each model on the evaluation sets (test set 1, test set 2, validation set). The first score (row I) is evaluated on all TMA cores, and a score of one is given if the denominator is zero, indicating that there are no true instances of the given class in the core, nor any false positives. The second score (row II) includes cores with positive value in either ground truth or prediction. The third score (row III) only includes cores with positive values in the ground truth. In the validation set, a total of 105, 17, and 342 TMA cores included benign, in situ lesions, and invasive, respectively. The corresponding numbers in test set 1 were 77, 14, and 284 cores. In test set 2, a total of 13, 3, and 56 cores included benign, in situ lesions, and invasive, respectively. Examples of segmentations can be seen in Fig 6.
(a, e, i) 1000 x 1000 patches of three HE images with (b, f, j) corresponding ground truth, (c, g, k) prediction and (d, h, l) true positive (blue), false positive (yellow), and false negative (green). (c) An almost perfect segmentation of invasive cells. (g) The segmentation connects the invasive cells into larger sheets. (k) An almost perfect segmentation of parts of a benign structure. The ground truth in (j) includes the myoepithelium and fills the upper part of the lumen. Abbreviations: HE = hematoxylin and eosin, TP = true positive, FP = false positive, FN = false negative.
On the TMA slide corrected by two pathologists, Dice scores of 0.56, 0.82, and 0.70 were achieved for benign epithelium, in situ lesions, and invasive epithelium, respectively, using model three on all cores. Precision was 0.66, 0.82, and 0.78, and recall 0.82, 1.00, and 0.67, for benign, in situ, and invasive, respectively. When only including cores where the class was present in the ground truth, Dice scores of 0.40 and 0.70 were reached for benign and invasive, respectively. In total, six and 22 cores included benign and invasive epithelium in their ground truth, respectively. No TMA cores included in situ lesions.
Dice similarity coefficient was calculated separately for different histological subtypes (invasive carcinoma NST, lobular carcinoma, and all other subtypes) and for histological grade 1-3 using model three on all TMA cores (see Table 4).
Qualitative evaluation of the TMA scores on test set 1 gave mean scores of 4.7, 3.7, 2.0, and 4.4 for all epithelium, benign epithelium, in situ lesions, and invasive epithelium, respectively, when excluding class zero on case level (see Table 5). On test set 2, the scores were 4.7, 3.5, 1.9, and 4.6, respectively. On test set 1, when only evaluating cores including the respective class in their ground truth, mean scores of 4.2 and 2.8 were achieved for benign and in situ lesions (see Table 5). An average score of 4.0 and 4.2 were reached for benign and in situ lesions on test set 2 when only evaluating cores including the respective class in their ground truth (see Table 5). The number of cases assigned each score can be found in Additional File 1. Model three was used to create the segmentations that were evaluated qualitatively. Examples of TMA cores with their respective score between 1-5 are shown in Figs 7 and 8.
TMA cores and their respective qualitative scores. The scores were given on case level, here only one core per patient is presented. (a) All epithelium score 5, benign score 5, in situ score 0, invasive score 5. (b) All epithelium score 5, benign score 3, in situ score 1, invasive score 5. (c) All epithelium score 5, in situ score 5, invasive score 4. d) All epithelium score 5, benign score 0, in situ score 1, invasive score 5. Red = Benign epithelium, Blue = In situ lesion, Purple = Invasive epithelial cells. Abbreviations: TMA = tissue microarray.
A TMA core and its respective qualitative and quantitative scores. The qualitative scores were given on case level, here only one cores is presented. Qualitative scores: all epithelium score 5, benign score 5, in situ score 0, invasive score 5. Quantitative scores: benign and invasive Dice scores of 0.77 and 0.86, respectively. Red = benign epithelium, Purple = invasive epithelial cells. Abbreviations: TMA = tissue microarray.
Discussion
In this study, we have developed a deep learning-based method for segmentation of benign, in situ, and invasive epithelial cells in HE stained breast cancer sections using IHC and pathologists’ annotations to create ground truths. A new dataset comprising image pairs of HE and CK stained slides with annotations of benign and in situ lesions was created and used for training AI models. In qualitative evaluation, high performance was achieved for all epithelium (4.7/5) and invasive epithelial cells (4.4/5), whereas lower performances were reached for benign epithelial cells (3.7/5) and in situ lesions (2.0/5). In quantitative evaluation, Dice scores of 0.79, 0.75, and 0.70 were achieved for benign, in situ, and invasive cells, respectively, when evaluating on all TMA cores.
Other studies have used IHC for segmentation of epithelial cells in breast, colon, and prostate cancer sections [17–19]. Brazdil et al. [18] developed an AI model for segmentation of epithelial cells using TMA slides and WSIs from breast and colon cancer. They achieved a sensitivity and specificity of 0.79 and 0.94, respectively. Their BC dataset was limited in size, including only 20 TMA cores (12 patients) and WSIs from five patients. Bulten et al. [17] segmented epithelial cells in sections from prostate cancer patients, separating benign and invasive epithelial cells with a myoepithelial cell marker. They achieved a Jaccard index of 0.78 and 0.83 on invasive and benign epithelium regions, respectively. In combined assessment of all epithelial cells, they achieved a Dice of 0.84 on an external test set, and a Dice of 0.88 on areas with only invasive epithelium in an internal test set. Their Dice was higher than ours for invasive epithelium. However, it is difficult to make a fair comparison to their results, as our model must correctly classify the epithelial type (benign, in situ, or invasive) as well as segment the cells to achieve a good score. Bulten et al. [17] observed a performance degradation with higher Gleason grade. We found better performance with higher histological grade on test set 1, and a poorer performance with higher histological grade on test set 2. Morphologically, high grade breast cancer cells are often pleomorphic and therefore differ more from benign epithelial cells than breast cancer cells of lower grade. Grade 1 breast cancers are often highly differentiated and may therefore be more similar to benign epithelial structures. Valkonen et al. [19] reached mean scores of 4.0 and 4.7 when two pathologists qualitatively scored the epithelial masks from 0 to 5. This is similar to our qualitative score of 4.7 when evaluating all epithelium. However, comparing different studies, with different methods, data material, and end points, is challenging, as the combination of many parameters and methodology choices influence the result.
Breast cancer is known for being morphologically heterogeneous [12]. Grade 1 tumors generally have more ductal structures and are less pleomorphic than grade 3 tumors, and there is large variation in morphological appearance between different histological subtypes. Our model performed better on invasive carcinoma NST than on lobular carcinomas. Lobular carcinomas are often characterized by single cell growth and scant cytoplasm. These tumors are underrepresented in the dataset, and they lack certain characteristics of epithelial cells, such as the cells’ tendency to cluster together. The scant cytoplasm and hence sparse CK staining of lobular invasive cells may have caused false negative CK staining. This may have led to removal of some cells during preprocessing of the CK masks, making invasive lobular cells even more underrepresented. Creating a model that performs equally well on all histological subtypes might not be possible. Artefacts like pen marks and CK marks surrounding TMA cores can also influence the ground truth, as can large shifts or broken tissue due to restaining. A non-rigid registration method could have resulted in better ground truths, and artefacts and false positive or negative staining could have been corrected either manually or by training an additional model [17]. Alternatives to using IHC to provide ground truth could be manual annotation of epithelial cells, or unsupervised methods. Manual annotations are extremely time-consuming, and not feasible if aiming to generate a large and accurate dataset. In the manual corrections that that were done on the present study for a final quantitative evaluation, the pathologists could spend up to two hours annotating a single TMA core. Two whole sections of breast cancer were added to the training data. The slides were subdivided and separated between the training and validation sets. The addition of two WSIs did not improve the model quantitatively but was chosen to increase robustness. Having patches from the same WSI in the training and validation set was not ideal. However, the two WSIs differed in the amount of benign and in situ lesions, and to ensure representation of both classes we allowed for using parts of the same WSI in both the training and in the validation set.
In situ lesions are composed of atypical epithelial cells, and thus have cellular features similar to invasive cells. They are, however, surrounded by myoepithelial cells, similar to benign epithelial structures. The poor performance on in situ lesions could therefore be explained by morphological similarity to both benign and invasive epithelial structures. To obtain a perfect Dice score, the model must identify and classify the epithelial structures correctly and mark the exact same cell boundary as in the ground truth. This is a challenging task. Evaluation on the slide corrected by pathologists gave lower Dice scores than evaluation of the model’s segmentations against CK generated ground truths. In the manually corrected dataset, only TMA cores that were automatically excluded with pyFAST’s TissueMicroArrayExtractor, or impossible to annotate manually due to very low tissue quality were removed. This could have led to the inclusion of more cores with fragile or broken tissue in this dataset than in the dataset for evaluation against CK generated ground truths, thus making the task more challenging. Furthermore, making accurate manual delineations of individual cells is a challenging task and ground truths created by CK staining could be more precise than manual annotations.
The low number of non-invasive lesions and morphological heterogeneity within these [32] could have affected the models’ performance. TMAs were taken selectively from the invasive tumor region, probably affecting the amount of non-invasive epithelial tissue present. An advantage of using TMAs is, however, the inclusion of more patients. A more extensive dataset, including more patients and WSIs might still be needed to improve model performance. Including more benign and in situ lesions in the dataset would be of particular importance, as these lesions were underrepresented in our data. The low numbers of in situ and benign epithelial structures give an unnaturally high Dice score when including all cores in the calculations since a score of one is given when the model correctly predicts no pixels of the given class. On the other hand, a single misclassification would more strongly influence the result of an underrepresented class.
All slides were stained and scanned at the same laboratory. The models’ results were evaluated both quantitatively and qualitatively, which is important as the quantitative scores may not be representative of the segmentations due to incorrect ground truths. Metrics like Dice similarity coefficient might not be ideal to evaluate the model’s performance, and a perfect Dice score may not be necessary for clinical use. A pathologist’s qualitative evaluation could provide a more relevant score. The requirements of a model’s performance may depend on the task for which it is used. For some clinical tasks a segmentation model needs near perfect results.
To further improve the model, more annotated data would be valuable. However, producing such datasets is extremely time consuming, and access to sufficient annotated data is often a limitation in AI studies. An alternative, time saving annotation approach, could be to iteratively improve the ground truth masks and consequently the model through active learning [33]. Such initial annotations could even be made by non-experts. Ground truths could be obtained by repetitive correction of annotations through multiple iterations of running the model followed by correction of the masks, or by exploring unsupervised methods. Experimenting with different image levels might also improve the results. By making the final model openly available in FastPathology (https://github.com/AICAN-Research/FAST-Pathology) anyone can use it to generate segmentations on their own digitized tissue slides.
Conclusion
The proposed method used IHC and pathologists’ annotations to make epithelial cell ground truths. The resulting segmentation model performed well in detecting epithelial cells and invasive epithelial cells in sections from breast cancer patients. However, correct classification of benign epithelial structures and in situ lesions was more challenging. The need for large, annotated datasets and the great morphologic heterogeneity in breast cancer represent challenging aspects of model development.
Supporting information
S1 Table. Qualitative results on test set 1 and 2.
Number of cases assigned to each score for each class (all epithelium, benign, in situ, and invasive). The scores under “Pre” represent scores where only cores with the respective class in the ground truth are included, otherwise the score is set to zero. The numbers in parentheses represent the percentage excluding score zero.)
https://doi.org/10.1371/journal.pone.0328033.s001
(TIFF)
Acknowledgments
The staining and scanning was performed at the Cellular and Molecular Imaging Core Facility (CMIC) Histology lab, Norwegian University of Science and Technology (NTNU).
References
- 1. Salto-Tellez M, Maxwell P, Hamilton P. Artificial intelligence-the third revolution in pathology. Histopathology. 2019;74(3):372–6. pmid:30270453
- 2. Bonert M, Zafar U, Maung R, El-Shinnawy I, Kak I, Cutz J-C, et al. Evolution of anatomic pathology workload from 2011 to 2019 assessed in a regional hospital laboratory via 574,093 pathology reports. PLoS One. 2021;16(6):e0253876. pmid:34185808
- 3. Evans AJ, Depeiza N, Allen S-G, Fraser K, Shirley S, Chetty R. Use of whole slide imaging (WSI) for distance teaching. J Clin Pathol. 2021;74(7):425–8. pmid:32646928
- 4. Litjens G, Sánchez CI, Timofeeva N, Hermsen M, Nagtegaal I, Kovacs I, et al. Deep learning as a tool for increased accuracy and efficiency of histopathological diagnosis. Sci Rep. 2016;6:26286. pmid:27212078
- 5. Graham S, Chen H, Gamper J, Dou Q, Heng P-A, Snead D, et al. MILD-Net: Minimal information loss dilated network for gland instance segmentation in colon histology images. Med Image Anal. 2019;52:199–211. pmid:30594772
- 6. Li C, Wang X, Liu W, Latecki LJ, Wang B, Huang J. Weakly supervised mitosis detection in breast histopathology images using concentric loss. Med Image Anal. 2019;53:165–78. pmid:30798116
- 7. Bychkov D, Linder N, Turkki R, Nordling S, Kovanen PE, Verrill C, et al. Deep learning based tissue analysis predicts outcome in colorectal cancer. Sci Rep. 2018;8(1):3395. pmid:29467373
- 8.
Ronneberger O, Fischer P, Brox T. U-Net: convolutional networks for biomedical image segmentation. In: Medical image computing and computer-assisted intervention – MICCAI 2015. vol. 9351. Springer; 2015. p. 234–41.
- 9. Isensee F, Jaeger PF, Kohl SAA, Petersen J, Maier-Hein KH. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nat Methods. 2021;18(2):203–11. pmid:33288961
- 10. Bouget D, Pedersen A, Hosainey SAM, Solheim O, Reinertsen I. Meningioma segmentation in T1-weighted MRI leveraging global context and attention mechanisms. Front Radiol. 2021;1:711514. pmid:37492175
- 11.
Vyberg M. Kompendium in anvendt immunhistokemi. 6 ed. 2005.
- 12.
WHO Classification of Tumour Editorial Board. Breast tumours. 5th ed. International Agency for Research on Cancer; 2019.
- 13. Pedersen A, Smistad E, Rise TV, Dale VG, Pettersen HS, Nordmo T-AS, et al. H2G-Net: a multi-resolution refinement approach for segmentation of breast cancer region in gigapixel histopathological images. Front Med (Lausanne). 2022;9:971873. pmid:36186805
- 14.
Galal S, Sanchez-Freire V. Candy cane: breast cancer pixel-wise labeling with fully convolutional densenets. Image analysis and recognition. Cham: Springer; 2018. p. 820–6.
- 15.
Ni H, Liu H, Wang K, Wang X, Zhou X, Qian Y. WSI-Net: branch-based and hierarchy-aware network for segmentation and classification of breast histopathological whole-slide images. In: Suk HI, Liu M, Yan P, Lian C, editors. Machine learning in medical imaging. Cham: Springer; 2019. p. 36–44.
- 16.
Mehta S, Mercan E, Bartlett J, Weaver D, Elmore J, Shapiro L. Learning to segment breast biopsy whole slide images. In: IEEE Winter Conference on Applications of Computer Vision WACV; 2018. p. 663–72.
- 17. Bulten W, Bándi P, Hoven J, Loo R van de, Lotz J, Weiss N, et al. Epithelium segmentation using deep learning in H&E-stained prostate specimens with immunohistochemistry as reference standard. Sci Rep. 2019;9(1):864. pmid:30696866
- 18. Brázdil T, Gallo M, Nenutil R, Kubanda A, Toufar M, Holub P. Automated annotations of epithelial cells and stroma in hematoxylin-eosin-stained whole-slide images using cytokeratin re-staining. J Pathol Clin Res. 2022;8(2):129–42. pmid:34716754
- 19. Valkonen M, Isola J, Ylinen O, Muhonen V, Saxlin A, Tolonen T, et al. Cytokeratin-supervised deep learning for automatic recognition of epithelial cells in breast cancers stained for ER, PR, and Ki-67. IEEE Trans Med Imaging. 2020;39(2):534–42. pmid:31398111
- 20. Smistad E, Bozorgi M, Lindseth F. FAST: framework for heterogeneous medical image computing and visualization. Int J Comput Assist Radiol Surg. 2015;10(11):1811–22. pmid:25684594
- 21. Smistad E, Ostvik A, Pedersen A. High performance neural network inference, streaming, and visualization of medical images using FAST. IEEE Access. 2019;7:136310–21.
- 22. Pedersen A, Valla M, Bofin AM, De Frutos JP, Reinertsen I, Smistad E. FastPathology: an open-source platform for deep learning-based research and decision support in digital pathology. IEEE Access. 2021;9:58216–29.
- 23. Kvåle G, Heuch I, Eide GE. A prospective study of reproductive factors and breast cancer. I. Parity. Am J Epidemiol. 1987;126(5):831–41. pmid:3661531
- 24. Holmen J, Midthjell K, Kr ü ger Ø, Langhammer A, Holmen TL, Bratberg GH. The Nord-Trøndelag Health Study 1995-97 (HUNT2): Objectives, contents, methods and participation. Norsk Epidemiologi. 2003;13(1):19–32.
- 25. Sandvei MS, Lagiou P, Romundstad PR, Trichopoulos D, Vatten LJ. Size at birth and risk of breast cancer: update from a prospective population-based study. Eur J Epidemiol. 2015;30(6):485–92. pmid:26026723
- 26. Knutsvik G, Stefansson IM, Aziz S, Arnes J, Eide J, Collett K, et al. Evaluation of Ki67 expression across distinct categories of breast cancer specimens: a population-based study of matched surgical specimens, core needle biopsies and tissue microarrays. PLoS One. 2014;9(11):e112121. pmid:25375149
- 27. Uhlén M, Fagerberg L, Hallström BM, Lindskog C, Oksvold P, Mardinoglu A, et al. Proteomics. Tissue-based map of the human proteome. Science. 2015;347(6220):1260419. pmid:25613900
- 28. Bankhead P, Loughrey MB, Fernández JA, Dombrowski Y, McArt DG, Dunne PD, et al. QuPath: open source software for digital pathology image analysis. Sci Rep. 2017;7(1):16878. pmid:29203879
- 29. Ruifrok AC, Johnston DA. Quantification of histochemical staining by color deconvolution. Anal Quant Cytol Histol. 2001;23(4):291–9. pmid:11531144
- 30. Guizar-Sicairos M, Thurman ST, Fienup JR. Efficient subpixel image registration algorithms. Opt Lett. 2008;33(2):156–8. pmid:18197224
- 31.
Kingma DP, Ba J. ADAM: a method for stochastic optimization. 2015. https://doi.org/10.48550/arXiv.1412.6980
- 32. Gorringe KL, Fox SB. Ductal carcinoma in situ biology, biomarkers, and diagnosis. Front Oncol. 2017;7:248. pmid:29109942
- 33. Wang H, Jin Q, Li S, Liu S, Wang M, Song Z. A comprehensive survey on deep active learning in medical image analysis. Med Image Anal. 2024;95:103201. pmid:38776841