Multicenter automatic detection of invasive carcinoma on breast whole slide images

Breast cancer is one of the most prevalent cancers worldwide and pathologists are closely involved in establishing a diagnosis. Tools to assist in making a diagnosis are required to manage the increasing workload. In this context, artificial intelligence (AI) and deep-learning based tools may be used in daily pathology practice. However, it is challenging to develop fast and reliable algorithms that can be trusted by practitioners, whatever the medical center. We describe a patch-based algorithm that incorporates a convolutional neural network to detect and locate invasive carcinoma on breast whole-slide images. The network was trained on a dataset extracted from a reference acquisition center. We then performed a calibration step based on transfer learning to maintain the performance when translating on a new target acquisition center by using a limited amount of additional training data. Performance was evaluated using classical binary measures (accuracy, recall, precision) for both centers (referred to as “test reference dataset” and “test target dataset”) and at two levels: patch and slide level. At patch level, accuracy, recall, and precision of the model on the reference and target test sets were 92.1% and 96.3%, 95% and 87.8%, and 73.9% and 70.6%, respectively. At slide level, accuracy, recall, and precision were 97.6% and 92.0%, 90.9% and 100%, and 100% and 70.8% for test sets 1 and 2, respectively. The high performance of the algorithm at both centers shows that the calibration process is efficient. This is performed using limited training data from the new target acquisition center and requires that the model is trained beforehand on a large database from a reference center. This methodology allows the implementation of AI diagnostic tools to help in routine pathology practice.

Introduction Cancer detection is a major public health issue, with almost 10 million cancer deaths worldwide in 2020, 19.3 million new cases diagnosed, and an expected rise to 28.4 million cases from 2020 to 2040 (+47%) [1]. Breast cancer (BC) has now surpassed lung cancer as the most commonly diagnosed malignancy (2.3 million new cases diagnosed worldwide, 11.7% of all cancer diagnoses) and is the leading or second cause of premature death in women in many countries according to the World Health Organization. Accurate and prompt detection of BC is essential to improve treatment efficacy and survival.
Current diagnosis of BC relies on close visual examination of surgical or biopsy material at cellular level by highly qualified pathologists. The average annual workload of pathologists has increased by around 5-10% [2] and current data indicate a shortage of histopathologists worldwide [3] leading to overwork, fatigue, and a higher risk of mistakes and diagnostic errors [2]. Furthermore, the recent Covid-19 pandemic has led to a significant delay and backlog in cancer diagnosis with the number of new cases diagnosed falling by 23.3% in 2020 according to the National Cancer Institute [4].
An increasing number of histopathology departments are going digital to overcome this problem [5]. Recent advances in slide scanners and information technology infrastructure enable the use of high-resolution digitized images called whole slide images (WSI) in pathologists daily routine. Digital pathology provides telemedicine, slide sharing for collaboration or second opinions, easier generation of reports, and long-term slide preservation. Above all, digital pathology allows the development of computational pathology (i.e., the analysis of WSI by algorithms in order to help pathologists in their diagnosis). This new field resulting from computer vision has recently come to the attention of the pathology community by promising time saving, better reproducibility, and better accuracy [6]. The use of advanced machine learning and deep-learning has been demonstrated on breast [7,8], prostate [9,10], skin [11], lung [12], and colorectal [13] cancer detection in numerous studies.
Several studies [14][15][16][17] have shown that machine learning based classifiers are built and used directly on sub-images extracted from WSI to predict whether they contain cancer or not. However, in a clinical routine context, a pipeline locating cancer on raw WSI is needed.
Few studies take into account computing times. In the study carried out by Pantanowitz et al. [18] a complete pipeline was proposed on prostate tissues, but computation times were approximately 20 min per slide and were therefore not compatible with daily clinical practice. In this study, we propose a fast end-to-end pipeline from raw WSI to cancer location meeting the needs of routine pathology practice.
Another requirement when designing a tool for clinical practice is the ability to maintain performance when transferring to a new medical center. Slide preparation and acquisition scanners might differ, leading to various image aspects. An algorithm that has learnt on data from a given acquisition center can be very efficient on that specific acquisition center, while performance may deteriorate when moving to a new center. In a previous study [19], the authors used a deep-learning model for style normalization before their classification model in order to make their system center agnostic. In another study [20], the authors compared color normalization techniques and color augmentation. They showed that both techniques improved generalization to unseen stain variations and that color augmentation led to better accuracy. However, a drop in performance with both techniques for the classification task on unseen center data was observed, which is not satisfactory for clinical use.
One of the most challenging aspects of computational pathology is the heavy size and weight of WSI (usually more than 10 000 x 10 000 pixels, and over 500 Mo, respectively). Traditional deep learning techniques using WSI as raw data are inapplicable, and when it comes to analyzing WSI, both storage and computation time are heavy challenges. The acceptance of diagnostic tools by pathologists requires the latter to be fit for usage and effective (i.e., fast and precise) in their daily routine practice.
With these challenges in mind, we aimed to design a fast and simple processing pipeline that can be generalized to any center using a limited amount of training data, sticking to light model architecture for performance purposes.
Here we present a patch-based approach that detects and locates invasive carcinoma (IC) on WSI. The algorithm consists of a two-step pipeline. First, a filtering algorithm parses WSI regions that are relevant to IC detection, focusing the analysis on a small part of the slide and leading to faster processing. Filtered regions are partitioned into patches and fed into a convolutional neural network (CNN) based classifier that predicts whether those patches contain IC or not. This model was first trained with labeled data from a so-called "reference center", and performance was measured on a test dataset containing WSI from this same reference center. Scanners and staining methods for each center are summarized in Table 1. When testing this so-called "master model" (or "master classifier") on data from another center, termed "target center", performance appeared to be greatly decreased. To maintain a good performance level, we used the master classifier as a starting state for transfer learning with training data extracted from the target center. The obtained "calibrated model" (or "calibrated classifier") performed well on data from the target center.
Below, we first explain the WSI processing pipe by describing the filtering process and the architecture of the IC classifier. We then present the various datasets used in this study, both for training the classifier and testing the IC detection algorithm. We give insights on the labeling process and on the composition of the datasets. We then describe the training processes leading to the master classifier and the calibration step used to obtain the target classifier. The performance evaluation methodology is then described, allowing us to measure the system's efficiency both at patch level (i.e., how accurately IC is localized) and at slide level (i.e., whether a slide containing IC is predicted as such and vice-versa). Finally, the results obtained and future studies are discussed.

Whole slide image pipeline processing
To locate IC on WSI we adopted a patch-based approach [8,17,21]. An input WSI was parsed into small images called patches which were filtered and later fed to a CNN-based classifier to

PLOS DIGITAL HEALTH
determine whether they contained IC or not. From there a slide level score was determined to assess whether the slide contained IC or not. The full processing pipeline is illustrated in Fig 1. In the parsing step, we chose to focus only on nuclear epithelial areas since these are the regions where IC diagnosis is performed. To do so, epithelial regions were first segmented automatically and were then parsed into patches at zoom level x20 and with size 256 x 256 pixels; non relevant patches: blur, no tissue, or those that did not contain nuclei were finally discarded. Doing so dramatically reduced the number of patches going through further processes; depending on the size of the epithelial regions filtering can discard up to 99% of the tissue (cf. Fig 1c). The filtering process is further described in S1 Text, Fig A. For each retained patch, a subsidiary patch with the same size and from the same center, but extracted at zoom level x5, was retrieved and fed into a two-class CNN-based classifier. A major difference with traditional patch-based approaches [8,17,21] was that we fed the network with x5 zoom patches to determine IC scores, these scores being attributed to the associated x20 zoom patches. The IC scores represent the probability that the patch contains IC (see section Evaluation method for more details). In other words, the contextual information contained in the x5 zoom patch determined the label of its central region (see Fig 1D and 1E). Keeping the x20 patch (and not the x5 one) as a base unit for the analysis allowed us to have a better resolution for cancer localization.

Network architecture
The model architecture consisted of a Resnet50 [22] feature extractor, the final fully connected layers being replaced by a random forest classifier. This hybrid architecture was found to improve results compared to using a simple ResNet50 architecture. The classifier assigned each patch with a probability of belonging to the IC class. . Epithelial nuclear regions are first detected and parsed (c) into square patches at zoom x20 (e). For each x20 patch, an auxiliary patch with the same center and size is extracted at zoom x5 (d) and fed to a CNN based classifier to predict an IC score. This score is attributed to the x20 patch. The set of IC scores for x20 patches belonging to nuclear epithelial regions enables the location of cancer on a slide (f). Note that a patch is considered to fall into the IC class when its score is above a threshold determined through the ROC method. NB: the color code in (f) ranges from blue (very low IC score-low probability of IC) to red (very high IC score-high probability of IC). A slide IC score is then computed by taking the weighted average of the IC patches score. https://doi.org/10.1371/journal.pdig.0000091.g001

Material and dataset
In this section the main libraries incorporated in our algorithm are mentioned and the datasets used for both the training and test are described.

Material
The filtering step was performed on CPU units: Intel(R) Core(TM) i9 3.6 GHz.
Model training and IC inference were performed on GPU: NVIDIA GeForce RTX 2080 Ti, 11 Gb, frequency boost 11 GHz.
The training of the CNN was managed with tensorflow 2.5 and the training of the random forest classifier was managed with scikit-learn. To process WSI, we used the Openslide library.

Datasets
The datasets used in this study were built from WSI originating from two different medical centers referred to respectively as "reference center" and "target center". As described in this section, slides were first labeled, then patches were extracted and filtered to generate training and test datasets. The training datasets were used to train the networks and the test datasets were used to assess the network's performance. The complete extraction and training pipe is illustrated in Fig 2. Both datasets were annotated according to the same process: using in-house software, 25 experienced pathologists were asked to draw regions of interest (ROI) on the WSI. Each ROI was checked and validated by an expert pathologist, in case of disagreement the WSI is sent back for revision until a consensus is found. ROI were drawn for regions presenting pathological patterns and for some healthy tissue as well. Each ROI was associated with a label Two-stage dataset generation and training process. This figure illustrates our two stage dataset generation training process. Whole slide images (WSI) from the reference center are parsed and broken down into patches. These are filtered so as to keep only those belonging to nuclear epithelial regions. The patches obtained are split between the training and test sets (at a ratio of 0.8, resp 0.2). Note that a given patient's slides are either fully included in the test or training set so as to avoid data leakage. The reference training set is used to train the master IC classifier. The performance of the master classifier is evaluated on the reference test set. Regarding the calibration process, the master model is used as an initial state for transfer learning training on the target center training set. The resulting calibrated model is evaluated on the target test set. https://doi.org/10.1371/journal.pdig.0000091.g002

PLOS DIGITAL HEALTH
corresponding to its diagnosis. Labels corresponding to invasive ductal carcinoma, invasive lobular carcinoma or mucinous carcinoma were grouped into the IC class and every other label corresponding to benign pathologies or healthy tissue was grouped into a Rest class.
Annotated slides were then processed and parsed into patches, each patch being assigned the label corresponding to the ROI it belonged to. Annotated ROIs may contain non-tissue pieces, blurry regions or stroma; to obtain quality data, those patches were filtered out according to the same criteria as those applied during the filtering step in the processing pipeline (see section "Whole slide processing pipeline"). The datasets are described in Table 2. For each center, several slides from the same patient were usually available. In order to split each dataset into a training and a test set, patient folders were divided so that~80% of the slides lay in the training set. A given patient folder exclusively belonged to the test or the training set. This ensured that no data leakage occurred between the training and test sets.
This table summarizes, for each class, the number of patient folders, slides, and patches that are contained in both datasets (training and test subsets). Note that a slide is referred to as an "IC slide" if it contains at least one region of interest with a label being one of the following: invasive ductal carcinoma, invasive lobular carcinoma or mucinous carcinoma. Dataset 1 (resp. 2) contains 135 (resp. 48) slides with ductal IC, 52 (resp. 0) slides with lobular IC, and five (resp. 1) slides with mucinous carcinoma. Note that slides come from both biopsies and mastectomies, in comparable proportions. The proportion of cases between various labels was not chosen, it is representative of the routine practice of the medical centers that provided the data.
As shown in Table 2, the reference dataset contained much more data than the target dataset; the former was used for training the master classifier whereas the target dataset was used for calibration.

Training of the classification model
In order to obtain an accurate classification model, relevant image features must be extracted from the images. This was achieved using a CNN that automatically learned optimal information for the task at hand, in this work the CNN used was a ResNet50. This network was a critical component of the proposed processing pipeline and was trained using a large amount of quality annotated data.
In [20], the authors show that data augmentation significantly improves models' performance on histological data. Building on this knowledge, a similar set of data augmentation functions were used during the CNN training to reduce overfitting and improve results on unseen data, namely flips and rotations, additive Gaussian noise, hue, saturation, contrast, and brightness transformations. The various parameters controlling augmentation functions were set using a grid search scheme.
To train the ResNet50, binary cross entropy was optimized with an Adam optimizer [23] with an initial learning rate of 0.001. ResNet50 weights were initialized with weights resulting

PLOS DIGITAL HEALTH
from pre-training on ImageNet and no layer was frozen. An early stopping condition was set. The features learnt from this feature extractor were used as inputs to train a random forest classifier. More specifically, the random forest was fed with feature maps output by the ResNet50 backbone from which the final fully connected layer was removed. The network resulting from the training phase was the master model and could then be calibrated with data from another center in order to perform well on the latter.

Calibration step to address a new center
From one medical center or laboratory to another, slide preparation methods, staining processes, and slide scanner types can differ considerably. Because of these variations, the general aspect of the scanned slides is notably different (see Fig 3). Consequently, although the master model was shown to be robust to analyze data from the same center as the training dataset, its performance deteriorated when moving to a new acquisition center (see Results section). In order to address this problem, we set up a calibration step which made it possible to specifically adapt the master model to a new target center. The calibration step was based on transfer learning: weights from the master model were taken as the starting state for new training with labeled data from the target center. An almost 10-fold decrease in data required during the calibration step was observed when compared to the training on the reference center (see Table 3). The calibration training was performed with the same hyperparameters and training strategy as the master CNN training.

PLOS DIGITAL HEALTH
This table illustrates the results of both the master IC classifier model and the target IC classifier model, on both the reference and target test sets. The master model performs well on the reference test set but its performance is greatly decreased when translating to the target center. Differences in slides preparation and acquisition scanners make it challenging to generalize to a new center. The calibrated model is obtained through a transfer learning approach by taking the master model as the starting state when learning on a training dataset from the target center. The performance of the target model obtained on the target test set is highly improved and is comparable to that obtained by the master model when tested on the reference test. Note that the target model has its metrics decreased on the reference test set.
The target model resulting from this calibration step was especially trained to perform well on data from the target center. This model could be directly plugged into the pipeline of WSI processing described above, as illustrated in Fig 4.

Evaluation method
The performance of the master and calibrated models was measured on both reference and target centers. As described in "Material and dataset", patch test sets (i.e., reference test set and target test set) were built from slides from both centers to assess the generalizability of our algorithm.
The performance of each model was evaluated both at the patch and slide level. For patch level classification, a simple threshold of the IC score determined whether a given patch was classified as IC or Rest. This threshold was determined as the best precision/recall tradeoff by maximizing the F1-score on validation data. For slide level classification, we defined the slide score S IC as the sum of the scores of patches classified as IC within this slide normalized by the total number of patches detected on the WSI.
where P is the patch classification score, P 0 is the threshold from which a patch is positive to IC and N is the total number of epithelial patches. If this score was above a fixed threshold (also determined using F1-score maximization) then the slide was considered to contain IC. The number of false negatives (FN), false positives (FP), true positives (TP), and true negatives (TN) were then computed both at the patch and slide level, and standard accuracy, precision, and recall metrics were measured. Accuracy was defined as (TP+TN)/(TP+FP+FN+TN) and represented the percentage of patches (resp. slides) correctly predicted; precision was defined as TP/(TP+FP) and represented the percentage of cancer-predicted patches (resp. slides) that were truly cancerous; recall was defined as TP/(TP+FN) and represented the percentage of patches (resp. slides) presenting IC regions that were correctly predicted as cancer.

Results
The results are summarized in Table 4.
The table details the classification metrics at patch and slide levels for the reference and target datasets. This illustrates the calibrated system's performance.

Accuracy of the model
At the patch level, the model had an accuracy of 92.1% and 96.3% on the reference and target test sets, respectively. At the slide level, accuracy was 97.6% for the reference test set and 92.0% for the target test set.

Model recall
At the patch level, the model had a recall of 95% and 87.8% on the reference and target test sets, respectively. At the slide level, recall was 90.9% for the reference test set and 100% for the target test set ( Table 4).

Precision of the model
At the patch level, precision was 73.9% and 70.6% for the reference and target test sets, respectively. At the slide level, precision was 100% for test set 1 and 70.8% for test set 2 (Table 4).

Computation times
The average computation time of the complete pipeline for slides from our test set was 114.9 s with standard variation of 95 s. This could be broken down as follows: 108.3 s on average for the filtering process and 16.6 s on average for the IC inference step.

Discussion
The algorithm developed and described in this paper aimed at detecting IC on WSI. It proved to be accurate and precise when tested at patch and slide level on the test datasets. However, to the best of our knowledge, there is no common standard either for metrics used or for datasets, and authors usually test their algorithms on their own data using different metrics. Some datasets do not even allow slide level evaluation and metrics are only computed at patch level. These reasons make any comparison complicated. For instance, Narayanan and colleagues 17 used a simple patch-based 5-layer CNN with color constancy pre-processing and achieved an AUC of 0.935 on a breast histology image dataset from Kaggle [24]. In another study, Cruz-Roa and colleagues introduced a 3-layer convnet classifier with a single convolutional layer [25]. A slide level DICE coefficient was computed to evaluate the model. These authors reported a DICE of 0.7494 ± 0.2071, positive predictive value (PPV) of 0.6464 ± 0.2870, and negative predictive value (NPV) of 0.9709 ± 0.0350 on a dataset consisting of TCGA breast images [26]. Celik and coworkers [8] were able to achieve 91.96% accuracy on the BreakHis dataset [27] at patch level using a pretrained ResNet-50 with no data-augmentation. Using a FusionNet encoder + softmax classifier, Brancati's group [28] reported 87.76% accuracy on Janowczyk's dataset [29]. Finally, Zeng and Zhang [7] used Google Cloud AutoML Vision on breast histopathology images from Mooney's dataset [24] and claimed 85.26% balanced accuracy at the patch level. As described above, our technique led to a patch level accuracy of more than 90% (see section Results) and a slide level accuracy of more than 92%. This is a state-of-the-art performance although, for previously mentioned reasons, a fair and closer comparison with other techniques is difficult.
A key point of the proposed algorithm pipeline is that it focuses the analysis solely on nuclear epithelial regions, leading to faster treatment, compatible with routine clinical use. Contrary to previous reports, where regions on which the analysis will be performed are determined through machine learning techniques, we propose simple, fast, and efficient filtering criteria based on standard image processing methods. No additional training is therefore needed. To the best of our knowledge, this is an original approach.
Another focus of our work was the ability of our model to generalize to different acquisition centers with various scanners and preparation processes that cause changes in slide appearance. The use of a two phase-training process provided a similar performance on the target test dataset compared to the reference test set while using much less training data. The target dataset was more than 10-times smaller than the reference dataset. It should also be noted that the diversity of pathologies found in the target dataset was smaller than that observed in the target dataset. Therefore, the reference dataset contained a lot of information that was not contained in the target dataset. It is important not to forget the information learnt during the first stage of training when transferring the knowledge to the target. As can be observed from the decreased performance on the reference center when using the calibrated CNN instead of the master CNN, it can be seen that some information is lost during the calibration phase (see Table 3). This problem will be addressed in a future study. Zaneta Swiderska-Chadaj et al. [19] proposed a technique using Cycle-GAN style normalization for a multicenter algorithm with good performance. They achieved similar accuracy on their development dataset and test datasets built using data originating from different centers. However, they used a training set containing data from several centers to train their classification model. With our proposed method, the network needs only one reference center dataset and calibration can then be performed using a small amount of data from a given target center. Furthermore, in order to reduce the computation time and the expectancy on hardware capabilities at inference, the number of models used and their size is minimal. It is therefore more adapted to use a single calibrated CNN instead of a Cycle-GAN normalization step followed by a CNN classification step.

Conclusion
Our IC detection pipe is efficient and could be applied to WSI from different medical centers using a limited amount of additional data. This tool may help pathologists to make a more accurate and faster diagnosis and postoperative treatment planning. Such a support can be used for quickly screening slides in high-throughput laboratories, selecting slides needed for fast immunohistochemistry, making practice more consistent, or assisting in reporting. Future studies include a parametric study on the size of the target dataset, for various medical centers, to determine the minimum volume of data necessary to maintain performance when moving from the reference to a target center, improvement of the processing time, generalization to pathologies other than IC, and organs other than the breast. We will also investigate multidomain adaptation as a means of obtaining proper generalization results with an even more limited amount of data when addressing a specific target center.