DeepFocus: Detection of out-of-focus regions in whole slide digital images using deep learning

The development of whole slide scanners has revolutionized the field of digital pathology. Unfortunately, whole slide scanners often produce images with out-of-focus/blurry areas that limit the amount of tissue available for a pathologist to make accurate diagnosis/prognosis. Moreover, these artifacts hamper the performance of computerized image analysis systems. These areas are typically identified by visual inspection, which leads to a subjective evaluation causing high intra- and inter-observer variability. Moreover, this process is both tedious, and time-consuming. The aim of this study is to develop a deep learning based software called, DeepFocus, which can automatically detect and segment blurry areas in digital whole slide images to address these problems. DeepFocus is built on TensorFlow, an open source library that exploits data flow graphs for efficient numerical computation. DeepFocus was trained by using 16 different H&E and IHC-stained slides that were systematically scanned on nine different focal planes, generating 216,000 samples with varying amounts of blurriness. When trained and tested on two independent datasets, DeepFocus resulted in an average accuracy of 93.2% (± 9.6%), which is a 23.8% improvement over an existing method. DeepFocus has the potential to be integrated with whole slide scanners to automatically re-scan problematic areas, hence improving the overall image quality for pathologists and image analysis algorithms.


Introduction
High-quality digital slides are becoming a ubiquitous and indispensable clinical workflow and research in pathology. Along with spatial resolution and color depth, image sharpness is often used to gauge the quality of digital slides. Most modern scanners come equipped with autofocus (AF) optics system to select focal planes to accurately capture three-dimensional tissue morphology as the best two-dimensional digital image. AF optics systems determine a set of focus points at different focal planes to be perfectly aligned with tissue height that may slightly vary within a slide. From these focal planes, scanners capture images to produce sharp tissue representation. However, commercial scanners may still produce digital images with out-of-a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 focus/blurry areas if their AF optics system erroneously selects focus points that lie in a different plane than the proper height of the tissue [1]. The distance of the focus points from the actual tissue plane is proportional to the amount of blurriness, i.e. the larger the distance, the more blurriness it will result. When parts of an image are blurry, this affects the performance of both pathologists and automated image analysis algorithms. If these areas can be identified, the slides can be rescanned with additional focus points in areas of blurring. If the amount of blurring is minimal (acceptable levels are to be determined in advance), the resulting images can still be presented to pathologists or algorithms. Currently, blurry regions are identified manually, a process that is subjective, tedious, error-prone, and time-consuming [2]. It also disrupts the workflow as it will be easy to miss blurry regions until the time of clinical review, which will delay the diagnosis of the case.
In recent years, a few systems have been developed to identify scanning artifacts automatically from digital slides [1,[3][4][5][6]. Walkowski and Szymas [3] developed an algorithm to compare the quality of digital slides generated by different scanners. They scanned the same glass slide with multiple scanners and noticed that each scan resulted in a constant amount of translation in the planer space, which made it difficult to fairly compare among digital slides. The resulting digital slides were manually registered to compensate for the constant translation. To reduce the computational complexity, the algorithm randomly selected relatively small areas that correspond to the same fragments among digital images of the same slide. In these small areas, Gray Level Co-occurrence Matrix (GLCM) was computed, which aggregated the distribution of co-occurring values [7]. The authors also used the GLCM to compute contrast and entropy statistics, which were then used to compare the quality of images captured by different scanners. Although the authors did not explicitly propose an algorithm to identify blurry regions, the algorithm can easily be adapted to identify such regions.
In another study, Zerbe et al. developed a distributed image analysis system to calculate the amount of sharpness of image patches, classifying each patch into one of the four sharpness categories: excellent, okay, review and defective [5]. The sharpness was computed by a modified Tenenbaum gradient (Tenengrad) operator [8]. In a similar study, Hashimoto et al. proposed a method that is capable of assessing both blur and noise by an evaluation index [4]. The index was calculated by linear regression analysis using the sharpness and noise information from the training dataset. The training dataset was selected according to the intended purpose of the image quality evaluation, i.e. clinical usage or image analysis. The method was tested with both objective and subjective image quality measurements on small regions sampled from a single hematoxylin and eosin (H&E) digital slide. Because both training and test patches were extracted from the same H&E digital slide, it's hard to assess how generalizable this method is.
Lahrmann et al. presented an algorithm to overcome the difficulties of the liquid-based cytology scanning [6]. The algorithm first performs a systematic analysis of the height variations within cytological samples in the z-dimension. Then, it performs a cell-based analysis to decide whether a focus point is valid, i.e. a cell, not an artifact, is detected. After the sample is imaged with the focus points, the algorithm divides the image into 16 sub-regions and then detects cells in each sub-region by their color intensity values. In the experiments, for each sub-region, 200 of the detected cells were classified as sharp or blurred by a Support Vector Machine classifier using five gradient based features. The percentage of in-focus cells (0-100%) defines a score for each region, and a combination of these scores in a slide determines its sharpness. If the image quality is below a threshold, the algorithm reselects other focus points and starts the sharpness calculation again.
Lopez et al. proposed another approach to detect blurred regions due to an incorrect (or suboptimal) focusing during acquisition [1]. To train their algorithm, they used 48,000 tiles of size 200x200 pixels at 20x magnification. For each tile, the algorithm extracted a set of features: Haralick features from GLCM and the Tenengrad operator. Using these features, a Decision Tree classifies each test tile as in-focus or blurred. To reduce the number of false positives, the algorithm applied a gray-scale morphological closing followed by a gray-scale morphological opening with a 3x3 structuring element. To validate the method, an expert randomly selected blurry and sharp tiles from digital slides of H&E and immunohistochemically (IHC) stained glass slides and evaluated the classification performance on these 3,438 tiles. A similar approach was proposed by Jimenez et al. which starts by extracting the tissue map using Otsu thresholding [2]. The algorithm divided the tissue into 64x64 pixel tiles. For each tile, the Cumulative Probability of Blur Detection contrast, entropy, and Tenengrad statistics were calculated. Each of these statistical measures was subjected to thresholding to decide if the tile under consideration is in focus or blurry. Ties were resolved by the majority voting algorithm [9]. To reduce classification errors, the resulting map was smoothed by application of morphological closing followed by a morphological opening with a disc-shaped structuring element. Although all these studies have shown promising results, they were not systematically validated to justify how they generalize to unseen data, or how well they perform for varying amounts of blurriness. To illustrate the complexity of the blur detection problem, we randomly selected 80 tiles from four different slides (with different amount of blurring) and subjected them to a blind deconvolution process [10]. We initialized the blind deconvolution with a Gaussian function (zero mean and standard deviation two) and estimated the deblurring function via maximum likelihood. Fig 1 shows the mean and standard deviation of the resulting Gaussian deburring functions. From this figure, it is evident that a family of deburring functions would be necessary to recover the correct amount of blurring from histology specimens. Considering variation and the amount of non-linearity involved in the process, we have decided to develop a deep learning [11] based approach, which is a better choice than the conventional image analysis methods that were considered for this task [1,[3][4][5][6]. The success of deep learning over conventional image analysis methods [12,13] is mainly attributed to its ability to identify discernable features without human intervention. Deep learning has recently been successfully applied in digital pathology to detect and segment nuclei and for diagnostic classification [14][15][16][17]. However, to the best of our knowledge, deep learning has not been yet used to identify blurry regions from digital pathology images.
In this paper, we propose a novel convolutional neural network to atuomatically identify out of focus regions in histopathological images. Our method is novel in terms of: 1) data curation, 2) generalization to different types of tissues, 3) being agnostic to H&E and IHC staining, 4) and in terms of algorithimic eficiency.

Training and validation datasets
Our training dataset contained four digital slides with different stains (H&E, Ki67, CD21, and CD10) from four different patients, i.e. there were 16 slides. For each slide, we scanned a region of interest (ROI) of approximately 6 mm 2 area, with Aperio ScanScope (Leica Biosystems Inc., Buffalo Grove, IL) at a 40x magnification where the pixel size is 0.2461 μm x 0.2461 μm. For each ROI, a trained operator manually selected 25 focus points and fine-tuned the autofocus values of the selected points to ensure that the focal planes align well with the tissue height. Then, the operator perturbed the focus points with a fixed offset value, O, where O 2 {−2.5 μm, −2.0 μm, −1.5 μm, −0.5 μm, 0.5 μm, 1.5 μm, 2 μm, 2.5 μm} (Fig 2). This method enabled us to obtain the same ROI with different focal planes, some of which not aligning with the proper tissue height, resulting in different levels of blurring. Finally, all of these ROI images were divided into 64x64 pixels size tiles and 2500 of these tiles were randomly selected, resulting in a total of 360,000 tiles. In order to create ground truth, the tiles whose offset values between [-0.5 μm, 0.5 μm] are labeled as in-focus and the rest of the images were labeled as blurry. The in-focus range (i.e. [-0.5 μm, 0.5 μm]) was empirically determined as it was practically impossible for an expert observer to differentiate between these ROIs visually, and interand intra-reader variability could play a role. Since the number of blurry tiles (240,000) was larger than the number of in-focus tiles (120,000), 108,000 sample tiles sampled from four different stains, were randomly selected from each class for training to prevent training set imbalance in categories. Lastly, ten percent of the sample tiles (i.e. 21,600) were selected as the validation set.

Testing dataset
For testing, we used two different datasets, the first of which was generated by the same scanner that produced our training data. We acquired six additional digital slides (3 H&E, 2 Ki67, and 1 CD10 cases) to form the first test dataset. Fig 2 shows some of the example images cropped from these digital slides with different focus offset values. The digital slides resulted in a total of 218,304 in-focus and 436,608 blurred tiles. To compare our results with those in presented in Lopez et al.'s work [1], we down-sampled the images to 20x magnification and obtained 200x200 tiles (6,168 in-focus tiles and 12,336 blurred tiles). Importantly, the tile size used in [1] is 36 times coarser than that of our approach.
The second test dataset, which was acquired with a Hamamatsu NanoZoomer 2.0HT scanner (Hamamatsu, Japan), consists of two H&E slides. Because these images were acquired at another facility, the amount of blurring was completely random; therefore, we visually evaluated the performance of the proposed method on these slides for comparison.

The deep learning architecture
In this study, we designed a convolutional neural network based system called DeepFocus, to classify each image tile as either in-focus or blurry. DeepFocus consists of five convolution layers, three max-pooling layers (after the third, fourth and fifth convolution layers) and fully connected layers (Fig 3). The last layer in DeepFocus is a softmax, which results in a probability of a tile belonging to either of the two classes. We defined the objective function as categorical cross entropy between the label and the prediction.

Preprocessing
Data preprocessing plays a critical role in many deep learning algorithms [18]. Several studies have reported better results when data is scaled between zero and one [19]. For this reason, we linearly scaled the color intensities between 0 and 1. We further applied zero-centering to transform the data in such a manner that all images in the dataset have the same average value of zero.

Data augmentation
With the exception of transfer learning, deep networks often require a huge number of training samples to achieve satisfactory classification accuracy. Because of the limited number of images in our training dataset, we employed commonly used data augmentation approaches

Design details and parameter optimization
During the training, we used Stochastic gradient descent (SGD) [20] and mini-batches consisting of 64 tiles. The operation performed by a layer, x, can be represented as x = g(Wu+b) where W (weights) and b(bias) are the parameters to be learned, g is an activation function, and u is the input vector from the previous layer. At each layer, we used Rectifier Linear Unit (ReLU), g = max(0,Wu + b), as the activation function as its simplistic nature coupled with SGD results in faster learning [21].
Like other optimization algorithms, SGD requires initial values for model parameters and each layer's input is affected by the parameters of the previous layers [22]. As a result, a small change in the previous layer's parameters is amplified as the network becomes deeper. The inconsistency in the distribution of layers' inputs causes a problem, called internal covariate shift, since the layers need to adapt to the new distribution continuously. To mitigate the effects of this problem, we used Batch Normalization (BN) [22]. For a given mini-batch, the input vector of a layer was represented by u. We calculated the mean (u i ) and variance (s 2 i ) of each feature, i, at each layer. Subsequently, we normalized u i as followŝ where � is a small constant (� = e -15 ). To increase the representational power, we scaled and shifted the normalized value by using additional tunable parameters γ and β: Although Ioffe used the BN before RELU in [22], there is growing evidence that BN results in quicker convergence if applied after ReLU [18]; therefore, we opted for BN after ReLU. To reduce the risk of overfitting, we used Dropout regularization in the fully connected layers [23]. Training with BN is known to have a regularization effect, so we reduced the dropout strength to p = 0.2 [23]. We implemented DeepFocus using TensorFlow [24].
The training phase, ran on the Owens supercomputer at Ohio Supercomputer Center (OSC) (Tesla P100-PCIE-16GB), took about 180 seconds per epoch. From the training, we realized that the underlying function is relatively smooth as SGD was able to achieve relatively high classification accuracy in a few iterations. We stopped the training at the end of 20 th epoch since the validation accuracy stopped increasing. The hyper-parameters of the architecture (kernel size, the number of layers and learning rates) were tuned using grid search and cross-validation on the validation [25]. Table 1 shows the explored and selected values for our hyper-parameters. DeepFocus: Detection of out-of-focus regions in whole slide digital images using deep learning

Evaluation methodology
To evaluate the proposed method, we considered a tile as in-focus if the probability of being in-focus is higher than 0.5, then we measured the accuracy: where TP, TN, P, and N correspond to the numbers of correctly classified in-focus tiles, the number of correctly classified blurry tiles, the total number of in-focus tiles and total number blurry tiles, respectively. Additionally, the Receiver Operating Characteristic (ROC) curve is plotted to observe TP and TN rates (Fig 4).

Comparison with a prior method for different offset values
We compared our proposed method, with state of the art algorithm, proposed by Lopez et. al. [1] Since the Lopez algorithm is designed to analyze 200x200 images captured at 20X magnification, we downsampled the test images to accommodate for these differences. Table 2 shows the average accuracy values of the two different approaches for different focus offset values. The fourth column shows the difference in accuracy between the two approaches. On average, DeepFocus is 23.8% more accurate than Lopez's approach [1] and the variabilty (as measured by σ) is less. In our previous studies, we did extensive analysis of optimization approaches and how they can help with achieve better accuracies [26,27]. Considering that our current performance is 93.2% and the state of the art performance is 69.4%, the current optimization seems to be satisfactory. Table 3 compares of the proposed approach (DeepFocus) and Lopez's approach [3] for each slide on the test dataset. This experiment revealed that DeepFocus performs better than the DeepFocus: Detection of out-of-focus regions in whole slide digital images using deep learning

Slide-based comparison
Lopez's approach [1] for all offset values between -2.5 and 2.5. We also observed that Lopez's approach struggled in identifying moderately blurry regions (i.e. 0.5 μm < |O| < 2 μm). It is worth mentioning that the standard deviation of DeepFocus for the offset values in the range [−2 μm and −1.5μm] is higher than the other offset values. Due to the finite thickness of the slides (5 μm), digital images represent nuclei in different focal planes. Therefore, with negative offset values, the scanner focused on some of the nuclei and created a sharper image for these regions. The Slide 1 and Slide 4 demonstrate typical examples for this type of problem (Fig 5). Interestingly, Lopez's approach failed to identify in-focus regions in Slide 1.

Evaluation on a different scanner
We also evaluated the robustness of the DeepFocus on full digital slides in the second test dataset that was acquired with a different scanner. The images were approximately of size 95,000 x 70,000 pixels. To avoid analyzing the tiles outside the tissue area, we find the tissue map at 1x magnification of the entire slide by Otsu Thresholding [28]. Once the tissue is detected, we classified each non-overlapping tile (64x64 pixels) using DeepFocus to create a binary mask of  DeepFocus: Detection of out-of-focus regions in whole slide digital images using deep learning the same size an input digital slide. Fig 6B, shows an example mask for one of the whole slide images (Fig 6A) where the green and red colors represent in-focus and blurry regions, respectively. As proposed in [1], this mask can be used to generate new focus points for the scanner, resulting in a higher quality image. Fig 6C shows the output of the Lopez's approach for the same image. Fig 6D shows some of the in-focus and blurred regions identified by DeepFocus and how these areas are labeled by the Lopez's algorithm.

Functional comparision with existing methods
In this study, we designed a new convolutional neural network, called DeepFocus, to identify blurry regions. Unlike prior studies, we systematically acquired data at different focal planes to vary the amount of blurring in a controlled manner, and performed rigorous validation on independent test sets. While existing algorithms can take a long time to run, we managed to achieve reasonable execution times (around 10 minutes) for a full digital slide (16GB) at 40x magnification. We also demonstrated that our method generalizes to multiple stain types (i.e. both H&E and IHC). Lastly, we compared the proposed method with a recently published study [1] in terms of accuracy, robustness and computational complexity. Unlike DeepFocus, DeepFocus: Detection of out-of-focus regions in whole slide digital images using deep learning the other apporoach was designed to work at 20x magnification and the analyzed unit area was 36 times larger than our's, resulting in a coarser output. A functional comparison between DeepFocus and the previous methods is summarized in Table 4.
The algorithm can be used in conjunction with scanners and image analysis algorithms to identify out-of-focus regions. Likewise, a pathologist can use it to automatically exclude outof-focus regions from further analysis. To make our proposed method responsive to the needs of different kinds of users (e.g. engineers, image analysts, or pathologists), it needs to have a low computational overhead. To improve computational efficiency to the proposed method, we opted for tile sizes of 64x64 pixels. This tile size provides a good tradeoff between computational efficiency and granularity/accuracy of identifying out-of-focus regions. If a higher level of accuracy is required, we can opt for either overlapping tiles or relatively small tiles but at the cost of higher computational overhead.
Like most of the deep learning approaches, DeepFocus involves matrix multiplication and convolution, which can be parallelized. Since GPU has a massively parallel architecture compared to CPU, it is advantageous for this task. DeepFocus benefits from the technological improvements in GPU which enables it to analyze a digital slide in a matter of a few minutes on a standard laptop computer. Recent studies show that the GPU based deep learning approaches speed up by 50x in just three years and researchers are expecting another 10x boost in the next few years [30]. As a result, we expect DeepFocus to become much faster in the near future.
Although deep learning techniques are being successfully applied to other digital pathology problems, DeepFocus is the first implementation of such techniques to characterize problems of digital image generation in pathology. This first application focused on the problem of accurately identifying blurry (out-of-focus) regions in whole slides images. These problems are very common as Stathonikos, et al. document, 5% of the cases had problems with scanning artifacts, such as blurry images and incomplete slides in the Dutch digital pathology experience [31]. Digital imaging problems are not limited to blurring; tissue folding, over-or under-staining, air-bubbles, compression artifacts are some of the many other problems. The DeepFocus framework needs to be extended to identify these problems before the images reach pathologists or image analysis algorithms. In future, we are planning on expanding our dataset to include more disease categories, different types of stains as well as other types of scanners. Additionally, we are planning to estimate the focal offset error which may be useful during the rescanning.

Code availability
The source code for running DeepFocus on a whole slide is available from https://github.com/ cialab/DeepFocus