Computational Pathology to Discriminate Benign from Malignant Intraductal Proliferations of the Breast

The categorization of intraductal proliferative lesions of the breast based on routine light microscopic examination of histopathologic sections is in many cases challenging, even for experienced pathologists. The development of computational tools to aid pathologists in the characterization of these lesions would have great diagnostic and clinical value. As a first step to address this issue, we evaluated the ability of computational image analysis to accurately classify DCIS and UDH and to stratify nuclear grade within DCIS. Using 116 breast biopsies diagnosed as DCIS or UDH from the Massachusetts General Hospital (MGH), we developed a computational method to extract 392 features corresponding to the mean and standard deviation in nuclear size and shape, intensity, and texture across 8 color channels. We used L1-regularized logistic regression to build classification models to discriminate DCIS from UDH. The top-performing model contained 22 active features and achieved an AUC of 0.95 in cross-validation on the MGH data-set. We applied this model to an external validation set of 51 breast biopsies diagnosed as DCIS or UDH from the Beth Israel Deaconess Medical Center, and the model achieved an AUC of 0.86. The top-performing model contained active features from all color-spaces and from the three classes of features (morphology, intensity, and texture), suggesting the value of each for prediction. We built models to stratify grade within DCIS and obtained strong performance for stratifying low nuclear grade vs. high nuclear grade DCIS (AUC = 0.98 in cross-validation) with only moderate performance for discriminating low nuclear grade vs. intermediate nuclear grade and intermediate nuclear grade vs. high nuclear grade DCIS (AUC = 0.83 and 0.69, respectively). These data show that computational pathology models can robustly discriminate benign from malignant intraductal proliferative lesions of the breast and may aid pathologists in the diagnosis and classification of these lesions.


Introduction
The pathological classification of ductal carcinoma in situ (DCIS) versus usual ductal hyperplasia (UDH) on core biopsy of has major implications for patient management. UDH is considered a benign proliferation, and patients with UDH carry only a small increased risk of developing subsequent breast cancer compared with patients without proliferative breast disease [1]. No treatment is necessary, and clinical management includes the continuation of routine breast cancer screening. In contrast, DCIS is a preinvasive malignant proliferation, and approximately 25% of patients diagnosed with DCIS on core biopsy are found to have invasive carcinoma upon surgical excision [2]. Primary treatment recommendations for DCIS include lumpectomy with or without whole breast radiation therapy and/or postoperative tamoxifen or total mastectomy with or without sentinel lymph node biopsy [3]. Thus, DCIS patients receive aggressive treatment, while UDH patients receive no treatment.
The pathological distinction of DCIS and UDH is based on multiple architectural and cytologic features, with nuclear atypia being particularly important in distinguishing the benign (UDH) from the malignant (DCIS) lesions. Although the defining features of DCIS and UDH are well established, the accurate and reproducible categorization of intraductal proliferative breast lesions into these categories remains a challenge in many cases, even for experienced pathologists [4][5][6][7][8][9][10]. The lack of reproducible and objective methods for classifying intraductal proliferative lesions of the breast has clear negative consequences, potentially resulting in both over-and under-treatment in clinical practice and complicating the use of pathological diagnoses to guide research in this area.
Given the importance of accurate and reproducible pathological diagnosis for guiding clinical care and translational research, there would be value to the development of computational tools to aid in the morphological characterization of intraductal proliferative lesions. Descriptive morphological features of nuclear atypia may be modeled with digitization of H and E-stained sections followed by image processing and analysis. The process involves image capture via digital photography or whole slide scanning [11], image segmentation via manual or automated methods [12], the measurement of extracted features, and the application of statistical methods to determine the association of features and feature-based predictive models with pathological diagnosis or clinical outcome [13,14]. Quantitative nuclear features have been demonstrated to correlate with pathological findings in prior studies of mitotic activity [15,16] and epithelial proliferations of the breast [17][18][19][20][21]. In particular, the measurement of nuclear area has been demonstrated to have diagnostic and prognostic value in malignant and premalignant lesions [22][23][24][25][26][27].
Although the current literature demonstrates correlations between quantitative features and proliferative breast lesions, this methodology has not yet been widely adopted within the pathology community. One reason is due to the inconvenience of applying previously published quantitative methods, which historically have required significant manual intervention (e.g. for nuclear tracing) prior to the measurement and extraction of one-to-several quantitative features of cellular morphology. A second reason is few prior studies have used external validation datasets, limiting our knowledge of the generalizability and robustness of the findings. A third major challenge has been that few pathology laboratories have possessed the requisite hardware and software for implementing computational pathology algorithms, and thus, it has been challenging to translate research advances to clinical practice. However, increasing numbers of laboratories are acquiring whole slide imaging (WSI) platforms and digital pathology tools, which should significantly facilitate the dissemination and ultimate clinical translation of computational pathology algorithms [11,14,28,29].
To build on the strengths and to address some of the limitations of prior work in this area, we designed and implemented a computational pathology method for the identification of nuclei and the quantification of nuclear features from histological images of intraductal proliferative lesions of the breast. In this study, we demonstrate the ability of this method to build classification models to discriminate DCIS from UDH and to discriminate low grade from high grade DCIS. These investigational tools provide new biological insights into the key cellular phenotypic differences that differentiate UDH and DCIS and ultimately, with further development, may provide real-time decision support to aid pathologists in the interpretation of proliferative lesions of the breast. All image processing code, images, and statistical code are provided at the accompanying website: earlybreast.becklab.org.

Massachusetts General Hospital (MGH) Dataset
The study was approved by the Partners Human Research Committee (Partners IRB), and the Partners IRB waived the need for consent. Cases were identified via a search of breast biopsies with a diagnosis of DCIS or UDH at MGH. 80 cases of DCIS and 36 cases of UDH from MGH were included in the study.
Core biopsy tissue was processed according to standardized laboratory protocol. Formalin fixed and paraffin embedded (FFPE) tissue was cut into 5 mm sections and stained with hematoxylin and eosin. The pathological grading of DCIS cases were obtained from the diagnostic pathology reports, and cases of DCIS were graded as low, intermediate or high based on the degree of nuclear atypia. If a case was reported to be intermediate between two grades, for the purposes of this analysis we classified the case as the lower of the grades.

Beth Israel Deaconess Medical Center (BIDMC) Dataset
The study was approved by the Beth Israel Deaconess Medical Center IRB, and the IRB waived the need for consent. Cases were identified via a search of breast biopsies with a diagnosis of DCIS or UDH at BIDMC. 20 cases of DCIS and 31 cases of UDH from BIDMC were included in the study. Similar to the MGH dataset, FFPE tissue was cut into 5 mm sections and stained with hematoxylin and eosin. The pathological grading of DCIS cases were obtained from the diagnostic pathology reports, and cases of DCIS were graded as low, intermediate or high based on the degree of nuclear atypia. If a case was reported to be intermediate between two grades, for the purposes of this analysis we classified the case as the lower of the grades.

Image acquisition
One representative diagnostic hematoxylin and eosin stained slide per case was digitized using Philips Ultra Fast Scanner 1.6 (Philips Digital Pathology; Best, Netherlands) at 406 magnification with a resolution of 0.25 mm per pixel. Whole slide images were reviewed, and diagnostic ROIs (1 to 4 per case) were manually selected for image analysis. For cases with greater than 4 diagnostic foci, up to 4 regions with the highest cellularity were selected. The MGH cases were scanned using a Philips Scanner at the MGH facility, while the BIDMC cases were scanned using a Philips Scanner at BIDMC.

Image processing and feature extraction
The proposed image processing and analysis framework consists of three main steps: nuclei segmentation, nuclei feature computation and statistical analysis and machine learning on computed features.

Nuclei segmentation
Nuclei segmentation was performed using Fiji (ImageJ, National Institutes of Health) [30]. For each image, the segmentation algorithm was applied for nuclei segmentation. Initially, the RGB color image was converted into HSV color space, in which image intensity (luma) is separated from color information (chroma), which makes the HSV more closely match human perception. Color thresholding was performed to obtain nuclear regions which were later processed with morphological operations to fill holes and merge scattered nuclear regions. Touching and overlapping nuclei were separated by watershed transformation. Following nuclear segmentation, a size filter of 200-4000 pixels was applied to exclude extracted objects of extremely small or large size to improve the specificity of nuclear detection. Identified objects were then analyzed from the original image to measure quantitative features of morphology and color of each nucleus (Fig. 1).

Nuclei feature computation
After nuclei segmentation, we computed morphological and statistical features from the selected nuclear regions. The computed morphological features include shape and geometric features, which are: area, perimeter, equivalent spherical perimeter, bounding rectangle (width and height), fit ellipse (major and minor axis), shape descriptors (circularity, aspect ratio, roundness and solidity) and Feret's diameter. The statistical features are intensity based (first order) and texture based (second order). We also explored the statistical features in different color models. In order to study the specific information carried by the hematoxylin stain, which highlights different cellular structures in the tissue, we separated hematoxylin and eosin stains using color deconvolution [31]. Color deconvolution reduced the problem of color variations in tissue appearance due to variation in tissue preparation, stain reactivity from different batches, protocol and scanners. In addition, different color models are proposed to separate a color into more useful components that may bring new information to the system. In this framework, our goal is to investigate the various color channels of different color models and select those channels that produce the top-performing classification models. We convert RGB images into two other color models, namely HSV (more intuitive for human perception) and Lab and Luv (uniform color separation).
In H&E stained images, nuclear and cytoplasmic regions appear as hues of blue and purple while extracellular material have hues of pink. In order to reduce the influence of extracellular region intensity, the RGB images are transformed into a new image called Blue-Ratio (BR) image to accentuate the nuclear dye [32] as: where R, G and B are red, green and blue channels of RGB, respectively. In a BR image, a pixel with a high blue intensity relatively to its red and green components is given a high value, whereas, a pixel with a low blue intensity as compared to its red and green components is given a low value. As we are interested in nuclei, which appear as blue-purple areas, Blue ratio intensity indicates spatial distribution of nuclear content in the image. For statistical feature computations, we selected eight color channels; red, green and blue from RGB color model, lightness (Value) from HSV color model, lightness from Lab and Luv color model, BR grey scale image and Hematoxylin channel from H&E color deconvolution.
The first order statistical features determine the distribution of grey level values within the nuclei regions. Using grey level information of the selected color channels, we computed mean, median, standard deviation, skewness and kurtosis. These five features were computed for each nucleus in selected eight color channels, resulting in a total of 40 first order statistical features.
We computed two types of second order statistical features using grey level Haralick co-occurrence [33] and run-length matrices [34]. The co-occurrence matrix GLCM (i,j; d,q) is square with dimension N g where N g is the total number of grey levels in the image. The value at i th and j th column in the matrix is produced by counting the total occasions a pixel with value i is adjacent to a pixel with value j at a distance d and angle q. Then the whole matrix is divided by the total number of such comparisons that have been made. Alternatively we can say that each element of GLCM matrix is considered as the probability that a pixel with grey level i is to be found with pixel with grey level j at a distance d and angle q. We defined adjacency in four directions (vertical, horizontal, left and right diagonals) with one displacement vector, which produced four GLCMs matrices. In our case, texture information is rotationally invariant. So, we take the average in all four directions and produce one GLCM matrix. Later, we compute 8 features proposed by Haralick from the GLCM in order to identify texture more compactly. These eight features are correlation, cluster shade, cluster prominence, energy, entropy, hara-correlation, homogeneity and inertia. These eight features were computed for each nucleus in the selected eight color channels, resulting in a total of 64 co-occurrence features.
The set of consecutive pixels, with the same grey level, collinear in a given direction, constitute a grey level run length matrix GLRLM (i,j; q). The dimension of GLRLM is N g 6R, where N g is the number of grey levels and R is the maximum run length. Similar to the GLCM, we compute GLRLMs for four directions and later average them. The 10 run-length features, derived from GLRLM, are short run emphasis (SRE), long run emphasis (LRE), grey-level non-uniformity (GLN), run length non-uniformity (RLN), low grey level runs emphasis (LGLRE), high grey level runs emphasis (HGLRE), short run low grey level emphasis (SRLGLE), short run high grey level emphasis (SRHGLE), long run low grey level emphasis (LRLGLE) and long run high grey level emphasis (LRHGLE). These features were computed for each nucleus in the selected eight color channels, resulting in a total of 80 run-length features. In total, we computed 196 texture features for each nucleus.
Prior to statistical and machine learning-based analyses, feature measurements were summarized at the patient level by computing the mean and standard deviation of each feature per patient, producing a total of 392 summary features per patient.
Statistical analysis and machine learning: We performed logistic regression with Lasso regularization [35] to build multivariate image feature-based models to classify DCIS versus UDH and low grade versus high grade DCIS. The analyses were implemented in R (http://www.r-project.org/), using the glmnet [36] package. Lasso regularization was used to create simpler models, less prone to overfitting, than those that would be obtained from standard logistic regression. The Lasso procedure consists of performing logistic regression with an L 1 regularization penalty, which has the effect of shrinking the regression weights of the least predictive features to 0 [35]. The amount of the penalty (and the number of nonzero features in the model) is determined by the regularization parameter l. This method has been shown to perform well in the setting of colinearity [37] and has been widely used to build predictive models from high-dimensional data in translational cancer research [38][39][40]. Features were standardized separately in the MGH and BIDMC data-sets prior to model construction, using the default setting in glmnet. We evaluated model performance within the MGH data-set by 9-fold cross-validation. For external validation on the BIDMC cases, we selected the value of l that achieved the maximum AUC in cross-validation on the MGH data-set and applied this fixed model to the BIDMC cases. Model performance was assessed by computing the AUC.

Accompanying website
All images, image processing code, statistical code, and results are provided at the accompanying website: http://earlybreast.becklab.org and data are deposited at the Dryad database (http://dx.doi.org/10.5061/dryad.pv85m).

DCIS and UDH Analyses in the Massachusetts General Hospital Dataset
We performed L 1 -regularized logistic regression to construct classification models to discriminate DCIS (n580) from UDH (n536). The top-performing model contained 22 active features ( Table 1) and achieved an AUC of 0.95 in crossvalidation (Fig. 2).

External Validation of the DCIS/UDH Classification Model
The most important test of a predictive model is its ability to generalize to new, unseen data from an independent institution. Thus, we collected a set of 51 samples, containing cases of both DCIS (n520) and UDH (n531) as an external validation dataset. We processed the samples in a similar fashion to that described for the MGH samples. We trained the predictive model on the full MGH dataset and applied the fixed model to the BIDMC dataset. On the BIDMC dataset, the model showed strong classification performance, achieving an AUC50.86 (  Fig. 3). Later, we combined both training and validation datasets and performed cross validation on all 167 cases, resulting in a model performance of AUC50.93 (  Fig. 4).
We next compared the performance of the DCIS vs UDH classifier on specific grades of DCIS (

Ablation analyses to compare performance of different feature classes and color spaces
To identify the subset of most informative features, we performed ablation analyses, in which we only consider a subset of features prior to model building and evaluation. Of the feature classes, we obtained the strongest performance with the texture features alone, which achieved an AUC of 0.91, followed by morphology and intensity features alone (AUC50.89 and 0.85, respectively). Of

Discussion
Previous studies have used quantitative image analysis to aid in the histological diagnosis and grading of breast lesions. Nuclear size and shape have been associated with pathological findings such as tumor grade and necrosis in malignant cytological and surgical breast specimens [17][18][19][20]. Morphometric nuclear size measurements, including area, diameter, and short axis, have also been associated with survival in patients with invasive carcinoma [22,23]. A comprehensive machine learning-based analysis of invasive breast carcinoma morphology has identified both epithelial and stromal features associated with survival [38]. Two prior studies of male breast carcinoma have associated morphometric parameters with tumor grade, immunohistochemical staining of tumor-associated proteins, and clinical outcome [21,24]. Although relatively less work has focused on the evaluation of breast biopsies with benign pathological diagnoses, nuclear morphometry in benign proliferative lesions has been associated with the subsequent development of breast cancer  [25,26]. These previous studies, with methods ranging from manual measurement to automated computational segmentation of cellular morphology, suggest the value of developing and applying methods in quantitative image analysis to link morphologic phenotypes with pathological diagnoses and clinical outcomes.
In contrast to most prior studies, we take a more unbiased data-driven approach, where we develop automated image processing software to extract a relatively high dimensional nuclear feature set (392 features) from each image and then use machine learning-based methods to build models to predict pathological diagnosis. Dundar et al. developed a morphometric pipeline to measure nuclear perimeter, aspect ratio and mean gray level intensity across 62 training cases and 33 test cases of UDH, DCIS, and atypical ductal hyperplasia (ADH), and they demonstrated an overall accuracy of 87.9% in distinguishing clinically actionable from non-actionable lesions [27]. This prior study and others have largely focused on the quantitation of existing diagnostic criteria used by surgical pathologists for disease classification. Our study builds on this approach and identifies novel morphological features that are important in discriminating pathological diagnoses of breast disease. Furthermore, our study demonstrates the ability for pathology image analysis to accurately classify proliferative breast lesions across two institutions using different histological protocols. This finding suggests that these techniques may be implemented across multiple laboratories. The image processing method could be further developed in several areas. First, the current system requires expert selected regions of interest. The nuclear detection algorithm currently implemented will not perform well on images that are pure stroma and assumes that the image is selected to enrich for cells involved in a breast epithelial proliferative lesion. A second area for future technical development is incorporation of relational nuclear features (e.g. average distance between nuclei, proportion overlapping nuclei) and spatial nuclear features (e.g. nuclear streaming). A third area for future development is the inclusion of stromal features into the predictive model. It is well known that DCIS may develop reactive-type stroma, and this feature may have value in diagnosis and prognosis. Lastly, performance may be further improved through the use of additional machine learning methods, including convolutional neural networks, which have recently shown strong performance for a range of image classification tasks [41,42].
Despite these limitations, the current study represents a first step in applying a computational pathology approach to the classification of intraductal proliferations of the breast. Our study demonstrates that models comprised of relatively small sets of quantitative nuclear features can achieve high accuracy for the classification of UDH vs. DCIS and low grade vs high grade DCIS. Building on this work, we plan to extend the method to the analysis of additional challenging intraductal proliferative lesions, including ADH, which represents a major challenge in diagnostic pathology [9] and is an area where a computational algorithm may be particularly useful for clinical practice.
With further development and validation, we envision this approach could be incorporated into clinical practice in several ways. Today, the standard method used to differentiate difficult cases on the borderline between low grade DCIS and UDH is to perform immunohistochemistry (IHC) for estrogen receptor (ER) and basal cytokeratins (CK5/6), with low grade DCIS tending to show the pattern of ER positive and CK5/6 negative, while UDH shows a mosaic-staining pattern for CK5/6 and variable ER positivity [43]. A computational pathology algorithm may replace the need for IHC altogether in a subset of cases, resulting in increased turnaround time and decreased cost, or alternatively, the algorithm may supplement the use of IHC for difficult cases, providing another layer of evidence to support one diagnosis or the other. In addition, we envision that a computational pathology algorithm for intraductal proliferative lesions could serve as a real-time second reader, which could point the pathologist to suspicious lesions or could trigger the pathologist's attention when the pathological diagnosis disagrees with the computational interpretation, and in this way, the algorithm could be used to identify cases requiring additional diagnostic work up.
The image processing algorithms, full set of images used in the analysis, and the code for statistical analysis are made publicly available on the accompanying website (earlybreast.becklab.org) and data are deposited at the Dryad Database (http://dx.doi.org/10.5061/dryad.pv85m).