Automatic cell counting from stimulated Raman imaging using deep learning

In this paper, we propose an automatic cell counting framework for stimulated Raman scattering (SRS) images, which can assist tumor tissue characteristic analysis, cancer diagnosis, and surgery planning processes. SRS microscopy has promoted tumor diagnosis and surgery by mapping lipids and proteins from fresh specimens and conducting a fast disclose of fundamental diagnostic hallmarks of tumors with a high resolution. However, cell counting from label-free SRS images has been challenging due to the limited contrast of cells and tissue, along with the heterogeneity of tissue morphology and biochemical compositions. To this end, a deep learning-based cell counting scheme is proposed by modifying and applying U-Net, an effective medical image semantic segmentation model that uses a small number of training samples. The distance transform and watershed segmentation algorithms are also implemented to yield the cell instance segmentation and cell counting results. By performing cell counting on SRS images of real human brain tumor specimens, promising cell counting results are obtained with > 98% of area under the curve (AUC) and R = 0.97 in terms of cell counting correlation between SRS and histological images with hematoxylin and eosin (H&E) staining. The proposed cell counting scheme illustrates the possibility and potential of performing cell counting automatically in near real time and encourages the study of applying deep learning techniques in biomedical and pathological image analyses.

Thank you for the comment. We have summarized the cell counting research in Table 1 and illustrated the difference between this research and the literature.  Regarding the SRS image analysis, there are some attempts in the literature that integrated the ML technique, specifically, deep learning, into the analysis. Most research performed the lesion prediction task for image patches within a specimen [1,2]. This research extends the simple image classification to pixel-level analysis by providing the cell segmentation and cell counting results that can reveal the intrinsic sample characteristics of brain tissues. Motivated by the superiority of U-Net over FCN on the medical image semantic segmentation task, this research employs and modifies U-Net to segment cells on SRS images. Also, the proposed cell counting scheme enables a mix of detection and regression-based counting because cells are segmented without the requirement of identifying each cell instance, but identified and counted through the involvement of morphological analysis. table 2 performance criteria of the classification method are presented however, what are the means of standard deviation or mean. Does each criterion have single parentage? Moreover, you should report other factors such as precision and negative predictive value, and fall-out.

In
Thank you for pointing this out. To show the cell segmentation results, the experiments are performed with five replications. Therefore, the means and standard deviation of means are the metrics calculated from the five replications. The description of the mean and standard deviation are highlighted in Subsection 4.1 Pixel-Wise Evaluation on the First Testing Region.  Table 3 and 4 summarize the mean and standard deviation (SD) of the pixel-wise segmentation evaluation for the two specimens.
The calculation of sensitivity, precision, and are shown as follows: Also, fall-out is also known as false positive rate (FPR), which can be calculated by According to the definition of specificity Specificity = TN TN + FP (8) FPR can be calculated by Based on Equation (5) and Equation (9), precision and FPR can be referred from the performance metrics in Table 3 and Table 4. Because of the page limitation, the metrics are not provided in the manuscript.    [3]. In particular, the brain tissues were collected from the Brigham and Women's Hospital and Dana-Farber Cancer Institute. A flash-freezing process was conducted at -80 o C, followed by a sectioning process to 12-µm thicknesses. The brain tumor samples are imaged by SRS and then stained using the H&E technique. A non-neoplastic benign brain tumor specimen with epilepsy and a malignant anaplastic oligodendroglioma specimen are utilized to conduct the cell counting task. Specifically, the resolution for the two specimens are 0.37 µm/pixel and 0.18 µm/pixel, respectively. The SRS image is split into three regions: one training region and two testing regions. It is noted that there are mismatches between the obtained H&E and SRS images for the same specimen regarding the cell shape, size, and position. Cell shiftiness and vanishing are also observed during the image collection process, which leads to a lack of ground truth cell distribution information. Therefore, the cells within the training region and the first testing region are annotated manually, which can be used to train the U-Net model and evaluate the cell segmentation results.
7. How you extract ground truth images. It should explain clearly and describe the accuracy of clustering techniques.
Thank you for the comment. The brain tumor samples are firstly imaged by SRS and then stained using the H&E technique. Therefore, we assume the number of cells May 13, 2021 3/11 from the H&E images is the ground truth. Regarding cell segmentation, it is noted that there are mismatches between the obtained H&E and SRS images for the same specimen regarding the cell shape, size, and position. Cell shiftiness and vanishing are also observed during the image collection process, which leads to a lack of ground truth cell distribution information. Therefore, the cells within the training region and the first testing region for each specimen are annotated manually.
The clustering is conducted on H&E images. Because of the clear differentiation between cells and the background, which is generated in the staining process, the clustering technique can easily segment cells. A demonstration of H&E image and the corresponding clustering results are shown in Fig 6 ( The assumption is that the cell segmentation results from H&E images are the ground truth with 100% accuracy. Section 4. Experimental Results (lines 338 -344, pg. 10): It is noted that there are mismatches between the obtained H&E and SRS images for the same specimen regarding the cell shape, size, and position. Cell shiftiness and vanishing are also observed during the image collection process, which leads to a lack of ground truth cell distribution information. Therefore, the cells within the training region and the first testing region for each specimen are annotated manually, which can be used to train the U-Net model and evaluate the cell segmentation results. However, overlapping cells exist. In this case, multiple cells can be recognized as one region. Therefore, a post-morphological analysis that uses distance transform and watershed segmentation algorithms is further employed for each identified region, where connected cells can be split, which enhances the cell counting results [4].   that AI can perform SRS image analysis in a detailed level and enhance the potential of promoting the SRS technique into the surgical process. Currently, due to the limited application of the SRS technique, the dataset is the only data available for brain tumor. Therefore, we only apply the cell counting to the dataset, which includes 5388 image patches in total. The related statements are shown as follows: Section 1. Introduction (lines 87 -91, pg. 3): This research not only demonstrates the possibility of performing SRS image analysis in a much more detailed level but also enhances the potential of promoting the SRS technique into the surgical process, which quickly provides surgical guidance without the requirement of the time-consuming stain process. This study aims to promote the implementation of AI to biomedical analysis for SRS images.
10. You should also report the ROC curve for segmentation.
Thank you for the comment. We didn't include the ROC curves in the manuscript because of the page limitation. We have added the ROC curves in the Appendix.
)DOVHSRVLWLYHUDWH 7UXHSRVLWLYHUDWH $8& 11. The results are so weak it should be developed and used powerful papers for description.
Thank you for the suggestion. Other than FCN and U-Net, image segmentation methods include regional convolutional network (R-CNN), dilated convolutional models, recurrent neural network, attention-based models, generative models, etc [5]. The models have different advantages and applications. For instance, R-CNN was proposed for object detection for complicated images. Attention enables an understanding of the relationship between different positions and scales within an image. Although they achieved promising performance, those models require a great number of training samples and complicated model architectures. Regarding cell segmentation for medical images, due to the simple cell structure and the lack of available annotation information, most research applied simple model structures such as FCN and U-Net [6][7][8][9][10]. Therefore, we modified the U-Net model and compared it to the conventional U-Net and FCN. Also, this research focuses more on the application of the deep learning technique to solve practical problems in the surgical process efficiently from an application perspective, instead of showing the modified U-Net as the best model theoretically. As stated in the Introduction section, this study aims to demonstrate the possibility of performing SRS image analysis using deep learning and enhance the potential of promoting the SRS technique into the surgical process from an application perspective.

Section 1. Introduction (lines 87 -91, pg. 3):
This research not only demonstrates the possibility of performing SRS image analysis in a much more detailed level but also enhances the potential of promoting the SRS technique into the surgical process, which quickly provides surgical guidance without the requirement of the time-consuming stain process.
12. Therefore, it seems that your manuscript needs additional evaluation and comparison with other works moreover, the implemented code should be presented in the GitHub.
Thank you for the comment. We have uploaded the implementation code to https://github.com/vestal-doublekuan/SRS-image-cell-counting .
1. It would be nice for the authors to provide the reasons to include two images in Fig 1 under the same magnification, 50 µm.
Thank you for pointing this out. We have modified the images in Fig 1 with   Thank you for the comment. We have enlarged the texts within the figure.  Fig 2. Overview of the cell counting framework.
3. The modified U-Net essentially used a set of different hyper-parameters to alter the network structure from the original. Is there any other novel contribution applied to the network?
Thank you for the comment. This research focuses on the application of U-Net on a new medical image modality for cell counting. The differences between M-UNet and the original U-Net include the number of kernels involved, the number of input color channels, and the application of the early-stopping technique. The modification is motivated by the simple SRS image color range that does not require the original U-Net structure due to the model complexity. The simplified M-UNet can also improve the model training efficiency.
4. The authors might want to compare the performance of U-Net with some other models' performances. Please see the article "Image Segmentation using Deep Learning: A survey" for detail.
Thank you for the suggestion. Minaee et al. (2021) summarized image segmentation methods comprehensively. Other than FCN and U-Net, other methods include regional convolutional network (R-CNN), dilated convolutional models, recurrent neural network, attention-based models, generative models, etc. The models have different advantages and applications. For instance, R-CNN was proposed for object detection for complicated images. Attention enables an understanding of the relationship between different positions and scales within an image. Although they achieved promising performance, those models require a great number of training samples and complicated model architectures. Regarding cell segmentation for medical images, due to the simple cell structure and the lack of available annotation information, most research applied simple model structures such as FCN and U-Net [6][7][8][9][10]. Therefore, we modified the U-Net model and compared it to the conventional U-Net and FCN. Also, as stated in the Introduction section, this study aims to demonstrate the possibility of performing SRS image analysis using deep learning and enhance the potential of promoting the SRS technique into the surgical process from an application perspective. This research not only demonstrates the possibility of performing SRS image analysis in a much more detailed level but also enhances the potential of promoting the SRS technique into the surgical process, which quickly provides surgical guidance without the requirement of the time-consuming stain process.

Authors need to explain the difference between the 5-layers U-Net and M-Net.
Thank you for the comment. The 7layer-UNet and 5layer-UNet are implemented and compared to the M-UNet to confirm the necessity of the U-Net structure. As shown in Fig 3, the M-UNet and original U-Net can be referred as 9layer-UNet. The main difference between 7layer-UNet, 5layer-UNet, and M-UNet is the number of convolutional layers. In particular, 7layer-UNet removes one encoder block and one decoder block, whereas 5layer-UNet removes two encoder blocks and two decoder blocks. The difference between the three structures is addressed in Section 4. To confirm the necessity of the U-Net structure, the simplified U-Net that removes one encoder block and one decoder block is performed as 7layer-UNet. Two blocks of both encoders and decoders are also excluded as 5layer-UNet in the experiment.
2. In line 317-327, authors need to specify the maximum size of the cell.
Thank you for the comment. We have added maximum cell size in the manuscript. 3. Authors need to justify why the training time difference is too much between 5-layer U-Net and M-Net.
Thank you for pointing this out. The time difference is caused by different machines in the experiments by mistake. To correct this, we have replaced the training time with the number of epochs employed. In the training process, if the validation loss has not been reduced for 25 epochs, the training process will stop. Also, U-Net was trained without early stopping, the default number of epochs is 300. If the validation loss has not been reduced for 25 epochs, the training process will stop.  4. The proposed M-Net lacks architectural difference compared to 5 layers U-Net.
Thank you for the comment. The 7layer-UNet and 5layer-UNet are implemented and compared to the M-UNet to confirm the necessity of the U-Net structure. As shown in Fig 3, the M-UNet and original U-Net can be referred as 9layer-UNet. The main difference between 7layer-UNet, 5layer-UNet, and M-UNet is the number of convolutional layers. In particular, 7layer-UNet removes one encoder block and one decoder block, whereas 5layer-UNet removes two encoder blocks and two decoder blocks. The difference between the three structures is addressed in Section 4.1 Pixel-Wise Evaluation on the First Testing Region (lines 382 -385, pg. 11). To confirm the necessity of the U-Net structure, the simplified U-Net that removes one encoder block and one decoder block is performed as 7layer-UNet. Two blocks of both encoders and decoders are also excluded as 5layer-UNet in the experiment.