Object detection for automatic cancer cell counting in zebrafish xenografts

Cell counting is a frequent task in medical research studies. However, it is often performed manually; thus, it is time-consuming and prone to human error. Even so, cell counting automation can be challenging to achieve, especially when dealing with crowded scenes and overlapping cells, assuming different shapes and sizes. In this paper, we introduce a deep learning-based cell detection and quantification methodology to automate the cell counting process in the zebrafish xenograft cancer model, an innovative technique for studying tumor biology and for personalizing medicine. First, we implemented a fine-tuned architecture based on the Faster R-CNN using the Inception ResNet V2 feature extractor. Second, we performed several adjustments to optimize the process, paying attention to constraints such as the presence of overlapped cells, the high number of objects to detect, the heterogeneity of the cells’ size and shape, and the small size of the data set. This method resulted in a median error of approximately 1% of the total number of cell units. These results demonstrate the potential of our novel approach for quantifying cells in poorly labeled images. Compared to traditional Faster R-CNN, our method improved the average precision from 71% to 85% on the studied data set.


Introduction
Cancer caused 10 million deaths in 2020 and is ranked as the sixth leading cause of death worldwide. Moreover, approximately one in five people worldwide develop cancer during their lifetime, and by 2040, the burden of cancer is projected to increase by 47% [1]. Despite the recent advances in this field, cancer treatment usually follows a "one-size-fits-all" approach, which leads to a good response for some patients but not for all. Many patients undergo through trial and error treatment and are subjected to needless toxicity in pursuit of the best treatment. Targeted cancer treatment is a response to this inefficiency; however, methods for predicting how a particular cancer will behave to a specific treatment are still lacking [2]. In evaluating cancer treatments, the zebrafish larvae xenograft is a promising assay for identifying which therapies can lead to better results for precise and targeted medicine [2][3][4][5]. accuracy. We conducted experiments using a data set of zebrafish xenografts provided by the Champalimaud Foundation of Lisbon, Portugal. Due to the cell morphology in the provided images, simple pre-processing techniques do not allow for accurate cell counts, and poor labeling prevents the development of a segmentation process. Our experiments focused on optimizing the cell-quantification process. The results indicate that the employed method can achieve promising detection performance, despite the dense cell overlap, the heterogeneity of the cell morphology, and the reduced sample size.

Main contributions
We present a fully automatic approach for detecting and counting cancer cells in zebrafish xenograft tumor images(see S2 Fig), with the following main contributions: 1. We demonstrate the superior capability of Faster R-CNN compared to single shot detector (SSD), You Only Look Once (YOLO) and region-based fully convolutional networks (RFCNs) as a detection and classification system for complex imaging, when accuracy is a major concern.
2. We comprehensively analyze the parameters of Faster R-CNN that can influence the ability to detect small objects, a common problem in the medical imaging research field.
3. We evaluate the effect of changing the number of proposals when dealing with overcrowded situations and cell overlapping. 4. We demonstrate the potential of data augmentation to increase a network's performance in given small data sets, which are typical of many medical applications. 5. We analyze the effect of defining accurate anchor rates and scales based on a detailed exploration of the cells, to optimize the process of detecting objects with different shapes. 6. We demonstrate our system's ability to detect cancer cells in images featuring several problems, by refining the system and adjusting it to address the problems at hand. In this way, we prove the suitability of object-detection frameworks as automatic cell-counting tools for problems in which segmentation is impossible due to inadequate labeling.
Furthermore, we present an improved version of Faster R-CNN, with precise fine-tuning that can handle common issues such as overlapping, small object size, and small data sets in medical imaging. This contribution should encourage further research, beyond the specific data used here.

Object-detection algorithms
In recent years, various strategies have been used to address object-detection problems. However, comparing the performance of systems in the literature is challenging because they feature different base features extractors, image resolutions, and software and hardware platforms. Nevertheless, these systems can be distinguished by considering the trade-off between speed and accuracy [21].
In this work, we focus on four recent meta-architectures that provide differing trade-offs between speed and accuracy: SSD, YOLO, Faster R-CNN, and RFCNs. SSD [26] is an architecture that uses a single feed-forward convolutional network to predict classes and anchor offsets without the need for a second classification stage, as Faster R-CNN and RFCN require. This characteristic tends to increase the system's speed while decreasing the model's accuracy (which is preferable, for instance, for video object detection). The accuracy can always be increased in each of the meta-architectures exposed by using a more robust feature extractor.
YOLO [27] was originally published in 2016 by Redmon et. al. It was announced as an object detection framework that combines in a single network the problem of object localization and classification. YOLO treats detection as a regression problem, in which the image is divided into a grid and for each grid there is a boxing box confidence and a class probability. Known for its efficiency in real-time detection, YOLO has had several proposed improvements [28][29][30] over the years. The most recent improvement, YOLO v5, developed by Ultralytics [31] uses a cross stage partial network (CSPNet) [32] as the model backbone and path aggregation network (PANet) [33] as the neck for feature aggregation.
Faster R-CNN [34] was developed based on the architecture of the Fast R-CNN method [35], which, in turn, was based on the regions of CNNs (R-CNN) method [36]. Faster R-CNN comprises two networks: the region-proposal network (RPN), whose primary purpose is to generate a set of proposed regions where objects could be present, and a network that uses the first network's output to detect objects in those regions.
Finally, RFCN is an approach that should be faster than Faster R-CNN because the former is fully convolutional, with almost all computation shared on the entire image. RFCN uses position-sensitive score maps to address the problem of translation-invariance in image classification and translation-variance in object detection [37].
In parallel with the definition of the object-detection system used, feature extractors also can be chosen. Depending on the problem's complexity, various feature extractors should be tested and have their performance compared.
Residual networks (ResNets), were introduced with the ResNet50 architecture [38]. This type of network uses so-called skipping connections, also called residual blocks, in which some activations from one layer are fed into a deeper layer. This allows the number of layers in the network to be increased efficiently [39].
GoogLeNet/Inception [40] introduced the concept of inception modules to reduce the number of parameters, even with 22 layers of depth. The network comprises nine inception modules, with a total of 100 layers. The inception modules created micro-architectures within the network's macro-architecture, where operations happen in parallel, and filters were applied to the output to reduce dimensionality [41].
The Inception ResNet [42] is based on the Inception architecture and introduces residual blocks to the architecture. Szegedy et al. found that Inception ResNet models can achieve higher accuracy values at lower epochs.
The NASNet system [43] was developed to optimize convolutional architectures. Inspired by the neural architecture search (NAS) framework, Zoph et al. [43] developed a system for "searching" the space of network architectures on a small proxy data set (CIFAR-10). Then, the learned architecture was transferred to a larger data set (IMAGENET), using concepts such as architecture scalability and flexibility. The system outperformed a set of humandesigned models.
Most state-of-the-art detectors perform well on challenging data sets such as COCO or PASCAL VOC since these data sets typically contain objects taking medium or large parts of an image, and the number of samples is significant. However, most of them struggle to detect overlapping and/or small objects in data sets of small size. In crowded scenes, objects tend to overlap largely with each other, leading to occlusions, and detection boxes with high overlaps can match the same object. In such a situation, it is often appropriate to apply strategies such as, for instance, Non-Max suppression, where the most appropriate bounding box for the object is selected. Additionally, small objects can be difficult to detect due to their low resolution. Moreover, existing deep neural networks lose the features of objects after several convolutions and pooling operations [44]. The approaches to address those problems are typically meta-architecture dependent, since the strategies and hyperparameters used can significantly change from a type of architecture to another. The following subsections refer to some of the used techniques to address this problem by employing Faster R-CNN, the algorithm used in this study.

Detection of small objects
In recent years, several authors have proposed various solutions to optimize small-object detection in images. Hu et al. [44] extracted image features from their third, fourth, and fifth convolutions and merged those into a one-dimensional vector. Some methods for detecting small objects were suggested using Faster R-CNN or SSD as the background [45].
Eggert et al. [46] presented an improved scheme for generating anchor suggestions and they proposed a modified Faster R-CNN that used higher-resolution feature maps for small objects. In [47], an improved loss function was proposed based on the intersection over union (IoU) for bounding box regression, and bilinear interpolation was used to improve the pooling operation for regions of interest (RoIs), to solve positioning error. In the detection phase, the authors used multiscale convolution feature fusion so that the feature map would contain more information and to improve the non-maximum suppression (NMS) technique in order to avoid the loss of overlapping objects.
Fu et al. [48] added deconvolution layers to SSD+Residual-101 to introduce additional large-scale context into object recognition and improve accuracy, especially for small objects. Cao et al. [49] proposed a multi-level feature-fusion method for introducing context information into SSD to improve accuracy for small objects. In [50], several high-level feature maps at various scales were extrapolated simultaneously to increase the spatial resolution. Tong et al. [51] delved deeper into handling small objects in computer vision. They suggested five perspectives of possible future research directions: emerging small-object-detection data sets and benchmarks, multi-task joint learning and optimization, information transmission, weakly supervised small -object -detection methods, and frameworks for small-object detection. Moreover, the atrous rate is a mechanism that increases the model's performance when detecting small objects [45]. This rate is applied to the tensor associated with the features to crop in the first stage in order to obtain box predictions.

Detection in crowded and overlapping scenes
Faster R-CNN outperforms Fast R-CNN by using an RPN with a CNN model, whose primary purpose is to propose a set of regions where objects could be present. By default, Faster R-CNN assumes 2,000 proposals, which are then reduced to a small number of proposals, based on the number of relevant objects detected in the first stage and reshaped to a fixed size in a process called RoI pooling [34]. Because the original version of Faster R-CNN was designed for images with a relatively small number of relevant objects, the network must be adjusted to the new context, in which the number of cells in the image can be relatively high, similar to what is applied for detecting humans in crowded scenarios [52]. This change demands higher precision in adjusting the NMS threshold, which is responsible for selecting the most appropriate bounding box for the detected object, because a higher value than needed can lead to more false positives, and a lower threshold can miss the detection of possibly relevant objects.

Small data sets
Data augmentation is the process whereby new instances are created from existing ones, usually on the fly, to avoid waste of storage. Several techniques can be applied, such as rotation, zoom, change in channel colors, and change of saturation.
These approaches introduce noise into the training process and force the model to be more tolerant of possible changes to the images, including position, orientation, and size, making the model more robust and mitigating overfitting. This process can be useful when dealing with small data sets, a common problem when analyzing medical images. In fact, it artificially boosts the size of the training set [53]. Several techniques were applied in this work: 1. Random horizontal flip-The image and detections were flipped horizontally. This occurred randomly with a 50% probability.
2. Random vertical flip-The image and the ground truth annotations were flipped vertically. This occurred randomly with a 50% probability.

Pixel value scale-
The values of all pixels in the image were scaled randomly by a constant value between fixed minimum and maximum values.
4. Random Image scale-Images were enlarged or shrunk randomly, while keeping the same aspect ratio.
5. RGB to gray-Entire images were converted to grayscale randomly, with a 10% probability.
6. Adjust brightness-The image brightness was changed randomly, up to a maximum prefixed threshold. The image outputs were saturated at values between 0 and 1.
7. Adjust contrast-The contrast was scaled randomly by a value between fixed minimum and maximum values.
8. Adjust saturation-The saturation was altered randomly by a value between fixed minimum and maximum values. 9. Distort color-A random color distortion was performed.
10. Jitter boxes-The corners of the boxes in the images were jittered randomly, as determined by a ratio. For instance, if a box was [100, 200] and the ratio was 0.02, the corners could move by [1,4]. 11. Crop image-The images and bounding boxes were cropped randomly.
12. Crop to aspect ratio-The images were cropped randomly to a given aspect ratio.
13. Black patches-Black square patches were randomly added to an image.
14. Rotation 90-The image and detections were rotated randomly by 90˚counter-clockwise 50% of the time.

Dealing with objects of different shapes
The original version of Faster R-CNN, by default, generated anchor boxes with three aspect ratios (1:1, 1:2, 2:1) and three scales (128 × 128, 256 × 256, 512 × 512), resulting in nine anchor boxes. The anchors can be adjusted to the problem at hand, considering different objects have different shapes, which will affect the algorithm's performance in the detection stage, as presented in [54], where the anchors were adjusted to detect cars. In the present work, this parameter can be changed within Faster R-CNN to deal with the morphology of the studied cells.

Data
The data we used in this study included 97 RGB images of zebrafish xenografted tumors, acquired with the initial purpose of evaluating targeted cancer treatment and assessing tumor response to various therapies. For this specific study, the images were labeled manually using the Fiji software [55]. The labeling was performed manually by a domain expert at the Champalimaud Foundation, who applied a single dot over each cell. Among the images, 89 had dimensions of 512 × 512 pixels, and the remaining eight had dimensions of 1280 × 1280 pixels, all with three channels, as a result of immunofluorescence technique, used to label nuclei, human cells and apoptotic or immune cells. These eight images were scaled to 512 × 512 pixels, using the inter-area technique, which is considered the best for decreasing the dimensions of the images [56]. Because the number of cells varied significantly across the images (ranging between 18 to 661), the images were divided into three sections based on the cells' complexity and density: • Low-Images with fewer than 50 cells; • Medium-Images with cells numbering in the range [250]; and • High-Images with greater than 250 cells.  Table 1.
High-skewness situations are assumed when the value of skewness is less than −1 or greater than 1 [57]. Similar to skewness, high kurtosis values are less than −2 or greater than 2. As shown in Table 2, the whole data set has a skewness of 1.7, and we also had a leptokurtic (kurtosis higher than 3) distribution, which can lead to situations in which the predictive model underestimates some predictions. With that in mind, the data set was partitioned into three sets (low, medium, and high) to reduce skewness and kurtosis for each sample.
Taking into account the small number of samples available, the fine-tuning of the model considered the entire data set, but the regression measures calculated took data set's partitioning into account.
Besides the number of cells present in each image, it was important to establish aspect ratios and scales when generating anchors to evaluate the distributions of the width, height, and aspect ratios of the 17,128 cells present in the images. The statistical measures reported in Table 2, and displayed in Fig 2, refer to images with dimensions 512 x 512 pixels. If the image is resized when fine-tuning the model, the scaling of these values should be considered.
As shown in Fig 3, the cells' aspect ratio ranged between 0.5 and 2.0. One possible approach to selecting aspect ratios for the anchor generator was to start with three ratios (e.g., 0.5, 1.0, and 2.0) that cover the majority of the distribution, as in Fig 4. Data set splitting and labeling. To divide the initial data set, we considered not only the number of images but also the number of available cells or objects shown in the various images and the average number of cells in each image for the data sets, as this value could range between 18 and 661 cells.  Table 1. Descriptive summary (mean, range, percentiles, skewness and kurtosis) of the cell quantities in the provided images, for the whole data set and for the various complexity levels.

Images (count)
Mean

PLOS ONE
Object detection for automatic cancer cell counting in zebrafish xenografts

PLOS ONE
Object detection for automatic cancer cell counting in zebrafish xenografts The initial data set was split into three data sets, taking into account the following proportions: around 80% for training, 15% for validation, and 5% for testing, as shown in Table 3. Considering the data set's small size, as well as the idea that this work aimed at tuning a network to be able to deal with the several limitations available in the data, we decided to add a more considerable proportion of data to the validation set and a smaller proportion to the test set, the latter of which was used only after we selected the best model.
The labeling of the images in the data set was inadequate for segmentation. Each cell was annotated by a dot, which did not correspond exactly to the center of the cell. To overcome this problem and apply object detection, we performed additional labeling considering those initial annotations using the labelImg software. This software allows bounding boxes to be created around objects to extract the coordinates of those boxes in the image.
Loss functions. The loss function in Faster R-CNN is evaluated during the algorithm's two stages. At each stage, the error is measured using a multi-loss function, which sums up to four losses in total. The first stage of the algorithm, the RPN, measures the model's performance, taking into account objectness loss and localization loss, which result in the multi-loss function defined in Eq 1: Thus, the RPN takes an image (of any size) as its input and outputs a set of rectangular object proposals, each with an objectness score [34]. In Eq 1, p stands for the predicted objectness class, p � for the ground truth objectness class, t for the ground-truth bounding box, and t � for the predicted bounding box. Objectness loss classifies a patch as having, or lacking an object. In this phase, the goal is not to identify the class of the identified object but to determine whether a patch contains an object or the background. This objectness score is used to filter the bad predictions in the second stage. With this purpose in mind, Faster R-CNN uses a classifier with two possible classes: one for the presence of an object and one for the background. Example of possible aspect ratios, taking into account the ratio distribution. The magenta shape corresponds to a 1:2 aspect ratio, the blue corresponds to a 2:1 aspect ratio, and the green corresponds to a square. The red dot is the anchor.
https://doi.org/10.1371/journal.pone.0260609.g004 Localization loss, in which a regression is applied to the bounding box, considers the parametrization of the four following coordinates, as shown in Eq 2: where x, y, w, and h denote the box's center coordinates and its width and height. The variables x, x a , and x � denote the predicted box, anchor box, and ground-truth box respectively (likewise for y, w, and h).
For the second stage of the algorithm, two new losses are computed: the box-predictor localization loss and the box-predictor classification loss. Each RoI is labeled with a ground truth class u and a ground-truth bounding-box regression target v. The multi-loss function is defined as seen in Eq 3: where p stands for the discrete probability distribution over the categories and t u is the bounding-box regression achieved from the first stage using the RPN localization loss. The indicator function [u � 1] evaluates to 1 when u � 1 and 0 otherwise (the background is labeled as 0). The hyperparameter λ controls the balance between the two losses [35]. Box-predictor localization loss. Ren et al. [34] applied a smooth-L1 loss on the position (x,y) of the top-left of the bounding box and the logarithm of the heights and widths, similar to the RPN localization loss applied on the first stage, as shown in Eq 4: where where x is the difference between the predictions and target. Choosing a smooth L1 loss comes from the fact that this type of loss is less sensitive to outliers and thus is more robust. When the regression targets are unbounded, training with L2 loss can demand careful tuning of the learning rates to avoid exploding gradients [35].
This loss is aimed at refining the localization of the bounding boxes before classification. Box-predictor classification loss. By default, weighted sigmoid was used to quantify classification loss. However, focal sigmoid classification loss and bootstrapped sigmoid classification loss were also tested.
As a binary classification, all former classification losses were based on the cross-entropy loss for binary classification [58], as presented in Eq 5: In Eq 5, y 2 {±1} specifies the ground truth class and p 2 [0, 1] is the model's estimated probability for the class with label y = 1. For notational convenience, p t is defined as follows: and Evaluation metrics. Mean average precision (mAP) is the standard metric used to evaluate an object-detection algorithm. This metric is the product of the precision and recall of the detected bounding boxes and ranges from 0 to 1, with higher values indicating better performance. The mAP, which is based on the concept of IoU, is a good measure of the network's sensitivity [41]. The IoU is the ratio of the overlapping area between the ground truth and predicted area to the total area or union area.
In addition to the mAP metric, used to fine-tune the model, and taking into account the problem in hand, other metrics were considered to evaluate the model's performance. Although mAP is the standard metric used in object-detection systems, it has limitations, especially when dealing with a large number of objects. This happens because if a specific object is not identified on a region proposal at the first stage of the algorithm, it will not be taken into account for the model's accuracy. In this way, each model's performance in the validation data set was tested, and several metrics were applied to measure the regression performance between the predicted value, (i.e., the number of objects predicted) and the ground-truth value. The root mean squared error (RMSE), mean absolute error (MAE), and median absolute error (MedAE) were taken into account. In the end, due to the nature of the problem, our best models should account for not only the mAP, but also the regression metrics. With that purpose, six champion models were selected and trained for additional epochs. The model that provided the best performance was considered the winner. Implementation details. The experiments were implemented using the TensorFlow Object Detection API, an open-source framework built on top of TensorFlow. All tests described in the following sections were carried out using Azure cloud computing resources, namely four Tesla K80 GPUs (2 physical cards) with 32 GiB GPU memory, on a virtual machine with 24 vCPUs, 224 GiB of memory, and a 1.44 TB SSD as temporary storage.

Performance comparison of different object-detection systems
In this experiment, we tested the performance of different feature extractors and meta-architectures using a model pretrained on the COCO data set [59]. Each model was run for 4,000 steps (200 epochs) with images resized to 600 x 600 pixels, the default value in TensorFlow API.
Choosing the meta-architecture. During the initial stage, our concern was defining the best meta-architecture for the problem at hand.
As shown in Fig 5, we created four models using RFCN, SSD, YOLO and Faster R-CNN. The R-CNN model showed a clear advantage over the others. When comparing SSD and Faster R-CNN, and using Inception V2 as a feature extractor, Faster R-CNN achieved a mAP value of 0.34 after 4,000 steps, whereas SSD was unable to exceed 0.04 mAP. When comparing RFCN and Faster R-CNN, this time with ResNet101 as a feature extractor, RFCN achieved a mAP of 0.31 after 4,000 steps, whereas Faster R-CNN achieved a value of 0.53 mAP. When implementing YOLO v5, provided by Ultralytics [31,60], we achieved a value of 0.43 mAP at 4,000 steps. This experiment shows the suitability of Faster R-CNN for the problem at hand, compared to the other meta-architectures tested.
Defining the best feature extractor. To better understand the influence of the feature extractor, we applied five feature extractors to Faster R-CNN: Inception V2, NAS, ResNet50, ResNet101, and Inception ResNet V2. Table 4 compares the mAP values at the validation data set and the speed of the feature extractors when training the model. At 4,000 steps, when Faster R-CNN achieves convergence, Inception ResNet V2 performed the best, achieving a value of 0.71 mAP. As expected, increasing the feature extractor complexity increases training speed, and when using NAS, this value reaches 96 seconds per epoch, even if the model's performance is not the highest, with a mAP of 0.44.
These results demonstrate that Faster R-CNN can achieve a performance of around 18 times better when upgrading the feature extractor from Inception V2 to Inception ResNet V2, thus demonstrating the advantage of using the latter when accuracy is the primary goal.

Performance evaluation for detecting small objects
The purpose of the experiments that followed the testing of the best feature extractor was to increase the system's capability to detect small cells. Thus, we trained our model (Faster R-CNN with Inception ResNet V2) in various experiments by changing the size of the figure, changing the stride used in the anchor generator, using an atrous rate, and adjusting the IoU threshold. Starting from the default values at TensorFlow API (S1-S3 Tables), grid search was used on those hyperparameters to maximize the model's performance without overfitting. This tuning process is clarified on the following sections.
Changing the image size. Because we are dealing with small objects and considering the known limitations of object-detection algorithms, one possible approach to dealing with this situation is to increase the size of the input images. To test the effect of image resizing on the algorithm's performance, we conducted experiments using various sizes and resizing methods.
We trained five models for 3,000 steps and changed the input images' size, where the lowest resolution was 256 × 256 and the highest was 1100 × 1100.
As a default, TensorFlow API assumes that an image should have a minimum of 600 pixels or a maximum of 1024 pixels by mantaining the aspect ratio. By dealing with squared images and knowing that an increase in the number of pixels demands a higher training time and more computational resources, we made several adjustments to understand how the image resizing can affect the algorithm's performance. At most, we resized the image to a dimension of 1100 × 1100 pixels, the maximum possible size for training the network with a batch of 1 without a RAM leak. As Fig 6 indicates, the images' dimensions significantly influenced the model's mAP, probably because of the presence of small objects in the input data. This was corroborated by the results presented in Table 5, which indicate a decrease in the RPN localization loss and RPN objectness loss for the higher-resolution images.
Furthermore, the dimensions also affected the model's convergence: when using the original dimensions (512 × 512), the model will began to stabilize only at 3,000 steps. However, when using the maximum dimensions, a convergent state is achieved at around 1,000 steps, as shown in Fig 6. Moreover, it is possible to conclude that low-resolution images do not allow the model's performance to improve during training, as seen in the images with dimensions of 256 × 256. Finally, as expected, increasing the dimensions of the images also increases the time required to conclude each epoch also increases. This ranged from 28 seconds for the lowest resolution to 94 seconds for the highest one. Before running further experiments, we adjusted the optimizer and the learning rate (LR). Whereas the Momentum optimizer was applied in the original Faster R-CNN, we tested three optimization algorithms to check their impact on the model (Fig 7).
As shown in Fig 7, changing the optimizer from Momentum to Adam with exponential LR allowed the model to stabilize at around 400 steps, with an approximate mAP value of 0.8, whereas Momentum required 600 steps to achieve that performance. Due to this faster convergence, Adam was selected as the optimizer in the subsequent experiments. Furthermore, with Adam as the optimizer, the initial LR was defined as 0.0002, which seemed to offer better accuracy and to boost the model's performance in earlier stages, when compared to other LR values, as shown in

PLOS ONE
Object detection for automatic cancer cell counting in zebrafish xenografts Changing the anchor generator's stride. When generating bounding boxes, different parameters should be adjusted to optimize the system's performance. One of these parameters is the stride (i.e., the space between each anchor). The TensorFlow API defines as default a value of 8 pixels, but this value can be doubled, when needed, to achieve a faster training time.
As seen in Table 6, because were dealing with small cells, a value of 8 pixels was adequate for the problem at hand. When defining this value as 16 pixels, a mAP of 0.8 was achieved, whereas with 8 pixels, the mAP reached a value of 0.85.
Changing the stride also affects the training time. In particular, decreasing the stride's value from 16 pixels to 8 pixels seemed to decrease the model's speed from 46 seconds to 96 seconds.
Using an atrous rate. The atrous rate allows object-detection models to detect smaller objects. In this experiment, we tested the model's performance with and without an atrous rate. As seen in Table 7, using an atrous rate during training provided slightly better results than not using an atrous rate did, and doing so did not affect the training time.
Changing the IoU threshold. The IoU threshold defines what bounding boxes should be considered as overlapping during the NMS technique. The chosen threshold value, or cutoff value, defines the models' level of sensitivity and specificity. The higher the value, the more specific and less sensitive the model, leading to fewer false positives and more true negatives. In contrast, a lower value will translate in higher sensitivity and lower specificity by obtaining more false positives and fewer true negatives. The threshold value, a problem-dependent parameter, was adjusted and optimized using grid search and the one that produced a better evaluation metric on the validation set was chosen.
We evaluated the model's performance under various IoU thresholds, ranging from 0.3 to 0.8, as seen in Table 8.
Decreasing the IoU threshold in NMS increases the number of accepted proposals, which increases training time. However, decreasing the IoU threshold seems to affect the validation losses regarding classification and localization at the second stage.

Performance evaluation for overcrowded scenes
The maximum number of proposals parameter should always be, at a minimum, the maximum number of ground-truth boxes present in the input images. We tested how the number of proposals influenced the model's performance by manually adjusting this value from 1000 to 3500 proposals in steps of 500, as seen in Table 9. Increasing the number of proposals increased the training time.
Although the final mAP indicates no apparent differences between the number of proposals, the model performed better with 3,000 proposals. Thus, we can conclude that the number of proposals generated can affect the training speed without any significant effects on the model's accuracy. However, the losses associated with the second stage of the algorithm showed significant reductions when increasing the number of proposals, even though that shrinkage was not visible in the final mAP.

Dealing with small datasets
In this experiment, we evaluated the effectiveness of various data-augmentation techniques on the model's performance. We tested the approaches individually for 400 steps (20 epochs). Table 10 shows that the model's performance increased significantly when applying data-augmentation techniques. In particular, the mAP values increased by 0.3 with respect to the baseline model, which did not use data augmentation. Total loss for validation, box-validation loss for localization (Loc.) and classification (Class.), best mAP before 400 steps, and training time for one epoch in seconds using various IoU thresholds (ranging from 0.3 to 0.8). The change in the IoU threshold, although it did not lead to significant changes in mAP, led to the best performance with a value of 0.7.
https://doi.org/10.1371/journal.pone.0260609.t008 Table 9. Validation losses, mAP, and training speed comparison using different numbers of proposals in a range of 1000 to 3500. Total loss for validation, box-validation loss for localization (Loc.) and classification (Class.), best mAP until 400 steps, and training time for one epoch in seconds using between 1,000 and 3,500 proposals. Increasing the number of proposals decreased the box validation loss and, consequently, improved the mAP.

Number of proposals
https://doi.org/10.1371/journal.pone.0260609.t009 The results indicate that the most effective augmentation techniques for short-term training were image scaling, jitter boxes, brightness adjustment, and image cropping. The saturation adjustment proved its efficiency only after more than 300 steps of training. After checking the data-augmentation techniques individually that produced the best performance, we tested two situations: combining the techniques that gave the best results (0.82 mAP) individually, named the Top 4 (15), and testing all augmentation techniques simultaneously (16). Although the mAP of the Top 4 did not differ when compared to the best individual techniques, using a set of data-augmentation techniques can provide results that are better than or comparable to those techniques used in isolation.

Dealing with objects of different shapes
The grid-anchor generator can be defined by the stride, scales, and aspect ratios. The experiments that followed the evaluation of the augmentation techniques analyzed the behavior of the training according to some changes to those parameters. Notably, at any given location (i.e., at each anchor), the number of the scales times the number of the aspect ratio anchors was generated with all possible combinations of scales and aspect ratios. In this experiment, we optimized the anchor generation process in a quantifiable way to match the receptive field shapes and reduced the number of invalid anchors. Taking as the baseline the cell exploration performed in the Data section, in which the dimensions and the aspect ratio of the different cells were analyzed, we made some changes to the default values used in Faster R-CNN and evaluated how those changes affected performance during the detection stage. Changing the anchor generator's aspect ratio. By default, Faster R-CNN considers the aspect ratio using three values (0.5, 1.0, and 2.0). In this experiment, we checked whether increasing the number of aspect ratios could also increase the model's performance. Therefore, we added the value of 1.5, considering that the cells could assume different shapes and could demand more flexible aspect ratios. As the results in Table 11 show, adding one more aspect ratio to the anchor generator did not improve the model's performance in the long term. Although the mAP rose faster when using this additional value, we verified that using only three aspect ratios was sufficient to detect the objects in our images. This could be explained by the fact that the bounding boxes are eventually readjusted to the ground-truth boxes using the localization losses in the first and second stages of the object detection.
Changing the anchor generator's scale. Image resizing to 1100 × 1100 pixels affects the scale value used in the anchor generator. Therefore, the values presented in Table 2 for the cells' width, height, distribution should undergo adjustments due to the new dimensions. Table 12 shows those values scaled for the new reality, as well as a descriptive statistical summary.
To calculate the best scales to use for the anchors, and because the size basis for the anchors was 256 pixels, the values of the pixels presented on Table 13 were scaled to this size.
The cells' mean width and height was around 0.13 of the basis anchor size. The minimum was around 0.03, and the maximum was 0.38. Considering these values, we executed experiments with various scales to check the behavior of the mAP at 400 steps.
The results shown in Table 13 indicate that limiting the scales to values similar to the cell size increases overfitting, thus leading to poorer performance. However, reducing the lowest value of the scale improves the model because the value is better adjusted to the problem's reality.
By testing a single scale (in this case, the value 0.3), we were able to corroborate that the object detection was affected profoundly. After concluding that the scales list [0.15, 0.3, 0.5, 1.0] achieved the best results in the previous experiments, we re-evaluated the mAP using the default list of aspect ratios to determine the best configuration to use in continuing the experiments. By combining the information on scales and aspect ratio, we confirmed, as shown in

Evaluating the models' regression metrics
To select the models in the previous steps, we mainly considered the mAP obtained after training. To evaluate the models further, we evaluated several regression metrics, including the RMSE, MAE, and MEDAE, on the data set with medium complexity. The results are shown in S1-S3 Tables. After evaluating the results, we selected the six best models and executed each for 2,000 steps to check their performance. The six models were as follows: 1. Fine-tuned (A)-Takes into account the model that was fine-tuned to balance mAP and speed over several adjustments. It is represented in blue, and it achieved a mAP of 0.8457.

Best regression (B)-
Takes into account the model that returned the best average regression metrics. It is represented in yellow, and it achieved an mAP of 0.7705.
3. Mixed regression and fine-tuned (C)-As a mix between the two previous options, the best regression metric results were chosen only if the mAP did not change by more than 0.01 from the best mAP obtained with the present configuration. It is represented in green, and it achieved a mAP of 0.8300. As Table 15 shows, the model with the highest mAP is not necessarily the one with the best regression results.
Upon noticing the different effects of image complexity on the final architectures, we defined three models for each level and adjusted the score threshold.
For the low-complexity images, we chose the best MEDAE model and defined the score threshold as 0.9. For the medium level, we selected the best RMSE & MAE model, and a threshold of 0.4. Finally, for the high-complexity images, we selected the best RMSE & MAE model, and a threshold of 0.2.
Finally, descriptive statistics were inferred from the final results, as shown in Table 16, and conclusions were made.
The results were as follows:

Conclusion
This paper proposes a refined version of Faster R-CNN (S6 Fig) for automatic counting of cancer cells with a promising mAP. We overcame image constraints such as small objects, overcrowded scenes, and small data sets by carefully adjusting distinct hyperparameters included in this process. We showed that object detection is a potentially effective and efficient counting technique that could lead to good results, even given weak labeling of the ground truth. In future work, we will attempt to expand this process to other cell types that have similar problems. Applying the insights obtained during this project to a new, larger data set will eventually lead to models with higher accuracy. Moreover, at least three labeled data sets with significant numbers of images should be defined, taking into account the images' complexity. As perceived during the experiments, the model's performance in inferring counts is positively associated with the clump of cells present in the images. This should be considered in future works. Overview of the Faster R-CNN object-detection system implemented in this research. This system includes a first stage (RPN) whose main purpose is to propose a set of regions where objects could be present, and a second phase in which the output of the first phase is used to detect and classify objects in those regions. (TIF) S1