Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

POSEA: A novel algorithm to evaluate the performance of multi-object instance image segmentation

  • Nianchao Wang,

    Roles Formal analysis, Investigation, Methodology, Project administration, Validation, Writing – original draft

    Affiliation Texas A&M University, TAMU, College Station, Texas, United States of America

  • Linghao Hu,

    Roles Resources, Writing – review & editing

    Affiliation Texas A&M University, TAMU, College Station, Texas, United States of America

  • Alex J. Walsh

    Roles Conceptualization, Data curation, Funding acquisition, Project administration, Resources, Supervision, Writing – review & editing

    walshaj@tamu.edu

    Affiliation Texas A&M University, TAMU, College Station, Texas, United States of America

Abstract

Many techniques and software packages have been developed to segment individual cells within microscopy images, necessitating a robust method to evaluate images segmented into a large number of unique objects. Currently, segmented images are often compared with ground-truth images at a pixel level; however, this standard pixel-level approach fails to compute errors due to pixels incorrectly assigned to adjacent objects. Here, we define a per-object segmentation evaluation algorithm (POSEA) that calculates segmentation accuracy metrics for each segmented object relative to a ground truth segmented image. To demonstrate the performance of POSEA, precision, recall, and f-measure metrics are computed and compared with the standard pixel-level evaluation for simulated images and segmented fluorescence microscopy images of three different cell samples. POSEA yields lower accuracy metrics than the standard pixel-level evaluation due to correct accounting of misclassified pixels of adjacent objects. Therefore, POSEA provides accurate evaluation metrics for objects with pixels incorrectly assigned to adjacent objects and is robust for use across a variety of applications that require evaluation of the segmentation of unique adjacent objects.

Introduction

Altered cellular-level heterogeneity within tissues is a characteristic of many diseases, including autoimmune disease [1], fibrotic skin disease [2], lysosomal storage disease [3], and cancer [4]. Tumors are complex tissues comprised of not only cancer cells, but also vasculature, immune cells, and stromal cells. High levels of intratumoral heterogeneity predispose patients to inferior clinical outcomes since resistance can emerge as a result of drug-tolerant populations [5]. Therefore, identification and quantification of cellular-level heterogeneity are important for the study of tissue pathologies and therapies. However, the assessment of cellular heterogeneity remains challenging. Traditional biochemical assays such as western blot [6], mRNA analysis [7], and oxygen consumption assays (i.e. Seahorse assay [8]) typically require the pooling of substrates from thousands of cells and do not provide single-cell information. Alternatively, single-cell assessment technologies such as flow cytometry and single-cell RNA sequencing require a homogenized cell suspension, which destroys the spatial integrity of the sample. Traditional biochemistry techniques also often require cell permeabilization for labeling with exogenous contrast agents, which limits in vivo and dynamic or time-course studies.

Fluorescence microscopy can be used for cell heterogeneity analysis if the images are segmented and analyzed at a single-cell level. Fluorescence imaging of the endogenous metabolic cofactors reduced nicotinamide adenine dinucleotide (NAD(P)H) and flavin adenine dinucleotide (FAD) enables nondestructive evaluation of cellular metabolism [911]. Single-cell segmentation of fluorescence images has been used to assess immune cell heterogeneity [1214], cancer heterogeneity [1518], cellular heterogeneity in response to treatment [1922] and spatial intratumoral heterogeneity [23,24]. Heterogeneity analysis of fluorescence images requires instance segmentation of the image into individual cells. Multiple solutions for instance segmentation of fluorescence images exist. These algorithms are often tailored to a specific image dataset and include a series of traditional image processing steps, such as intensity-based threshold and watershed [2527], or use machine learning techniques, such as convolutional neural networks, to classify pixels for image segmentation [2831]. Many software packages are available for fluorescence image segmentation including ImageJ, CellProfiler, Ilastik, and Imaris [32,33].

Due to a large number of available image segmentation tools, it is important to robustly evaluate segmentation results. A variety of methods can be used to evaluate image segmentation. Traditional subjective methods use human evaluators to provide qualitative assessment scores of the segmentation results, but subjective evaluation lacks consistency and is time-consuming [34]. Objective methods quantify the segmentation results relative to a ground truth segmented image [35]. Evaluation metrics such as Precision and Dice Score are based on true positive (TP), false positive (FP), true negative (TN), and false negative (FN) classifications of pixels in the segmented image determined relative to the ground truth image [36]. While the assignment of pixels is relatively simple for semantic segmentation, where pixels are typically assigned to one of two classes, object or background, pixel-level assessment is not accurate for instance segmentation where a pixel must not only be correctly labeled as object but also attributed to the correct object. Often fluorescence signals within cells are localized to compartments such as the cytoplasm and downstream analysis of cell heterogeneity requires not only high accuracy of cell identification but also accurate pixel assignments to ensure the data is not confounded by background pixels.

Metrics can also evaluate instance segmentation performance. Several object-based evaluation algorithms have been demonstrated to evaluate over-segmentation in multispectral satellite images, which present multiple object classes of generally disparate objects [37,38]. Additionally, the COCO metrics are commonly used evaluation metrics for instance segmentation and include 12 metrics in four categories: Average Precision (AP), AP across scales, Average Recall (AR), and AR across scales [39]. However, these evaluation techniques provide object-level metrics and are not directly applicable to evaluate the accuracy of both object detection and the within object pixel accuracy. A robust algorithm for the accurate evaluation of instance cell segmentation results at both an object and pixel level remains unexplored for images with a large number of adjacent objects.

In this paper, we define and demonstrate the performance of a supervised per-object segmentation evaluation algorithm (POSEA). Traditionally, segmented images are often compared at a pixel level without accounting for object-specificity [34]; however, a standard pixel-level approach fails to compute errors due to pixels correctly attributed to the object class but incorrectly assigned to an adjacent object. POSEA addresses this inaccuracy problem by extracting each object from the ground truth segmented image, matching it with the colocalized segmented object, and assigning each pixel within the segmented image as true positive (TP), true negative (TN), false positive (FP), or false negative (FN) for computation of traditional performance metrics such as Precision, Recall, and F-measure. POSEA was tested on simulated binary and grayscale images and was used to evaluate the performance of CellProfiler segmentation results of autofluorescence images of three different cell samples of varying segmentation complexity. Based on the results, POSEA is advantageous for the supervised evaluation of segmented images with a large number of unique, adjacent objects.

Materials and methods

POSEA

Traditional pixel-level evaluation.

The POSEA code computes both traditional pixel-level accuracy metrics [34] and per-object accuracy metrics for the comparison of two segmented images (Fig 1). The first step for POSEA is to load two segmented images, a “test” image and a “ground-truth” image, into the algorithm to compare. The ground truth image contains segmented objects with four unique (non-0) intensity values, and background pixels have a value of 0. The test image contains segmented objects with sequential intensity values from 1 to n, the number of objects in the test image. POSEA computes traditional pixel-level outputs by comparing binary masks of both input images. The masks of the two input images are obtained by changing the intensity of pixels that are above zero to one. Then, each pixel is assigned as TP, FP, and FN based on a comparison of the two images. TP pixels are 1 in both images. FP pixels are 1 in the test image and 0 in the ground truth image. FN pixels are 1 in the ground truth image and 0 in the test image.

thumbnail
Fig 1. Flow chart of the steps in POSEA to assign pixels as TP, FP, or FN.

https://doi.org/10.1371/journal.pone.0283692.g001

Object level evaluation.

Next, POSEA assigns pixels as TP, FP, or FN at the object level (Fig 1). For robustness and to not require prior knowledge of the object values of the ground truth image, the POSEA algorithm finds the specific intensity values of the objects within the ground truth image by calculating the most frequent non-zero numbers in the array. Then, an image is created for each unique intensity value within the ground truth image. A minimum number of four intensity values ensures the separation of adjacent cells. Each object within the ground truth image is then assigned a unique intensity value, 1 to x’, where x’ is the number of objects in the ground truth image. Similarly, each object in the test image has a unique intensity value from 1 to n’, where n’ is the number of objects in the test image. To calculate pixel assignments, each of the ground truth objects is iteratively evaluated. A loop from 1 to x’ is used to extract each object from the ground truth image and an image mask is generated with the object pixels retaining a value of 1 and the rest of the image set to the background value (0). Next, the object image is multiplied by the test image, and the most frequent pixel intensity in the resulting image is considered to be the matched object from the test image. Once the same object is identified in both the ground truth and test images, the number of TP pixels is calculated. The FP pixels per object is the number of pixels of the object in the test image result minus the number of TP pixels of the object. The FN pixels per object is calculated from the number of pixels of the object in the ground truth image minus the number of TP pixels. To calculate the object level accuracy metrics for the entire image, the sum of the number of TP pixels of each cell is calculated. The FP value is the number of pixels of all objects in the test image result minus the total TP value. The FN value is the number of pixels of all objects in the ground truth image minus the total TP value.

Calculations.

POSEA calculates Precision (P), Recall (R), and F-measure (F) metrics by both the traditional pixel-level method and by the per-object method. Perfect segmentation results in a value of 1 for Precision, Recall, and F-measure. A value of 0 means no pixels are correctly segmented.

How to use POSEA

POSEA is based on python 3.7.7. The POSEA code, pseudo code, and example images are posted to GitHub (https://github.com/walshlab/POSEA). POSEA requires two input images, first a ground truth image and second a test image. The background for the ground truth image should be zero. The input images should be grayscale images, with adjacent segmented objects assigned unique integer values. The first outputs of POSEA are the unique intensity levels of the ground truth image. The next outputs are F-measure, Precision, and Recall for a traditional pixel-level evaluation. Then, the final outputs are F-measure, Precision, and Recall metrics for the per-object assessment, computed at the image level. Those three metrics are also calculated for each cell and saved in a CSV document in the python directory.

Creation and testing of simulated images

Validation with binary images.

First, POSEA was tested on images with perfectly overlapping objects and objects with no overlap. Two binary 256*256 pixel images were created; one with all pixels having a value of 1, and one image with all pixels having a value of 0. These two images were each assigned as ground truth or test and evaluated by POSEA against itself or the other image for four comparisons.

Next, four half-black (pixel values = 0), half-white (pixel values = 1) binary 256*256 pixel images were created (Fig 2). Iteratively, each image was considered to be the ground truth and evaluated against itself and the three additional images. Therefore, sixteen comparisons were evaluated by POSEA.

thumbnail
Fig 2. Four half-white (pixel values = 1) and half-black (pixel values = 0) 256*256 pixel images evaluated by POSEA.

For each image, the white area was considered the segmented region. Each image was evaluated against the other image, including its replicate.

https://doi.org/10.1371/journal.pone.0283692.g002

Simulated adjacent cell images

Two 256*256 pixels grayscale images of two adjacent objects of equal size but different intensities were made using OpenCV [40]. The intensities of the two circles are 100 and 200 (Fig 3). POSEA was used to evaluate each image as the ground truth against itself and the other image, respectively. The Precision metric was computed and compared for POSEA and the traditional pixel-level method.

thumbnail
Fig 3. Simulated grayscale images of adjacent objects created to compare the evaluation performance of POSEA and traditional pixel-level analysis for images with pixels misassigned to an adjacent object.

https://doi.org/10.1371/journal.pone.0283692.g003

POSEA evaluation of segmented fluorescence images

Fluorescence images of T cells.

A previously published [12] dataset of fluorescence images of T cells and paired CellProfiler segmentation results were provided by Drs. Walsh and Skala. The dataset includes about 200 autofluorescence images and matched CellProfiler segmented images of two T cell populations, quiescent T cells and activated T cells. The fluorescence intensity and CellProfiler segmented images are 32-bit grayscale, 256x256 pixel images. Five images of quiescent T cells and five images of activated T cells were selected for manual segmentation of the fluorescence image to create a ground truth image. The images were selected at random to reduce selection bias. The selected quiescent T cell images contained 150–160 cells or unique objects per image while the activated T cell images contained 32–60 cells per image. Quiescent T cells are uniformly sized, round, and separable while activated T cells clump together and are heterogeneous in size, which is more challenging for automated segmentation.

Fluorescence images of MCF7 cells.

MCF7 breast cancer cells were cultured in high glucose Dulbecco’s Modified Eagle’s Medium (DMEM), supplemented with 1% penicillin: streptomycin, and 10% fetal bovine serum. Cells were plated at a density of 4 x 105 cells per 35 mm glass-bottom imaging dish (MATTEK), 48 hours before imaging. The culture media in the dish was refreshed before imaging.

NAD(P)H fluorescence images were acquired by a customized multi-photon fluorescence microscope (Marianas, 3i) with a 40X water-immersion objective (1.1 NA). A stage top incubator (Okolab) was used during imaging to maintain a physiological environment (37°C, 5% CO2, 85% relative humidity). NAD(P)H fluorescence was stimulated by a titanium: sapphire laser (COHERENT, Chameleon) at 750 nm with 18 mW to 20 mW average laser power at the sample. NAD(P)H emission was detected by a photomultiplier tube (HAMAMATSU) coupled with a 550/88 nm bandpass filter. The total time of collecting a 256 x 256-pixel fluorescence lifetime image was around 60 seconds with a pixel dwell time of 50 μs and 5 repeats. Fluorescence lifetime images were integrated across time for fluorescence intensity images.

Five NAD(P)H intensity images of MCF7 cells were randomly selected. The instance segmentation of the cell cytoplasm was generated using a published CellProfiler pipeline [25]. The pipeline first segments cell objects to define the boundaries between cells or cell clumps and background [25]. Then, nuclei regions are identified due to the intensity differences between the nuclei and cytoplasm [25]. Finally, individual cells are identified by propagating out from the nuclei to terminate at either a cell-background boundary or another propagating cell [25]. This process requires the optimization of CellProfiler parameters to match the nucleus size and cell size. Nuclei were identified for diameters of 5 to 25 pixels, a range that encompasses the typical diameters of MCF7 nuclei. Additionally, a threshold correction factor of 0.8 (a number less than 1 alters the threshold value for improved cell boundary identification) and a 30-pixel adaptive window (selected to match the average MCF7 cell size) optimized the performance of the Otsu threshold algorithm for identification of MCF7 cells. Cytoplasm regions were identified by removing nucleus objects from cell objects. Finally, to remove clumped cells and noise, all identified objects were filtered based on the area of the cytoplasm which ranged between 100–500 pixels for MCF7 cells.

Segmentation of ground truth images.

A ground truth segmented image was created by hand-segmentation of the NAD(P)H intensity images of the cells in ImageJ [41]. Cells were highlighted using the “paintbrush” tool for segmentation. A brush width of 1–5 pixels was chosen based on the need for accurate hand segmentation. Four different intensities were used for the hand segmentation because the four-color map theorem states no more than four colors are required so that no two adjacent regions have the same color [42]. The four intensity values of the segmented ground truth image do not need to be specific values since POSEA automatically recognizes unique intensity values in the image.

POSEA evaluation of fluorescence images.

POSEA was used to compare the CellProfiler segmented images as test images with the corresponding hand-segmented images as ground truth images. The Precision, Recall, and F-measure output values were recorded. The per-cell data was obtained through the CSV document saved in the python directory.

Comparison of POSEA and traditional method.

The evaluation metrics F-measure, Precision, and Recall computed by POSEA and the traditional pixel-level method were compared for segmented fluorescence images using R studio [43]. Similarly, R studio was used to build the violin plot for the POSEA per cell data.

POSEA evaluation of vehicle images.

POSEA was tested on 5 randomly selected synthetic images of cars from a public dataset, ‘Vehicle Rear Side View Synthetic Data Set’ (https://www.kaggle.com/datasets/saratrajput/vehicle-rear-side-view-synthetic-data-set). The dataset contains 5000 8-bit RGB images and corresponding depth images, instance segmentation images, and class segmentation images. The size of each image is 1024*768 pixels. POSEA was used to compare the class segmentation image as the ground truth image with the instance segmentation image, converted to gray-scale, as the test image for 5 randomly selected images.

Results

POSEA performance on simulated images

Simulated binary images of completely matching or no matching pixels were evaluated by POSEA to verify the range of the outputs of the algorithm. Images with perfectly overlapping segmented (non-0 intensity) pixel values have an output Precision value of 1 whereas completely different pixel values have an output Precision value of 0. The comparison of an all-black image against an all-black image returns a Precision value of 0, due to POSEA’s assumption that background (non-segmented) pixels have a value of 0.

Then, four half-black, half-white binary images were evaluated by POSEA (Table 1). In these images, the white portion represents a segmented object. The Precision metric is 0 if there is no overlap of objects between the ground truth and test images, 0.5 if half of the pixels overlap between the ground truth and test objects, and 1 if the objects completely overlap in both the ground truth and test images.

thumbnail
Table 1. Precision values output from POSEA evaluation of simulated binary images (Fig 2) that are half-white and half-black either horizontally divided (Image 1, Image 3) or vertically divided (Image 2, Image 4).

https://doi.org/10.1371/journal.pone.0283692.t001

Two simulated grayscale images of adjacent objects (Fig 3) were evaluated by POSEA (Tables 2 and 3). The Precision value for the comparison of these two images by the traditional pixel-based method is 1. POSEA calculates values of 1 for Precision, Recall, and F-measure for each object in Image 5 and Image 6 when the same image is used as both the ground truth and test image. When Image 5 is evaluated with Image 6 as the ground truth image, for Object 1, the Precision, Recall, and F-measure values are 0.65, 1, 0.79, respectively. For Object 2, the Precision, Recall, and F-measure values are 1, 0.75, and 0.85, respectively. When Image 6 is evaluated with Image 5 as the ground truth image, for Object 1, the Precision, Recall, and F-measure values are 1, 0.65, and 0.79, respectively. For Object 2, the Precision, Recall, and F-measure values are 0.75, 1, and 0.85, respectively.

thumbnail
Table 2. Precision values output from the traditional pixel-level evaluation method of the simulated grayscale images of adjacent cells (Fig 3).

https://doi.org/10.1371/journal.pone.0283692.t002

thumbnail
Table 3. Precision, Recall, F-measure values output from POSEA per cell evaluation of the simulated grayscale images of adjacent cells (Fig 3).

https://doi.org/10.1371/journal.pone.0283692.t003

Evaluation of segmented autofluorescence images.

Within cells, NAD(P)H is primarily localized to the cytosol and mitochondria. Therefore, cells in autofluorescence images exhibit high intensity in the cytosol while the nucleus remains dim (Fig 4). Representative fluorescence and segmentation images of quiescent T cells, activated T cells, and MCF7 cells show the differences in cell shape and clustering among the groups (Fig 4).

thumbnail
Fig 4. Representative fluorescence (first column) images, ground truth segmented images (second column), and CellProfiler segmented images (third column) of quiescent T cells (first row), activated T cells (second row), and MCF7 cells (third row).

Quiescent and activated T cells are segmented into individual cells whereas MCF7 cells are segmented into the cytoplasm.

https://doi.org/10.1371/journal.pone.0283692.g004

POSEA outputs for image-level F-measure, Precision, and Recall metrics were compared with traditional pixel-level metrics (Fig 5). F-measure, Precision, and Recall values computed by the traditional pixel-level method are higher than the corresponding metric computed by POSEA for each cell type (Fig 5). The POSEA output metrics of F-measure, Precision, and Recall are greatest for the segmentation results of quiescent T cells and lowest for the MCF7 cells. For the traditional pixel-level evaluation method, F-measure, Precision, and Recall values of activated T cells are slightly greater than the metrics of quiescent T cells. The MCF7 cells have the lowest evaluation metrics by the traditional pixel-level analysis, however, the F-measure, Precision, and Recall values are higher than 0.6.

thumbnail
Fig 5.

F-measure, Precision, and Recall values calculated by POSEA and the traditional pixel-level method (TM) to compare CellProfiler segmentation results with hand-segmented, ground truth images for quiescent T cells (A), activated T cells (B), and MCF7 cells (C). Each colored triangle is the value for an image (n = 5 images per group). The black circle and lines represent the mean and standard deviation of the 5 data points in each group.

https://doi.org/10.1371/journal.pone.0283692.g005

Evaluation results per cell using POSEA

POSEA evaluation metrics for each object within the activated T cell images were analyzed to compare the CellProfiler segmented image with the hand-segmented ground truth image (Fig 6). At the object level for activated T cells, Precision is the highest mean value and Recall is the lowest value. The histograms of each evaluation metric have a skewed distribution of low values.

thumbnail
Fig 6. Violin plots and boxplots showing the distribution of the F-measure, Precision, and Recall values calculated by POSEA for each segmented object within the activated T cell images (n = 225 cells from 5 images).

https://doi.org/10.1371/journal.pone.0283692.g006

POSEA time consumption

The time required for POSEA to evaluate each image was measured (Table 4) using a desktop with an Intel i9-9900KF CPU and an NVIDIA GeForce RTX 3080 GPU.

thumbnail
Table 4. Average (n = 5 images) time consumption of POSEA and the traditional pixel-level method to compare ground truth and segmented images of Quiescent T cells, Activated T cells, and MCF7 cells.

The average number of cells per image is 154 for Quiescent T cells, 45 for Activated T cells, and 44 for MCF7 cells.

https://doi.org/10.1371/journal.pone.0283692.t004

Evaluation results of the vehicle dataset using POSEA

Average (n = 5 images; 15 objects) F-measure, Recall, and Precision values calculated by POSEA are 0.9691, 1, 0.9843. The traditional method matches POSEA at the image level accuracy as there are no pixels misassigned to adjacent objects. For the object level analysis, the average (standard deviation) F-measure, Recall, and Precision values are 0.9860 (0.009744), 1 (0), and 0.9725 (0.018951), respectively.

Discussion

Here, an object-based supervised evaluation algorithm (POSEA) is demonstrated for accurate assessment of segmented images with a large number of unique, adjacent objects. POSEA was tested on simulated images of increasing segmentation complexity to demonstrate the accuracy and performance of POSEA. Using autofluorescence images of quiescent T cells, activated T cells, and breast cancer cells, segmented by an automated CellProfiler pipeline, the differences between per-object evaluation and a traditional pixel-level evaluation method were demonstrated. Finally, the unique ability of POSEA to calculate segmentation performance at an object level was shown for autofluorescence images of activated T cells.

POSEA was rigorously tested on simulated images of increasing complexity to define the performance of the algorithm. POSEA Precision outputs for matching non-0 pixel intensities or all mismatching pixel values are the expected values of 1 or 0, respectively. Due to the POSEA assumption that pixel values of 0 are background and these pixels do not belong to a segmented object, the output precision value of a black (intensity value = 0) image evaluated against itself is 0. Likewise, POSEA calculates the expected output Precision values for half-white and half-black images: when 50% of pixels match, the Precision value is 0.5 (Table 1). Finally, images of two adjacent objects, with the same outer perimeter for the object cluster but different individual objects (Fig 3), were simulated to directly compare the performance of POSEA with the traditional pixel-level method. The traditional pixel-level method resulted in inaccurate Precision values of 1 (Table 2). In contrast, POSEA calculated accurate Precision, Recall, and F-measure metrics for both objects, accounting for the pixels incorrectly assigned to the adjacent object. Although POSEA outputs Precision and Recall values of 1 sometimes when non-identical objects are compared, this is due to the formulas for calculations of Precision, which only includes TP and FP, and Recall, which only includes TP and FN. Altogether the simulated image experiments define the range of output values for POSEA and demonstrate the advantage of POSEA for computing segmentation accuracy of adjacent objects. Since the accuracy metrics are calculated by matching each object in the ground truth images and the corresponding objects in the segmentation results, the performance of POSEA is immune to the number of objects, object size and spatial distribution.

POSEA-computed Precision, Recall, and F-measure values are more accurate than traditional pixel-level evaluation of fluorescence images of three different cell types, quiescent T cells, activated T cells, and breast cancer cells, MCF7. The POSEA Recall, Precision, and F-measure metrics are lower than the metrics computed by traditional mask assessment due to the inclusion of the per-object accuracy by POSEA (Table 2). Lower metrics are expected for POSEA due to pixels incorrectly assigned to adjacent objects, as is demonstrated with the simulated objects in Images 5 and 6 (Fig 3, Tables 2 and 3). POSEA, but not the traditional pixel-level method, computed higher values for Precision, Recall, and F-measure of the CellProfiler segmentation of quiescent T cells than activated T cells (Fig 5), as is expected since the quiescent T cells are easier for automated segmentation, due to their isolated and round shapes (Fig 4). POSEA is not limited to specific output metrics but can calculate multiple metrics based on TP, FP, FN which are the basic metrics for evaluation algorithms [44]. Recall, Precision, and F-measure metrics were chosen since they are sensitive to under-segmentation and over-segmentation [45]. The POSEA results for instance and class segmented automobile images demonstrate its robustness and transference across image types and datasets.

POSEA uses ground truth objects to match segmentation results, which allows the computation of the per-object evaluation metrics. Iteration by object allows the calculation of pixel classifications as TP, FP, and FN for each cell. These per-object metrics allow the identification of poorly segmented objects. Therefore, the per-object metrics are unique to POSEA and provide an advantage over other evaluation methods that output metrics only at the image level [26,28,46]. For example, as shown in Fig 6, activated T cells segmented by CellProfiler generally have high Precision values (>0.8), yet a number of cells have low Precision values. Objects with low Precision values have a high FP rate which suggests under-segmentation. To visualize and further investigate why those cells have low Precision values, a Precision threshold could be set to display the locations of the cells that were poorly segmented and visual inspection could provide information on why the segmentation protocol failed for those particular cells. Interpretation of the POSEA per-object evaluation metrics can be used to inform strategies to improve segmentation.

POSEA is a robust and easily implemented tool for segmentation evaluation. However, POSEA also has some limitations. First, POSEA requires grayscale images for which the intensity values correspond to unique objects. Next, as a supervised evaluation method, POSEA requires a reference or ground truth image. Hand-segmentation to generate a ground truth image can be time-consuming for a large number of images or a large number of objects [35]. POSEA is slower than the traditional method (Table 4), because of the iterative nature to compute the accuracy of each object. The quiescent T cell images required a longer time than the quiescent and cancer cell images because of the greater number of cells in the quiescent T cell images. Currently, POSEA can only analyze one object class at one time, for example, the cell mask or cytoplasm. If the segmentation output includes multiple object classes, POSEA would have to evaluate them separately.

In summary, POSEA provides accuracy metrics combining object and pixel-level assessments of segmented images with a large number of unique, adjacent objects. Therefore, POSEA is a useful tool for the optimization of instance segmentation methods for applications that require high pixel-level performance for the segmentation of adjacent objects, as is necessary for cell segmentation within microscopy images. Evaluation of segmented microscopy images demonstrates that POSEA is more accurate than traditional pixel-level evaluation for images with a large number of adjacent objects. Moreover, POSEA provides segmentation accuracy metrics for each object for the identification of the poorly-segmented objects in an image. POSEA is not limited in application to microscopy images of cells but can be applied to evaluate any pair of segmented images to compare segmentation methods and optimize automated segmentation techniques.

References

  1. 1. Cha J. and Lee I., Single-cell network biology for resolving cellular heterogeneity in human diseases. Experimental & molecular medicine, 2020. 52(11): p. 1798–1808.
  2. 2. Deng C.-C., et al., Single-cell RNA-seq reveals fibroblast heterogeneity and increased mesenchymal fibroblasts in human fibrotic skin diseases. Nature Communications, 2021. 12(1): p. 1–16.
  3. 3. Gieselmann V., What can cell biology tell us about heterogeneity in lysosomal storage diseases? Acta Paediatrica, 2005. 94: p. 80–86.
  4. 4. Marusyk A. and Polyak K., Tumor heterogeneity: causes and consequences. Biochimica et Biophysica Acta (BBA)-Reviews on Cancer, 2010. 1805(1): p. 105–117.
  5. 5. Dagogo-Jack I. and Shaw A.T., Tumour heterogeneity and resistance to cancer therapies. Nature reviews Clinical oncology, 2018. 15(2): p. 81–94.
  6. 6. Mahmood T. and Yang P.-C., Western blot: technique, theory, and trouble shooting. North American journal of medical sciences, 2012. 4(9): p. 429.
  7. 7. Kozak M., An analysis of vertebrate mRNA sequences: intimations of translational control. The Journal of cell biology, 1991. 115(4): p. 887–903.
  8. 8. Van der Windt G.J., Chang C.H., and Pearce E.L., Measuring bioenergetics in T cells using a seahorse extracellular flux analyzer. Current protocols in immunology, 2016. 113(1): p. 3.16 B. 1–3.16 B. 14.
  9. 9. Kolenc O.I. and Quinn K.P., Evaluating cell metabolism through autofluorescence imaging of NAD (P) H and FAD. Antioxidants & redox signaling, 2019. 30(6): p. 875–889. pmid:29268621
  10. 10. Chance B., et al., Oxidation-reduction ratio studies of mitochondria in freeze-trapped samples. NADH and flavoprotein fluorescence signals. J Biol Chem, 1979. 254(11): p. 4764–71.
  11. 11. Georgakoudi I. and Quinn K.P., Optical imaging using endogenous contrast to assess metabolic state. Annu Rev Biomed Eng, 2012. 14: p. 351–67. pmid:22607264
  12. 12. Walsh A.J., et al., Classification of T-cell activation via autofluorescence lifetime imaging. Nat Biomed Eng, 2020. pmid:32719514
  13. 13. Alfonso-Garcia A., et al., Label-free identification of macrophage phenotype by fluorescence lifetime imaging microscopy. J Biomed Opt, 2016. 21(4): p. 46005.
  14. 14. Heaster T.M., et al., Autofluorescence imaging of 3D tumor-macrophage microscale cultures resolves spatial and temporal dynamics of macrophage metabolism. 2020: p. 2020.03.12.989301. pmid:33093167
  15. 15. Walsh A.J. and Skala M.C., Optical metabolic imaging quantifies heterogeneous cell populations. Biomed Opt Express, 2015. 6(2): p. 559–73.
  16. 16. Sharick J.T., et al., Cellular Metabolic Heterogeneity In Vivo Is Recapitulated in Tumor Organoids. Neoplasia, 2019. 21(6): p. 615–626.
  17. 17. Wallrabe H., et al., Segmented cell analyses to measure redox states of autofluorescent NAD (P) H, FAD & Trp in cancer cells by FLIM. Scientific reports, 2018. 8(1): p. 1–11.
  18. 18. Cardona E.N. and Walsh A.J., Identification of rare cell populations in autofluorescence lifetime image data. Cytometry A, 2022.
  19. 19. Shah A.T., et al., In vivo autofluorescence imaging of tumor heterogeneity in response to treatment. Neoplasia, 2015. 17(12): p. 862–870.
  20. 20. Walsh A.J., et al., Optical Imaging of Drug-Induced Metabolism Changes in Murine and Human Pancreatic Cancer Organoids Reveals Heterogeneous Drug Response. Pancreas, 2016. 45(6): p. 863–9.
  21. 21. Walsh A.J., et al., Quantitative optical imaging of primary tumor organoid metabolism predicts drug response in breast cancer. Cancer Res, 2014. 74(18): p. 5184–94.
  22. 22. Walsh A.J., et al., Optical metabolic imaging identifies glycolytic levels, subtypes, and early-treatment response in breast cancer. Cancer Res, 2013. 73(20): p. 6164–74.
  23. 23. Spagnolo D.M., et al., Platform for quantitative evaluation of spatial intratumoral heterogeneity in multiplexed fluorescence images. Cancer research, 2017. 77(21): p. e71–e74.
  24. 24. Heaster T.M., Landman B.A., and Skala M.C., Quantitative Spatial Analysis of Metabolic Heterogeneity Across in vivo and in vitro Tumor Models. Front Oncol, 2019. 9: p. 1144.
  25. 25. Walsh A.J. and Skala M.C., An automated image processing routine for segmentation of cell cytoplasms in high-resolution autofluorescence images. Multiphoton Microscopy in the Biomedical Sciences Xiv, 2014. 8948.
  26. 26. Gamarra M., et al., Split and merge watershed: A two-step method for cell segmentation in fluorescence microscopy images. Biomedical signal processing and control, 2019. 53: p. 101575. pmid:33719364
  27. 27. Salvi M., et al., Automated segmentation of fluorescence microscopy images for 3D cell detection in human-derived cardiospheres. Scientific reports, 2019. 9(1): p. 1–11.
  28. 28. Raza S.E.A., et al. Mimo-net: A multi-input multi-output convolutional neural network for cell segmentation in fluorescence microscopy images. in 2017 IEEE 14th international symposium on biomedical imaging (ISBI 2017). 2017. IEEE.
  29. 29. Hu Z., et al., Automated segmentation of geographic atrophy in fundus autofluorescence images using supervised pixel classification. Journal of Medical Imaging, 2015. 2(1): p. 014501.
  30. 30. Aydin A.S., et al. CNN Based Yeast Cell Segmentation in Multi-modal Fluorescent Microscopy Data. in CVPR Workshops. 2017.
  31. 31. Al-Kofahi Y., et al., A deep learning-based algorithm for 2-D cell segmentation in microscopy images. BMC bioinformatics, 2018. 19(1): p. 1–11.
  32. 32. Carpenter A.E., et al., CellProfiler: image analysis software for identifying and quantifying cell phenotypes. Genome biology, 2006. 7(10): p. 1–11.
  33. 33. Sommer C., et al. Ilastik: Interactive learning and segmentation toolkit: Eighth IEEE International Symposium on Biomedical Imaging (ISBI). 2011. Proceedings.
  34. 34. Anbeek P., et al., Automatic segmentation of eight tissue classes in neonatal brain MRI. PloS one, 2013. 8(12): p. e81895.
  35. 35. Wang Z., Wang E., and Zhu Y., Image segmentation evaluation: a survey of methods. Artificial Intelligence Review, 2020. 53(8): p. 5637–5674.
  36. 36. Chang H.-H., et al., Performance measure characterization for evaluating neuroimage segmentation algorithms. Neuroimage, 2009. 47(1): p. 122–135. pmid:19345740
  37. 37. Zhang X., et al., Segmentation quality evaluation using region-based precision and recall measures for remote sensing images. ISPRS Journal of Photogrammetry and Remote Sensing, 2015. 102: p. 73–84.
  38. 38. Su T. and Zhang S., Local and global evaluation for remote sensing image segmentation. ISPRS Journal of Photogrammetry and Remote Sensing, 2017. 130: p. 256–276.
  39. 39. Chen X., et al., Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
  40. 40. Bradski G. and Kaehler A., Learning OpenCV: Computer vision with the OpenCV library. 2008: " O’Reilly Media, Inc.".
  41. 41. Abràmoff M.D., Magalhães P.J., and Ram S.J., Image processing with ImageJ. Biophotonics international, 2004. 11(7): p. 36–42.
  42. 42. Fritsch R., et al., Four-Color Theorem. 1998: Springer.
  43. 43. Allaire J., RStudio: integrated development environment for R. Boston, MA, 2012. 770(394): p. 165–171.
  44. 44. Smochină C., Image processing techniques and segmentation evaluation. Technical University" Gheorghe Asachi", Doctoral School of the Faculty of Automatic Control and Computer Engineering, 2011.
  45. 45. Van Rijsbergen C.J., Information retrieval. 2nd. newton, ma. 1979, USA: Butterworth-Heinemann.
  46. 46. Jiang K., Liao Q.-M., and Dai S.-Y. A novel white blood cell segmentation scheme using scale-space filtering and watershed clustering. in Proceedings of the 2003 International Conference on Machine Learning and Cybernetics (IEEE Cat. No. 03EX693). 2003. IEEE.