Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Kaizen: Decomposing cellular images with VQ-VAE

Abstract

A fundamental problem in cell and tissue biology is finding cells in microscopy images. Traditionally, this detection has been performed by segmenting the pixel intensities. However, these methods struggle to delineate cells in more densely packed micrographs, where local decisions about boundaries are not trivial. Here, we develop a new methodology to decompose microscopy images into individual cells by making object-level decisions. We formulate the segmentation problem as training a flexible factorized representation of the image. To this end, we introduce Kaizen, an approach inspired by predictive coding in the brain that maintains an internal representation of an image while generating object hypotheses over the external image, and keeping the ones that improve the consistency of internal and external representations. We achieve this by training a Vector Quantised-Variational AutoEncoder (VQ-VAE). During inference, the VQ-VAE is iteratively applied on locations where the internal representation differs from the external image, making new guesses, and keeping only the ones that improve the overall image prediction until the internal representation matches the input. We demonstrate Kaizen’s merits on two fluorescence microscopy datasets, improving the separation of nuclei and neuronal cells in cell culture images.

1 Introduction

Recent advances in imaging techniques have improved the quality and quantity of medical image data. At the same time, deep learning algorithms have opened new possibilities for automatic medical image analysis [1]. However, health research and medical diagnosis require near perfection since marginal errors can lead to colossal harm or even death. To aggravate things, biological structures can be very challenging with small overlapping individual elements of complex shapes. Thus, current approaches [2, 3] for image analysis tend to miss objects. In contrast, humans seamlessly recognize all the different objects that compose a given medical image.

Thus, a potential avenue for advancing medical image analyses lies in exploring aspects of human perception that diverge from contemporary computer algorithms. For instance, the intricacies of human perception extend beyond a mere bottom-up process, not relying solely on sensory input. Instead, humans leverage acquired knowledge and experiences to generate plausible internal hypotheses [4]. Such internally generated hypotheses are continuously evaluated against external inputs, rejecting the hypotheses that diverge more from reality. In essence, humans possess generative models to dynamically construct and refine their perception of reality.

The use of generative models for computer vision allows to optimize the output during inference. For example, the reader might consider the case when the optimal solution for an image during inference is available. In this scenario, we can optimize the output to match the optimal solution. Typically during inference, we do not possess the optimal solution. However, we possess the ground truth used in self-supervised learning, the data structure of the image itself. Thus, it is possible to optimize the model solution to the degree that the image contains information about the optimal solution.

Generating images that already exist in the input might seem inefficient. However, generating internal images allows to contrast the internal prediction with the input, providing continuos error feedback. The error feedback might be used to update the model priors in an unsupervised manner, but also to improve object prediction during inference. For example, any object predictions inconsistent with the input image can be eliminated, improving results. Furthermore failing to predict an object in a region leads to a high local error, allowing to generate new object predictions in the specific location.

Following this reasoning, we present Kaizen a practical implementation for decomposing cellular images into individual objects. First, we train a VQ-VAE to encode individual cells. Then, the VQ-VAE makes predictions about the input during inference at image locations with maximal error. Like an evolutionary algorithm, only the predictions that decrease the difference between the input image and global prediction survive.

2 Related work

2.1 VQ-VAE

Variational Autoencoders [5] can compress an image dataset into a latent multivariate gaussian distribution. For further image compression, quantization methods have been applied successfully [6, 7]. Similarly, Van den Oord et al. proposed the Vector Quantised-Variational AutoEncoder (VQ-VAE) [8] that also encodes input data to a vector of discrete latent variables. However, the VQ-VAE employs the discrete latent variables as indexes to a memory table to recover a set of embeddings. The embeddings are decoded to produce an image. Posterior work [9] proposed a hierarchical VQ-VAE with several latent codes to improve image quality.

2.2 Cell instance segmentation

Segmentation is considered the first critical step for biomedical microscopy image analysis. Although standard deep learning approaches have been applied to the task [10, 11], some proposals explore alternative data representations that might be more suitable for microscopy. For example, StarDist [12, 13] segments cell nuclei by describing them as star-convex polygons. StarDist first predicts for each pixel the distance to the cell nucleus boundary along a set of predefined equidistant angles, and afterward performs non-maximum suppression (NMS) to discard duplicated polygons. Multistar [14] extends Stardist by identifying overlapping objects, while SplineDist [15] modifies StarDist by representing objects as planar spline curves. Another example is Cellpose [16, 17] learns during training to predict the gradients of a diffusion process starting at the cell centre. Later during inference, Cellpose backtracks the predicted gradient to see which pixels converge to the same cell centre.

Finally, in amodal blastomere instance segmentation, a VQ-VAE has been used after a typical detection pipeline to generate mask representations from the image features [18]. In contrast, our method employs the VQ-VAE to generate individual objects representations and later performs the segmentation.

2.3 Image decomposition

Early work [19] already proposed image decomposition as an alternative to object detection pipelines. Other approaches tackle unsupervised learning of object representations through image reconstruction [2023]. While MONet [20] and IODINE [22] apply a network recurrently, Slot-attention [21] and Capsule networks [23] apply an algorithm iteratively. Although Kaizen is similar since it iteratively applies a VQ-VAE, it is trained in a supervised manner. Unlike those models wich incorporate an explicit attention mechanism, Kaizen relies on a VQ-VAE to generate only individual objects. More recently Composer [24] decomposes an image into several decoupled representations including object instances, and then trains a diffusion model conditioned to these representations, to improve control on image generation.

3 Methodology

An Illustration of Kaizen is shown in Fig 1. Kaizen uses a VQ-VAE model trained on microscopy images to predict one individual cell from an image with multiple cells. During inference the VQ-VAE iteratively predicts individual cells in the input microscopy image (Fig 1A). Kaizen maintains an internal predicted image formed by all the cells predicted so far (Fig 1B). The difference between the internal predicted image and the external image is the error image (Fig 1C). Kaizen accepts a new prediction only when it reduces the error, making the external image and the internal prediction more similar. Furthermore, the new predictions are made on regions with higher error, avoiding duplicate predictions. The process is repeated until the method is unable to predict new cells. Kaizen components are described in more detail below.

thumbnail
Fig 1. (a) Multicolored squares correspond to various regions serving as input to a VQ-VAE tasked with reconstructing the central cell within. The reconstructed individual cells are merged into an internal image, (b), with all the cells predicted so far. Subtracting this internal image from the original yields an error image, (c). New prediction points, indicated by crosses, are selected at local maxima in the error image. The prediction indicated by a red cross will be discarded since adding a cell there increases the overall error.

https://doi.org/10.1371/journal.pone.0313549.g001

3.1 VQ-VAE

As a generative model for Kaizen a VQ-VAE [8] was chosen, favouring inference speed over quality of the samples. The VQ-VAE was trained with small patches containing a few cells from the hundreds in a microscopy image. The purpose of the training was for the VQ-VAE to produce a single cell image as output when given an image containing multiple cells. Thus, for a given input training patch the corresponding training output was generated by multiplying the input patch by the cell’s mask touching the central pixel. As shown in Fig 2, after training the VQ-VAE encodes a single central cell from small patches containing multiple cells while disregarding other cells and noise present on the input. To avoid encoding non-existent cells, twenty percent of the training patches did not have any cell touching the central pixel; in these cases the VQ-VAE was trained to output an empty image.

thumbnail
Fig 2. Samples of the VQ-VAE encoding individual cells for the U2OS dataset.

The first row corresponds to training image patches employed as input to the VQ-VAE. In the second row, under each input, the corresponding ground truth is shown. Finally, the third row shows the corresponding VQ-VAE ouput after training.

https://doi.org/10.1371/journal.pone.0313549.g002

3.2 Core algorithm

Using the VQ-VAE that encodes individual cells described above as predictor and taking inspiration from predictive coding in the brain, the core algorithm of Kaizen was implemented. Algorithm 1 calculates an error image as the difference between the input and a internal image prediction. At each iteration, several points are proposed where the error image reaches its maximum, and distant between them. Then at each proposed point in the image an object prediction is made. Finally, only predictions that diminish the loss when added to the internal prediction, are kept.

3.3 Predicting on the error

Notably, keeping an internal image allows estimation of an error image, which is derived by subtracting the internal prediction from the input image. Predicting on the error image filters out already predicted objects. Thus, predicting on the error image is easier, given its reduced complexity compared to the original image, minimizing potential confusion for the predictor. However, erroneous predictions in the initial stages could propagate into subsequent ones, compounding any inaccuracies committed by the predictor.

Algorithm 1. Core algorithm

To enhance object detection while minimizing error compounding, Kaizen first predicts on the input image with the core algorithm until no more object predictions are found. With this first set of predictions an error image is generated. Then, with the error image as input, we apply Kaizen again generating new predictions and a second error image. We repeat this process until no new objects are found in a given iteration, aiming to detect all objects even in the most densely populated images.

3.4 Datasets

U2OS dataset: Fluorescent microscopy images from U2OS cell lines. Image set BBBC039 version 1, available from the Broad Bioimage Benchmark Collection [25]. The ground truth consists of individual nucleus instances. Although the ground truth might contain some errors, we did not alter the ground truth. We randomly selected 100 images for the training set, 50 for the validation set, and 50 for the test set.

Neuroblastoma dataset: Fluorescent images of cultured neuroblastoma cells, available from the Cell Image Library [26]. The ground truth consists of manually annotated cell boundaries. Although the ground truth might contain some errors, we did not alter the ground truth. However, since the ground truth size was half the weight and height of the images, we shrink the images with cubic interpolation to match the ground truth size. We randomly selected 71 images for the training set, 12 for the validation set, and 17 for the test set.

3.5 Evaluation metric

A predicted object is considered a true positive (TP) if his intersection over union IoU with a ground truth object is above certain threshold . For the same threshold predicted objects without ground truth are considered false positives (FP), and finally ground truth not predicted is considered false negative (FN). The average precision (AP) for one image is given by:

(1)

The average precision reported is the average over all images in the test set.

3.6 Implementation details

The VQ-VAE encoder consists first of 5 strided convolutional layers with padding 1. The first three layers have stride 2 and window size 4 4. Fourth layer has stride 1 and window size . Fifth layer has stride 1 and window size . The first two layers have 64 hidden units while the rest of the layers have 128 hidden units. The convolutional layers are followed by two residual blocks (implemented as ReLU, conv, ReLU, conv)

The VQ-VAE decoder first has 1 convolutional layer of stride 1, window size , padding 1 and 128 hidden units. Then it has two residual blocks. Next, three transposed convolutions follow with stride 2, window size , and the first one with padding 1. The final layer is a transposed convolution with stride 1, window size and 64 hidden units.

The VQ-VAE codebook contains 128 embeddings of dimension 2, and we use exponential moving averages [8] to update the dictionary with = 0.25. We use the ADAM optimiser [27] with learning rate 1e–4, L1 loss, and train for 500,000 steps with batch-size 32. L1 loss was also employed for the Kaizen algorithm.

To predict at the border of the image, the input image was padded by half the input size of the VQ-VAE, eight pixels of mirror-padding followed by zero-padding (20 pixels in total for the U2OS dataset and 60 for the Neuroblastoma dataset).

To select several distant points in parallel, we convolve the error image with a kernel of ones with stride one. The highest value in the resulting image is the first point. The region surrounding this point ( pixels) is set to zero, and the process is repeated to select subsequent points. This process aims to minimize the occurrence of predictions on background noise and cell boundaries.

For the U2OS dataset the number of simultaneous points of prediction was set to ten (N in Algorithm 1). To avoid a hypothetical infinite loop, the repeat loop in Algorithm 1 was set to a maximum of 30 iterations. For the neuroblastoma dataset the number of simultaneous points of prediction was set to one (N in Algorithm 1) and the repeat loop in Algorithm 1 was set to a maximum of 100 iterations.

As pre-processing the images of both datasets were normalized. No data augmentation was applied. As post-processing to avoid empty predictions in the U2OS dataset, we eliminate all the masks with less than 20 pixels. For the same reason, in the neuroblastoma dataset, all masks with less than 20 blue pixels or less than one green pixel are eliminated.

4 Results

Kaizen decomposes an image into object representations, including superpositions of objects or occlusions. In contrast, typical computer biomedical models are classifiers that predict box coordinates plus category classification (object detection) or perform pixel classification (segmentation). Thus, to compare the current methodology, our method was modified to perform instance segmentation. Specifically, the predicted objects are converted to binary masks by setting a minimum brightness threshold (ten percent of the input image average), such that all the pixels above it are set to one and below it to zero. Such an approach might understate Kaizen results but provides a reasonable comparison to the current methodology.

First, we evaluate Kaizen on a U2OS dataset of cell nuclei fluorescent microscopy images [15]. To train the VQ-VAE we generated image patches of 40 40 pixels, with eighty percent of them containing a cell touching the patch center. The VQ-VAE was then trained to code representations of individual cell nuclei as illustrated in Fig 2 and described in 3.1.Kaizen was then applied on 50 left-out images from the dataset. Qualitative results for a test image are illustrated in Fig 3.

thumbnail
Fig 3. Examples of Kaizen segmentation for the U2OS dataset.

The first two rows correspond to two dataset images, while the last row magnifies an image region. Panel a) depicts the original dataset image, and panel b) shows the corresponding internal image reconstruction created by merging the individual nuclei generated by the VQ-VAE. Panels d) and e) show Kaizen and Stardist corresponding image segmentation. Panel c) depicts the ground truth segmentation.

https://doi.org/10.1371/journal.pone.0313549.g003

Regarding numerical results, we compare Kaizen to Stardist model [12], because it was specifically designed to predict cell nuclei in the same data type, and it is superior to more general approaches like U-Net [2] or Mask R-CNN [3]. The numerical evaluation in Table 1 shows that our method obtains superior average precision across all the thresholds.

thumbnail
Table 1. Results for the average precision(AP) for several intersection over union (IoU) thresholds for the U2OS nucli dataset and Neuroblastoma dataset.

https://doi.org/10.1371/journal.pone.0313549.t001

On the whole U20S test set, Kaizen produces 100 false positives and 486 false negatives. Thus, the majority of Kaizen errors correspond to false negatives or ground-truth masks that have no valid match. Most of these non-predicted nuclei correspond to extremely minuscule nuclei, labeled inconsistently in the ground truth.

The impact on the algorithm of variations across entire images was also analyzed. As illustrated in Fig 4, with a fixed number of parallel predictions the processing time of Kaizen scales proportionally with the number of cells present in the image. For empty images, the computation is nearly instantaneous, as the algorithm primarily involves convolving a kernel of ones across the image and performing a limited number of predictions. Consequently, Kaizen remains highly efficient for large images with a low cell density. However, computational time increases in cases of high cell density, images with significant noise and artifacts, or instances where the model encounters cells that are challenging to predict.

thumbnail
Fig 4. Kaizen processing time versus the number of cells in the image for the U2OS test dataset at a fixed number of parallel predictions.

https://doi.org/10.1371/journal.pone.0313549.g004

Next, to assess Kaizen in more complex images, we evaluate it on a Neuroblastoma dataset of fluorescent microscopy images. In contrast to the U2OS dataset, the entire cell is predicted including the cytoplasm. To account for the increased individual prediction size, we increase the input size of the VQ-VAE to 120x120 pixels. Kaizen was applied on 17 left-out images from the dataset. Qualitative results for a test image are illustrated in Fig 5.

thumbnail
Fig 5. Examples of Kaizen segmentation for neuroblastoma dataset.

The first two rows correspond to two dataset images, while the last row magnifies an image region. Panel a) depicts the original dataset image, and panel b) shows the corresponding internal image reconstruction created by merging the individual cells generated by the VQ-VAE. Panels d) and e) show Kaizen and Cellpose corresponding image segmentation. Panel c) depicts the ground truth segmentation.

https://doi.org/10.1371/journal.pone.0313549.g005

For the numerical results on the Neuroblastoma dataset, we compare Kaizen to Cellpose [16], since it was designed with this specific dataset. Numerical evaluation is presented in Table 1. Both models obtain worse results than expected; this might be the result of chance since the 17 left out images seem difficult compared to the typical image on the dataset. Kaizen obtains superior average precision for the three lowest thresholds. However, it falls behind for the 0.8 and 0.9 thresholds. Upon inspection, average precision decay for higher thresholds might be related to the conversion from object images to binary masks.

5 Conclusion

We have introduced a new approach to learning object representations in microscopy images: Kaizen. The approach is inspired by human perception, which contains an inherent predictive component that provides feedback to an internal model of the world.

In the implementation presented here, a VQ-VAE was trained to encode discrete representations of individual cells in microscopy images. Afterward, during inference, the VQ-VAE was applied iteratively on the input image to make new guesses, keeping only those that diminish the loss between the image and the global prediction. Furthermore, once no more predictions were found, the error image (input image minus global prediction) was used as new input to detect more cells and generate a second error image. This process was repeated several times to avoid missing cells.

Since typical models do not learn object representations, Kaizen was evaluated in two different instance segmentation datasets, showing competitive performance. More specifically, Kaizen obtained higher AP than Stardist across all thresholds, and higher AP50, AP60, and AP70 than Cellpose.

In future work Kaizen can be extended by allowing several models to make simulataneous predictions. Adding evolutionary algorithms and more powerful generative models should improve further Kaizen performance.

References

  1. 1. Esteva A, Chou K, Yeung S, Naik N, Madani A, Mottaghi A, et al. Deep learning-enabled medical computer vision. NPJ Digit Med. 2021;4(1):5. pmid:33420381
  2. 2. Ronneberger O, Fischer P, Brox T. U-Net: convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer; 2015, pp. 234–41.
  3. 3. He K, Gkioxari G, Dollár P, Girshick R. Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV). IEEE; 2017, pp. 2961–9.
  4. 4. Gregory RL. Perceptions as hypotheses. Philos Trans R Soc Lond B Biol Sci. 1980;290(1038):181–97. pmid:6106237
  5. 5. Kingma DP, Welling M. Auto-encoding variational Bayes. In: 2nd International Conference on Learning Representations, ICLR 2014 - Conference Track Proceedings. 2013. https://doi.org/10.48550/arxiv.1312.6114
  6. 6. Agustsson E, Mentzer F, Tschannen M, Cavigelli L, Timofte R, Benini L. Soft-to-hard vector quantization for end-to-end learning compressible representations. In: Advances in Neural Information Processing Systems. 2017.
  7. 7. Theis L, Shi W, Cunningham A, Huszár F. Lossy image compression with compressive autoencoders. arXiv, preprint, 2017.
  8. 8. VanDenOord A, Vinyals O. Neural discrete representation learning. In: Advances in Neural Information Processing Systems (NeurIPS). 2017.
  9. 9. Razavi A, Van den Oord A, Vinyals O. Generating diverse high-fidelity images with VQ-VAE-2. In: Advances in Neural Information Processing Systems. 2019.
  10. 10. Hollandi R, Szkalisity A, Toth T, Tasnadi E, Molnar C, Mathe B. A deep learning framework for nucleus segmentation using image style transfer. bioRxiv, preprint, 2019:580605.
  11. 11. Fishman D, Salumaa S-O, Majoral D, Laasfeld T, Peel S, Wildenhain J, et al. Practical segmentation of nuclei in brightfield cell images with neural networks trained on fluorescently labelled samples. J Microsc. 2021;284(1):12–24. pmid:34081320
  12. 12. Schmidt U, Weigert M, Broaddus C, Myers G. Cell detection with star-convex polygons. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer; 2018, pp. 265–73.
  13. 13. Weigert M, Schmidt U, Haase R, Sugawara K, Myers G. Star-convex polyhedra for 3D object detection and segmentation in microscopy. In: 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). IEEE; 2020, pp. 3666–73.
  14. 14. Walter F, Damrich S, Hamprecht F. Multistar: instance segmentation of overlapping objects with star-convex polygons. In: 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI). IEEE; 2021, pp. 295–8.
  15. 15. Mandal S, Uhlmann V. Splinedist: automated cell segmentation with spline curves. In: 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI). IEEE; 2021, pp. 1082–6. https://doi.org/10.1109/isbi48211.2021.9433928
  16. 16. Stringer C, Wang T, Michaelos M, Pachitariu M. Cellpose: a generalist algorithm for cellular segmentation. Nat Methods. 2021;18(1):100–6. pmid:33318659
  17. 17. Pachitariu M, Stringer C. Cellpose 2.0: how to train your own model. Nat Methods. 2022;19(12):1–8.
  18. 18. Jang W, Wei D, Zhang X, Leahy B, Yang H, Tompkin J. Learning vector quantized shape code for amodal blastomere instance segmentation. arXiv, preprint, 2020.
  19. 19. Park E, Berg A. Learning to decompose for object detection and instance segmentation. arXiv, preprint, 2015.
  20. 20. Burgess CP, Matthey L, Watters N, Kabra R, Higgins I, Botvinick M. Monet: unsupervised scene decomposition and representation. arXiv, preprint, 2019.
  21. 21. Greff K, Kaufman R, Kabra R, Watters N, Burgess C, Zoran D. Multi-object representation learning with iterative variational inference. In: Proceedings of the International Conference on Machine Learning. PMLR; 2019, pp. 2424–33.
  22. 22. Locatello F, Weissenborn D, Unterthiner T, Mahendran A, Heigold G, Uszkoreit J. Object-centric learning with slot attention. In: Advances in Neural Information Processing Systems 33 (NeurIPS 2020). 2020, pp. 11525–38.
  23. 23. Sabour S, Frosst N, Hinton GE. Dynamic routing between capsules. In: Advances in Neural Information Processing Systems 30 (NIPS 2017). 2017.
  24. 24. Huang L, Chen D, Liu Y, Shen Y, Zhao D, Zhou J. Composer: creative and controllable image synthesis with composable conditions. arXiv, preprint, 2023.
  25. 25. Ljosa V, Sokolnicki KL, Carpenter AE. Annotated high-throughput microscopy image sets for validation. Nat Methods. 2012;9(7):637. pmid:22743765
  26. 26. Yu W, Lee HK, Hariharan S, Bu WY, Ahmed S. CCDB: 6843, Mus musculus, neuroblastoma. Cell Image Library. 2019. https://doi.org/10.7295/W9CCDB6843
  27. 27. Kingma D, Ba J. Adam: a method for stochastic optimization. arXiv, preprint, 2014.