Synthetic data enables human-grade microtubule analysis with foundation models for segmentation

doi:10.1371/journal.pcbi.1013901

Fig 1.

The SynthMT instance segmentation benchmark evaluates methods on synthetic interference reflection microscopy (IRM)-like images containing microtubules (MTs).

(a) Synthetic image mimicking IRM of in vitro reconstituted MTs nucleated from fixed seeds (visualized in red), reproducing key mechanical and geometrical properties such as filament length and curvature. (b) Our pipeline generates accompanying ground-truth instance masks that enable quantitative evaluation. (c) The classical FIESTA [11] algorithm predicts anchor points for each instance (for visual clarity, only the first and last point of each instance are shown), which we connect through splines. The example demonstrates typical failure modes: filament fragmentation (single MTs split into multiple instances), incomplete segmentation, and artifacts at intersections. (d) SAM3 [12] guided by a simple text prompt (“thin line”) produces precise, human-grade segmentation, accurately tracing intersecting MTs. This is supported by its high Skeleton Intersection over Union (SKIoU) for this specific image.

More »

Expand

Fig 2.

Our synthetic data generation pipeline produces realistic microtubule (MT) images with corresponding instance segmentation masks conditioned on a parameter set .

(1) Generating geometry creates instance masks from geometric parameters (count, length, curvature) using polylines. (2.1) Physical rendering applies point-spread function (PSF) convolution to replicate optical properties, and adds red seeds and uniform background. (2.2) Artifact simulation introduces realistic distractor features (circular spots, irregular structures). (2.3) Noise addition models signal-dependent (Poisson) and signal-independent (Gaussian) noise sources. (2.4) Global distortions apply spatially-varying effects (vignetting, blur, contrast variations) to match real microscopy conditions. This approach enables the generation of labeled data that closely approximates experimental interference reflection microscopy (IRM) images, when its set of generation parameters is tuned accordingly (as explained in section 4).

More »

Expand

Fig 3.

Optimizing aligns synthetic image distributions with real, annotation-free microscopy data.

Real interference reflection microscopy (IRM) images (left) and synthetic images (center) are embedded using DINOv2. The parametric generator (right) creates images by sampling from distributions governing geometric properties (filament count, length, curvature) and imaging characteristics (PSF, noise, artifacts, contrast, distortions), all controlled by . An optimization loop iteratively refines by maximizing cosine similarity between real and synthetic embeddings, ensuring that synthetic images match the statistical properties and visual characteristics of experimental data.

More »

Expand

Fig 4.

Real IRM images with human ground-truth labels (for evaluation only).

Four exemplary frames from the 66 video crops of size used to establish target distributions (each containing 10 randomly sampled frames). Each image shows individual in vitro MTs growing from stabilized seeds under different experimental conditions, exhibiting natural variation in quantity, length, curvature, overlapping MTs, contrast, noise characteristics, filament density, and background properties. These target distributions act as references in the optimization process (Fig 3), where DINOv2 embeddings of guide the synthetic data generation. Ground-truth labels from one annotator are overlaid to illustrate the filament structures present in the real data; they are used solely for later method evaluation and are not required for generating the synthetic dataset SynthMT.

More »

Expand

Fig 5.

Domain experts confirm perceptual quality of SynthMT images.

Violin plots show z-normalized ratings of n = 6 domain experts across five quality dimensions for real IRM images and synthetic images from SynthMT and DRIFT [7] (10 images each, 30 in total). Each violin displays the full distribution, median (white line) and interquartile range (thick bar). Ratings were collected on a 7-point Likert scale. DRIFT permits evaluation only of structural fidelity due to its black-and-white outputs (see exemplary images in Fig B in S1 Appendix). SynthMT images score higher than DRIFT on this dimension, indicating that parameter-optimized synthesis yields structures that more closely resemble real microscopy data. A measurable gap to real images persists across all dimensions. Nevertheless, experts rate the backgrounds, lighting, and noise patterns of SynthMT as internally coherent and plausibly aligned with real IRM, in contrast to DRIFT’s limited realism.

More »

Expand

Table 1.

Results on SynthMT signal unprecedented segmentation performance of the new ‌‌SAM3 model.

More »

Expand

Table 2.

Hyperparameter optimization on synthetic SynthMT images improves SAM3Text to human-grade performance on unseen, real IRM data.

More »

Expand

Fig 6.

SAM3Text + HPO closely matches ground-truth MT length and curvature distributions across scales and datasets.

Normalized histograms compare predicted and ground-truth distributions for MT length (top row) and curvature (bottom row) on (a) SynthMT and (b) unseen, real data. SAM3Text + HPO preserves both low and high values across the full range of lengths and curvatures, as reflected by low KL divergence values computed from these histograms. Distributions for the other methods on SynthMT are shown in Fig E in S1 Appendix and Fig F in S1 Appendix.

More »

Expand

Fig 7.

Only SAM3Text + HPO reaches human segmentation performance on unseen, real IRM data.

Split violin plots show the distribution of per-image SKIoU scores (n = 66) for each method evaluated on unseen, real IRM data. The left side of each violin (blue) represents the default configurations, while the right side (purple) shows the performance after HPO on 10 random, synthetic synthetic images from SynthMT. Each violin includes a horizontal line indicating the mean SKIoU across all images. The human performance is shown as a solid gray violin for reference. Its mean and standard deviation values are indicated by horizontal lines across the plot. The plot shows that the optimized SAM3Text matches the human inter-annotator baseline. While other methods also improve with HPO, none demonstrates the top-tier performance of SAM3Text. Notably, CellSAM already approaches human-level performance in its default configuration, but exhibits decreased performance after HPO.

More »

Expand

Fig 8.

Qualitative comparison on an unseen, real-world in vitro reconstituted MT assay.

For each method, we show predictions after HPO on 10 synthetic images from SynthMT (few-shot setting). The selected real image is particularly challenging, as it contains many intersecting MTs and exhibits a low signal-to-noise ratio (SNR), exposing a wide range of failure modes. For anchor-point methods such as FIESTA and TARDIS, only the first and last predicted points per instance are shown for visual clarity. Underneath each image we report the mean SKIoU value for this specific image, in order to correlate it with a visual impression. SAM3Text clearly performs best in this setting, while all other methods show limitations that may hinder their suitability for large-scale fully automated analysis. For more comparisons and dynamic exploration of this kind, we refer to our project page at DATEXIS.github.io/SynthMT-project-page.

More »

Expand