ColiCoords: A Python package for the analysis of bacterial fluorescence microscopy data

Single-molecule fluorescence microscopy studies of bacteria provide unique insights into the mechanisms of cellular processes and protein machineries in ways that are unrivalled by any other technique. With the cost of microscopes dropping and the availability of fully automated microscopes, the volume of microscopy data produced has increased tremendously. These developments have moved the bottleneck of throughput from image acquisition and sample preparation to data analysis. Furthermore, requirements for analysis procedures have become more stringent given the demand of various journals to make data and analysis procedures available. To address these issues we have developed a new data analysis package for analysis of fluorescence microscopy data from rod-like cells. Our software ColiCoords structures microscopy data at the single-cell level and implements a coordinate system describing each cell. This allows for the transformation of Cartesian coordinates from transmission light and fluorescence images and single-molecule localization microscopy (SMLM) data to cellular coordinates. Using this transformation, many cells can be combined to increase the statistical power of fluorescence microscopy datasets of any kind. ColiCoords is open source, implemented in the programming language Python, and is extensively documented. This allows for modifications for specific needs or to inspect and publish data analysis procedures. By providing a format that allows for easy sharing of code and associated data, we intend to promote open and reproducible research. The source code and documentation can be found via the project’s GitHub page.

Solving Eq 2 gives a cubic equation which is solved analytically to ensure fast transformation of many data points. To account for points B at the poles of the cells the coordinate x c is restricted to the domain [x l , x r ]. The cellular coordinates (l c , r c , ϕ) can then be calculated from x c and p(x). The longitudinal coordinate l c is given by the arc length along p(x) from x l to x c : ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi This expression also gives the full length of the cell when x c is substituted by x r . The radial coordinate r c is simply the euclidean distance between (x c , p(x c )) and (x p , y p ): r c ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi The third coordinate is the angle ϕ (in degrees) and it uniquely defines points A along the body of the cell by distinguishing between the top and bottom of the cell, as well as defining the position of points B at the poles. The value of ϕ is 0 at the top of the cell and runs from 0 to 180 along the right pole, where ϕ is defined as the angle between the line perpendicular to p(x) at x r and the line from (x c , p(x c )) to (x p , y p ). For points below the midline of the cell, the value of ϕ is 180, which then runs from 180 back to 0 along the left pole.
Note that the top area of the cell, as it is displayed in Panel (A) of Fig 2, is given by y p < p(x p ), since the origin of the Cartesian coordinates is at the top left, where the postive y-axis is running down. By this definition, the coordinate of the midpoint of the top-left pixel is (0.5, 0.5), in line with the coordinate definitions in the image analysis software ImageJ [71,72].

Preprocessing and optimization
To analyse fluorescence microscopy data with ColiCoords some preprocessing of the input data is required. Notably, ColiCoords is not designed to segment or detect cells in the raw microscopy images. Binary (labelled) input images are required for ColiCoords processing to identify the location of cells in the images. However, these binaries are only needed to localize cells and initialize the coordinate system, and not to form the final optimized coordinate system, and so consequently a high accuracy of segmentation is not required. Binary or segmented images can be derived from brightfield, differential interference contrast (DIC), phase contrast or fluorescence images by established tools such as CellProfiler [62], Ilastik [63], Oufti [55] or Convolutional Neural Networks (e.g. implemented by TensorFlow [64] or Keras [65]). An implementation of the U-Net segmentation architecture [73] is included in ColiCoords together with examples on how to use it.
Fluorescence images should be (optionally) background or illumination corrected and different channels such as brightfield or DIC should be aligned if they are acquired with different optics. ColiCoords allows the processing of any kind of image data as well as sparse data, such as single-molecule tracking and single-molecule localization microscopy (SMLM) data. Based on the provided binary images, single cells are automatically cut out from all image data, the positions of the localizations are selected and transferred to the coordinates of the cropped image. The orientation of the cell is calculated using the binary image by calculating the image moments [74]. All data elements are then rotated to orient the cells horizontally. The result of this process is a collection of single-cell Python-objects where all data are organized on a cellby-cell basis, and every cell has its own coordinate system which can be used to perform calculations, analysis or visualizations on the data element of choice. These sets of cells can then be indexed and selected based on user-defined criteria (shape, size, fluorescent signal etc.) and further analysed or inspected through interactive plots within the Jupyter Notebook environment.
After initialization of the cell objects, the coordinate system needs to be refined to allow it to more accurately describe the cell's shape. ColiCoords implements the optimization of the coordinate system based on different data sources by providing several objective functions which, when minimized, give the best matching coordinate system for the respective data source. Any image-type data (e.g. brightfield) can be used provided there is an anisotropic signal along the length of the cell, as well as localization (SMLM) data provided the localizations describe the outline of the cell. Fig 3 shows the iterative optimization process. In Panel (A) the ground truth binary image (derived from measurement) is shown together with the cell's coordinate system. The coordinate system is described by a total of 6 parameters, 3 coefficients a 0 , a 1 , a 2 which define the second degree polynomial p(x), the left and right endpoints x l and x r , and the cell's radius r. In the figure, the polynomial p(x) is shown together with the isodistance line from p(x) with distance r. As can be seen from the figure, the initial coordinate system does not accurately describe the cell's shape. To refine the coordinate system, first a radial distance image is calculated (Panel (B)). This image is then thresholded with the initial guess value of r, which gives a new binary image (Panel (C), black). Comparison of this image with the ground truth binary image gives the χ 2 value, which is minimized to optimize the fit of the coordinate system. The fitting in ColiCoords is done via the package Symfit [75], which provides an API to the minimizers implemented by scipy.minimize [57].
Other microscopy images can be used to optimize the coordinate system, where the only constraint is that the signal needs to be isotropic along the length of the cell. This is shown for binary, brightfield and fluorescence images in   channels measured for a single cell. The binary is derived from the brightfield and is used for initial guesses of the coordinate system. This initial guess coordinate system is shown in red in the left column. This coordinate system is used to calculate a synthetic image, averaged along the angular coordinate (middle). A comparison between this image and the measurement gives a measure for how well the current parameters describe the shape of the cell. The optimal parameters can be found via iterative optimization (right). For brightfield-or fluorescence data-based optimization the radius r of the cell is determined in a second step by determining the half-maximum point of pixel intensity.
The optimization process and a possible application thereof is illustrated in Fig 5. Panel (A) shows a STORM super-resolution reconstruction of the membrane protein marker LacY-eYFP. In the top panel, the coordinate system based on initial guesses (derived from binary image; grey) is shown. The bottom panel shows the result upon optimization based on the STORM-localizations data. Here, the radial distance r c for all localizations is calculated and by comparison to the value for r the χ 2 (squared differences) is calculated. Minimization of the χ 2 gives the parameters for the coordinate system that best describes the bacterial membrane. Interestingly, the localizations along the membrane show a periodic fine structure. To further characterize this, the position along the perimeter of the cell was plotted in panel (B). The zero-point position is defined as the beginning of the 'top' part of the membrane (l c = 0, ϕ = 0, r c = r, panel (A)) and the positive directions runs clockwise along the cell outline.
To extract the size of the periodic structures observed, at first a spatial auto-correlation [83] function was calculated and its low-frequency components were subtracted by means of a sliding window, to reveal clear oscillations (panel (C)). By Fourier transforming this signal (panel (D)) we found a periodicity of the signal of 56 nm, which we attribute to an artefact resulting from the YFP fusion, as previously reported for MreB-YFP constructs [84].
In conclusion, optimization based on a SMLM super-resolution membrane marker is expected to provide the most accurate approach for creating a coordinate system within the cell (vide infra, section (D)). This allows for aligning and combination of multiplexed localization data from many cells with high accuracy.

Batch processing
To demonstrate the ability of ColiCoords to analyse a dataset consisting of many cells, we applied a typical analysis procedure to two sets of fluorescence microscopy images. The first set of E. coli cells were incubated with the reactive dye Cy3B-NHS to label the outer membrane, while the second set of cells expressed eGFP in the cytosol.
For each condition, a subset of 100 images were manually annotated through the use of a custom GUI element. These annotated binary images were then combined and used to train a convolutional neural network. An implementation of the U-Net architecture [73] neural network was used, based on the Keras [65] API using Tensorflow [64] as backend.
The brightfield images were first scaled down to 256x256 to preserve graphics card memory. A custom implementation of Keras' Sequence was used for preprocessing and augmentation. All input images were normalized using tanh-estimators [85] and augmented 8-fold through permutations of horizontal mirroring, vertical mirroring and transposing the images. This augmented data was randomly split into validation data and training data.
After training the neural network, the network was applied to the whole set of images to generate binary images. The images were then scaled back to 512x512 pixels. Next, binary shapes were filtered from the segmentation mask by their shape and size. The resulting binary images were used as input for ColiCoords, together with the corresponding brightfield and fluorescence images. The cell objects were filtered based on the results of optimizations by binary images and brightfield images as well as a measurement of the cell's radius from the brightfield image. This yielded a total of 2341 and 1691 cells for the Cy3B and eGFP datasets, respectively.
The analysis output is shown in Fig 6. In panel A, an individual cell for each condition are shown. Their radial distribution profile is shown in panel (B) (blue line), together with the individual datapoints from each pixel as red points. In panel (C), all cells from each dataset were aligned by transforming all pixel coordinates in the fluorescence images from each cell to cellular coordinates. These cellular coordinates are then transformed back onto a model cell with user-specified dimensions. The resulting point cloud is then used to generate an aligned fluorescence image by convolution with a 2D gaussian. The final image is displayed with spline interpolation. In panel (D), the average radial distribution for all cells is shown. The curves are individually normalized in the x-direction by setting the brightfield-measured radius to 1 whereas the y-direction is normalized by the maximum of each curve. The shaded region (standard deviation) therefore only reports on variations in the shape of the radial distribution profile.

Synthetic benchmarks
In order to evaluate ColiCoords' performance a synthetic dataset was generated and subsequently analyzed. By comparing the results for different conditions and optimization methods the accuracy can be benchmarked.
First, a set of 10000 synthetic cells were generated using ColiCoords' synthetic_data module. The geometric parameters describing these synthetic cells were chosen based on a set of measured E. coli cells. These synthetic cells have a binary image data element by default. A brightfield data element was added based on normalized brightfield radial distributions measured from E. coli. The ratio between the background intensity (no cells) of the brightfield and the maximum of the scattered light from the bacterial membrane was measured and kept constant to replicate realistic signal-to-noise ratios. The background light was set to one and then multiplied by the chosen photon count number (500, 1000 and 10000), after which the resulting values where drawn from a Poisson distribution to simulate shot noise. Finally, normally distributed noise was added with a standard deviation of 20 photon counts.
To each cell two homogeneously distributed membrane-localized STORM data elements were added. The first data element describes localizations at the inner membrane. The mean radial distance of the localizations was set according to where the inner membrane should be with respect to the radius of the cell measured from the brightfield image based on measurements with the construct LacY-eYFP. The standard deviation from the mean radius was set to 0.25 pixels (20 nm). A second STORM data element was added 100 nm further out with the same standard deviation.
From this set of synthetic cells, images were reconstructed. An average of 10 ± 3 cells were taken from the set and randomly rotated and placed in each image of 512x512 pixels (40 μm x 40 μm), ensuring at least 5 pixels distance between the cells' binary images. The coordinates of the cell's STORM data element were combined into one big STORM localizations table.
These processes yielded 1000 ground-truth binary images together with 3 sets of 1000 brightfield images with different photon numbers (500, 1000, 10000 per pixel) and therefore different signal-to-noise ratios (Example images in Fig 7A). The whole ColiCoords' data analysis pipeline was performed on this dataset to benchmark its performance.
First, the brightfield images were segmented. For all 3 signal-to-noise conditions, the first 400 images were used to train the neural network. After training, the network was applied to the whole set of 1000 images to generate binary images. Next, the binary images were filtered. Bordering cells were removed and the remaining binary objects were selected based on their size and ellipticity where the selection criteria were obtained from the ground-truth binary images. The filtered and labelled binary images were then used by ColiCoords together with the brightfield and STORM data to generate a set of 'measured' cell objects.
The 'measured' cell objects obtained were correlated to the original ground-truth cell objects by first performing a cluster analysis on the input STORM table [86]. The groundtruth cells were then identified by encoding the combined STORM intensity and comparing to the ground-truth intensity of the STORM clusters. The corresponding measured cell was then found by comparing the centre-of-mass positions of the STORM clusters and the filtered binary images used as input. This yielded two sets of cells, one with ground-truth cells and one with corresponding 'measured' cells for comparison. The radial position is normalized to the ground-truth inner membrane position. The ground-truth radial distance (light colours, filled histogram) is compared with radial distances calculated with the coordinate system as calculated for the different data elements (binary, brightfield, STORM) for different brightfield image photon counts (dark colours, line only). The absolute mean deviation (D) and the root mean squared deviation (D 2 ) are shown in the graph. Right: Relative minimization objective function χ 2 . The χ 2 is calculated for the obtained coordinate system for each cell for the STORM data element for each condition. The obtained value is divided by the ground-truth χ 2 value, therefore a value of 1 indicates a perfectly fitted coordinate system (red line). https://doi.org/10.1371/journal.pone.0217524.g007 Cells which had too few STORM localizations compared to their ground-truth counterparts were discarded due to them being too poorly segmented. Several cells had too many STORM localizations due to localizations from neighbouring cells. These additional localizations were also removed. This yielded a total of 6569, 7129 and 7245 cells out of maximum 7245 cells for brightfield images with 500, 1000 and 10000 photons, respectively. This number is lower for input images with lower signal-to-noise because some cells are segmented poorly by the neural network.
Finally, the coordinate system of the cells was optimized using different approaches to compare their accuracy. The cells in each condition were initialized with guesses derived from segmented binary images and subsequently optimized based on the data elements binary, brightfield and storm (inner membrane). This yielded a total of 9 sets of cells which are summarized in Fig 7B. In the left column of each condition, the relative radial distances of all STORM localizations in the dataset are plotted as calculated by its coordinate system. The radial distances were normalized to the ground truth inner membrane distance to eliminate cell-to-cell radius variations. The ground truth radial distances of the inner (cyan) and outer (magenta) are shown in light colours and filled histogram. The calculated radial distances are shown on top with dark colours where the outer line of the histogram is shown only.
From these graphs the effect of the accuracy of the coordinate system can be shown on the resulting radial distribution graph. As can be seen, in the condition with the lowest number of photons (500) and a coordinate system optimized on the binary data element, the resulting radial distribution histogram is significantly wider than the ground-truth radial distribution. Compared to binary images derived from higher signal-to-noise images, the result more accurately describes the underlying ground-truth. This is further reflected by comparing the obtained χ 2 values from the optimization process. In the histogram on the right (second column) the relative χ 2 of each cell derived from its optimized coordinate system is plotted. The values are obtained by calculating the χ 2 value for STORM optimization (mean squared difference in distance between coordinate system outline and STORM localizations) and dividing this value by the ground-truth χ 2 value to get a relative χ 2 . This value is an indication for the goodness of the fit, where a value of one indicates a perfect fit (red line in histograms).
As can be seen in the histogram, for 500 photons and optimization based on the binary image (Panel (B), top left), the median χ 2 value of optimization is roughly 10-fold higher compared to the ground-truth values. The inset (log scale x-axis and logarithmically increasing bin size) shows a long tail in the distribution to higher relative χ 2 values up to 500.
When comparing the binary optimization result for 1000 and 10000 photons, it can be seen that the radial distribution histograms become narrower and more accurately match the ground truth radial distribution. The distribution of relative χ 2 values also reflect this with median values of 5.8 an 3.8, respectively.
Moving one row down, the results are plotted for optimization based on the brightfield image. The initial guesses for optimization derived from the binary images and therefore the quality of the segmentation influences the final result. As can be seen from the radial distribution histograms, the coordinate system obtained more accurately describes the cell compared to binary optimization. For 10000 photons the radial distribution graph very closely matches the ground truth. In the relative χ 2 distributions a second population can be seen which is mostly present for lower photon numbers. This is a consequence of poor initial guesses and the optimization therefore returns a different local minimum. The median relative χ 2 values are 4.1, 2.6 and 1.9, respectively.
The third row shows the result from optimization based on the inner membrane STORM localization dataset. Since the STORM localizations are identical in all cells and are independent of the brightfield image, any difference in the optimization result is due to the initial guesses. As expected, the obtained radial distribution histograms almost perfectly overlap with the ground truth histograms. The obtained median relative χ 2 values are all 0.99, however some small outliers are found, where the population sizes of values >50 are 2, 3 and 9 respectively. In Fig 7 for 10000 photons and brightfield optimization it can be seen that the ground-truth result is matched closely but not completely. This is unexpected since the cells were generated from a perfect second-degree polynomial shape and therefore it was expected that the optimization result would more accurately describe the cell's shape. The χ 2 value of the obtained result was lower compared to the ground-truth, thus the problem is not that the algorithm returned a local minimum instead of a global minimum. Therefore, switching to a global optimization algorithm such as differential evolution [87] will not yield the desired improved result.
Instead, it was found that in the optimizer's implementation a choice of speed over accuracy is the origin of the discrepancy. Specifically, in the bootstrapping process to calculate a simulated brightfield image to compare with the measured image, the radial distribution of the brightfield image is calculated by normal binning (histogram) compared to convolution with a Gaussian kernel. This choice was made to keep the calculation times manageable, and optimizing of this procedure, possibly through a two step (course and fine) optimization process is expected to yield improved results.
The STORM optimization results (lower row, Fig 7) do closely match the ground-truth result. Any deviation originates from a small number of cells which do not converge to the correct solution. Since this population is larger for lower photon brightfield images, which has a less accurate segmentation result, this is be attributed to poor initial guesses for the coordinate system.
Finally, using an optimizer that is restricted by bounds on the optimization parameters is likely to reduce the subpopulations in both brightfield and STORM optimization. These features and the issues mentioned above will be addressed in a future performance update of ColiCoords.

Conclusions
ColiCoords is an open-source software package to analyse fluorescence microscopy data of rod-shaped cells. It allows for the transformation of Cartesian coordinates from any data source to cellular coordinates. This transformation can then be used to obtain distributions of fluorescence along the cell long or short axes or its perimeter and align the whole cell to allow for the combination of data from many cells.
The open-source nature and Jupyter-notebook based workflow together with an open and compact file format promotes open and reproducible analysis of microscopy data. As illustrated in this paper, ColiCoords facilitates the publication in the form of a 'reproducible article' [88], as recently featured in eLife, where data and code to generate the figures are bundled together with the publication. The code and data to generate the figures in this article is available online (see Methods).
ColiCoords' code is well documented and the main features are illustrated by example notebooks, which can be tested live on Binder. The software itself can be installed directly from both the Anaconda and PyPi package managers.
Analysis of synthetic data has shown that we can determine the cellular coordinates of SMLM super-resolution datapoints based on only the brightfield image of the cell with high accuracy (5%). If the coordinate system is derived from super-resolution measurements of a membrane marker, the accuracy is further increased (1%).
Although the current state of ColiCoords is a mature project, future updates are planned to be released, either by us or in collaboration with the community. Three major updates are planned; first a performance update to increase speed and accuracy of the optimization process, second a metadata update allowing the storage of image-associated metadata, and finally an update to allow the organization of cell objects into a hierarchical structure to process lineage and temporal information.

Cy3B-NHS staining
E. coli MG1655 cells were grown overnight in LB medium and subsequently diluted 1000 times in EZ rich medium (Teknova) with 0.2% glucose. The cells were grown until the culture reached an OD600 value of 0.3 after which the cells were centrifuged at 3000 rcf for 5 minutes and resuspended in PBS buffer. The amino-reactive dye Cy3B-NHS (GE Healthcare) was added to a final concentration of 1 mM and incubated while protected from light at room temperature for 1 hour [89]. Next, the cells were centrifuged at 3000 rcf for 10 minutes and the cell pellet was resuspended in PBS.

eGFP staining
E. coli BL21(DE3) cells were grown overnight in LB medium and subsequently diluted 1000 times in EZ rich medium (Teknova) with 0.2% glucose. The growth media was supplemented with the appropriate antibiotics. When the cell culture reached an OD600 of 0.3, eGFP expression was induced with IPTG at a final concentration of 200 μM for 30 minutes at 37˚C. For imaging, 3 μL of the bacterial culture was transferred to a microscope coverslip and covered by an agarose pad [90,91].
Imaging was done on an Olympus IX83 inverted microscope with an Olympus UAPON 100x NA 1.49 TIRF oil immersion objective. The excitation light (514 nm, Coherent) was coupled into the objective via the ET-442/514/561 Laser Triple band set (69904, Chroma) and the fluorescence was collected on a electron multiplying charge-coupled CCD camera (512x512 pixel, C9100-13, Hamamatsu). The focal position was held constant by the Olympus ZDC2 and images on multiple positions were collected automatically using Olympus' CellSens software.

Super-resolution imaging
LacY-eYFP expressing E. coli C41 cells were grown overnight in LB medium with appropriate antibiotics and subsequently diluted 1000 times in EZ rich medium (Teknova) with 0.4% glycerol. When the cell culture reached an OD600 of 0.3, Lacy-eYFP expression was induced with 0.01%(v) L-arabinose for 30 minutes at 37˚C. The cells were centrifuged at 3000 rcf for 5 minutes and resuspended in EZ rich medium with 0.2% glucose. 3 μL of the bacterial culture was transferred onto a microscope coverslide and covered by an agarose gel pad.
Images were captured with an exposure time of 50 ms using the open source software μManager [92,93].
To obtain the final STORM reconstruction, 2000 frames were collected and processed with the ImageJ plugin ThunderSTORM [68].
All data and code used to generate the figures in this article can be found at Zenodo (DOI: 10.5281/zenodo.2637790) and GitHub, respectively.