Robust nonparametric quantification of clustering density of molecules in single-molecule localization microscopy

We report a robust nonparametric descriptor, J′(r), for quantifying the density of clustering molecules in single-molecule localization microscopy. J′(r), based on nearest neighbor distribution functions, does not require any parameter as an input for analyzing point patterns. We show that J′(r) displays a valley shape in the presence of clusters of molecules, and the characteristics of the valley reliably report the clustering features in the data. Most importantly, the position of the J′(r) valley (rJm′) depends exclusively on the density of clustering molecules (ρc). Therefore, it is ideal for direct estimation of the clustering density of molecules in single-molecule localization microscopy. As an example, this descriptor was applied to estimate the clustering density of ptsG mRNA in E. coli bacteria.

Many algorithms have been adopted, utilized, or developed, in the field of SMLM for analyzing localization data of molecules and quantifying inter-molecular organizations [13,14,[16][17][18][19][20][21][22][23]. These methods provide means to identify statistically the forming of clustering molecules from random populations, to examine complex patterns of molecular organization, to segment molecules into clusters, and to quantify clustering features. For example, pair-correlation analysis has been applied to SMLM data on membrane proteins to identify the presence of clusters, as well as to estimate various cluster features, such as the density of molecules in a a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 cluster and overall size of a cluster [16,24,25]. In addition, density-based algorithms such as DBSCAN (density-based spatial clustering of applications with noise) [26,27] and OPTICS (ordering points to identify the clustering structure) [28,29] have been exploited to identify clusters of proteins and nucleic acids, as well as to probe the clustering structures, in both bacteria and animal cells [13,14,[17][18][19]. Other methods that have been used for analyzing SMLM data include Ripley's K/L/H functions and their derivatives [20,21]. More recently, Bayesian analysis and Voronoï diagrams have been utilized to segment molecules into clusters and to analyze the clustering properties [22,23].
Segmentation and tessellation methods typically require human inputs as algorithmparameters. For example, DBSCAN requires two parameters (a radius, eps, and the minimum number of points in the neighborhood for a point to be considered as a core point, minPts) [26][27][28][29], and they are known to be sensitive to the chosen parameters [18,30]. The identification of clusters in the Voronoï diagram based method also requires a density threshold to determine whether points form clusters [23]. Although various techniques have been proposed to determine "appropriate" parameters for use [23,27,29,31], bias is inevitably introduced by the choice of parameters in these algorithms.
It has been found that nonparametric algorithms could directly report some of the clustering features of molecules. For example, pair correlation analysis allowed to fit the computed correlation from experimental data to collect two fitting parameters that are coupled to the density of clustering points (ρ c ), the number of clusters N c and the density of random points ρ r [16,24,25]. In addition, it has been reported that the derivative of Ripley's H function, H 0 (r) gave the size of clusters (R c ) reliably from the r-value corresponding to the minimum of [32,33]. More importantly, it was found that r H 0 m only depends on the cluster size but insensitive to other clustering features such as the densities of clustering and random points [32].
Here we present another descriptor based on nearest neighbor distribution functions for directly reporting the density of clustering molecules (ρ c ) in SMLM data. We examined the nearest neighbor function G(r) [34], the spherical contact distribution function F(r) [34], and the J-function J(r) = (1 − G(r)) / (1 − F(r)) [35,36], and found that the associated derivative functions, G 0 (r) and J 0 (r), reliably report the clustering features of points. In the presence of clusters, G 0 (r) and J 0 (r) are peak/valley shaped. Most importantly, we observed that the position of the J 0 (r) valley, r J 0 m , depends exclusively on the density of clustering points (ρ c ). Therefore, unlike r H 0 m from Ripley's H function that reports the cluster size, our descriptor, r J 0 m , is ideal for direct measurements of the clustering density of molecules. As an example, we applied J 0 (r) and r J 0 m to estimate the clustering of ptsG mRNA in E. coli. We expect that this nonparametric descriptor, J 0 (r), together with H 0 (r) [32,33], will be useful in a broad range of applications in SMLM.

G(r), F(r) and J(r), and their derivatives
When quantifying the spatial organization of biological molecules in SMLM data, of particular interest in certain situations is the clustering or aggregation of molecules [37][38][39][40], which is featured by an enhancement in the local density of molecules. This enhancement in density has been used to identify clusters methods such as DBSCAN, OPTICS, and Voronoï tessellation [13,14,[17][18][19][20][21][22][23]. On the other hand, the enhancement in the molecular density is also accompanied by the decrease of intermolecular distances, which could be described by functions based on nearest neighbor distances, such as pair-wise correlation function [16], nearest neighbor function G(r), and spherical contact distribution function F(r) [34]. The nearest neighbor function G(r) is the distribution function of the distance r of a point (existing in the data) to the nearest other point, while the spherical contact distribution F(r) is the distribution function of the distance r of an arbitrary point in the space (not necessarily existing in the data) to the nearest point in the data [34]. In addition, another function, J(r), has been suggested by van Lieshout and Baddeley in 1996 [35], JðrÞ ¼ 1À GðrÞ 1À FðrÞ , as a better nonparametric test to determine whether data were from a Poisson process.
We first explored how G(r), F(r) and J(r) functions depend on the clustering features of points using numerical simulations. Briefly, we generated points forming various clusters in the presence of noises (i.e., Poisson random points) in a region of interest, and computed these three functions. In a two-dimensional Poisson random process where points were not forming clusters (Fig 1A), the nearest neighbor functions gave the expected curves, G p (r) = F p (r) = 1 − exp(−λπr 2 ) (where λ is the density of points) and J p (r) = 1 (Fig 1C). However, when points aggregated into clusters (Fig 1B), both G(r) and J(r) deviated significantly from those for random points, while F(r) became only slightly different (Fig 1D). We observed that J(r) droped from 1 to * 0.4 when r increased from 0 to 5 nm, while G(r) raised in the same r-range (0-5 nm). This observation indicates that G(r) and J(r) can be used for detection of clusters.
Furthermore, to remove accumulative effects, and inspired by Kiskowski et.al. [32], we calculated the derivatives of these functions: G 0 (r), F 0 (r) and J 0 (r). Striking peaks or valleys appeared in G 0 (r) and J 0 (r) if points formed clusters ( Fig 1F). In contrast, these derivative functions remained essentially flat for random points (Fig 1E). On the other hand, F 0 (r)'s were very similar in the two cases (Fig 1E and 1F).
Dependence of G 0 (r) and J 0 (r) functions on clustering features To explore quantitative applications of G 0 (r) and J 0 (r), we examined how they change with varying clustering features in the point patterns. Here we focused on the following features: the radius of clusters, R c , the density of clustering points (i.e., clustering density), ρ c , the number of clusters, n c , the density of random noise points (i.e., background points), ρ r , and the width (W) and height (H) of the region of interest (ROI). The first three features, R c , ρ c and n c , are directly related to the properties of clusters in the data, while ρ r is an indicator of the noise level. By varying one feature at a time, we observed that changes in ρ c , ρ r , or R c resulted in horizontal shifting or vertical scaling of both G 0 (r) and J 0 (r) (Fig 2A-2C). For example, both G 0 (r) and J 0 (r) shifted to the left and scaled up when the clusters became denser (ρ c increased). If the clusters became bigger (R c increased) while keeping the clustering density constant, little horizontal translation was observed (Fig 2C), although both G 0 (r) and J 0 (r) scaled up too. In contrast, G 0 (r) and J 0 (r) were not as sensitive to the number of clusters (N c ) or the size of the ROI, W and H (Fig 2D-2F).
We further quantified the dependence of G 0 (r) and J 0 (r) on the clustering features. By fitting G 0 (r) and J 0 (r) with polynomials, both the amplitude (i.e., height of G 0 ðrÞ: G 0 m , or depth of J 0 ðrÞ: J 0 m ) and the positions of the peaks and valleys (r G 0 m and r J 0 m , respectively) were determined. The dependence of these values on the clustering features are shown in Fig 3, and S1-S3 Figs. We observed that both G 0 m and J 0 m depend on all the clustering features (S1 and S3 Figs), but r G 0 m and r J 0 m are most sensitive to the density of clustering points ρ c (Fig 3 and S2 Fig). Most interestingly, r J 0 m is essentially independent on all the other clustering features except the density of clustering points ρ c (Fig 3), providing a way to correlate r J 0 m with directly measuring the clustering densities of molecules, as shown below.
Robust direct measurement of clustering density by r J 0 m Our quantifier r J 0 m can be used for direct measurements of clustering densities of molecules. We first confirmed that the r J 0 m À r c relation is independent on other clustering features when simultaneously varing both ρ c and R c , or N c , or ρ r Á Á Á. We found that the r J 0 m À r c relation from all the simulations collapsed onto a single curve, as shown in Fig 4A. This curve was fitted very well (R 2 = 0.9946) by a power-law function r J 0 03. This curve provides a "calibration" that can be used to directly estimate the clustering density of molecules.
"Noises" are almost always present in SMLM data, due to individual molecules not forming clusters, non-specific labeling, and/or false-positive localizations. A crucial question to examine is how this quantifier r J 0 m is affected by noises. As shown in Figs 3B and 4, r J 0 m is independent on the density of random noise points (or background points) in the data, strongly suggesting that it is likely to be robust to use r J 0 m to measure the clustering density of molecules (ρ c ). To rigorously assess the robustness of the r J 0 m À r c relation, we systematically investigated how r J 0 m deviates in the presence of various amount of noises for a given clustering density. First we looked at how r J 0 m changes with increasing ratios between the number of clustering points n cp to the number of random (background) points n rp , β = n rp /n cp . We found that r J 0 m remained constant when there were up to *10 times more noise points than clustering points. The relative errors m is without background points) were below 5% for β ≲ 10 (Fig 5A), indicating that the r J 0 m À r c relation is very robust. In addition, as a more rigorous test, we also examined how the relative error d r J 0 m behaves with increasing relative density between clusters and background, i.e., ρ c /ρ r . We found that r J 0 m was robust ρ c /ρ r ! 2 with the relative error d r J 0 m below 10% ( Fig 5B). As ρ c /ρ r decreased below 2, d r J 0 m started to increase quickly,  reaching * 30 − 40% for ρ c /ρ r = 1.5. Although not completely degraded, the accuracy of r J 0 m started to compromise for ρ c /ρ r < 2. Therefore, it is suggested that r J 0 m be used for SMLM data with ρ c /ρ r ! 2 to ensure the accuracy.
It is expected that the error in measuring the density of clustering points is more relevant in real applications. Therefore, we also investigated the capability of using the r J 0 m À r c "calibration" curve to estimate the clustering density of molecules in the presence of various amount of background noise points. Briefly, for each tested ground-truth clustering density (ρ c ), we varied the density of background points (ρ r ) such that ρ c /ρ r ranged from 2 to 10. For each pair of (ρ c , ρ r ), we generated 50 simulated data and computed J 0 (r) and r J 0 m for each simulation. The "measured" clustering density r m c (averaged over the 50 simulations) was then obtained from the r J 0 m À r c "calibration" curve, r m c ¼ ððr J 0 m À bÞ=AÞ À 1=a (Fig 4A). The relative error in the measured clustering density was quantified by d r c ¼ jr m c À r c j=r c Â 100%. We observed that the error in "measured" clustering densities r m c were close to the ground-truth density ρ c (≲ 10% for ρ c /ρ r ! 3 as shown in Fig 6) although the relative error increased as ρ c /ρ r decreased (* 20 − 25% for ρ c /ρ r = 2, shown in Fig 6), suggesting that it is robust to use r J 0 m to estimate clustering density (ρ c ) in point patterns.

J 0 (r) for heterogeneous clusters
It is known that, in certain applications, molecules of interest might form heterogeneous clusters [16,23]. We examined heterogeneity arising from either clustering radius (R c ) or clustering density (ρ c ). Briefly, simulations were run for clusters with two different clustering radii (R c1 and R c2 ), or two different clustering densities (ρ c1 and ρ c2 ), in the presence of random noises. We noticed that J 0 ðr c1 ;r c2 Þ ðrÞ from heterogeneous clusters with different clustering densities shifted both horizontally and vertically, and fell between the two curves from homogeneous clusters, J 0 r c1 ðrÞ and J 0 r c2 ðrÞ (Fig 7). In addition, we observed that J 0 ðr c1 ;r c2 Þ ðrÞ overlapped very well with J 0 " r c ðrÞ from a homogeneous sample with a clustering density equal to the algebraic mean, " r c ¼ ðr c1 þ r c2 Þ=2 (Fig 7). It is noted that G 0 (r) shows a similar behavior. Therefore, G 0 (r) and J 0 (r) report only the average clustering density throughout the region of interest; they cannot distinguish different clustering densities in heterogeneous clusters. In contrast, for heterogeneous clusters with different radii, J 0 ðR c1 ;R c2 Þ ðrÞ shifted only in the vertical direction. The position of the valley, r J 0 m , did not change for heterogeneous clusters with different radii (S4 Fig), which is expected because the r J 0 m À r c relation does not depend on R c . In addition, we found that J 0 ðR c1 ;R c2 Þ ðrÞ is equivalent to J 0 " R c ðrÞ from homogeneous clusters with a radius of " Fig). Therefore, r J 0 m can be robustly used for heterogeneous clusters with different cluster sizes but the same clustering density.  Application of J 0 (r) to ptsG mRNA in E. coli bacteria As a simple example, we applied our method based on J 0 (r) and r J 0 m to measure the clustering density of ptsG mRNA, encoding a primary glucose transporter in E. coli bacteria. The ptsG mRNA were labeled through fluorescence in situ hybridization (FISH) by 7 Alexa 568-conjugated oligonucleotide probes, and were imaged by stochastic optical reconstruction microscopy (STORM) with a resolution of * 20 nm in x/y and * 50 nm in z. Three example bacteria were shown in Fig 8A. The average number of localizations per bacterial cell was 1576 ± 357 (mean ± standard error). The J 0 (r) function from the localizations were computed (orange curve in Fig 8C), which gave r J 0 m % 1:707 nm and an estimated density of ρ c % 0.187 nm −2 .
As a comparison, the same ptsG mRNA in E. coli bacteria were labeled by 14 probes via FISH, with three example cells shown in Fig 8B. The clusters of localizations appeared larger than those with 7 probes. Quantitatively, we measured 3090 ± 377 (mean ± standard error) localizations per bacterial cell, which was expected as the number of probes was doubled. However, as the spacing between 14 probes was similar to that between 7 probes, we expected that the density of localizations remained the same. We computed the J 0 (r) function for the sample labeled with 14 probes and found that the curve (blue curve in Fig 8C) overlapped well with that from the sample with 7 probes, indicating that the clustering density was unchanged. This observation was confirmed by examining r J 0 m (1.699 vs. 1.707) and the estimated clustering density (0.195 nm −2 vs. 0.187 nm −2 , or * 4% difference, Fig 8D), showing that the density estimated from r J 0 m was independent on the cluster size.

Discussion
To conclude, we explored the possibility of utilizing nearest neighbor functions to quantify spatial patterns of molecules in single-molecule localization microscopy. We observed that the associated derivative functions, G 0 (r) and J 0 (r), can reliably report the clustering features of point patterns. We found that J 0 (r) is particularly useful because its position, r J 0 m , relies exclusively on the density of clustering points (ρ c ). More importantly, we showed that this r J 0 m À r c relation is very robust in the presence of up to *10 times more noise points than clustering points, or when the relative density (ρ c /ρ r ) is ≳ 2. As an example, we applied J 0 (r) and r J 0 m to robustly estimate the clustering of ptsG mRNA in E. coli.
In the current study, we chose not to exploit any border correction when computing the nearest neighbor functions. A simplest approach for border correction is the "reduced sample" method [41], which focuses on the points lying more than r away from the boundary of the region of interest. However, the "reduced sample" method discards much of the data, and therefore unacceptably wasteful. In addition, it's particularly inappropriate in certain applications where points are preferentially located at the boundary, an example of which is the spatial organization of high-copy number plasmids in bacteria [14]. We note that more sophisticated methods for border correction are available, including the Kaplan-Meier correction [42] and the Hanisch correction [43], both are provided in the spatstat R-package [44,45]. These edge corrections can be readily used in our method. However, for the sake of simplicity, uncorrected estimators for the nearest neighbor functions have been used in the current study.
We would like to emphasize that the current method based on nearest neighbor functions is nonparametric and robust. Computing the nearest neighbor functions and their derivatives does not require any parameters as human inputs, eliminating possible subjective biases that might exist in other algorithms such as DBSCAN and OPTICS. In addition, the performance of this method is robust in the presence of noise/background points. The nonparametric nature and robustness of the current method would allow broad applications in the field of single-molecule localization microscopy.
We expect several types of applications of our method in the field of SMLM. First, it can be used as a direct quantification of the clustering density (ρ c ) of molecules in biological samples. Second, although it does not identify clusters by itself, our method, in combination with Ripley's H 0 (r) function [32,33], provides objective means to determine parameters (i.e., clustering density and cluster size) that can be used in other clustering-identification algorithm such as DBSCAN and Voronoï tessellation. In addition, in the current work, we focused on the r J 0 m À r c relation for non-parametric measurement of the clustering density of molecules; however, we expect that it is possible to design ways to figure out other cluster features (such as R c and ρ r ) by taking advantage of the dependence of G 0 m and J 0 m on those features (S1 and S3 Figs), together with the information of ρ c .

Spherical contact distribution function F(r), nearest-neighbor distribution function G(r), and the J function J(r)
In a set of points, X, in the k-dimensional space, the spherical contact distribution function, or sometimes referred to as the empty space function, F(r), of X is defined as F(r) = P{d(y, X) r}, where d(y, X) = min{|y − x|: x 2 X} is the distance from an arbitrary point, y, to the nearest point of the point process, X [34]. For a Poisson process with arrival intensity λ (equivalent to density in the context here) in the k-d space, F p ðrÞ ¼ 1 À exp À l p k=2 r k Gð1þk=2Þ [34]. The nearestneighbor distribution function G(r) is very similar to F(r): G(r) = P y {d(y, X) r} where P y is the Palm distribution, which is the conditional distribution of the entire process given that y is one point in X [34]. Therefore, G(r) is the distribution function of the distance from a point of the process to the nearest other point of the process, i.e., the "nearest-neighbor". For a Poisson process in the k-d space, G p ðrÞ ¼ 1 À exp À l p k=2 r k Gð1þk=2Þ ¼ F p ðrÞ [34]. In 1996, van Lieshout and Baddeley suggested using the quotient JðrÞ ¼ 1À GðrÞ 1À FðrÞ to characterize a point process [35]. For a Poisson process, J p (r) = 1.

Simulation and computation of G(r), F(r), J(r) and their derivatives
Sets of points were generated in R programing language [46]. In a region of interest with a width (W) and a height (H), n c circular clusters with radii of R c were randomly distributed. Each cluster contains random points at a density of ρ c . Poisson noise points were added randomly to the whole region of interest, with a density ρ r . The total number of clustering points (n cp ¼ n c Á r c Á pR 2 c ) and the total number of noise points (n rp = ρ r Á WH) define the noise level β = n rp /n cp .
Simulations were run using various sets of cluster features (W, H, ρ r , ρ c , n c , R c ). For each set of features, 50-200 trials were run. The G(r), F(r), J(r) functions and their derivatives were computed using the spatstat package [44,45], without applying any edge corrections.

Bacterial sample preparation
Bacterial sample for imaging was prepared as previously published [13]. Briefly, an E.coli MG1655 derivative strain DJ480 (D. Jin, National Cancer Institute) was grown in MOPS EZ rich defined medium (TEKnova) supplemented with 0.2% fructose and 0.2% glucose at 37˚C until OD600 reached 0.15-0.25. Cells were then fixed with 4% formaldehyde in 1X PBS and permeabilized with 70% ethanol. Chemically synthesized single molecule FISH (smFISH) probes (20 nucleotides each) were designed using Stellaris Probe Designer and ordered from Biosearch Technologies (http://www.biosearchtech.com). Seven or 14 probes against ptsG mRNA were then polled and labeled with Alexa Fluor 568 succinimidyl ester (Life Technologies). Permeabilized cells were washed once with FISH wash solution (10% formamide in 2X SSC) and resuspended in hybridization buffer (10% dextran sulfate and 10% formamide in 2X SSC) containing labeled FISH probes. Hybridization reactions were incubated in the dark at 30˚C overnight. On the second day, the cells were washed three times with FISH wash solution. After the wash, the cells were pelleted, resuspended into 4X SSC. For imaging, cells were immobilized to poly-L-lysine treated 1.5 borosilicate chambered coverglass (Thermo Scien-tific™ Nunc™ Lab-Tek™).

Super-resolution imaging and reconstruction
SMLM was performed on an inverted optical microscope (Nikon Ti-E with 100X NA 1.49 CFI HP TIRF oil immersion objective) with a yellow laser (561 nm, 150 mW, Coherent Obis LS) and a violet laser (405 nm, 25 mW, CrystaLaser) fiber coupled to the microscope body. Laser lines are reflected by a dichroic mirror (Chroma zt405/488/561/647/752rpc-UF3) having near-TIRF excitation. The emission signal was collected by the objective, filtered by emission filters (Chroma ET595/50m), and imaged on a 1024X1024 EMCCD camera (Andor iXon Ultra 888).
Although a cylindrical lens with 10 m focal length (CVI RCX-25.4-50.8-5000.0-C-415-700) was inserted in the emission path, allowing 3D imaging [3], detected spots within a z slice (Δz = ±100 nm) were used as a 2D projection. Violet laser power was modulated to keep the number of blinking-on spots above 50% of the number of cells in the field of view. When the number of blinking-on spots reached less than this, even with the maximum violet laser power, the acquisition was terminated. The power density lasers on the sample was *4300 W Á cm −2 for yellow laser and the maximum power density for the violet laser was about 130 W Á cm −2 . Imaging buffer was composed of 10mM NaCl, 50mM Tris (pH = 8.0), 10% glucose, 30 Unit of glucose oxidase (G2133-10KU, Sigma-Aldrich) and 454.5 Unit of catalase (219001, EMD Millipore) in 4X SSC.
The data analysis algorithm was adopted from previous published [2,3], and modified to handle multi-color and 3D images as previously published [13]. Briefly, all the pixels with intensity values greater than 3.5-4.5 fold of the standard deviation in each frame were identified. Within a 5-by-5 pixel area, local maximum intensity pixels whose intensity values were greater than its 24 surrounding pixels were found to represent the intensity peak of a single fluorophore. For identified peaks, a square region of 19×19 pixels surrounding local maximum intensity pixel was fitted with an Elliptical Gaussian function [3].

Gðx; yÞ
where b is the background level, h is the amplitude of the peak, w x and w y are elliptical widths, x 0 and y 0 are the center coordinates of the peak. The z-positions of the fluorophores were determined by comparing their w x and w y values to a calibration curve. Z-drift was prevented in real time Nikon perfect focus system. The horizontal drift was corrected during data analysis by fast Fourier transformation [13]. Finally, the acquired localization were used to generate reconstructed super-resolved images [3,13,14] and for quantitative analysis using G(r), F(r), and J(r), as well as their derivatives.