Accounting for Limited Detection Efficiency and Localization Precision in Cluster Analysis in Single Molecule Localization Microscopy

Single Molecule Localization Microscopy techniques like PhotoActivated Localization Microscopy, with their sub-diffraction limit spatial resolution, have been popularly used to characterize the spatial organization of membrane proteins, by means of quantitative cluster analysis. However, such quantitative studies remain challenged by the techniques’ inherent sources of errors such as a limited detection efficiency of less than 60%, due to incomplete photo-conversion, and a limited localization precision in the range of 10 – 30nm, varying across the detected molecules, mainly depending on the number of photons collected from each. We provide analytical methods to estimate the effect of these errors in cluster analysis and to correct for them. These methods, based on the Ripley’s L(r) – r or Pair Correlation Function popularly used by the community, can facilitate potentially breakthrough results in quantitative biology by providing a more accurate and precise quantification of protein spatial organization.

The randomness in the estimate of K(r) due to subsampling arises due to the Bernoulli random variables B 1 , B 2 , . . . , B N . The mean and variance of K(r) can be estimated using the first and second moments of K(r).
The first moment is and the second moment is Since B i are i.i.d. and B 2 i = B i , in order to evaluate the expectations in (2) and (2), it is sufficient to compute E The last expression can be directly computed or approximated with an integral.

Exact computation of K-function in the presence of localization uncertainty
For example, if W i are assumed to be independently drawn zero mean Gaussian random vectors, the vector X i − X j + W i − W j is a Gaussian random vector with mean X i − X j , and covariance equal to the sum of the covariances of W i and W j , and hence 2 is a non-central χ 2 random variable, with known distribution. The expected value of K (r) is given by The expected value inside the summation is nothing but the complementary cumulative distribution function of a non-central χ 2 random variable, which is easily computed using the Marcum Q-function.

Justification for the choice of estimator of true locations
In the Methods section of main text, the following estimators are defined.
The estimate of (3) can be justified under the assumption that estimates of the x-coordinate of the cluster center in (4) and cluster spread of the x-coordinate in (5) are accurate. Leť denote the estimate of (3) when the cluster center and spread are accurate. It is easy to verify that This suggests that by using the estimates of (3), the estimate of the squared distance between any pair of points in the cluster is unbiased, and thus the K-function computed using distances between the estimated points is expected to be accurate. This is the main reason for using the estimate of (3). It is to be noted here that if one were interested in minimising the squared error E[ X i − X i 2 ] in the position of each molecule, then one would use the MMSE estimator of in (3). However, it was observed that in practice this leads to a shrinking of the reconstructed clusters. The current estimator does not have this drawback and has the added advantage of accurately approximating distances between points in the cluster.

Effect of clustering of localizations on reconstruction
The reconstruction method presented in the paper works on a cluster-by-cluster basis, and therefore the SMLM localizations must be first preprocessed by means of clustering algorithms like DBSCAN [1] or others [2], before applying the method. The method presented assumes that the clustering errors are minimal. Example reconstructions after including clustering by DBSCAN is shown in Figure S5, which provided satisfactory results. The user is recommended to try out different clustering methods and parameters for a given dataset, so as to minimize the clustering errors. There are obvious limitations to this approach: in a case where the clusters are overlapping, it might be difficult for clustering algorithms to identify true clusters.
For the reconstruction method, since the X and Y coordinates are estimated separately, the method works best if the clusters are elliptical if not circular. As mentioned already, it is best if the clusters are well separated. Also, if a cluster with an arbitrarily complicated shape is clustered into multiple small symmetric clusters by the clustering algorithm, since the reconstruction method works on the basis of shrinking the clusters about a central point for each cluster, it might introduce artifacts, since each of the small clusters will be shrinked about their centers rather than the center of the true cluster. Therefore, the user must be careful to make sure that the clustering step does not introduce major errors or artifacts.
If the clusters are expected to have other specific parametric shapes, e.g., polygonal, helical etc., it might be possible to adapt the reconstruction method proposed in this paper to these alternate cluster shapes.