ChromoTrace: Reconstruction of 3D Chromosome Configurations by Super-Resolution Microscopy

Motivation: The 3D structure of chromatin plays a key role in genome function, including gene expression, DNA replication, chromosome segregation, and DNA repair. Furthermore the location of genomic loci within the nucleus, especially relative to each other and nuclear structures such as the nuclear envelope and nuclear bodies strongly correlates with aspects of function such as gene expression. Therefore, determining the 3D position of the 6 billion DNA base pairs in each of the 23 chromosomes inside the nucleus of a human cell is a central challenge of biology. Recent advances of super-resolution microscopy in principle enable the mapping of specific molecular features with nanometer precision inside cells. Combined with highly specific, sensitive and multiplexed fluorescence labeling of DNA sequences this opens up the possibility of mapping the 3D path of the genome sequence in situ. Results: Here we develop computational methodologies to reconstruct the sequence configuration of all human chromosomes in the nucleus from a super-resolution image of a set of fluorescent in situ probes hybridized to the genome in a cell. To test our approach we develop a method for the simulation of chromatin packing in an idealized human nucleus. Our reconstruction method, ChromoTrace, uses suffix trees to assign a known linear ordering of in situ probes on the genome to an unknown set of 3D in situ probe positions in the nucleus from super-resolved images using the known genomic probe spacing as a set of physical distance constraints between probes. We find that ChromoTrace can assign the 3D positions of the majority of loci with high accuracy and reasonable sensitivity to specific genome sequences. By simulating spatial resolution, label multiplexing and noise scenarios we assess algorithm performance under realistic experimental constraints. Our study shows that it is feasible to achieve genome-wide reconstruction of the 3D DNA path in chromatin based on super-resolution microscopy images.

The primary nucleic acid sequence of the human genome is not sufficient to understand 2 its functions and their regulation. Fitting the 6 billion basepairs or approximately 2 m of 3 double-helical DNA into an approximately 10 µm radius nucleus requires tight packing 4 of DNA into chromatin, where about 150 bp of DNA are wrapped around cylindrical 5 nucleosome core particles, which in turn can be tightly packed due to interspersed 6 flexible linker DNA [1]. In addition, each chromosomal DNA molecule occupies a discrete 7 3D volume inside the nucleus and the arrangement of these chromosome territories is 8 non-random and changes with cell differentiation [2,3]. This remarkable management of 9 23 large linear polymer molecules controls crucial functions of the genome, such as gene 10 expression, DNA replication, chromosome segregation, and DNA repair. 11 12 Structural biology techniques, such as electron microscopy, crystallography, and NMR 13 have given atomic level insights into the physical structure of the DNA double helix 14 and the nucleosome [4]. In vitro, also higher order structures such as nucleosomes 15 stacked into 11 or 30 nm chromatin fibres can be observed and studied at high resolution. 16 However, the existence of regular higher order nucleosome structures in vivo has not 17 been demonstrated under physiological conditions. To date, little direct information is 18 available about the functionally crucial 3D folding and structure of chromatin between 19 the scale of single nucleosomes (approximately 5 nm) and the diffraction limit of light 20 (200 nm), which can only resolve entire chromosome territories with a size of a few µm. 21 22 In situ, classically two general types of higher order chromatin organization have been 23 distinguished at a coarser level, euchromatin which tends to be less compact and displays 24 high gene density and activity, and heterochromatin, with a higher degree of compaction 25 and lower gene density and activity [5]. Due to the arrangement of chromatin from indi- 26 vidual chromosomes in territories, the majority of DNA-DNA interactions occur in cis, 27 while trans interactions are more rare and mostly observed on the surface of or on loops 28 outside of territories [6][7][8]. Within territories and across the whole nucleus euchromatin 29 and heterochromatin are generally spatially separated [9], leading to heterochromatin 30 rich and gene expression poor domains at the nuclear periphery and around nucleoli. 31 Gene expression is intrinsically linked to the 3D structure of chromosomes, chromatin 32 packing densities and the accessibility of DNA by e.g. the transcriptional machinery. 33 34 In the last 10 years, biochemical DNA crosslinking technologies based on chromo-35 some conformation capture (3C), have been developed to address the issue of higher 36 order chromatin structure in an indirect manner [10]. These methods have been widely 37 used to measure the average linear proximity of genome sequences to each other in cell 38 populations with good throughput and at kb resolution. The resulting contact frequency 39 maps analyzed with computational models have indirectly inferred principles of genome 40 organization [11]. A major result of these studies, is that chromosomes are organized into 41 domains of 400-800 kb that are topologically associated. These TADs are the smallest 42 structuring units of chromatin above the 150 bp nucleosome level that can be reliably 43 detected biochemically so far. Although good correlations between contact frequency and 44 function of regulatory elements has been shown for several genes [12], such crosslinking 45 technologies cannot determine the 3D position and physical distances of genomic loci 46 inside the nucleus directly.
studies have already explored the use of super-resolution microscopy to investigate 53 chromatin structure [17][18][19], such as the organization of distinct epigenetic states in 54 Drosophila cells [20] that suggested distinct folding mechanisms and packing densities 55 that correlate with gene expression. Dissection of nucleosome organization inside the 56 nucleus in single cells using super-resolution shows that higher nucleosome compaction 57 corresponds to heterochromatin while lower compaction associates with active chromatin 58 regions and RNA polymerase II, and that the spatial distribution, size and compaction 59 of nucleosome correlate to cell pluripotency [18]. While these studies provide first new 60 intriguing insights into chromatin organization they have so far largely focused on single 61 loci without a complete 3D reconstruction of a chromosome or the genome.

63
However, the resolving power of super-resolution microscopy raises the tantalizing 64 possibility to directly reconstruct the 3D path of large parts of the chromosomal DNA se-65 quence. Super-resolution microscopy can resolve unprecedentedly small volume elements 66 (approximately 20 x 20 x 20 nm [21]) inside the total nuclear volume (approximately 67 8 × 10 −6 µm 3 ), which will on average contain only up to 2 kb or a few nucleosomes. This 68 fundamental increase in information of the relative positioning of defined loci in the 69 genome can now be leveraged computationally.

71
This increase in resolution, which enables to distinguish around 60 million volume 72 elements inside a single nucleus, can be combined with any sensitive and site specific 73 fluorescence in situ hybridization (FISH) probe design that allows for spectral and/or 74 temporal multiplexing. Several methods that fulfill these criteria have recently been 75 developed, and fall within two general probe design categories. Either a primary imager 76 strand with fluorophore-containing DNA is hybridized to the genome directly [22] or 77 a primary genome-sequence specific DNA probe that facilitates transient binding of 78 the fluorophore-containing secondary imager strand is used (DNA-PAINT) [23]. Our 79 reconstruction algorithm should in principle allow the mapping of the genome sequence 80 in 3D with a resolution of tens of nucleosomes, depending on their local packing density. 81

82
Carrying out such large-scale genome mapping studies by systematic super-resolution 83 microscopy will critically depend upon choosing the best design of the necessary chro-84 mosome or genome wide fluorescent probe libraries and use sufficient resolution in the 85 employed 3D super-resolution imaging technology. To prove that such studies are feasible 86 and guide their probe design and microscope technology choices, we have developed 87 an algorithm, called ChromoTrace that uses an efficient combinatorial search to test 88 the theoretical possibility of complete three-dimensional reconstruction of chromosomal 89 scale regions of DNA inside nuclei of single human cells (Fig 1). To thoroughly test our 90 algorithm, we have developed a simulation to model chromatin packing of the human 91 genome sequence within a realistic geometry of the nucleus. Our modeled 3D architecture 92 of the genome reproduces several key characteristics of real chromosomes, and our Chro-93 moTrace reconstruction algorithm, based on suffix trees, then maps the simulated 3D 94 sequence label positions back to the reference genome. By simulating realistic resolution, 95 label multiplexing and noise scenarios we assessed the algorithm performance for different 96 experimental scenarios. Our results show that ChromoTrace can map the positions of 97 the labeled probes to the genome with very high precision and recall. Importantly, our 98 study shows for the first time that it is feasible to achieve genome-wide reconstruction of 99 the 3D DNA path in chromatin based on current super-resolution microscopy and DNA 100 labeling technology and defines the required quality of experimental data to achieve a 101 certain bp resolution and reconstruction completeness, which will be invaluable to guide 102 experimental efforts to generate such data sets systematically. In the following text we 103 use the term color equally to either represent different fluorophores, ratiometric labeling 104 with fluorophore mixtures or barcodes, or temporally separated localizations of one or 105 several fluorophores.

107
Simulation of chromatin structure of the nucleus 108 In our simulations we will model the chromatin structure of the nucleus as a self avoiding 109 walk (SAW) through a 3D lattice. A 3D lattice graph is a three dimensional grid of 110 equally spaced points, where only the nearest neighbours are connected and a SAW 111 is a path through a lattice which does not intersect itself. We choose to use SAWs as 112 they are commonly used to model chain like structures such as solvents and polymers, 113 including DNA [24].

114
To simulate the human genome we generate SAWs through a simple random process. 115 For each copy of each chromosome we pick a random starting node in the lattice that 116 satisfies three conditions.    From here the SAW is extended by picking one of the adjacent nodes at random each 121 with equal probability. If the point satisfies the above three conditions then it is added 122 to the SAW; otherwise another adjacent node is picked at random and the conditions 123 checked. This process continues until the SAW reaches the desired length or the SAW 124 becomes stuck and unable to pick any adjacent node. Should the SAW become stuck, we 125 restart this process from another node on the current SAW. Assume the current SAW is 126 of length i, we truncate the SAW to length max(0, 0.8i) and begin the process again 127 until the SAW reaches the desired length. An example of the output of the method is 128 shown in  In this section we describe ChromoTrace, a new algorithm to identify the 3D structure of 134 chromosomes from a set of labeled points. We begin with an intuitive description of the 135 algorithm and then present the process more formally. The input given to the algorithm 136 is a segmentation file, consisting of a list of (x, y, z) coordinates with associated colors 137 and a labeling file, consisting of a list of genomic locations with associated colors and a 138 distance threshold. The goal of the algorithm is to correctly map the (x, y, z) coordinates 139 to their genomic locations. A brief outline of the ChromoTrace algorithm is given below. 140 1. Build a suffix tree of the labeling data.     ii. Extend located paths one character at a time until the extension becomes 147 ambiguous.

148
iii. Attempt to resolve ambiguous extensions. 149 iv. Repeat from Step 2)b)i) until no paths can be extended.

150
(c) Remove located paths from suffix tree and distance graph. Suffix Trees A suffix tree is a well understood indexing data structure [25,26] that 155 allows for very fast searching of a subsequence within a sequence. A suffix tree efficiently 156 stores every subsequence of the indexed sequence, for a sequence of length n it has 157 exactly n leaves and has total size proportional to n. The subsequence of the indexed 158 sequence are spelled out as paths from the root of the suffix tree (Fig 3).

159
Distance Graph Key to our algorithm is the construction of a graph from the (x, y, z) 160 coordinates in the segmentation file. Each (x, y, z) coordinate will be represented as a 161 node in the graph and two nodes are connected if and only if the Euclidean distance 162 between them is less than a threshold . T is a user defined value that should be modified 163 depending on spacing of the probe design and resolution of the image.

164
A path in a graph is a sequence of edges, e 1 , e 2 , . . . , e m , connecting a sequence of 165 nodes v 1 , v 2 , . . . , v m+1 . We define a trivial path in the distance graph as a path such 166 that every node connected by the path other than v 1 and v m+1 has exactly two adjacent 167 nodes. A trivial path is maximal if it cannot be extended at either end. More formally we 168 define a trivial path as a path e 1 , e 2 , . . . , e m such that it's node sequence v 1 , v 2 , . . . , v m+1 169 satisfies the following: where |v i | denotes the number of adjacent nodes of v i . A trivial path is maximal if it is 171 also true that for i equal to 1 and m that |v i | = 2.

173
After building the suffix tree and processing the segmentation file to build the distance 174 graph we must search the graph to find all of the maximal trivial paths. The set of 175 maximal trivial paths can be found by first storing the number of neighbours each node 176 has and then processing this list. Given the set of maximal trivial paths it is simple to 177 extract the sequence of colors each trivial path represents and to search for this sequence 178 in the suffix tree. If the sequence occurs uniquely in the suffix tree we associate this 179 path with the genomic location found in the suffix tree.

180
Once we have a set of paths mapped to a genomic location we also know which color 181 is expected at the next position in the path. Using this information we explore the 182 distance graph and extend the path with the expected color if there is only one adjacent 183 node with this color. Once we have extended in this way as much as possible there may 184 exist paths where the expected extension is ambiguous. More specifically we may have a 185 path where there are two or more adjacent nodes that are labelled with the expected 186 color. In this situation we find the next L expected colors and search the distance graph 187 for this combination. if there exists an unambiguous extension we add this to the path, 188 otherwise we stop. L is a user defined input where larger values of L will make the 189 algorithm slower but likely increase recall. We repeat this extension process iteratively 190 until no paths can be extended. All of the mapped loci are then removed from the 191 distance graph and suffix tree and the algorithm started again. This entire process is 192 repeated until no more paths can be extended and no more trivial paths found.

194
Simulation of reasonable chromatin structure in the nucleus 195 The nucleus is delimited by the nuclear envelope and contains the nucleolus and chromatin. 196 To simulate the results of super-resolved detection of large-scale probe hybridizations, 197 we need to build a model of reasonable packing densities of DNA in the human nucleus. 198 The precise local density of DNA in the human nucleus is surprisingly unclear due to 199 the uncertainty regarding the in vivo structure of chromatin. To tackle this problem 200 with reasonable computational complexity at the scale relevant for super-resolution 201 microscopy our simulation uses an intermediate grained self-avoiding-walk (SAW) model 202 for chromatin on a grid of points. We assume random packing of chromatin and 203 an average density corresponding to the highest values estimated in human cells, to 204 conservatively estimate the sequence reconstruction challenge at the single cell level. 205 Although real biological data will likely show stronger heterogeneity in local packing 206 density, this would aid rather than hinder the reconstruction task (see Discussion). 207 We focus here purely on the single cell reconstruction problem, although cell-to-cell 208 structure conservation is likely to improve the reconstruction ability by allowing averag-209 ing of conserved structures of the same genomic sequence between cells (see Discussion). 210 211 The SAW model has been widely used in literature to simulate the structure of polymer 212 chains [24], and as such should also be suitable for the generation of chromatin confor-213 mation producing a structure with similar features to those observed in real experiments. 214 For this simulation we wanted to generate a model that was in agreement with a simple 215 yet realistic structure of the human nucleus ( Fig 1B). We used a 3D sphere of 500 µm 3 216 volume, with an internal sphere of 50 µm 3 volume devoid of chromatin that represents 217 the nucleolus. In this space we simulated the 46 human chromosomes (two copies of the 218 22 autosomes and the two sex chromosomes X and Y for a total of 6,053,303,898 bp). 219 Each chromosome was modeled as a separate SAW with a length proportional to the 220 real chromosome size and the simulated chromatin paths were forced to remain inside 221 the nucleus but not allowed to enter the nucleolus. Overall the simulated chromosomes display a number of structural characteristics that 226 are similar to previous experimental data [7,27,28]. Interestingly, although we assumed 227 random packing and no biologically driven heterogeneity in density, the simulation results 228 in a variety of SAW conformations: similar to observed open, fractal, and compact 229 chromatin states [27]. Furthermore, each chromosome resides roughly within its own 230 chromosome territory (Fig 2).

232
In order to assess the properties of our simulated genome architectures statistically, we 233 next simulated 100 synthetic nuclear chromosome sets. We define a contact between 234 SAWs as two adjacent points occupied by two different SAWs. To quantify the packing 235 density of chromatin in our simulated genomes we used the average number of contacts 236 between labeled probe positions within the SAWs across the 3D search space. For each 237 chromosome Fig 4A shows the average number of contacts made within the same chro-238 mosomal SAWs (intra-chromosome contacts) and the average number of contacts made 239 between different chromosomal SAWs (inter-chromosome contacts). Unsurprisingly the 240

PLOS
6/17 average number of contacts within chromosomes is highly correlated with chromosome 241 length ( Fig 4A) and after adjusting of chromosome length each chromosome shows 242 proportionally similar numbers ( Fig 4B).

244
As expected, the average physical distance between two loci in 3D space is highly 245 correlated to their genomic distance and the variance in distance increases with ge-246 nomic distance (Fig 4C). Loci with a large genomic distance but short 3D distance 247 are reminiscent of chromatin looping behavior. Overall we conclude that we have a 248 reasonable simulation of genome architecture, without heterogeneity in packing density, 249 that recapitulates many features of chromatin. determination, providing a set of the 3D coordinates (x, y, z) and color classification 258 but without the indication of the locus (Fig 5A). The colors where assigned to a loci 259 uniformly at random. The goal of the reconstruction algorithm is to assign each of the 260 in situ locus with a specific (x, y, z) position.

262
If we had the same number of colors as loci this task would be trivial, however, the 263 experimental constraints mean that we will have vastly more loci than different colors. 264 Further challenges will occur due to errors in the labeling and imaging experiment. We 265 proposed to solve this problem using the fact that the linear sequence of the probe design 266 dramatically constrains the search space for solutions. Furthermore we can use efficient 267 string based data structures, such as a suffix tree, to efficiently explore compatible places 268 of the design space relative to the 3D space. We named this combined combinatorial 269 exploration followed by expansion the ChromoTrace method (Methods).

271
Our simulations puts us in a position to explore these experimental and technical 272 constraints in a controlled manner, since we can vary the probe design both in terms 273 of number of colors and spacing along the linear genome sequence. Since we know 274 the underlying ground truth of sequence identity and probe color, we can test the 275 hypothesis that the high resolution of 3D position determination and high reliability 276 of color classification provided by super-resolution microscopy should provide enough 277 information to find unique solutions for mapping back probe positions to the linear DNA 278 sequence.

280
We created probe designs using a regular fixed spacing between probes (in our simulations 281 we use 10.8 kb spacing), resulting in an effective spatial imaging resolution of 4.3 x10 −5 282 µm 3 volume which is well within the limits of super-resolution. We then convert the 283 3D positions of the simulated imaging data to a graph of potential adjacencies, using a 284 threshold distance of T which relates to the maximum distance between two sequential 285 probe positions in space (10.8 kb) assuming an average compaction of DNA. The resulting 286 distance graph should in theory contain most of the true paths of the probes along 287 the genome plus spurious links of physically close but non-adjacent probes. We then 288 create a suffix tree containing the expected probe colors along the genome, capturing 289 the two possible directions of reading the labels (p to q and q to p direction) resulting 290 in a reversible suffix tree with path information for both forward and reverse genome 291 directions. The algorithm then iteratively explores the distance graph to find regions 292 with a unique solution of matching potentially physically adjacent color combinations 293 with the genome sequence ( Fig 5A). Once such anchor regions are found, the algorithm 294 has a vastly reduced search space and extends them into the distance graph until it 295 hits regions with high combinatorial complexity (Fig 5B), such as highly compact regions. 296

297
To test the performance of ChromoTrace for determining the DNA path through the 298 nucleus we first loaded the labeling file into the reversible suffix tree and jointly searched 299 the suffix tree and distance graph (x, y, z) to find unique sequences of colors found in 300 both. We chose a value for the distance threshold as the value which maximises the 301 number of trivial paths that we find. We performed this analysis for all 100 synthetic 302 nuclear sets, for all 22 probe designs, for all chromosomes separately as well as for the 303 whole genome. We choose to use precision (specificity) and recall (or sensitivity) to 304 assess algorithm performance. Recall is the ratio of the number of correctly mapped 305 probes to the total number of probes and precision is the ratio of the number of correctly 306 mapped probes to the total number of mapped probes. Since the ground truth is known 307 a priori in our simulation there is no ambiguity in how to measure performance.

309
Analyzing these 55,000 reconstruction attempts shows that the algorithm is highly 310 precise (mean of 0.99 across all simulations), however the recall rate is much more 311 variable (Fig 6A). This variability can largely be explained by two factors i) the number 312 of colors available in the probe design ii) the density of probe positions in 3D space. 313 For individual chromosomes the mean recall rate is approximately 0.99 when using a 314 10 color probe design, however for the same probe design genome wide the mean recall 315 rate drops to 0.64 (Fig 6A). This reflects the increased number of ambiguous sequence 316 paths available when the spatial search space is more densely packed, due to labeled 317 sequences from physically close chromosome territories.

319
As expected the mean number of contacts for synthetic chromatin paths per chromosome 320 are highly correlated to chromosome length (Fig 4A). To assess the reconstruction per-321 formance in dependence of the spatial probe position density (i.e. chromatin compaction) 322 we show the area under the precision-recall curve values (PR AUC) against the total 323 number of intra-chromosomal contacts in 100 kb windows across all autosomes and for 324 all probe designs (Fig 6B). The contacts are defined as the total number of occupied 325 spaces around each labeled probe position given the distance graph distance threshold 326 T. There is a clear trend for increased PR AUC values for probe designs with a greater 327 number of colors irrespective of chromatin density. Across all probe designs there is a 328 marked drop in performance as the chromatin density increases, and this drop is much 329 sharper for probe designs with fewer colors (Fig 6B).

331
For optimum reconstruction it is important to address performance in terms of the 332 completeness of the reconstructed paths. We assessed the length of reconstructed paths 333 in the context of the number of colors available for the labeling design (Fig 7). As 334 expected we see a clear trend towards increased mean and variance of path lengths as 335 the number of colors is increased. Furthermore, the minimum path length across all 336 simulations for each labeling design was 6. The necessity of ChromoTrace to find unique 337 anchors points (trivial paths) before extending further out into the distance graph results 338 in paths never being smaller than 6 points long which is an expected observation when 339 considering the random placement of colors across the simulation space and the global 340 compaction of the chromosomes.

341
Real experimental super-resolution data will contain noise, likely from two major sources, 343 firstly missing probes due to hybridization failure and secondly mislabeled probes, either 344 due to chemical mislabeling or crosstalk between different dyes in the super-resolution 345 microscope. To assess the performance of the reconstruction algorithm in the presence 346 of errors we simulated 99 datasets for each error mode, containing error rates ranging 347 from 1% to 99%, across all 22 probe designs for the 100 simulated nuclear sets, for all 348 chromosomes separately as well as for the whole genome (a total of over 5.4 Million 349 simulations). The errors were added to our simulations at random with the appropriate 350 error rate.

352
For all probe designs the proportion of mislabeled probes has a dramatic effect on 353 the reconstruction precision and we observe a clear decrease in precision as the propor-354 tion of probes with the wrong color is increased (Fig 8A). At 10% mislabeled probes 355 for the 24 color probe designs the mean precision is 0.94 (SD=0.003), dropping to 0.92 356 (SD=0.006) for 11 colors and to 0.7 (SD=0.003) for 3 colors. Recall rates are even more 357 strongly effected by the proportion of mislabeled probes, starting from a maximum 358 recall rate of approximately 0.99 for the 24 color probe designs with no mislabeled 359 probes, recall rates drop sharply for all probe designs as the proportion of mislabeled 360 probes increases (Fig 8B). At 10% mislabeled probes for the 24-color probe designs the 361 mean recall is 0.85 (SD=0.012), dropping to 0.59 (SD=0.002) for 11 colors and to 0.1 362 (SD=0.01) for 3 colors. At above 60% of mislabeled probes both precision and recall is 363 too low to be useful. The rapid drop in performance for recall compared to precision is 364 not unexpected considering that chromotrace uses exact matching.

366
For missing probes the relationship between recall and percentage of errors is very 367 similar (Fig 8B and 8D). This is not surprising since either removing or replacing 368 probes with a wrong color in a sequence of colors is likely to stop the extension of 369 correct paths at a similar rate. Precision however, only starts to drop at a much higher 370 percentage of missing, compared to mislabeled probes (Fig 8A and 8C). This suggests 371 that the chance of creating an error in path extension when removing probes is lower 372 than if mislabeled probes are present. If DNA paths were linear in 3D space this would be 373 entirely expected as the distance threshold between sequential probes would ensure that 374 most paths are not incorrectly extended across missing probe locations, while mislabeled 375 probes will not only terminate extension but also cause mismatches to the genome 376 sequence. These results suggest that removing relatively large numbers of probes is 377 unlikely to cause incorrect path extensions across a majority of the simulated chromatin 378 space.

380
Encouragingly even for the probe designs with the lowest numbers of colors (3) precision 381 remains at approximately 0.8 with a missing probe rate of 25%. Furthermore, precision 382 is also relatively robust to mislabeled probe errors, remaining above 0.75 with more than 383 15% of mislabeled probes for probe designs with greater than 7 colors. As expected recall 384 is far more sensitive to error and there is only a marginal difference observed between 385 the two error modes.

386
Differences in chromatin packing density 387 It is unclear how much of the available volume chromatin occupies locally within the 388 nucleus under physiological conditions, but the literature suggests nucleosome concen-389 trations of 140 ± 28 µM with nucleosomes every 185 bp in HeLa cells leading to a 390 packing density of 10 % when assuming a nucleosome volume of 1296 nm 3 [29]. To be 391 conservative, our simulation used a higher than average density of chromatin, with 34% 392 of the available local volume occupied by chromatin (545 thousand points per genome 393 from a 1.59 million point grid). An increased density of the SAWs will result in a harder 394 reconstruction problem, because a higher number of occupied adjacent spaces within 395 the simulation leads to an increase in the number of ambiguous choices for path extension. 396

397
To assess the effect of lowering the chromatin density we performed additional simulations 398 by omitting the nucleolus and doubling the radius of the nucleus resulting in filling 399 approximately 6.9% of the nuclear volume with chromatin (982 thousand points per 400 genome from a 14.1 million point grid). Unsurprisingly these SAWs are less densely 401 packed, an effect that can be visualized by looking at the proportion of adjacent spaces 402 that are occupied, given the distance threshold T, for all labeled genomic locations in 403 the simulations (Fig 9). While in our original simulations the median proportion of 404 occupied spaces around each probe position from the labeling design is 0.52 (Fig 9A), 405 in the lower density simulations this is decreased to 0.37 (Fig 9B).

407
To test how this effects the reconstruction performance, we generated 100 synthetic 408 nuclear sets using the approach described above and produced 22 different probe designs 409 containing 3 to 24 colors for the lower density simulations. We then reconstructed using 410 the ChromoTrace algorithm for all synthetic data sets for each chromosome separately 411 and for the whole genome. As expected performance, in terms of both precision and 412 recall, is significantly improved for the less densely packed simulations (Fig 9C). The 413 genome wide mean precision remains high (greater than 0.99) for all probe designs. The 414 difference in recall is much more pronounced with mean recall rates of 0.99, 0.92 and 415 0.12, compared to 0.97, 0.48 and 0.08, for probe designs with 24, 11 and 3 colors for 416 the lower density compared to the higher density simulations respectively. Importantly 417 when comparing the lower to the higher density simulations the recall rate is improved 418 by a mean factor of 3 across all different color probe designs (Fig 9C). This marked 419 improvement in sensitivity reflects the decreased number of occupied adjacent 3D spaces 420 around each individual probe position and consequently a reduced number of ambiguous 421 sequence path extension choices when lowering the density of the simulated chromatin 422 paths (Fig 9A and 9B). Overall across all probes designs the lower density simulations 423 have a genome wide mean recall rate of 0.84 compared to 0.58 for the higher density 424 simulations.

426
Until this point we have been using simulations containing uniform spacing between 427 adjacently labeled positions (loci), however the distance between adjacent labels in real 428 super resolution experiments will be variable. The main factors effecting this variability 429 are likely to be the lack of absolute uniformity of probe spacing along the genome, the 430 super-resolution image localization precision and the probability of effective probe hy-431 bridization. To create a more accurate simulation of real experimental data we developed 432 a full simulation of a super resolution experiment. Starting from probe level localization 433 we simulated the results of image acquisition, followed by event clustering leading to 434 observed 3D positions. Importantly this leads to a more varied set of distances between 435 observed loci positions (Supplementary Information).

437
Briefly, starting with the 100 SAWs and the 10 color probe designs, we used each 438 3D coordinate as the starting point in 3D space and placed 10 probes equally spaced 439 along a single direction (x, y, or z) based on the direction of travel along the walk. The 440 midpoint of each group of probes is the original starting position and each probe was 441 given a 0.3 probability of being missing. Next, for each probe, we simulated a number of 442 localization events (LE's) drawn from a poisson distribution with a mean of 5 and added 443 error in all directions independently, drawing from normal distributions with standard 444 deviations of 5 nm, 5 nm, and 15 nm for x, y and z, respectively. For clustering these 445 LE profiles we used the DBSCAN (Density-based spatial clustering of application with 446 noise) algorithm [32]. To define the final 3D coordinates for each locus we took the 447 mean coordinate from each direction separately across all LE's for each cluster that was 448 defined by DBSCAN. These along with their relevant 10 color labeling designs were used 449 as the input to ChromoTrace.

451
We investigated the result of applying this process to the starting simulations in terms of 452 three different types of error. Firstly, the overall percent of missing loci is approximately 453 6% for both genomes and chromosomes (Fig 10B), as seen previously the number of 454 missing loci has an extremely adverse effect on the reconstruction performance partic-455 ularly in terms of recall (see Robustness and error tolerance). Next we looked at 456 the percentage of LE's that were clustered into the wrong locus by DBSCAN, we see 457 that the mean percentage of loci containing erroneous LE's is approximately 5.8% for 458 both genomes and chromosomes ( Fig 10C). The observed position of loci in 3D space 459 whose clusters contain erroneous LE's are likely to be far less accurate than those whose 460 LE's are consistent. Moving individual loci around in space is likely to adversely effect 461 the performance of ChromoTrace due to points falling outside of the chosen distance 462 threshold T used in the distance graph. Finally we looked at the percentage of DBSCAN 463 defined clusters that contained LE's from multiple loci and observe a mean percentage 464 of approximately 1.9% for genomes and chromosomes (Fig 10D). It is reasonable to 465 assume that as the number of unique loci contributing LE's to a defined DBSCAN 466 cluster increases so the accuracy of the final observed loci coordinates is likely to decrease. 467

468
We ran ChromoTrace across all 100 simulated super resolution experiments for whole 469 genome reconstruction and for individual chromosomes. Overall the performance in 470 terms of recall was significantly lower than for the original simulations however precision 471 remained high (Fig 10A). The mean precision for both genomes and chromosomes is 472 higher than 0.95 and the mean recall is 0.14 and 0.40 genome wide and for chromosome 20 473 respectively. The improved recall rates for reconstructing individual chromosomes reflects 474 both the decreased complexity of the distance graph space and importantly a decrease 475 in the number of potential LE's from different loci that could be incorrectly clustered 476 together by DBSCAN. Here we have chosen challenging parameters for the problem of 477 LE profile loci labeling, profile segmentation and observe a significant decrease in the 478 reconstruction performance achieved by ChromoTrace. However, for chromosome scale 479 genomic regions we are still able to reconstruct approximately 40% of the 3D structure 480 and make very few mistakes with precision remaining above 0.95. The parameters 481 used for simulating these LE profiles are by no means optimal and could certainly be 482 improved when designing real experiments, for example, the number of different color 483 labels could be increased and the rate of missingness improved by using highly specific 484 probes. Furthermore the use of DBSCAN in our hands was 'out of the box' and we did 485 not attempt to optimise the clustering of individual LE's from different loci. Improving 486 the clustering of LE's using DBSCAN or more sophisticated custom algorithms would 487 certainly improve the accuracy of estimated loci coordinates.

489
Although we simulated reasonable chromatin paths and used a challenging density of 490 chromatin in the nucleus, our simulation is coarse grained and does not at this time 491 take the known structural heterogeneity of chromatin packing of different genomic 492 sequences into account, for example eu-and heterochromatic domains or TADs. It 493 is therefore necessary to consider how such structural heterogeneity would affect the 494 reconstruction problem. For a given packing density, such structures should lead to one 495 of two outcomes, firstly that the entire chromosome (or probed region of interest) is 496 overall more compact than simulated, leading to a significantly smaller volume of the 497 chromosome territory. This would effectively reduce the amount of resolvable spatial 498 information present for the reconstruction. Such a result would be disappointing in 499 terms of the reconstruction algorithm, but fascinating in terms of how such chromosomal 500 domains are created and maintained. However, the extended conformation of many 501 chromosomes seen previously [31], along with the distribution of their contacts to the 502 nuclear lamina [32], suggest that overall compaction is an unlikely configuration, except 503 for specific cases such as mitotic chromosomes or the inactive X chromosome. The 504 second outcome is that the more highly packed regions are interspersed with more 505 extended regions. The extended regions would be easier to reconstruct, as the bet-506 ter resolved 3D information will be more accurately able to place these regions to a 507 unique position on the genome. At the extreme of this model one would have a se-508 ries of resolvable linkers with interspersed globules of packed chromatin that would 509 not be resolvable. In such a scenario integration with the HiC data or other contact 510 maps, whose resolution is good in these more dense regions [33] would be very interesting. 511

512
On the other hand, when the density of chromatin in the nucleus is lower, the re-513 construction improves dramatically in terms of recall. In experimental HiC data if 514 unusual numbers of contacts are observed relative to chromosome size it may be indica-515 tive of biological processes effecting chromatin condensation [34]. It is feasible to resolve 516 a large fraction of chromosomal scale regions with a resolution of 10.8 kb and reconstruc-517 tion at this level would provide very high-resolution chromosomal scale chromatin maps 518 (including the internal structure of TADs, TAD boundaries and inter-TAD regions). 519 Even if the very fine details of high density chromatin structures remain challenging with 520 the currently available imaging technology, the spatial information provided by even 521 partial reconstruction of the chromatin path is certain to increase our understanding 522 of how chromosome folding and partitioning is related to active processes such as gene 523 expression [35,36].

525
The other important consideration is the number of distinct fluorescence colors that the 526 reconstruction requires. The number of flourophores compatible with 3D super-resolution 527 microscopy and in-situ hybridization conditions is currently limited to three dyes that 528 can be reliably spectrally separated if imaged at the same time. Since DNA in situ probes 529 can be coupled to more than one flurophore, combinatorial labeling can create different 530 color ratios. In our simulations, up to 10 colors for simultaneous detection could easily be 531 generated in this manner, however this will also introduce noise due to chemical labeling 532 errors (the chance by which a probe will be labeled with a different color ratio than 533 intended) which would lead to wrong probe assignments. However, since any given color 534 will have only a finite set of possible neighboring mistakes with associated error rates, a 535 substitution matrix of possible errors can be integrated into both the extension phase 536 and exploration phase of the suffix tree [26], changing the formulation of the problem into 537 a likelihood model of seeing the 3D position of probes given a certain path labeling. In 538 addition, recent advances in labeling techniques such as the 'Exchange-PAINT' method 539 now allow sequential hybridization and image capture, allowing to separate up to 10 540 pseudocolors based on a single dye in time [21]. This labeling technology requires long 541 super-resolution image acquisition times, but could massively increase the number of 542 probes available for the reconstruction algorithm. For example, a binary code with 543 2 colors and 10 labeling rounds could distinguish in the region of 2 10 labels, which 544 would make reconstruction almost trivial. It is therefore very likely that a well-designed 545 combination of spectral and temporal multiplexing of fluorescent dyes, will make it 546 possible to generate image data with sufficiently large numbers of differently 'colored' 547 probes. Therefore it should be possible to optimise data acquisition times with different 548 numbers of colors to allow high resolution reconstruction of the chromatin paths for 549 individual chromosomes within the nucleus. Our comprehensive simulation framework 550 will be valuable in guiding the optimal design of such probes, since it allows to simulate 551 the effect of different designs on the reconstruction performance rapidly in silico.

553
In this paper we proposed a novel algorithm, ChromoTrace, to, in theory, leverage 554 super-resolution microscopy of thousands to millions of in situ genome sequence probes 555 to provide accurate physical reconstructions of 3D chromatin structure at the chromo-556 somal scale in single human cells. To test this algorithm we have made simulations of 557 chromatin paths in realistic nuclear geometries, and explored different labeling strategies 558 of in situ probes. Our study shows that near complete resolution of a chromosome with 559 10 kb resolution can be achieved with realistic microscope resolution and fluorescent 560 probe multiplexing parameters. Extensions to this method such as leveraging between 561 nucleus consistency effects and using a likelihood-based scheme will allow even more 562 sophisticated modeling of experimental error sources in the future [37].

564
There is currently no suitable experimental data to substantiate this work; this is 565 firmly a theoretical exploration of the possibility to achieve this and the constraints 566 any experimental method would need to satisfy for a successful reconstruction. For 567 example, it is clear that minimizing mislabeling is more important than minimizing 568 missing probes. Our simulations are based on known and realistic experimental parame-569 ters, where available. We have tested our method under challenging chromatin density 570 levels and aggressive error models of missing, misreported data and LE precision. Our 571 algorithm and assumptions are compatible with leading super-resolution techniques; in 572 particular our method assumes isotropic resolution of the probes, which has been shown 573 using methods such as direct stochastical optical reconstruction microscopy combined 574 with interference [21,38]. Nevertheless real experimental data will likely have proper-575 ties that we have not anticipated. Some of these properties, such as systematic error 576 behavior, or changes in resolution across the nucleus might hinder our reconstruction. 577 On the other hand, properties such as structured heterogeneity in packing density and 578 cell-to-cell structure conservation are likely to improve our ability to reconstruct. Our 579 reconstructions based on single cell image data are initially most likely to work in a 580 patchwork manner across a chromosome, and will be very complementary to the contact 581 based maps based on HiC or promoter-capture HiC [39]. Combining super resolution 582 imaging and contact mapping should provide fundamentally new insights into chromatin 583 organization and function within the nucleus.