Tight basis cycle representatives for persistent homology of large data sets

Persistent homology (PH) is a popular tool for topological data analysis that has found applications across diverse areas of research. It provides a rigorous method to compute robust topological features in discrete experimental observations that often contain various sources of uncertainties. Although powerful in theory, PH suffers from high computation cost that precludes its application to large data sets. Additionally, most analyses using PH are limited to computing the existence of nontrivial features. Precise localization of these features is not generally attempted because, by definition, localized representations are not unique and because of even higher computation cost. For scientific applications, such a precise location is a sine qua non for determining functional significance. Here, we provide a strategy and algorithms to compute tight representative boundaries around nontrivial robust features in large data sets. To showcase the efficiency of our algorithms and the precision of computed boundaries, we analyze three data sets from different scientific fields. In the human genome, we found an unexpected effect on loops through chromosome 13 and the sex chromosomes, upon impairment of chromatin loop formation. In a distribution of galaxies in the universe, we found statistically significant voids. In protein homologs with significantly different topology, we found voids attributable to ligand-interaction, mutation, and differences between species.


Introduction
Quantitative observations of physical systems from galaxies to proteins often lead to data sets that are discrete and have uncertainties [Ayyub and Klir, 2006].The identification of robust features in such data sets is the impetus for the development of topological data analysis (TDA).Persistent homology (PH) is, perhaps, the most mathematically rigorous method developed for TDA.PH applies techniques developed in algebraic topology to find robust lacunae in discrete data sets.Most implementations of PH end their analysis at the stage where the existence of nontrivial homology cycles has been demonstrated.However, scientific interest in nontrivial cycles often requires finding specific locations for homology cycles.For example, while the existence of voids in the distribution of galaxies is interesting, cosmologists need to know the locations and sizes of these voids to compare them to theoretical models of galaxy formation.As another example, the looping of chromatin strands of the genome is not random.It regulates gene expression by bringing regulators close to regulated genes that otherwise are far along the linear strands [Kadauke andBlobel, 2009, Rowley andCorces, 2018].Hence, determining the precise genomic location of loops is important.PH has found useful applications in areas as diverse as neuroscience [Bendich et al., 2016], computational biology [McGuirl et al., 2020], natural language processing [Zhu, 2013], the spread of contagions [Taylor et al., 2015], cancer [Nicolau et al., 2011, Lawson et al., 2019], material science [Kramár et al., 2013], computer graphics [Brüel-Gabrielsson et al., 2020], among many others.However, its application is still limited because of the challenges of high computation cost and identifying the precise locations of features.
Colloquially, we can think of the features that PH finds as 'holes' in the cloud of points that constitute a data set, regions of lower density enclosed in regions of higher density.As one can surmise from the appearance of the term 'density', the features that PH finds are dependent on the scale of distances between neighboring data points.The use of an objective mathematically sound method to compute the existence of holes is called for because the human eye is adept at pattern detection even when there is no pattern.For example, we can readily pick out a chain in a uniform distribution of constituents in a three-dimensional space [Soneira and Peebles, 1978].On the other hand, we may be unable to find topologically significant holes in a 3D point-cloud by visual inspection (Figure 1e).
We briefly introduce terminology and background to explain the challenges in applying PH.A n-simplex is a set of pn `1q points (Supplementary figure 13).A simplicial complex at a spatial scale of τ is defined as the set of all possible simplices with maximal pairwise distance at most τ between their points.Hence, we get a sequence of complexes as τ increases (Figure 1a).Loops in a simplicial complex are computed as basis elements of its homology group of dimension one (H 1 ).Voids are computed as basis elements of its homology group of dimension two (H 2 ).PH tracks changes in the number of basis elements of homology groups across the sequences of complexes.The τ values of the births and deaths of topological features or holes are recorded and plotted as a persistence diagram (PD) (Figure 1c).We refer to these τ values as births and deaths for brevity in this paper.Every hole is classified by the range of values of τ across which it persists, called its persistence or barcode length and computed as death ´birth.The key point is that the features with relatively high persistence are robust to larger experimental variability.This multiscale construction and processing of high-dimensional simplices (0-, 1-, 2-, and 3-simplices in this work) incurs a high computation cost.Test data sets used to benchmark PD computation are commonly limited to a few thousand points.Some of the different methods to compute PD are matrix reduction [Edelsbrunner and Harer, 2008], Morse theory [Mischaikow and Nanda, 2013], and matroid theory [Henselman and Ghrist, 2016].Ripser is one of the most efficient software packages computing PD [Bauer, 2021].However, even Ripser was unable to process human genome data sets because they contain millions of discrete points.In previous work, we developed Dory [Aggarwal and Periwal, 2021] and used it to compute the PD of human genome data sets within five minutes using only 6 GB of memory.We also showed that its efficiency is not limited to these specific data sets since it used the least memory and run-time in almost all test data sets.A different approach to tackle computational feasibility of PD is to approximate a topological structure with a large number of simplices by one with fewer simplices [Dey et al., 2019].However, theorems justifying such a coarsening so as to not lose information about possibly functionally important features are hard to come by and adding sampling stochasticity to an approach that is trying to overcome data uncertainty is counter-intuitive.Most applications of PH are limited to computing PD and thereby demonstrating the existence and persistence of holes, but not their locations, and therefore do not allow an assessment of their functional significance.Why is the location information elided?By definition, representative boundaries around topological features are not uniquely defined [Carlsson, 2009] so the resulting boundaries can be geometrically imprecise (Figure 1d).Furthermore, computing candidate representative boundaries is expensive in both memory usage and run time.Hence, the development of efficient and scalable algorithms to compute PD and precise representative boundaries of persistent features are active areas of research.
One approach to compute a canonical set of representative boundaries has been to define minimal boundaries as solutions to an optimization problem.For example, a possible set of boundaries is one that minimizes the aggregated weight of boundaries at different spatial scales [Dey et al., 2011].Boundaries that are solutions to such optimization problems are called optimal homologous cycles.Strategies for their efficient computation were introduced in Dey et al. [2018].However, computing optimal homologous cycles still is of order O `n11 ˘for a data set with n points [Guerra et al., 2021].Hence, they are usually computed for small data sets.A comparison of different possible optimization functions and implementations is discussed in Li et al. [2021].Another issue is that they are not computed for voids.This is because it has been shown that an optimization strategy for H 2 representative boundaries is infeasible (NP-hard) [Chen and Freedman, 2011].
Our contribution in this work is a set of algorithms that can compute minimal representative boundaries around significant loops and voids in large data sets.Figure 2b shows an outline of our strategy.We classify a hole as significant if it has persistence larger than a threshold (a higher value means robust to larger variability in the data set) and is born at a spatial scale of at most τ u (a lower value suggests holes with denser boundary).First, we developed a recursive algorithm that can compute a comprehensive set of representative boundaries around all nontrivial holes (persistence > 0).This is not always possible in extant algorithms because of high computation cost.We call these boundaries birth-cycles.Next, we developed a greedy algorithm that decreases lengths of birth-cycles to give a set of shortened representative boundaries.Third, we implemented a local smoothing of the boundaries.Figures 1d and 1g show examples of computed shortened and smoothed H 1 and H 2 representatives.Note how they tighten around the respective holes when they are shortened.Fourth, the algorithm identifies sub-regions of a spatial embedding of the data set that contain significant features so we can use a divide and conquer strategy.We show in examples that this significantly decreases the maximum size of data sets that need to be processed for subsequent analysis.We also developed a graphical contraction that can possibly decrease the number of sub-regions.Finally, we introduce stochasticity in two different ways that can help to overcome the greedy algorithm's local minimum problem.The computational efficiency of our algorithms is evidenced by the ability to process a human genome data set with three million points within a few minutes and using around 3 GB of memory.We illustrate the geometrical precision of computed representative boundaries for voids using case studies of the distribution of galaxies in universe and protein crystal structures.Hence, this work enables scientific investigation of the functional significance of topologically robust features in large noisy data sets from experimental observations.

Genome-wide high resolution loops in human genome computed within a minute
We analyzed the human genome at high resolution to show scalability and application of our algorithms to large data sets.This data set has approximately three million points.The 23 chromosomes in human cells collectively contain approximately 6.7 billion base pairs (bp).The 2 m length of the genome is folded at multiple scales to fit inside a cell of around 10 µm diameter.Loops from this folding play specific roles in gene regulation.They enable long range interaction between gene regulatory regions that lie far along a linear chromosome or even on different chromosomes.A correlation between chromatin loops and aggregation of the cohesin protein at loop anchors has been demonstrated [Rao et al., 2014].It has been hypothesized that cohesin plays a role in chromatin loop formation, confirmed by treating cells with a molecule, auxin, that disrupts cohesin function.Addition of auxin led to a loss of chromatin loops [Rao et al., 2017].We observed a congruent trend for H 1 loops when we computed PH for data sets with and without auxin treatment.
We estimated genome-wide pairwise distance matrices at 1 kb resolution using data from Hi-C experiments for control and auxin-treated human cells Rao et al. [2017] (Section 4.2). 1 kb resolution means that all linear chromosomes are partitioned into bins of a thousand contiguous base pairs.A distance matrix p D " p d ij then estimates the spatial distance between bin i and bin j in the folded genome (see Section 4.2 for details).This resulted in matrices of size 3088281 ˆ3088281 for genome-wide analysis of the human genome.We denote the matrices for control and auxin-treated by p D c and p D a , respectively.
Figure 3a shows distributions of our estimates, p d c ij and p d a ij , for bin distances |i ´j| " 1, 2, 3, and 4. Figure 3b shows that their means and medians increase with bin distances.We expect that in most cases the bins that are farther apart along the linear chromosome will also be farther in the folded chromosome.Therefore, this is an expected property of our pairwise estimation and provides a sanity check.Figure 3a shows that the distributions for bin distances greater than 2 are not as significantly different as compared to that for bin distance of one.This suggested that the estimates might not be accurate for bin distances larger than one.Hence, we defined a feature to be significant if it is born at a spatial scale less than or equal to τ u " 100 and has persistence at least " 50.These choices are based on the fact that 100 is above the 75%-ile of estimates for bin distance of 1 and 150 is below the 25%-ile of estimates for bin distance of 2 (dashed lines in Fig 3).Computing PH up to τ " τ u ` " 150 resulted in 3,206,797 and 2,701,615 valid entries in p D c and p D a , respectively, out of the `3088281 2 ˘« 4.7 ˆ10 12 possible pairs.
Our algorithms computed PD and representative boundaries in around one minute using 3.1 GB of memory for p D c and 35 seconds with 3 GB for p D a (see Table 1).The low memory usage is because all data structures throughout our algorithms and strategy are designed to be on the order of the maximal number of valid edges (n e ).This is specifically beneficial in this case because n e !Opn 2 q.In general, we expect our memory usage to be lower than that of extant algorithms by multiple orders of magnitude for topological structures where the number of simplices is much less than the maximum number of simplices possible.Such topological structures are termed sparse.It is difficult to theoretically generalize efficiency in run-time of the matrix reduction algorithm a priori since it depends on the number of reduction operations which are decided by the structure of the coboundary and boundary matrices.The coboundary (and boundary) matrices in turn depend upon the topology of the data set.Even for the same data set, σ1 σ9 σ8 σ7 σ6 σ5 σ4 σ3 σ2 σ1 σ10 σ9 σ8 σ7 σ6 σ5 Pers. pairs H0 H1

(a)
Compute reduced boundary matrix.
Compute persistence pairs by reducing coboundary matrix.
Recursive algorithm to compute reduction operations for boundary matrix which form a comprehensive set of representative cycles.
Greedy algorithm to shorten cycles.
Local smoothing to further shorten cycles.
Embedded in R 2 or R 3 ?
Construct covers of cycles.
Eliminate covers with no significant features.
Graphical contraction to decrease size of covers or number of covers.
Stochasticity by perturbations and permutations to compute multiple sets of representative homology cycles.
Stochasticity by permutations to compute multiple sets of representative homology cycles.
Select minimal representatives.

Yes No
(b) Figure 2: Matrix reduction algorithm and our strategy to compute representative boundaries in large data sets with improved precision.(a) Simplices are indexed by the order in which they are added to the simplicial complex.Column and row i of boundary matrix D correspond to simplex σ i .Columns for 0-simplices are not shown because they are always empty.Column Dpσ i q has 1 at boundary-simplices of simplex σ i , and 0 otherwise.Lowest non-zero entry in each column is called a low, shown as red box.Columns are reduced (mod 2 sum) from left to right till each row has at most one red box, called a pivot.Pivot at pi, jq is a persistence pair pσ i , σ j q.If Rpσ i q " 0 and σ i is not in any persistence pair, then pσ i , 8q is a feature that did not die.In the example, there are two H 1 persistence pairs-pσ 9 , σ 10 q and pσ 8 , 8q, or p2.75, 2.75q and p2.5, 8q.different coboundary and boundary matrices might be constructed that require different number of reduction operations.This is possible because of the arbitrary choice of ordering of simplices that are added to the topological structure at the same spatial scale.
Figure 4 shows that the median and 75%-ile boundary lengths decreased across multiple log scales (base 2) after being shortened by the greedy algorithm.Figure 4 shows that smoothing of the shortened cycles reduced some to degenerate cycles of length 2 (lowermost humps in the distribution of the smooth cycles lacking in the distribution of lengths of shortened cycles).Specifically, 902 cycles in control and 209 in auxin-treated were reduced to degenerate cycles.These were ignored in subsequent analysis.Next, we determine which of the shortened and smoothed boundaries cannot be around a significant feature.For every cycle C, we estimate the theoretically maximum persistence that a feature, with C as its representative boundary, can have.The birth is estimated as the longest edge in the boundary, because at that spatial scale the cycle is born.Death is estimated as the maximum of all possible pairwise distances between points of the cycle.This is because, at that spatial scale all simplices on the points of the cycle will be added to the simplicial complex.Hence, the theoretically maximum persistence is (estimated death) -(estimated birth).If this value is less than , then C is ignored from subsequent analysis.There were 271 such cycles in control and 22 in auxin-treated.This left us with 42695 cycles in control and 21137 in auxin-treated.Hence, auxin treatment resulted in a loss of around 50% of possibly significant H 1 loops.For further analysis, we considered cis-and trans-cycles separately.If a cycle is defined by edges with all bins from the same chromosome, we call it a cis-chromosome cycle or a cis-cycle.If a cycle is defined by edges with at least two bins on different chromosomes, we call it a trans-chromosome cycle or a trans-cycle.Analysis of these sets of minimal cycles showed that the control data set has a higher number of cis-cycles in all chromosomes, irrespective of their length and maximal bin distance between contiguous bins in the cycle (see Figures 5b and 5a).The differences in the number of cycles at bin-distances of up to 10 5 in Figure 5a indicates that we found cis-cycles that have contiguous bins which are at some distance along the linear chromosome.This provides evidence for long-range interactions.The number of trans-cycles also decreased upon treatment with auxin.However, there are relatively more trans-cycles in auxin-treated data (more negative points in Figure 5c).For every pair of chromosomes, we counted the number of trans-cycles that go through the pair.We plot the difference in the counts between control and auxin treatment as a graph in Figure 5d.A red edge between a pair of chromosomes pc i , c j q implies that there are more trans-cycles going through c i and c j in auxin-treated than in control, and blue edges indicate the opposite.The predominance of blue edges indicates that for almost all chromosome pairs there are more trans-cycles in control as compared to auxin-treated.An exception that stands out is chromosome 13 which has more red edges.We repeated this analysis without the sex chromosomes to analyze trans-cycles that go through only the autosomes.Figure 5e shows that such trans-cycles through chromosome 13 are mostly blue.Hence, the auxin-treated data set has more trans-cycles that go through chromosome 13 and the sex chromosomes.This might yield insights into functional relations between chromosome 13 and the sex chromosomes.For example, ambiguous genitalia is known to be associated with sex chromosome disorders.However, at least one case has been observed of ambiguous genitalia with the autosomal chromosome anomaly of ring chromosome 13 [Özsu et al., 2014].(d) Opacity of edge pc i , c j q indicates difference in number of trans cycles between control and auxin-treated that go through chromosomes c i and c j .Blue indicates more in control and red means more in auxin-treated.Chromosome 13 stands out with many red edges.(e) Same as previous but without the sex chromosomes.There are very few red edges at chromosome 13, indicating that auxin-treated has a larger number of trans-cycles that go through chromosome 13 and sex chromosomes.
2.2 Rare voids in a distribution of more than 100,000 galaxies We benchmarked computation of minimal boundaries around voids in R 3 by analyzing a distribution of 108,030 galaxies in the universe.We also showed that the computed boundaries identified distinct voids in the universe.The locations of galaxies in R 3 was estimated from a redshift data set (http://www-wfau.roe.ac.uk/6dFGS/) using Hubble's law (see Section 4.3 for details).Significant features were defined by τ u " 15 and " 2. The number of nontrivial H 2 features was 2351, out of which 42 were classified as significant (shown in red in the H 2 PD shown in Figure 6a).Computation took 42 mins on a 2.4 Ghz Intel Core i9 to compute PD, compute a set of birth-cycles, shorten them, and finally smooth them.Figure 6b shows that both shortening and smoothing algorithms decreased the lengths of boundaries.
Next, we show that the computed birth-cycles wrap around all features classified as significant in the data set.We define the cover of a set of embedded points as the smallest hyper-rectangle in Cartesian coordinates that contains all those points on it or inside it (see 4.1.2for a rigorous definition).We estimated significant features that a birth-cycle might contain by computing PD of its cover.In Figure 6c, we indicate the significant features in the full data set using blue o's and the estimated significant features in the birth-cycle using red x's.For every o there exists an exactly matching x.Hence, birth cycles wrap around all features classified as significant in the PD of the full data set.Do the shortened boundaries also wrap around all features that are classified as significant in the PD of the full data set?We estimated significant features for the set of shortened cycles to compare with the significant features in the full data set.Figure 6d shows that not all significant features in the full data set (o) seem to have a matching significant feature in shortened cycles (x).
We explain why this happens and how it is a feature of our algorithm by comparing birth-cycles with shortened cycles for an annotated example: Figure 6e shows six different views (two rows, three columns) of the embedded points in the spatial region of the example.The columns show three different orientations.The top row shows the birth-cycle.The bottom row shows three shortened cycles in that region as computed by the greedy algorithm.This region has a significant feature found by the matrix reduction algorithm with a basis cycle that is born at 10.24 and that dies at 15.3, denoted as p10.24, 15.3q, in the full data set.However, the persistence pairs with maximum persistence in covers of shortened cycles are p13.2, 15.3q, p14.32, 15.34q, and p13.96, 14.2q.The persistence of all these is less than , but constituents in these boundaries can interact because their births are less than τ u .Hence, the shorter cycles found by the greedy algorithm show that the annotated feature is not significant.
Covers of 41 of the smoothed boundaries had a non-zero number of significant features.The largest cover had 0.6% of the number of points in the original data set.Hence, the computational cost of any subsequent analysis was considerably reduced.Graphical contraction reduced the number of possible covers with significant features from 41 to 40, but it did not reduce the number of points in any cover.All 40 covers had exactly one significant feature.We constructed n pert " 50 perturbations for every cover.Figure 7a shows that features with larger persistence have a larger perturbation parameter (see 4.1.2for definition of perturbation parameter).This is expected since larger persistence implies robustness to larger perturbations.We constructed n perm " 50 permutations for every perturbation by permuting indices of edges of same length in the topological structure.Hence, every random permutation is a re-indexing of edges in the simplicial complex.However, it is possible that all edges are indexed similarly in two permutations.Chances of this are higher when there are very few edges of same lengths in the complex.Therefore, we only consider unique sets of labels out of the 50 random permutations (see Section 4.1 for more details on perturbations and permutations).
It was feasible to compute PH up to the maximum possible threshold (see Figure 7b).Consequently, we computed representative homology boundaries for all unique permutations of perturbations of every cover.Figure 7c shows that the lengths of the computed homology boundaries varied considerably across all permuted sets of all perturbations of a cover.Figure 7d shows that both permutations and perturbations contributed to variation in lengths.
We used the computed multiple sets of homology boundaries for all permutations, to define boundary-simplices that are in minimal representatives (Section 4.1.2for details).Do these minimal boundaries identify unique singular voids?They wrapped around only one significant feature since each cover had one significant feature.Next, we confirmed that they wrapped around unique voids.We computed their covers and checked whether they have non-empty set-intersection.Figure 8b shows a graph with nodes as covers and an edge defining a non-empty intersection.Each node is labeled by the number of significant features in it.The presence of 24 singletons in this graph shows that 24 covers do not intersect with any other cover.This assures that the corresponding minimal representatives are around distinct singular voids.There are 6 components (marked with bigger colored circles) which indicate non-empty intersection of covers.We visually confirmed that the minimal boundaries in these components are also around distinct voids (Figures 8e and 8f).Hence, we computed 40 minimal representatives that are around distinct significant voids in the distribution of galaxies in the universe.Figure 16 shows each of the voids.Figures 8c and 8d show two views of all voids in the embedding of the full data set.Figure 8a shows the 40 significant persistence-pairs in covers of minimal representatives and the 42 classified as significant in the full data set.The decrease in the number of significant features by two is attributed to the aforementioned possibility of a decrease in persistence as boundaries are shortened.Note that we were able to compute the deaths of all features in these regions because of the small cover sizes.This was not possible for the full data set because of the large number of galaxies.
Next, we show that it is rare to find similar voids at random.We conducted random searches in two different ways, spatial and graphical (Section 4.3 for details).In spatial sampling, we sampled hyper-rectangles in the embedding.In graphical sampling, we sampled connected subgraphs of a graph with nodes as galaxies and edges defined for pairs of galaxies at most τ u distance apart.We defined four features for comparison between random samples and the sets of minimal representatives-size of their cover, radial distance of galaxy closest to the center of the cover (units in Mpc), spherical uniformity of the points in the cover, and eccentricity of the cover.Figure 9 shows projections of the These shorter boundaries have birth ă τ u , but higher than that of the birth-cycle.The persistence of holes that they wrap around is less than , and they are not classified as significant.
four-dimensional feature-set with circle markers for the random samples and the crosses for the voids computed by our strategy.The color of the random samples shows the Gaussian kernel density estimation (kde).We defined and computed a pseudo p-value (see Methods 4.3.4)for every computed void that is indicative of statistical significance of its feature set. Figure 9c shows that feature sets with larger cover size are statistically significant (p-value ď 0.01) in spatial sampling.Figure 9c shows that feature sets with larger radius are statistically significant in graphical sampling.
Table 2 shows features sets of all 40 voids and the corresponding pseudo p-values.There are 12 voids with feature sets that are statistically significant in both spatial and graphical sampling.

Protein homologs with significantly different topology
Can we find protein homologs with differences in the number of significant H 2 features in their backbone?We classified voids in protein structures as significant if they are born at a spatial scale of at most 10 and if their persistence is at least 3.5, i.e. τ u " 10 and " 3.5.These choices are based on the fact that the average bond lengths in protein molecules ranges from around 1.5 Å to 3 Å.We considered all publicly available Protein Data Bank (PDB) entries with at least 10 and at most 20000 atoms in their backbone (N, C, Cα, and O atoms).PH of the resulting 174, 574 valid PDB entries was computed for the backbone, up to and including threshold of τ " τ u ` " 13.5, on x2650 processors.  2 1 2 2 2 3 2 4 2 5 2 6 std.dev. of boundary lengths across perms for a pert.Figure 7: Statistics on perturbations and permutations (a) Covers with significant feature of higher persistence have a higher perturbation parameter.(b) Minimum, median, and maximum of distribution of computation times for all permutations of all perturbations of a cover.Median and minimum increase with increase in number of points in the cover.A large disparity is seen for the maximum time taken for covers of small size, showing that it is difficult to estimate run-time a priori with high precision.(c) Lengths of representative homology boundaries vary across all permutations of all perturbations for every cover.(d) Both perturbations and permutations contribute to variation in lengths of the representative boundaries.
shows the computation times against the number of atoms in the backbones.Out of these, 12198 PDB entries had at least one significant H 2 feature.For each of these, we queried for PDB entries that matched at least 75% of length with a score of at least 75% and the length of backbones differed by at most 25%.It is generally presumed that sequences with high similarity did not arise independently and share a common ancestor [Pearson, 2013].Therefore, we call the matching entries homologs.We determined 874 unique sets of homologs.A higher value of L 0 norm in barcode lengths indicates that the topological differences are more significant (see Section 4.4 for definition).We iterated over pairs of PDB structures in each set of homologs to determine those with a different number of significant features and with L 0 norm in barcode lengths at least 3.We constructed a new graph with edges as all pairs that satisfied these criteria.Figures 10d and 10e show the 25 components of this graph.Overall, there are 110 nodes or PDB entries.Out of these, 71 had at least one significant void.The largest of these proteins had around 4000 atoms in its backbone.First, we computed the shortened and smoothed birth-cycles.Figure 10b shows that the largest set in covers of smoothed cycles has less than 1000 points (open o in Figure 10b).Hence, again, the computational cost of any subsequent analysis was significantly reduced as compared to the largest backbone.Graphical contraction resulted in 77 covers that possibly contain significant voids.The number of significant voids are shown by open squares in Figure 10b.The size of the 1 1  largest of these covers in each PDB entry is marked by red x.The overlapping x's and open o's show that graphical contraction did not result in further decrease in size of the covers for subsequent analysis.
We implemented the stochastic analysis with n pert and n perm chosen as 15.We were able to compute PH and representative homology boundaries.Figure 10c shows the variation in computation times.The maximum time taken was around 15 mins.Note that, the computation time is not strictly increasing with increase in the number of points in the data set.We constructed the set of minimal representatives for significant voids in every protein, as detailed in our strategy.All of the 25 homolog graphs along with 3D figures showing protein crystal structure and voids are in Supplementary figures 17 to 41.All boundaries are around distinct singular voids.We showcase three examples here.First is an example of ligand-binding (Figure 11b).The pheromone-binding protein of Bombyx mori has no significant voids in its unbounded form.However, it has significant voids when bound with bell pepper odorant (PBD 2p70) and also when bound with iodohexadecane (PBD 2p71).Second is an example of different topology between different species.The growth arrest and DNA damage (GADD45) genes is a highly conserved family of proteins (GADD45a, GADD45b, and GADD45g) that respond to stress on mammalian cells and have a crucial involvement in DNA repair.We found that a dimeric conformation of GADD45g from humans (PBD 2wal) has a significant void, but the dimeric configuration reported for mouse (PBD 3cg6) does not have any voids (Figure 11b).This structural difference might lead to differences in gene expression.For example, a closely related gene in the same family, GADD45a, is up-regulated in humans but down-regulated in mice upon irradiation [Ghandhi et al., 2019].Third is an example of mutation leading to a different topology.Cocosin, a protein in coconut fruit, is a possible food allergen.Its reported crystal structure (PDB 5xty) has no significant voids.However, it has 3 significant voids when mutated in two residues in each of its two chains (PDB 5wpw).Figure 11c shows the 3D crystal structures with the voids.The mutated regions are shown with large red markers.

Discussion
PH is an immensely popular method of TDA because it is based on clear mathematical foundations and it is computable with, essentially, basic linear algebra.The existence of nontrivial robust holes in noisy experimental observations, as computed by PH, has found applications across diverse areas of research.However, computing the locations of holes has not received much attention because of the non-uniqueness of their boundaries and high computational cost.This information, unfortunately, is crucial for understanding the functional significance of nontrivial features.This is especially important in scientific contexts, as we have demonstrated in three disparate areas.In this work, we provided a set of algorithms and strategy to compute locations of nontrivial holes with high precision.Moreover, we were able to process large data sets with millions of points.
Our method is based on the established matrix reduction algorithm that reduces boundary or coboundary matrices to compute PH.The following is a summary of our contributions and the main ideas of our strategy: We use duality between cohomology and homology to reduce the computation cost of reducing the boundary matrix.We show that the reduction operations of the boundary matrix form a comprehensive set of representative boundaries in every scenario, that we call birth-cycles.We developed a recursive algorithm that computes reduction operations for the boundary of any given simplex on the fly.This eliminated memory overhead since the entire matrix of reduction operations is never stored in memory.It was additionally optimized in both memory usage and run-time by selective storage of reduction operations of some simplices.It may be interesting to explore whether these simplices are related to critical simplices in Morse theory.The next major algorithm is the greedy shortening algorithm that updates the set of representative boundaries with shorter ones.This algorithm is optimized by division into different cases.Greedy shortening decreased lengths of representatives significantly in our case studies, shortening them by multiple log-scales in some data sets.We also locally smooth these boundaries.If an embedding of the data set is available, we use the shortened and smoothed boundaries to find subsets of the complete data set that contain significant features.In our case studies, the number of points in these subsets was much less than the number of points in the full data set.This reduced the computational cost of subsequent analysis.We add stochasticity in two different ways to possibly find shorter boundaries in the embedding of these subsets.From the multiple computed sets of boundaries, we construct a set of minimal representatives.It is important to note here that our introductions of stochasticity were analytical tools, in the same way that randomization can improve the performance of sorting algorithms, and in no way was the data reduced or sampled.
We suggest some alternative algorithm design choices that could be explored and lead to improvements on our methodology.For example, we have defined the cover of an embedded set of points as the smallest hyper-rectangle that contains the points on it or inside it.A tighter cover of the points in the embedding might be to fit an oriented bounding box by computing eigenvectors of the covariance matrix of embedded points.However, it will incur a higher computational cost in comparison to the current choice of cover.As another example, the perturbation parameter of a cover is based on the number of significant features being the same across all of its perturbations.This is a weak check to ensure that the topology across perturbations does not change significantly.A stronger check could be to use the L 0 norm between two PDs that we defined for our case study of proteins.However, the threshold of L 0 norm at which two PDs will be judged as being significantly different, will have to be defined as a hyperparameter.We generally avoided increasing the number of hyperparameters.An even stronger check could be to compute metrics like the bottleneck and Wasserstein distances between two PDs.However, in addition to requiring a threshold as a hyperparameter that defines significantly different PDs, these are computationally expensive.
We list some technical limitations and possible resolutions as avenues for future work.All algorithms have been designed with maximal size of data structures as a linear factor of Opn e q, where n e is the number of edges in the final simplicial complex.However, the size of the reduction operations of the coboundary matrix and the size of the reduced boundary matrix depend upon the reduction operations and cannot be predetermined.They are stored as one-dimensional sparse matrices.Their maximum permissible size in our current implementation is the limit of unsigned int, l " 4,294,967,295.This limit was reached when we attempted to compute PD of the HIV capsid (PDB 3j3q).One possible resolution is to increase the maximum permissible size of our data structures from l to l 2 by using two-dimensional sparse matrices, such that an overflow continues data entry to a new row.Another limitation is the dependence of our definition of covers on availability of a spatial embedding.It is possible to define covers using only the pairwise distances.A comparison of computational benchmarks and results for different definitions of covers may be useful.A third limitation is the assumption that it is feasible to compute PH up to the maximum possible threshold for covers of shortened and smoothed boundaries.If that is not feasible, then we cannot compute representative homology boundaries for the cover(s).Instead, we can compute birth-cycles by computing PH up to and including threshold of τ " τ u ` for all permutations of all perturbations.A strategy will have to be developed that will construct a set of minimal representatives from the multiple sets of shortened and smoothed birth-cycles.
To illustrate its applications, we provided three cases studies from real data sets across diverse scientific fields.First, we computed loops in genome wide human genome at high resolution of 1 kb.We found that auxin affects trans-cycles that go through chromosome 13 and the sex chromosomes differently, as compared to all other chromosomes.Second, we computed voids in a distribution of more than 108,030 galaxies in the universe.We showed that the computed voids have properties that are statistically rare, and therefore unlikely to be a product of chance.Third, we computed voids in proteins homologs with different number of significant H 2 features.We highlighted three examples that showed the difference in significant voids related to ligand interaction, mutation, and difference in species.We visually confirmed that the representatives that we computed for voids in the universe and in protein structures wrap tightly around singular unique voids.In conclusion, we believe that this work enables research into the functional significance of robust features in a plethora of scientific data sets.

Computing boundaries
We describe our algorithms and strategy to compute minimal boundaries around all significant holes (loops and voids) in a discrete data set, U, with all pairwise distances available.

Background and terminology
A n-simplex is a set of pn `1q-points (supplementary Figure 13).We require 0, 1, 2, and 3-simplices to compute PH up to and including H 2 .These simplices are also called vertices, edges (e), triangles (t), and tetrahedrons (h), respectively.We define diameter of a simplex as the maximal pairwise distance between its points.A simplicial complex at a spatial scale of τ is defined as the collection of all simplices with diameter at most τ.Topologically distinct loops and voids in a simplicial complex are computed as basis elements of its homology groups H 1 and H 2 , respectively [Hatcher, 2001].Persistent homology computes changes in the number of basis elements as the spatial scale of observation, τ, increases [Edelsbrunner and Harer, 2008].These changes are represented as birth and death of basis elements of the homology groups.A (birth, death) pair is also called a persistence pair.Nontrivial persistence pairs are plotted with birth along the x-axis and death along the y-axis.These plots are called persistence diagrams (PD).A feature with higher persistence " death ´birth, will be robust to larger variability in the data set.
The matrix reduction algorithm [Edelsbrunner and Harer, 2008] is an established method to compute PD.It also yields a set of representative boundaries.We briefly describe the method and refer to Figures 1a and 2a as an example.Simplices are denoted by σ i , where 1 ď i ď N is the index of the simplex.They are indexed in the order in which they are added to the simplicial complex.Hence, they are indexed in the order of increasing diameters.Those with the same diameter are assigned unique indices arbitrarily.The boundary matrix D is constructed for a simplicial complex as follows.Row and column i of D correspond to simplex σ i in the simplicial complex.Column i of D has 1 at the boundary-simplices of σ i (blue boxes in 2a), and is 0 (gray boxes) otherwise.Boundary-simplices of a n-simplex are all pn ´1q-simplices that can be constructed using its points.We denote the column corresponding to a simplex σ i in a matrix M by M pσ i q.
Every non-zero column of D has a lowest non-zero element (shown by red boxes), called a low.The matrix reduction algorithm then dictates to reduce columns of D, such that each row of D has at most one low (red box).Each reduction operation is a sum of columns mod p, where p is a prime number.In this work, p " 2. The reduction of D is specifically from left to right as follows.A column of j of D is reduced only when all columns i ă j have been reduced.Also, column j is reduced only with a column i ă j.This is written as a matrix multiplication, DV " R, where V records the reduction operations for columns of D that result in the reduced matrix R. Note that V is always an upper triangular matrix because reductions operations are from left to right.There are two kinds of persistence pairs that are determined from R. First, a low (red box) in R at pi, jq implies that a feature was born when σ i was added to the simplicial complex and it died when σ j was added.This persistence pair is denoted by pσ i , σ j q.It is also called a pivot element of R. Also, Rpσ j q is a representative homology cycle for this feature.Second, if Rpσ i q " 0 and σ i is not in any low of R (σ 8 in the example), then the persistence pair pσ i , 8q is a feature that was born when σ i was added, but it does not die.Such features might exist when PH is computed up to a spatial scale that is less than the maximum of all pairwise distances.There is no representative boundary in R for such a feature.Another way to compute exactly the same PD is by reducing the coboundary matrix, D K .It is the off-diagonal transpose of D. Hence, column and row i of D K correspond to simplex σ N ´i`1 .Similar to the reduction of boundary matrix, the coboundary matrix D K is also reduced from left to right and written as D K V K " R K .Persistence pairs from R K and R have a bijective mapping [De Silva et al., 2011].If pσ j , σ i q is a pivot of R K , then pσ i , σ j q is a pivot of R. The pair pσ i , σ j q is a persistence pair with birth when σ i is added to the complex and death when σ j is added to the complex.If R K pσ i q " 0 and there is no pivot in row for σ i , then Rpσ i q " 0 and there will be no pivot in the row for σ i in R. Hence, the pair pσ i , 8q is a feature that does not die.In the example there are two H 1 features, pσ 9 , σ 10 q " p2.75, 2.75q and pσ 8 , 8q " p2.5, 8q.Non-zero columns of R form a set of representative boundaries.In the example, tRpσ 10 q " tσ 5 , σ 6 , σ 9 uu is the only H 1 representative homology boundary.However, there are two H 1 features.Hence, columns of R do not form a comprehensive set of boundaries when there are features that do not die.

Our strategy and algorithms
The aim is to compute the tightest dense boundaries that surround robust topological features.We quantify these conditions by two user-defined thresholds.First is a lower limit on persistence, denoted by .It defines the desired robustness.Second is an upper threshold for birth of features, denoted by τ u .A lower value of τ u is expected to restrict the results to denser boundaries.This is because any point in an H 1 representative that is born at a spatial scale ď τ u , will have at least two neighbors within spatial distance of τ u .Similarly, a point in an H 2 representative with birth less than τ u , will have at least three neighbors within τ u .Hence, we define a hole with birth ď τ u and persistence ě as significant.Now, persistence of a feature pb, 8q is at least τ ´b, where τ is the threshold for PH computation.The threshold for PH computation then has to be at least τ " τ u ` .Otherwise, a feature born at b P pτ u ´ , τ u s that does not die might be erroneously considered as insignificant if its persistence is estimated as τ ´b.
We first compute the persistence pairs by reducing the coboundary matrix D K to compute R K .From the bijective mapping between pivots of R K and R, we know that Rpσ i q ‰ 0 if σ i is in some pivot of R K .Hence, we reduce all such Dpσ i q.As a result, we determine exactly which columns of D need reduction.This significantly reduces the run-time of computing R. For an even lower run-time, we implement reduction of columns of D using paired-indexing and serial-parallel algorithm that we developed for coboundary matrix reduction in our previous work (supplementary 5.1.2).The default batch-size for parallel reduction of boundaries is chosen to be 1000.
Next, we compute a comprehensive set of representative boundaries.Columns of R will not yield a comprehensive set of boundaries if there are features that do not die.Therefore, we define a comprehensive set of representative boundaries by columns V pσ i q for all simplices σ i such that a feature is born when they are added to the complex.We call these birth-cycles.In the example in Figure 2a, tV pσ 8 q, V pσ 9 qu " ttσ 5 , σ 6 , σ 7 , σ 8 u, tσ 5 , σ 6 , σ 9 uu is the set of birth-cycles.
The matrix V is a sparse upper triangular matrix.However, even then it is not feasible to store it in computer memory since the size of a column can go up to Opn 4 q, for a data set of n points in the worst case for H 2 computation (see Figure 14).We developed a recursive algorithm that uses R to compute a column of V on the fly without having to store any other column of V in computer memory (see Supplementary 5.2 for details).The computed column is directly written to a file.This eliminates memory overhead for computing birth-cycles but can have a high run-time based on the number of recursions and their depths.We observed in our test data sets that a small number of columns of V were used more frequently in recursions than average.The algorithm selectively stores such columns in computer memory when they are first computed.This decreases the run-time of any further recursions that will require these columns.Hence, the recursive algorithm is efficient in both memory and run-time.
After computing the birth-cycles, we shorten them in multiple stages as follows.
I Greedy shortening: We initialize the set of representative boundaries, C " tC i u, as the set of all columns V pσ i q such that a nontrivial feature is born when σ i is added to the simplicial complex.The length of a cycle C is the number of simplices in it-#edges in H 1 representatives and #triangles in H 2 representatives.We denote it by |C| .The basic idea of the shortening algorithm is defined by the following steps: If for a C i there are multiple such C j , we arbitrarily pick one to shorten C i .c. Go to step a.
A straightforward implementation of computing d ˚is to check |C i ' C j | for all possible pairs.However, it might not be feasible in run-time if the number of cycles is large.For example, it took around 12 hours to shorten a set of around 100, 000 H 1 representatives using this approach.We optimized the greedy algorithm into different cases (Supplementary 5.3 for details) such that it took under 20 minutes to shorten the 100, 000 H 1 representatives.We implement this optimization for H 1 representatives but not for H 2 .This is because a similar optimization for H 2 representatives will theoretically require a data structure of size Opn 3 q in the worst case.However, our aim is to ensure that all data structures are Opn 2 q.The straightforward implementation for H 2 representatives was computationally feasible in all of our examples because of the small number of H 2 features.II Check for connectedness: C i ' C j can possibly result in multiple disconnected cycles.We say that two H 1 representatives (H 2 representatives) are disconnected if they have no common 1-simplex (2-simplex).At the conclusion of the greedy algorithm, we check for connectedness of every cycle (Supplementary 5.4), and we record disconnected ones as separate cycles in C. III Local smoothing: We smooth the resulting shortened cycles by reducing them with trivial topological features.
Hence, we reduce H 1 representatives with triangles and H 2 representatives with tetrahedrons.We develop algorithms in which we smooth a cycle by iterating over only those trivial features that share a boundary with it (Supplementary 5.5 for details).
We now have C " tC i u as a set of shortened and smoothed boundaries.If a spatial embedding of the data set is available, we add stochasticity to possibly get shorter representative boundaries.
IV We presume that a spatial embedding E of the data set U is available that maps all of its points bijectively to points in either R 2 or R 3 (Cartesian coordinates), E : U Ñ R 2 or R 3 .WLOG we will consider embedding to be in R 3 .
a. Defining and computing covers of cycles: We denote a local cover of a cycle C in the embedding of the full data set as follows.Let there be k points in a cycle C, denoted by tc 1 , ..., c k u Ă U. Its embedding is EpCq " tp 1 , ..., p k u, where p j " pp 1 j , p 2 j , p 3 j q, 1 ď j ď k.Then, the dimensions of the smallest hyper-rectangle (in Cartesian coordinates) that contains the cycle is We define a cover of C as the set of all points of the data set that are embedded inside or on this hyper-rectangle.In other words, s Eliminating cycles that cannot be around significant holes: We determine whether a cycle C i P C cannot wrap around a significant topological feature.The cover of every cycle C i is computed, denoted by s C i .We compute PH up to and including τ for the embedded points Ep s C i q.The number of significant features is denoted by np s C i q.If np s C i q " 0, then the embedding of cover s C i does not contain any significant feature.Hence, cycle C i cannot wrap around any significant feature in EpU q.We ignore such cycles in subsequent analysis.c.Graphical contraction of covers: At this point, s C " t s C i u is a collection of covers that possibly contain at least one significant feature in Ep s C i q.We define two rules to update this collection such that there is a possible decrease in the number of covers and/or decrease in the number of points in some covers.First, if We implement these two rules by representing the elements of s C as nodes of a graph and conducting a graphical analysis (Supplementary 5.6 for the algorithm).d.Stochasticity to find multiple sets of representative boundaries: We add stochasticity in two different ways in each s C i .In Supplementary 5.7, we give a rationale for why these might make it possible to discover shorter representative cycles.We detail these two methods as follows.
i. Spatial perturbation of points in the embedding: For every cover s C i , we construct a user-defined number of perturbations, denoted by n pert .Every point p j in Ep s C i q is perturbed randomly in a ball centered at the point.The radius of this ball is defined as the smaller of (distance of nearest neighbor of p j )/3 and a maximum perturbation ∆ i that is allowed for all points in Ep s C i q.Such an upper threshold on the magnitude of a perturbation is required so that the topology does not change significantly as compared to the unperturbed embedding.We call ∆ i the perturbation parameter for s , where m is the smallest positive integer such that the number of significant features in all the n pert perturbations is the same as that in Ep s C i q (PH computed up to τ " τ u ` ).ii.Permutations for every perturbation: We permute the indices of edges that have the same diameter.
This results in permutation of columns of the coboundary and the boundary matrices.This stochasticity is merely a change in labels and therefore introduces nothing topologically new into the computation, but helps to overcome the greedy algorithm's local minimum problem.We construct n perm number of permutations, a user-defined hyperparameter.It is possible that some re-indexed sets result in the exact same indexing of edges.There is a higher chance of this if the number of maximum possible unique indexing is not very large as compared to n perm .Therefore, we discard a permutation if its indexing of edges already exists in the set of permutations.iii.Processing permutations of perturbations to construct the final set of minimal representatives: We compute representative homology boundaries for the valid permutations of all perturbations of all covers.We then determine minimal representative boundaries from the multiple computed sets.One way can be to compare lengths of the multiple computed boundaries around a topological feature.
To do so, first a matching of features across different PDs is required.There are methods to match persistence pairs between two PDs, for example, minimizing bottleneck distance and minimizing Wasserstein distance, to name a few.However, no metric can ensure that two features being matched across PDs will correspond to a significant hole located in a similar region in the embedding.PDs across permutations of the same perturbation will have the same set of persistence pairs, but even then there can be ambiguity if there exist two significant features with the same birth and death.Therefore, we do not match persistence pairs and define the set of minimal representative boundaries as follows.We first disregard all representative homology boundaries that do not contain significant features.Then, we pick all sets that minimize the longer of the remaining boundaries as follows.We sort boundaries in each set by decreasing order of length.The number of boundaries is equalized across all sets by inserting boundaries of length 0. Then the list of sets of representative boundaries is sorted in increasing order of the length of longest boundary as first priority, second longest length as second priority, and so on.We select the sets of representative boundaries with the lowest order in the resulting sorted list.If only one set has the minimum order, then that set of representative boundaries is defined as the minimal set and we are done.If there are multiple sets with minimum order, then we define the union of all simplices in the representatives in these sets as the set of minimal representatives.
IV' If a spatial embedding is not available, then we can implement multiple permutations and select a set of minimal boundaries as defined above.

Hi-C analysis
We first summarize the methodology of a Hi-C experiment to understand the data obtained.Nuclear DNA from millions of cells is shredded.Pairs of spatially close shredded fragments have a high chance to ligate due to their inherent stickiness.Paired fragments are sequenced and counted.The counts are aggregated at a specific resolution, say r kb, and reported as follows.The linear chromosomes are partitioned into fixed length segments of r base pairs, called bins.
The Hi-C contact matrix, M " rm ij s, is constructed where m ij reports the aggregated counts, or contact frequency, of paired fragments where one fragment is in bin i and the other is in bin j.The contact matrix is balanced to account for experimental biases.There are multiple methods to balance or normalize Hi-C contact matrices [Lyu et al., 2020].We implemented iterative cross-entropy (ICE) normalization using Cooler [Abdennur and Mirny, 2020].Fragments that are spatially close in the folded genome, are expected to have higher counts or contact frequencies.Hence, we estimated the pairwise-distance matrix by taking the multiplicative inverse of non-zero contact frequencies, i.e. p D " r p d ij " 1{m ij s.The pairwise distance for zero contact frequencies was estimated as infinity.In other words, an edge between bin i and bin j is not added to the simplicial complex if m ij " 0, regardless of the threshold of PH computation.ICE normalization resulted in m ij " NaN for a few pairs.The corresponding edges were not added to the simplicial complex.

Redshift data set
The redshift data set was obtained from http://www-wfau.roe.ac.uk/6dFGS/.It comprises of 110,256 unique and reliable redshifts and holds 136,304 spectra.We only considered entries classified as galaxies by the SuperCOSMOS star/galaxy classifier provided in the data set.This resulted in a data set of 108,030 galaxies.We computed an embedding of this data set in R 3 using recession velocity pcz km s ´1q, galactic latitude pbq, and galactic longitude plq as follows.Hubble's law was used to compute the proper distance, D " cz H0 Mpc where H 0 " 72.1 km s ´1 Mpc ´1 is the Hubble's constant.We computed spherical coordinates pρ, θ, φq using pD, b, lq as follows-ρ " D, l " θ, and φ " pπ{2 ´bq.Spherical coordinates were transformed to Cartesian coordinates, px, y, zq, which defines the embedding of this data set in R 3 .
To test that the voids computed have a rare set of features, we implemented two random searches, spatial and graphical as follows.

Spatial sampling
1.For every computed minimal representative boundary C i , 1 ď i ď 40, we computed its cover s C i " rx í , x ì sˆry í , y ì sˆrz í , z ì s.Its dimensions are then ∆x i " x ì ´xí , ∆y i " y ì ´yí , and ∆z i " z ì ´zí .The minimum number of galaxies in or on the covers is denoted by n min .
2. We defined the minimum and maximum dimensions over all covers by ∆ min x " argmin i t∆x i u, ∆ max x " argmax i t∆x i u, and similarly in y and z dimensions.
2. We created a grid in 3D Cartesian coordinates with each voxel of dimensions ∆x " ∆ min x, ∆y " ∆ min y, and ∆z " ∆ min z.
3. We picked a point px, y, zq randomly inside or on a non-empty voxel.
4. We randomly picked a set of dimensions, ∆x, ∆y, ∆z, from r∆ min x, ∆ max xs, r∆ min y, ∆ max ys, r∆ min z, ∆ max zs, respectively.A hyper-rectangle of picked dimensions with px, y, zq as center is defined as a sampled cover.If the number of galaxies in or on this cover is at least n min , then we recorded this as a sample, otherwise it was discarded.
5. The locations of all galaxies inside and on a valid sampled cover from previous step, were transformed to spherical coordinates.The center of the cover was translated to the origin.Then we computed four featuressize of the cover as the number of galaxies in and on it, spherical uniformity (see 4.3.3 for details on our definition and estimation), radius as the radial distance of the closest galaxy from the center, and eccentricity as the ratio of the smallest dimension of the cover to the longest.

Graphical sampling
1. We constructed a graph, G " pV, Eq on the embedding.The set of nodes V was defined as all galaxies and an edge between two nodes was defined iff spatial distance between them is at most τ u (picked as 15 in this case study).
2. The number of 0-simplices in the computed set of minimal representatives vary in the interval r8, 149s.We randomly selected an integer, n i , in this range, that defines the number of nodes of our graphical sample.We then selected a random connected subgraph G with number of nodes equal to n i and with minimum degree 3.This was done for a fair comparison with the computed representative H 2 homology boundaries because every 0-simplex in them has at least three neighbors as they are sums of tetrahedrons.For a valid subgraph, we compute its cover and features of the cover that were also defined in spatial sampling.

Spherical uniformity
We defined spherical uniformity of a set of points in R 3 by estimating how evenly they are distributed in the space of polar angle (θ) and azimuthal angle (φ).A grid of pixel size π{10 rads ˆπ{10 rads was constructed on this space.We defined spherical uniformity, u, as the ratio of non-empty pixels to the total number of pixels.

Pseudo p-values
To estimate rarity of the set of the four features of the voids computed by our strategy, we computed a pseudo p-value as follows.Let V " tpv 1 i , v 2 i , v 3 i , v 4 i qu denote the set of features of the covers of voids computed by PH, 1 ď i ď 40.There are two sets of 100, 000 random covers, one each from spatial and graphical sampling.WLOG let S " tpf 1 i , f 2 i , f 3 i , f 4 i qu denote a set of the four features of N random samples, 1 ď i ď N. We re-scale the data to r0, 1s.Suppose min j tf d j u " ´.We performed PCA on this set of points, separately for re-scaled spatial and graphical samples.Let the transformed coordinates of samples along the PCA dimensions be S Figure 12 shows the histograms of the coordinates in four PCA dimensions for both.The variance ratios of the four PCA components for spatial sampling are r0.52,0.31, 0.15, 0.007s and for graphical sampling are r0.8,0.093, 0.072, 0.034s.Note that the last PCA component of spatial sampling accounts for only 0.7% of variance in the embedded points.We re-scaled V, u d i " ´, and computed coordinates, pw 1 i , w 2 i , w 3 i , w 4 i q, of the rescaled features sets along the PCA dimensions.Then, for a void in V with re-scaled and transformed coordinates pw 1 , w 2 , w 3 , w 4 q along the PCA dimensions, we computed a pseudo p-value as follows.Let c d be the number of samples with PCA coordinate value greater than w d in PCA dimension d.Then, the p-value for the void was defined as ś 4 d"1 pc d {nq, where n is the number of total samples.List of persistence pairs from cohomology computation Table 3: Some basic terminology, symbols, and descriptions.

PDB analysis to find homologs with different H 2 topology
Spatial coordinates of backbone of PDB entries were extracted using the Prody package [Zhang et al., 2021].A PDB entry q i was marked as a homolog of p i if it satisfied following criteria-at least 75% of p i matched with q i with a score of at least 75% and lengths of their backbones differed by at most 25%.A script was written to automate this search using the Python package pyPDB [Gilpin, 2016].PDB entries that had a residue entry not among the 20 amino acids, were ignored.
We defined an L 0 distance between two H 2 PD as follows.For each PD, we constructed a list of significant barcode lengths, sorted in decreasing order.The number of entries in the two lists were equalized by padding with 0. The maximum of the element-wise difference between these two lists was defined as the L 0 distance between the two PDs.
5 Supplementary information

Computing homology
We first compute the persistence pairs by reducing the coboundary matrix and store them as p K .From duality between homology and cohomology, it follows that we need to reduce a column Dpσ i q only if pσ i , σ j q is a persistence pair in p K .Additionally, we provide a way to efficiently determine trivial persistence pairs that, similar to apparent pairs [Bauer, 2021], do not require any reduction.Hence, we reduce column Dpσ i q only if pσ i , σ j q is a persistence pair in p K and is not a trivial persistence pair.

Trivial persistence pairs
Suppose t is the smallest triangle in the coboundary of an edge e.Additionally, if e is the diameter of t, then it will be the lowest edge in Dptq.Further, since t is the smallest triangle in the coboundary of e, there is no triangle smaller than t that has e in its boundary.Hence, all entries in row e and to the left of column t will be 0. As a result, the lowest entry in column t is the first non-zero entry in row e, making it a pivot entry (first low in the row from left).Hence, column t will not require any reduction operations, and we say that pe, tq is a trivial persistence pair in H 1 .
Now, we will show that if pe, tq is a trivial persistence pair in H 1 , then pt, eq is a persistence pair in H 1 such that coboundary of e does not require any reduction operations.Since t is the lowest triangle in the coboundary of e, all entries below row t in column e of the coboundary matrix will be 0. Also, since e is the diameter of t, all entries in row t to the left of column e will be 0, making pt, eq a pivot entry.As a result, column e in the coboundary matrix will   not require any reduction operations, and we say that pt, eq is a trivial persistence pair in H 1 .In Dory, we compute cohomology and store the persistence pairs that are not trivial in p K .Therefore, to compute H 1 , we need to iterate only over the triangles that are in some persistence pair in p K because these triangles are also the ones that are not in a trivial persistence pair of H 1 .
Since we do not store trivial persistence pairs, we have to check for them at every reduction step as well.Suppose e is the lowest edge in the partially reduced boundary of a triangle.Then, if pe, t 1 q is a trivial persistence pair, the next reduction has to be with the boundary of t 1 .So, our aim is to determine whether there is a trivial persistence pair pe, t 1 q for a given edge e.To do so efficiently, we use paired-indexing that we introduced in Aggarwal and Periwal [2021].Paired-indexing encodes a simplex such that it stores information about its diameter.A triangle t is stored as xk p , k s y where k p is its diameter and k s is the third vertex in the triangle that is not in the edge corresponding to its diameter.Then, for an edge e, if t 1 " xe, k s y is the lowest triangle in the coboundary of e, then pe, t 1 q is a trivial persistent pair.These checks are computationally feasible because the number of reduction steps is lowered on account of processing only the triangles that are in some persistence pair in p K .
Similarly, to compute H 2 we iterate over the tetrahedrons that are in some persistence pair in p K .As before, we have to check for trivial persistence pairs during reduction because we do not store them in computer memory.Suppose t is the lowest triangle in partially reduced boundary of a tetrahedron.Then, if pt, h 1 q is a trivial persistence pair, the next reduction has to be with the boundary of h 1 .If h 1 is the smallest tetrahedron in the coboundary of t and the maximum triangle in boundary of h 1 is also t, then pt, h 1 q is a trivial persistent pair, and the next reduction has to be with the boundary of h 1 .

Final algorithm for homology computation
We reduce the boundaries of simplices in batches using the serial-parallel reduction that we introduced in Aggarwal and Periwal [2021] (Figure 15).The default batch-size for parallel reduction of boundaries is chosen to be 1000.The flowcharts for serial and parallel are shown in appendix A. Only those triangles that are in some persistence pair in p K need to be considered for reduction.

Recursive algorithm to compute columns of V
We develop a recursive algorithm that computes V pσq without having to store V.We also show that by selectively storing a few columns of V, we can significantly reduce the computation time at a very low cost of additional memory.

Computing birth-cycles for H 1
We show how we use R to reduce the boundary of any edge e o in the simplicial complex on the fly.We consider two cases, Rpe o q is non-empty (or non-zero) and Rpe o q is empty (or zero).We first go over the former case since the latter will require the former.The function FindV in algorithm 1 computes V pe o q when Rpe o q is non-empty.We begin by defining r eo " Be and V pe o q " r s.We reduce r eo with R by updating r eo Ð r eo ' Rpe 1 q if lowpRpe 1 qq " lowpr eo q and append e o to V pe o q.However, we have to append to V pe o q the reduction operations corresponding to Rpe 1 q, that is, V pe 1 q.This sets up a recursion to compute V pe 1 q.It remains to define the termination of recursion.We show that Be o will always reduce to zero.Initially, r eo " Be o .Since Rpe o q is the complete reduction of Be o that exists in R, we know that reduction of Be o with R will eventually make it equal to Rpe o q.Hence, r eo " Rpe o q at some point in the reduction, and the next reduction step will be summing r eo and Rpe o q because they have the same low.This will result in zero.
Algorithm 1 Compute V pe o q for edge e o with Rpe o q non-empty 1: Input R, e o 2: Output V pe o q 3: V pe o q Ð r s 4: FindVpe o , V pe o qq 5: function FindV pe, V pe o qq 6: Append e to V pe o q 7: r e Ð Be 8: while r e not empty do 9: if lowpRpe 1 qq " lowpr e q then 10: if r e is NOT empty then 12: FindVpe 1 , V pe o qq We now tackle the latter case of Rpe o q empty.Initially, r eo " Be o .In contrast to the former case, V pe o q is initialized with re o s and is not empty.We reduce r eo with R by updating r eo Ð r eo ' Rpe 1 q if lowpRpe 1 qq " lowpr eo q.Since Rpe 1 q will be a non-empty column in R, we use FindV (algorithm 1) to compute reduction operations required to reduce Be 1 to Rpe 1 q and append them to V pe o q.The reduction ends when r eo inevitably reduces to zero since Rpe o q is empty.We do not keep track of the coefficients of edges in V pe o q during reduction to reduce computation time.Hence, we remove edges with zero coefficient (modulo 2) from V pe o q at the conclusion of recursion since their boundaries will sum to 0, at the end of the reduction loop.
Algorithm 2 Compute V pe o q for edge e o with Rpe o q empty 1: Input R, e o 2: Output V pe o q 3: V pe o q " re o s 4: r eo Ð Be o 5: while r eo is not empty do 6: if lowpRpe 1 qq " lowpr eo q then 7: FindVpe 1 , V pe o qq 9: Remove edges from V pe o q with zero coefficient

Computing birth-cycles for H 2
The algorithm to compute birth-cycles for H 2 is written similarly by recursively reducing the boundaries of triangles at which topological features are born.However, we do not store trivial persistence pairs of H 1 in R, and compute these on the fly.Also, if pe 1 , t 1 q is a trivial persistence pair, we know that Rpt 1 q " Bt 1 .Hence, we do not compute V pt 1 q in such cases (algorithm 3).
Algorithm 3 Compute V pt o q for triangle t o with Rpt o q empty 1: Input R, t o 2: Output V pt o q 3: V pt o q " rt o s 4: r to Ð Bt o 5: while r to is not empty do 6: if plowpr to q, t 1 q is a triv.pers.pair then 7: r to Ð r to ' Bt 1 8: Append t 1 to V pt o q 9: else if lowpRpt 1 qq " lowpr to q then 10: FindVpt 1 , V pt o qq 12: Remove triangles from V pt o q with zero coefficient if plowpr t q, t 1 q is triv.pers.pair then 18: else if lowpRpt 1 qq " lowpr t q then 20: FindVpt 1 , V pt o qq 5.2.3 Selective storage for computational optimization V does not need to be stored in memory in the recursive algorithm.We write each birth-cycle to a file when it is computed.The only additional memory taken is the maximum possible size of V pσq.However, recursive reduction can be expensive depending upon the number of reductions and the recursion-depth of every reduction.To alleviate this problem, we note that it is possible for a column Rpσ i q to be used multiple times in the recursive computation of birth-cycles.In this case, storing its reduction operations, V pσ i q, in computer memory can decrease the computational overhead.It is also possible that the time taken to compute V pσ i q is low and the multiple computations have a negligible computational overhead.Hence, we developed selective storage of columns of V based on two criteria.For notational convenience, we drop the subscript i.First, we consider the number of reductions, inclusive of the reductions carried out recursively, required to compute V pσq.We denote this parameter by n r pσq (#reductions).Second, we maintain a counter for the number of times V pσq is requested as the birth-cycles are being computed.We denote this parameter by n u pσq (#usage).If n r pσq ą n r and when n u pσq surpasses a user-defined threshold, denoted by n ů, then we store V pσq in the memory.Note that an optimal value of n ů that can provide the best trade-off between memory usage and run-time can be determined only after computing all of the birth-cycles recursively and analyzing usage of V pσq.We chose n ů " 2 as its default value, and it can be increased if storage of fewer V pσq is desired and the resulting increase in computation time is insignificant.
The benefits of the above strategy will be most significant if n r is low for most simplices and the ones with large n r also have a large n u .The former is desirable because it will reduce the memory taken by selective storage, and the latter means the reduction in computation time by using selective storage will be significant.

Greedy algorithm to compute a set of shorter cycles
A birth-cycle corresponding to a topological feature that is born in homology group H d is a set of d-simplices.We will denote the set of all birth-cycles computed for H d by B d .For notational convenience, we drop the subscript d since the analysis follows similarly in any dimension.We denote a birth-cycle by X i P B where X i is a set of simplices.We define the length of X i as the number of simplices in X i , denoted by |X i |.Since the set of birth-cycles forms a basis for the corresponding homology group, we propose a Gram-Schmidt like method to reduce the lengths of cycles.We substitute a cycle X i by X i ' X j " pX i Y X j qzpX i X X j q if such a substitution reduces its length by r m , the maximum possible reduction in length when all pairs of cycles are considered.The iterations are continued till the sum of no two cycles results in a cycle of shorter length.We know that this condition will be satisfied eventually because every iteration results in a reduction in length of at least one cycle and the lengths are bounded below by 0. Note that X j is possibly one of multiple cycles that can be added to X i to reduce its length by r m .A choice of a different cycle might result in a different set of cycles at the end of the algorithm that may be shorter.Hence, it is not guaranteed that a single run of this algorithm to termination will result in cycles of minimal lengths, and the only condition satisfied is that sum of no two cycles will give a cycle of shorter length.We also note that X i ' X j might result in disconnected cycles.
A brute-force implementation that scans all possible pairs is computationally infeasible.N cycles will require OpN 2 q comparisons in every iteration and every comparison involves determining pX i Y X j qzpX i X X j q, which can be costly for long cycles.Hence, such a brute-force approach is not feasible to process a large number of cycles.
We make the following observations.

If |X
To utilize these observations, we index the birth cycles in decreasing order of their lengths.Hence, if i ă j, then |X i | ą |X j |.Indices of cycles of the same length are defined arbitrarily by the sorting algorithm.Since the relative lengths of cycles might change in an iteration, a re-indexing of cycles by sorting them in decreasing order of lengths is done at every iteration.Then, from observation 2 it follows that for a given a cycle X i , we only need to consider substitution of X i by X i ' X j if j ą i.Hence, we decrease the number of comparisons required in every iteration.
We begin by initializing two data structures.First, for every cycle X i we store f pX i q " pX i ˚, r i q, where X i ˚is the cycle such that updating X i by X i ' X i ˚decreases its length the most, and the decrease in length is stored as r i .Second, for every simplex σ, that is in at least one cycle, we maintain a list of all cycles that contain it.We denote it by gpσq " rX i1 , X i2 , ...s.
To initialize f pX i q, we first set r i Ð 0, X i " NULL, and consider the sum of X i with all cycles X j with j ą i.The following ideas reduce the number of comparisons necessary.

If |X
We increment j by 1 and repeat till either j " N or X j ă r i .The latter condition follows from observation 1 because, if X j ă r i , then summing X i with any cycle X k for k ě j cannot result in a reduction in length greater than r i .Since the cycles are indexed by decreasing lengths, we can quit when X j ă r i .This potentially reduces the number of comparisons to be done.Note that if r i " 0 upon termination, then there is no cycle that can be added to X i to reduce its length.2. To compute |X i ' X j |, we iterate over elements in X i and X j and count the elements that are in either but not in both.To make this operation of order Op|X i | `|X j |q, we store every cycle as a list of simplices ordered by their index.We quit when the count is more than |X i | ´ri because that implies that the reduction in length will be less than r i .In such a case, we do not iterate over all elements in X i and X j , reducing the computation time.
We compute the initial f pX i q for all birth-cycles in parallel using OpenMP for-loop with static scheduling and a chunksize of 1000.
We start the iterative algorithm after f pX i q and gpσq have been initialized.We denote the maximum r i in f pX i q by r m .If r m " 0, the algorithm terminates.Otherwise, all X i with r i " r m are substituted with X i ' X i ˚.For every substitution, gpσq has to be updated as follows-X i is to be added to gpσq for all σ P X i ˚zpX i X X i ˚q and it has to be removed from gpσq for all σ P X i X X i ˚.This information is tracked during the summation of cycles as a list upσq " rpX i1 , f 1 q, ...s, where f 1 " 0 if X i1 has to be removed from gpσq and, otherwise, f 1 " 1.Then, gpσq is updated in parallel.
After these substitutions, we update the indices of cycles by re-sorting them in decreasing order of lengths.We also have to determine whether f pX i q has changed.To reduce the number of comparisons, we consider four different cases for every cycle X i as follows.
1. X i was updated: If X i was updated in this iteration, then we reset r i Ð 0 and check the sum of X i with each cycle X j where j ą i. 2. X i was not updated, r i " 0: This means that prior to the updates, there was no cycle that could be added to X i to result in a cycle of shorter length.Hence, after the updates, we need to check the sum of X i with only the cycles that were updated.3. X i was not updated, r i ‰ 0, X i ˚was updated: Since X i ˚has been updated, we check the sum of X i with each cycle X j where j ą i.
4. X i was not updated, r i ‰ 0, X i ˚was not updated: In this case we only need to check the sum of X i with the updated cycles to determine if a reduction in length more than r i is possible.
The above four cases can be categorized into two scenarios-checking the sum of X i with all cycles and checking the sum of X i with only the cycles that were updated in the iteration.These scenarios are implemented using different strategies.
Strategy 1 to check the sum of X i with all cycles (cases 1 and 3): For every simplex σ in X i , we iterate over gpσq.Then, if X j P gpσq, we know that σ P X i X X j .Hence, we can compute |X i X X j | on the fly by initializing it as 0 and increment it by 1 whenever X j is in gpσq for σ P X i .Since |X i ' X j | " |X i | `|X j | ´2|X i X X j |, we can compute the reduction in length on the fly when X i is summed with X j as well.If this reduction is greater than r i , we update r i Ð |X i ' X j | and X i ˚Ð X j .Strategy 2 to check the sum of X i with updated cycles (cases 2 and 4): Suppose the U " rX i1 , X i2 , ...s is the list of updated cycles ordered in decreasing order of their lengths.Then, we sum X i with cycles in U as they are ordered in U.If we get a reduction in length of X i that is greater than r i , we update r i and X i ˚accordingly.For efficiency, we implement the two ideas, 1 and 2, from initialization.Hence, we may need not compare X i with all cycles in U and also may not need to iterate over all simplices in the two cycles that are being summed.
Strategy 1 and strategy 2 are implemented as embarrassingly parallel over all cycles and updated cycles, respectively.The four cases (case 1 to 4) are implemented separately in parallel for better load balance.We use OpenMP for-loop with static scheduling and a chunksize of 50.
Why do we use two different strategies?The strength of strategy 1 lies in the fact that iterating over gpσq compares only those cycles that have at least one simplex in common.However, it also means that we iterate over every simplex in every cycle X i that is to be processed.In our experiments with test data sets, we observed that in the first few iterations almost all of the cycles are in cases 2 and 4 and only a few cycles updated.Hence, strategy 1 takes long because it iterates over the entire length of every cycle.As an alternative, we implement strategy 2 that iterates over pairs of cycles and checks their sums.It is significantly faster than strategy 1 because the ideas 1 and 2 reduce the number of pairs that are considered.Now, as iterations of the algorithm reduce the lengths of the cycles, there is a decrease in both the number of simplices in the cycles and the number of cycles that have common simplices.As a result, combined with strategy 1, the algorithm scales efficiently for a large number of cycles.The efficiency is higher if the iterations of the algorithm are accompanied with an increase in the number of cycles in cases 1 and 3.This algorithm can be implemented for both H 1 and H 2 birth-cycles.However, since the memory requirement for gpσq can be Opn 3 q in the worst case, we implement it for only H 1 .To shorten H 2 birth-cycles, we simply iterate over all pairs of nontrivial cycles in every iteration.This was computationally feasible for all of the data sets in this work because the number of nontrivial H 2 features was very low.

Connectedness
For every shortened H 1 representative, we compute its cycle basis using the networkx Python package and record different basis elements as separate cycles.For every shortened H 2 representative, we compute a connectivity graph, G " pV, Eq, with the set of nodes V as all triangles in the representative and an edge between nodes is defined if the corresponding triangles have two points in common.We record disconnected components of G as separate H 2 representatives.

Smoothing
An H 1 representative can be written as a sequence of 0-simplices, rv 0 , ..., v n , v 0 s.To smooth it, we simply remove v i if the pairwise distance between v i´i and v i`1 is at most τ u .
An H 2 representative boundary is a set of triangular faces, B " tt i u, and each triangle, t i , is a set of its three vertices.We define a graph G with set of nodes V " tt i u and set of edges E " tpt i , t j q where |t i X t j | " 2u, i.e. an edge between two nodes of G denotes that the corresponding triangular faces share two vertices.For each triangle t i , we define its diameter as the length of the longest edge in it, denoted by d i .
Suppose a tetrahedron h " tt 1 , t 2 , t 3 , t 4 u exists in the embedding, with birth ď τ u , such that exactly three of its four faces are in B. Then, we remove those three faces from B, add the fourth face of h to B, and add edges between the fourth face and neighbors of the three removed faces.If there are multiple such possibilities, then the algorithm greedily picks a tetrahedron with smallest diameter.We update G and repeat till no update to G is possible.At the end of the iterations, we remove any disconnected tetrahedrons in B. It is not efficient to loop through all valid (with birth ď τ u ) tetrahedrons in the embedding of the entire data set.We developed an algorithm that, instead, is of the order of number of triangular faces in B (Algorithm 4).
Algorithm 4 Smoothing a 2-cycle 1: Input: Boundary B " tt i u (set of triangular faces), birth threshold τ u 2: Make graph G Ð pB, Eq, where pt i , t j q P E if they share two points.for pt 1 , t 2 , t 3 q in combinations of 3 from C do  Compute npc k q by computing PD up to τ u ` 15: if npc k q " npc i q then 16: Remove c i from G c 17: if npc k q " npc j q then

Perturbations and permutations
The rationale for perturbations and permutations possibly yielding shorter representatives is as follows.a. Perturbations: Consider a persistence pair pb, dq.The algorithm will find a representative boundary around this feature, say C 1 , that is born at b, i.e. the length of longest edge in C 1 is b.Suppose that there exists a cycle C 2 in the embedding that is born at b `∆ p∆ ą 0q and is closer to the hole as compared to C 1 .We are interested in discovering C 2 if ∆ is sufficiently small pb `∆ ă τ u q.By perturbing the data set, we allow the possibility that the longest edge in C 2 is smaller than the longest edge in C 1 in the perturbed data set, and hence, the former can be discovered by the algorithm as a representative boundary.
b. Permutations: The indices of the edges that are added to the simplicial complex at the same spatial scale, are permuted.The resulting change in the order of columns of boundary matrices, can give a different set of representative boundaries.Note that, unlike perturbations which are essentially relying on the robustness of meaningful topological features, a permutation preserves the PD.

Universe
Figure 16 shows the computed set of minimal representative boundaries around significant voids in the universe.

Acknowledgement
This research was supported by the Intramural Research Program of the NIH, the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK).This work utilized the computational resources of the NIH HPC Biowulf cluster (http://hpc.nih.gov).

Figure 1 :
Figure1: PH finds robust holes in discrete data sets and our algorithms improve geometric precision of representative boundaries.(a) A data-set with 4 points and annotated pairwise distances.Simplicial complex (D i ) at a spatial scale τ is defined as a set of simplices (σ i ) with diameter at most τ.At τ " 2.5, #holes in simplicial complex increases from 0 to 1 when σ 8 is added (D 8 ).At τ " 2.75 #holes increase 1 to 2 when σ 9 is added (D 9 ).One hole gets filled in at τ " 2.75, when σ 10 " ta, b, cu is added (D 10 ), but one remains alive.(b) A noisy discrete data set in R 2 with one significant hole.(c) Birth and death of holes is plotted as persistence diagram.It shows that one hole stands out with relatively high persistence.(d) Our algorithms improve geometric precision of representative boundaries in multiple steps.Birth-cycle is from recursive algorithm.Shortened is from greedy shortening algorithm.Smooth cycle is from smoothing algorithm.(e) A point-cloud in R 3 .It is difficult to identify any significant hole by visual inspection.(f) PD shows that one feature has relatively high persistence.(g) Our algorithms again improve geometric precision.Blue boundary is birth-cycle from recursive algorithm.Red boundary is smooth cycle after greedy shortening and smoothing algorithms.

Figure 3 :Figure 4 :
Figure 3: Spatial pairwise estimates from Hi-C matrices reasonably estimate distance between adjacent bins on a chromosome.(a) Distribution of pairwise estimates for bin distance of 1 is significantly lower as compared to those of higher bin distances.(b) Both means and medians of the distributions increase with increase in bin distances.

Figure 5 :
Figure 5: Both cis-and trans-cycles decrease in number upon treatment with auxin.(a) and (b) Number of cis-cycles is decreased upon addition of auxin, irrespective of cycle length and the maximal bin distance between contiguous bins.(c) Number of trans-cycles is also decreased, but there are relatively more instances of more trans-cycles in auxin-treated.(d)Opacity of edge pc i , c j q indicates difference in number of trans cycles between control and auxin-treated that go through chromosomes c i and c j .Blue indicates more in control and red means more in auxin-treated.Chromosome 13 stands out with many red edges.(e) Same as previous but without the sex chromosomes.There are very few red edges at chromosome 13, indicating that auxin-treated has a larger number of trans-cycles that go through chromosome 13 and sex chromosomes.

Figure 6 :
Figure 6: Results of PH computation, shortening, and smoothing for spatial embedding of around 108, 000 galaxies (a) H 2 PD with significant features (τ u " 15, " 2) marked by red.(b) Lengths of birth-cycles are decreased by both shortening and smoothing.(c) Open-face blue circles are significant features in the H 2 PD of full data set.Red crosses are significant features in covers of birth-cycles.The set of birth-cycles wrap around all features classified as significant in the PD of the full data set.(d) Open-face blue circles are significant features in the H 2 PD of full data set.Red crosses are significant features in covers shortened cycles.Shortened cycles do not wrap around all features classified as significant in the PD of full data set.This is a feature of the shortening algorithm, as we show by comparing homology cycles with shortened cycles for the annotated example.(e) Top row shows birth-cycle, for the annotated example, in three different orientations.Bottom row shows the shorter cycles found by the greedy algorithm, in same orientations.These shorter boundaries have birth ă τ u , but higher than that of the birth-cycle.The persistence of holes that they wrap around is less than , and they are not classified as significant.
boundary in a pert.
in original data-set sig.feature in final minimal cycles(a)

Figure 8 :
Figure 8: The final computed boundaries identified locations of 40 distinct voids in the spatial embedding.(a) Persistence pairs of significant features in covers of the computed set of minimal representatives and of significant features in the full data set.(b) A graph with nodes as covers of minimal boundaries.Labels show that each cover has one significant feature.An edge denoted non-empty intersection between two covers.24 out of 30 components are singletons.They identify distinct regions in the universe, each with one significant void.(b) and (d) Two orientations of all computed minimal boundaries around significant H 2 voids in the spatial embedding of galaxies.(e) 5 of the components with intersecting covers to visually confirm that all minimal representatives are around distinct voids.(f)Last remaining component that has 5 intersecting covers and up to 5 significant voids.Two different orientations show that the 5 minimal cycles are around 5 distinct voids.12

Figure 9 :
Figure 9: Voids determined by minimal representatives have set of features that are rare to find by random sampling.100, 000 samples each of (a) Spatial sampling and (b) Graphical sampling show that the feature sets of most voids have very low kde (0.001), especially for voids that have large covers.Statistical significance of features sets of the computed voids in (c) Spatial sampling and (d) Graphical sampling.13

Figure 10 :
Figure10: Homologs with significant topological differences (a) PD computation of 174,574 PDB entries up to and including threshold τ " τ u ` " 10 `3.5 (b) 71 proteins in 25 homolog graphs that have significant features.There are 77 covers with significant features after graphical contraction.The largest of the covers around smooth cycles is less than 1000 for every protein.(c) Run-times for computing representative homology boundaries in permutations of perturbations of 77 covers.(d) and (e) 25 homolog graphs (different #sig.features and L 0 norm of at least 3).Black labels are the number of significant H 2 features in every PDB in that column.

Figure 11 :
Figure 11: Three selected examples from the 25 homolog graphs.(a) Ligand-binding: 1gm0 is a form of pheromonebinding protein from Bombyx mori that has no significant voids.Its bound configurations-with bell pepper odorant (2p70) and with iodohexadecane (2p71)-have significant voids at the binding sites.(b) Different species: Dimeric configuration of GADD45g in humans has a void, but not in mice.(c) Mutation: Two mutations (large red markers) in each of the two chains of protein Cocosin, introduce three significant voids.

Figure 12 :
Figure 12: Histograms of the PCA coordinates of random samples in (a) Spatial sampling and (b) Graphical sampling.

Figure 14 :
Figure 14: Matrix representation and their bounds in PH computation, up to and including H 2 , for a data set with n points.(a) Homology computation by reducing boundaries.(b) Cohomology computation by reducing coboundaries.

C
Ð cliques in G with at least 3 nodes/triangular faces 8: for C in C do 9:

G
c Ð subgraph G defined by nodes in c pc i , c j q in pairs of nodes of G c do Ź Intersection check 12: c k Ð c i X c j 13: Continue/Skip if c k P tc i , c j u AND |c k | ă 4 14: if c i and/or c j removed then Ź Add c k to G c 20:for c i in G c do 21: If c k Ă c i AND npc k q " npc i q, remove c i Ź Subset check 22:Update edge set E to add neighbors of c k Ð composition of all G c 26: Output: List of nodes of G Figure 17 Figure 19 Figure 30 Figure 31 Figure 38 Figure 40 Figure 41

Table 1 :
Our algorithms process large data sets within a minute.n is the number of points, τ " τ u ` is the threshold for PH computation, #H 1 is the number of nontrivial H 1 features, #cycles is the number of cycles born at spatial scale less than or equal to τ u , T tot is time in sec to compute PD, birth-cycles, and shorten and smooth birth-cycles, M tot is peak memory usage in GB.
Columns of R give a set of representative homology boundary.There is only one H 1 representative boundary, Rpσ 10 q, but there are two H 1 features.We propose to use columns of V as a comprehensive set of representative boundaries, that we call birth-cycles.Coboundary matrix D K is off-diagonal transpose of D. It is reduced similarly and same persistence pairs are obtained from R K .(b) Flowchart of our algorithms and strategy.

Table 2 :
Features of voids computed by PH.Last two columns are ´log 10 of the p-value, and those ě 2 (or p-value ď 0.01) are highlighted.Entire rows of voids with statistical significance in both spatial and graphical sampling are highlighted.
F 2 and t i ∈ p ⊥ Figure 15: Serial-parallel reduction to reduce boundaries of triangles.
13: function FindV pt, V pt o qq Skip because tetrahedron tt 1 , t 2 , t 3 , t 4 u P B Ð diameter of h " tt 1 , t 2 , t 3 , t 4 u 18:if d ă d ˚AND d ď τ u thenMark tt 1 , t 2 , t 3 u Ð tt 1 , t 2 , t 3 u to be removed and t 4 Ð t 4 to be added to G N Ð union of neighbors of tt 1 , t 2 , t 3 u Remove tt 1 , t 2 , t 3 u from B Remove disconnected tetrahedrons from B by removing its components of length ď 4 30: Output: Nodes in B5.6 Covers and graphical contractionGiven an upper threshold for PH τ u , threshold for significance , and s C " t s C i u with np s C i q ą 0, we implement the following conditions.1.Subset check: If s C j Ă s C i and n `s C i ˘" n `s C j ˘, then remove s C i from s C. 2. Intersection check: If s C i Y s C j R tφ, s C i , s C j u and n `s C i X s C j ˘" n `s C i ˘and/or n `s C j ˘, then replace s C i and/or s C j by s C i X s C j .Note that we can impose a stronger condition that ˇˇs C i Y s C j ˇˇą 3, because otherwise the intersection cannot have a significant feature.We implement this by constructing a graph with nodes as the elements of s C (Algorithm 5).Input:τ u , , s C " t s C i u where np s C i q ą 0 2: Set of vertices V Ð s C 3: Remove C i from V if there exists np s C i q " np s C j q AND s C i Ă s C j Ź Subset check 4: Define edge set E Ð tp s C i , s C j qu if s C i X s C j Rtφ, s C i , s C j u 5: Intersection graph G Ð pV, Eq 6: for component c in components of G do d