Determining clinically relevant features in cytometry data using persistent homology

Cytometry experiments yield high-dimensional point cloud data that is difficult to interpret manually. Boolean gating techniques coupled with comparisons of relative abundances of cellular subsets is the current standard for cytometry data analysis. However, this approach is unable to capture more subtle topological features hidden in data, especially if those features are further masked by data transforms or significant batch effects or donor-to-donor variations in clinical data. We present that persistent homology, a mathematical structure that summarizes the topological features, can distinguish different sources of data, such as from groups of healthy donors or patients, effectively. Analysis of publicly available cytometry data describing non-naïve CD8+ T cells in COVID-19 patients and healthy controls shows that systematic structural differences exist between single cell protein expressions in COVID-19 patients and healthy controls. We identify proteins of interest by a decision-tree based classifier, sample points randomly and compute persistence diagrams from these sampled points. The resulting persistence diagrams identify regions in cytometry datasets of varying density and identify protruded structures such as ‘elbows’. We compute Wasserstein distances between these persistence diagrams for random pairs of healthy controls and COVID-19 patients and find that systematic structural differences exist between COVID-19 patients and healthy controls in the expression data for T-bet, Eomes, and Ki-67. Further analysis shows that expression of T-bet and Eomes are significantly downregulated in COVID-19 patient non-naïve CD8+ T cells compared to healthy controls. This counter-intuitive finding may indicate that canonical effector CD8+ T cells are less prevalent in COVID-19 patients than healthy controls. This method is applicable to any cytometry dataset for discovering novel insights through topological data analysis which may be difficult to ascertain otherwise with a standard gating strategy or existing bioinformatic tools.


Introduction
Cytometry data contain information about the abundance of proteins in single cells and are widely used to determine mechanisms and biomarkers that underlie infectious diseases and cancer. Recent advances in flow and mass cytometry techniques enable measurement of abundances of over 40 proteins in a single cell [1,2]. Thus, in the space spanned by protein abundance values measured in cytometry experiments, a cytometry dataset is represented by a point cloud composed of thousands of points where each point corresponds to a single cell. Abundances of proteins or chemically modified forms (e.g., phosphorylated forms) of proteins in single immune cells change due to infection of the host by pathogens (e.g., a virus) or due to the presence of tumors which usually result in changes in the 'shape' of point cloud data measured in cytometry experiments [3][4][5]. Cytometry data analysis techniques commonly rely on Boolean gating and calculation of relative proportions of resulting populations as a method to compare datasets across control/healthy and experimental/diseased conditions. In recent years, state-of-the-art analyses based on sophisticated machine learning algorithms capable of mitigating batch effects, ad hoc gating assumptions, and donor-donor variability have been developed [6,7]. However, these methods are not designed to quantitatively characterize shape features (e.g., connected clusters, cycles) in high dimensional cytometry datasets that can contain valuable information regarding unique co-dependencies of specific proteins in diseased individuals compared to healthy subjects.
Topological Data Analysis (TDA) aims to capture the underlying shape of a given dataset by describing its topological properties. Unlike geometry, topological features (e.g., the hole in a doughnut) are invariant under continuous deformation such as rotation, bending, twisting but not tearing and gluing. One of the tools by which TDA describes topological features latent in data is persistent homology [8,9]. For example, for a point cloud data, persistent homology captures the birth and death of topological features (e.g., 'holes') in a dataset after building a scaffold called a simplicial complex out of the input points. This exercise provides details regarding topological features that 'persist' over a range of scale and thus contain information regarding the shape topology at different length scales (see S1 Fig for details). Persistent homology has been applied to characterize shapes and shape-function relationships in a wide variety of biological systems including skin pattern formation in zebra fish [10], protein structure, and pattern of neuronal firing in mouse hippocampus [11]. TDA has additionally previously been applied to identify immune parameters associated with transplant complications for patients undergoing allogenic stem cell transplant using populations of immune cell types assayed via mass cytometry [12]. However, this work did not use persistent homology or expression levels of proteins in their analysis, leaving the shape of cytometry data uncharacterized. Another work focuses on the use of TDA as a data reduction method for single-cell RNA sequencing data [13], but again do not attempt to characterize how topologies derived from point clouds differ among disparate data sources such as healthy and diseased individuals.
The challenges of directly applying current persistence methodologies to cytometry data to characterize distinguishing features between healthy and diseased states are the following: 1. Features that separate healthy from diseased state can pertain to the change in density of points in a region in point cloud data-therefore, the information of local density should be incorporated in persistent homology methods, in particular in the filtration step that brings in sequentially the simplices connecting the points. In commonly used Rips filtration [14] the density of points is not included. 2. There can be shape changes giving a different length scale in the point cloud data, such as formation of an elbow, in a diseased condition. 3. There can be systematic differences between healthy and diseased states across batch effects and donor-donor variations. Topological features should capture these global differences being oblivious to the local variations caused by measurement noise.
We address the above challenges by developing an appropriate filtration function to compute persistence and applying the method to characterize distinguishing features of non-naïve CD8+ T cells between healthy and SARS-CoV-2 infected patients.

Persistence framework for SARS-CoV-2 infection
Topological signatures given by persistence are stable, global, scale invariant and show resilience to local perturbations [15]. It is this property of persistent homology that motivates us to use TDA in distinguishing clinically relevant features in flow cytometry data in COVID-19 patients.
Persistent homology. Persistent homology builds on an algebraic structure called homology groups graded by its dimension i and denoted by H i . Intuitively, they describe the shape of the data by 'connectivity' at different levels. For example, H 0 describes the number of connected components, H 1 describes the number of holes, and, H 2 describes the number of enclosed voids apparently present in the shape that the dataset represents. Three and higher dimensional homology groups capture analogous higher (� 3) dimensional features. A point cloud data (henceforth abbreviated as PCD) itself does not have much of a 'connected structure'. So, a scaffold called a simplicial complex is built on top of it. This simplicial complex, in general, is made out of simplices of various dimensions such as vertices, edges, triangles, tetrahedra, and other higher dimensional analogues. Given a growing sequence of such complexes called filtrations, a persistence algorithm tracks information regarding the homology groups across this sequence. In our case, these complexes can be restricted only to vertices and edges. With the restriction that both vertices of an edge appear before the edge, we get a nested sequence of graphs Persistence diagram. Appearance ('birth') and disappearance ('death') of topological features, that is, cycles whose classes constitute the homology groups, can be captured by persistence algorithms [8,16]. These 'birth' and 'death' events are represented as points in the socalled persistence diagram. If a topological feature is born at filtration step b and dies at step d, we represent this by persistence pair (b, d) with persistence d − b. The pair (b, d) becomes a point in the persistence diagram with the 'birth' as x-axis and 'death' as y-axis. This 2D plot summarizes topological features latent in the data. In the example-filtration of Fig 1, a new component gets 'born' when a vertex v i appears in the filtration for the first time. When an edge is introduced, one of the two things can happen-either two components are joined, or a cycle is created. In the first case, a 'death' happens for 0-th homology group H 0 , and in the second case, a 'birth' happens for the 1-st homology group H 1 . For example, when e 0 comes in the filtration (G 6 ), it merges two components created by v 0 and v 1 . By convention, we choose to kill the component that got created later in the filtration and thus we let the component created by v 1 die. We obtain a persistence pairing (1,6) since edge e 0 at filtration step 6 kills the component created by v 1 at step 1. Similarly, we obtain pairs (2, 7), (4,8), (5,9), and (3,11). These points, tracking the 'birth' and 'death' of components, produce the persistence diagram for the 0-th homology group H 0 and hence we refer to it as the H 0 -persistence diagram. Note that the edge e 4 creates a cycle (yellow) that never dies. In such cases, i.e. when a topological feature never dies, we pair it with 1. For the edge e 4 , we obtain a persistence pair (10, 1). But, this feature concerns the 1-st homology group H 1 and thus it becomes a point in the persistence diagram for H 1 which we refer to as H 1 -persistence diagram. One way to leverage the above framework for studying a function is to assign function values to vertices and edges and construct a filtration by ordering them according to these assigned values. For such cases the persistence pairs take the form (b, d) where b is the value at which a feature is born and d is the value at which it dies. The function values that induce the filtration (Fig 2) are chosen to capture two features of the input PCD-(i) the density variations, and (ii) the anisotropy of the features, that is, how elongated it is in a certain direction, henceforth termed as length scale 'of the feature' or collectively 'of the data'. In particular, length scales refer to the prominence of protrusions such as 'elbows' in COVID-19 data. Below we briefly describe how we adapt the above persistence framework for analyzing point cloud data (PCD) representing CD8+ T cells in SARS-CoV-2 infection. Details regarding the approach are provided in Section 4 and S2 Appendix and S1 Fig. Computing persistent homology for cytometry datasets. Our datasets consist of cytometry data for non-naïve CD8+ T cells. Given protein expressions (real values) for d proteins in such a single cell, we can represent it as a d-dimensional point in R d . Considering a population of single cells, we get a point cloud (PCD) in R d . Now, we study the shape of this PCD using the persistence framework that we describe above. We compute persistence diagrams for the PCDs generated with protein expressions from different individuals and compare them. It turns out that, for computational purposes, we need a limit on the dimension d for PCD which means we need to choose carefully the proteins that differentiate effectively the subjects of our interest, namely the healthy individuals, COVID-19 patients, and recovered patients. We typically choose 3 (sometimes 2) protein expressions to generate the PCD and call it a PCD in the P1, P2, P3 space if it is generated by proteins P1, P2, and P3 respectively.
Flow cytometry data for non-naïve CD8+ T cells in Mathew et al. [3] show generation of CD8+ T cells with larger abundances of the proteins CD38 and HLA-DR (CD38+HLA-DR + cells) for some COVID-19 patients, forming an 'elbow' in the two dimensional PCD with CD38 and HLA-DR protein expressions (see S2 Fig). Moreover, there is an increase in the local density of the points (or single CD8+ T cells) in the 'elbow' region. This suggests that, to study the PCD generated by the protein expressions by persistence framework, we need to choose a filtration that is able to capture such geometric shapes and variations in the local density.
We briefly describe our choice of filtration by considering the example of a point cloud P � R 2 shown in Fig 2. Mathematical and computational details regarding the filtration are provided in the Section 4. We build a filtration according to assigned values to the vertices and edges of a graph connecting the input points. For a vertex p which is a point in the input PCD P, we denote this value f v (p) (given by Eq 1 in Section 4). Similarly, we denote the assigned value to an edge e as f e (e) (given by Eq 2 in Section 4); see Fig 2. The values satisfy the conditions that f v (p) < 0 and f e (e) � 0; implications of this specific choice will become clear in the next paragraph. It is noteworthy to mention that f v (p) is the negative of distance-to-measure originally defined in [17] and later used in [18] for the PCD case and captures the density distribution of points, whereas f e (e) captures the inter-point distances between the points in the given point cloud.
The persistence algorithm processes each vertex and edge in the order of their appearance in the filtration. We execute it using a threshold value λ from −1 to 1 and generate the persistence diagram accordingly. Intuitively, as λ is increased from −1 to 1, vertices p for which f v (p) � λ and edges for which f e (e) � λ appear in the filtration for a particular value of λ (see Fig 2). Since f v (p) < 0 and f e (e) � 0, all the vertices first appear as λ is increased from −1 to 0, and then edges start appearing as λ becomes positive. The birth-death events for H 0 and H 1 constituting the persistence diagram ( Fig 2H) contain information about the density and length scales present in the point-cloud. For example, the points showing birth and death events for the H 0 -persistence diagram are more densely organized for the single cell protein expression data from the healthy donor than the SARS-CoV-2 infected patient in the HLA-DR -CD38 plane shown in S3 Fig

Application of persistence to healthy and patient data
Our aim is to find out systematic differences in topological features extracted from cytometry data for healthy individuals and COVID-19 patients. Ideally one would like to compute persistence diagrams for all 25 proteins that were measured in single CD8+ T cells, however, this task encounters two major problems. First, as we mentioned before taking the full 25 dimensional PCD introduces the curse of dimensionality [19] making it computationally infeasible to produce the persistence diagrams. The second one is more subtle. In order to measure how the density of data differs from a healthy to infected person in a quantitative way, we need to ensure that the number of points in each PCD, to be analyzed by persistent homology, is the same. Cytometry data usually contain different numbers of single cells in datasets obtained from different donors or replicates. To address the curse of dimensionality we use a classifier (XGBoost [29]) that distinguishes single CD8+ T cells in healthy donors from those in COVID-19 patients and we choose the top r (taken to be 3) features (proteins) that are deemed important by the classifier while classifying the data points (cells). This reduces the dimension of the data from 25 to a much smaller value denoted r.
To address the second issue, we perform uniform random sampling on every r-dimensional dataset and take equal number of samples from it. We then use the filtration defined in Eqs 1 and 2 to construct persistence diagrams for each dataset. To quantify the structural differences in the datasets as captured by the corresponding persistence diagrams, we compute the Wasserstein distance [20] between persistence diagrams from randomly selected pairs of either two healthy donors (H×H) or a healthy donor and an infected patient (H×P) and compute distributions of the Wasserstein distances for a large number of (H×H) and (H×P) pairs. The comparison of these distributions via Kolmogorov-Smirnov (KS) tests provides information regarding the systematic differences in shape features in the CD8+ T cell cytometry data across healthy individuals and COVID-19 patients. The computational pipeline is summarized in (Fig 3). Below we describe results from the application of our computational pipeline to the CD8+ T cell cytometry data in Mathew et al. [3] A few protein expressions in CD8+ T cells separate healthy donors from COVID-19 patients. We use XGBoost [29], a decision tree based classifier, to rank order proteins for their ability to distinguish CD8+ T cell point cloud data between healthy individuals and COVID-19 patients. The average accuracy of the classifier is about 92%. The classifier returns a feature score for each protein that characterizes its importance relative to other proteins in distinguishing cells from healthy individuals and COVID-19 patients. Intuitively, feature score is an indicator of the importance of a particular feature while classifying the data. By ranking the proteins by their feature scores, we can reduce our further analysis to only a subset of the most important proteins. Our analysis (Fig 4) shows that the top three most important proteins to the XGBoost classifier are proteins T-bet, Eomes, and Ki-67. T-bet induces gene expressions leading to an increase in cytotoxic functions of CD8+ T cells. CD8+ T cells with increased cytotoxic functions are known as 'effector' CD8+ T cells and these cells show higher T-bet abundances. Conversely, Eomes induces gene expressions that contribute towards increased life span and re-activation potential of CD8+ T cells to specific antigens [21]. These long-lived T cells are known as 'memory' T cells which show increased expressions of Eomes. Memory T cells provide key protection against re-exposure to the same infection. Ki-67 is a marker for actively proliferating cells [22]. Mathew et al. [3] identified Ki-67 as one marker that is upregulated (increased) in some COVID-19 patients. These three proteins are most likely to distinguish CD8+ T cells in healthy donors from those in patients. Further details regarding the application of the classifier are provided in the Materials and Methods section (Section 4).
Persistence diagrams distinguish structural features in CD8+ T cell data occurring in healthy individuals and COVID-19 patients across batch effects and donor-donor variations. We select the proteins T-bet, Eomes, and Ki-67 as relevant markers and compute the persistence diagrams of the PCD given by them for each individual belonging to groups of healthy donors, COVID-19 patients, and recovered patients. The persistence diagrams vary from individual to individual in each group and between groups which could arise due to batch effects in samples and/or donor-to-donor variations. To determine if there are systematic differences in persistence diagrams for individuals across the three groups (healthy, patient, and recovered), we compute Wasserstein distance between persistence diagrams for 3 categories of pairings: 1) two healthy donors (H×H), 2) one healthy donor and one patient (H×P), and 3) one healthy donor and one recovered individual (H×R). We compute distances for 100 randomly chosen pairs of individuals for each category of pairings. Wasserstein distances of the persistence diagrams for 0-th and 1-st homology groups H 0 and H 1 respectively are higher when comparing H×P pairs than when comparing H×H pairs (Fig 5). A 2-sided KS test showed that the Wasserstein distances for H×P and H×H belong to different probability distribution functions (p � 0.01); see This suggests that systematic differences in the geometry of the PCD occur only for specific sets of proteins. Details regarding computation of persistence diagrams and Wasserstein distances are given in Section 4.
Next, we select a comparison pair that generates a large Wasserstein distance between H 1persistence diagrams to further investigate what structural differences exist between the A readily apparent difference between the resulting persistence diagrams is given by the lower birth times in H 1 of the COVID-19 patient compared to the healthy control (Fig 6E and  6F). This result indicates that the length scale of the data is smaller in the COVID-19 patient, which can be visually confirmed in the scatter plots of the data (Fig 6A and 6B). Specifically, the single cell abundances of T-bet and Eomes in CD8+ T cells are clustered significantly tighter around the origin for the COVID-19 patients than for the healthy controls. Similar manual inspection of other H×P pairs that generate large Wasserstein distances between their persistence diagrams confirms that this trend is not limited to this pair alone.
Additionally, the points in the H 0 -persistence diagram are spread out more widely for the healthy control than the COVID-19 patient (Fig 6C and 6D). A wider distribution of births and deaths in the 0-th homology H 0 implies that there are regions of disparate densities. This suggests that the densities in the protein expressions of T-bet and Eomes are more homogeneous in the PCD in the COVID-19 patient than in the healthy control.
The structural change in the PCD for CD8+ T cells in the T-bet/Eomes plane that occurs during COVID-19 infection implies that T-bet and Eomes expression should be downregulated (decreased) in non-naïve CD8+ T cells. This result is consistent with analysis of clusters of CD8+ T cells by Mathew et al. [3] via a software package FlowSOM [27] that shows that clusters high in T-bet and/or Eomes are downregulated in COVID-19 patients.
The relevance of the above proteins in distinguishing healthy controls from patients is further demonstrated by the statistically significant differences (p-values � 10 −8 ) in the mean Tbet, Eomes, and Ki-67 abundances in the CD8+ T cells between the groups (S10 Fig). However, the distributions of the mean abundances for the above proteins also showed regions of overlap between healthy and patient populations (S10 Fig) indicating existence of H × P pairs with much smaller differences in the mean values between them than the population averaged mean values of these proteins. Our method specifically identifies differences in topological

PLOS COMPUTATIONAL BIOLOGY
Persistent homology determines clinically relevant features in cytometry data features in the shape of the PCD between a H × P pair which can be present despite small differences in the mean values of specific proteins (e.g., T-bet). We further investigated this point by analyzing correlations between the Wasserstein distance between the persistence diagrams of H × P pairs with the difference in the mean protein abundances (S11(A) and S11(B) Fig) which showed moderate correlations (� 0.5). However, there are several instances in which the Wasserstein distance captures differences in the shape of the PCD via persistent homology even when the difference in mean protein abundance (e.g., mean T-bet abundance) is small (S11(C)-S11(F) Fig). This is due to changes in the shape of the point-cloud that are not easily captured by summary statistics such as the mean. Therefore, our analysis shows that healthy and COVID-19 pairs can be better separated by our persistent homology analysis than by summary statistics measures in such cases. The downregulation of T-bet and Eomes in response to viral infections is not well documented, as CD8+ T cells commonly differentiate into phenotypes with high T-bet, high Eomes, or both in response to infections [21,23].
We next explore the application of our approach to other datasets. We apply our method to the single cell cytometry dataset in Mathew et al. [3] for B cells obtained from healthy donors and COVID-19 patients. The B cells are major orchestrators of the humoral component of adaptive immunity against infections. We compute persistence diagrams for the proteins CXCR5, PD-1, and TCF-1, identified by XGBoost as the three most important proteins for classifying healthy donors and patients. A chemokine receptor, CXCR5, is responsible for B cell trafficking and is found to be downregulated in B cells in COVID-19 patients [3]. Both PD-1, a checkpoint inhibitory receptor [24], and TCF-1, a transcription factor important for T cell differentiation and effector functions [25] are increased in B cells in infected individuals (S12 Fig). Immunosuppressive effects of high PD-1 expression in B cells have been reported earlier [24]. We find that Wasserstein distances of the persistence diagrams for both homology groups (H 0 and H 1 ) are significantly different (Fig 7) between the healthy donors and the patient population for B cells. This demonstrates that our approach is able to distinguish healthy individuals from patient populations using PCDs of other immune cells. Furthermore, we find that the margin of separation, quantified by the QF-distance (QFD), in these Wasserstein distances is smaller with B cells than the CD8+ T cells. This implies that the structure of PCDs for CD8+ T cells differ more between healthy controls and patients than that for the B cells. These findings may point to previously uncharacterized impact of PD-1 and TCF-1 on B cell function or phenotype in SARS-CoV-2 infection.
Distributions of mean PD-1 expression on B cells in COVID-19 patients is not largely different than that of healthy controls (S12(B) Fig). Therefore, we further analyzed how the difference between mean PD-1 values and the differences in PCDs quantified by the Wasserstein distances are related (S13(A)-S13(D) Fig). We found that unlike T-bet for CD8+ T cells, mean PD-1 expression differences in B cells are not tightly correlated with Wasserstein distances in healthy control-patient pairs (S13(A)-S13(B) Fig). We visualized the PCD in the space of TCF-1, PD-1, and CXCR5 (S13(E) and S13(F) Fig) to gain further insights regarding the differences in the shapes of the PCD in healthy control-patient pairs with similar mean PD-1 expressions. The differences in the shape of the PCDs for these pairs can be largely attributed to the higher expressions of TCF-1 in healthy controls (S13(E) and S13(F) Fig).
Comparison with existing methods. To determine how our selection of filtration compares with an existing method, we compare how our results might change if we use Rips filtration. In Rips filtration, simplices appear when all of their edges appear in the filtration. The edges appear in non-decreasing order of their lengths. We use the entire dataset as the PCD and generate persistence diagrams subsequently, using Rips filtration [8,14,26] rather than the filtration we use in our approach. Note that in the standard Rips filtration, all vertices appear at the same instant whereas in our case the vertices are ordered by Eq 1. We then compute Wasserstein distances as done previously. We find that Rips filtration is also able to distinguish persistence diagrams of healthy controls and patients, but the margin of separation is much lower, as evidenced by a higher p-value and lower QFD than our choice of filtration offers (S14 Fig). This indicates that our method, which is designed to identify protrusions such as "elbows" in the data, characterizes greater differences in CD8+ T cell protein expression structures than existing methods such as Rips filtration.
We then compare our TDA approach with an existing algorithm FlowSOM [27], widely used for visualizing, clustering, and analyzing PCDs from cytometry experiments. FlowSOM uses a self-organizing map algorithm for generating single cell subsets with unique marker protein expressions. FlowSOM is capable of clustering similar cells together and offers a robust way to determine which cellular subsets are differentially expressed between data sources [28]. We run a FlowSOM analysis and clustering on the CD8+ T cell data and determine that 6 of the 15 clusters are differentially expressed between healthy controls and COVID-19 patients (S15 Fig). We select one FlowSOM cluster which is differentially expressed (Cluster #3) and one that is not differentially expressed (Cluster #1) between healthy donors and patients for downstream analysis. Visual inspection of these clusters in the 3-dimensional space of T-bet, Eomes, and Ki-67 shows that PCD structure may be different between healthy controls and patients in Cluster #1 (S15(C) Fig). This is because proteins can co-vary in different ways in healthy controls and patients, affecting topological features hidden in the PCD. We then perform our persistent homology analysis to determine if the structure of PCDs for proteins Eomes, Ki-67, and T-bet in subsets of single cells associated with these FlowSOM clusters differ between healthy controls and patients. We find distributions of Wasserstein distances between the persistence diagrams obtained for the above FlowSOM clusters are significantly different (Fig 8 and S16  Fig).
Performing FlowSOM on B cells reveals 12 of the 15 clusters are differentially expressed between healthy controls and COVID-19 patients (S17(A) and S17(B) Fig). The three clusters (Cluster #1, #2, and #14) that are not differentially expressed are all high in PD-1. Visualization of the PCDs for Cluster #2 and Cluster #4 shows that regions high in TCF-1 S17(C) and S17 (D) Fig) distinguish the PCDs for the pairs. Thus, our method is able to identify topological features hidden in the PCD that separate healthy controls from COVID-19 patients in Flow-SOM clusters, including clusters which are not differentially expressed between these groups.

Discussions and conclusions
We develop a persistent homology based approach to determine topological features hidden in point cloud data representing single cell protein abundances measured in cytometry data. In particular, we characterize the number of connected components or H 0 , and the number of holes or H 1 in our persistence calculations, and show that our approach is able to determine systematic shape differences in the cytometry data for CD8+ T cells obtained from healthy individuals and COVID-19 patients. Therefore, the approach is able to successfully determine systematic shape differences that exist in the presence of batch effect noise and donor-donor variations in the cytometry data. Furthermore, our approach does not use data transformations (e.g., arc-sinh transformation) or any ad-hoc subtype gating to determine these systematic differences, thus we expect persistent homology based approaches will be especially useful in identifying high-dimensional structural trends hidden in cytometry data.
We determine structural changes in T-bet and Eomes abundances in single CD8+ T cells in COVID-19 patients that can be summarized as downregulation. This result is non-intuitive as previous findings show that T-bet and Eomes protein abundances are highest in effector CD8+ T cells, which are induced in response to acute infections, suggesting T-bet and Eomes expressions should be upregulated [21,23]. The clinical implications of this result are unclear. Mathew et al. [3] describe a immunophenotype in which Eomes+, T-bet+, CD8+ T cells are more abundant in COVID-19 patients who respond poorly to Remdesevir and NSAIDs, have high levels of IL-6, and have fewer eosinophils. Our analysis identifies that this immunophenotype (i.e., Eomes+, T-bet+, CD8+ T cells) is systematically less prevalent in COVID-19 patients than in healthy controls. The ability of our approach to identify shape features in single immune cell PCD without any 'supervision' (e.g., specific gating) of the cytometry data shows that it can potentially determine more complicated immunologically relevant shape features. Furthermore, our approach inferred finer geometric structures from PCDs in B cells for proteins CRCX5, PD-1, and TCF-1 which helped distinguish COVID-19 patients from healthy individuals. These proteins are associated with cell migration, immunosuppression, and effector functions in lymphocytes and can potentially provide further insights into B cell response in COVID-19.
We compare our approach with an existing algorithm FlowSOM which is widely used for analyzing and visualizing multidimensional cytometry data. Our comparison reveals PCDs for subtypes of CD8+ T cells in FlowSOM clusters that do not differentiate healthy controls and COVID-19 patients contain topological features separating the above groups. Therefore, detecting topological features hidden in the PCDs can provide important biological insights regarding response of the lymphocytes in COVID-19.
Our approach integrates cellular comparisons with dataset comparisons. First, the classifier pools all data and determines which proteins are significant in discriminating whether cells come from healthy controls or COVID-19 patients. In this way, the classifier identifies a way to compare cellular phenotypes across experimental groups. Next, the computation of Wasserstein distances for persistence diagrams compares individuals against each other, integrating cellular phenotypes with donor information (e.g., healthy and COVID-19 patients). Thus, this approach allows us to automatically identify individuals that are associated with distinguishing structural features in the point cloud data.
Currently, the limitations are mostly due to the curse of dimensionality that increases the computational complexity. Since we are computing pairwise distances between datapoints to obtain the persistence diagram (S2 Appendix), computation time increases as the dimension of data increases. Computational cluster resources that we use currently complete all computations in about 20 minutes. This is comparable to other data science applications using large datasets, but this approach can be a barrier to those without access or experience with computational clusters. Additionally, it is unclear how additional dimensions impact the statistical properties of the data and interpretability of the results. To expand into many (i.e. 25) dimensions, computational interpretation and validation tools are necessary.

Relevant feature selection by the XGBoost classifier
Let D = {c 1 , c 2 , . . ., c m } be the collection of m cytometry datasets. Each dataset, c i , can be viewed as a M n�p matrix where n is the number of datapoints (cells) and p is the number of proteins with which each c i is generated. We denote the collection of cytometry datasets of healthy individuals as C H � D and similarly the set of individuals infected with SARS-CoV-2 as C P � D. We proceed to label the data in the following manner: If c i 2 C H then we assign the label + 1 to each of the n datapoints, similarly we assign −1 if c i 2 C P . Essentially, we now have a binary classification problem where our labeled dataset is [c j , with labels defined as above. We solve this binary classification problem with XGBoost [29], a gradient boosted decision tree based classifier, and as a byproduct we get feature scores that correspond directly to each feature's importance in the classification. The higher the score for a protein, the more important it is for the classifier's decision. After our classifier orders the proteins by their scores, we take first r proteins to construct the point-cloud on which persistence diagrams are computed. We set r = 3 for all our analysis reported here. We used data from 56 healthy individuals and 108 COVID-19 patients for our feature selection.
The XGBoost classifier was implemented using the open-source python XGBoost package [29]. The model was then trained and validated with K-fold cross-validation, with K = 10. The average accuracy of the classifier was 92.14 ± 0.04%. The protein scores are shown in Fig 4.

Random subsampling of the datapoints
Each PCD can be thought as a set of indexed points. These indices were first shuffled randomly and then 20, 000 indices and hence respective points were sampled uniformly from this shuffled set. The samples drawn from each PCD were further analyzed using persistent homology. We discarded datasets that had less than 20, 000 data points. Among 55 healthy individuals only 1 had less than 20, 000 data points. For the patient data, the number of such datasets was 34.

Details of persistent homology computation
As mentioned before (Section 2.1), computation of persistence diagrams needs a filtration. We set the filtration induced by the function f = {f v , f e } where f v (p) computes an 'average' Euclidean distance between the vertex p and its k neighbors according to Eq (1) and f e (e) computes the length of the edge e according to Eq (2).
The term kp − q i k in the above equation is the Euclidean distance between the vertices p and q i . The function value f e (e) for an edge e = (p, q) is given by the Euclidean distance between p and q. For the experiments, the number of nearest neighbors is fixed to k = 40.
f e ðeÞ ¼ kp À qk ; 8p; q 2 P and p 6 ¼ q ð2Þ We begin by sampling every cytometry PCD c i and take n(= 20, 000) samples. We do this to make c i uniform w.r.t. number of data points (single CD8+ T cells). We compute a complete weighted graph G(V, E) with vertices in the sampled data. This complete graph G is a key-step that enables us to compute the persistence diagram, Dgm(c i ) of the dataset c i , w.r.t. the filtration function f. We show the algorithm (Algorithm B in S2 Appendix) that executes this step in detail in the supplementary material. Notice that the graph G is weighted in the sense that each vertex v 2 V and edge e 2 E carries a weight of f v (v) and f e (e) respectively. Observe that f : We compute persistence diagrams for each c i 2 D according to Algorithm C in S2 Appendix. The next step involves comparing the persistence diagrams. We do this by computing the Wasserstein distance between persistence diagrams and plotting their distributions. We take two persistence diagrams of randomly selected healthy individuals and compute the Wasserstein distance between them with the help of Gudhi [20, 30] and scikit-learn Python library [31]. Similarly, we compute Wasserstein distance between persistence diagrams of a healthy and an infected individual (both are randomly drawn from the collection). We plot the resulting distances. We do this for 108 pairs to obtain two distributions. Note that, results described in Section 2.1 still holds for 200 pairs (S4 Fig). Intuitively, a large Wasserstein distance between two persistence diagrams implies the datasets on which they were constructed are structurally very different while a small distance implies they are structurally similar.

Statistical testing of difference in Wasserstein distance distributions
2-sided Kolmogorov-Smirnov (KS) tests were performed on Wasserstein distances for pairs of individuals to determine if they arise from the same or different probability distribution functions [32]. MATLAB's subroutine kstest2 was used to determine p-values, where the null hypothesis is that the Wasserstein distances from H×H comparisons and the experimental condition (either H×P or H×R) arise from the same non-parametric distribution, and the alternative hypothesis is that they come from different distributions. A p-value of � 0.05 (or � 0.05) indicates the support for the alternate hypothesis (i.e., data occur from different distributions) is statistically significant (or not significant).

Computing quadratic form (QF) distance
To measure the dissimilarity between a pair of Wasserstein distance distributions, quadratic form (QF) distance was computed as proposed by Bernas et al. in [33] (originally introduced in [34]). The QF distance was calculated using the formula where f and h are two vectors that list counts corresponding to two histogram bin counts. The quantities f and h can be normalised so that ∑ i f i = ∑ i h i = 1 when indexed by i. In our case,

FlowSOM analysis
FlowSOM was performed on the entire non-naïve CD8+ T cell dataset, and separately, the non-naïve B cell dataset. Data was scaled with the transform asinh(x/150) before analysis. See Code Availability for FlowSOM source code. Comparisons of relative cluster abundances between healthy controls and COVID-19 patients were performed with a Wilcoxon rank sum test. Subsequent persistent homology computation was performed on the selected clusters by sampling 20, 000 cells from either the aggregated healthy control data or aggregated COVID-19 patient data. This sampling was repeated to form "synthetic" individual healthy control or COVID-19 patient data. We cannot sample data from each individual, as was done in the prior computations, because many individuals display too few cells in the selected clusters to reliably sample 20, 000 cells. In the CD8+ T cell analysis, clusters #1 and #3 were chosen for persistent homology calculations because they contain the most cells, and thus are most likely to have cells in each sample and be unaffected by random sampling.