STATISTICAL INFERENCE IN CELL LINEAGE TREES

Cell diﬀerentiation is often associated with speciﬁc divisions and generations in a lineage tree. The presence of phenotypic noise, however, can make it diﬃcult to observe such patterns. Using the group symmetry representation of a binary tree, it is shown how variation in a lineage can be compactly described by a set of natural variables each of which is labelled by the division at which a type of variation arises and the generation at which it is expressed. This harmonic analysis for a rooted tree provides a disciplined way to aggregate tree-structured data, improving the ability to identify diﬀerentiation patterns in noisy lineages. It also allows the proportion of variation of a phenotypic fate associated with each division to be estimated and compared to the proportion of variation expressed at each generation. The method has been applied to T-lymphocyte lineages tracked using time-lapse microscopy over several generations. For comparison, the analysis has been applied to C. elegans , a lineage with clear diﬀerentiation stages, and to a stationary branching process, which has none. to DGH, the National Health and Medical Research Council of Australia (NHMRC) Program Grant 1054618 to TPS, and NHMRC grants 620500 and APP1099140 and ARC grant FT0990405 to SMR. We thank Alan Rubin for suggesting we test our method on the C. elegans lineage.

1. Introduction. In embryonic cell lineages, differentiation is often tightly synchronised to specific cell divisions and generations (Chisholm, 2001;Hadjantonakis and Arias, 2016). Low levels of variability in lineages such as for the roundworm C. elegans has allowed unambiguous identification of the developmental patterns (Sulston et al., 1983) without the need for statistical analysis. In contrast, the presence of substantial phenotypic noise in, for example, lymphocyte lineages (Hawkins et al., 2007) may be preventing underlying differentiation patterns from being observed. Whether noise has a functional role in multi-cellular development has been much debated (Balázsi, van Oudenaarden and Collins) with indications that the stability of certain hematopoietic subpopulations arises simply from the law of large numbers, not from programmed differentiation (Gerlach et al., 2013). A statistical method that can detect underlying differentiation patterns in noisy lineage trees would be an important contribution to this debate.
Early approaches to statistical inference in cell lineages (Cowan and Staudte, 1986;Huggins and Staudte, 1994) focused on stationary processes in microbial populations and were not designed to address differentiation patterns. More recently, models have been developed in the context of particular hypotheses, such as chaotic dynamics (Sandler et al., 2015) or state-switching processes  but do not address the general problem of characterising variation in a lineage. The theory of branching processes (Haccou et al., 2005), which has been applied to lymphocyte proliferation and death (Zilman, Ganusov and Perelson, 2010), does not address lineage inference. Phylogenetic inference (Felsenstein, 2003) involves reconstructing an unknown tree structure but is not designed to infer developmental patterns in a known tree structure.
Identifying signal from noisy measurements requires aggregating data, yet for tree-structured data this most elementary of operations is non-trivial. The source of the difficulty is that, with noisy data, daughters are statistically indistinguishable (that is, unindentifiable or exchangeable). This means that their subtrees are indistinguishable too, and so on, recursively through all descendants, leading to a particular pattern of indistinguishability. Now in statistical analysis, one generally aims to aggregate as much of the data as possible to maximise signal-to-noise, but not so much that meaningful associations are lost. Achieving this requires both respecting and exploiting the pattern of indistinguishability.
The optimal aggregation scheme can be found by examining the symmetry invariance of a rooted tree. The indistinguishability of daughters and their subtrees means it is possible to permute family members in ways that preserve the joint distribution over the tree, with the set of all possible permutations forming a group. Group representation theory (Diaconis, 1988;Stiefel and Fässler, 1992) can then be used to identify a linear transformation that not only aggregates tree-structured data optimally but also defines a natural set of variables that is free of the redundancies caused by indistinguishable variables.
This analysis is related to the conventional practice of blocking in nested groups. What is new here is that the observations are distributed at all levels of the nested groups (the tree structure), not just at the lowest level as would be necessary for a conventional analysis of variance (ANOVA) to apply.
This generalisation has profound consequences as the variance components are no longer just scalar eigenvalues, as in an ANOVA (Speed, 1987), but instead form orthogonal subsets of dependent variables where each subset is associated with variation from a particular division.
In this paper this statistical framework is developed and applied to differentiation patterns in various lineage trees, from the highly-ordered structure observed in C. elegans to the featureless character of a simulated branching process. The framework is able to characterise the continuum between these two extreme cases of differentiation where most systems of interest, including the T-lymphocytes discussed here, lie. The technique is analogous to the spectral analysis of a noisy time series where spectral lines can exist concurrently with a broadband background. In that case, the lines are interpreted as arising from ordered processes while the flat background is considered to be from unstructured noise. It is in the spectral domain that this distinction between signal and noise becomes clear.
The rest of the paper is organised as follows. Section 2 shows aspects of the 3 lineage types used in this paper. The framework of the model, and how family members are assigned to variables, is given in Section 3. The core of the paper, Section 4, examines ways to improve inference on trees by progressively increasing model constraints until real data can be analysed. Graphical models are used to visualise and interpret the dynamics of variation in a lineage in Section 5. The progression and expression of phenotypic fate is defined and illustrated in Section 6. A discussion about the interpretations and prospects for this analysis is given in Section 7.
2. Lineage Data. Three types of lineage data are analysed: T cells Unpublished lineage data on CD8 + T cells from GFP:OT-1 transgenic mice. Naive cells, expressing a T cell receptor for SIINFEKL peptide from ovalbumin, interact with peptide-pulsed bone marrowderived dendritic cells to activate clonal expansion (Oliaro et al., 2010). Cells and their descendants are tracked using time-lapse fluorescence microscopy and analysed using custom software (Shimoni et al., 2013). Although multiple phenotypic traits were recorded, in this paper the only trait analysed is the average area of a dividing cell over its lifetime. Note that only dividing cells were used in the analysis; cells that die or whose fate is unknown were counted as missing data. 19 replicate families were used. Worm Published (Santella et al., 2016) embryonic lineage data from the RW10425 transgenic strain of C. elegans. In this strain the PHA-4 protein, a marker for pharyngeal and intestinal tissue, is tagged with green fluorescent protein. Gut differentiation occurs early during embryogenesis, with PHA-4 expression beginning by generations 7 and 8. There are 10 replicate families. Branching Process Simulated lineages from a stationary branching process. 20 replicate families are used, with a missing data fraction of 20% assumed. Here we define a branching process to be one with a particularly simple pattern of correlations between family members ς and ς : The mother-daughter correlation is h while the correlation between any two family members ς and ς separated by a kinship distance ∆ ςς is given by h ∆ ςς . Mother-daughters have ∆ = 1, sisters have ∆ = 2, cousins ∆ = 4 and so on. Importantly, in this scheme daughterdaughter correlations are the square of mother-daughter correlations making daughters independent conditional on their common mother. As will be shown in Section 5, the underlying graphical model for this branching process is a binary tree which is generally not the case for real lineages. For our purposes, this stationary branching process represents a null model since, despite the presence of correlations between family members, there are no preferred differentiation stages.
Sample lineages from these 3 lineage types are shown in Fig. 1 while the expression of each phenotype as a function of generation is shown in Fig. 2.
3. Modelling Framework and Labelling Conventions. As with any statistical model, we must first assign variables to each data point. In general, a lineage measurement, Y hijk , might be indexed by 4 factors: Conditions (h), Family (i), Member (j), and Trait (k). The Condition factor corresponds to the cell type being studied or the experimental arrangement under which a founder cell is chosen or cultured (and may itself consist of multiple factors). Each factor level of Family refers to a particular founder cell and its descendents; each level of Member corresponds to a position in the family tree; and each level of Trait refers to a given phenotype recorded for a cell (such as average size or marker expression intensity).
To focus on the associations among family members, we restrict our attention to modelling a single trait from families subject to the same conditions. A multi-family sample can then be represented by a two-factor array (Y ij ), where i has n levels corresponding to the number of families and j has p levels corresponding to the number of members within a family. With no meaningful distinctions among families (they are all of the same cell type and subject to the same conditions) we assume families are independent and identically distributed replicates. The data can thus be represented by a matrix Y with n replicates (rows) and p variables (columns). Comparison of some sample lineages. Colouring of the nodes reflects quantification of the phenotype under analysis (average area over lifetime for T cells, PHA-4 expression for C. elegans). The absence of a node on a branch represents a missing data point. Note that for the T cell lineage the root node is the naive cell while for the worm lineage the root node is the zygote (labelled P0 in the C. elegans naming convention). Each of the p dimensions corresponds to a family member. We use a binary number to label each family member so that, for example, the first 3 generations are labelled as founder (1), daughters (10, 11), and granddaughters (100,101,110,111), where each label thus encodes the family member position.
The group symmetry methodology described in Section 4.2.5 demands a clear distinction between the terms generation and division: generation refers to the depth of a family member in a tree while division refers to a process occurring between two adjacent generations. A cell thus belongs to a generation but arises from a division. It will be necessary to assign each division a unique two-factor index ( , τ ) with referring to the longitudinal coordinate of the division and τ to the transverse coordinate. In our convention, a cell in generation g always arises from a division with = g. We choose the convention that the founder cell is in generation 1. This means that the first division within the family is actually = 2 since it produces daughters in g = 2. The 'division' = 1 refers to a process, occurring outside the family, that gave rise to the founder cell. Variation attributed to = 1 therefore refers to inter-family variation. Fig. 3 gives a summary of these definitions and conventions.
Often in lineage measurements there are many more members of a family (p) than there are families (n). Thus p n, with the disparity getting exponentially worse with the number of generations studies. Performing reliable inference when p/n > 1 is an open research question (Hastie, Tibshirani and Wainwright, 2015). Best results are achieved when prior knowledge of the problem can be incorporated.
In the next section we describe increasingly more sophisticated steps to reduce the effective dimensionality of the inference calculation, first by exploiting known symmetry properties and then by using observed sparsity properties. Our practical goal is to identify a scheme where the data requirement of the model (the number of replicates required to infer its parameters) is independent of the number of generations studied.

Statistical Inference.
Our objective is to infer the joint probability distribution P(y) from a sample of size n where y is a p-dimensional random variable representing the single trait for each family member in the lineage. In this study we discuss inference for a multivariate Gaussian since the traits we examine for T cells and C. elegans are continuous and approximately marginally Gaussian. Thus, (1, 1) actually represents the process that distinguishes different founder cells and will be use to label inter-family variation. According to group representation theory, each division is a potential source of variation. Importantly, in this study, only the longitudinal division coordinates are distinguishable.
where Σ is the variance-covariance matrix and µ is the multivariate mean. Our preliminary goal is to infer the maximum-likelihood estimateΣ and its inverse, the precision matrixK =Σ −1 . Since there are many shared associations between family members it will turn out to be more useful to estimate the covariance matrix for a natural set of variables. These will be determined from group symmetry arguments. 4.1. Unstructured Gaussian. A naive method for findingΣ is to assume no prior structure on Σ, allowing for all possible associations between family members. Then, if the sample mean (y) and (biased) sample covariance (S) are given by the usual where Y i is the data vector from family i, the mean and variance-covariance estimates are given explicitly bŷ µ = y,Σ = S (2) imsart-aos ver. 2014/10/16 file: main1.tex date: December 18, 2017 . CC-BY 4.0 International license peer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/267450 doi: bioRxiv preprint first posted online Feb. 18, 2018; Using this to analyse the first G generations, the effective number of dimensions p eff , the number of unknown variance-covariance parameters N Σ , and the minimum number of replicates n min required to ensure existence of the MLE are: where the number of family members p = 2 G − 1. Although p eff = p for this unstructured case, with group symmetries p eff < p.
Note how n min increases exponentially with the number of generations G being studied, making this simple model impractical for analyzing trees. Nevertheless it provides a reference for our improved models which aim to make n min independent of G. For each model we will examine the reduction in p eff , N Σ and n min , as will be shown.

4.2.
Symmetry. To reduce n min requires identifying constraints. Constraints on Σ can be found from the group symmetry properties of a tree. We first demonstrate how this pattern can be determined by inspection alone and then use group representation theory to identify the natural set of variables associated with this pattern.

Shared Parameters.
To reduce the number of unknowns in the model, we start by identifying a pattern of shared parameters in the covariance matrix. The shared parameters arise from the indistinguishability of daughters and their subtrees.
For example, consider the pair of cells 10 and 110 which have 1 as their Most Recent Common Ancestor (MRCA). We can uniquely identify the covariance matrix element for this pair by using the 3-index 231 to specify the generation of each cell (2 and 3) and the generation of their MRCA (1). Now because of daughter indistinguishability, the association between a different cell pair, 11 and 101, must have the same 3-digit label 231. We can proceed to give a 3-index label to each covariance matrix element, leading to the following patterned covariance matrix Σ G for the first 3 generations: where lines separate different generations. We use the subscript G to indicate when a matrix has this patterned structure. Because a generation denotes the lineage distance to the founder cell, a meaningful quantity in cell differentiation, we need to specify the generations of the cell pair and its MRCA, not just generational differences. For example, sisters in generation 3 may have different statistical associations to sisters in generation 2. The MLE of a Gaussian with this form of Σ can be found explicitly. To understand why, consider the log-likelihood of the multivariate Gaussian over positive definite matrices K. If the pattern is in the covariance matrix (Anderson, 1973), maximum likelihood (ML) optimisation is not in general convex; if the pattern is in the precision matrix, ML optimization is convex and leads to explicit estimates for Σ (Hojsgaard and Lauritzen, 2008;Szatrowski, 1980). Since, as we will show in the next section, this particular pattern arises from group symmetries in the binary tree, the pattern is invariant under the action of an inverse and is thus the same for covariance and precision matrices. Convex optimisation is thus guaranteed.
To constrain the structure of K we represent it as a linear combination of matrices. The resulting patterned inverse covariance is then where each A α is a matrix of 0's and 1's which has the same dimensions as K G , with 1's identifying a particular shared parameter and a α giving the value of that parameter.
The MLE is found by differentiating Eq. 5 with respect to each a α and setting dL/da α = 0, giving Substituting Eq. 6 gives Since the matrices are symmetric we can equate the inner products of the matrices: Thus the MLE of each shared parameter inΣ G is found by averaging the corresponding elements in S (Hojsgaard and Lauritzen, 2008). The result, S G , is thus the MLE of the patterned covariance: This aggregation of selected elements improves the signal-to-noise in just the right way, enhancing information about all possible associations in the tree by averaging over the indistinguishability pattern.
Note that the MLE of the mean,μ, is also given explicitly for a multivariate Gaussian with group symmetries (Gehrmann and Lauritzen, 2012). For the case of a binary treeμ is found by pooling data from family members sharing the same generation, following the pattern in the variances shown on the diagonal of Σ G (Eq. 4).
While the shared parameters have reduced the number of variance and covariance elements, it is difficult to evaluate how many replicates, n min , are now required to ensure that S is positive definite (Uhler, 2012). The answer will become apparent when we examine the invariant subspaces of the group representation.

4.2.2.
Indistinguishability and Symmetry Invariance. The shared parameter pattern, found above by inspection, arises directly from the symmetry invariance properties of the tree. These symmetries have deep implications, beyond just optimal data aggregation.
An object possesses a symmetry if it remains invariant under the actions of a group. Symmetries are possible in a binary tree because daughters are statistically indistinguishable. We can thus exchange the two subtrees descended from any ancestor without altering the joint distribution over the imsart-aos ver. 2014/10/16 file: main1.tex date: December 18, 2017 . CC-BY 4.0 International license peer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/267450 doi: bioRxiv preprint first posted online Feb. 18, 2018; tree. Here we use the term ancestor to mean a branch node, i.e. any family member that has daughters.
The complete set of group actions is found by considering all the ancestors together, allowing each to be in one of two 'states': having its subtrees exchanged or not. For a tree with G generations and thus A = 2 G−1 − 1 ancestors, there are 2 A unique configurations of all ancestor states that keep the family relationships invariant. These configurations form the complete set of elements in the group of order 2 A .
For example, a tree consisting of the first 3 generations has A = 3 (corresponding to members 1, 10, and 11). The order of the group is thus 2 3 = 8. Each of the 8 permutations is shown in Fig. 4.

4.2.3.
Group-Averaged MLE. Symmetry constrains the joint distribution because each relationship-preserving configuration of family members is a permutation of the p variables that keeps P(y) invariant. Thus if D s is the p-dimensional permutation matrix representing an action s of the group G, symmetry invariance (referred to as G-invariance) requires P(D s y) = P(y), ∀s ∈ G. For a multivariate Gaussian, this means that the covariance matrix belongs to the set referred to as the fixed point subspace of the group G. Note that a G-invariant imsart-aos ver. 2014/10/16 file: main1.tex date: December 18, 2017 . CC-BY 4.0 International license peer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/267450 doi: bioRxiv preprint first posted online Feb. 18, 2018; A fundamental technique for transforming an unconstrained matrix M into one that is symmetry invariant is the group-average or Reynolds operator given by: where |G| is the order of the group. This projects the matrix onto the fixed point subspace by averaging with respect to the orbits of the group. It is straightforward to demonstrate that the pattern that arises from P G (Σ), when G is the symmetry group of the tree, is the same as that shown in Eq 4 that was found by inspection.
The requirement that Σ ∈ W G places a constraint on the log-likelihood optimisation (Eq. 5). The resulting constrained MLE is given by (Shah and Chandrasekaran, 2012) or justΣ G = P G (S) from Eq. 12. Thus the MLE is found by projecting the sample covariance onto the fixed point subspace of the group, a result that necessarily agrees with Eq. 9 since the underlying pattern is the same. P G () is the general operation for pooling data with group symmetries that preserves associations between variables.
This method for finding the shared parameters is only practical when the order of the group, |G|, is small. In a binary tree, |G| increases superexponentially, as 2 A where A = 2 G−1 − 1. Thus, even with 4 generations, |G| = 128, and the group-averaging approach of Eq. 13 is awkward to implement in practice.

Decomposition into Irreducible Components.
Deeper insight, and computational benefit, arises from examining the invariant subspaces of the group representation. This identifies a set of orthogonal components that are a particularly meaningful and concise way of describing tree-structured variation. In this subsection we briefly summarise the general theory for decomposition according to group symmetries, following (Stiefel and Fässler, 1992;Shah and Chandrasekaran, 2012). In the next subsection we apply it to a binary tree.
Let ϑ : s → D s ∀s ∈ G be the representation of G on the vector space V ∈ R p . From Maschke's theorem, and by induction, a linear representation ϑ of a finite group is a direct sum of irreducible representations. Accordingly, imsart-aos ver. 2014/10/16 file: main1.tex date: December 18, 2017 . CC-BY 4.0 International license peer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/267450 doi: bioRxiv preprint first posted online Feb. 18, 2018; has a multiplicity m ω and a dimensionality d ω such that p = I ω=1 m ω d ω . This ensures that any matrix M ∈ W G can be decomposed as where each C (ω) ∈ R mωdω×mωdω corresponds to an isotypic component. Here T is a unitary change-of-basis matrix that transforms M to a symmetryadapted basis where its block diagonal form is revealed.
Furthermore, Schur's lemma states that, since C (ω) and ϑ ω commute, the isotypic components themselves decompose and can be written as a direct sum of repeated subblocks, Ω ∈ R mω×mω . Note that this requires the appropriate arrangement of symmetry-adapted basis vectors within each isotypic component as specified in T (Stiefel and Fässler, 1992).
The fundamental decomposition of M ∈ W G is thus where ν indexes the repeated subblocks M Ω is referred to as an irreducible block while the set of m ω variables from which it is composed is an irreducible component. Irreducible blocks are thus the main objects of inference and, as we will see for the case of a binary tree, represent the variation arising from each division.
Since M is either Σ G or K G in this paper, Eq. 16 states that a model with p variables decomposes into I unique irreducible components each with m ω variables. This reduces the number of pairwise associations in the model since Eq. 16 only permits associations between variables within an irreducible component. It also reduces the total number of unique variables since only one of the d ω identical copies of each irreducible component needs to be considered.
imsart-aos ver. 2014/10/16 file: main1.tex date: December 18, 2017 . CC-BY 4.0 International license peer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/267450 doi: bioRxiv preprint first posted online Feb. 18, 2018; 4.2.5. Decomposition for a Binary Tree. Here we show the results of applying group representation theory (Serre, 1977;Diaconis, 1988;Stiefel and Fässler, 1992) to the vector space defined by all members of a binary tree. Detailed calculations (see Supplement S1) show that for a completely reducible representation on a binary tree with G generations, the number of active irreducible representations I (here indexed by 1 ≤ ω ≤ I), their dimensionalities d ω and multiplicities m ω are As required, this satisfies p = I ω=1 m ω d ω when p = 2 G − 1. The change-of-basis matrix T is calculated using the standard method for identifying the symmetry-adapted basis (Stiefel and Fässler, 1992). For example, with 3 generations: This unitary (and orthonormal) matrix defines a set of symmetry-adapted, or natural, variables in columns based on linear combinations of the standard variables in rows. This transformation of data to its natural basis defined by a group symmetry is referred to simply as spectral analysis by Diaconis, see Chapter 8 (Diaconis, 1988). The parts of Haar transformation matrices (Strang, 1993) for each generation can be recognised and are outlined in blue for generation 2 and green for generation 3. Because representation theory requires the natural bases be grouped into isotypic components, shown here separated by vertical lines, T is not simply a direct sum of Haar matrices. The column headers give imsart-aos ver. 2014/10/16 file: main1.tex date: December 18, 2017 . CC-BY 4.0 International license peer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/267450 doi: bioRxiv preprint first posted online Feb. 18, 2018; the natural variables in terms of divisions and generations, following the interpretation to be described next. How the natural variables are constructed from standard variables is illustrated in Fig. 5 for the case of 4 generations. Examining these combinations suggests labelling each natural variable with a 3-integer tuple ( , τ, g) where : Longitudinal division The longitudinal coordinate of a division, , corresponds to an isotypic component, ω. The fact that there are G longitudinal division coordinates is consistent with there being I = G isotypic components (Eq. 17). τ : Transverse division The transverse coordinate of a division, τ , corresponds to an irreducible component ν within an isotypic component. The fact that there is 1 transverse division coordinate for = 1, 2 and 2 −2 coordinates for ≥ 3 (see e.g. Fig. 5) is consistent with the dimensionalities d ω found for each irreducible representation (Eq. 18). g: Generation The variables within each irreducible component correspond to the generations g at which variation arising from division ( , τ ) can be observed. Given that variation from a division can only be observed in members after that division, g is restricted to ≤ g ≤ G and thus has G − + 1 values. This is consistent with the multiplicity imsart-aos ver. 2014/10/16 file: main1.tex date: December 18, 2017 . CC-BY 4.0 International license peer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/267450 doi: bioRxiv preprint first posted online Feb. 18, 2018; of each irreducible representation being m ω = G − ω + 1 (Eq. 19). Importantly, since each accessible g occurs exactly once in each irreducible component, the irreducible component consists of an ordered sequence and is thus a time series.
The most important aspect of the natural variables is their second order properties. It is straightforward to check that, using T from Eq. 20, the natural decomposition Eq. 16 transforms the patterned covariance, Σ G (Eq. 4), into its block-diagonal form, Σ Ω (see Supplement S2). Just to be clear on the subscript notation, the covariance matrix (or its inverse) can take on 3 forms: unstructured Σ, patterned Σ G , and transformed into its natural basis Σ Ω . Then for the 3-generation tree, Matrix elements within each isotypic block are labelled with a superscript ( ) and 2 subscripts referencing the interacting generations. Each division forms an isotypic block. Starting at = 3, degeneracy occurs and repeated eigenvalues appear. Elements outside the isotypic blocks are zero and labelled with a dot; elements inside an isotypic block but outside an irreducible block are zero and labelled with a 0. For the case of 4 generations, Σ Ω is shown as a heat map in Fig. 6, alongside Σ G Each division ( , τ ) thus represents an independent source of variation. This means that variation at generation g can be decomposed into independent contributions from all divisions where ≤ g. This is simply the traditional concept of variance components in our language of divisions and generations. We will use this to estimate the contributions of each division to the pattern of fate expression (see Section 6).
The decomposition has thus taken a model with associations between all variables and partitioned it into a set of independent ordered sequences each representing the effects of variation from a division ( , τ ). In summary: (i) Each natural variable ( , τ, g) represents variation originating at division ( , τ ) and observed at generation g, (ii) Each division is a source of varia-  tion that can influence generations after it, (iii) Variation originating from different divisions is independent.
Note that because the transverse division coordinates identify identical irreducible components we only need to consider one irreducible component per isotypic block, say ( , 1). Variation from different transverse divisions τ with the same are thus identically distributed. This is a consequence of the degeneracy of eigenvalues arising from the non-commutative symmetry group. Physically this is because transverse division coordinates cannot be distinguished. Thus, in practice we will only decompose variation into longitudinal division components, not transverse division components.
These natural variables provide an intuitive language for characterising differentiation patterns in a tree. Fig. 7 shows examples of some differentiation 'collective modes' (to use a term from physics) and their corresponding and g. Since τ 's are not distinguishable we will ignore their values. Note that if we only had data from a single generation, G, the symmetry group would still be the same as for the case of the entire tree. However, the group would be represented on a vector space with dimension 2 G−1 , given by the number of leaves of the tree. In this case T reduces to the Haar transformation matrix (Strang, 1993), and Σ Ω is diagonal with eigenvalues given by classical nested ANOVA (Speed, 1987). Σ Ω is then the spectral covariance. Examples of asymmetric expression patterns in a tree and the -divisions to which they correspond. Each colour identifies an irreducible component associated with division and observed at some generations {g}. For example, the green asymmetry arises at division 5 but is not observed until generations 6 and 7. Note that transverse division coordinates, τ , cannot be diagnosed and are ignored.
What is new here is that we have examined the group representation on all the generations up to and including G, a vector space with dimension 2 G − 1. The abstract group is same in both cases; the difference is in its representation.
We remark that these variables are called natural because the underlying group symmetry of the tree required the standard variables be aggregated in just this way. It is satisfying that the groupings correspond to the those one would intuitively want to define in a nested ANOVA if we were to consider each generation separately.
Note that K ( ) Ω has the same dimensionality as S ( ,τ ) Ω but is independent of τ . Substituting Eqs. 23 and 24 in Eq. 5, it is thus apparent that each irreducible block can be treated as an independent MLE calculation. The result,Σ ( ) states thatΣ ( ) Ω is found by averaging the d irreducible subblocks in the transformed sample covariance. This procedure ensures that elements of S Ω that are outside the block diagonal are ignored.
The resultingΣ can be reconstructed by substituting Eq. 25 in Eq. 16 and transforming back to the original basis: The procedure for findingΣ for a G-invariant covariance is thus as follows: 1. Transform S into the symmetry-adapted basis. 2. Zero the elements outside the irreducible blocks. 3. Average the irreducible blocks within each isotypic block (if there is more than one).

Transform back to the original basis.
This result is necessarily the same as the projection of S onto the fixed point subspace of the group (Eq. 13) butΣ Ω provides a more compact and informative way of describing tree-structured variation.
imsart-aos ver. 2014/10/16 file: main1.tex date: December 18, 2017 . CC-BY 4.0 International license peer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/267450 doi: bioRxiv preprint first posted online Feb. 18, 2018; Importantly, Eq. 25 gives an explicit MLE of the irreducible block that involves simply re-arranging terms in the sufficient statistic S. Each of these blocks is thus a descriptive statistic for tree-structured data, involving linear combinations of data points, sums of squares and no parameters. 4.2.7. Complexity of MLE calculation. Transforming Σ to the symmetryadapted basis makes it clear how much group symmetry reduces the complexity of the model. The effective number of dimensions, p eff , is found by summing the dimensions of each unique irreducible component. The number of free parameters in the covariance matrix, N Σ , is found by summing the number of parameters in each unique irreducible block. The minimum number of replicates required, n min , is found from the dimensionality of the largest irreducible block ( = 1). Thus The group-symmetric model is thus significantly more constrained than the unstructured model (compare Eq. 3), with the number of parameters growing polynomially with G instead of exponentially. Note how p eff < p (when G ≥ 3), a reduction in the effective number of dimensions that was not apparent from the fixed-point subspace perspective. Even with these symmetry constraints however, n min still grows with G, albeit linearly (Eq. 30) instead of exponentially. This means that for a fixed set of n replicates there will always be a limit to the number of generations that can be analysed. An additional constraint is required. 4.3. Sparsity. The additional constraint comes from recognising that each irreducible component is a time series from generation to G (see Section 4.2.5). Together, the irreducible components form a set of G independent time series each starting at a different generation but all ending at G. A standard procedure for restricting the complexity of a time series is to consider a fixed order Markov process. This restricts each irreducible block of the transformed precision matrix to having non-zero values along a diagonal band (the tri-diagonal in the case of a 1st order Markov process). Remember that here it is the structure of each K A restricted-order Markov chain is a simple case of a decomposable Gaussian graphical model (Speed and Kiiveri, 1986;Lauritzen, 1996) and thus yields an explicit MLE whose calculation we briefly summarise here. Following the standard procedure for a decomposable model, variables in the block are organised into cliques and separators, a straightforward exercise for a Markov chain of any order. The corresponding sub-blocks within S where the subscript c i refers to a clique, s i refers to a separator, and N C is the number of cliques in the irreducible block. The MLE for an irreducible block is then given explicitly by (Lauritzen, 1996) where the expression {Υ} 0 denotes a matrix with the dimensions ofK ( ) Ω which has its appropriate sub-block occupied by Υ and zeros elsewhere.
This expression makes it clear that, since it is the inverse of the clique and separator sub-blocks that are required, it is only these sub-blocks that need to be positive definite. The mininum number of replicates required for positive definiteness is thus set by the order M of the Markov process, which is fixed, rather than by the size of the irreducible block, which grows linearly with G. In general, n min = 2 + M and we have finally achieved our goal of having the data requirements be independent of the number of generations being analysed. Note that restricting the non-zero parameters in the precision matrix to be on the diagonal band means that N Σ ∼ O(G 2 ), down from the cubic dependence in Eq. 29. p eff remains unchanged.
Inspection of the T-cell and worm lineage data show that, at least up to generation 4, non-zero values in K ( ) Ω are indeed primarily confined to the tri-diagonal. This justifies the (first-order) Markov process assumption, and we hereafter use it to extend the analysis to higher generations. 4.4. Missing Data. The MLE calculations for the models described above assume complete data. In reality, some measurements are missing, often because data collection is imperfect but also because cells die and have no descendants. This creates problems. Having to marginalise over the missing variables would mean that the MLE calculations are no longer convex imsart-aos ver. 2014/10/16 file: main1.tex date: December 18, 2017 . CC-BY 4.0 International license peer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/267450 doi: bioRxiv preprint first posted online Feb. 18, 2018; and the explicit expressions for the MLE no longer apply. Also, with such partially-balanced data we would not be able to perform the similarity transformation to the symmetry-adapted basis. A simple solution is to apply the Expectation-Maximization (EM) Algorithm (Dempster, Laird and Rubin, 1977). This iteratively improves the estimate of the covariance matrix, generating expected values of the sufficient statistic at each step. In the E-step, the current estimate of the mean µ and covariance matrixΣ are used to calculate the expected sufficient statistic for each replicate, conditioned on the observed data. The average sufficient statisticŜ over all replicates is then calculated. In the M-step,Ŝ is used in the MLE calculation of the irreducible blocks (as described above) to update the estimateΣ. The E and M steps are then repeated untilΣ converges.
In more detail (Little and Rubin, 2002), the first and second order statistics are calculated for each replicate i by partitioning the variables into observed sets, labelled o i , and unobserved sets, labelled u i . Members of each set usually differ from one replicate to the next. The vector of unobserved values in each replicate is then filled by its expected value conditioned on the vector of observed values: These combined with the observed values completes the first order statistic, The second order statistic (Y Y ) i for each replicate i, partitioned into observed and unobserved sections, is found from is the residual covariance of the unobserved variables after conditioning on the observed variables.
imsart-aos ver. 2014/10/16 file: main1.tex date: December 18, 2017 . CC-BY 4.0 International license peer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/267450 doi: bioRxiv preprint first posted online Feb. 18, 2018; Once this exercise has been completed for all replicates, the sample mean and covariance are calculated from the usual The estimated sample covariance,Ŝ, is then used in the procedures described in the previous sections to calculate a new estimate,Σ.
Iterating these steps gives the following algorithm: 1. Initializeμ andΣ. 2. Expectation step to determine the expected value of the sufficient statistics for each replicate. Use Eqs. 33, 34, 35 to calculate the updated estimateμ and the estimated sample covariance,Ŝ 3. Maximization step to findΣ fromŜ.

Return to
Step 2 until convergence.

Graphical Models of a Lineage.
Having estimated the parameters in the model we can visualise and interpret the results using graphical models. Here we use two types of graphs: an undirected graph in the standard basis where the individual family members are nodes, and a directed graph in the symmetry-adapted basis where the natural variables are nodes. As will become apparent, the natural basis is a more compact way of viewing all the parameters at once. 5.1. Standard Basis, Undirected Graph. To visualise the network of statistical associations between different family members we use an undirected graph (Speed and Kiiveri, 1986;Lauritzen, 1996). Here we examine the graphs defined either by marginal or by conditional associations. The first is found from Σ while the second is found from K.
For the network of marginal associations the strength of an edge between a pair of variables is defined by the Pearson correlation coefficient, ρ jj = σ jj / √ σ jj σ j j where σ jj is an element ofΣ. For the network of conditional Undirected graphs in the standard basis. The colour of edges in each graph corresponds to the correlation (top row) or partial correlation (bottom row) between pairs of family members. To avoid clutter only the first 4 generations are shown. Note how the graph of partial correlations (8f) is a binary tree for the simulated branching process but not for the real lineages. This emphasises how the graphical model of real lineages has the symmetry properties of a tree but not the sparsity properties.
imsart-aos ver. 2014/10/16 file: main1.tex date: December 18, 2017 . CC-BY 4.0 International license peer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/267450 doi: bioRxiv preprint first posted online Feb. 18, 2018; associations the strength of an edge is determined by the partial correlation jj |V \{j,j } = −κ jj / √ κ jj κ j j where κ jj is an element ofK, and V \{j, j } refers to the set of variables excluding j and j . Both types of undirected graphs are shown in Fig. 8 for the 3 lineage types. The network of conditional associations identifies direct interactions between variables, conditioned on all other variables, and, as expected, provides a sparser representation than does the network of marginal associations. Note how a binary tree is revealed in the graph of partial correlations for the branching process (Fig. 8f). This is expected since the branching process was designed so that daughters were independent when conditioned on their common mother. In the network of partial correlations this assumption reveals itself as the lack of an edge between sisters. In contrast, in the partial correlation graphs for T-cell (Fig. 8d) and worm (Fig. 8e) lineages, sisters are often joined by edges.
We emphasise that the inferred undirected graph for all lineages has the symmetry structure of a binary tree but not necessarily its sparsity structure. We are thus able to examine how the inferred network of statistical relationships compares to the known network of familial relationships; though the latter is a binary tree, the former may not be. 5.2. Natural Basis, Directed Graph. One problem with representing each family member as a node is that the graph appears cluttered since that there are many edges and nodes with similar parameters. This problem gets exponentially worse with increasing generations. Such redundancies disappear when examining the tree over its natural, or symmetry-adapted, variables, where the indistinguishabilities have been removed.
Since the natural variables in each irreducible component are ordered by generation they can be represented by a directed graph, with each variable conditioned on the past (Wermuth, 1980;Kiiveri, Speed and Carlin, 1984;Pearl, 1988). Each irreducible component is a chain and the tree is represented by G independent chains. The structural equation model, sometimes called a causal model, underlying each chain is a non-stationary time series given by the following system of equations: Note that each irreducible component is represented by its own system of equations but we avoid the superscripts to reduce index clutter. Here z j is a natural variable corresponding to a generation j, β jj is the regression imsart-aos ver. 2014/10/16 file: main1.tex date: December 18, 2017 . CC-BY 4.0 International license peer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/267450 doi: bioRxiv preprint first posted online Feb. 18, 2018; coefficient of generation j on j , and ε j is an independent zero-bias random variable representing the noise originating at generation j. Defining a lowertriangular coefficient matrix B = (b jj ) gives the system of equations in matrix form: To find the parameters in the structural equation model, β jj and E(ε 2 j ), from the inferredΣ Ω we use the modified Cholesky decomposition: where Φ = (ϕ jj ) is diagonal and L is lower triangular. Then since E(zz ) = Σ Ω , we find that L −1 = (b jj ). Thus β jj can be found from Eq. 37 while the noise terms are found directly from E(ε 2 j ) = ϕ jj . The directed graph can then be defined with edge weights given by β jj and node strengths given by E(ε 2 j ). The edges represent transmission of variation while the nodes represent innovations. If |β jj | < 1 then transmission is regressive, with descendants gradually losing memory of previous generations. However, if |β jj | > 1 then variation from that division observed at generation j is amplified during transmission to generation j. Thus variation can arise directly from a noisy generation (node) or it can by amplified by strong transmission between generations (edge), or both.
The directed graphs for the 3 lineage types are shown in Fig. 9. It is striking how β jj is particularly high for certain generations and divisions in the worm lineage. For the T cells division = 1 has large innovations and high transmission between generations whereas the other divisions are fairly quiet. The branching process is largely featureless across all generations and divisions, as expected.
6. Explaining Fate. The directed graphs shown in Fig. 9 give a detailed summary of variation throughout lineage. We now examine what the results mean for our understanding of cell fate, addressing two questions in particular: How much each division contributes to cell fate, and how well an ancestor's phenotype predicts the fate of its descendants.
In this study, fate is defined to be the phenotype y observed at the latest generation modelled, G. This practical definition allows us to develop the quantitative framework for understanding fate in terms of explained variance. Of course fate at generation G may not represent the ultimate fate of cells in a lineage, especially if G is an early generation. Furthermore, cell fate is often conceptualised in terms of cell types and thus represented by discrete states. However cell types are usually identified by the continuous expression levels of certain underlying phenotypes. It seems reasonable then to define fate directly in terms of the measured (continuous) traits rather than the derived (discrete) cell types.
With this definition, a statistical analysis of cell fate involves explaining the variability of the phenotype at generation G in terms of other variables. If a set of variables can be shown to account for most of the variability at generation G then we can say that those variables explain fate (in the statistical sense). First we will explain a cell's fate by the hidden contributions from each division; then we will explain it by the observed phenotypes of the cells' ancestors. The first we interpret as a measure of fate progression (or commitment) while the second we interpret as a measure of fate expression.
6.1. Contributions to Fate. Whether certain divisions are more important than others, and how committed cells are to particular fates at different stages in the lineage are questions of obvious scientific interest.
It is straightforward to estimate the contributions to cell fate at generation G from each division , where 1 ≤ ≤ G. This is just the standard problem of estimating the variance components for nested groups of family members in a single generation, a trivial exercise now that we have estimated the transformed covariance.
Consider the variance of a cell in generation G, represented by a diagonal imsart-aos ver. 2014/10/16 file: main1.tex date: December 18, 2017 . CC-BY 4.0 International license peer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/267450 doi: bioRxiv preprint first posted online Feb. 18, 2018; element in Σ G and, in the 3-index notation used in Eq. 4, denoted by σ GGG . This can be written in terms of the elements ξ ( ) gg of Σ Ω (from Eq. 22) by applying the similarity transformation Σ G = T Σ Ω T † to Σ Ω . Supplement S2 gives a specific example. The result shows that σ GGG is just the sum of independent contributions from each division ( , τ ) : where N div = G =1 d = 2 G is the total number of divisions (which is equal to the number of members of generation G). As before, d is the dimension of the -th irreducible eigenspace or, equivalently, the number of transverse divisions for the -th division (Section 4.2.5).
The components of variance, ξ 6.2. Predictability of Fate. The question of how well a cell's fate can be predicted from the phenotypes of its ancestors is a measure of how much information about a given phenotypic fate is being expressed in earlier generations.
We define the predictability of fate to be the proportion of variance of a family member in generation G that can be explained by the phenotypes of its ancestors, where here the 'explaining' is by linear regression. Given a family member in generation G and its direct ancestor in generation g, the proportion of explained variance is just the squared correlation coefficient, or coefficient of determination, In the subscripts we have simplified the 3-index notation from Eq. 4 by ignoring the third index. This does not cause confusion since in this context we are only concerned with direct ancestors; the third index, associated with the MRCA, is thus redundant. For example, σ 34 is understood to be the covariance between a mother in generation 3 and one of its daughters in generation 4, not any other daughter.
Generalising to prediction using multiple generations of direct ancestors up to and including that in generation g gives where g represents a vector of direct ancestors of the cell in generation G that are from generations 1 to g inclusive. Note that Eq. 43 accounts for possible dependencies in the variation between ancestors. Unlike for the case of components of variance, variation between ancestral generations is not (in general) orthogonal.
6.3. Comparing Explanations of Variance. The two explanations of phenotypic variability are complementary: η 2 explains variance in terms of shared ancestral divisions, compares members within the same generation, and involves the transformed covariance matrix Σ Ω ; in contrast, R 2 explains variance in terms of ancestral generations, compares members across generations, and involves the covariance matrix Σ. The two quantities are shown in Fig. 10 for the 3 lineage types with the top row giving the explained variance and the bottom row giving the cumulative explained variance. Note that because of the first order Markov process assumption (Section 4.3),  R 2 = R 2 cml in these plots. For the case of the simulated branching process the exact result is also shown.
η 2 (G, ) (blue line, top row) gives the contributions to fate at generation G from each of the earlier divisions . For the worm, = 3, 4, 6 are particularly important divisions for explaining fate (at G = 8) while = 1 is irrelevant. For the T cell = 1 is by far the important division for explaining fate (at G = 5) while higher divisions are unimportant. As expected, for the branching process, contributions for all divisions are comparable.
R 2 (G, g) (orange line, top row) gives the predictability of fate at generation G using each of the earlier generations g. For the worm R 2 is zero until g = 7, when differentiation, and thus expression of fate, has started to occur and successive generations start to resemble each other. Strikingly then, none of the structure in η 2 from 1 ≤ ≤ 6 is reflected in R 2 . For the T cell, even though most of the variation is explained by division = 1, R 2 is low for g = 1, 2 indicating that the phenotypes of those ancestors contain little information about their descendants despite their fate having largely been set. For both these lineages then, cell fate is being determined in early generations but is not being expressed until later.
Such 'hidden' fate progression is best visualised in the cumulative eximsart-aos ver. 2014/10/16 file: main1.tex date: December 18, 2017 . CC-BY 4.0 International license peer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/267450 doi: bioRxiv preprint first posted online Feb. 18, 2018; plained variance shown in the bottom row of Fig. 10. For the worm, η 2 cml increases unevenly with each division while R 2 cml (G, g) remains zero. For the T cell, even though η 2 cml starts high at = 1, R 2 cml (G, g) starts at zero. Contrast both these lineages with the branching process where η 2 cml and R 2 cml both start near zero and increase steadily in a similar fashion. This would be expected since all the variation in the branching process is expressed. Clearly a T cell lineage cannot be modelled by a branching process. Based on these observations we interpret η 2 cml as a measure of fate progression (or commitment) and R 2 cml as a measure of fate expression. In the two real lineages fate progression is always higher than fate expression, with the difference representing hidden fate progression. This reflects cells committing to a particular fate before they express individual markers of that fate.
7. Discussion. We have developed a general method for inferring the structure of lineage heterogeneity. Symmetry invariance, invoked to constrain the joint probability distribution, identifies a set of natural variables which compactly describe tree-structured variation. To apply the method to real measurements we employed a Markovian constraint and used the EM algorithm to account for missing data.
The inferred parameters were interpreted in two ways. First, directed graphs over the irreducible components gave a fine-grained, causal view of all variation throughout the lineage. Second, variation in a late generation was explained in terms of earlier divisions and generations, allowing the progression and expression of cell fate to be inferred. Comparing the two methods distills how much of the fine-grained behaviour seen in the directed graphs actually affects later generations: ancestral variation only influences fate if it is effectively transmitted through intermediate generations. This makes it possible to distinguish between variation that influences fate and variation that is inconsequential noise.
Examining the differences between the worm and branching process highlights how regularities in noisy lineage patterns can be distinguished from noise. The sharp features, analogous to spectral lines, seen in the 'fate spectrum' of the worm, Fig. 10b, are associated with asymmetric divisions that give rise to well-known differentiation structure. Supplement S3 shows a detailed worm lineage with each family member annotated by its standard label, highlighting the specific divisions contributing to the spectral features. In contrast the branching process seen in Fig. 10c has a comparatively featureless structure, analogous to a white noise spectrum. The T cell lineage in Fig. 10a lies somewhere in between, showing a single peak at = 1, but imsart-aos ver. 2014/10/16 file: main1.tex date: December 18, 2017 . CC-BY 4.0 International license peer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/267450 doi: bioRxiv preprint first posted online Feb. 18, 2018; no other structure. We emphasise that lineage patterns do not have to be invariant (as in C. elegans) in order to be detected. In mammalian lineages, where cell differentiation is more likely to occur over clusters of closely-related divisions and generations rather than at specific ones, peaks in η 2 would be spread over neighbouring . This would be analogous to the broadening of spectral lines. We note that the concept of a fate map has largely been a deterministic one, describing at what stage a particular fate is specified (Chisholm, 2001). This method can thus be regarded as the first steps towards a probabilistic representation of the fate map.
The development of new lineage measurement techniques (Amat et al., 2014;Frieda et al., 2016) has been increasing the need for statistical methods to analyse lineage variation. Moreover, since the method can be viewed as a generalisation of ANOVA or multilevel modelling, it may find similarly broad use. The ubiquity of binary trees in both natural and human-designed systems suggest that potential applications exist in areas beyond cell lineages.

SUPPLEMENTARY MATERIAL
Supplement to 'Statistical Inference in Cell Lineage Trees': (sup1.pdf). Additional information is provided in a separate file. This covers the derivation of the group representation for a binary tree (Supplement S1), explicit expressions for the decomposition of a tree-structured covariance matrix (Supplement S2), and a figure showing C. elegans lineage with standard notation (Supplement S3).