^{1}

^{1}

^{1}

^{2}

^{1}

^{2}

^{*}

DF, AW, and ES conceived and designed the experiments. DF, AW, and SK performed the experiments. DF, AW, SK, UF, and ES analyzed the data, contributed reagents/materials/analysis tools, and wrote the paper.

A patent application may be made on the results reported.

What is the lineage relation among the cells of an organism? The answer is sought by developmental biology, immunology, stem cell research, brain research, and cancer research, yet complete cell lineage trees have been reconstructed only for simple organisms such as

A multicellular organism develops from a single cell, the zygote, through numerous binary cell divisions and cell deaths. Consequently, at any given time, the lineage relations between the cells of the organism can be represented by a rooted labeled binary tree called the cumulative cell lineage tree (

(A) Multicellular organism development can be represented by a rooted labeled binary tree called the organism cumulative cell lineage tree. Nodes (circles) represent cells (dead cells are crossed), and each edge (line) connects a parent with a daughter. The uncrossed leaves, marked blue, represent extant cells.

(B) Any cell sample (A–E) induces a subtree, which can be condensed by removing nonbranching internal nodes and labeling the edges with the number of cell divisions between the remaining nodes. The resulting tree is called the cell sample lineage tree.

(C) A small fraction of a genome accumulating substitution mutations (colored) is shown. Lineage analysis utilizes a representation of this small fraction, called the cell identifier. Phylogenetic analysis reconstructs the tree from the cell identifiers of the samples. If the topology of the cell sample lineage tree is known, reconstruction can be scored.

(D) Coincident mutations, namely two or more identical mutations that occur independently in different cell divisions (blue mutation in A and B), and silent cell divisions, namely cell divisions in which no mutation occurs (D–F), may result in incorrect (red edge) or incomplete (unresolved ternary red node) lineage trees. Excessive mutation rates might result in successive mutations (not shown), which cause the lineage information to be lost.

Lineage relations among cells have been studied using a variety of clonal assays [2,3,6,8,10,16–24]. Such assays act by detecting the progeny of a single founder cell, which has been marked by a heritable marker. Some assays mark the founder cell by an invasive technique such as injection of a tracer molecule [

Genetic variability has also been used for reconstructing lineage trees of several tissue samples extracted from the same individual. In one study [

In this paper we show that cell lineage trees can be reconstructed from genomic variability caused by somatic mutations, and that somatic mutations in higher organisms contain sufficient information to allow precise reconstruction of the organism cell lineage tree. We describe a hybrid in vitro/in silico automated procedure for reconstructing cell lineage trees from DNA samples, and demonstrate its effectiveness and precision in a controlled environment.

Somatic mutations are sufficiently rare for common wisdom to say that “the genome is the same in every cell in the body” except for some white blood cells [

However, precise reconstruction may be hampered by three factors: coincident mutations (

In principle, any mutation information may assist lineage tree reconstruction. We focus on mutations in MSs for the following reasons: (1) MS slippage mutations, which insert or delete repeated units in an MS, are thought to occur during DNA replication [

In order to asses the theoretical potential of lineage analysis using genomic MSs, we obtained data regarding human and mouse MSs and performed calculations and computer simulations based on these data. We searched the human and mouse genomes for MSs and found about 1.5 million loci interspersed on all chromosomes and containing a variable number of tandem repeats. Based on this data, and on published data regarding human and mouse MS mutation rates [^{12} leaves and correspond, under reasonable assumptions, to a newborn mouse;

Two types of random trees with 32 leaves were generated, and MS stepwise mutations were simulated. Results of simulations of wild-type human using different numbers of MS loci are shown. The white line marks the perfect score limit (according to the Penny and Hendy tree comparison algorithm [

We prove that our approach has the potential of reconstructing without error condensed trees of sets of cells that are many orders of magnitude larger than anything achieved in the past. Our proof is based on certain assumptions regarding the DNA contents and nature of mutations in the human genome. These assumptions are stated explicitly in this manuscript, and (to the best of our knowledge) are in agreement with existing biological literature.

In describing our theoretical results, we prefer simplicity over achieving the best possible results based on our assumptions. In particular, the “triplet algorithm” that we present for reconstructing the condensed tree does not make use of all information available to it, and there is a lot of slack in the analysis. One may expect that more sophisticated algorithms coupled with tighter analysis will allow one to extend the family of trees that can be inferred using our methods.

We have chosen not to try to strengthen our theoretical results at this point for the following reasons. First, we believe that the vast potential of our method is made clear already by the analysis that we provide here. Second, and perhaps more importantly, it may be premature to enter into a lengthy theoretical analysis before establishing more firmly the biological assumptions on which the analysis is based. The biological assumptions that we make here may eventually turn out to be too optimistic in some respects (e.g., that mutation events are, or can be viewed as being, statistically independent), and too pessimistic in other respects (e.g., that the only significant source of variability in the human genome is MSs). Hence, there is not much point in performing tedious and time-consuming analysis based on current biological assumptions.

We now describe the mutation model that we assume in the analysis, which we call the “uniform model.” We use the following notation: ^{6} and ^{−5}.

This completes the description of the uniform model. Our analysis addresses the following question: assuming that one could read with no error the identifiers of all extant cells, can one (with high probability, over the events of random mutations) reconstruct the underlying condensed tree with no error at all? The answer depends on the “shape” of the extant tree. We present some ranges of parameters for which reconstruction is possible. Specifically, we take ^{40}). Note that ^{47}).

We performed simulations on two types of randomly generated cell lineage trees (

In order to assess the extent of silent cell divisions, which might act as a limiting factor for reconstructing human cell lineage trees, we calculated the probability for a silent cell division and estimated the total number of cells in the tree for a newborn human. We found that in a single cell division the probability of a daughter cell acquiring no new mutations is less than 10^{−21}. For estimating the total number of cells in the tree we created a model of human embryonic development that overestimates the number of cells and cell divisions, and thus can serve as a theoretical upper bound on the size of the cumulative cell lineage tree of a newborn human. We found that in the model, in more than 99.9% of newborns, there is at least one new mutation in each daughter cell in each cell division. This suggests that during human prenatal development, even a single silent cell division is unlikely to occur. As mentioned above, coincident mutations may cause erroneous tree reconstruction, but because there are no data on the topology and depth of newborn cumulative cell lineage trees, it is difficult to estimate their effect in this model.

(A) Photograph and scheme of the

(B)

(C) Transverse scheme of the

Past work on

In order to analyze genomic variability within a plant, we grew an MMR-deficient

We developed a procedure that takes as input a set of DNA samples, primers for MS loci, information on expected MS sizes, and information on PCR and capillary electrophoresis multiplexing compatibility between MS loci, and outputs a reconstructed cell lineage tree (with edge lengths) correlated with the DNA samples (

The procedure accepts biological samples and PCR primers as input, and outputs a reconstructed lineage tree. It consists of a series of seven consecutive steps (numbered), during which the physical biological samples are “transformed” into digital data, which are then analyzed algorithmically. We built a hybrid in vitro/in silico automated system that performs steps 2–7 of the procedure (outlined), and used it to process DNA from tissue samples and single-cell clones. Incorporation of whole genome amplification techniques in the future may enable processing of single cells as well. For a detailed specification of the procedure, see

We accomplished the procedure in a hybrid in vitro/in silico automated system, which operates as follows. A predetermined set of

To quantitatively evaluate the cell lineage tree reconstruction procedure, we cultured ex vivo cell trees with known topologies and well-estimated edge lengths, called cultured cell trees (CCTs). We constructed three CCTs (A–C; ^{2} = 0.955). Furthermore, reconstructions of the CCTs without using root identifiers (

(A–C) A cell sample lineage tree with a predesigned topology is created by performing single-cell bottlenecks on all the nodes of the tree. Lineage analysis is performed on clones of the root and leaf cells. Three CCTs (A–C) were created using LS174T cells that display MS instability. All topologies were reconstructed precisely. Edge lengths are drawn in proportion to the output of the algorithm. Gray edges represent correct partitions according to the Penny and Hendy tree comparison algorithm [

(D) There is a linear correlation (^{2} = 0.955) between reconstructed and actual node depths.

(E) Reconstruction scores of CCTs A–C using random subsets of MS loci of increasing sizes (average of 500).

We discovered that somatic mutations in higher organisms carry enough information to enable precise reconstruction of the entire organism cell lineage tree. We demonstrated the practical utility of the discovery by developing a prototype automated procedure for the reconstruction of cell lineage trees from DNA samples.

In the short term, small-scale projects utilizing this discovery and its associated procedure may aim to gain preliminary understanding of partial lineage trees associated with different organs or systems, by analyzing cell samples containing only dozens or hundreds of cells. In addition, analysis of the development of cancer using this method may provide immediate benefits. Cancer analysis may not require the perfection of single-cell methods, since clonal tissue samples may be obtainable from solid tumors.

In the longer term, with the improvement of DNA sequencing technologies [

We downloaded the human (build 35) and mouse (build 33) genomes from UCSC Genome Bioinformatics (

For estimation of the number of MS mutations in each cell division in human, we obtained from the literature [

The mutation-rate function for MSs with 9–15 repeat units is therefore

where e is the basis of the natural logarithm. For all MSs with less than nine repeat units, we made a conservative assumption that their rate of mutation is zero. Because of the lack of information regarding mutation rates in MSs with more than 15 repeat units, we made another conservative assumption that the mutation rates of all such loci are the same as for loci with 15 repeat units. Therefore, our estimated mutation rates for short and long MSs most likely represent an underestimate of the actual rate. We sorted the human MSs according to their length, and for each length we computed the expected number of mutations acquired by a daughter cell in a single cell division by multiplying the mutation rate by the number of MSs. The total number of expected MS mutations was computed by summing the expected number of mutations in each length category. See complete data in

In contrast to the information regarding human MS mutation rates, there are fewer published data regarding mouse MS mutation rates, and data from different sources may be inconsistent. Comparison of data from several studies of human [

Our mathematical analysis, simulations, and the reconstruction of CCTs assume a uniform MS mutation rate across tissue types, as there is no sufficient knowledge at present to assign different somatic MS mutation rates to different tissue types.

The simplifying assumptions underlying the uniform model make the calculation of some key quantities (which we call “signal,” “noise,” and “loss,” as indication of their effect on our eventual reconstruction algorithm) rather straightforward. We provide such calculations now (omitting some details). For signal, the expected number of mutations per edge of the extant tree is ^{t}^{−50}/^{2}/2

We now describe the reconstruction algorithm that we analyzed. This is a new algorithm, which we call the “triplet algorithm,” designed to facilitate the proof. This algorithm is chosen here because its analysis is simple, but we do not necessarily advocate its use in practice. We suspect that similar (and perhaps even better) results are true for other algorithms as well.

The basic primitive of the triplet algorithm is a “triplet subroutine.” Given identifiers for three cells (say, A, B, and C), the triplet subroutine counts for every pair of cells the number of common mutations, namely, the number of loci in which the two cells have the same label, and moreover, this label is different from the corresponding label of the root. The pair of cells that maximize this count (say, A and B) are output by the triplet subroutine. We say that the triplet subroutine is “successful” if the pair of cells that it outputs is the one that has the longer common branch (or equivalently, the deeper common ancestor).

The triplet subroutine will be successful unless there is some value of ^{t}^{−51} × (1/^{2}, which is maximized when ^{−18}. Summing over all values of ^{−17}. Hence one can execute the triplet subroutine on 10^{17} arbitrary (not just random!) triplets, and still be likely to be successful in all executions.

We now describe the triplet algorithm. View every cell as a vertex in an auxiliary graph G. In an execution of a triplet subroutine that outputs cells A and B (say, on input A, B, and C), put an edge between A and B. As long as there are more than two connected components in the graph G, pick three vertices from three different components and execute a triplet subroutine on them, thereby adding an edge to the graph and decreasing the number of connected components by one. After

We have just shown that for every extant tree of depth ^{17} is much larger than 40 × 2^{40}. In fact, the ratio between these numbers is such that the probability that the tree is constructed with no error is greater than 0.9995. This completes our proof.

The simulations demonstrate, first, that human wild-type MS mutations enable accurate reconstruction of cell lineage trees and, second, that with higher mutation rates, as in MMR-deficient cells, cell lineage trees can be accurately reconstructed with no more than 800 MS loci (a kit containing 800 primer pairs for human MS amplification is commercially available).

The simulation proceeds as follows. A random tree is generated according to chosen topology type, maximal depth, and number of leaves. MS mutations are simulated according to number of loci and mutation rates, and leaf identifiers are generated. A lineage inference algorithm reconstructs the lineage tree. The inferred tree is compared with the generated tree, and the result is scored. Mutations and tree inference are performed ten times for each generated tree. For each depth, five random trees from each topology type are generated.

Since we do not usually have prior knowledge of the real organism's tree topology and branch lengths, we simulated two types of random trees that reflect topology space variability to a reasonable extent (see

Type I random trees have a random binary path, and are generated as follows. Generate LEAF _ NUMBER of unique nonoverlapping binary strings, each of a random length of up to MAX _ DEPTH bits. Each string represents a path in a binary tree leading to a leaf. In such a tree the least common ancestor of any pair of leaves is usually relatively close to the root. There is only a (1/2)^{n}

Trees of type II are generated by random node addition, as follows. The tree is initialized with two paths of random length up to MAX _ DEPTH that start at the root and lead to two leaves. An iteration adds a leaf by randomly picking a path, and on it randomly picking an internal node as the source of the new path. The depth of the new leaf is determined randomly between (new internal node + 1) and MAX _ DEPTH. Leaves are added until LEAF _ NUMBER is obtained. The procedure generates a variety of tree topologies with branches at various depths. The procedure often produces nonbalanced trees. In this family of trees, increasing the maximum depth does not always result in an increase of noise over signal since internal branches are often deep.

Mutations in MS loci were simulated by a stepwise model. The root was assigned a vector of zeros of the size of the simulated ALLELE _ NUMBER. An ALLELE _ MUTATION _ RATE, which is the probability of each MS locus mutating in a single cell division, was chosen. Then in each simulated cell division, each locus could mutate by increasing or decreasing the repeat number according to its assigned probability. Starting with the root identifier we generated the identifier of all tree nodes and leaves by simulating mutations as described above.

In simulated MMR-deficient cells we used mutation rates not higher than one per 100 cell divisions, according to

Mutation Rates in MMR-Deficient Human Simulations

Lineage tree inference was done with the NJ algorithm [

The generated tree and the reconstructed tree were compared using Penny and Hendy's topological distance algorithm [

Simulations were performed for samples with up to 100 cells, because of the computational resources required by the phylogenetic analysis algorithm.

The probability for no mutations in each daughter cell in each length category was calculated by the formula

The probability for no mutations in all loci was calculated by multiplying the probabilities for no mutations in all length categories (see data in

In order to estimate the total number of cells in a human neonatal cell lineage tree, we developed a model of human wild-type development^{14} leaves and approximately 10^{14} internal nodes, and hence has approximately 2 × 10^{14} nodes altogether (according to published data the adult human has about 10^{14} cells). This series of 46 divisions lasts for 23 d because each cell cycle is exactly 12 h long (according to published data the cell cycle in early human embryogenesis is 12–24 h). From this point on, there are additional 486 cycles of 12 h in 243 d until birth. In each cycle, each cell divides with a probability of 0.5 and dies with a probability of 0.5. Therefore, the number of living cells remains relatively constant from day 23 to day 266 (birth) at about 10^{14}, and in each day 2 × 10^{14} cells are produced. The total number of cells produced during this process, and therefore the total number of nodes in the complete cell lineage tree at birth, is approximately 2 × 10^{14} + 486 × 10^{14} = 4.9 × 10^{16}.

The probability for at least one new MS mutation in every cell in the human neonatal cell lineage tree was calculated by the formula (probability for no MS mutations in a single daughter cell)^{total number of cells}. This calculation gives

DNA from

Individuals of

Primers for most MS loci were designed using Primer3 (

CCTs were created using LS174T human colon adenocarcinoma cells, which were obtained from the European Collection of Cell Cultures (Salisbury, United Kingdom) and were grown in medium containing EMEM (Eagle's minimum essential medium, in Earle's balanced salt solution, GIBCO, San Diego, California, United States), 2 mM glutamine, 1% nonessential amino acids, 10% fetal bovine serum, and 1% penicillin-streptomycin. We estimated that LS174T cells divide every 1.5 d according to the frequency of routine plate passages. We created CCTs as follows. Initially, a single cell was isolated from a cell stock and was defined as the tree root. This cell was allowed to proliferate for a desired number of cell divisions (passages were performed when required). Then, two cells were isolated from the root progeny, and were defined as its daughter cells in the tree. This procedure was continued for each daughter cell, creating the granddaughter cells, etc., until the entire tree was grown. The tree root and leaf cells were cloned in plates, and lineage analysis was performed on DNA obtained from these clones. Lineage analysis performed on clones is expected to yield the same results as analysis on the founder cells of the clones (see Text

^{+/−}), with each parent from a different inbred line. Animals that are heterozygous for a mutant MMR gene have normal or slightly elevated MS mutation rates. A cross between two such animals produces (with a frequency of 1:4) an animal that is homozygous for the mutant gene, with greatly elevated MS mutation rates. In order to deduce the identifier of the root (zygote) of such an organism, which is used in an experiment, the identifiers of its parents should be obtained. Because the parents come from inbred lines, they are homozygous at each MS locus and therefore deducing the identifier of the zygote is straightforward. The deduced identifier is very close (and when analyzing a few hundred MS loci may be identical) to the actual identifier because somatic MS mutations in the parents are very rare. It is important to note that this procedure deduces the identifier of the zygote of the organism, which may or may not be identical to the root of the reconstructed tree.

Deduction from the most common allele (deduction of root identifier) is performed as follows. In this procedure, the most common allele is determined for each MS locus in the population of sampled cells, and this value is assigned to the root identifier. Thus, the root identifier consists of the most common values in the cell population. In balanced trees that are not too deep, the deduced identifier will be very close (and may be identical) to the actual root identifier. However, in unbalanced (nonsymmetric) trees, this procedure will result in the deduced identifier being “tilted” towards the larger branch, and in deep trees the deduced identifier may differ from the actual identifier in MS loci that accumulate mutations in a nonsymmetric fashion. For example, in an MS locus that is biased towards MS contraction, the deduced identifier value may be smaller than the actual value. It is important to note that this procedure deduces the identifier of the root of the tree, which is not necessarily the zygote of the organism.

(56 KB JPG)

(93 KB JPG)

A complete description of the protocol for reconstructing lineage trees from DNA samples, including the capillary histogram signal analysis algorithm and tree reconstruction and scoring algorithms.

(474 KB DOC)

(89 KB DOC)

(88 KB DOC)

(27 KB DOC)

(26 KB DOC)

(27 KB DOC)

(28 KB DOC)

(50 KB DOC)

(48 KB DOC)

(165 KB DOC)

(A) Reconstructing cell lineage trees from DNA extracted from cell clones.

(B) Selection criteria for MSs.

(29 KB DOC)

(27 KB DOC)

We thank A. Levy for pointing us to MSs and MMR-deficiency, which are the basis of our current implementation; Z. Livneh for advice and support; K. Katzav for the design and preparation of the figures; J. Leonard for mutant

cultured cell tree

mismatch repair

microsatellite

neighbor joining