• Loading metrics

What Is the Tree of Life?

What Is the Tree of Life?

  • W. Ford Doolittle, 
  • Tyler D. P. Brunet


A universal Tree of Life (TOL) has long been a goal of molecular phylogeneticists, but reticulation at the level of genes and possibly at the levels of cells and species renders any simple interpretation of such a TOL, especially as applied to prokaryotes, problematic.

One of the several ways in which microbiology puts the neo-Darwinian synthesis in jeopardy is by the threatening to “uproot the Tree of Life (TOL)” [1]. Lateral gene transfer (LGT) is much more frequent than most biologists would have imagined up until about 20 years ago, so phylogenetic trees based on sequences of different prokaryotic genes are often different. How to tease out from such conflicting data something that might correspond to a single, universal Tree of Life becomes problematic. Moreover, since many important evolutionary transitions involve lineage fusions at one level or another, the aptness of a tree (a pattern of successive bifurcations) as a summary of life’s history is uncertain [24].

Darwin’s Tree of Life Hypothesis

Before 1859, hierarchical (tree-like) patterns of organismal relationships were commonly drawn and taken to mirror some natural order, most likely divine [5]. What Darwin gave us when he published The Origin of Species was a nontheistic reason for the discoverability and utility of such classifications. “Community of descent is the hidden bond which naturalists have been unconsciously seeking, and not some unknown plan of creation,” he wrote [6]. There was to be an actual TOL whose “ramifying branches may well represent the classification of all extinct and living species in groups subordinate to groups” [6]. Bapteste and one of us have called this claim by Darwin his “TOL Hypothesis”—that is, that the tree-like pattern of relationships recognized by systematists reflects an underlying tree-like evolutionary process [7].

After Darwin, classifications were most often assumed without proof to be evolutionary: phenetics was taken to be identical to phylogenetics. As De Querioz noted, “…the relationships expressed in existing taxonomies were merely reinterpreted as the result of evolution, and evolutionary concepts were developed to justify existing methods” [8]. Not much could be done to improve on this as long as the methods of comparative biology (mostly anatomical) remained the basis for classification, as they had been for centuries.

In the mid-1960s, molecular sequencing (first proteins, then ribosomal RNAs, then genes, and now genomes) appeared to offer a way out of such circularity, a possible independent proof of the TOL hypothesis, some thought. Emile Zuckerkandl and Linus Pauling wrote:

“…molecular phylogenetic trees should in principle be definable in terms of molecular information alone. It will be determined to what extent the phylogenetic tree, as derived from molecular data in complete independence of organismal biology, coincides with the tree constructed on the basis of organismal biology…”

And thus,

“…If the two trees are mostly in agreement with respect to the topology of branching, the best available single proof of the reality of macro-evolution would be furnished” [9].

This made a lot of sense at the time. The way in which genes replicate and mutate is, barring recombination, tree-like. One consequence is that ancestral nodes in a tree constructed from gene sequences are interpreted as corresponding to real ancestors (actual genes). Nothing like that would necessarily be the case for trees based on “organismal biology,” that is, on phenotypic similarities and differences. If evolution were not the cause of phenotypic similarities and differences, then phenetic classifications would be like those used to order books in a library: all the books on Canadian cuisine might (arguably) belong on the same shelf, but no one would claim that they descended from one first book on that subject. Nodes in such a phenetic classification need not represent ancestors [7].

So, agreement between trees would indeed have been some sort of “proof” of Darwin’s TOL hypothesis, with two caveats. The first is that no widely accepted prokaryotic tree “constructed on the basis of organismal biology” was actually available for comparison: microbiologists had given up the attempt to make one in the mid-1950s, and no one since has been foolish enough to reboot the effort. The second is that agreement of molecular and organismal trees is not really a disproof of the theistic explanation Darwin wanted to supplant. Any sensible creator would surely use similar genes to make similar organisms.

The molecular phylogenetic tree-making enterprise has grown more spectacularly than Zuckerkandl and Pauling could have dared hope, and, in particular, the use of small-subunit ribosomal RNA (ssu rRNA) as “molecular chronometer” has revolutionized systematics, most importantly making it possible to put all prokaryotes and all eukaryotes into a single universal TOL. The story of how Carl Woese courageously pioneered the use of this molecule has been often told [10,11]. And the discovery by Woese and George Fox that prokaryotes were deeply divided into Bacteria and Archaea, so that life appeared to comprise three domains, not two (Prokaryotes and Eukaryotes), has by now made it into all the textbooks.

That LGT might be a spoiler for the TOL was only a suspicion at the beginning of the era of molecular phylogenetics. Japanese clinical microbiologists found in the early 1960s that antibiotic resistance could spread between bacterial species via “resistance transfer factors,” now called plasmids [12]. But at that time, such cases were thought to be special, disease-related, and possibly human-caused. Sporadic cases of incongruence between protein-coding and rRNA gene trees were subsequently discovered [13,14], but it was probably the evidence from complete bacterial genome sequences that gave rise to the first serious questioning of the TOL. The claim that a quarter of the genes of the thermophilic bacterium Thermotoga maritima had been transferred from Archaea [15] stunned many of us when it appeared in 1999, and prompted one of us to write that if…

“…different genes give different trees, and there is no fair way to suppress this disagreement, then a species (or phylum) can ‘belong’ to many genera (or kingdoms) at the same time: There really can be no universal phylogenetic tree of organisms based on such a reduction to genes” [2].

Nearly Universal Trees and the Statistical Tree of Life

A decade and a half of prokaryotic tree-making has not produced general agreement on just how much LGT there is, other than “much more than we expected,” although there have been many serious attempts [16]. Indeed, there is little agreement on how to quantify LGT as a process and its impact on genomes over all of evolutionary time. The most reasonable way, we think, is to ask of any contemporary genome how many of its genes have been strictly vertically inherited along a lineage of replicating genomes, tracing all the way back to that of the last universal common ancestor (LUCA). There are only about 100 genes that are found in all or nearly all prokaryotic genomes and thus have a strong claim to having been present continuously in all lineages [17,18]. These are largely involved with the ribosome and its functions (or transcription), and are considered to be relatively immune to LGT, according to the “complexity hypothesis” [19].

The complexity hypothesis is the notion that genes whose products interact with the products of many others will have so many coevolved molecular dependencies that function in a distantly related cellular environment is impaired. “Connectivity” (number of protein–protein interactions), not functional category, is in fact the key determinant [20], and there are highly connected systems other than translation in many cells. So it may be equally relevant that translation and transcription are conservative and essential cellular activities, while many frequently transferred genes function in “optional” metabolic systems, representing alternative and therefore evolutionarily more changeable ways cells can make livings [18].

Puigbo et al. [21,22] have for some time been conducting large-scale gene family tree comparisons across bacteria and archaea, the Forest of Life (FOL) project involving thousands of such families. Even among their 102 Nearly Universal Trees (NUTs), made from genes found in all or nearly all bacteria and archaea, there are very few with identical histories. However, NUT topologies are “far more congruent than expected by chance,” so “they appear to reflect a significant central trend, an attractor in the tree space that could be equated with the STOL [the Statistical Tree of Life]” [22].


Most gene families are not among the NUTs, however, and even within a designated species, a sizeable fraction of any genome participates in rapid gene loss and gain (by LGT). The pangenome concept, which aims to describe the gene repertoire of a bacterial species by comparing gene contents of several to many of its strains, supports this notion. A typical bacterial genome comprises a “core” of genes present in all or nearly all of the strains of its species and often many more “dispensable” or “accessory” genes [23] present in only some strains (as few as one). With Escherichia coli (several thousand genomes sequenced), the average strain carries about 5,000 genes. Genes shared between all or almost all strains number little more than 3,000, but the number of gene families with a representative in at least one E. coli genome is approaching 100,000 [24,25]. Prochlorococcus, the world’s most plentiful organism and most important oxygen-provider, shows an average genome content of only about 2,000 genes but a pangenome calculated at 85,000, so far [26,27]. Archaea, too, boast pangenomes [28], and we have come to think in terms of “distributed” prokaryotic genomes and “shared genomic resource” models, in which genes are lost when superfluous and regained when needed, by LGT.

According to a “strong version” [29] of the popular Black Queen hypothesis [30], different strains (or species) will lose different genes for “leaky” synthetic functions—those whose products leak out of cells and thus allow “cross-feeding.” They become mutually dependent. Several recent metagenomic surveys have uncovered heterogeneous populations of previously unknown small-genomed bacteria or archaea that may be metabolically dependent on each other in this way [31,32]. In line with this, Wolf and Koonin [33] suggest a general model in which gene loss is the dominant mode of prokaryotic evolution: what saves genomes from extinction is periodic genomic expansion, in part via multiple simultaneous (or nearly simultaneous) LGT events. Whether such a punctuational pattern holds or gene gain and loss are more regularly ongoing processes, the result is that often less than half of the genes in any strain’s genome are likely to have been continuously present along a lineage of genomes stretching back to the first member (common ancestor) of that species.

As we go deeper than species into the TOL, the fraction of vertically inherited genes can only get smaller. In an influential 2009 analysis, Lapierre and Gogarten extended and deepened the pangenome analysis to 573 bacterial genomes spanning the then available diversity [34]. The core, about 250 genes, which would have included the NUTs, comprises 8% of the typical bacterial genome and would shrink further if archaea were included. Among dispensable genes, they called 64% “character genes”—“essential for colonization and survival in particular environmental niches” and found in related species—while the 28% more rapidly turning-over genes were strain-specific “accessory genes.”

Dreams of LUCA

What, then, did LUCA’s genome contain? At one extreme, one might imagine that all contemporary gene families have their coalescents (roots) or direct ancestors thereof, in the genome of LUCA. On this model, either (i) that genome was large, and descendant lineages have differentially lost gene families, or (ii) many important and widely distributed gene families have evolved since LUCA and are not detectably homologous to other gene families with which they share common origins. Mixtures of (i) and (ii) might also be entertained.

At the other extreme, one could hold that LUCA as a cell had a “normal” or even smallish prokaryotic genome (1,500–5,000 genes) with only 100 or so genes (represented in NUTs) that might potentially have been passed down directly to all or almost all contemporary genomes. This would be a “uniformitarian” model, explaining the past in terms of processes familiar from the present. In all descendant lineages, the majority of LUCA’s genes will have been replaced via LGT—from closely related and distant lineages of other genes, homologous or not, carrying out the same or quite different cellular functions (structural and metabolic). A key difference from the first model is that some contemporary gene families will have had coalescents that predate LUCA [3537]. That is, the donors for many LGT events would have been cellular lineages that have since gone extinct, because LUCA is by this and the previous conceptualization the last universal common cellular (not necessarily genomic) ancestor.

More explicitly, according to the second model, the fraction of any particular contemporary prokaryote’s genes that have been strictly vertically inherited along a lineage of replicating genomes tracing all the way back to LUCA would comprise a universally conserved core (roughly the NUTs) plus “character genes” defining its metabolic lifestyle—methanogenesis genes if both LUCA and the contemporary prokaryote were methanogens, for instance. If the contemporary prokaryote we picked was instead a cyanobacterium while LUCA was methanogenic, non-core shared genes would be substantially fewer. Given that even within the NUTs there is some incongruence of trees, the fraction of a randomly chosen or “average” contemporary prokaryote’s genes that have been inherited only vertically since LUCA could easily be only a small percentage.

A third model, favored by some, is to redefine LUCA as a community, not a single cell or species [38,39]. Although in our view conflating “having common ancestry” with “having a common ancestor,” this model points out the basic conflict implicit in reconstructing the TOL from the sequences of genes that disagree. We are using genes as a proxy. What we have really been after is a tree of entities at some level of the biological hierarchy higher than genes or genomes—cells, organisms, populations, or (most likely) species. Perhaps our quest should be recast as simply an attempt to retrace life’s history at such a higher level, and not as an effort to achieve the proof of the “TOL Hypothesis” that Zuckerkandl and Pauling thought gene sequences could provide. Perhaps we have already unknowingly done this: Tassy recently mused that “we enter a world of non systematic phylogenetics, a surprising oxymoron” [40].

Trees of Cells and Species

If one assumes, almost as an entailment of Schleiden and Schwann’s cell theory, that all cells derive from previous cells, then it could be the case that there is a single tree-like pattern relating all cells, although one would need to look up to the level of “species” in lineages where sex became a reproductive necessity. Although incongruence of gene trees due to LGT means that no single one can be taken as topologically equivalent to such a tree of cells and species, the STOL might still be the best (though not a guaranteed) proxy for it. Patrick Forterre endorses this common view when he writes:

“The universal tree should depict evolutionary relationships between domains defined according to the translation apparatus reflecting the history of cells (and their envelope) and not according to the global genome composition that is influenced by LGT, virus integration and endosymbiosis, the history of which is incredibly complex” [41].

But we might also ask what warrant there is to believe in such a tree-like history of cells, coupled or not to molecular phylogenies. Reticulating events that would compromise trees at the cellular level are occasional or regular whole-cell fusions, mixing entire genomes and, more contentiously, conjugation and the compromising of cellular integrity represented by phage attachment or transformation. These latter introduce only a little genetic material each time, but when summed over many events, can replace every gene: the argument would be over whether the pattern of material as well as genetic relationships might be seen as non-tree-like.

Interspecies whole-cell fusion resulting in heterodiploid cells does occur in haloarchaea [42] but is probably rare in other prokaryotes. Rare does not mean evolutionarily unimportant, however. Whatever theory we might support as to the origin of eukaryotes, we must acknowledge that something like a fusion of (presumably unicellular) archaeal and bacterial cells was instrumental in this most “basic” divergence. Although a claim that many major archaeal phyla were independently founded by the simultaneous importation of many bacterial genes [43] is now in doubt [44], such events as the acquisition of suites of genes for respiration by the anaerobic methanogenic ancestors of haloarchaea are hard to reconcile as one-gene-at-a-time LGT events. And there is no doubt that a fusion of cyanobacterial and eukaryotic cellular lineages ultimately gave rise to higher plants and, through secondary and tertiary endosymbiotic events, to several algal lineages.

So, even a tree of cellular lineages is not an unproblematic concept. Students of animals and plants have long accepted that incomplete lineage sorting, introgression, and full-species hybridization pose difficulties for the sorts of trees that Darwin might have had us draw. But it is microbes, with their promiscuous willingness to exchange genes between widely separated branches of any “tree,” that have most seriously jeopardized the neo-Darwinian synthesis, in the oversimplified form that we have often presented it to the public [45]. More sophisticated understandings do remain possible [3] and should be debated in a more conceptually and science-historically self-aware context [4].


We thank Laura Eme and Olga Zhaxybayeva for comments on this manuscript.


  1. 1. Lawton G (2009) Uprooting Darwin’s tree. New Scientist 201: 34–39.
  2. 2. Doolittle WF (1999) Phylogenetic classification and the universal tree. Science 284: 2124–2128. pmid:10381871
  3. 3. Mindell DP (2013) The Tree of Life: metaphor, model, and heuristic device. Syst Biol 62: 479–489. pmid:23291311
  4. 4. Morrison DA (2014) Is the Tree of Life the best metaphor, model or heuristic for phylogenetics? Syst Biol 63: 628–638. pmid:24671618
  5. 5. Hellström NP (2012) Darwin and the Tree of Life: the roots of the evolutionary tree. Arch Nat Hist 39: 234–252.
  6. 6. Darwin C (1859) On the Origin of Species (Murray, London).
  7. 7. Doolittle WF, Bapteste E (2007) Pattern pluralism and the Tree of Life hypothesis. Proc Natl Acad Sci USA 104: 2043–2049. pmid:17261804
  8. 8. de Queiroz K (1988) Systematics and the Darwininan revolution. Phil Sci 55: 238–259.
  9. 9. Zuckerkandl E, Pauling L (1965) Evolutionary divergence and convergence in proteins, in Evolving Genes and Proteins, eds Bryson V, Vogel HJ (Academic, New York) pp. 97–106.
  10. 10. Woese CR (2004) The archaeal concept and the world it lives in: A retrospective. Photsynth Res 80: 361–372.
  11. 11. Pace NR, Sapp J, Goldenfeld N (2012) Phylogeny and beyond: Scientific, historical, and conceptual significance of the first tree of life. Proc Natl Acad Sci USA 109: 1011–1018. pmid:22308526
  12. 12. Watanabe T (1963) Infectious heredity of multiple drug resistance in bacteria. Bact Rev 27: 87–115. pmid:13999115
  13. 13. Hilario E, Gogarten JP (1993) Horizontal transfer of ATPase genes–the tree of life becomes a net of life. Biosystems 31: 111–119. pmid:8155843
  14. 14. Brown JR, Doolittle WF (1997) Archaea and the prokaryote-to-eukaryote transition. Microbiol Mol Biol Rev 61: 456–502. pmid:9409149
  15. 15. Nelson KE, Clayton RA, Gill SR, Gwinn MK, Dodson RJ, Haft DH et al. (1999) Evidence for lateral gene transfer between Archaea and bacteria from genome sequence of Thermotoga maritima. Nature 399: 323–329. pmid:10360571
  16. 16. Dagan T, Martin W. (2007). Ancestral genome sizes specify the minimum rate of lateral gene transfer during prokaryote evolution. Proc Natl Acad Sci USA 104(3), 870–875. pmid:17213324
  17. 17. Koonin EV (2003) Comparative genomics, minimal gene-sets and the last universal common ancestor. Nature Revs Microbiol 1: 127–136.
  18. 18. Charlebois RL, Doolittle WF (2004) Computing prokaryotic gene ubiquity: rescuing the core from extinction. Genome Res 14: 2469–2477. pmid:15574825
  19. 19. Jain R, Rivera MC, Lake JA (1999) Horizontal gene transfer among genomes: the complexity hypothesis. Proc Natl Acad Sci USA 96: 3801–3806. pmid:10097118
  20. 20. Cohen O, Gophna U, Pupko T (2011) The complexity hypothesis revisited: connectivity rather than function constitutes a barrier to horizontal gene transfer. Mol Biol Evol 28: 1481–1489. pmid:21149642
  21. 21. Puigò P, Wolf YI, Koonin EV (2009) Search for a Tree of Life in the thicket of the phylogenetic forest. J. Biol 8: 59. pmid:19594957
  22. 22. Puigbò P, Wolf YI, Koonin EV (2013) Seeing the Tree of Life behind the phylogenetic forest. BMC Biol 11:46. pmid:23587361
  23. 23. Vernikos G, Medini D. Riley DR, Tettelin H (2015) Ten years of pan-genome research. Curr Op Microbiol 23: 148–154.
  24. 24. Land M, Hauser L, Jun SR, Nookaew I, Leuze MR, Ahn TH et al (2015) Insights from 20 years of bacterial genome sequencing. Funct Integr Genomics 15: 141–161. pmid:25722247
  25. 25. Huang K, Brady A, Mahurkar A, White O, Gevers D, Huttenhower C et al. (2014) MetaRef: a pan-genomic database for comparative and community microbial genomics. Nucleic Acids Res 42: D617–D624. pmid:24203705
  26. 26. Baumdicker F, Hess WR, Pfaffelhuber P (2012) The infinitely many genes model for the distributed genome of bacteria. Genome Biol Evol 4: 4434–456.
  27. 27. Biller SJ, Berube PM, Lindell D, Chisholm SW (2015) Prochlorococcus: the structure and function of collective diversity. Nature Rev Microbiol 13: 13–27.
  28. 28. Deschamps P, Zivanovic Y, Moreira D, Rodriguez-Valera F, Lopez-Garcia P (2014) Pangenome evidence for extensive interdomain horizontal transfer affecting loineage core and shell genes in uncultured planktonic Thaumarchaeota and Euryarchaeota. Genome Biol Evol 6: 1549–1563. pmid:24923324
  29. 29. Fullmer MS, Soucy SM, Gogarten JP (2015) The pan-genome as a shared genomic resource: mutual cheating, cooperation and the black queen hypothesis. Front Microbiol 6:728. pmid:26284032
  30. 30. Morris JJ, Lenski R, Zinser ER (2012) The Black Queen hypothesis: evolution of dependencies through adaptive gene loss. MBio 3. pii: e00036–12 pmid:22448042
  31. 31. Castelle CJ, Wrighton KC, Thomas BC, Hug LA, Brown CT, Wilkins MJ et al (2015) Genomic expansion of domain Archaea highlights roles for organisms from new phyla in anaerobic carbon cycling. Curr Biol 25: 690–701. pmid:25702576
  32. 32. Brown CT, Hug LA, Thomas BC, Sharon I, Castelle CJ, Singh A et al. (2015) Unusual biology across a group comprising more than 15% of domain Bacteria. Nature 523: 208–211. pmid:26083755
  33. 33. Wolf YI, Koonin EV (2013) Genome reduction as the dominant mode of evolution. Bioessays 35: 829–837. pmid:23801028
  34. 34. Lapierre P, Gogarten JP (2009) Estimating the size of the bacterial pan-genome. Trends Genet 25: 107–110. pmid:19168257
  35. 35. Zhaxybayeva O, Gogarten JP (2004) Cladogenesis, coalescence and the evolution of the three domains of life. Trends Genet 20: 182–187. pmid:15041172
  36. 36. Doolittle WF (2000) The nature of the universal ancestor and the evolution of the proteome. Curr Opin Struct Biol. 10(3): 355–358. pmid:10851188
  37. 37. Fournier G. P., Andam C. P., & Gogarten J. P. (2015). Ancient horizontal gene transfer and the last common ancestors. BMC evolutionary biology, 15(1), 70.
  38. 38. Woese C (1998) The universal ancestor. Proc Natl Acad Sci USA 95: 6854–6859. pmid:9618502
  39. 39. Arnoldt H, Strogatz SH, Timme M (2015) Toward the Darwinian transition: switching between distributed and speciated states in a simple model of early life. Physical Rev E 92: 052909.
  40. 40. Tassy P (2011) Trees before and after Darwin. J Zool Syst Evol Res 49: 89–101.
  41. 41. Forterre P (2015) The universal tree of life: an update. Front Microbiol 6: 717. pmid:26257711
  42. 42. Naor A, Lapierre P, Mevarech M, Papke RT, Gophna U (2012) Low species barrriers in halophilic archaea and the formation of recombinant hybrids. Curr Biol 22: 1444–1448. pmid:22748314
  43. 43. Nelson-Sathi S, Sousa FL, Roettger M, Lozada-Chavez N, Thiergart T, Janssen A et al (2015) Origins of major archaeal clades correspond to gene acquisitions from bacteria. Nature 517: 77–80. pmid:25317564
  44. 44. Groussin M, Boussau B, Szöllzösi , Eme L, Guoy M, Brochier-Armanet et al (2015) Gene acquisitions from bacteria at the origins of major archaeal clades are vastly overestimnated. Mol Biol Evol
  45. 45. Dennett D, Coyne J, Dawkins R, Myers P. (2009) Darwin was right. The New Scientist 201: 22–24.