A Universal Trend among Proteomes Indicates an Oily Last Common Ancestor

Despite progresses in ancestral protein sequence reconstruction, much needs to be unraveled about the nature of the putative last common ancestral proteome that served as the prototype of all extant lifeforms. Here, we present data that indicate a steady decline (oil escape) in proteome hydrophobicity over species evolvedness (node number) evident in 272 diverse proteomes, which indicates a highly hydrophobic (oily) last common ancestor (LCA). This trend, obtained from simple considerations (free from sequence reconstruction methods), was corroborated by regression studies within homologous and orthologous protein clusters as well as phylogenetic estimates of the ancestral oil content. While indicating an inherent irreversibility in molecular evolution, oil escape also serves as a rare and universal reaction-coordinate for evolution (reinforcing Darwin's principle of Common Descent), and may prove important in matters such as (i) explaining the emergence of intrinsically disordered proteins, (ii) developing composition- and speciation-based “global” molecular clocks, and (iii) improving the statistical methods for ancestral sequence reconstruction.

S1 Discussions regarding molecular evolution and proteome composition change.
The following three subsections will discuss the following three points respectively: (i) even through every organism today is expected to have existed in some form since the emergence of life (Darwin's common descent axiom), not all proteomes have evolved/mutated to the the same extent since the last common ancestor (LCA); speciation events account for a large number of sequence substitutions; (ii) probability of oil-content-changing substitutions are expected to occur more often during speciation events; (iii) in absence of an entropic drive to change oil content, one would need an unacceptably large number of random mutations to observe the range of oil contents observed in our dataset. S1.1 A sequence's oil content is expected to change at a glacial pace compared to the molecular evolution that drives it; a continual random walk in sequence space for billions of years can not explain the range of protein hydrophobicity observed As an alternative (null) hypothesis to the oil escape described in the paper, let us assume that the (8%) range in oil content observed in the paper (Fig. 2) is due to a random drift in sequence space since the LCA. We can show, by modeling the evolution of the proteome by a random walk, that this null hypothesis is very unlikely to have occurred. Take a proteome of size N , with H number of hydrophobic residues (0 ≤ H ≤ N ). Also, assume that any mutation independently changes the hydrophobicity (H) by ±1 amino acid. Each successive accrued mutation may then be considered as a random walk on an integer number line (shown below) where the current position is the hydrophobicity of the proteome. Assuming that M point mutations have been accrued, the change in the number of hydrophobic residues in the proteome, ∆H, will be approximately of the order of the √ M (obtained from theory on random walks on integer number lines), i.e., ∆H ≈ √ M So, the change in fraction (or percent) hydrophobic residue in the proteome will be relatable to the number of accumulated mutations per amino acid position: ∆H% 100 or ∆H f raction = ∆H N ≈ √ M N Now, assuming a proteome size of N = 2, 000, 000 (which is the size scale of a proteome belonging to a large bacterium such as Mycobacterium smegmatis), given the equation above, we would require about (N × 0.08) 2 = 25.8 × 10 9 number of random mutations per proteome (or 12, 800 mutations per amino acid position) to shift the hydrophobicity of the proteome by the observed 8%. Given the acceptable average mutational rate of 10 −9 mutations per amino acid position per year or 1 mutation per position per one billion years (a commonly reported magnitude 1 ), we would expect a random walk to produce an 8% drift in hydrophobicity (for N = 2, 000, 000) only after roughly 12.8 × 10 12 or 12.8 trillion years of evolution, which is substantially larger than the expected age of the universe, much less the expected age of earth.
Although the example above uses a specific value for proteome size (N ), plugging in other values for N in the above equation will provide similar results, indicating the implausibility of attaining the range of hydrophobicity by simple random mutational diffusion. For example, a high average mutation rate of 6 × 10 −9 brings the number of years required to obtain an 8% shift in hydrophobicity to a still high 2.1 trillion years. For the present N , one would need a substitution rate of 4.26×10 −6 to cause an 8% shift in hydrophobicity, which is far from the observed regime of 10 − 9 amino acid substitutions per site per year; the observed change in % hydrophobicity (∆H%) can not be attributed to or caused by random mutational drift in the sequence, which raises the plausibility of drift by oil escape. The following subsection further strengthens this claim.
More/empirical problems with randomly changing a proteome's H. There are also other considerations that further diminish the effect of random point mutations on drift in hydrophobic content: (A) the random walk in H (described above) required that mutations are always "non-synonymous" (i.e., they must change H). However, this move set does not consider "synonymous" mutations, where one hydrophobic residue is replaced with another (and the same for non-hydrophobic residues), which will drastically increase the number of mutations required to obtain a translation of 1 unit in the number line above (since many "walks" will be ineffective in translating H). For example, if we consider only F,I,L and V as hydrophobic amino acids (chosen since %FILV was our metric for hydrophobicity in the manuscript), then an attempted mutation will be a hydrophobicity modifying mutation only 32% of the time.
(B) As evident in the sequence substitution matrices (Table S1), mutations that maintain proteome (oil or charge) composition are much likely to occur (and be kept for posterity) than mutations that modify oil/charge content.
Both (A) and (B) indicate that the actual extent to which random accumulated mutations or sequence "drift" affects oil content will be even lower than those estimated by a simple random walk in H space (since more mutations are required in order to shift a proteome's H by even 1 unit on the number scale above).
In conclusion, the composition shifting substitutions within a genome are likely not made during neutral drift, i.e., g's contribution to the composition change by substitution that we term as "oil escape" is expected to be minimal compared to the substitutions inncurred during the speciation events due to increased substitution rates, [2][3][4][5][6][7][8][9][10][11][12] hitchhiking, [13][14][15] and presumably reduced population sizes (and high fixation probabilities) 16 during speciation events. S1.2 Mutations tend to favor the preservation of oil and charge content in a sequence.
Average log odds for a type of mutation: Random Composition shifting Composition preserving mutation mutation mutation Substitution matrix used  Table S1: Historically, fixed mutations that preserve oil/charge content are much more probable than those that do not. Proteome composition-maintaining mutations are highly probable, which is indicated by a positive average log-odds score (see equation below; each log odds value was obtained from matrices found on the NCBI FTP database at ftp://ftp.ncbi.nih.gov/blast/matrices/). Conversely, proteome composition-shifting mutations appear to be even more improbable than mutations that appear randomly. This indicates that, a random drift of sequence space is under active pressure to maintain proteome composition (almost certainly due to the biophysical requirement of maintaining protein structure/function). Note that while the Blosum matrices are general indicators of trends of acceptable substitutions, such a matrix can not represent all sequence evolution situations given the heterogeniety in rates of nucleotide substitution. 18,19 AMINO ACID KEYS: X is the set of all possible amino acids, the hydrophobic amino acids H ∈ [F, I, L, V ] (the top four most hydrophobic residues as per the Kyte-Doolittle scale), and the charged amino acids C ∈ [E, R, D, K], while H ′ and C ′ are all amino acids not including H and C, respectively.

EQUATION:
The log-odds of substitution (Mij in bit units), for any two amino acids i, j was obtained from published Blosum matrices 17 and normalized to represent bit units; so, Mij = log2 (Pij /qiqj ), where qi is the natural frequency of amino acid i in sequences, Pij is the probability of the two amino acids i and j replacing each other at a homologous position. The odds of a mutation may be obtained from the expression 2 M ij .
The various Blosum 17 substitution matrices (obtained from the NCBI FTP database ftp://ftp.ncbi.nih.gov/blast/matrices/) indicate that mutations that preserve the proteome/sequence's % hydrophobic content (H) (and % charge content, C) are much more probable than those that cause fluctuations in oil or charge content. Notably, H and C modifying mutations are not even marginally tolerated (as indicated by their lower average log odds score than even the expected log odds score for a randomly picked amino acid pair). S1.3 Rates of molecular evolution (by nucleotide substitutions) and speciation events are linked (i.e., node number ∝ number of accumulated substitutions).

Summary:
• It has been shown that speciation events are concomitant with an increased rate of substitution.
The emergence of the molecular clock hypothesis, [20][21][22] along with the neutral [23][24][25] and nearly neutral [26][27][28] theories of molecular evolution (reviewed by Ohta and Gillespie 29 ) posited that a vast majority of nucleotide substitutions have neutral effect on fitness, which, in turn, indicate a constant substitution rate-for all sequences regardless of species-over time (like the ticking of a molecular clock). Although it is undeniable that neutral drift in sequence space must happen (given the finite size of all populations), the inaccuracies within the rates of molecular clocks (reviewed by Bromham 30 ), together with the emergence of alternative molecular clock models (e.g., the episodic molecular clock 31,32 ) and a better understanding of mutation rate change during speciation (reviewed below), we can be more certain that speciation events (and node formation in the tree) accounts for a substantial additional genome deviation.
One of the early pieces of evidence indicating that speciation events were concomitant with higher rates of molecular evolution was reported in 1990, when Mindell, Sites and Graur 2 found that a noticable increase in nucleotide substitution rates can be associated with speciation (or diversification) events in sceloporine lizards. Although this study dealt with lizards in particular, and the study has been contested, 33 an empirical preceedence was set in trying to determining the linkage between substitution rates and speciation. Today, evidence of linkage between molecular rates and speciation/diversification has been found within clades as diverse as angiosperms, 3 birds, 4, 5 and confamilial sauropsids (birds and reptiles). 6 Importantly to the manuscript, a study of a large number of available phylogenies (unbiased by node density artifacts 34,35 ) indicated direct correlations between node number (identical in form to the "node number" in the MS) and extent of sequence divergence from the last common ancestor 7-11 (for striking graphs, please refer to . This direction finally culminated in the finding that increased mutation rates during speciation events explain rate discrepancies in molecular clocks, 12 which was otherwise a scarcesly understood issue (with specific exceptions 36,37 ). It is very interesting that, among a diverse group of phylogenetic trees, the discrepancies in the expected molecular clock rates have been explained merely by the additional effects of increasing substitution rates during cladogenesis (Fig. 4 in Pagel et al.'s report 12 ).
From these studies, it appears incontrovertable that increased mutation rates during speciation events do significantly contribute to the rate of non-synonymous substitution that was previously predominantly thought to be dominated by neutral drift and phyletic gradualism. S1.4 Oil escape is expected to happen predominantly during speciation events and less so during neutral drift.
As described in the previous section, the total number of substitutions (or total "deviation") of a sequence from its last common ancestor can be split into (i) the deviation caused by a time-dependent neutral drift (which is expected to be generally equal for all sequences today 2,12,23,24,29 ) and a node-number-dependent burst of substitutions that are indicative of speciation events. [2][3][4][5][6][7][8][9][10][11][12] Pagel et al. provide an equation describing this cumulative drift 12 (which is a variation of an earlier formalism 2 ), where the total amount of deviation (x) of a species from the common ancestor (placed at the root) in a phylogenetic tree is described as Here, n is the node number of the sequence (related to the number of speciation events in its lineage leading to the common ancestor), β is the rate of substitution during speciation one event, and g is the constant number of substitutions encountered by neutral drift (Mindell et al. describe g as time multiplied by a constant rate of divergence by neutral or "anagenic" drift 2 ).
Given the previous section, the following inferences can be made: • Constant neutral drift (g in Equation 1) acts on all genomes at approximately the same rate 2, 12 (especially under the nearly neutral theory reviewed by Ohta and Gillespie 29 ), and can not account for differences between species in the number of substitutions accumulated, i.e., all sequences have diverged equally due to this phenomenon.
• Neutral drift accounts for only those nucleotide substitutions that are "neutral" or "nearly neutral", which include synonyomous substitutions that display neutral fitness effects and a fraction of the non-synonymous mutations that display "nearly neutral" fitness effects. 23,24,29 This indicates that neutral drift would vastly account for mutations that are not under strong selective pressures, such as those mutations that maintain homology (or biochemistry) among protein sequences (as predicted by high log-odds scores for those residue substitutions in BLOSUM matrices 17 ). As shown in Table S1, mutations that tend to change oil (FILV) or charge (ERDK) content within a sequence are significantly lower in log-odds probabilities than mutations that maintain sequence oil-and charge-composition, i.e., neutral drift manifested in non-speciation times of a genome (and the g component of Equation 1) is expected to have little effect on (oil and charge) composition change.
• Given the inability for neutral drift to explain the shift in a proteome's oil content, what else could explain such a phenomenon? An alternative explanation to neutral drift for the introduction of composition changing (and low-BLOSUM-score) substitutions into a sequence is by hitch-hiking of slightly deleterious mutations along with adaptive mutations (the genetic "draft" or hitch-kiking hypothesis [13][14][15] ) during the burst of possibly adaptive mutations expected to accompany speciation events, [2][3][4][5][6][7][8][9][10][11][12] i.e., oil escape is expected to happen majorly under speciation events or creation of additional nodes in a species' lineage.
The true phenotype of oil-composition-changing mutations-whether these mutations are functionally adaptive or non-adaptive and possibly deleterious-might indicate alternative mechanisms involving oil-compositionchanging substitutions, however, those distinctions are not as important as the statement that these substitutions are not expected to neary neutral, and so, are not expected to happen during the Kimura (or Ohta) style neutral genetric drift. Also, as discussed in the next subsectionSection S1.1, a random drift in oil content is not capable of shifting the oil content of a proteome by unbiased random drift, and if fact, a bias (or "field" applied onto the random choces of substitution made, e.g., one that exists when starting at higher oil contents) is necessary to explain such an oil escape.

S2
The seed proteins used to produce Fig. 3A: FASTA file of the seed protein used to produce Fig. 3A   Figure S3: Proteomes are increasing in "specificity" over evolutionary time. An increase in the frequency of oily (non specific) residues, is mirrored by those residues that provide specificity to a protein (such as disulphide and hydrogen bonding polar residues), while those residues that have ambivalent features (such as charges, which may be used to maintain disorder, like in intrinsically disordered proteins, and order, like in proteins stabilized by salt bridges) show "ambivalent drift" over node space (which would indicate more niche-like constraints for these residues). These trends support the pluripotent mechanism (described in the text) as the originator of the protein repertoire. Kyte-Doolittle hydrophobicity or H k (B) correlate well with the extent of each amino acid's evolutionary "drift" over node space (assessed by the Spearman correlation number r s,AA between the fraction of a single type of amino acid F AA in a proteome and species node number, which are explicitly shown in Fig. S8). Note that, in panel B, Cysteine, marked as a red "x", was discounted when calculating r s , due to it's non-canonical property of possessing high hydrophobicity and high interaction specificity). However, C GC and H k are themselves not even negligibly correlated (given the high probability, p-val= 0.3, that the scatter plot in C is random). This indicates that there are two separate constrains/drifts at play simultaneously: one at the genome/nucleotide level (GC drift) and one at the proteome/amino acid level (which is the "the oil escape" discussed in the main text). Actual plots of individual amino acid drifts are displayed in Fig. S8. , we calculated the Spearman correlation coefficients (r s ) between a protein's oil content versus its length, which is plotted in histogram form. The mean r s in this study lies very close to zero (at r s ∼ 0.03), indicating a relatively unbiased dependence of oil content on protein length within size-homogenized homologous clusters. The broad distribution of these r s 's exist due to the small average size of proteins per cluster. These results indicate that oil escape within globular protein domains was not driven primarily by the addition of loops and intrinsically disordered regions within globular proteins (which would cause the increase in protein length). In doing so, each cluster is a collection of a orthologs (if we find one for the seed protein) and paralogs (if we do not find an ortholog). However, to many, this study is flawed, as paralogs (that may or may not be directly related to the seed protein) are added to the cluster, thereby reducing the validity of the study. In response, we have performed two new studies using the most recent COG database (each cluster in this database corresponds with a Cluster of Orthologous protein groups ranging through 66 complete genomes 40 ).
From the database, we performed our regression analys on both exclusively orthologous protein clusters (A) and homologous clusters (B). The available clusters unfortunately, due to sampling problems (because of smaller dataset size and a shorter range of node numbers as only lower-eukaryotes have been included in the study), often result in statistically irrelevant correlations. However, when we shift the threshold for statistical relavance, we find that more and more number of clusters display oil escape, which strictly tend to favor oil escape as we pass the significance threshold of p-value ≤ 0.05. Also, interestingly, the statistically significant oil escapes actually increase in percent as one goes from a mix of homologous (orthologous and paralogous) clusters to exclusively orthologous clusters, which only reasserts the claims regarding oil escape among clusters of related proteins.

Details:
The tree of life comprises a collapsed, pruned, tree obtained from the iToL website. The newick format of this tree is as follows:  Figure S11: Oil escape is described by a shift in the near-gaussian distribution of oil content per protein, which precludes theories favoring the preferential adaptation of IDPs in complex organisms. The figure above provides two histograms, that describes the distribution of protein oil content (this time using the Kyte-Doolittle scheme) among all proteins found in proteomes belonging to all animals (red curve) and bacteria (black curve) described in the SI. It is interesting to note that the two distributions are generally Gaussian-like, and, as discussed in the MS and SI, we expect that the distribution described by the bacterial aggregate would most resemble the last common universal ancestor's oil content, and the red curve represents newer proteomes that emerged only later; this picture can be described by the shift of a unimodal (gaussian) distribution of protein oil contents to lower mean oil contents, where the gaussian distribution slides to the left in the picture. It is important to note that Uversky (Protein Sci. 2002, 11(4):739-56) has shown that a strong divide between IDPs and globular/folded proteins is dictated primarily by a threshold hydrophobicity and secondarily (to a less important degree) by the protein's mean net charge per amino acid. The two vertical, dashed lines describe the threshold for proteins ranging in 10 varying charge configurations (which would involve a large number of proteins we know). The band delineated by the two lines indicates a general sequence-composition-threshold describing the transition between globular proteins and IDPs. If IDPs were to be enriched in later proteomes due to adaptation, then one would expect to find a more aggressive recruitment/invention of IDPs in the proteome, from which we would predict a second hump in the protein hydrophobicity histogram in complex organisms (red). However, all we see in animal protomes, is a unimodal (single hump) oil-distribution whose tail drifts into the Uversky threshold, thereby putatively allowing for the chance encountering of IDPs in the first place. Given the lack of a new hump in the "Uversky zone" (left of the dashed lines) which would indicate the active enrichment of IDPs in complex organisms, we can infer that Uversky-like IDPs didn't cause, but emerged from the Gaussian shift in oil content over evolutionary time. In conclusion, oil escape since the emergence of the LUCA explains the emergence of IDPs, and there is no reason to expect that any adaptive force drove the emergence of IDPs.

(f) (a)
Proteome length (L) Figure S12: The complete correlation graph (A) along with the individual relationships (B). %F ILV vs node number displays the strongest correlation between the four relationships: %F ILV , %GC (obtained from ensemble cDNA transcripts), node number, and proteome length (L). The complete correlation network indicates that while GC content is correlated with node number, only three correlations-(i), (ii) and (iv)-may be sufficient in explaining all other correlations. Also, it is unlikely that the weak relationship between GC content and node number will be able to cause the relationship between FILV and node number, i.e., the relationship between GC content and node number is likely either caused by the coupling between FILV and GC (ii) or by unknown and independent constraints. Pagel's λ metric 41,42 is used to distinguish between whether, given a species tree, and a species trait (assigned for the leaves in the tree) is independent of phylogenetic relationships (λ = 0) or dependent on the species' genetic shared history (λ = 1). Using the same methedology described in Figure 1 of Freckleton et. al, 42 we find that the maximum likelyhood (ML) probability of λ in our combined database (A), and our individual databases (C) indicate very high estimated λ's (≥ 0.97). The λ = 0.99 estimate for the original tree, with branch lengths of unit length (A; black circles), are also obtained for a "neutral drift" tree, where branch lengths were modified to ensure that all species-to-root lengths are equal (A; red circles). Also, the critical value of the log-likelihood ratio test (red dashed line; p-val= 0.05) indicates that the value of λ is significantly different from 0 (i.e., a non-phylogenetic explanation for the character trait, oil escape, is highly unlikely). As a control, we studied 1000 original trees whose species character trait values were (i) shuffled and (ii) randomized with a standard normal probability distribution (µ = 1,σ 2 = 1). Panel (B) describes the search for Pagel's ML λ's for the shuffled (black) and randomized (red) trees, which resulted in estimated λ values of 0.14 ± 0.08 and 0.08 ± 0.05, respectively. Calculating λ: Each branch in the iToL/NCBI tree used (See Section S6.2) is set to 1, which means that the "operational time" of this tree is in node numbers. A variance-covariance matrix V of the tree (where diagonal elements V i,i were set to the node number of species i, and the off diagonal elements V i,j are the shared history (in node numbers) between the two species i and j. V(λ) is a modification of V by an off diagonal multiplier λ (normall ∈ [0, ..., 1]). At each value of λ, we calculate the likelyhood of p(λ, y) (Equation 4 in Freckleton et al. 42 ), which will end up giving us the most likely (ML) λ value. The x − axis in the graphs indicate the difference − [ln (p(λ 0 , y)/p(λ 1 , y))], where λ 0 is the maximum likelihood λ, and λ 1 = λ 0 is the null hypothesis. Any difference (− [ln (p(λ 0 , y)/p(λ 1 , y))]) that is lower than the χ 2 significance cutoff (-0.92 or the red dashed line, which is obtained from precompiled tables setting p-val= 0.05 with one degree of freedom) indicate rejected values for λ. Average proteome oil content Figure S16: Oil escape is robust to data culling. For each culling fraction (C), we obtained 100 species sets each amounting to a subset of the total number of species in our database. From each set, utilizing the original tree and the subset of oil contents per species, we obtained using a generalized least squares method 43 the estimated LCA's oil content. For each C, the resulting 100 estimated ancestral oil contents were averaged and displayed above (with error bars representing variance). It appears as though our last common ancestor's oil content is consistently predicted on average to be similar in value (between 30 and 30.5). While the mean is steady across testing parameters, the variance (σ 2 ) of the estimation per C increases, but is remarkably low even with 70% of the species culled. This indicates that while the set of sequenced species in relatively low, the LCA's oil content is consistently predicted to be higher than average (26.1). This study was repeated independently with no difference in inference.

METHOD:
Set up for estimating the ancestral oil content: Each branch in the iToL/NCBI tree used (See Section S6.2) is set to 1, which means that the "operational time" of this tree is in node numbers (an alternative tree was also used to model neutral drift). An n × n variance-covariance matrix V of the tree (where diagonal elements V i,i were set to the node number of species i, and the off diagonal elements V i,j are the shared history (in node numbers) between the two species i and j in our ToL). Let Y i be the n × 1 (column) vector of character states (oil contents), i.e., for any species i, Y i is %F ILV i . Finally, let T be a n × 2 matrix whose first column elements all equal 1 and the second column elements depicts the species node number (i.e., T i,1 = 1, T i,2 = i's node number). Estimating the ancestral oil content: Then, the model of evolution 43 for species i is Y i = α + βT i,2 + ǫ i , where α is the character state of the ancestor at operational time 0 (i.e., the LCA), β is the estimated rate of change of the character state per operational time unit (e.g., node number), and ǫ i is the random error. 43 From a generalized least squares method, 43 we can estimate both α and β by solving for S4 Statistics of the asymptotic nature of oil escape.
We use the following asymptotic form to fit the scatter plot (oil escape) described in Fig. 2B: Here, N is its node number, φ 1 is the asymptote, φ 2 is the ordinate-intersect and φ 3 is the rate constant of the trend. Keeping all φ's variable, and fitting to the data in Fig. 2B, we obtained the following values: r = 0.879, φ 1 = 23.91%, φ 2 = 883.6, φ 3 = 0.89. Also, constraining φ 1 = %F ILV C still results in the following values: r = 0.877, φ 2 = 301.49, φ 3 = 0.62), which is higher than the corresponding monotonicity-based Spearman coefficient r s of ∼ 0.846.

S5 The proteomes analyzed
The "proteomes" analyzed were annotated from genomes obtained from the Ensembl databases. 44 All species listed below (and in Table S2) were used to create Fig. 2A,B, Fig. 3 and Fig. S4, while only the italicized ones (those with first occurrence data in brackets available from pbdb.org) were used to calculate Fig. 2C

S6.1 Node number is dependent on the "universal" tree of life (not the pruned tree).
Thre pruned tree of life was obtained from iToL with the option to leave "Internal nodes expanded", which means that the node number obtained for each species is dependent not on the number of sequenced species in the tree of life, but on the number of species in the NCBI taxonomy database. For simplicity, each branch length is set to 1 (and hense , node number is the metric for time). Below is an example of two trees in newick format obtained using the "Internal nodes expanded", which explains how, even with increasing the number of species in the tree of life, the node number of Homo sapiens, and Mus musculus do not change. Given that the number of sequenced species (hundreds) is much less than the number of species in the NCBI taxonomy database (as of 2012, NCBI records 249, 450 species), we can conclude that, barring the refreshing of the NCBI taxonomy database, the addition of more sequenced species to our ToL will not change the node numbers of the extant species.

S6.2 THE COMMON TREE USED TO CALCULATE NODE SPACE
The following "tree" in newick format was obtained from iTol's NCBI algorithm. Each bracket pair indicates an additional node that separates the species (or group of species) from the last common ancestor. The node number of a species in Fig. 2 onwards is the number of brackets that the species is embedded into.