Figures
Abstract
As biological sequence databases continue growing, so do the insight that they promise to shed on the shape of the genetic diversity of life. However, to fulfil this promise the software must remain usable, be able to accommodate a large amount of data and allow use of modern high performance computing infrastructure. In this study we present a reimplementation as well as an extension of a technique using indicator vectors to compute and visualize similarities between sets of nucleotide sequences. We have a flexible and easy to use python program relying on standard and open-source libraries. Our tool allows analysis of very large complement of sequences using code parallelization, as well as by providing routines to split a computational task in smaller and manageable subtasks whose results are then merged. This implementation also facilitates adding new sequences into an indicator vector-based representation without re-computing the whole set. The efficient synthesis of data into knowledge is no trivial matter given the size and rapid growth of biological sequence databases. Based on previous results regarding the properties of indicator vectors, the open-source approach proposed here efficiently and flexibly supports comparative analysis of genetic diversity at a large scale. Our software is freely available at: https://github.com/WandrilleD/pyKleeBarcode.
Citation: Duchemin W, Thaler DS (2023) PyKleeBarcode: Enabling representation of the whole animal kingdom in information space. PLoS ONE 18(6): e0286314. https://doi.org/10.1371/journal.pone.0286314
Editor: Matthew Cserhati, AbbVie Inc, UNITED STATES
Received: January 13, 2023; Accepted: May 13, 2023; Published: June 2, 2023
Copyright: © 2023 Duchemin, Thaler. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The code, documentation, and instruction of use for PyKleeBarcode can be found in the GitHub repository (https://github.com/WandrilleD/pyKleeBarcode). All relevant data and code for the experiment conducted in this article can be found in the GitHub repository (https://github.com/WandrilleD/pyKleeBarcode-publication-supporting-code-and-data).
Funding: This research was funded by the SIB Swiss Institute of Bioinformatics. WD and DT were supported in this research by the Lounsbery Foundation grant "A Cosmic View of Life on Earth". The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. URL of funder websites: * sib.swiss * https://www.rlounsbery.org/.
Competing interests: The authors have declared that no competing interests exist.
Introduction
The study of biological evolution through comparing linear sequences from different species, now known as molecular phylogeny, was first proposed by Crick in a 1957 lecture [1, 2]. The passage that heralds the coming of molecular phylogeny is clear and to the point:
“Biologists should realize that before long we shall have a subject which might be called ’protein taxonomy’ -the study of the amino acid sequences of the proteins of an organism and the comparison of them between species. It can be argued that these sequences are the most delicate expression possible of the phenotype of an organism and that vast amounts of evolutionary information may be hidden away within them.”
Linear sequence analysis of proteins from different species combined with paleontology to estimate chronological time of their divergence gave rise to the concept of a ‘molecular clock’ [3, 4], which proposes a constant number of amino acid changes per year in chronological time [5]. Directly inspired by the molecular clock, the neutral theory of evolution proposes that many sequence changes in evolution result from population effects such as drift, minor intermittent selection, and bottlenecks [6, 7]. The subsequent advent and development of DNA sequencing led Kimura to explicitly extend the neutral theory to DNA sequences with the reasonable hypothesis that synonymous mutations were neutral [8]. While the neutrality of synonymous sequences is not 100% valid, it is a useful first order approximation in many systems [9].
Woese and colleagues pioneered the use of small subunit ribosomal RNAs to construct phylogenies that encompass all of cellular life [10, 11]. SSU RNAs have two important features in this regard: a) Presence and sufficient similarity in all cellular life to allow them to be aligned and compared. b) Abundance and relative stability, facilitating their purification and characterization. Molecular methods including in situ hybridization and PCR allowed the characterization of SSU RNAs and the genes encoding them in situ including uncultured and unculturable organisms [12].
Mitochondrial genomes are small in size, lack recombination, and are technically straightforward in terms of molecular biology because their copy number is approximately a thousand times greater than nuclear genes. Mitochondrial analysis both confirmed and refined other methods for species and population identification, especially among animals [13, 14] but with certain caveats among other Eukaryotes, e.g., plants and fungi, as well [15]. Hebert and colleagues proposed that a partial segment of the mitochondrial genome agreed upon by the community could universalize and democratize species identification [16, 17]. The particular mitochondrial sequence that has become the most widely used, a 648 base pair (bp) segment of the gene encoding mitochondrial cytochrome c oxidase subunit I (COI), became most widely adopted because reliable primers and methods useful for both vertebrates and invertebrates were adopted by a critical mass of the community. Intrinsically, there is nothing special about the mtCOI DNA barcode region compared to other protein-encoding regions of the mitochondrial genome [18].
The Barcode Of Life Database (BOLD) is an open access compilation including mtCOI barcode data from approximately five million individuals that collectively cover a great deal of the extant animal kingdom [19]. The BOLD approach is a different approach in the world of sequence analysis because it is a small proportion of the genome (e.g., in humans less than one millionth of the total genome) but is available from many different individuals in many species. For many animal species DNA barcodes are the only sequence information available. The use of BOLD was first limited to species identification. The broad but not deep nature of barcode data can be used for more than identification, they can support phylogenic reconstructions and evolutionary models [20].
A method of choice to extract and recapitulate the evolutionary information contained in a set of homologous sequences is to construct a phylogenetic tree based on the of likelihood of a substitution model. These methods have benefitted from numerous advances in the past decades, but the explosion of the number of possible tree topologies as the number of leaves increase—(2n-5)!!, unrooted topologies for n leaves–still makes the reconstruction of phylogenetic trees with tens of thousands of sequences using likelihood methods daunting (although there exist strategies to improve the scalability using phylogenetic placement, see [21]). Aside from computational consideration, one has to consider cases where the evolutionary history of sequences does not follow a strictly tree-like pattern because of horizontal and modular evolution [22].
Sirovich and colleagues developed an innovative approach to allow multiple DNA sequences both within and among many animal species to be compared such that the overall structure of animal life within and among species can be visualized [23, 24]. Their method relies on a contrastive approach to build indicator vectors which recapitulates the genetic information of a sequence, or a group of sequence, which can then be used to probe the genetic diversity, or closedness of the taxonomic groups they represent, for example in the form of a similarity matrix called the structure matrix. Visualization of the structure matrix has been a valuable part of a number of publications [18, 25–31]. However, the indicator vector and structure matrix approach for the comparative study of DNA sequences has not yet been utilized to its full potential. The present work was originally motivated to make the prior work of Sirovich and colleagues more accessible as the implementation of software to compute the indicator vectors and Klee diagrams has been limited to the original code released with [23] (currently available at https://phe.rockefeller.edu/barcode/klee.php), which runs on an older version of a proprietary commercial software, Matlab2009a, a not only unmaintained version, but whose documentation is no longer available on the developer’s website. We used the opportunity of a re-implementation to introduce significant extensions to the original methods, in particular to accommodate a variable number of sequences per species, and provide a more flexible interface to the computational steps involved in going from a multiple sequence alignment to a structure matrix. Refactoring the process into separate steps allowed an efficient and scalable parallelization of the pipeline, as well as adding flexibility. In the previous Matlab implementation updating a structure matrix with one or several new sequences required recomputing it entirely. Separation of different steps in the implementation grants additional flexibility, both computationally and conceptually. Real-life datasets from BOLD (The Barcode Of Life Database) are used to illustrate the new implementation and in places contrast it with the published Matlab-based method.
Methods
Derivation of the original method’s formula
After a series of pre-processing steps, which are detailed in the result section of this manuscript as well as in [24], the multiple sequence alignment is split in several sequence sets.
A set of sequence i is represented as matrix si of size m×s⋅4, where m is the number of sequences in the sequence set and s is the sequence size, in nucleotides.
Then, the indicator vector of sequence set i is computed as the eigenvector with the largest positive eigenvalue of matrix: , where N is the number of sequence sets, as defined in equation 15 of [24]. Note that this formulation presumes that all sequence set contains the same number of sequences. We begin by writing: such that the previous equation can be written
This formula can be rewritten as:
Which makes it evident that once the matrix has been computed; the indicator vector of a sequence set can be computed independently from the other sequences. Additionally, the original method only considers the case were the number of sequences per set, noted m, is the same for each set. We generalize this approach to any specific number of sequences per set, noted mi for sequence set i by normalizing the Si matrices by mi, such that the previous equation becomes:
Given its role of contrasting a given sequence set against all others, we call the matrix the reference matrix.
Finally, it can be remarked that the computation of all, or parts of, the structure matrix depends only on the indicator vectors of the sequences of interest.
Consequently, we describe the method of [24] as three consecutive steps:
- Computation of the reference matrix from all sequence sets
- Computation of an indicator vector for each sequence set independently
- Computation of the structure matrix from the indicator vectors of all sequence sets
We detail each of these steps in the results section.
Mammalian sequences dataset
In order to test and demonstrate our method, we accessed the Barcode Of Life Database [19] for primary COI5-P sequences of mammals. Out of the >100,000 mammalian sequences, we selected a limited set, in order to facilitate results analysis and interpretation.
This set was built such that each of the following mammalian taxon is represented: Afrotheria, Artiodactyla, Carnivora, Chiroptera, Dermoptera, Eulipotyphla, Glires, Metatheria, Perissodactyla, Pholidota, Primates, Prototheria, Scandentia, Xenarthra.
To ensure this, we randomly selected a maximum of 40 species per taxon, and 3 sequences per species, except for Prototheria, Pholidota, Scandentia and Dermoptera where the low number of sampled species and sequences caused us to take all available sequences (resp. 3, 7, 9, and 1 species, and 7, 21, 16, and 3 sequences). The resulting set is composed of 1049 sequences spread among 354 species.
The retrieved sequences were then aligned using MAFFT [32, 33]. The sequence alignment extremities were then trimmed to avoid the spurious gaps resulting from differences in the size of the sequenced fragment. The resulting multiple sequence alignment presented no gaps (-), no missing base (N) characters, and 0.007% ambiguous IUPAC characters (Y,R,W,etc.).
Impact of the reference matrix on the indicator vector and structure matrix
One of the new possibilities offered by our re-implementation is the possibility to compute the indicator vectors of sequences which are absent from the reference matrix. This can be useful, for instance, if a new sequence were just acquired, because it could be integrated to the structure matrix without having to re-compute it entirely.
In order to evaluate the impact of an incomplete reference matrix, we devised a number of experiments where we compute the indicator vectors and structure matrix of our mammalian dataset sequences with a reference matrix computed of a limited set of sequences. We then compared these with the indicator vectors and structure matrix obtained with a complete reference matrix.
The initial experiments were “primate-only”: reference matrix with primate sequences only (120 primate sequences); “no-Laurasiatheria”: reference matrix with Laurasiatherian sequences missing (488 non-Laurasiatherian sequences); and “missingXX%”: random percentages of sequences missing.
For the “missingXX%”, we tested percentages from 10 to 90 percent (by increments of 10%). We performed 10 replicates per condition to assess the variability in the results.
Results
We have implemented each of the following steps as an independent piece of code with its specific parallelization scheme, as well as in a single executable regrouping all three steps for convenience. All implementations were done in python3 using core modules and the publicly accessible libraries numpy, scipy, matplotlib, mpi4py. The source code is available at https://github.com/WandrilleD/pyKleeBarcode, complete with a documentation, tests, and example of usage with toy datasets. Fig 1 gives a schematic overview of the steps involved in the process of going from an input multiple sequence alignment to a structure matrix.
PyKleeBarcode takes as input a multiple sequence alignment in the fasta format, whose sequences are grouped into sequences sets, most frequently corresponding to species.
The grouping of individual sequences into sets is deduced from part of the sequence name in the input fasta file, or via an association table provided using an optional argument to the script. Note that a sequence set may contain a single sequence. Also, note that while the original method from [24] enforces the constraint that each sequence set has the same number of sequences, pyKleeBarcode allows sequence sets of different sizes.
Sequence alignment pre-processing steps
To compute the structure matrix from a multiple sequence alignment, the DNA sequences are first pre-processed and transformed to vectors of four numbers. The pre-processing consists in a tentative imputation of the ambiguous N characters: any N character is replaced by the modal nucleotide at this position among its sequence set (typically, sequences of the same species in the multiple sequence alignment), unless that modal value is the gap character (“-”), or there are multiple non-gap character modal values, in which case the value stays N at this position. This pre-processing step is unchanged from the original method of [24]. Then the sequences are transformed to numerical vectors where each nucleotide is represented using four numbers, such that A corresponds to [1,0,0,0], C to [0,1,0,0], G to [0,0,1,0], and T to [0,0,0,1]), and the gap character “-" to [0,0,0,0] (formula 1 and 2 of [22]), additionally we have added support for ambiguous IUPAC characters, which corresponds to a vector with the different possibilities having the same weights and summing to 1 (e.g., B may correspond to C,G, or T and is represented as [0,1/3,1/3,1/3]).
The numerical representations of each sequence are then grouped by set into matrices where each row is the numerical vector of a single sequence. Thus, sequence set i is represented as matrix si of size m×s⋅4, where m is the number of sequences in the sequence set and s is the sequence size, in nucleotides.
This series of pre-processing steps are represented in Fig 1. They are performed either once, or at the beginning of step 1 and step 2 when these are executed as independent executables.
Step1: Reference matrix computation
This step corresponds to the computation of a matrix representation of the diversity of a DNA sequence across several individuals or sets of individuals (typically, species) that will serve as a reference against which individual sequence will later be contrasted (see step 2).After sequences have been transformed into numerical vectors and grouped into matrices si as described in the pre-processing steps above and the top part of Fig 1, the reference matrix R is obtained by computing, for each sequence set i:
And then summing all Si matrices, normalized by the number of sequences in each set mi:
See the Method section detail how we derive this formulation from the original formulas of [24].
Reference matrix computation has a time complexity of O(Ns2) and a memory complexity of .
When executed as an independent script, it takes as input a multiple sequence alignment in the fasta format, and as output the reference matrix in a simple binary format, chosen for its read/write performance.
The script is parallelized using MPI and thus it can be deployed on multiple CPU architectures. Briefly, the different sequences sets are split between the different processes, each computing a local reference matrix, which are then combined before being written to a file.
Furthermore, we provide a utility script to merge pre-existing reference matrix together (provided they were computed on an independent set of sequences and that no grouping/species are shared). This allows one to update a previously obtained matrix with new sequences.
Additionally, it also permits the computation of a reference matrix to be subdivided in any number of subtasks, by simply splitting the input alignment, and subsequently merging the resulting matrices. This makes the computation of very large datasets tractable and easy to deploy on one or multiple HPC architecture.
Step2: Indicator vector computation
This step corresponds to the computation of indicator vectors for each sequence or sets of sequences provided.
As per [24], the indicator vector of a sequence set is defined as the eigenvector with the largest positive eigenvalue of a matrix contrasting the sequence set against all other sequence sets. Using the definitions provided above, for sequence set i this matrix is computed as:
See the Method section detail how we derive this formulation from the original formulas of [24].
Indicator vectors computation has a time and a memory complexity of O(Ns2). When executed as an independent script, it takes as input a multiple sequence alignment in the fasta format, the reference matrix (as obtained from the previous step), and it outputs indicator vectors in a csv format file.
The computation of each individual indicator vector being independent from the rest, the parallelization of this script using MPI is fairly straightforward.
As with step1, it is possible to further split the computations in many subtasks by splitting the input alignment before invoking pyKleeBarcode. The subsequent merging of the resulting file is done by simple concatenation.
Step3: Structure matrix computation
In this step a structure matrix, containing the correlation between the indicator vectors of different sets of sequences, is computed.
In pyKleeBarcode it may be computed either in one go as the inner product of in two fashions. Either in one go, following the formulation of [24], with a single multiplication of a matrix where each row is an indicator vector by its transpose, or in multiple steps where each step computes a line from the structure matrix.
The first option is faster—albeit both options have the same time complexity–but more memory-intensive because the whole structure matrix needs to be held in memory at once, while the second one is slower but requires less memory.
The switch between these two options is, by default, set at 5,000 indicator vectors, corresponding to about 800Mb of memory, and can be modified to suit the available resources.
Structure matrix computation has a time complexity of O(Ns) and a memory complexity of O(N2s) when the number of indicator vectors is small, and O(Ns) otherwise.
This step takes as input the set of indicator vectors, in csv format, for which to compute pair-wise correlations; and it outputs the structure matrix (in a binary format of its lower triangular portion).
While the previous steps could be parallelized and split in subtasks in a straightforward manner, the structure matrix computation matrix computation is a more complicated affair, because it must look at all pairs of indicator vectors.
Nevertheless, we have devised an algorithm, and provide the corresponding script, to update an existing structure matrix with new indicator vectors (and thus, new sequences).
In fact, as structure matrices grow quadratically in size with the number of sequence sets they represent (count about 38Gb for 100,000 sequence sets), the structure matrix file format has been explicitly devised with the goal of allowing an update without having to read, or even re-write the whole file; the new information is merely appended to it.
Overall, throughout our various benchmarks and experiments, Step2 has consistently been the step that took most of the computational time (usually between 80% and 90%). When using a single MPI process, pykleebarcode has a performance of the same order of magnitude as the original Matlab code (see S2 Fig); the usage of several MPI processes diminishes runtime due to the increased amount of computational resources (see S2 Fig).
Impact of the reference matrix on the indicator vector and structure matrix
Fig 2A presents a view of the structure matrix of the mammalian dataset obtained with the whole data set as reference matrix. To help interpretation, rows and columns have been ordered and annotated with taxonomic groups retrieved from the NCBI taxonomy [34], whose hierarchical structure is displayed on Fig 2B. It should be noted that the ordering of groups inside the same taxonomic unit is arbitrary (e.g.: the appearance of Carnivora next to Artiodactyla in Laurasiatheria). In this structure matrix, it appears that the correlation of indicator vectors is higher among the sequences closely lower taxonomic unit (such as Equidae, or Pecora). However, between less closely related groups (ie, off the structure matrix diagonal) the correlation fluctuates between 0.2 and 0.4. This effect is related to the genetic saturation of the studied COI5-P sequence (see S3 Fig). For reference, steps 1, 2 and 3 took respectively 5, 30 and 1 seconds as well as 95, 162, and 92 megabytes, on an Intel i7-8665U CPU on this dataset of 1049 sequences of 384bp took (for the purpose of this test each sequence were kept separate as their own set).
A. View of the structure matrix of the mammalian dataset and taxonomic structure of Mammalia. B. Phylogenetic tree structure of the taxonomic groups retrieved from NCBI taxonomy.
To assess the effect of the impact of an incomplete reference matrix on the indicator vectors and structure matrix we designed three experiments as described in the methods section. The “primates-only” experiment serves to assess a case of extrapolation: computing indicator vectors on sequences which are outgroups to the ones included in the reference matrix. In contrast, the “no-Laurasiatheria” explores a case of interpolation, albeit one an extreme one where an entire super-order is missing. Finally, the “missingXX%” experiment addresses mixed cases of extrapolation and interpolation but where no large taxonomic group is missing.
Fig 3A presents the absolute Pearson correlation values between the indicator vectors obtained with a reference matrix containing all sequences versus a reference matrix containing the primate sequences only (“primates-only” experiment). All values are above 0.980, despite the “primates-only” reference matrix containing less than 12% (120/1049) of sequences, and all coming from the same subgroup of the data: primates in this instance. Interestingly, it appears the indicator vector of primate sequences themselves are among the most impacted as it presents the lowest average correlation value of all the main taxa as well as contains the vector with the overall lowest correlation value. Fig 4A reinforces this impression, as it shows that values in the structure matrix obtained from primates-only reference are highly correlated with the ones obtained with the full reference, but with a notable decrease when it comes to values describing the similarity between primates and other groups. This is suggestive of a behaviour where the effect of extrapolation (here, computing a structure matrix with some non-primates on a reference matrix containing primates sequences only) mostly affects the displayed relationship between the extrapolated sequences and the sequences used for building the reference matrix, rather than between the different extrapolated sequences.
A. Absolute Pearson correlation between indicator vectors obtained on the complete and the primate-only reference matrices. B. Absolute Pearson correlation between indicator vectors obtained on the complete and the no-Laurasiatheria reference matrices. C. Evolution of absolute Pearson correlation between indicator vectors obtained on the complete and a limited reference matrix with an increasing percentage of randomly missing sequences. Each boxplot corresponds to a single random replicate.
A. comparison of the primates-only structure matrix and the reference one. B. comparison of the no-Laurasiatheria structure matrix and the reference one.
The results of the “no-Laurasiatheria” experiment, presented in Fig 3B, also show very good correlations with the full reference matrix, with all values being above 0.997. Regarding a pattern where the sequences included in the reference matrix (i.e., the non-laurasiatherians) are on average lower than the ones excluded (i.e., laurasiatherians), while a Mann-Whitney U yields a small p-value (<10−12), the actual difference in median is only about 0.0001. Fig 4B also does not exhibit a pattern which specifically differentiate Laurasiatherians from others when comparing the structure matrices. Thus, it would seem that "interpolation” of new sequences in a structure matrix is not specifically affect the new sequences, even when they are from a group which is entirely missing from the reference matrix.
Fig 3C shows the results of the “missingXX%” experiment, where sequences are missing at random from the reference matrix. We see an expected pattern of general decrease in absolute correlation as the percentage of missing sequences increases, albeit all values stay above 0.9998 until 60% of sequences are missing and remain above 0.998 even when 90% of the sequences are missing. We also observe that the variation between replicates is limited, showing that this trend is robust to random sampling effects.
Discussion
PyKleeBarcode builds on the previous works on indicator vectors, and allows several new possibilities, such as an extended handling of ambiguous nucleotide character, but more importantly the ability to have differences in the number of sequences included per group, as well as the possibility of efficient integration of new results onto a pre-existing structure matrix—something hitherto not possible. As our experiments have shown, this has only minimal impact on the computed indicator vectors and similarities between sequences, except in the direst of scenarios, such as in the primates-only experiment, or when the percentage of new sequences is above 50%. Consequently, our method is appropriate to integrate a growing set of sequences, continuously integrating new specimens to see where they fit among the existing structure, which could only need to be entirely recomputed for major version releases for example.
Considering PyKleeBarcode, and the indicator vectors approach in general, in the context of other methods used to extract insight from homologous sequences, it first presumes that a multiple sequence alignment has already been obtained, using MAFFT [32, 33] for example, in order to establish homology nucleotide by nucleotide. Thus, in an analysis pipeline it would come after tools used to establish a distance-based homology between a single sequence and among a larger database, such as BLAST [35], mapping tools such as bwa [36] or minimap2 [37], or dedicated software and databases (such as SiLiX [38] and HOGENOM [39] for instance), which are not the most appropriate or efficient to compute all pairwise distances between already aligned sequences.
As we noted above our approach can be contrasted with phylogenetic tree reconstruction methods: pyKleeBarcode does not create an explicit evolutionary scenario, but only a distance matrix. Consequently, pyKleeBarcode is robust to cases where sequences do not follow a strictly tree-like evolutionary scenario and is not hindered by the unfavorable combinatorics of the number of possible tree topologies. In contrast, likelihood-based tree reconstruction methods must constantly engage in compromises between result approximation and runtime to completion.
Furthermore, the distance matrix produced by pyKleeBarcode may be used as the basis to construct a tree, using a distance-based method such as Neighbor-Joining [40] or UPGMA [41]. PyKleeBarcode may be used in complement with the results of likelihood-based methods to point out discrepancies that suggest reticulated evolution or convergences.
The indicator vector approach is most closely related with pairwise-distance methods and shares some of their limitations such as a sensitivity to genetic saturation. The main specificity of our approach relies on the elegant approach of [23] to conserve the richness of sequence diversity information when grouping sequences at various taxonomic levels rather than relying on a single consensus sequence per group.
Some properties of indicator vectors are still unclear. For example, the contrastive nature of the approach should make them robust against heterogeneities in the mutation patterns of sequences, but this remains to be tested. Similarly, it would be of interest to investigate potential biases introduced by sampling differences between studied clades.
With pyKleeBarcode we propose a flexible and up-to-date interface to compute indicator vectors and structure matrix from a multiple sequence alignment in a manner, adapted to the size of modern datasets and large computer infrastructures.
We anticipate that pyKleeBarcode will help undertake deeper analyses of biological sequence databases, including BOLD, and allow new insights into large scale features of extant life.
Supporting information
S1 File.
Supporting information for “PyKleeBarcode: Enabling representation of the whole animal kingdom in information space “Appendix A: Benchmarking of pykleeBarcode against the previous Matlab implementation, and Appendix B: Investigating the saturation of the COI5-P among mammals evolution.
https://doi.org/10.1371/journal.pone.0286314.s001
(PDF)
S1 Fig.
Evolution of execution time (A) and peak RAM usage (B) for the computation of a structure matrix with the number of DNA sequences for different implementations.
https://doi.org/10.1371/journal.pone.0286314.s002
(TIF)
S2 Fig. Estimation of the speedup of pykleeBarcode achieved with 4 MPI processes.
The red line represents the average on a rolling window of 20 points.
https://doi.org/10.1371/journal.pone.0286314.s003
(TIF)
S3 Fig. Evolution of COI5-P hamming distances with divergence time among mammals.
The red line corresponds to a rolling average of a 20 MYA window.
https://doi.org/10.1371/journal.pone.0286314.s004
(TIF)
Acknowledgments
Important thanks to Brian Abbott, Jesse Ausubel, Jacqueline Faherty, and Mark Stoeckle for valuable discussions. Calculations were performed at sciCORE (http://scicore.unibas.ch/) scientific computing center at University of Basel, with support by the SIB—Swiss Institute of Bioinformatics. The title is a reference to Stewart Brand’s question “Why haven’t we seen a photograph of the whole earth yet?” https://en.wikipedia.org/wiki/The_Blue_Marble.
References
- 1. Cobb M. 60 years ago, Francis Crick changed the logic of biology. PLoS biology 2017, 15(9):e2003243. pmid:28922352
- 2. Crick FH. On protein synthesis. Symp Soc Exp Biol 1958, 12:138–163. pmid:13580867
- 3.
Zuckerkandl E, Pauling L. Molecular Disease, Evolution, and Genic Heterogeneity. In: Horizons in Biochemistry: Albert Szent-Gyögyi Dedicatory Volume. edn. Edited by Kasha M, Pullman B: Academic Press; 1962: 189–225.
- 4. Zuckerkandl E. Fifty-year old and still ticking… . an interview with Emile Zuckerkandl on the 50th anniversary of the molecular clock. Interview by Giacomo Bernardi. Journal of molecular evolution 2012, 74(5–6):233–236. pmid:22739996
- 5. Koonin EV. A half-century after the molecular clock: new dimensions of molecular evolution. EMBO reports 2012, 13(8):664–666. pmid:22791022
- 6. Kimura M, Crow JF. The Number of Alleles That Can Be Maintained in a Finite Population. Genetics 1964, 49:725–738. pmid:14156929
- 7. Kimura M. DNA and the neutral theory. Philosophical transactions of the Royal Society of London 1986, 312(1154):343–354. pmid:2870526
- 8. Kimura M. Preponderance of synonymous changes as evidence for the neutral theory of molecular evolution. Nature 1977, 267(5608):275–276. pmid:865622
- 9. Parvathy ST, Udayasuriyan V, Bhadana V. Codon usage bias. Mol Biol Rep 2022, 49(1):539–565. pmid:34822069
- 10. Woese CR, Kandler O, Wheelis ML. Towards a natural system of organisms: proposal for the domains of archea, bacteria, and eukarya. Proc Natl Acad Sci USA 1990, 87:4576–4579.
- 11. Woese CR, Fox GE. Phylogenetic structure of the prokaryotic domain: the primary kingdoms. Proceedings of the National Academy of Sciences of the United States of America 1977, 74(11):5088–5090. pmid:270744
- 12. Olsen GJ, Lane DJ, Giovannoni SJ, Pace NR, Stahl DA. Microbial ecology and evolution: a ribosomal RNA approach. Annual review of microbiology 1986, 40:337–365. pmid:2430518
- 13. Avise JC, Arnold J, Ball RM, Bermingham E, Lamb T, Neigel JE, et al. Intraspecific phylogeography: the mitochondrial bridge between population genetics and systematics. Ann Rev Ecol Syst 1987, 18:489–522.
- 14. Moore WS. Inferring phylogenies from mtDNA variation: Mitochondrial-gene trees versus nuclear-gene trees. Evolution 1995, 49:718–726. pmid:28565131
- 15. Ausubel JH. A botanical macroscope. Proceedings of the National Academy of Sciences of the United States of America 2009, 106(31):12569–12570. pmid:19666620
- 16. Hebert PD, Ratnasingham S, deWaard JR. Barcoding animal life: cytochrome c oxidase subunit 1 divergences among closely related species. Proc Biol Sci 2003, 270 Suppl 1:S96–99. pmid:12952648
- 17. Stoeckle MY, Hebert PD. Barcode of life. Sci Am 2008, 299(4):82–86, 88. pmid:18847089
- 18. Thaler DS, Stoeckle MY. Bridging two scholarly islands enriches both: COI DNA barcodes for species identification versus human mitochondrial variation for the study of migrations and pathologies. Ecology and Evolution 2016, 6:6824–6835. pmid:28725363
- 19. Ratnasingham S, Hebert P. bold: The Barcode of Life Data System (http://www.barcodinglife.org). Mol Ecol Notes. 2007 May 1;7(3):355–364.
- 20. Hebert PD, Hollingsworth PM, Hajibabaei M. From writing to reading the encyclopedia of life. Philosophical transactions of the Royal Society of London 2016, 371(1702).
- 21. Chu G, Warnow T. SCAMPP+FastTree: improving scalability for likelihood-based phylogenetic placement, Bioinformatics Advances 2023, 3(1), vbad008\ pmid:36818728
- 22. Soucy S, Huang J, Gogarten J. Horizontal gene transfer: building the web of life. Nat Rev Genet 2016, 16, 472–482.
- 23. Sirovich L, Stoeckle MY, Zhang Y. A scalable method for analysis and display of DNA sequences. PLoS One 2009, 4(10):e7051. pmid:19798412
- 24. Sirovich L, Stoeckle MY, Zhang Y. Structural analysis of biodiversity. PLoS One 2010, 5(2):e9266. pmid:20195371
- 25. Stoeckle M, Coffran C. TreeParser-Aided Klee Diagrams Display Taxonomic Clusters in DNA Barcode and Nuclear Gene Datasets. Sci Rep 2013,3, 2635. pmid:24022383
- 26. von Beeren C, Stoeckle M, Xia J, Burke G, Kronauer DJC. Interbreeding among deeply divergent mitochondrial lineages in the American cockroach (Periplaneta americana). Sci Rep 2015, 5, 8297. pmid:25656854
- 27. Raupach MJ, Astrin JJ, Hannig K, Peters MK, Stoeckle MY, Wägele JW. Molecular species identification of Central European ground beetles (Coleoptera: Carabidae) using nuclear rDNA expansion segments and DNA barcodes. Front Zool 2010, 7, 26. pmid:20836845
- 28. Raupach MJ, Barco A, Steinke D, Beermann J, Laakmann S, Mohrbeck I, et al. The Application of DNA Barcodes for the Identification of Marine Crustaceans from the North Sea and Adjacent Regions. PLoS One. 2015 Sep 29;10(9):e0139421 pmid:26417993
- 29. Modica MV, Puillandre N, Castelin M, Zhang Y, Holford M. A good compromise: rapid and robust species proxies for inventorying biodiversity hotspots using the Terebridae (Gastropoda: Conoidea). PLoS One. 2014 Jul 8;9(7):e102160. pmid:25003611; PMCID: PMC4086986.
- 30. Stoeckle M, Thaler D. Why Should Mitochondria Define Species? Human Evolution 2018, 33, 1–30
- 31. Stoeckle MY, Thaler DS. DNA barcoding works in practice but not in (neutral) theory. PLoS One (2014),9, e100755 pmid:24988408
- 32. Katoh K, Misawa K, Kuma KI, Miyata T. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic acids research 2002, 30(14):3059–66. pmid:12136088
- 33. Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Molecular biology and evolution 2013, 30(4):772–80. pmid:23329690
- 34. Schoch CL, Ciufo S, Domrachev M, Hotton CL, Kannan S, Khovanskaya R, et al. NCBI Taxonomy: a comprehensive update on curation, resources and tools. Database (Oxford). 2020 Jan 1;2020:baaa062. [cited 2023 Jan 4] pmid:32761142
- 35. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215: 403–410 pmid:2231712
- 36. Li H, Durbin R.Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics 2009, 25:1754–60. pmid:19451168
- 37. Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 2018, 34:3094–3100 pmid:29750242
- 38. Miele V, Penel S, Duret L. Ultra-fast sequence clustering from similarity networks with SiLiX. BMC Bioinformatics 2011, 12, 116 pmid:21513511
- 39. Penel S, Arigon AM, Dufayard JF, Sertier AS, Daubin V, Duret L, et al. Databases of homologous gene families for comparative genomics. BMC Bioinformatics 2009, 10 (Suppl 6):S3 pmid:19534752
- 40. Saitou N, Nei M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Molecular Biology and Evolution 1987, 4(4): 406–425 pmid:3447015
- 41. Sokal RR, Michener CD. A statistical method for evaluating systematic relationships. Univ. Kansas Sci. Bull. 1958, 38:1409–1438