Skip to main content
Advertisement

< Back to Article

Fig 1.

Topological data analysis.

(A) Topological data analysis aims to infer the topological features (e.g. loops, voids, etc.) of an unknown space from a finite set of sampled points. (B) Persistent homology, a tool of TDA, builds simplicial complexes (generalizations of networks that include higher dimensional elements like triangles and tetraheadra), by taking balls of radius ϵ centred on the sampled points. Points are connected in the simplicial complex if the corresponding balls intersect. This construction is known as Vitoris-Rips complex. Persistent homology tracks how the topological features of Vietoris-Rips complexes change with ϵ. (C) Barcodes are suitable representations of persistent homology. Each interval in the barcode represents the range of ϵ across which a particular topological feature (for instance, a loop) is present in the inferred topology. In this figure, the barcode of the first persistent homology, that tracks the presence of loops, is shown. The two intervals in the barcode correspond to the two loops present in the original space.

More »

Fig 1 Expand

Fig 2.

ARGs and condensed graphs.

Two examples of ultra-minimal ARGs and the condensed graphs that result from collapsing their unlabelled edges. The root node is marked red whereas sampled nodes are marked green. Mutations in the r-th character are indicated by mr. Edges pointing to a recombination node are labelled with the letter P or S, depending on whether they contribute to the prefix or suffix of the recombinant sequence. Recombinant nodes are marked with the position of the recombination breakpoint. All nodes are labelled by their sequence of characters. Condensed graphs of ARGs can be embedded into m-dimensional hypercubes and their diagonals.

More »

Fig 2 Expand

Fig 3.

Ultra-minimal ARGs.

Two examples of ARGs containing the minimum number of recombination events, Rmin = 3, required to explain a sample of n = 7 sequences with m = 3 segregating sites. Both ARGs are minimal ARGs. However, only the minimal ARG at the bottom is an ultra-minimal ARG.

More »

Fig 3 Expand

Fig 4.

Topological ARGs.

Examples of condensed ultra-minimal ARGs (left) and their corresponding tARGs (right). In a tARG the edges are completely determined by the vertices. The topology of the resulting tARG can differ from that of the original condensed ultra-minimal ARGs.

More »

Fig 4 Expand

Fig 5.

Persistent homology of a sample of genetic sequences.

Barcode and Vietoris-Rips complexes at several values of the parameter ϵ, for the sample of sequences . Only the first homology group (H1) is shown. At small ϵ the four sampled points are disconnected. Increasing ϵ leads to a loop, that appears as a single element of H1. Further increasing ϵ fills in the loop, leading to a single connected surface. An ultra-minimal ARG explaining , and the corresponding tARG are shown in Fig 2 (bottom). The barcode only captures one of the Rmin = 2 recombination events.

More »

Fig 5 Expand

Fig 6.

Barcode ensemble of a sample.

(A) Schematic representation of the barcode ensemble of a genomic sample. Persistent homology is computed for each genomic interval of a partition of the sequence. Barcodes associated to different genomic intervals capture different recombination events. The union of all barcodes is the barcode ensemble. The total number of intervals in the barcode ensemble is denoted as . The partition is chosen such that is maximized. (B) Comparison between lower bounds and RMGRmin in coalescent simulations. Values of and RMG for simulated samples of 40 sequences with 12 segregating sites, sampled from a population under the coalescent model with recombination. 4,000 samples were simulated in total. The colored band represents the interdecile range, whereas the central line represents the mean. The values of and RMG are strongly correlated (Pearson’s r = 0.98, p < 10−100). At very high recombination rates, tends to be larger than RMG, as cases where occur more frequently.

More »

Fig 6 Expand

Fig 7.

Ultra-minimal ARG, first-homology barcode ensemble and reconstructed tARG of a sample of 4 sequences.

The four sampled sequences are represented by green leaf nodes in the ultra-minimal ARG depicted in (A). The ARG involves two single-crossover recombination events. Both recombination events and their genetic scales (mutational distance between recombining sequences) are correctly captured by the barcode ensemble of the samples, shown in (B). Intervals containing the location of recombination breakpoints are indicated over each bar. Persistent homology generators can be used to reconstruct the topology of the tARG, as depicted in (C). Without adding any extra sequences to the sample, the two bars are associated to the same four generators, allowing only to reconstruct the large envelope of the two loops in the tARG. Adding sequences E and F to the sample (represented by blue leaf nodes in (A)) disentangles the generators of the two loops, fully reconstructing the topology of the tARG.

More »

Fig 7 Expand

Fig 8.

Barcode ensemble of two divergent sexually-reproducing populations.

The case in (A) assumes the two populations are completely isolated. All recombination events present in the barcode ensemble involve genetically close parental gametes. The case in (B) considers a small migration rate between the two populations. Some of the recombination events present in the barcode ensemble involve genetically distant parental strains, leading to larger death times ϵd in the barcode ensemble. The total number of detected recombination events is similar in both cases and uniform across the entire genome. Intervals with the location of the recombination breakpoints are indicated for each recombination event, where positions refer to segregating sites.

More »

Fig 8 Expand

Fig 9.

Barcode ensemble across the HLA and MS32 mini-satellite loci of the LWK population.

(A) Recombination rates (top) across a 250 kilobase region of the HLA locus according to the African-American recombination map, based on 30,000 individuals [55]. The vertical axis is in logarithmic scale. The distribution of recombination events (bottom) detected by the barcode ensemble of a sample of 90 individuals from the LWK population sequenced by the International HapMap Consortium [53] is consistent with the observed recombination rates. Note that in neutral models of evolution the number of recombination events in minimal ARGs is roughly expected to grow logarithmically with the recombination rate of the population [57]. (B) Distribution of recombination events detected by the barcode ensemble of a sample of 97 individuals from the LWK population, sequenced by the 1,000 Genomes Project Consortium [56]. The higher density of SNPs in this dataset allows for a higher resolution in the localization of recombination events as well as a higher sensitivity. (C) Density of recombination events per nucleotide against their average death time 2ϵd, for recombination events captured by the barcode ensemble in (A). Each point represents a genomic position for which the barcode ensemble detects recombination. The horizontal axis represents the average death time of the bars in the barcode ensemble that are associated to that genomic position. Events with large ϵd, corresponding to recombination events with a large mutational distance between recombining sequences, are mostly associated to regions with low number of recombinations, as expected from neutral models of evolution [57]. (D) Recombination rates (top) across a 100 kilobase region near the MS32 mini-satellite locus according to the African-American recombination map [55]. The vertical axis is in logarithmic scale. The distribution of recombination events (bottom) detected by the barcode ensemble of a sample of 97 individuals from the LWK populations sequenced by the 1,000 Genomes Project Consortium [56] is consistent with the observed recombination rates.

More »

Fig 9 Expand

Fig 10.

Barcode ensemble and partially reconstructed tARG of a sample of 112 Darwin’s finches.

The barcode ensemble is shown in (A), based on 140 homozygous SNPs present in a 9 megabase scaffold. In total, 13 recombination/gene flow events are captured in the barcode ensemble, with different genetic scales. Bars are colored according to the position of the corresponding recombination breakpoint in the genome, as depicted in (C). We also indicate the number of recombination events detected at each genomic interval, as well as some of the orthologous genes present at regions where recombination events are detected. The reconstructed tARG is presented in (B). Loops in the reconstructed tARG are outlined using the same code of colors. We have also included leaf nodes that do not participate in any recombination event, using a nearest neighbour algorithm based on genetic distance. Edge lengths are arbitrary.

More »

Fig 10 Expand

Fig 11.

Parameter estimation in models of evolution.

(A) Dependence of on the recombination rate parameter for a set of 1,000 simulations of a basic coalescent model. Each simulation consists of 200 sequences, 30 kilobase long. The expected of the barcode ensemble grows monotonically with the recombination rate, providing a good measure of the later. The smoothed average is shown in red. (B) Dependence of the average death time, 〈ϵd〉, on the migration rate of two divergent populations with fixed recombination and variable migration rates, based on 900 simulations. Each simulation consists of 150 sampled sequences, 10 kilobase long. The same structure as in Fig 8 was considered for the two populations. The expected value of 〈ϵd〉 decreases monotonically with the migration rate, being informative of the later.

More »

Fig 11 Expand