A New Method to Reconstruct Recombination Events at a Genomic Scale

doi:10.1371/journal.pcbi.1001010

Figure 1.

Scheme of the recombination detection process for one run of the algorithm.

(A) Input dataset of 10 sequences and 83 SNPs. Colors on sequences represent similar patterns of SNPs, and a change of color along a sequence represents the signal of past recombination events. (B) Recoded matrix. The patterns of SNPs within a column of grain size n (10 SNPs in this example) have been recoded into numbers. Those sequences having the same pattern within a column will be assigned the same number. Between columns, numbers represent completely different patterns. Unique patterns are assigned the number zero and will not be considered. (C) Trees one, two and three, constructed based on the recoded matrix. Going from left to right, the recoded matrix is segmented into sets of compatible [30] columns of patterns. Compatibility of columns is checked using a variant of the four gamete test [31] for multi-allelic markers. Each segment is represented as a tree in which the leaf nodes contain the sequences analyzed and the edges contain the patterns inherited, similar to point mutations. Recurrence is not allowed. (D) Networks 1–2 and 2–3 constructed from consecutive trees one, two and three merged pairwise. All the information contained in the two original trees will be present in the compatible network. Recombinant sequences are leaf nodes descending from nodes having two parents, which means that have inherited patterns from two different nodes (similar to an Ancestral Recombination Graph). (E) Information saved for each detected recombination event: the recombinants sequences and the starting and ending position of the network. For a more detailed description of the algorithm see [12]. In red, the recombination event that will be further studied in Figure 2.

More »

Expand

Figure 2.

Scheme of the recombination detection process integrating 10 runs of the algorithm.

The analyzed dataset is the one shown in Figure 1. (A) Integration of the information of 10 runs regarding the recombination event of sequence 5. For each run of the algorithm, the starting and ending position of the network in which the recombination is detected, is saved. For each run, the size of the first column varies, being 10, 1, 2, 3… up to 9 and therefore the number of runs corresponds to the grain size. At the end, for each recombination event, we have a set of intervals in which it was detected which can be represented graphically as a distribution. The maximum interval represents the region in which the recombination has been seen the maximum number of times. The mean point of the maximum interval is defined as the estimated breakpoint position. The threshold indicates the number of times a recombination has to be detected to be considered as true. The intersection between the threshold and the detection distribution defines the threshold interval in which the algorithm guarantees that the recombination event is located. (B) Integration of the information of all detections for the 10 runs of the algorithm. Each line represents a set of sequences in which the same recombination event has been detected; the distribution of the line shows the number of times the event has been detected along the sequence. (C) Final output of the algorithm: breakpoint positon in the first row, the recotypes in rows and the recombination events detected in columns. The presence of a particular recombination event in a particular sequence is represented as a 1, and absence as a 0. Note that the recotypes represent exactly the coloring of the sequences in Figure 1 and that only recombinations that had a distribution above the threshold are represented in the recotypes.

More »

Expand

Figure 3.

Distribution of the number of detections using the optimal method.

Each line represents the distribution of detections for particular recombination events. The dataset corresponds to one COSI simulation. Only those recombinations reaching the threshold will be considered as true events. The pick of each distribution will locate the breakpoint position for each particular recombination event along the sequence. The optimal method (grains 20, 10 and 5 forward and reverse and a threshold of 42) creates narrower maximal intervals in the detection distributions than when only using grain 10.

More »

Expand

Figure 4.

Values of the aggregate Z scores for different settings.

Z scores were calculated over mean values for 100 simulations of false discovery rate, sensitivity and 90th percentile of the distance between the inferred breakpoint to the true position. Different colored lines represent different methods, the numbers on the legend inform on the grain size used and whether they combine more than one grain size. All methods are run using a sliding window and forward and reverse. Different thresholds are represented along the X axes. Threshold is defined as number of detections to be considered as true divided by the number of runs of the algorithm.

More »

Expand

Figure 5.

Sensitivity of the optimal method to detect recombinations depending on age.

Results plotted are the averaged between 100 simulations. The black curve depicts how sensitivity of IRiS varies with the age of the recombination events (in bins of 500 generations) and follows the left axis. The two gray curves represent the number of recombination events generated by COSI and detected by IRiS and follow the right axis.

More »

Expand

Figure 6.

Recombination rates inferred from sperm typing, LD-based methods and IRiS on the MS32 region.

(A) Inferred recombination rates based on sperm typing information; figure adapted from the figures in [17] in which they calculate recombination rates through sperm typing. (B) Recombination rate inferred by LDhat. (C) Number of recombination events detected by IRiS using the optimal method. Recombination rates inferred in (A) are based on a single individual whereas recombination rates inferred at (B) and (C) are based on the same population data. Position zero marks the location of the minisatellite MS32.

More »

Expand

Figure 7.

Sensitivity of the optimal method evaluated in silico.

The plot shows the number of times in silico recombination events along the sequence were detected by IRiS depending on the breakpoint location. Different colors indicate different ways to produce the recombinant sequence, from light gray to black: “random” indicates that parental haplotypes were taken at random, “1dif near bkp” indicates that parental sequences had to be different near the breakpoint region (plus minus 10 SNPs), “ 2 dif near bkp” indicates that parental sequences had to be different near the breakpoint regions at both sides of the breakpoint, and “ unique” indicates that the parental sequences had to be different near the breakpoint region and the recombinant sequence had to be unique within the breakpoint region. Below, the recombination rate estimated by LDhat is shown, following the right axis.

More »

Expand

Figure 8.

Nucleotide and recombination diversity.

Values were calculated for each of the populations based both on haplotypes and recotypes for the 18 regions. Values of recombination diversity have been multiplied by 100 to make them comparable.

More »

Expand

Figure 9.

First and second components of the Principal Components Analysis.

Only recombinations present in at least in two individuals were taken for the analysis. The first component explained 18.03% of the variance and the second component 14.53%.

More »

Expand