Fig 1.
A genome of length L is“circularized” by taking the first half of the sequence (L/2) and concatenating that sequence onto the end of the genome (A). The algorithm then splits the sequence into many shorter windows of length w. We assign each window an α value [1,-1,0] based on whether there are more Gs, Cs, or equal quantities of both. (B) The GC skew statistic is shown (left) plotted across the E. coli genome, with a purple dotted line showing where the original sequence ended, prior to concatenating 1/2 of the genome to the end. The plot on the right shows the α value plotted for the same genome. (C) SkewIT finds the location in the genome with the greatest difference in GC skew between the first half and the second half of the genome, by using a pair of sliding windows to find the greatest sum of differences between the α values for the two halves.
Fig 2.
This figure shows the distribution of SkewI values for the 12 bacterial genera with the greatest number of fully sequenced genomes.
Table 1.
Average SkewI values for the 12 bacterial genera with the largest number of complete genomes.
The threshold was set at 2 standard deviations below the mean.
Fig 3.
Escherichia skew index values.
A) SkewI for all 934 Escherichia genomes. The threshold (vertical black line) is at 0.749. B) GC-skew plots for Escherichia coli O121 strain RM8352 and Escherichia coli M8. E. coli O121 has an unusually low SkewI of 0.275, while E. coli M8 has a SkewI of 0.877, which is typical for this genus. C) Initial alignment between the two E. coli genomes revealed a large inversion. Alignment of the assembly reads revealed locations with no read coverage (red diamonds) E. coli O121 at both ends of the inversion. D) Flipping the inversion in strain RM8352 produced a much more consistent alignment between the E. coli genomes (dot plot), and restored the GC skew plot to a more normal appearance (shown along the y axis).
Fig 4.
A) SkewI for all 934 Burkholderia genomes. The threshold (vertical black line) is 0.715. B) SkewI colored by chromosome. C) GC-skew plots for all three chromosomes for Burkholderia contaminans strains MS14 (left) and SK875 (right). D) Alignments between MS14 and SK875 chromosomes 1 and 2. MS14 is shown on the y axis of each plot. E) Cross-chromosome alignments between MS14 and SK875 chromosome 1 and 2 reveal that a 1.7Mbp region of MS14 chromosome 1 actually belongs to chromosome 2. Similar matches in MS14 chromosome 2 suggest two regions that belong in chromosome 1. F) We rearranged and inverted the sequences of MS14 chromosomes 1 and 2 based on the alignments and GC-Skew plots. G) The final MS14 chromosomes alignment with those of B. contaminans SK875.
Fig 5.
Mycobacterium skew index values.
A) SkewI for 236 Mycobacterium genomes from 12 Mycobacterium species, all of which have multiple strains available in RefSeq. The threshold (vertical line) is at 0.413.) SkewI colored by species. C) Plot comparing GC Content (%) to SkewI, where each dot represents a different genome colored by species.
Fig 6.
SkewIT sensitivity to misassemblies.
In order to evaluate the sensitivity of the SkewIT method for detecting misassemblies, we first randomly selected 10 genomes from these species: Bacillus thuringiensis, Salmonella enterica, Staphylococcus aureus, Escherichia coli, and Pseudomonas aeruginosa. A) displays the SkewI threshold for each species. For each genome, we simulated 100 misassembled genomes by moving a random subsequence of length k% of the full genome length to another random location. This was repeated for 12 values of k ranging from 0 to 30, with 100 random misassemblies for each value of k. B) shows the average percentage of the misassembled genomes that had SkewI values below the threshold.