Interpreting and de-noising genetically engineered barcodes in a DNA virus

doi:10.1371/journal.pcbi.1010131

Fig 1.

Outline of computational workflow for selecting stock barcodes.

Linker sequences are trimmed with Cutadapt (colors represent distinct barcodes). Clustering is performed using the Starcode message-passing algorithm to determine centroids, defining "centroid" as the representative barcode (dotted circles represent clusters). We empirically determined which edit distance to employ by examining two criteria: 1) the performance of different edit distances on various amounts of plasmid controls, each containing a barcode of known sequence, to recall correct input barcode sequences and 2) “shoulder cutoff” using the plot line of the distribution abundance of called clusters from our barcoded plasmid library. In defining the "true" barcodes of our virus library, we utilized L = 3 Levenshtein distance because it performed slightly better in our defined barcode plasmid controls over the default Starcode Levenshtein distance of L = 2. Final "true" barcodes were then assigned by omitting the lowest abundance barcodes (omitting the bottom barcodes that account for a cumulative 1% of all counts), based on the assumption that the lowest abundance reads were most likely to be sequencing artifacts.

More »

Expand

Fig 2.

Illumina sequencing reads from single-plasmid controls.

The control spiked-in barcode is shown as purple bars, erroneous barcodes are shown as gray bars. Shown are the top highest count 15 barcode sequences tallied across the four 1-Plasmid controls. Raw counts are shown with no clustering applied, with the x-axis square root-transformed to highlight low values.

More »

Expand

Table 1.

Single-plasmid controls preparation.

More »

Expand

Table 2.

Percent of erroneous barcodes in Illumina sequencing reads of single-plasmid controls: original (raw), quality thresholds for any base with each of two thresholds, and clustering with different Levenshtein distances applied.

More »

Expand

Table 3.

10-plasmid controls preparation.

More »

Expand

Fig 3.

Illumina sequencing reads from 10-plasmid controls using different clustering distances.

The y-axis depicts the barcode sequence; the x-axis shows the square root-transformed percentage of total read counts. The colored bars represent the control barcodes. Gray bars represent the most common erroneous barcodes within a library. The plots compare the raw percentages (no clustering) with clustering using Starcode’s message-passing algorithm, and L = 1, L = 2, and L = 3 distance parameters. Here we show the 10-plasmid controls 10P-A, 10P-D and 10P-G. Additional 10-plasmid controls are shown in S4 Fig.

More »

Expand

Fig 4.

Linearity plots of 10-plasmid controls with L3 clustering parameter.

The log10 transformed x-axis shows the copy number of plasmid inputs, the log10 transformed y-axis represents L3 clustered read counts. Linear regression trendlines are plotted in gray, with corresponding R² values. Linearity of the 10-plasmid control series 10P-A and 10P-G is shown (the linearity of additional 10-plasmid controls is shown in S6 Fig).

More »

Expand

Fig 5.

Comparison of barcode counts between raw and different clustering distances.

This figure shows that the number of barcodes that are called decreases with the increasing clustering distance; clustering with L = 3 substantially decreases the barcode counts called in the raw sequencing reads. Note that the highest-count cluster is ranked 1. The figure is cut off to include only the 10,000 most abundant barcodes to focus on the “elbow” where the number of barcodes that are called display a steep drop-off.

More »

Expand

Fig 6.

Centroid counts as percentage of cluster counts.

Each gray dot represents the percentage of counts in a cluster (using L3 clustering) from the plasmid library that comes from its centroid. The centroids are ordered by the size of their cluster. The highest-count cluster is ranked 1. A smoothed curve (in blue) is fitted to the dot plot. This plot shows that the majority of cluster counts come from the defined centroid.

More »

Expand

Fig 7.

Barcode length distributions for 90%, 99%, and 99.9% sequencing reads cutoffs.

Applying a cutoff for barcode abundance eliminates most recall of longer barcodes that were not intended in the original design for a 12-nucleotide barcode library. These barcodes are derived from the L3 clustering. The y-axis is square root-transformed so low values are more visible.

More »

Expand

Fig 8.

Distributions of the barcode pairwise distances within the plasmid library (panel A, shown in blue) and 5,000 randomly generated 12mers (panel B, shown in yellow).

The figure represents the pairwise Levenshtein distances between centroids in the plasmid library after applying a L3 clustering distance and a 99% reads cutoff. The 5,000 randomly generated 12mers are not clustered, and with no cutoff. Data show that the proportion of possible barcode sequence space covered is similarly sparse in both cases.

More »

Expand

Fig 9.

Technical replicates of the sequencing of the virus library.

Using Illumina sequencing, one replicate of the virus library was sequenced in one run and three additional technical replicates were sequenced in a separate run. Thus, the four technical replicates were sequenced in two independent runs. The UpSet plot shows the number of barcodes that intersect amongst the four virus library technical replicates (lower panel), as well as the total counts of those barcodes in blue (upper panel). The libraries were clustered using L3 distance and a 99% reads cutoff was applied. The four replicates show a large degree of overlap in clustered barcodes.

More »

Expand

Fig 10.

Overlap of clustered barcodes from the plasmid library, ligated virus genomes, and the virus library.

A. L3 clustering distance of barcodes along with a 99% cumulative count cutoff was applied to the three libraries. In the lower panel, the UpSet plot shows the number of barcodes from the virus library that intersect with barcodes from the plasmid library and/or the ligated virus genomes. Data show that the vast majority of barcodes from the virus library are present in the plasmid library and the ligated virus genomes. In the top panel, the UpSet plot represents the total read counts associated with barcodes in the virus library that intersect with the plasmid library and the ligated virus genomes. Data show that the barcodes in all three libraries account for the overwhelming number of counts in the virus library. B. Overlap of clustered barcodes that are shorter than 12 nucleotides from the plasmid library, ligated virus genomes, and the virus library. L3 clustering distance along with a 99% cumulative count cutoff was applied to the three libraries. In the lower panel, the UpSet plot shows the number of barcodes from the virus library that intersect with barcodes from the plasmid library and/or the ligated virus genomes. Data show that the majority of short (<12nt) barcodes in the virus library are also present in the plasmid library and ligated virus genomes. In the top panel, the UpSet plot represents the total read counts associated with shorter barcodes in the virus library that intersect with the plasmid library and the ligated virus genomes. Data show that the shorter barcodes that overlap in all three libraries account for the majority of shorter barcodes in the virus library, suggesting that they did not arise during the course of infection.

More »

Expand

Fig 11.

Abundance of clustered barcodes in the plasmid and virus libraries.

For each barcode, its log10 percentage of counts in the plasmid library is plotted on the x-axis, and its log10 percentage of counts in the virus library is plotted on the y-axis. A trendline with slope 1 through the origin is plotted. Data show a correlation with the more abundant barcodes in the plasmid library generally giving rise to the more abundant barcodes in the virus stock.

More »

Expand