Fig 1.
Illustration of the aim of this work.
While existing methods for analyzing TIS data primarily categorize genes as either essential or non-essential based on transposon insertion density, our goal is to establish a method that quantifies the fitness effect of losing a non-essential gene on a more gradual scale.
Fig 2.
Transposon insertions near the start and end of a gene are less likely to generate a gene knockout.
a) To identify a possible dependence of the read counts on the position of a transposon insertion within a gene, we split the open reading frame of each gene in 20 equally sized segments. Each segment therefore covers 5% of the coding region of a gene. For every segment we calculated the average read count per insertion site to obtain a profile along the coding region. b) Profile of the read counts against the position within the coding sequence, averaged over all non-essential genes. The profile shows that insertions near the start and stop codon of a gene obtain a slightly higher read count than insertions in the central 80% of the coding region. This effect is visibly stronger for genes that are annotated as essential (c), indicating that insertions near the gene edges do not cause a full gene knockout. d) The read count per insertion averaged over the central 80% of the coding region for essential and non-essential genes. While essential genes typically have a lower average read count per insertion site than non-essential genes, their distributions still overlap. e) Histogram of the largest span that appears free of insertions, expressed as a fraction of the gene’s length. The distribution for essential genes has a tail towards longer insertion free spans (up to the full length of the coding region), while the spans are typically shorter for non-essential genes.
Fig 3.
Correcting for centromere bias to determine the expected transposon insertion rate of genes.
a) The goal of calculating the expected insertion rate is to estimate the number of insertion sites that produce no reads because the mutants were lost during the sampling of the population due to low abundance. b) The empirical insertion rate depends on the distance of a gene to the centromere. To visualize this bias we determined the number of transposons that mapped within a distance rc from the chromosome centromere for different values of rc, as done previously by [31]. The plots of the cumulative insertions for the two halves of each individual chromosome are shown as grey lines, the averaged curve is shown as a blue line. The non-zero intercept of a linear fit of the portion of the average curve for distances rc > 300 kb demonstrates the existence of the centromere bias in our dataset. c) The averaged cumulative plot for distances Rc < 400 kb was fitted with an exponential function and a 4th order polynomial. While the power law function does approximate the shape of the curve, it systematically under- and overfits different portions of the curve. Overall, the 4th order polynomial better approximates the curve. d) The approximated empirical insertion rate by the fitted power law and polynomial functions. The plot shows that while the approximation by the polynomial function is better for rc < 200 kb, the polynomial starts to oscillate for larger distances. e) The difference between the expected (E(X)) and observed (O(X)) transposon insertion rates for essential and non-essential genes. While the average difference is close to zero for non-essential genes, it becomes positive for essential genes.
Fig 4.
The steps of the procedure to calculate fitness from transposon insertion data.
a) First, the expected number of insertion events is calculated for each gene, using the global insertion profile to correct for the centromere bias. The expected number of insertions is then compared to the observed number of insertions to determine the number of sites that have an insertion but produce no reads. These sites are included as sites that have zero read counts. b) After adding the zero read count sites, outliers are removed using the range between the 5th and 95th percentiles of the data. c) The mean and variance of the read counts at different insertion sites are used to determine the average and uncertainty, respectively, of the fitness of a gene deletion mutant. To provide a robust robust estimate for the variance, information is shared between genes by fitting the global mean-variance relationship with an overdispersed Poisson model. The resulting fit allows us to determine the variance based on the mean read count of a gene based on the assumption that this mean-variance relationship is a property of the dataset.
Fig 5.
Overview of the experimental procedure to determine the reproducibility of the fitness values across different replicate experiments.
Biological replicates B1-B4 are different clones of a single wild-type strain transformed with plasmid pBK549. For two of the biological replicates (B1 and B2), the extracted genomic DNA was sampled and sequenced multiple times, yielding the technical replicates B1T1−6 and B2T1−3.
Fig 6.
Fitness estimates from SATAY are reproducible across replicate experiments.
a) The estimated fitness effect of gene disruptions for all genes of technical replicate 1 plotted against its estimated value in technical replicate 2. The identity lined (red dashed line) is shown as a reference for perfect correlation between the two replicates. b) The fitness distributions of technical replicate 1 (top) and technical replicate 2 (bottom). Essential genes are represented by gray transparent bars. c) Same plot as in (a), but for two biological replicates. d) The fitness distributions of biological replicate 1 (top) and biological replicate 2 (bottom). Annotations are the same as in (b).
Fig 7.
The fitness values obtained from SATAY datasets only weakly correlate with the values reported by other studies.
a) Plot of the fitness values of gene deletion mutants reported by [50] against the fitness values obtained from a SATAY data set in this study. The top and side panels show the DFE densities. The identity line is shown for reference (dashed red line). b) Plot of the fitness values of gene deletion mutants generated using SATAY in a bem3Δ genetic background against the fitness values of the same gene deletion mutants generated in a wild-type (WT) genetic background (both from this study). Positive and negative genetic interactors of BEM3, as annotated by [18], are shown as green and red datapoints, respectively.
Fig 8.
Increasing sequencing depth does not improve the accuracy of fitness estimate.
a) The effect of sequencing depth on the accuracy of the fitness estimates was determined by pooling the experimental data from the six technical replicates B1_T1-B1_T6 and randomly sampling a subset of n reads without replacement from this pooled dataset. b) The number of observed independent transposon insertions (red) and the average and median read count per insertion site (blue) as a function of the number of sampled reads. c) The read count distribution for varying levels of the sequencing depth. The distribution has been cut off at a maximum value of 20×103 reads/transposon. d) The distribution of the relative standard error of the fitness estimates across different genes for different levels of sequencing depth.
Fig 9.
The relative contributions of different noise sources during SATAY experiments.
a) Schematic representation of the different steps in the SATAY procedure at which different replicates were split. b) Procedure for comparing the difference in the insertion sites (ΔP) and difference in read counts (ΔR). If two replicates have non-zero read counts mapping to a position in the genome within 2 baspairs of each other (indicated by the grey box), the insertions are matched and their read difference is calculated (left panel). If one replicate has an insertion at a genomic location but no insertion is found within two basepairs of this location in the other replicate, this location is recorded as a difference in position (right panel). c) Plot of the differences in insertion sites and read counts between the different replicates. Replicates split before sequencing (step 3) show the smallest variation in both insertion positions and read counts. Replicates split before PCR amplification (step 2) have more variation in their read counts than samples split before sequencing. Splitting replicates before library expansion (step 1) creates the largest amount of variation in insertion sites between the replicates, but their read count difference at matched sites is similar to that of replicates split before PCR amplification.