Figures
Abstract
The identification of essential genes in Transposon Directed Insertion Site Sequencing (TraDIS) data relies on the assumption that transposon insertions occur randomly in non-essential regions, leaving essential genes largely insertion-free. While intragenic insertion-free sequences have been considered as a reliable indicator for gene essentiality, so far, no exact probability distribution for these sequences has been proposed. Further, many methods require setting thresholds or parameter values a priori without providing any statistical basis, limiting the comparability of results. Here, we introduce Consecutive Non-Insertion Sites (ConNIS), a novel method for gene essentiality determination. ConNIS provides an analytic solution for the probability of observing insertion-free sequences within genes of given length and considers variation in insertion density across the genome. Based on an extensive simulation study and different real-world scenarios, ConNIS was found to be superior to prevalent state-of-the-art methods, particularly when libraries had only a low or medium insertion density. In addition, our results showed that the precision of existing methods can be improved by incorporating a simple weighting factor for the genome-wide insertion density. To set methodically embedded parameter and threshold values of TraDIS methods a subsample-based instability criterion was developed. Application of this criterion in real and synthetic data settings demonstrated its effectiveness in selecting well-suited parameter/threshold values across methods. An R package and an interactive web application are provided to facilitate application and reproducibility.
Author summary
Identifying essential genes in bacteria is key to understanding their ability to survive, which can, for example, be applied to the development of new treatments. One way to do identify these genes is by creating libraries where small DNA fragments (“insertions”) are randomly placed in the genome: essential genes tend to remain insertion-free because insertions disrupt their function. The challenge is to determine whether a (long) uninterrupted sequence is due to chance or because the gene is truly essential. Here, we present Consecutive Non-Insertion Sites (ConNIS), a statistical method that calculates the probability of such insertion-free sequences. Extensive comparisons on simulated and real datasets show that ConNIS outperforms existing methods, especially when a library is rather sparse in terms of the total number of insertion sites. Since many analysis methods rely on parameter values that have to be set before the analysis and can heavily influence the final results, we also propose a data-driven approach to set these values, making results more comparable across studies. Our methods are freely available as an R package and all results are presented in a web app.
Citation: Hanke M, Harten T, Foraita R (2026) ConNIS and labeling instability: New statistical methods for improving the detection of essential genes in TraDIS libraries. PLoS Comput Biol 22(3): e1013428. https://doi.org/10.1371/journal.pcbi.1013428
Editor: Jinyan Li, Shenzhen University of Advanced Technology, CHINA
Received: August 12, 2025; Accepted: February 12, 2026; Published: March 6, 2026
Copyright: © 2026 Hanke et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All data and code used for running experiments, model fitting, and plotting is available on a GitHub repository at https://github.com/bips-hb/ConNIS_results. We have also used Zenodo to assign a DOI to the repository in a zip format: https://doi.org/10.5281/zenodo.16790977 Additional real-world results are available under https://zenodo.org/records/18538450. All results can be interactively explored under https://connis.bips.eu. The new methods are made available as R package under https://github.com/bips-hb/ConNIS.
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Determination of genes essential for the growth and survival of bacteria has been of major interest in genetic research as it provides a deeper understanding of lifestyle and adaptation [1–3]. While site-directed mutagenesis approaches determine essential genes accurately, such methods are laborious and time-consuming when performed globally. Consequently, whole genome analyses have only been conducted for well-known model organisms such as Escherichia coli, i.e., the Keio library [4]. In the last decade, wider availability of high-throughput sequencing methods initiated a shift from single-gene to whole-genome analysis, resulting in the development of transposon insertion sequencing (TIS) methods. Employing transposons that are randomly inserted into the genome enables researchers to generate large mutant libraries and to characterize them by the location of insertion sites (IS) via high-throughput sequencing. Transposon directed insertion site sequencing (TraDIS) is a widely applied TIS method [5–8] and has been established for the determination of essential genes in various scientific set-ups [9–15].
A key challenge in TIS studies resides in the statistical analysis, which typically aims to maximize the detection of true positives (essential genes) while minimizing the number of false positives (non-essential genes incorrectly identified as ‘essential’). Although there are multiple software suites and packages, not every method embedded therein will be equally suitable for the analysis of the obtained data set. For example, sliding window approaches [16,17] and Hidden Markov Models [18,19] have been proposed for the analysis of high-density libraries which regularly originate from mariner-transposon-based mutagenesis [20,21]. However, many TIS studies utilize Tn5-based transposons, which have different underlying assumptions and constraints in terms of the data generating process: Unlike mariner transposons, Tn5 insertions do not depend on the presence of specific motifs and can theoretically occur in any non-essential region of the genome [6,22]. Nevertheless, Tn5-based libraries reported so far tend to be less dense than mariner-based libraries. Consequently, observing larger genomic regions lacking IS just by chance becomes more likely in Tn5-based libraries. Furthermore, the set of detected IS across the genome rarely displays a uniform distribution of gene-wise insertion densities. Reasons may be transposon-driven preferences for GC- or AT-rich regions [23–26] and genomic hot- or coldspots, i.e., genomic regions of notably higher or lower insertion densities, respectively [24,27–29].
So far, a couple of Tn5-based statistical methods for identifying essential genes have been proposed. Burger et al. [30] suggest estimating the probability of observing several IS within a gene of a given length based on a binomial distribution using the genome-wide insertion density as success probability. The Tn5Gaps method of the Transit package [31] uses a Gumbel distribution to approximate the probabilities of observed IS-free gaps along the genome. Essentiality is determined by the largest gap within or partially overlapping a gene. Since both methods rely on p-values derived from thousands of genes, the authors recommend correcting for multiple testing. However, they did not evaluate how different correction approaches might affect the identification of essential genes. Alternatively, the Bio-TraDIS software package [32] avoids the multiple testing problem by heuristically leveraging an often observed bimodal distribution of gene-wise insertion densities. Combining an exponential distribution (for essential genes) with a gamma distribution (for non-essential genes), genes are labeled as ‘essential’, ‘non-essential’, or ‘ambiguous’ based on an a priori set likelihood ratio threshold. In practice, a clear distinction between the two distributions is not always guaranteed [6], and the threshold values are usually set arbitrarily. The recently proposed Bayesian method InsDens calculates the posterior probability of a gene being essential [33] and the authors suggest to use Bayesian decision theory to set a posterior probability threshold. Although this method offers a clear interpretation, it requires choosing an a priori probability distribution parameter, too, which can influence the outcome.
Some but rather limited comparative studies of TIS methods are available. Based on high-density library data, Larivière et al. [20] draw a comparison between the bimodal approach of Bio-TraDIS, the Tn5Gaps method and a custom modification of the bimodal approach [11]. However, only two threshold values were applied and no performance analysis under different controlled data-generating processes was described. While Nlebedim et al. [33] used four different parameter combinations to generate synthetic data, the only method applied in the analysis was their method InsDens. The additional analysis of three real-world datasets using Bio-TraDIS suffers from the application of only one and rather low threshold value. Similarly, Ghomi et al. [34] proposed to use the non-parametric clustering algorithm embedded in the DBSCAN R package for the sole reason that it allows omitting the heuristic setting of threshold values required by Bio-TraDIS. Again, only a single and rather low threshold value for comparison using Bio-TraDIS was applied. However, the DBSCAN clustering algorithm itself requires setting two a priori parameter values but the authors did not report how different values might affect labeling performance, nor did they provide guidance on how to choose appropriate values.
The widespread use of TIS, particularly Tn5 statistical analysis methods, contrasts with the lack of systematic reviews of these methods, especially when considering different data-generating processes. Furthermore, a transparent, comprehensive statistical method for setting threshold or parameter values in TIS methods is missing. As a consequence, publications often only justify the choice of methods and parameters by citing prior studies that used similar approaches and parameters. In addition, most studies lack sensitivity analyses for their threshold and parameter values. A common practice is truncating the 5’- and/or 3’ ends of genes by several base pairs or up to 20%, to align with the assumption that the gene ends are generally non-essential [6,13,35–38], yet this approach is never investigated in sensitivity analyses.
At this scientific stage, we provide the following contributions to the statistical detection of essential genes. First, we introduce Consecutive Non-Insertion Sites (ConNIS), a novel method that determines gene essentiality based on insertion-free sequences within genes. ConNIS provides an analytical solution for the probability of observing the longest insertion-free sequence within a gene, based on its length and the number of IS under the assumption of being non-essential. Second, we performed an extensive simulation study with 160 parameter combinations mimicking different data-generating processes. Using these synthetic datasets, four additional semi-synthetic datasets and three real datasets, ConNIS demonstrated its superiority over five state-of-the-art Tn5 analysis methods, especially in settings with low- and medium-dense libraries. In this context, we propose to use a weighting factor when applying genome-wide insertion density values to better represent genomic regions with low insertion densities. This modification also improved three competing methods by reducing the number of false positives without losing too many true positives in many settings. Third, we provide for the first time a data-driven instability criterion for selecting thresholds and parameter values in TIS methods, thereby making the results from different studies and methods comparable and more transparent. Applications of this approach to real and synthetic datasets clearly demonstrate its suitability for setting parameter values for all methods considered. An in-depth analysis of biological functions for selected genes illustrates the ability of ConNIS to reliably detect essential genes even among short genes that are often excluded from statistical analyses because competing methods cannot distinguish signal from noise in this length range. ConNIS and the instability criterion for all competing methods have been made available as an R package. To further explore our results, we provide an interactive web app and publicly available code.
Materials and methods
Consecutive Non-Insertion sites (ConNIS)
The transcription of an essential gene is, by definition, vital for an organism’s survival and its hindrance due to transposon insertion will result in the mutant’s removal from the population. Consequently, IS are expected to be exclusively detected in the non-essential genome, which comprises genes, intergenic regions and smaller fractions of essential genes [7]. Based on the assumption that a Tn5 transposon can occur at any position in the non-essential genome, we propose ConNIS, a method that classifies a gene as ‘essential’ or ‘non-essential’ by analyzing its largest insertion-free sequence in terms of base pairs. Therefore, we derive a novel probability distribution (see S1 File) that we used to determine the probability of observing an insertion-free sequence in a gene of given length and number of IS.
Let b denote the length of a genome in terms of base pairs that contains p genes with corresponding lengths . We define
as the genome-wide insertion density, where h is the genome-wide number of observed IS. For a gene
let
be the rounded expected number of IS within gene j under the assumption that gene j is not essential. Furthermore, let
be the length of an observed consecutive sequence of non-insertions of gene j under the assumption of uniformly distributed IS (Fig 1A).
A For a gene j, determine its length and its longest insertion-free sequence
. B Set a weight w for the genome-wide insertion density θ reflecting rather low density regions of the genome. C ConNIS: The probability of observing
within gene j due to random chance.
Based on Theorem 1 (see Section 1 in S1 File), the probability mass function of is given by
We next consider that IS are often distributed non-uniformly across the genome (see Fig B in S2 File). Other approaches use the genome-wide insertion density θ to estimate expected insertions per gene. However, under the assumption of non-essentiality this can inflate false positives in regions with a lower-than-average IS density, where missing insertions are likely due to chance. ConNIS corrects for this by introducing a weight factor w to θ that adjusts for low-density regions (Fig 1B), a bias not handled by normalization methods focused only on insertion counts.
Observing the longest gap of size ConNIS is then defined as the probability of observing an insertion-free consecutive sequence of at least length
in gene j:
Given a significance level of , we declare a gene j to be essential if
. To control the global type I error when applying ConNIS, we suggest using either the Bonferroni(-Holm) method [39,40] to control the family-wise error rate (FWER) or the Benjamini-Hochberg method [41] to control the false discovery rate (FDR).
Competing state-of-the-art methods with proposed weighting strategy.
We compared ConNIS with five popular state-of-the-art Tn5 analysis methods for determining essential genes:
- the Binomial distribution approach in the TSAS 2.0 package [30],
- the approach of fitting a bimodal distribution based on gene-wise insertion densities included in the Bio-TraDIS package [32] (referred to as Exp. vs. Gamma method throughout this paper),
- the InsDens method [33],
- the Tn5Gaps method of the TRANSIT package [31] and
- the Geometric distribution.
Although the geometric distribution has not been published as a stand-alone method, we included it as a competitor because it is the limiting distribution of ConNIS (see Theorem 2 in S1 File) and has been part of an analysis pipeline for determining the probability of insertion-free regions in the genome [42].
The Binomial and Geometric methods use the genome-wide insertion density θ as success probability, and the Tn5Gaps method uses it as a location parameter. However, for the reason outlined above, this naive use of θ can increase the number of false positives within genomic regions with a relatively low insertion density compared to the rest of the genome. To address this potential pitfall, we introduce a weight w to adjust θ when applying these methods, as we do it in ConNIS. We then use these modified methods, as well as the original versions, for comparison. See Sect 2 in S1 File for methodological details. Further, InsDens requires several prior hyperparameters. In line with the authors‘ claim, our tests on selected simulation settings showed minimal impact from these choices [33]. Thus, we used the default settings of the R package insdens for all simulations and real data analyses (see https://github.com/Kevin-walters/insdens, commit 286f114).
A labeling instability criterion for tuning parameter selection.
TIS methods often require a priori set parameter or threshold values that will influence the final number of genes labeled ‘essential’ and therefore the methods’ performances in terms of correct classification. The setting of an ‘appropriate’ parameter/threshold value in a given data scenario can be interpreted as a tuning problem. In this context our data-driven tuning approach selects a parameter of threshold value for a TIS method from a set of candidate values.
We consider observed IS as realizations of an unknown probability distribution across the genome. This is comparable with a repeated TIS experiment which yields different IS positions in each realization, particularly in non-essential genomic regions. As a result, gene-wise IS metrics, such as the longest insertion-free sequence or gene-wise IS density
, would vary between experiments, potentially altering the set of genes classified as essential. The main idea of our selection criterion is to leverage these variations by quantifying the average variation of gene labeling based on m subsample for a given tuning value. Inspired by stability selection approaches in linear regression and graph estimation problems [43–45], a ‘good’ tuning value should give rather stable results in terms of gene classification, i.e., the gene labeling should be less sensitive to the random occurrence of IS across the genome. In the following, we detail the procedure for selecting a suitable weight value w using ConNIS as an example. A transfer to other TIS methods or to threshold based filters in the data pre-processing steps is straight forward.
Let be a sequence of ordered weights (
for
and
). Assume further that m subsamples, each of of size
, are drawn without replacement from the set of h observed IS and the expected number of insertion sites per gene to be
(Fig 2A).
A Drawing m subsamples of the h original observed IS. B Calculation of the instability values for all weights based on the estimated variances of a Bernoulli variable. C Selecting the weight
with the lowest instability
. D Application of ConNIS using
followed by a multiple testing correction to identify putative essential genes.
For all , genes are labeled as ‘essential’ or ‘non-essential’ for a given significance level within each of the m subsamples using ConNIS. By modeling the gene labeling as a Bernoulli process (‘essential’ or ‘non-essential’), we estimate the probability of labeling a gene j as ‘essential’ (see Fig 2B) using
We can then define the instability criterion over all genes for a given weight as
where is the Bernoulli variance and
is the total number of genes that have been labeled at least once as ‘essential’ in the m subsamples. This normalization factor ensures
. A value of
indicates complete consistency in gene labeling for each gene across subsamples, whereas
reflects total instability, equivalent to randomly assigning labels by flipping a fair coin for each gene in each subsample.
After calculating the instability for all weights, we have a sequence of instability values and select then the weight
that minimizes the instability of labeling (see Fig 2C):
Finally, ConNIS is applied to the original data with (Fig 2D).
Depending on the range of w and the number of observed IS, it is possible that very small weight values may lead to instability values approaching or even reaching zero with (nearly) all genes being labeled as ‘non-essential’. Following other stability approaches [43–45], these values are excluded from the set of candidate tuning values because they provide no useful information. For our instability criterion, we propose to omit all weights smaller than the smallest weight that maximizes the function to ensure that only the most informative weights are considered, i.e.,
Results
Comparison of ConNIS with state of the art methods
To evaluate the performance of ConNIS and its competitors we applied all methods to synthetic, semi-synthetic and real-world data. A detailed explanation of our simulation schemes for generating synthetic data that covers different assumptions about the data generating process can be found in S3 File. For the application to real-world data we used three publicly available Tn5 libraries of different organisms and insertion densities. The semi-synthetic datasets were generated from a high-density Tn5 library by randomly deleting IS. Note that the performance of all methods depends on the chosen values of their respective parameters and thresholds.
Since the number of essential genes is small compared to the number of non-essential genes, we used the Mathew’s Correlation Coefficient (MCC), a metric suitable for imbalanced data [46–48], as main performance measure. The MCC equals 1 when the method perfectly labels all genes as ‘essential’ or ‘non-essential’ (perfect agreement), 0 when the labeling is completely random, and –1 when there is perfect disagreement between the true and predicted labels. For comparison the MCC is plotted given the number of genes labeled ‘essential’ which can be controlled by the methods’ different parameter and threshold values. In addition, the precision-recall-curve (PRC) is shown to investigate two desirable, yet occasionally, conflicting objectives: selecting as many true positives as possible (recall) while avoiding an inflated number of false positives (precision).
In all applications the weight value of Binomial, ConNIS, Geometric and Tn5Gaps was set to with
being the original, unweighted version. For Exp. vs. Gamma, we set the
-likelihood ratio threshold
, covering the range of values commonly reported in the literature. The posterior probability threshold of InsDens was set at
. Furthermore, we truncated genes by excluding the distal ends by either
,
or
and applied Bonferroni(-Holm) and Benjamini-Hochberg procedures for multiple testing correction.
Synthetic data settings.
We present three illustrative synthetic data settings offering an overview of performance the methods. The results of all 160 different settings and additional classification metrics can be interactively explored at https://connis.bips.eu, supporting our findings.
Synthetic data example 1 (SDE1) had IS randomly distributed along the genome in a sinusoidal shape. ‘Essential’ genes were defined to contain insertion-free sequences of at least
of the gene’s length. This represented scenarios where essential genes can contain IS relatively far from their distal ends. Fig 3A shows ConNIS clearly outperforming the other methods with regard to the MCC. The PRC demonstrates that ConNIS effectively enhanced the identification of true essential genes without substantially inflating false positives by maintaining high precision. Notably, all methods showed their peak performance when the number of genes labeled ‘essential’ was about the number of true essential genes (dashed orange line). The plot also highlights the beneficial effect of applying a weight
to Binomial and Geometric (the points indicate the average performance if no weight is applied, i.e.,
).
The plots show the LOESS-smoothed curves with confidence intervals for the MCC and PRC in three synthetic data settings. At the vertical dotted line the number genes labeled ‘essential’ matches the number of true essential genes. Dots on the curves indicate the average performance without weights or the least stringent threshold (
for Exp. vs. Gamma and
for InsDens). In all settings,
of both ends of each gene were trimmed.
Synthetic data example 2 (SDE2) mimicked scenarios where only a low insertion density can be achieved, e.g., due to bottleneck effects by environmental pressure. Therefore, IS were randomly distributed along the genome in a sinusoidal shape, and the essential genes had insertion-free sequences of at least
of their length. All methods suffered from the sparse library (Fig 3C). ConNIS achieved the highest MCC value if the number of genes labeled ‘essential’ was close to the number of true essential genes. Binomial and Tn5Gaps could only achieve mediocre values at best. For Exp. vs. Gamma, InsDens and Tn5Gaps, the range of the number of genes labeled ‘essential’ never contained the number of true genes. However, the first two could achieve MCC values that were slightly worse than ConNIS. With respect to the PRC, all methods induced false positives due to larger non-insertion sequences occurring by chance compared to denser libraries. ConNIS tended to have a rather high precision while Exp. vs. Gamma and InsDens achieved rather high recall values, but at the price of an inflation of false positives.
Synthetic data example 3 (SDE3) covered scenarios with so-called ‘cold-spots’ along the genome, which have a much lower chance of containing IS. ‘Essential’ genes were defined to contain insertion-free sequences of at least . In non-essential sections of the genome, each base pair had the same probability to contain one of the
IS, yet, in 25 randomly placed sections of size
bp these probabilities were lowered by factor 10. ConNIS achieved clearly the best MCC values and PRC performance (Fig 3D). However, all methods tended to overestimate the number of essential genes (indicated by the rather low precision values) due to the higher chance of false positives in cold spots.
Real-world data.
In the first example, we applied all methods to an E. coli BW25113 strain library comprising approximately 102,000 IS at time point T0 [49]. As ground truth, we used the results of the single-knockout study by Baba et al. [4], which is often considered as the gold standard. Fig 4A shows ConNIS outperforming the other methods by reaching MCC values up to 0.65. InsDens and Exp. vs. Gamma labeled too many genes as ‘essential’ (at least 575) even for their strictest thresholds ( and
). However, the thresholds of Exp. vs. Gamma had only a marginal influence on the number of genes labeled ‘essential’, which resembles in parts the results of the simulation study. The PRCs reveal that all methods achieved mediocre precision values at best with ConNIS having relatively stable precision values for rising recall values.
The vertical dotted line shows the true number of genes. Dots indicate the performance of the original methods (). A E. coli BW25113 strain with
IS [49]. B E. coli K-12 MG1655 strain with
IS [50]. C Salmonella enterica serovar Typhimurium 14028S strain with
IS [37]. Note, in applications A and B, most of the results of Exp. vs. Gamma are covered by those of InsDens.
A high-density E. coli K-12 MG1655 library characterized by approximately 390,000 IS [50] was used as seccond example. As truth, we used the gene essentiality classification from the Profiling of E. coli Chromosome (PEC) database [51]. Here, results were consistent with the simulation study, i.e., all methods benefited from the high number of IS and performed best when the number of selected genes approached the number of true essential genes (Fig 4B). Weighting () was highly beneficial for all methods, as the original versions (
, indicated by the points) had high recall values at the cost of low precision values.
In the third example, all methods were applied to a Salmonella enterica serovar Typhimurium 14028S library comprising approximately 186,000 IS [37]. Following Nlebedim et al. [33], we used as truth the combined set of essential genes provided by Baba et al. [4] and Porwollik et al. [52]. In this scenario all methods showed at best mediocre performance. The removal of IS with low read counts improved the performances of the methods slightly and might be a sign of the presence of spurious IS [31,53] (we tried minimum read count thresholds with value 1, 2, 3, 5 and 10). ConNIS achieved the best MCC and precision values. Exp. vs. Gamma and especially InsDens performed very poorly (Fig 4C), with the latter labeling far too many genes as ‘essential’, resulting in negative MCC values. Binomial, ConNIS, and Tn5Gaps benefited from a fairly low weight w, which seemed to reduce the number of false positives comapred to the original versions ().
Semi-synthetic data settings.
To investigate the influence of the number of observed IS on the methods’ performances, we generated semi-synthic data by drawing IS subsamples of sizes ,
,
and
from a very high density Tn5 library [11]. The Kaio library [4] was used as a reference for true gene essentiality. In low and medium density libraries (subsamples of
to
IS) ConNIS outperformed the other methods, clearly (see Fig 5A; for medium-sized libraries see Fig A in S2 File). Similar to the synthetic and real-world data settings, ConNIS showed its best performance in terms of MCC when the number of selected genes was about the number of true essential genes. In case of the rather high-density library (subsample size of
IS), all methods were on par, (Fig 5B).
Subsamples were generated by randomly drawing IS from a very high-density library to generate a low- (A) and high-density (B) library of E. coli BW25113 [11]. The Kaio library [4] was used as reference for ‘true’ gene essentiality. The vertical dotted line indicates where the number of genes labeled “essential” corresponded with number of true essential genes.
Application of gene labeling instability criterion for tuning parameter selection
The performance of the gene labeling instability criterion for tuning parameter selection was investigated using the three previously described real-world datasets and three randomly chosen dataset examples from the simulation study. We applied the instability criterion to select the tuning parameter of each method. For each setting, subsamples were drawn without replacement, with each subsample containing
of randomly picked IS from the original data. Genes were truncated by
at each distal end. We used the MCC to evaluate the performance of the labeling instability criterion and compared it to MCC values for the optimal parameters (those that produce the highest MCC) as well as for parameters used in earlier studies, such as unweighted versions or heuristic choices.
The results in Table 1 demonstrate that the application of the gene labeling instability criterion for determining a weight for ConNIS is highly beneficial. For the real-world datasets E. coli BW25113 and Salmonella enterica serovar Typhimurium 14028S, as well as the synthetic dataset 2, the application of instability criterion successfully identified the ‘optimal’ weights based on the corresponding MCC values (Table 1). In these examples, ConNIS also achieved the highest MCC with its selected w compared to all other methods. Furthermore, for synthetic datasets 1 and 3, the MCCs obtained by the instability criterion were close to its highest possible MCCs in these settings ( vs.
and
vs.
). Only for the MG1655 strain data, our tuning approach was less successful for ConNIS (
vs.
).
The instability criterion also successfully tuned the other methods, resulting in many cases where the best possible MCC was achieved. For Exp. vs. Gamma in all six settings thresholds were selected that resulted in MCC values close or even to values obtained by applying an optimal threshold value. In comparison to
thresholds used in recent studies [11,20] we found our instability approach to give similar or better MCC values, yet, the range of possible MCC was rather small. Applying labeling instability to the posterior probability threshold r in InsDens yielded favorable results in five settings and achieved the highest possible MCC in three of them. It was also able to select a good choice between a very strict (
) and a very relaxed value (
) which have been used before [33]. Only in the Salmonella enterica serovar Typhimurium 14028S real-world dataset, the selected r had a bad MCC value. However, since InsDens generally showed a weak performance in this setting before (Fig 4C), the result is not surprising. For Binomial, Geometric and Tn5Gaps the application of the instability approach was also beneficial compared to the unweighted (i.e., original) version of the methods.
Biological relevance
To highlight biological relevance beyond global performance metrics, we analyzed genes where ConNIS systematically disagrees with the other methods. Here, we focus on ‘major discrepancies’, i.e., genes called ‘essential’ by ConNIS but “non-essential” by four to five comparator methods, or vice versa. Across the three libraries (E. coli BW25113, E. coli MG1655 and S. Typhimurium 14028s), this affects 59 genes in total (15, 26 and 18 genes, respectively). Of these, 44 genes are called ‘essential’ by ConNIS but ‘non-essential’ by the comparator methods, whereas 15 genes show the opposite case (see S4-S6 Files). Overall, the analysis shows that ConNIS agrees well with experimental gold-standard essentiality sets while providing specific gains in low-insertion regimes and for short genes that have often been excluded a priori from the analysis due to lack of detection power of established methods. For all methods the threshold/parameter values were set by our instability criterion.
In the first group, ConNIS-specific essential calls have a median length of 328 bp (interquartile range 131 to 477bp; minimum gene length 74bp). For example, ftsL (365bp), ffs (113 bp), argU (76bp), and folK (479bp) were correctly identified as being essential by ConNIS. FtsL encodes a cytoplasmic membrane protein which essentiality manifests in rapid cell division blockade upon mutation [54]. Ffs together with Ffh builds up the well-known signal recognition particle (SRP) in E. coli, a multifunctional ribonucleoprotein complex fundamental for membrane protein targeting. As described by Peterson et al. [55], both the Ffh protein and the ffs encoded 4.5S RNA are essential for cell viability and correct localization of proteins to the cytoplasmic membrane. Another RNA-gene correctly identified by ConNIS is argU. Lack of function mutations have been reported to cause DNA replication defects [56] manifesting in inhibition of cell growth [57]. Essentiality of folK has been critically analyzed by Goodall et al. [11] who, in contrast to the Keio library [4] and PEC database [51], classified the gene as ‘conditionally essential’ and not as ‘essential’. However, ConNIS also named folK essential. Interestingly, Goodall et al. [11] as well as the Keio library [4], the PEC database [51] and ConNIS are correct and the explanation highlights the importance of the chosen growth conditions on which basis essentiality is defined. While Goodall et al. [11] determine essentiality by using a library obtained directly from LB-agar plates and conditional essentiality by a library obtained after successive growth to 5–6 generations in liquid LB medium, Wetmore et al. [49], which data were used in this study, grew their mutant library on LB-plates followed by growth in liquid LB medium (as Goodall et al. [11] to an OD of 1.0). Consequently, ConNIS identifies folK correctly as being essential for the Wetmore et al. [49] mutant library. The disagreement pattern between ConNIS and its competitors is consistent with the known limitation of density/count-based methods on short or low-insertion genes, which are often excluded or down-weighted. Thus, the results highlights that ConNIS retains statistical power even for short genes. Yet, in the case of nusB, one of the four Nus factor encoding genes of E. coli, ConNIS wrongly assigns ‘essentiality’, while Bubunenko et al. [58] have shown that NusB, despite of being important for cell growth, is not essential. A closer inspection revealed that the incorrect assignment resides in the fact that the analyzed library carries only one insertion at the far 3’-end of nusB. Consequently, the relatively long insertion-free gap yields a relatively low p-value.
The second group comprises genes that ConNIS called ‘non-essential’ but that the other methods classified as ‘essential’. These genes are typically much longer (median length
bp, interquartile range 1132 to 1659bp). Two examples of genes correctly identified as being non-essential by ConNIS are ptsI and ybcK. As shown by Wu et al. [59] and Wu et al. [60] viable loss-of-function mutants of ptsI and ybcK, respectively, can be recovered. On the other hand, ConNIS classified pssA as non-essential, whereas the encoded phosphatidylserine synthase (PssA) is known to be essential for vitality in various pathogenic bacteria including E. coli [61]. The wrong assignment can be explained by the combination of three factors: first, only one insertion was observed close to the middle of the gene making the observed gap nearly as small as possible for a single insertion site. Second, a rather low number of expected insertions sites was used due to the low weighting factor of
, increasing the probability to observe bigger insertion free gaps under the null model. Third, the applied Bonferroni-Holm correction method is relatively conservative and can label even small p-values non-significant when thousands of genes are examined.
Discussion and conclusion
In this work, we addressed three main challenges inherent in statistical analysis in TraDIS studies. The first challenge arises from the fact that in Tn5 datasets every base pair of the genome serves a potential insertion site, while reported insertion densities often remain far below saturation levels. Considering this, ConNIS gives an analytic solution for the probability of observing an insertion-free sequence within a gene of a given length and number of insertion sites. The second challenge is the often observed non-uniform distribution of IS across the genome. Neglecting this factor can lead to an increased number of (nearly) insertion-free genes being incorrectly labeled as essential in regions with relatively low insertion densities. Addressing non-uniformity, ConNIS contains a weighting parameter that increases the precision by making it more difficult to label genes as ‘essential’ in low-density regions. We extended this idea to three state-of-the-art methods to improve their precision. The third challenge lies in the fact that many TIS methods rely on a priori set threshold or parameter values, which can substantially influence labeling performance. However, an ‘objective’ criterion for setting these values has been lacking, often resulting in arbitrarily chosen values. By introducing the concept of gene labeling instability based on subsamples of observed IS, we proposed a data-driven approach to select appropriate parameters and threshold values.
An extensive simulation study and application to three real-world datasets and four semi-synthetic datasets was conducted to compare the performance of ConNIS to multiple state-of-the-art Tn5 analysis methods. In most settings, ConNIS outperformed these methods or was at least on par with the best of them. Unlike its competitors, ConNIS showed usually robust performances for (arbitrarily) chosen truncation and filter values. The results also confirm our idea of weighting the genome-wide insertion density when applied to existing methods: it could reduce the number of false positives without sacrificing too many true positives. Applying our proposed gene labeling instability criterion for tuning parameter selection in various real-world and multiple synthetic data scenarios demonstrated its potential to select favorable weight and threshold values for all methods. By inspecting major discrepancies between the classification results of ConNIS and its competitors, we showed that ConNIS was able to correctly classify even very short genes, thereby avoiding the standard practice of dismissing such genes from the analysis a priori.
Given that ConNIS demonstrated superior performance, especially in low and medium insertion density settings, its application is expected to improve the precision of results in experimental settings characterized by high selective pressure or observation of bottleneck effects. While we have investigated ConNIS’ ability to identify essential genes, we anticipate that its application might be similarly beneficial for the determination of conditionally essential genes, for example by comparing gene-wise ConNIS scores between conditions and by defining quasi-essential genes based on differences in these scores between time points or conditions. This would broaden the scope of ConNIS to settings where a non-binary characterization such as relative gene fitness is desirable. As a first step, we provide a proof-of-concept (see S7 File), where we illustrate how ConNIS in combination with the instability approach yields a continuous gene-wise essentiality evidence score and how this score can be used to classify quasi-essential genes and fitness-like effects. This framework could be extended in future work by explicitly modeling the loss of insertion sites over time or across conditions [27]. Further, the weighting approach could be improved by incorporating multiple weighting values to target different genomic regions more effectively.
Our gene labeling instability criterion was originally developed for selecting threshold and parameter values of Tn5 analysis methods, but it may also be applicable to other TIS methods that employ alternative transposons, such as the popular mariner transposon. It might also serve as a criterion in pre-processing steps like quality filters or trimming of distal gene ends. Last but not least, our work showed the crucial role of the underlying data-generating process on the performance of all methods. Future work could expand the range of scenarios considered, helping researchers choose the most appropriate method for analyzing their data. In this context, a systematic re-analysis of publicly available TraDIS datasets could raise the confidence in essential gene prediction and allow for further hypothesis generation. As a first step, we provide a curated multi-study resource comprising eight publicly available Tn5-based TraDIS/Tn-Seq datasets, including transparent per-study processing scripts and the resulting ConNIS essential-gene predictions, via Zenodo (DOI: https://doi.org/10.5281/zenodo.18538449).
Supporting information
S1 File. Proofs and methodological extensions.
PDF file with proofs for ConNIS and formal definition of the extension of existing methods.
https://doi.org/10.1371/journal.pcbi.1013428.s001
(PDF)
S2 File. Additional plots.
PDF file with plots of additional analysis results.
https://doi.org/10.1371/journal.pcbi.1013428.s002
(PDF)
S3 File. Detailed description of the generation of synthetic data.
https://doi.org/10.1371/journal.pcbi.1013428.s003
(PDF)
S4 File. Gene classifications for E. coli BW25113.
https://doi.org/10.1371/journal.pcbi.1013428.s004
(CSV)
S5 File. Gene classifications for E. coli MG1655.
https://doi.org/10.1371/journal.pcbi.1013428.s005
(CSV)
S6 File. Gene classifications for S.
Typhimurium 14028S.
https://doi.org/10.1371/journal.pcbi.1013428.s006
(CSV)
S7 File. Proof-of-concept for gene fitness and quasi-essentiality.
An R Cookbook.
https://doi.org/10.1371/journal.pcbi.1013428.s007
(PDF)
Acknowledgments
The authors would like to thank Ian Henderson, Emily Goodall and Ash Robinson for providing the list of insertion sites of their high-density library of the E. coli BW25113 strain.
References
- 1. Gerdes S, Edwards R, Kubal M, Fonstein M, Stevens R, Osterman A. Essential genes on metabolic maps. Curr Opin Biotechnol. 2006;17(5):448–56. pmid:16978855
- 2. Zhang Z, Ren Q. Why are essential genes essential? - The essentiality of Saccharomyces genes. Microb Cell. 2015;2(8):280–7. pmid:28357303
- 3. Shang W, Wang F, Fan G, Wang H. Key elements for designing and performing a CRISPR/Cas9-based genetic screen. J Genet Genomics. 2017;44(9):439–49. pmid:28967615
- 4. Baba T, Ara T, Hasegawa M, Takai Y, Okumura Y, Baba M, et al. Construction of Escherichia coli K-12 in-frame, single-gene knockout mutants: the Keio collection. Mol Syst Biol. 2006;2:2006.0008. pmid:16738554
- 5. Langridge GC, Phan M-D, Turner DJ, Perkins TT, Parts L, Haase J, et al. Simultaneous assay of every Salmonella Typhi gene using one million transposon mutants. Genome Res. 2009;19(12):2308–16. pmid:19826075
- 6. Chao MC, Abel S, Davis BM, Waldor MK. The design and analysis of transposon insertion sequencing experiments. Nat Rev Microbiol. 2016;14(2):119–28. pmid:26775926
- 7. Cain AK, Barquist L, Goodman AL, Paulsen IT, Parkhill J, van Opijnen T. A decade of advances in transposon-insertion sequencing. Nat Rev Genet. 2020;21(9):526–40. pmid:32533119
- 8. Liang Y-T, Luo H, Lin Y, Gao F. Recent advances in the characterization of essential genes and development of a database of essential genes. Imeta. 2024;3(1):e157. pmid:38868518
- 9. Christen B, Abeliuk E, Collier JM, Kalogeraki VS, Passarelli B, Coller JA, et al. The essential genome of a bacterium. Mol Syst Biol. 2011;7:528. pmid:21878915
- 10. Rubin BE, Wetmore KM, Price MN, Diamond S, Shultzaberger RK, Lowe LC, et al. The essential gene set of a photosynthetic organism. Proc Natl Acad Sci U S A. 2015;112(48):E6634-43. pmid:26508635
- 11. Goodall ECA, Robinson A, Johnston IG, Jabbari S, Turner KA, Cunningham AF. The Essential Genome of Escherichia coli K-12. mBio. 2018;9(1).
- 12. Sternon J-F, Godessart P, Gonçalves de Freitas R, Van der Henst M, Poncin K, Francis N, et al. Transposon Sequencing of Brucella abortus Uncovers Essential Genes for Growth In Vitro and Inside Macrophages. Infect Immun. 2018;86(8):e00312-18. pmid:29844240
- 13. Poulsen BE, Yang R, Clatworthy AE, White T, Osmulski SJ, Li L, et al. Defining the core essential genome of Pseudomonas aeruginosa. Proc Natl Acad Sci U S A. 2019;116(20):10072–80. pmid:31036669
- 14. Luo H, Lin Y, Liu T, Lai F-L, Zhang C-T, Gao F, et al. DEG 15, an update of the Database of Essential Genes that includes built-in analysis tools. Nucleic Acids Res. 2021;49(D1):D677–86. pmid:33095861
- 15. Rivas-Marin E, Moyano-Palazuelo D, Henriques V, Merino E, Devos DP. Essential gene complement of Planctopirus limnophila from the bacterial phylum Planctomycetes. Nat Commun. 2023;14(1):7224. pmid:37940686
- 16. Pritchard JR, Chao MC, Abel S, Davis BM, Baranowski C, Zhang YJ, et al. ARTIST: high-resolution genome-wide assessment of fitness using transposon-insertion sequencing. PLoS Genet. 2014;10(11):e1004782. pmid:25375795
- 17. Zhang YJ, Ioerger TR, Huttenhower C, Long JE, Sassetti CM, Sacchettini JC, et al. Global assessment of genomic regions required for growth in Mycobacterium tuberculosis. PLoS Pathog. 2012;8(9):e1002946. pmid:23028335
- 18. Chao MC, Pritchard JR, Zhang YJ, Rubin EJ, Livny J, Davis BM, et al. High-resolution definition of the Vibrio cholerae essential gene set with hidden Markov model-based analyses of transposon-insertion sequencing data. Nucleic Acids Res. 2013;41(19):9033–48. pmid:23901011
- 19. DeJesus MA, Ioerger TR. A Hidden Markov Model for identifying essential and growth-defect regions in bacterial genomes from transposon insertion sequencing data. BMC Bioinformatics. 2013;14:303. pmid:24103077
- 20. Larivière D, Wickham L, Keiler K, Nekrutenko A, Galaxy Team. Reproducible and accessible analysis of transposon insertion sequencing in Galaxy for qualitative essentiality analyses. BMC Microbiol. 2021;21(1):168. pmid:34090324
- 21.
Ioerger TR. Analysis of gene essentiality from TnSeq data using Transit. Essential genes and genomes. Springer US; 2021. p. 391–421.
- 22. Kwon YM, Ricke SC, Mandal RK. Transposon sequencing: methods and expanding applications. Appl Microbiol Biotechnol. 2016;100(1):31–43. pmid:26476650
- 23. Zhang H, Lu T, Liu S, Yang J, Sun G, Cheng T, et al. Comprehensive understanding of Tn5 insertion preference improves transcription regulatory element identification. NAR Genom Bioinform. 2021;3(4):lqab094. pmid:34729473
- 24. van Opijnen T, Levin HL. Transposon Insertion Sequencing, a Global Measure of Gene Function. Annual Review of Genetics. 2020;54(1):337–65.
- 25. Lluch-Senar M, Delgado J, Chen W-H, Lloréns-Rico V, O’Reilly FJ, Wodke JA, et al. Defining a minimal cell: essentiality of small ORFs and ncRNAs in a genome-reduced bacterium. Mol Syst Biol. 2015;11(1):780. pmid:25609650
- 26. Green B, Bouchier C, Fairhead C, Craig NL, Cormack BP. Insertion site preference of Mu, Tn5, and Tn7 transposons. Mob DNA. 2012;3(1):3. pmid:22313799
- 27. Mahmutovic A, Abel Zur Wiesch P, Abel S. Selection or drift: the population biology underlying transposon insertion sequencing experiments. Comput Struct Biotechnol J. 2020;18:791–804. pmid:32280434
- 28. Kimura S, Hubbard TP, Davis BM, Waldor MK. The nucleoid binding protein H-NS biases genome-wide transposon insertion landscapes. mBio. 2016;7(4).
- 29. Manna D, Porwollik S, McClelland M, Tan R, Higgins NP. Microarray analysis of Mu transposition in Salmonella enterica, serovar Typhimurium: transposon exclusion by high-density DNA binding proteins. Mol Microbiol. 2007;66(2):315–28. pmid:17850262
- 30. Burger BT, Imam S, Scarborough MJ, Noguera DR, Donohue TJ. Combining genome-scale experimental and computational methods to identify essential genes in Rhodobacter sphaeroides. mSystems. 2017;2(3).
- 31. DeJesus MA, Ambadipudi C, Baker R, Sassetti C, Ioerger TR. TRANSIT - A Software Tool for Himar1 TnSeq Analysis. PLOS Computational Biology. 2015;11(10):e1004401.
- 32. Barquist L, Mayho M, Cummins C, Cain AK, Boinett CJ, Page AJ, et al. The TraDIS toolkit: sequencing and analysis for dense transposon mutant libraries. Bioinformatics. 2016;32(7):1109–11. pmid:26794317
- 33. Nlebedim VU, Chaudhuri RR, Walters K. Probabilistic identification of bacterial essential genes via insertion density using TraDIS data with Tn5 libraries. Bioinformatics. 2021;37(23):4343–9. pmid:34255819
- 34. Ghomi A, Jung JJ, Langridge GC, Cain AK, Boinett CJ, Abd El Ghany M. High-throughput transposon mutagenesis in the family Enterobacteriaceae reveals core essential genes and rapid turnover of essentiality. mBio. 2024;15(10).
- 35. Zhang C, Phillips APR, Wipfler RL, Olsen GJ, Whitaker RJ. The essential genome of the crenarchaeal model Sulfolobus islandicus. Nat Commun. 2018;9(1):4908. pmid:30464174
- 36. Jana B, Cain AK, Doerrler WT, Boinett CJ, Fookes MC, Parkhill J, et al. The secondary resistome of multidrug-resistant Klebsiella pneumoniae. Sci Rep. 2017;7:42483. pmid:28198411
- 37. Mandal RK, Kwon YM. Global screening of Salmonella enterica serovar Typhimurium genes for desiccation survival. Frontiers in Microbiology. 2017;8.
- 38. Sarmiento F, Mrázek J, Whitman WB. Genome-scale analysis of gene function in the hydrogenotrophic methanogenic archaeon Methanococcus maripaludis. Proc Natl Acad Sci U S A. 2013;110(12):4726–31. pmid:23487778
- 39. Dunn OJ. Multiple Comparisons among Means. Journal of the American Statistical Association. 1961;56(293):52–64.
- 40. Holm S. A Simple Sequentially Rejective Multiple Test Procedure. Scandinavian Journal of Statistics. 1979;6(2):65–70.
- 41. Benjamini Y, Hochberg Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B: Statistical Methodology. 1995;57(1):289–300.
- 42. Goodall ECA, Azevedo Antunes C, Möller J, Sangal V, Torres VVL, Gray J, et al. A multiomic approach to defining the essential genome of the globally important pathogen Corynebacterium diphtheriae. PLoS Genet. 2023;19(4):e1010737. pmid:37099600
- 43. Meinshausen N, Bühlmann P. Stability selection. Journal of the Royal Statistical Society Series B: Statistical Methodology. 2010;72(4):417–73.
- 44. Liu H, Roeder K, Wasserman L. Stability Approach to Regularization Selection (StARS) for High Dimensional Graphical Models. Adv Neural Inf Process Syst. 2010;24(2):1432–40. pmid:25152607
- 45.
Müller CL, Bonneau R, Kurtz Z. Generalized stability approach for regularized graphical models. 2016.
- 46. Chicco D, Jurman G. The Matthews correlation coefficient (MCC) should replace the ROC AUC as the standard metric for assessing binary classification. BioData Mining. 2023;16(4).
- 47. Chicco D, Jurman G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics. 2020;21(1):6. pmid:31898477
- 48. Matthews BW. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim Biophys Acta. 1975;405(2):442–51. pmid:1180967
- 49. Wetmore KM, Price MN, Waters RJ, Lamson JS, He J, Hoover CA. Rapid quantification of mutant fitness in diverse bacteria by sequencing randomly bar-coded transposons. mBio. 2015;6(3).
- 50. Ma Y, Pirolo M, Jana B, Mebus VH, Guardabassi L. The intrinsic macrolide resistome of Escherichia coli. Antimicrob Agents Chemother. 2024;68(8):e0045224. pmid:38940570
- 51.
Yamazaki Y, Niki H, Kato J-i. Profiling of Escherichia coli chromosome database. In: Osterman AL, Gerdes SY, editors. Microbial gene essentiality: protocols and bioinformatics. Totowa, NJ: Humana Press; 2008. p. 385–9.
- 52. Porwollik S, Santiviago CA, Cheng P, Long F, Desai P, Fredlund J, et al. Defined single-gene and multi-gene deletion mutant collections in Salmonella enterica sv Typhimurium. PLoS One. 2014;9(7):e99820. pmid:25007190
- 53. Bai J, Dai Y, Farinha A, Tang AY, Syal S, Vargas-Cuebas G, et al. Essential Gene Analysis in Acinetobacter baumannii by High-Density Transposon Mutagenesis and CRISPR Interference. J Bacteriol. 2021;203(12):e0056520. pmid:33782056
- 54. Guzman LM, Barondess JJ, Beckwith J. FtsL, an essential cytoplasmic membrane protein involved in cell division in Escherichia coli. J Bacteriol. 1992;174(23):7716–28. pmid:1332942
- 55. Peterson JM, Phillips GJ. Characterization of conserved bases in 4.5S RNA of Escherichia coli by construction of new F’ factors. J Bacteriol. 2008;190(23):7709–18. pmid:18805981
- 56. Slagter-Jäger JG, Puzis L, Gutgsell NS, Belfort M, Jain C. Functional defects in transfer RNAs lead to the accumulation of ribosomal RNA precursors. RNA. 2007;13(4):597–605. pmid:17293391
- 57. Sakamoto K, Ishimaru S, Kobayashi T, Walker JR, Yokoyama S. The Escherichia coli argU10(Ts) phenotype is caused by a reduction in the cellular level of the argU tRNA for the rare codons AGA and AGG. J Bacteriol. 2004;186(17):5899–905. pmid:15317795
- 58. Bubunenko M, Baker T, Court DL. Essentiality of ribosomal and transcription antitermination proteins analyzed by systematic gene replacement in Escherichia coli. J Bacteriol. 2007;189(7):2844–53. pmid:17277072
- 59. Wu X, Lv X, Lu J, Yu S, Jin Y, Hu J, et al. The role of the ptsI gene on AI-2 internalization and pathogenesis of avian pathogenic Escherichia coli. Microb Pathog. 2017;113:321–9. pmid:29111323
- 60. Wu T, Liu J, Li M, Zhang G, Liu L, Li X, et al. Improvement of sabinene tolerance of Escherichia coli using adaptive laboratory evolution and omics technologies. Biotechnol Biofuels. 2020;13:79. pmid:32346395
- 61. Lee E, Cho G, Kim J. Structural basis for membrane association and catalysis by phosphatidylserine synthase in Escherichia coli. Sci Adv. 2024;10(51):eadq4624. pmid:39693441