A Statistical Framework for Improving Genomic Annotations of Prokaryotic Essential Genes

Large-scale systematic analysis of gene essentiality is an important step closer toward unraveling the complex relationship between genotypes and phenotypes. Such analysis cannot be accomplished without unbiased and accurate annotations of essential genes. In current genomic databases, most of the essential gene annotations are derived from whole-genome transposon mutagenesis (TM), the most frequently used experimental approach for determining essential genes in microorganisms under defined conditions. However, there are substantial systematic biases associated with TM experiments. In this study, we developed a novel Poisson model–based statistical framework to simulate the TM insertion process and subsequently correct the experimental biases. We first quantitatively assessed the effects of major factors that potentially influence the accuracy of TM and subsequently incorporated relevant factors into the framework. Through iteratively optimizing parameters, we inferred the actual insertion events occurred and described each gene’s essentiality on probability measure. Evaluated by the definite mapping of essential gene profile in Escherichia coli, our model significantly improved the accuracy of original TM datasets, resulting in more accurate annotations of essential genes. Our method also showed encouraging results in improving subsaturation level TM datasets. To test our model’s broad applicability to other bacteria, we applied it to Pseudomonas aeruginosa PAO1 and Francisella tularensis novicida TM datasets. We validated our predictions by literature as well as allelic exchange experiments in PAO1. Our model was correct on six of the seven tested genes. Remarkably, among all three cases that our predictions contradicted the TM assignments, experimental validations supported our predictions. In summary, our method will be a promising tool in improving genomic annotations of essential genes and enabling large-scale explorations of gene essentiality. Our contribution is timely considering the rapidly increasing essential gene sets. A Webserver has been set up to provide convenient access to this tool. All results and source codes are available for download upon publication at http://research.cchmc.org/essentialgene/.


Introduction
Large-scale systematic analysis of gene essentiality is an important step closer toward unraveling the complex relationship between genotypes and phenotypes [1]. However, such analysis cannot be accomplished without unbiased and accurate annotations of essential genes.
Whole-genome knockout experiments produce the most accurate essential gene annotations, but they often take a consortium of labs many years to complete in even one organism. It is not surprising that single-gene knockout results are currently only available in a handful of well-studied model organisms, such as Escherichia coli [2] and Saccharomyces cerevisiae [3]. Computational methods for predicting essential genes, e.g., homology mapping [4,5,6], constraint-based methods [7,8,9,10] and supervised learning [11,12,13,14,15], are often useful in reducing the cost and labor. However, they have limited applicability to understudied species. For example, supervised learning depends on a partial list of known essential genes which is often unavailable for understudied species. Constraint-based methods are often limited to metabolic and signaling pathways and require prior knowledge of pathways. Homology mapping works best when the model and target organisms are closely related. Otherwise, the prediction coverage is often low and it ignores the unique physiology of the subject species as people have found that orthologs do not necessarily have the same degree of essentiality [16].
Therefore, to interrogate the essential genes in understudied species, whole-genome transposon mutagenesis (TM) followed by sequence-based identification of insertion sites is often the most practical and the most frequently used experimental approach [17]. Using this approach, essential genes for a variety of understudied bacterial species, such as Pseudomonas aeruginosa and Francisella tularensis novicida, have been identified, greatly increasing the insights into the essential processes necessary for growth of these bacteria under defined conditions. As of May 2012, we have found at least 24 genomic scale TM studies in 19 bacterial species from literature (Table S1), all of which except two were generated within the past ten years. For the past three years, we have seen at least nine such new datasets from different organisms. As researchers are starting to analyze conditional essential genes, that is, essential genes under virtually unlimited growth conditions, we expect the number of available TM datasets will rapidly increase in the future years.
As a result of this rapid increase, in current genomic databases, e.g., [18,19,20], many of the annotations of microbial essential genes are derived directly or indirectly from TM experiments. However, unlike single-gene knockouts, TM has intrinsic biases (see below). Without correcting these biases, the TM assignments will contain hundreds of mis-annotated essential and non-essential genes in each organism, which has caused substantial confusion among the genetics community.
Transposons are segments of DNA that can move (transpose) from one location in a genome to another [21,22]. The locations in which a transposon can move depends on the sequence that the transposase recognizes and cleaves, although the recognition sequence for some transposons is unclear or has yet to be determined. TM results in disruption of the region of the genome where the transposon is inserted. If an insertion within a predicted ORF allows the resulting strain to form a colony on appropriate solidified media, it is unlikely that ORF is essential for viability under those conditions (Fig. S1). Therefore, TM identifies essential genes using a ''negative'' approach, i.e., identifying many regions that are not essential and presuming that everything else is essential.
Due to the random nature of transposon insertion events, there are a number of factors that may create systematic biases in TM experiments. For example, it is inevitable that some genes, especially shorter ones, will be missed simply by chance [23,24,25]. This will create false positive errors in which nonessential genes are determined as essential by TM. On the other hand, the insertion may take place at any part of a gene, such as the extreme ends, which may not fully disrupt the function of the gene product [23,24,25,26,27,28,29,30]. This will create false negative errors in which essential genes are determined as nonessential by TM. These biases from TM experiments have introduced substantial errors in essential gene annotations in current genomic databases [19]. In order to render the large-scale integrative and comparative analysis of essential genes possible, these biases must be quantitatively assessed and corrected.
Several models were previously developed to tackle the biases in TM. For example, Lamichhane et al. used a Bayesian framework to estimate essential genes in Mycobacterium tuberculosis with a subsaturation level of TM [28,31]. Jacobs et al. used a neutral base pair model to reach an estimation of 307 essential genes in P. aeruginosa PAO1 with a saturation level of TM [29]. However, the key limitation of the existing models is that they only take into account the insertions in viable mutants and ignore those that disrupt essential genes because these mutants die and, as such, are not observable. Therefore, the number of real insertions is underestimated to equal the observed insertions. Also, most of the existing models assume a gene's probability of being inserted only depends on its length and one insertion is sufficient to disrupt a gene's function. However, the TM results often show ''hot'' and ''cold'' spots in the genome for transposon insertions. Furthermore, it may often require multiple insertions to disrupt a gene's functions, depending on the site of the insertion.
Given these caveats, we developed a novel Poisson model based statistical framework to simulate the TM insertion process and subsequently correct the experimental biases. Briefly, the statistical framework works as follows: We first quantitatively assessed the effects of potential factors that may affect the accuracy of TM results, such as gene length and relative insertion positions, and subsequently incorporated relevant factors into the framework. Through iteratively optimizing parameters, we finalized the model and inferred the actual insertion events occurred in each gene given the observed insertion information. Finally we described each gene's essentiality on probability measure, and provided corrections towards possible biases in the TM assigned annotations.
We took advantage of the definitive mapping of essential genes in E. coli MG1655 strains determined by single gene knockout experiments (PEC set) [2] to identify the errors in the essential gene annotations produced by TM experiments (Gerdes set) [23] by comparing their assignments. Although the single-gene knockout experiments are not completely error free, they have been considered the least error-prone [2]. We also realized that the essential genes uniquely identified by TM may have biological significance as they may represent genes essential for fitness as suggested by Gerdes et al. [32]; however, since our focus was on those essential for survival, we still referred to them as ''errors''. Also note, our model is not dependent on the single-gene knockout results in the target organism (see the Discussion section). The PEC dataset was only used for assessing the errors in TM dataset and evaluating our model's performance.

Assessing Overall Error Rates in the TM Annotations of Essential Genes
Using the PEC set as a reference, we assessed the overall error rates in the Gerdes dataset. We examined the subset of genes appearing in both datasets. This subset contains 3833 genes in total, including 615 TM-assigned essential genes (TmEs) and 3218 TM-assigned non-essential genes (TmNs), and covers 90% (3833 out of 4291) of all genes in the TM dataset.
According to the intersection with the PEC set, the Gerdes dataset can be divided into four mutually exclusive fractions ( Table 1): true essential by TM (TETmE, essential assigned by TM and PEC), false essential by TM (FETmE; essential assigned by TM but non-essential in PEC), false non-essential by TM (FNTmN; non-essential by TM but essential in PEC) and true non-essential by TM (TNTmN; non-essential by both TM and PEC). In this report, we defined the TM essential error rate (TmER) as the proportion of the FETmEs over the total TmEs; likewise, the TM non-essential error rate (TmNR) was defined as the proportion of the FNTmNs over the total TmNs. Among the 615 TmEs, 186 were TETmEs, yielding a TmER of 70%. Similarly, the TmNR was 2.3%. Typically, in TM experiments the TmNR is low, but the TmER is relatively high.

Assessing the Effects of Four Factors on the Accuracy of TM Experiments
Previous studies suggested the accuracy of TM experiments may be affected by four main factors: gene length, insertions in the distal regions, number of insertions per gene and polar effects [23,29]. To quantitatively assess each factor's influence on the accuracy of TM results, we examined the association of each factor with the FETmEs and FNTmNs, respectively ( Fig. 1).
(1) Gene length: potentially causing false positive errors. In TM experiments, the genes that have never been detected with transposon hits will be assigned as essential (Fig.  S1). The average detectable insertion density is about 1 per 400 bp in TM experiments under saturation levels [23,24,25]. This suggests relatively short genes (e.g. #300 bp) would easily be missed simply by chance, and thus will be incorrectly labeled as essential.
To quantitatively assess its influence, we compared the length of essential genes in the Gerdes and PEC sets, using the total genes as a control (Fig. 1A). The student t-test shows that the average length of essential genes in the Gerdes set (730 bp) is significantly shorter than that in the PEC set (1,003 bp) (P-value ,1E-11). It is also significantly shorter than total genes (982 bp) (P-value ,1E-25) while the difference between the PEC set and total genes is considered not significant (P-value .0.05).
(2) Insertions in the 59-and 39-ends of genes: potentially causing false negative errors. In TM experiments, sometimes insertions occurring at the extreme ends of a gene's coding sequence may not sufficiently disrupt its function [23,24,25,26,27,28,29,30]. In these cases, essential genes may be mistakenly assigned as non-essential genes, creating a false negative error. We compared the distributions of the position of insertions within the ORFs between FNTmNs and TNTmNs. As expected, the FNTmNs have a higher percentage of transposon insertions in the 39-and 59-ends than TNTmNs (Fig. 1B). To assess the significance of this difference, we simulated pure random insertion experiments within the coding sequences (see Methods). The P-values showed that in the 20%-most of the 39-end and the 5%-most of the 59-end regions, the FNTmNs have significantly more insertions than TNTmNs (Table S2). We named it the ''25% extreme ends'' rule and used it later in the model. Occasionally, a few insertions in a gene are insufficient to completely disrupt its function, especially when the target gene is relatively long [23,24,25,26,29,33]. We plotted the distribution of the number of insertions per gene for both FNTmNs and TNTmNs (Fig. 1C). The histogram showed that about 75% of the FNTmNs that were mistakenly assigned to be non-essential by TM only harbored a single insertion. The average insertion number per gene among FNTmNs was 1.56, significantly smaller than TNTmNs (4.13) (P-value #1E-26).
(4) Polar effects: potentially causing false positive errors. In TM experiments, polar effects can occur when a transposon inserts in a dispensable gene and prevents the transcription of its downstream essential genes in the same operon [23,24,25]. Because the insertion in this dispensable gene actually disrupts the entire downstream essential genes and causes death of these mutants, this dispensable gene will be incorrectly labeled as essential, causing a false positive error.
To detect the number of FETmEs caused by polar effects, we examined all TmEs within each of 2,665 operons that were inferred experimentally or computationally (see Methods). If there exists a TETmE that resides downstream of a FETmE, this FETmE is considered to be potentially caused by a polar effect. Among the 429 FETmEs, only 46 of them may be caused by polar effects. The small number is likely due to the fact that many TM experiments have been designed to prevent polar effects, typically by designing transposons with a strong or regulatable promoter downstream of the transposase but still within the transposon.

Developing the Statistical Framework to Correct TM Errors
We then developed a statistical framework that is capable of incorporating potential factors that have strong associations with FETmE and FNTmN assignments. This model not only estimates the overall error rates, but also assigns a score to indicate the probability that an individual gene is essential given the TM assignments. We incorporated three of the above four potential factors into our model. Since polar effects are only responsible for a relatively few number of FETmEs, we chose not to include this factor into our model. The general idea of this model is illustrated in Fig. 2.
This model requires two assumptions: (A) Transposons insert randomly and independently into the coding region of a gene; and (B) Each transposon has the same ability to disrupt a gene's function, although this ability may vary at different regions of a gene.
Assumption (A) does not require transposon insertions to be uniformly distributed along the entire genome because of the insertion ''hot'' and ''cold'' spots observed in microbial genomes and thus provides a more realistic approximation of the process of transposon insertions.
If we assume that transposons insert into a gene independently and the insertions occur at a constant rate during mutagenesis, then this process can be characterized by a Poisson distribution [34,35]. The probability that there are k insertions occurring within a gene with length L can be expressed as: Here r is the local insertion density on the DNA fragment, estimated by counting the number of insertions within a 30 kblong region flanking the coding sequence. If k = 0, this equation describes the probability that this ORF is missed in TM experiments.
Based on assumption (B), we defined two parameters P 1 and P 2 both in the range of (0, 1) representing the probability that an individual transposon insertion disrupts a gene's function when the insertion occurs at the 25% extreme ends or in the middle of a gene, respectively. We assumed the same probability (P 1 ) for the 5%-most of 59-end and 20%-most of 39-end to disrupt a gene's function.
Under these two assumptions, we can calculate the probability of being essential for each gene given the TM assignments.
(A) If it is a TM assigned essential gene (TmE), we have: (B) Similarly, if it is a TM assigned non-essential gene (TmN), we have: Here E is a binary variable and E = 1 if this gene is essential; otherwise, it is non-essential. n real represents the real number of insertions occurring in this gene during the TM experiment and n obs represents the observed number of insertions in the TM dataset. Here n obs = 0 if the gene is assigned as essential by TM, and n obs .0 if it is assigned as non-essential. n obs can be further separated into two parts n 3 0 or5 0 ends and n middle to represent the observed insertions occurring at the 25% extreme ends or in the middle of a gene, respectively. In the transposon insertion process, if an insertion hits a true essential gene and disrupts its function, the inserted mutant will die and this insertion will not be observable in the TM dataset; therefore the real insertion number n real should be greater or equal to the observed insertion number n obs . While for a non-essential gene, no matter whether an insertion disrupts its function or not, it will not die. Thus the real insertion number n real always equals to the observed insertion number n obs . After the derivation of equations, we then used an iterative procedure [36,37,38] to estimate the values of unknown parameters that determine Eqs. (1) and (2) (details see Methods).

Validating the Model in E. coli TM Dataset
The Gerdes dataset contains 615 TmEs and 3218 TmNs. Using the above algorithm, we found the converged P 1 = 0.942, P 2 = 0.984 and the overall essential rate r ess = 12.84%. Since our model assigned each individual gene a score to indicate its probability of being essential, we ranked these genes in the TmEs and TmNs separately. Among the 615 TmEs, the expected number of essential genes ( P n i P(E~1)) we estimated was 480. Using the expected number of essential genes as the cutoff, the top 480 genes were named predicted essential genes by our model among the TM-assigned essential genes (PETmEs) and the remaining 135 genes were predicted non-essential genes by our model among the TM-assigned essential genes (PNTmEs). Similarly, among the 3218 TmNs, the expected number of essential genes we estimated was 12. Using this cutoff, the top (1) Part A: It never had any insertion and was missed by all transposons by chance. This means we do not have useful information to infer what this gene could be, and it is completely blind for us. For any blind gene, we can only try our best guess and assume that the chance of that gene to be essential (Pr (E~1Dn real §n obs~0 )) is equal to the overall essential gene rate (Pr(overall essential)), and that a gene to be non-essential is equal to Pr (E~0Dn real §n obs~0 ) = 1-Pr (E~1Dn real §n obs~0 ). (2) Part B: It actually had insertions, but all inserted mutations died. (Pr (E~1,n real w0Dn real §n obs~0 )) This means that this gene is truly essential. In this way, we can now split the TM assigned essential genes into two parts, TETmE and FETmE. Similarly, if in the TM experiment, a gene is observed to have insertions, meaning it is TM nonessential, what could it really be? There are also two possibilities: (1) Part C: All these observed insertions are ineffective, and did not interrupt the gene function. This means again we are blind about this gene. So it has a certain chance to be essential (Pr (E~1,allinsertionsineffectiveDn real~nobs w0)), and also has a certain chance to be nonessential (1-Pr (E~1,allinsertionsineffectiveDn real~nobs w0)). (2) Part D: There was at least one effective insertion, and it did interrupt the gene function. (Pr (E~1,atleastoneinsertioneffectiveDn real~nobs w0)~0). This means this gene is truly non-essential. doi:10.1371/journal.pone.0058178.g002 12 genes were named predicted essential genes by our model among the TM-assigned non-essential genes (PETmNs) and the remaining ones were predicted non-essential genes among the TM-assigned non-essential genes (PNTmNs).
To assess the accuracy of our predictions, we compared our results with the PEC dataset ( Table 2). Among the 480 PETmEs, 176 (or 37%) were true essential, significantly higher than that in the original TmEs (186/615 = 30%) (P-value = 0.013, Fisher's exact test). Remarkably, among the 135 PNTmEs that we filtered out, only 10 (or 7.4%) were true essential, significantly lower than that of the original TmEs, i.e., 30% (P-value ,1E-8). On the other hand, among the 12 PETmNs, 5 (or 42%) were true essential, significantly higher than that in the original TmNs (2.3%) (P-value ,1E-6). These results strongly indicated that our model successfully enhanced the accuracy of the original TM assignments.
Our results also showed a positive correlation between the confidence score and the enrichment of essential genes (Fig. 3). In other words, the higher confidence scores we chose as the cutoff, the higher percentage of true essential genes our predictions contained, indicating that our score system is in agreement with the distribution of essential genes. For instance, the top 100 PETmEs contained 50 true essential genes, this ratio is significantly higher than that in the original TmEs (30%) with P-value ,1E-4.

Testing the Model's Robustness in Analyzing Subsaturation Level TM Datasets
Compared with a saturated TM experiment, an unsaturated TM experiment generally contains a higher TmER, because genes are more likely to be missed by transposon insertions and thus incorrectly assigned as essential.
To test our model's applicability to unsaturated TM datasets, we randomly removed 10%, 30% and 50% of the total insertions from the Gerdes dataset to simulate the effects of different subsaturation levels of TM experiments. We then applied our model on these subsaturated datasets.
The results suggested a strong robustness in analyzing subsaturation level TM datasets (Fig. 3). As shown in Fig. 4, the lower curve (dashed line) showed the p-values of the Fisher's exact test to examine whether the true essential rate in PNTmEs is significantly lower than that in the original TmEs set. Similarly, the upper curve (solid line) showed the p-values of the Fisher's exact test to examine whether the true essential rate in our PETmNs is significantly higher than that in the original TmNs set. For each of the three (10%, 30% and 50%) random experiments, we repeated 100 times to obtain the error bars. The results showed that under each of these subsaturation conditions, our model still significantly improved TM results.

Testing the Model's Applicability to P. aeruginosa by Allelic Exchange Experiments
The ultimate test is to see if our model is applicable to other microorganisms. A set of essential genes has been determined by TM to a saturated level in another c-Proteobacteria, P. aeruginosa PAO1 (Jacobs dataset) [29]. This TM dataset contains 678 putative essential genes and 4892 non-essential genes. If we assume the probability that an individual transposon insertion disrupts a gene's function, is the same across different species, i.e., P 1 = 0.942 and P 2 = 0.984 as those in E. coli, then the overall essential rate r ess we estimated for PAO1 is 10.1% and the expected numbers of PETmEs and PETmNs are 540 and 15, respectively.
Because a whole-genome single gene knockout dataset is not yet available in this organism, we chose to pursue allelic exchange experiments to validate our predictions in PAO1.
In order to make sure our experimental procedure can correctly identify essential genes, we first tested it on PA4238 as positive control. PA4238 is a subunit of RNA polymerase, which is the   target of the microbicidal antibiotic, Rifampin [39]. The essentiality of PA4238 is confirmed by independent gene knockout efforts [24,29]. As expected, PA4238 was determined by the allelic exchange experiments as essential. According to our model, PA4238 has a rank 46 out of 678, higher than the expected number of essential genes among TmEs, i.e., 540; therefore it was predicted as essential by our model. In this case, the results from the TM assignment, our model and the allelic exchange experiment are consistent ( Table 3).
We then selected six ORFs from the PAO1 TM dataset to further test our model: PA0723, PA2954 and PA2143 were selected from the TmEs; and PA3746, PA4260 and PA0985 were selected from the TmNs. The main purpose to conduct allelic exchange experiments for validations was to demonstrate that TM indeed contains errors and our method is capable of correcting some of these errors. To serve this purpose, we chose the genes that are most likely to be TM errors as our testing cases based on the following considerations: First, the testing cases among TmEs should be lower ranked (close to or below the cutoff of 540) while the testing cases among TmNs should be higher ranked (close to or above the cutoff 15). The higher-ranked genes among TmEs and lower-ranked genes among TmNs are mostly agreements between TM and our model and most likely to be correct assignments; therefore, they would not be interesting testing cases. Second, among these lower-ranked TmEs and higher-ranked TmNs, we challenged ourselves by selecting those more difficult cases, i.e., we either cannot find an ortholog in E. coli or the assignments based on orthologs are contradictory to TM assignments. Third, we chose our testing cases to be on both sides of the cutoffs to test the robustness of our cutoff setting. Finally, among the short list of genes that met the above three criteria, we nailed down to six based on experimental feasibility.
Among the three TmEs (PA2954, PA0723 and PA2143), PA0723 was a true essential gene but PA2954 and PA2143 turned out to be non-essential. According to our model, PA0723 was ranked 414 out of 678. Because the expected number of essential genes among TmEs is 540, it was correctly predicted to be essential. In contrast, PA2954 and PA2143 were given a rank of 588 and 663, respectively. They were both predicted as false positive error (i.e., non-essential) genes because their ranks were lower than 540. Our model was correct in all three cases.
Among the three TmNs (PA3746, PA4260 and PA0985), PA0985 was a true non-essential gene but PA3746 and PA4260 were tested to be essential. Our model assigned PA3746 a rank of 8 out of 4892. Because the rank was higher than 15, the expected number of essential genes among the TmNs, PA3746 was predicted to be essential. In contrast, PA0985 and PA4260 were assigned a rank of 113 and 103, respectively; they were predicted to be non-essential. Our model was correct in two out of the three cases.
Overall, among the seven genes tested by allelic exchange experiments, our model agreed with the experimental validations in six of them. Remarkably, among all three cases that directly contradicted the TM assignments, allelic exchange experiments supported our predictions. Details of allelic exchange experiments are provided in the Sup (Fig. S2 and Table S3).

Examples of Identified TM Annotation Errors Confirmed by Literature
Among the TM annotation errors in PAO1 identified by our model, a number of them have been confirmed by literature.

Below are three examples:
Hfq (PA4944) is a RNA-binding protein in PAO1 and involved in the Bacterial RNA degradation pathway. It is identified as an essential gene by TM experiments [29], but assigned as nonessential by our model (ranked 572 out of the 678 TmEs). A recent study [40] investigated the effect of Hfq on virulence and stress response of PAO1 and compared the growth rate of a wild-type strain and a Dhfq strain. The Dhfq strain showed a reduced growth rate compared with the wild-type strain, which indicates that this gene is essential for fitness but not for survival and should be referred as a non-essential gene here.
FpvI (PA2387) is a RNA polymerase sigma factor and also a TM-assigned essential gene in PAO1. Our model predicted it as non-essential by assigning it a rank of 582 out of the 678 TmEs. Beare et.al studied its role in siderophore-mediated cell signaling [41]. They found that FpvI is required for the synthesis of ferripyoverdine receptor FpvA which is part of the signaling pathway regulating pyoverdine (a form of siderophore) production. In the experiment, they constructed a DfpvI strain and found that this mutant produced much lower amounts of FpvA than the wildtype stain, which suggests that FpvI is required for normal amounts of FpvA production but not for survival.
PcrG (PA1705) is a regulator in type III secretion system (TTSS) which enables the bacteria to translocate virulence effectors directly into the cytosol of host cells. It ranks 559 out of the 678 TmEs by our model and is predicted as a non-essential gene, contradicting the TM assignment. To further understand its mechanism, Sundin's group constructed an in-frame deletion mutant of PcrG and cultured the mutant in LB medium. They found the PcrG mutant can up-regulate the expression of exoenzyme S (ExoS) which has been identified as effectors targeted into host cells by the TTSS of PAO1 [42]. This experiment also proved that PcrG is a non-essential gene.

Testing the Model's Broad Applicability to Other Organisms
Francisella tularensis is a Gram-negative, intracellular pathogen that causes the disease tularemia. F. tularensis novicida is commonly used as surrogate in virulence studies particularly virulent toward mice and is also able to cause tularemia-like disease in rodents. Genome-scale TM experiments have been performed on its subspecies F. tularensis novicida [25]. This TM library consists of 16,508 unique insertions covering 1,434 out of the 1,767 genes, an average of .9 insertions per gene, which achieves the highest coverage among the available TM experiments of bacterial species. We applied our model on this TM library and estimated the overall essential rate is 14.5%. Among the 333 TmEs, we predicted 251 genes as essential labeled them as PETmEs. Among the 1434 TmNs, we predicted 5 as essential and labeled as PETmNs. The prediction results have been made available on the Web server described in the next section. The TM annotations from other microorganisms will be processed and deposited in our Web server once the results become available.

Developing a Web Server that Provides Convenient Access to Our Method
To make our model readily available to the research community, we developed a convenient and user-friendly Web service named EGTEC (The Essential Gene TM Annotation Error Corrector) hosted at: http://research.cchmc.org/ essentialgene/. The EGTEC interface is implemented using PHP-HTML and comprises of two main functions (Fig. 5): (1) A query tool for a quick exploration of the corrected TM sample results in bacteria, which currently includes E. coli, P. aeruginosa and F. tularensis novicida. Once a query is received, the tool will check for the gene's presence in the database and displays the matching records for the input, including the original TM annotations and the EGTEC's prediction as well as the results from single-gene knockout experiments if available. The tool will accept the input query gene's name in the following formats: gene symbol/name, GenBank accession number or Entrez Gene ID. (2) An uploading tool for submitting new TM experimental data by users. The EGTEC accepts user-generated TM experimental results and makes corrections on these results. The input should contain the following information: (a) Each ORF's name, (b) Start and end position of this ORF and (c) The transposon insert position in this ORF. A sample input file is provided along with the uploading box. When finished, the corrected TM annotations including the predicted probability of each ORF being essential will be sent to the user through email.

Discussion
The intrinsic biases in TM experiments motivated us to develop a statistical framework to systematically filter out errors (both FETmEs and FNTmNs) and thus improved the accuracy in TMdetermined essential gene annotations. This model is significant in four ways: First, our model significantly enhances the accuracy of the original TM assignments. In the E. coli TmEs, our PNTmEs, i.e., the false essential genes filtered out by our method, had a significantly lower true essential rate than that in the original TmEs. In contrast, in the TmNs, our PETmNs, i.e., the predicted essential genes from TmNs, had a significantly higher true essential rate than that in the original TmNs. The confidence scores generated by our model were shown to have positive correlations with the enrichment of true essential genes.
Being able to assign PETmNs demonstrates the advantage of our approach over previous studies by recovering true essential genes from TmNs. Most of the existing models, e.g., in [28], only focused on removing false essential genes from TmEs, but are incapable of recovering false non-essential genes, although relatively few, from TmNs. In E. coli, among the 12 false nonessential genes we recovered, five are true essential, significantly higher than original TmNs ( Table 2).
Second, because our model adopts simple but more realistic assumptions, it is applicable across multiple microorganisms. It is very important to note that our method is not dependent on single gene knockout results to make predictions. The single gene knockout results in E. coli were only used for assessing the TM errors and evaluating the performance of our predictions. In P. aeruginosa PAO1 where single gene knockout results are unavailable, we demonstrated that our method is remarkably accurate based on literature and the six allelic exchange experiments. Among the seven chosen genes for an experimental test, six of them were assigned correctly by our model with an overall accuracy of 86%. These results clearly demonstrated our method's reliability and applicability to organisms that do not have single gene knockout results. We further demonstrated our model's broad applicability to F. tularensis novicida, where single gene knockout results are also unavailable. Therefore, we believe, if granted full access to the information of transposon insertion positions in the bacterial genomes, our model can be readily applicable to all 24 available TM datasets listed in Table S1. Third, our model displays robustness in analyzing unsaturation level TM datasets and resisting with experimental errors as demonstrated in the simulated unsaturation level TM datasets. This is potentially useful in significantly reducing the time and costs currently associated with TM experiments. Finally, our model is flexible and able to incorporate more potential factors. For example, we assigned different weights to the insertions based on the positions where they were inserted in the genes. In the future, we will consider other factors that might affect the accuracy of TM experiments [33].
To benchmark our method's performance against different types of computational methods, we compared their prediction performance in Table S4. We only included the studies conducted in E. coli in order to make an objective comparison. It is not surprising that supervised methods that rely on gold standard datasets often outperform non-supervised methods that do not rely on gold standard datasets. The main problem of supervised learning is that it is not applicable to understudied species where very few essential genes are already known; therefore, its superior performance in sensitivity, specificity and precision is not meaningful to the problem we intend to solve. Although the constraint-based method outperforms our model in precision, its sensitivity and specificity are rather poor. This reflects its limitation of being dependent on a priori knowledge on pathways and reactions. Homology mapping has a similar level of performance as our model, but it requires a closely related model organism, and the accuracy of assignment is highly dependent on the distance between the model organism and the target organism. In this case, homology mapping also performs well mainly because single gene knockout results are available in a well-studied and closely-related organism, Acinetobacter baylyi [43]. It is unrealistic to expect such a model organism is always available for an understudied species.
Our approach represents a significant advancement over Gerdes approach to filter TM errors. First, the cutoff settings in Gerdes approach appeared to be somewhat arbitrary and lack of statistical rigorousness. For example, all genes longer than 240 bp and free from inserts were assigned as essential, and genes with at least one insertion were designated as non-essential unless they are relatively long (.900 bp). In contrast, all of our parameters were determined through the iterative learning process. Second, Gerdes annotations were not strictly based on TM data but also involved manual inspection that requires prior knowledge on E. coli physiology. Therefore, Gerdes approach cannot be easily extended to an understudied organism and expected to achieve the same performance. In contrast, our approach only relies on the TM data, i.e., transposon insertion positions in the bacterial genome. Even without using prior knowledge to improve our assignments, our model still significantly outperformed Gerdes approach in precision and specificity (Table S4).
In other organisms, although the performance metrics cannot be easily assessed because definite mappings of essential genes are not yet available in those species, our approach exhibits several advantages: Our model is capable of estimating the probability score of being essential for individual genes, rather than only estimating an overall essential rate for the whole genome as in Jacobs et al.'s method. In addition, Jacobs et al. used a multinomial distribution which cannot incorporate the difference between ''hot'' and ''cold'' spots. Furthermore, their method is not applicable for correcting false negative errors, i.e., detecting true essential genes among TmNs. In Lamichhane et al.'s study, their method was applicable to a subsaturation level of TM. Since a gold-standard dataset is not yet available in M. tuberculosis, we cannot compare our subsaturation results with theirs. In addition, they also have the same issues as in Jacobs et al.'s method by disregarding false negative errors and differences in insertion density. By taking into account these features, our model is expected to be more realistic and accurate.
Our effort is timely given the large number of existing annotations of essential genes and rapidly increased TM datasets. The need for such a system to assess the TM errors becomes even more urgent as researchers start to explore conditional essentials, which could hold the key to understanding gene essentiality. For the past three years, we have seen numerous new bacterial essential gene sets published and the majority of them were generated by TM approach [44,45,46,47,48,49,50,51,52,53] ( Table S1). We expect growth of essential gene datasets to greatly accelerate. Therefore our contribution is not only valuable but also timely. As long as TM experiments remain the dominant method to determine essential gens in prokaryotes, our method is expected to remain useful.
A logical next step is to investigate our model's applicability to eukaryotes. However, because eukaryotes have much more complex genome structure than prokaryotes, e.g., introns, the mechanism for eukaryotic essential genes is expected to be quite different from the prokaryotic essential genes. For the same reason, transposons may also likely work differently in eukaryotes than in prokaryotes. A thorough investigation on these aspects should be performed before we apply our model to eukaryotic genomes.
In summary, we have developed a promising tool that is crucial for large-scale data mining on essential genes. The analysis enabled by this tool will eventually lead to a better understanding on the mechanistic basis underlying gene essentiality.

Data Sources
In this study, we tested our model on three different prokaryotic organisms: E. coli, P. aeruginosa PAO1 and F. tularensis novicida. We chose to study these three organisms because their genomic sequences and the TM datasets are publicly available.
The genomic and protein sequences of E. coli MG1655 were downloaded from the Comprehensive Microbial Resource (CMR) at http://cmr.jcvi.org/. It contained 4,289 protein coding genes in total.
The E. coli TM dataset was downloaded from [23]. Of the total 4291 protein-coding genes in this dataset, 3,311 ORFs have observed transposon insertions and 649 ORFs have not. The remaining 331 genes were excluded from analysis because no reliable PCR data can be obtained for the corresponding region of the E. coli chromosome for the technical reasons.
The P. aeruginosa PAO1 TM dataset was downloaded from [29]. This dataset contained 4,892 ORFs that have observed transposon insertions and 678 ORFs that have not.
The F. tularensis novicida TM dataset was downloaded from [25]. This datasets contain 333 putative essential genes and 1,767 putative non-essential genes.
The E. coli operon dataset was downloaded from Regulon DB version 6.4 at http://regulondb.ccg.unam.mx/. This dataset contained 2,665 operons that inferred experimentally or computationally in the E. coli genome.

Simulation of Random Insertions
Assuming a uniform distribution for the transposon insertions within an ORF, we conducted a simulation experiment for random insertions. In the Gerdes set, we observed 12,966 total transposon insertions inside ORFs, excluding intergenic insertions. We randomly generated 12,966 insertions along the genome, excluding intergenic regions. For each random insertion, we recorded its relative position inside the ORF. Repeating this simulation 1,000 times, we obtained the distribution of the positions of simulated insertions. We then evaluated the empirical insertion distribution of TmNs (both TNTmNs and FNTmNs) by calculating the probability P for the randomized insertions appearing in a certain position for an equal or greater number of times than in the real experiment. We concluded that there was a significant difference if the P-values #0.01 (Table S2).

Derivation of Equations
Based on the probability theory, Eq. (1) can be rewritten into two parts (Fig. 2): Pr (E~1Dn real §n obs~0 ) Pr (E~1,n real~0 Dn real §n obs~0 ) ðPartAÞ z Pr (E~1,n real w0Dn real §n obs~0 ) ðPartBÞ Part A: Pr (E~1,n real~0 Dn real §n obs~0 ) Pr (E~1Dn real~0 ,n real §n obs~0 ) Pr (n real~0 Dn real §n obs~0 ) where Pr (E~1Dn real~0 ,n real §n obs~0 )~Pr (E~1Dn real~0 ) is the probability that the gene is essential given the actual insertion number equal to 0. This means it never had any insertion and was missed by all transposons by chance. Thus, we cannot further infer the information about this gene, we assume its probability of being essential should equal to the overall percentage of essential genes in the genome. We thus have: Pr (E~1Dn real~0 )~Pr (overall essential): Since the transposon insertion process follows a Poisson distribution, we get: Pr (n real~0 Dn real §n obs~0 )~P r (n real~0 ,n real §n obs~0 ) Pr (n real §n obs~0 ) Part B: Pr (E~1,n real w0Dn real §n obs~0 ) X mw0 ½Pr (E~1jn real~m ,n real §n obs~0 ) Pr (n real~m jn real §n obs~0 ) where Pr (E~1Dn real~m , n real §n obs~0 ), mw0 is the probability that this gene is essential given that the real insertion number is m (m .0), and none of these insertions survived. Since we didn't observe any insertion, all m insertions should be effective; otherwise, mutants with ineffective insertions should have survived. In this case, the target gene is always essential, as all insertions interrupted its function and caused the mutant's death. Thus, we have: Pr (E~1Dn real~m ,n real §n obs~0 )~1 Pr (n real~m Dn real §n obs~0 )~Poisson(m; rL) Similarly, Eq. (2) can also be rewritten into two parts (Fig. 2): Pr (E~1Dn real~nobs w0) ~Pr (E~1, all insertions ineffective Dn real~nobs w0) ðPartCÞ z Pr (E~1, at least one insertion effectiveDn real~nobs w0)ðPartDÞ Part C: Since all the insertions have not disrupted this gene's function and we cannot further infer the information about this gene, we assume its probability of being essential should equal to the overall percentage of essential genes in the genome, thus, Pr (E~1, all insertions ineffectiveDn real~nobs w0) Pr (overallessential)(1{P 1 ) n 3 0 or5 0 ends (1{P 2 ) n middle Part D: Since we observed effective insertions in the target gene and the mutant with this gene disrupted still survived, the target gene cannot be essential.

Estimating Parameters Using an Iterative Procedure
We iteratively estimated the unknown parameters P 1 (the probability that an individual insertion disrupts a gene's function when it occurs within the 25% extreme ends of a gene), P 2 (the probability that an individual insertion disrupts a gene's function when it occurs in the middle of a gene) and r ess~P r (OverallEssential) (the true essential rate in the genome) as follows: Step 1. Based on the model assumptions, we defined the empirical estimators for P 1 and P 2 by calculating the number of corresponding insertions (denominator) and their effectiveness (numerator) using the definition as follows: Here S Ess ð Þ and S Non{ess ð Þdenote the set of essential and non-essential genes at the current step, respectively. n i middle is the observed insert number within the middle regions of the i th gene; n i 3 0 or5 0 ends is the observed insert number at the 39 or 59ends and L i is the length of that gene.
Step 2. UsingP P 1 ,P P 2 to estimate r ess : In this step, first, we calculated the expected number of essential genes in TmEs and TmNs respectively, by the following equation: Then r ess can be estimated by: Substituting (4) into (5), we can get: Step 3. Using the current r ess andP P 1 ,P P 2 , to assign an essential probability score to each individual gene in the TM dataset.
Step 4. Ranking these essential probability scores in TmEs set and using the expected number of essential genes in that dataset as the cutoff, the top genes are considered as PETmEs. Similarly, ranking the essential probability scores in TmNs dataset and using the expected number of essentials in that dataset as the cutoff, the top genes are considered as the PETmNs, so our filtered essential dataset should be the combination of the PETmEs and PETmNs.
Step 5. Updating the current essential dataset S Ess ð Þ using this filtered essential dataset and the remaining genes to update the current non-essential dataset S Non{ess ð Þ .
Step 6. Going back to the Step 1 until the result converges. In our model, the initial P 1 and P 2 were randomly assigned between 0 and 1, and updated during the iteration process until converge.

Experimental Validation in P. aeruginosa PAO1
To verify our predictions, we conducted allelic exchange experiments on the seven chosen genes: PA4238, PA0723, PA0985, PA2954, PA2143, PA3746 and PA4260. First, using genomic DNA from strain PAO1 as template and standard PCR techniques, we cloned the PCR fragments containing the desired genes and 1 kb of flanking DNA into pUC19. We next inserted a non-polar aacC1 (Gm R ) cassette from pUCGM [54] into the target gene in the same transcriptional orientation to ensure that there was no polar effects on the downstream genes. The inserted fragments were then subcloned into the gene replacement vector, pEX100T, and the resultant constructs were conjugated into PAO1 with selection for Gm R . Transconjugates were resolved on LB agar containing 7% sucrose as a counter-selectable marker. Isogenic mutants were confirmed initially by PCR and then Southern blot. Complementation was both via single copy using the unique attB locus as well as via stable plasmids (details in Text S1).