The authors have declared that no competing interests exist.
Low-cost, high-throughput sequencing has led to an enormous increase in the number of sequenced microbial genomes, with well over 100,000 genomes in public archives today. Automatic genome annotation tools are integral to understanding these organisms, yet older gene finding methods must be retrained on each new genome. We have developed a universal model of prokaryotic genes by fitting a temporal convolutional network to amino-acid sequences from a large, diverse set of microbial genomes. We incorporated the new model into a gene finding system, Balrog (Bacterial Annotation by Learned Representation Of Genes), which does not require genome-specific training and which matches or outperforms other state-of-the-art gene finding tools. Balrog is freely available under the MIT license at
Annotating the protein-coding genes in a newly sequenced prokaryotic genome is a critical part of describing their biological function. Relative to eukaryotic genomes, prokaryotic genomes are small and structurally simple, with 90% of their DNA typically devoted to protein-coding genes. Current computational gene finding tools are therefore able to achieve close to 99% sensitivity to known genes using species-specific gene models. Though highly sensitive at finding known genes, all current prokaryotic gene finders also predict large numbers of additional genes, which are labelled as “hypothetical protein” in GenBank and other annotation databases. Many hypothetical gene predictions likely represent true protein-coding sequence, but it is not known how many of them represent false positives. Additionally, all current gene finding tools must be trained specifically for each genome as a preliminary step in order to achieve high sensitivity. This requirement limits their ability to detect genes in fragmented sequences commonly seen in metagenomic samples. We took a data-driven approach to prokaryotic gene finding, relying on the large and diverse collection of already-sequenced genomes. By training a single, universal model of bacterial genes on protein sequences from many different species, we were able to match the sensitivity of current gene finders while reducing the overall number of gene predictions. Our model does not need to be refit on any new genome. Balrog (Bacterial Annotation by Learned Representation of Genes) represents a fundamentally different yet effective method for prokaryotic gene finding.
One of the most important steps after sequencing and assembling a microbial genome is the annotation of its protein-coding genes. Methods for finding protein-coding genes within a prokaryotic genome are highly sensitive, and thus have seen little change over the past decade. Widely used prokaryotic gene finders include various iterations of Glimmer [
The lack of recent advances in
In line with evaluation metrics used by other gene finders, if a program can find nearly all true positive genes while predicting fewer genes overall, it is reasonable to assume this is primarily due to a reduction in false positive predictions [
Currently available gene finders were developed in the late 1990’s and 2000’s, when relatively few prokaryotic genomes were available. Today, tens of thousands of diverse bacterial genomes from across the prokaryotic tree of life have been sequenced and annotated. We hypothesized that it should therefore be feasible to build a data-driven gene finder by training a machine learning model on a large, diverse collection of high-quality prokaryotic genomes. The program could then be applied, without any further re-training or adjustment, to find genes in any prokaryotic species. Balrog was developed with this strategy in mind. In the experiments below, we show that Balrog, when trained on all high-quality prokaryotic genomes available today, matches the sensitivity of current state-of-the-art gene finders while reducing the total number of hypothetical gene predictions. By integrating protein-coding gene predictions from Balrog, standard prokaryotic annotation and analysis pipelines such as NCBI PGAP (Prokaryotic Genome Annotation Pipeline) [
We compared the performance of Balrog, Prodigal, and Glimmer3 by running each tool with default settings on a test set of 30 bacteria and 5 archaea that were not included in the Balrog training set. Following the conventions established in multiple previous studies, we considered a protein-coding gene to be known if it was annotated with a name not including “hypothetical” or “putative.” In standard annotation pipelines, proteins are labeled hypothetical if they have no significant match to known protein sequences and are not otherwise covered by a standard naming rule [
Genome | Balrog | Prodigal | Glimmer3 | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Bacteria | GC | genes | 3′ matches | extra | 3′ matches | extra | 3′ matches | extra | |||
% | # | # | % | # | # | % | # | # | % | # | |
30 | 1570 | 99.3 | 1557 | 99.2 | 302 | 99.3 | 367 | ||||
31 | 1486 | 99.3 | 1475 | 99.3 | 248 | 1473 | 99.1 | 279 | |||
33 | 2359 | 2265 | 96.0 | 2255 | 95.6 | 557 | 96.1 | 715 | |||
34 | 2419 | 2397 | 99.1 | 2401 | 99.3 | 554 | 99.3 | 648 | |||
34 | 1360 | 98.2 | 98.2 | 220 | 1332 | 97.9 | 257 | ||||
37 | 1630 | 1607 | 98.6 | 98.7 | 281 | 98.7 | 333 | ||||
38 | 3873 | 99.0 | 3829 | 98.9 | 970 | 3831 | 98.9 | 1134 | |||
40 | 1496 | 1484 | 99.2 | 99.3 | 373 | 1485 | 99.3 | 422 | |||
41 | 1608 | 99.3 | 1596 | 99.3 | 441 | 1594 | 99.1 | 543 | |||
42 | 1897 | 1883 | 99.3 | 99.5 | 921 | 1884 | 99.3 | 1027 | |||
45 | 1882 | 1804 | 95.9 | 1809 | 96.1 | 604 | 96.2 | 783 | |||
46 | 2137 | 2107 | 98.6 | 99.0 | 595 | 2114 | 98.9 | 696 | |||
46 | 885 | 99.9 | 883 | 99.8 | 826 | 879 | 99.3 | 840 | |||
49 | 2299 | 2227 | 96.9 | 2233 | 97.1 | 808 | 97.3 | 1134 | |||
49 | 2850 | 2769 | 97.2 | 2754 | 96.6 | 929 | 97.2 | 1103 | |||
49 | 1998 | 1941 | 97.1 | 1932 | 96.7 | 375 | 97.2 | 533 | |||
50 | 2178 | 2152 | 98.8 | 98.9 | 492 | 2134 | 98.0 | 679 | |||
50 | 4031 | 3947 | 97.9 | 98.1 | 1868 | 3953 | 98.1 | 2423 | |||
51 | 3128 | 3061 | 97.9 | 3064 | 98.0 | 796 | 98.0 | 1585 | |||
52 | 3529 | 97.8 | 914 | 3408 | 96.6 | 3368 | 95.4 | 1110 | |||
52 | 2322 | 97.9 | 2268 | 97.7 | 698 | 2268 | 97.7 | 1165 | |||
54 | 1780 | 98.5 | 1752 | 98.4 | 348 | 1746 | 98.1 | 489 | |||
56 | 1382 | 1373 | 99.3 | 99.6 | 354 | 1373 | 99.3 | 362 | |||
58 | 2499 | 2393 | 95.8 | 95.9 | 724 | 95.9 | 908 | ||||
60 | 2196 | 2155 | 98.1 | 98.6 | 608 | 2160 | 98.4 | 742 | |||
62 | 2889 | 2849 | 98.6 | 2853 | 98.8 | 619 | 98.8 | 858 | |||
63 | 2612 | 2564 | 98.2 | 98.3 | 730 | 2562 | 98.1 | 847 | |||
65 | 2498 | 2451 | 98.1 | 98.7 | 1176 | 2447 | 98.0 | 1540 | |||
65 | 1022 | 997 | 97.6 | 98.6 | 260 | 1000 | 97.8 | 286 | |||
73 | 4880 | 4778 | 97.9 | 98.8 | 3887 | 4728 | 96.9 | 4789 | |||
49 | 2289 | 2248 | 98.2 | 98.3 | 747 | 2245 | 98.1 | 949 | |||
Archaea | |||||||||||
36 | 1710 | 1678 | 98.1 | 1682 | 98.4 | 517 | 98.7 | 570 | |||
39 | 621 | 618 | 99.5 | 100.0 | 720 | 100.0 | 778 | ||||
46 | 2757 | 2567 | 93.1 | 2545 | 92.3 | 1123 | 93.6 | 1999 | |||
50 | 1390 | 1372 | 98.7 | 1370 | 98.6 | 446 | 99.0 | 581 | |||
61 | 2047 | 2001 | 97.8 | 98.5 | 731 | 2015 | 98.4 | 884 | |||
46 | 1705 | 1661 | 97.4 | 1663 | 97.6 | 691 | 97.9 | 949 |
“genes” refers to all protein-coding genes in the NCBI annotation where the description does not contain “hypothetical” or “putative.” Genes with descriptions containing “hypoth” or “etical” are also excluded to catch the most common misspellings of hypothetical.
“3′ matches” counts the number of genes with stop sites exactly matching between the annotation and prediction on the same strand. “extra” counts the number of genes predicted by each program that do not share strand and stop site with an annotated non-hypothetical gene. The lowest number of extra genes and the highest number of 3′ matches are bolded for each organism.
All three tools achieved similar sensitivity on the bacterial genomes in the test set. On average, Balrog found 2 non-hypothetical genes fewer than Prodigal (2,248 vs. 2,250) and 3 genes more than Glimmer3 (2,248 vs. 2,245). This represents a difference of less than 0.1% in sensitivity. Balrog predicted the fewest genes overall, reducing the number of “extra” gene predictions by 11% vs. Prodigal (664 vs. 747) and 30% vs Glimmer3 (664 vs. 949).
Balrog predicted more genes than Prodigal for only one bacterial genome,
On the five genomes in the archaea test set, we observed more pronounced differences in the number of extra gene predictions. Glimmer3 found the most known genes, averaging 1670, versus 1663 for Prodigal and 1661 for Balrog. However, Balrog predicted the fewest genes overall, 18% fewer extra genes than Prodigal and 40% fewer than Glimmer3.
Similar results were observed when the gene model was trained on a set excluding organisms sharing a family, rather than a genus, with any organism in the test set. On average, the gene model achieved sensitivity of 98.12% with family excluded vs. 98.15% with genus excluded (2247 vs. 2248 genes) in bacteria and 97.50% vs. 97.44% (1662 vs. 1661) in archaea. The family-excluded model predicted on average 25 more extra genes than the genus-excluded model in bacteria (689 vs. 664) and 32 more in archaea (597 vs. 565).
In selecting genomes on which to train our gene model, we aimed to cover as much microbial diversity as possible while limiting sequence redundancy. As a whole, currently available prokaryotic genomes are biased toward clinically relevant organisms. Many low-abundance environmental species may be absent from public databases, whereas organisms important to human disease may have full genomes for hundreds of closely related strains [
From this set of high-quality complete genomes with gene annotations, 29 bacterial and five archaeal species were randomly selected to serve as a test set.
From all genomes, we extracted amino-acid sequences from annotated non-hypothetical genes. All genes with a description containing “hypothetical” or “putative” were removed from analysis, as many of these are not true genes but instead are the predictions of other gene finding programs. Additionally, genes with descriptions containing “hypoth” or “etical” were excluded in an effort to catch the most common misspellings of hypothetical. All non-hypothetical gene sequences were translated in all five alternative reading frames, and from these translations we extracted open reading frames (ORFs) longer than 100 amino acids to use as training examples of non-protein sequence.
We extracted amino-acid shingles (overlapping subsequences) in the 3’ to 5’ direction of length 100 and overlapping by 50 from all protein and non-protein sequences. These were used as positive and negative gene examples, respectively. In total, ≈27 gigabases (9 billion amino acids) of translated gene and non-gene sequence was generated to train the gene model.
A temporal convolutional network (TCN) was trained using the methods and open source Python framework of Bai et al. [
A temporal convolutional network (TCN) with 2 hidden layers and a convolutional kernel size of 2. The number of connections exponentially increases as hidden layers are added, enabling a wide receptive field. Notice the output of a TCN is the same length as the input. Balrog’s TCN used 8 hidden layers, a convolutional kernel size of 8, a dilation factor of 2, and 32 *
During inference, we use the output from the pre-trained TCN to predict a single score for an ORF of any given length. To predict a single probability between 0 and 1, we combine all output scores from the TCN according to
Our gene model TCN used 8 hidden layers, 32 *
Though not the main focus of this work, a good start site model provides a boost in accuracy for a prokaryotic gene finder. In bacteria, the initiation of translation is usually marked by a ribosome binding site (RBS), which manifests as a conserved 5-6 bp sequence just upstream of the start codon of a protein-coding gene. Experimentally-validated start sites are not available for the vast majority of bacterial genes, so we made the assumption (also used in previous methods [
Similar to the gene model, we trained a TCN on the positive and negative examples of gene start sites. A slightly smaller model was used due to the reduced complexity and length of the start site sequence data. Our start site model used 5 layers with 25 *
A powerful gene sequence model is necessary for finding genes, but additional features such as open reading frame (ORF) length can also be taken into account. In particular, longer ORFs are more likely to be protein-coding genes, by the simple argument that a long stretch of DNA without stop codons is less likely, in random DNA sequence, than a short stretch. Balrog begins by identifying and translating all ORFs longer than a user-specified minimum. Its task is to determine for each of these ORFs whether it represents a protein-coding gene.
We also developed an optional kmer-based filter, using amino-acid sequences of length 10, which runs before the gene model to positively identify genes. This filtering procedure simply identifies all amino-acid 10-mers found in annotated non-hypothetical genes from the training data set and flags any ORF containing at least two of these 10-mers as a true protein. This initial step finds many common prokaryotic genes with a very high specificity and near-zero false positive rate.
Next, ORFs are scored by the pre-trained temporal convolutional network in the 3’ to 5’ direction. The region surrounding each potential start site of each ORF is then scored by the start site model. A directed acyclic graph is constructed for each contig, with nodes representing all possible ORFs. Edges are added between compatible ORFs overlapping by less than a user-specified minimum. To avoid creating a graph with O(
The global maximum score of the directed acyclic graph is computed by finding the longest weighted path through the graph as shown in
A directed acyclic graph with nodes representing open reading frames (ORFs) and edges representing possible connections. Each edge is weighted by the ORF score at the tip of the arrow minus any penalty for overlap. ORFs that overlap by too much are not connected. In this example, the maximum score is achieved by following the bolded path connecting 0-2-3. ORF 1 is not included because it is mutually exclusive with ORF 0 and results in a lower score due to overlap with ORF 2.
To benchmark gene finding performance, Glimmer3 and Prodigal were run with default settings and allowed to train on each genome in the test set.
In the spirit of building a data-driven model, nearly all parameters were optimized with respect to the data rather than being hand-tuned. Ten genomes were randomly selected from the training data set to use for optimization of weights used in the scoring function for genome graph construction.
The score for each ORF node was calculated by a linear combination of features including the gene model score, start site model score, start site codon usage, and the length of the ORF. Additionally, final scores for edges between nodes are penalized by the length and direction of overlap, if any, between the connected ORFs. Depending on the type of overlap, per-base penalties are multiplied by the length of the overlap and subtracted from the edge connection score. Different penalties are learned for divergent overlap (3’ to 3’), convergent overlap (5’ to 5’), and unidirectional overlap (3’ to 5’ or 5’ to 3’).
This scoring system was used to combine features so the linear weights can be learned with respect to the data to maximize gene finding sensitivity. Optimization of all weights with respect to gene sensitivity was accomplished using a tree-structured Parzen estimator [
Our gene model is tuned to maximize sensitivity to known genes without regard to the total number of predictions. In order to keep down the number of false positive predictions, users may optionally run a post-processing step with MMseqs2 [
A diagram showing all steps from genomic sequence in to gene predictions out. Green circles represent input and output data. White squares represent intermediate data. Blue squares represent processes. Yellow cylinders represent databases and pretrained models.
Balrog demonstrates that a data-driven approach to gene finding with minimal hand-tuned heuristics can match or outperform current state-of-the-art gene finders. By training a single gene model on nearly all available high-quality prokaryotic gene data, Balrog matches the sensitivity of widely used gene finders while predicting fewer genes overall. Balrog also requires no retraining or fine-tuning on any new genome.
Balrog predicted consistently fewer genes than both Prodigal and Glimmer3 on both the bacterial and archaeal genome test sets. The sensitivity of all three gene finders was nearly identical and likely well within the range of noise in our sample on average, though Prodigal appears to achieve higher sensitivity than both Balrog and Glimmer3 on high-GC% genomes. A stronger bias against short ORFs, similar to Prodigal’s penalty on ORFs shorter than 250bp, may help Balrog perform better in genomes with particularly high GC content. However, incorporating a bias against small genes may provide higher specificity at the cost of sensitivity to small genes. Heuristics used by current gene finders, including default minimum ORF lengths of 90 for Prodigal and 110 for Glimmer3, have led to a blind spot around functionally important small prokaryotic proteins [
Our test set deliberately represented a near-worst-case scenario for Balrog, where no organism from the same genus was used to train the model. On organisms closely related to those in the large and diverse training set, we expect Balrog may perform better as a result of overfitting. Overfitting of a gene model in this context is a complex issue. Simply memorizing and aligning to all known genes can be thought of as the ultimate overfit model, yet that strategy would likely prove effective at finding conserved bacterial genes. Finding prokaryotic genes is not a standard machine learning task where memorization inevitably leads to higher generalization error. Conserved amino-acid sequences in prokaryotic genes may represent functionally important protein motifs and memorization of short amino-acid sequences as indicators of protein coding sequence may prove useful in finding genes even in novel organisms. Still, we attempted to be as fair as possible to competing gene finders by removing all organisms with a shared genus. We felt this should provide a conservative estimate of the true generalization error of our model to relatively distant genomes.
An alternative approach to training a universal protein model could use protein clusters to capture diversity in protein sequences with less redundancy than our whole-genome approach. However, we wanted our final evaluation metric to be as fair as possible to all gene finders and reflective of a real-world situation where a newly sequenced prokaryote would likely contain many proteins from many different clusters.
Balrog in its current form is relatively slow. While tools like Prodigal and GeneMarkS-2 may analyze a genome in a matter of seconds, Balrog may take minutes per genome. This is due to a wide range of factors including the complexity of the gene model and the optional gene filtering step with MMseqs2. Optimization of run time represents a possible future improvement for Balrog.
Balrog was designed primarily to find genes without much regard for identifying the exact location of their translation initiation site (TIS). TIS identification is a challenging problem with relatively little available ground-truth data. A reasonably accurate start site predictor helps to guide a gene finder, so Balrog does include a small TIS model, but accurate start site prediction was not a primary focus of this work. Further complicating the issue, nearly all available start site locations are based solely on predictions of previous gene finders. Demonstrating true improvement in start site prediction would require comparing Balrog to other gene finders on a large ground-truth data set which is simply not currently available. Incorporating TIS models used by Prodigal or GeneMark may enable improvement in start site identification in the future.
Full organism names and accession numbers of all genomes used in the gene finder comparison in
(CSV)
Full organism names and accession numbers of all genomes used to train the gene model.
(CSV)
Full organism names and accession numbers of all genomes used in the protein kmer and MMseqs2 filtering steps.
(CSV)
We would like to thank Christopher Pockrandt for helping distribute the C++ version of Balrog, Jennifer Lu and Alaina Shumate for helping brainstorm cool program names, everyone on the Center for Computational Biology Slack channel for voting on said cool names, Martin Steinegger for helpful conversations and creating MMseqs2, @genexa_ch for providing via Twitter a small set of diverse GTDB genomes on which the kmer filter and MMseqs2 are run, and everyone in the S. Salzberg and M. Pertea labs.
Dear Mr. Sommer,
Thank you very much for submitting your manuscript "Balrog: A universal protein model for prokaryotic gene prediction" for consideration at PLOS Computational Biology.
As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.
We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.
When you are ready to resubmit, please upload the following:
[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.
[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).
Important additional instructions are given below your reviewer comments.
Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.
Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.
Sincerely,
Christos A. Ouzounis
Associate Editor
PLOS Computational Biology
William Noble
Deputy Editor
PLOS Computational Biology
***********************
Reviewer's Responses to Questions
Reviewer #1: I am glad to see some new tool developed using deep learning to predict proteins from prokaryote genomes and this sounds a nice tool to improve the accuracy and easily used without training for specific taxonomic units like other tools, e.g. prodigal and prokka. I would like to test it by myself, but failed in using the web server provided by the authors, even failed in uploading my sequences. I suppose the model file is large hindering distribution of the tool. Standalone version is much helpful.
The writing is good, then I still have other concerns.
1) As we know, prokaryote genome sequnces are largely biased in sequncing for some pathogens. Then the data set for taining is not balanced.
2) for prokaryote genomes, the difference of gene numbers within the same species, that is, different populations/strains, is large because of HGT or other reasons resulting in quite difference if accessory genomes. Why the authors select proteins for training based on the rule of picking up genomes and then determine the proteins. It looks like that the authors need to select all prokaryote genomes with high qualtiy and then cuurate pangenome to cluster these proteins for traing your model. Another option is that the author could extract high quality of protein seuqnces of porkaryotes from known databases, e.g. uniprot.
3)I am not sure the rule to make non-hypothetical genes only based on a description containing “hypothetical” or “putative”. This is really coarse.
4) two figures are too simple to express clearly what was done by authors.
Reviewer #2: The authors developed a method for gene prediction in prokaryotes, Balrog, which is based on deep convolutional neural networks (CNNs) and was trained on 3290 genomes and tested on 36. To focus the test results on non-trivial cases, no genomes from the same genus as any of the test were allowed in the training set. The method employs recent technological developments of using CNNs in sequence modeling (Bai, Kolter, Koltun, 2018). First, a CNN is trained to predict for every position of a translated amino acid sequence whether translation is in the right frame or not. This is the heart of the method. Second, a CNN for predicting translation initiation sites is trained on the 32 nucleotide long sequences around each start sites of the non-hypothetical proteins in the training set. Third, to avoid making contradicting ORF calls (e.g. strongly overlapping ones), the longest weighted path through a directed acyclic graph is computed, in which nodes represent possible ORFs and nodes are connected by edges if the ORFs do not overlap too much.
Balrog achieves very similar sensitivity as the gold standard tools Prodigal and Glimmer3, and it has 11% and 30% fewer likely false predictions than Prodigal and Glimmer3, respectively. Balrog takes 5-10 minutes to process a typical bacterial genome on a GPU, whereas Prodigal takes a few seconds at most on a single CPU core.
The results are a bit disappointing considering the big advances that deep learning has afforded in many bioinformatic applications. However, the study is interesting for two reasons. First, if the slight improvements hold true with an unbiased benchmark, they would be a worthwhile improvement of prediction accuracy. Second, the study demonstrates how to use state-of-the-art deep learning methods for the task of gene prediction.
Major points:
1) It is unclear to what degree the training set is biased by the fact that many gene annotations in the training genomes are also produced by bioinformatic prediction tools. Since Glimmer3 and Prodigal have been the standard tools for gene prediction since 1998 and 2010, respectively, it is likely that most of the 'extra" genes annotated as hypothetical were actually predicted found by Glimmer3 or Prodigal. It is therefore not surprising at all that Glimmer3 and Prodigal would find more such 'extra' genes than a tool such as Balrog using a very different methodology.
The authors need to construct a benchmark that can correct for such biases or at least estimate them. One option could be to test on genomes that have been annotated using experimental data such as RNA-seq, CAGE-seq or the like.
2) It would be important to get more information on how much this very highly parameterized method can generalize beyond the genus. The benchmark should therefore be repeated with training sequences from which all genomes from the same family / order of any of the test genomes have been excluded.
Minor points:
3) The Methods do not mention what dilation sizes were used in the gene model CNN. d = 2^i ?
4) To train the start site model, negative training examples were taken to be the start site codons after the annotated start site of the positive training ORFs. Isn't that quite risky since start sites are notoriously hard to annotate and might be frequently wrong? Wouldn't it be better to use start codons within the negative ORF training examples?
5) Whereas it is stressed in the abstract that Prodigal and Glimmer3 need to be pretrained on each genome to achieve optimal results, the Methods section does not mention if such pretraining was employed.
6) Please explain why 'Efficiently training a temporal convolutional network requires sequences of the same 96 length.'
7) Why is Balrog so slow? I count 20 * 8 * 32 * 8 = 41920 parameters. Since predictions can be done in parallel on the GPU, that should take a few seconds, not minutes, for the few ten thousand translated ORFs longer than 60 codons. Could it be that each convolutional filter is not only computed once per input window of length k, as it should, but 100-k+1 times (where 100 is the length of the sequences used for training)?
8) Line 13: Delete '32 ∗ L hidden units per layer, and'
9) Please comment in the discussion on why you did not use a transformer architecture.
**********
Large-scale datasets should be made available via a public repository as described in the
Reviewer #1: Yes
Reviewer #2: Yes
**********
PLOS authors have the option to publish the peer review history of their article (
If you choose “no”, your identity will remain anonymous but your review may still be made public.
Reviewer #1: No
Reviewer #2: No
While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool,
Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here:
To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions, please see
Submitted filename:
Dear Mr. Sommer,
We are pleased to inform you that your manuscript 'Balrog: A universal protein model for prokaryotic gene prediction' has been provisionally accepted for publication in PLOS Computational Biology.
Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.
Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.
IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.
Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.
Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology.
Best regards,
Christos A. Ouzounis
Associate Editor
PLOS Computational Biology
William Noble
Deputy Editor
PLOS Computational Biology
***********************************************************
Reviewer's Responses to Questions
Reviewer #1: No further comment.
Reviewer #2: The authors have addressed all reviewer comments satisfactorily. I particularly appreciate providing open-source C++ code that can run on CPUs.
**********
Large-scale datasets should be made available via a public repository as described in the
Reviewer #1: Yes
Reviewer #2: Yes
**********
PLOS authors have the option to publish the peer review history of their article (
If you choose “no”, your identity will remain anonymous but your review may still be made public.
Reviewer #1: No
Reviewer #2: No
PCOMPBIOL-D-20-01618R1
Balrog: A universal protein model for prokaryotic gene prediction
Dear Dr Sommer,
I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.
The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.
Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.
Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!
With kind regards,
Alice Ellingham
PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom