Theoretical Prediction and Experimental Verification of Protein-Coding Genes in Plant Pathogen Genome Agrobacterium tumefaciens Strain C58

Agrobacterium tumefaciens strain C58 is a Gram-negative soil bacterium capable of inducing tumors (crown galls) on many dicotyledonous plants. The genome of A. tumefaciens strain C58 was re-annotated based on the Z-curve method. First, all the ‘hypothetical genes’ were re-identified, and 29 originally annotated ‘hypothetical genes’ were recognized to be non-coding open reading frames (ORFs). Theoretical evidence obtained from principal component analysis, clusters of orthologous groups of proteins occupation, and average length distribution showed that these non-coding ORFs were highly unlikely to encode proteins. Results from the reverse transcription-polymerase chain reaction (RT-PCR) experiments on three different growth stages of A. tumefaciens C58 confirmed that 23 (79%) of the identified non-coding ORFs have no transcripts in these growth stages. In addition, using theoretical prediction, 19 potential protein-coding genes were predicted to be new protein-coding genes. Fifteen (79%) of these genes were verified with RT-PCR experiments. The RT-PCR experimental results confirmed the reliability of our theoretical prediction, indicating that false-positive prediction and missing genes always exist in the annotation of A. tumefaciens C58 genome. The improved annotation will serve as a valuable resource for the research of the lifestyle, metabolism, and pathogenicity of A. tumefaciens C58. The re-annotation of A. tumefaciens C58 can be obtained from http://211.69.128.148/Atum/.


Introduction
Agrobacterium tumefaciens is a Gram-negative bacteria belonging to the Rhizobiaceae family. As ubiquitous soil microorganisms, most of the A. tumefaciens members are ideal vectors for plant gene-transfer. The products of a series of virulence (vir) genes export the singlestranded transferred DNA (T-DNA) in tumor-inducing (Ti) plasmid to plant cells, and the T-DNA can integrate into the plant genome randomly [1,2]. Moreover, most of the gene sequences can be replaced by the T-DNA, making A. tumefaciens an essential tool for plant transgenic research. The genome of A. tumefaciens C58 was sequenced in 2001 by Washington University and Cereon genomics company [3,4]. As a powerful transgenic tool, the detailed genomic study of A. tumefaciens C58 could lead to a directed refinement of plant transformation. The genome of A. tumefaciens C58 is approximately 5.67 Mb and is composed of four replicons, i.e., one circular chromosome, one linear chromosome, and two plasmids, namely, pTiC58 and pAtC58. GenBank accession numbers for the four replicons are AE007869 to AE007872. Shortly after the publication of A. tumefaciens C58 genome, the Comprehensive Microbial Resource of The Institute for Genomic Research automatically re-annotated it (http://cmr. jcvi.org/) and identified additional .1,000 suspicious protein-coding genes. The Reference Sequence (RefSeq) collection in the National Center for Biotechnology Information (NCBI) aims to provide a comprehensive, integrated, non-redundant, and wellannotated set of sequences, including genomic DNA, transcripts, and proteins [5]. The four A. tumefaciens C58 replicons were processed by RefSeq pipeline and assigned with a project number (ID: 57865). The annotation of A. tumefaciens C58 in the aforementioned public databases is quite different, indicating that its genome annotation is far from satisfactory.
Considering that most of the protein-coding genes annotated with gene-finding programs have not been verified experimentally, annotations in the sequenced genomes always contain falsepositive and false-negative prediction, especially in the GC-rich genomes [6][7][8][9][10][11][12][13][14][15]. False-positive prediction indicates that some open-reading frames (ORFs) are incorrectly predicted to be protein-coding genes (most of them are short ORFs with no functional information), whereas false-negative annotation indicates true protein-coding genes missed in the genome annotation. Current gene-finding programs perform relatively well in low GC content genomes, but the accuracy drops considerably in high GC content genomes because they contain fewer overall stop codons and more spurious ORFs. False-positive prediction is a very serious problem in high GC content genomes. Given that A. tumefaciens C58 has a relatively high overall GC content (59.1%), this species may contain false-positive and false-negative ORFs. Klüsener et al. performed proteomic and transcriptomic analyses of phosphatidylcholine (PC)-deficient and wild-type A. tumefaciens and observed that the loss of PC can alter the expression of approximately 13% of the genes [16]. Other proteomic studies predicted that approximately 3,000 cytosolic proteins and 400 membrane proteins can be expressed under the conditions of isoelectric point (pI) 4 to 7 and a molecular weight of 10 kDa to 150 kDa. However, the proteomic experimental results detected only approximately 1,500 proteins under the above conditions [17,18].
In the current analysis, all the A. tumefaciens C58 'hypothetical genes' in RefSeq annotation were re-identified, 29 of these molecules were recognized as non-coding ORFs by an algorithm based on the Z-curve method [19,20]. Evidence obtained from the principal component analysis (PCA), clusters of orthologous groups of proteins (COG) occupation, and average length distribution showed that the identified non-coding ORFs were highly unlikely to encode proteins. Reverse transcription-polymerase chain reaction (RT-PCR) experiments confirmed that 23 (79%) ORFs did not express in three important bacterial growth stages. In addition, 19 potential new protein-coding genes were predicted by two ab initio gene finding program and our algorithm. All the potential new protein-coding genes were tested using RT-PCR, and 15 (79%) of these genes were confirmed. Although missing genes is not the most serious problem in bacterial gene annotation, the current analysis confirmed that some proteincoding genes are still missed in the annotation. The improved annotation provides valuable information for the genomic analysis of A. tumefaciens C58.

Data collection
The sequence and annotation of A. tumefaciens C58 genome were downloaded from the NCBI RefSeq because it can provide a comprehensive and relatively precise annotation [5]. The ,5.67 Mb genome of A. tumefaciens C58 contained four replicons, namely, a circular (2,841,580 bp, NC_003062) and a linear (2,075,577 bp, NC_003063) chromosomes as well as two plasmids, pAtC58 (542,868 bp, NC_003064) and pTiC58 (214,233 bp, NC_003065). The four replicons annotated 2765, 1851, 542, and 197 protein-coding genes, respectively.
The annotation of protein-coding genes can be classified into two groups. The first group contains genes with confirmed functions, which are used for the training dataset. The second group includes 'hypothetical genes' whose coding status was not determined, but are re-identified in the current analysis. A total of 2,987 function confirmed genes in the two chromosomes were used for the training dataset. The coding status of 1,071, 558, 214, and 59 'hypothetical ORFs' in the circular chromosome, linear chromosome, pAtC58, and pTiC58 were re-annotated, respectively. Furthermore, potential new protein-coding genes that were not found in the RefSeq annotation were predicted by two ab initio gene finding programs, i.e., Prokaryotic dynamic programming gene-finding algorithm (Prodigal) [21] and FgenesB (http://linux1.softberry.com/berry.phtml?topic = fgenesb& group = programs& subgroup = gfindb), respectively.

Identification of non-coding ORFs from annotated hypothetical genes
The method adopted in this study is based on the Z-curve of the DNA sequence, which has been applied successfully to find genes in bacterial and archaeal genomes [19,20]. In the present analysis, 21 variables are adopted, which include 9 phase-dependent single nucleotides and 12 phase-independent di-nucleotides. Details about these variables and the identification process are listed in the Methods S1.

Method for identifying new functional genes
In the current analysis, two ab initio gene finding programs are performed to identify new protein-coding genes not found in the RefSeq annotation. Prodigal is a recently developed highly accurate microbial gene finding program, with high speed, low false positive rate, and high accuracy in locating the translation initiation sites (TISs) [21]. FgenesB is another accurate ab initio prokaryotic gene prediction program based on the Markov chain models of coding regions, translation, and termination sites. This program also includes a simplified prediction of operons based only on distances between predicted genes. Combining the predicted result of the two ab initio programs and their Z scores in the Z-curve method, new protein-coding genes not annotated in RefSeq are identified in A. tumefaciens C58 genome.

Strain cultivation and nucleic acid isolation
A. tumefaciens C58 was inoculated into 100 mL Luria-Bertani (peptone, 10 g/L; yeast extract, 5 g/L; NaCl, 10 g/L) broth and incubated at 28uC overnight with 180 rpm shaking. The bacterial cells were sampled in early log, late log, and stationary stages with OD 600 of 0.3, 0.6, and 0.9, respectively. Total DNA was purified by phenol-chloroform-isoamyl alcohol (25:24:1, v/v/v) and precipitated by ethanol in the late log stage of bacterial cells. After being washed twice using 70% ethanol, the total DNA was dissolved in 100 mL sterilized H 2 O [22]. For the three growth stages, total RNA was extracted used Trizol (Invitrogen) and treated with DNase I (Takara) to remove genomic DNA contamination. Then, cDNA was made by reverse transcription using the total RNA with First Strand cDNA Synthesis Kit (Fermentas). The total DNA, total RNA, and cDNA were used for PCR analysis.

RT-PCR and sequence validation
The 50 mL PCR mixture contained 5 mL 106PCR Buffer, 0.2 mM dNTP (Takara), 0.02 mM primers, 1 mL total DNA, total RNA or cDNA, 1 mL taq DNA polymerase (Takara), and nuclease-free water. The samples were incubated with the following cycles: 95uC for 5 min, 30 cycles of 95uC for 45 s, annealing for 45 s, 72uC for 1 min, and a final extension of 72uC for 10 min. The PCR primers for the chosen sequences were designed by Primer Premier 5.0 software (Premier Biosoft International, Palo Alto, CA). The PCR reaction condition for every primer sit was optimized with DNA sample in repeated PCR experiments at 2uC to 10uC lower than the predicted annealing temperature until a single amplified band was obtained. The 16S rRNA gene and recA gene encoding recombination protein A were Table 1. Sensitivity, specificity, and accuracy of more than 10fold cross-validation tests for A. tumefaciens C58.

Species
Sensitivity a (%) Specificity a (%) Accuracy b (%) used as positive controls for multi-copy and single-copy genes from A. tumefaciens strain C58, respectively. Meanwhile, translation initiation factor gene IF-2 (named PC-3) was used as a known positive control gene in the verification experiments of proteincoding genes. Each of the PCR products was purified using a PCR product purification kits (SBS Genetech CO., Ltd. Shanghai, China). The purified DNA fragments were ligated with pGEM-T and then transformed into competent cells of Escherichia coli DH5a, as described by the pGEMH-T Vector Systems (Promega, Madison, WI, USA). The positive clones were sequenced by Beijing Sunbiotech Co., Ltd. (Beijing, China).

Identification of non-coding 'hypothetical ORFs'
First, 1,071, 558, 214, and 59 'hypothetical ORFs' in the circular chromosome, linear chromosome, pAtC58, and pTiC58 plasmids were re-identified using the Z-curve method [19,20]. With the exception of the putative and 'hypothetical genes' in the annotation file, 2,987 function confirmed genes in the four replicons were used to determine the discrimination parameters. The 2,987 genes were randomly divided into two almost equal parts. The first part served as a training set for the calculation of the Fisher coefficients, whereas the other served as a test set for the assessment of algorithm accuracy. Both the training and test sets should include positive and negative samples. In the genome of A. tumefaciens C58, 88.6% of the whole DNA sequences are coding sequences, making the preparation of an appropriate set of  Table 2   negative samples quite difficult. Therefore, each of the 2,987 protein-coding genes was randomly shuffled 100,000 times, so that it was transformed into a random sequence. The shuffled sequences then served as negative samples. Sensitivity S n and specificity S p were used to evaluate the algorithm and were defined as follows: S n = TP/(TP+FN), S p = TN/(TN+FP), where TP, TN, FP, and FN are fractions of positive correct, negative correct, falsepositive, and false-negative predictions, respectively. Accuracy was defined as the average of S n and S p . After performing ten-fold cross-validation tests, the mean sensitivity, specificity, and standard deviation were listed in Table 1. The 'hypothetical genes' were reidentified using the final Fisher coefficients and criterion for deciding coding/non-coding. Considering that the negative samples differ in each discrimination process, the recognized non-coding ORFs also have slight differences. The process was repeated 100 times, and the commonly identified non-coding ORFs were adopted. Consequently, 29 'hypothetical genes' in the four replicons of A. tumefaciens C58 were identified as non-coding ORFs, which were listed in Table 2.
Theoretical and experimental evidence of the recognized ORFs as non-coding ORFs In the current annotation of bacterial genomes, false-positive predicted genes always exist, i.e., some randomly occurring ORFs are recognized as protein-coding genes, most of which are relatively short 'hypothetical ORFs' [6][7][8][9][10][11]. The difference between protein coding genes and identified non-coding ORFs can be viewed intuitively using principal components analysis (PCA) [23]. Figure 1 shows the distribution of points spanned by the first two principal components on the principal plane for A. tumefaciens C58. The coding and non-coding sequences were represented by open circles and triangles, respectively. The first and second principal axes possessed 33.4% and 14.9% of the total inertia of the 21-dimensional space. The first two principal axes were responsible for separating the coding and non-coding   sequences into two scarcely overlapping clusters. The recognized non-coding ORFs were represented by filled stars distributed far from the core of function-known genes and close to the random sequences. This implies that the ORFs listed in Table 2 were highly unlikely to encode proteins. The clusters of orthologous groups of proteins (COG) functional category was added to most of the archaeal and bacterial curated genomic annotations. Each COG is a group of three or more proteins that are inferred to be orthologs. Computational analysis showed that approximately 70% prokaryotic proteins were generally highly conserved and contained ancient conserved regions shared by homologs from distantly related species [24]. Therefore, an ORF that has a COG code is highly likely to be a protein-coding gene. Table 3 showed that approximately 96.8% of function-known genes have COG code. However, for the recognized non-coding ORFs, only 3.8% were assigned with  COG codes. In addition, the average lengths of the recognized non-coding ORFs (105.8 aa) were much shorter than that of the function-known genes (357.8 aa). All the above theoretical evidence supports the view that the recognized non-coding ORFs in A. tumefaciens C58 were very unlikely to encode proteins.
To test our theoretical prediction, all the identified 29 noncoding ORFs were verified experimentally. Information of the 29 non-coding ORFs and the designed primers are listed in Table 4, and the RT-PCR results are shown in Figure 2. The PCR using total DNA as each template confirmed that the reagents and primers both work well, and all the ORFs could be amplified precisely ( Figure S1). Water was used as a template to detect whether the reagents were contaminated, which yielded negative results ( Figure S1). The RT-PCR products from the total RNA sample showed no DNA contamination ( Figure S2). After the DNA reagents and the RNA were both confirmed, the cDNA reverse transcribed from the RNA was used as template for PCR. In the RT-PCR results of cDNAs as templates, 16S rRNA and recA gene were used as positive control for multi-copy and single-copy genes, respectively ( Figure 2). For the early log stage, NC-7 (210 bp), NC-8 (109 bp), NAt-6 (202 bp), and NAt-7 (194 bp) were successfully amplified, whereas NAt-4 (374 bp) and NAt-8 (262 bp) were amplified for the late log stage (Figures 2A and B). For the stationary stage, all the non-coding ORFs showed negative amplification products ( Figure 2C). A total of 23 (79%) cDNAs of the tested 29 non-coding ORFs were not amplified, confirming that most predicted non-coding ORFs were not expressed in the three important stages of bacterial growth. DNA sequencing results confirmed that the PCR products were the correct target gene sequences (data not shown). The RT-PCR results verified that the theoretical prediction of non-coding ORFs was very reliable.
Theoretical prediction and experimental validation of newly predicted protein-coding genes not annotated in NCBI RefSeq The ab initio gene-finding programs used in the NCBI RefSeq annotation pipeline include Glimmer [25], GeneMark [26], and the recently developed Prodigal [21]. However, the Prodigal result has not been incorporated into the RefSeq annotation. FgenesB is another accurate gene prediction program which has not been used in RefSeq annotation. Therefore, Prodigal and FgenesB were used to find new potential protein-coding genes in A. tumefaciens C58 genome. Both Prodigal and FgenesB predicted 19 potential new protein-coding genes. BLAST search was performed to find potential functions for these potential protein-coding genes, 13 of which had high sequence similarities with function-known genes in public databases. However, Table 5 showed that in most cases, the query sequences were only aligned to a partial of the BLAST hits. The other 6 were predicted to be 'hypothetical genes' with sequence lengths similar to those of their BLAST hits ( Table 5).
All of the 19 predicted protein-coding genes underwent experimental verification. The 16S rRNA gene, recA gene, and PC-3 gene encoding translation initiation factor IF-2 were selected as positive controls for multi-copy, single-copy, and positive control genes, respectively. Information and the designed primers of the predicted protein-coding genes and three positive controls are listed in Table 6. The RT-PCR results of the cDNA templates are shown in Figure 3, and the control PCR results of the total DNA and total RNA are shown in Figure S3 and Figure S4. All the potential protein-coding genes could be amplified precisely from the total DNA templates ( Figure S3), and the RT-PCR products from the total RNA sample had no DNA contamination ( Figure S4). In Figure 3 genes were successfully amplified in three A. tumefaciens C58 growth stages, confirming that they are truly protein-coding genes although they are not annotated in NCBI RefSeq. The DNA sequencing results confirmed that the PCR products were the correct target gene sequences (data not shown). The RT-PCR results verified that the theoretical prediction of novel proteincoding genes was also very reliable.

Discussion
A. tumefaciens C58 was the first sequenced genome in Agrobacterium species. Therefore, the precise gene annotation for this bacterium is important for microbiological research and plant genetic modification. Considering that most of the ORFs are identified by gene-finding programs, but not verified experimentally in the current stage, many false-positive and several falsenegative ORFs always exist in bacterial genome annotation, especially in GC-rich genomes [6][7][8][9][10][11][12][13][14][15]. Bacterial gene annotation can be considerably improved although it has continuously developed over the past decade. A. tumefaciens C58 genome has relatively high GC content, thus it contains fewer overall stop codons and more spurious ORFs.
Many rigorous constraints are imposed on true protein-coding genes rather than randomly occurring ORFs. The generally accepted codon usage pattern is prototype, where , , and indicate purine, non-guanine, and any bases at the first, second, and third codon positions, respectively [27][28][29]. The first, second, and third codon positions have been suggested to be associated with the biosynthetic pathway, hydrophobicity pattern, and -helix orstrand forming potentiality of the coded amino acids, respectively [27][28][29]. However, the false ORFs do not have such coding constraints. The different codon usage patterns between proteincoding genes and spurious ORFs form the bases of the current algorithm. Figure 1 shows that most of the spurious ORFs were distributed far from the core of the function-known genes, indicating that they do not use a general codon usage pattern. Most of the recognized non-coding ORFs were confirmed to have no transcripts in the three important bacterial growth stages. Although the RT-PCR experimental results under the three tested conditions cannot ensure that these ORFs never express under any conditions, the theoretical evidence obtained from the PCA analysis, COG occupation, and average length distribution provides more compelling evidence. Therefore, these ORFs are highly unlikely to be protein-coding genes. In addition, 15 (79%) of the 19 newly predicted protein-coding genes, were confirmed to be protein-coding genes by RT-PCR. Although the most important problem in bacterial genome annotation is false-positive prediction, the experimental result confirmed that missing genes still exist in A. tumefaciens C58. We also noticed that most of the 'novel genes' are only aligned to a fraction of the function-known genes in the public databases although the sequence identities were high (Table 5). Detailed functions of these genes should be further investigated. The improved annotation of A. tumefaciens C58 will provide more accurate information for the research of this important plant pathogen genome. The re-annotation of A. tumefaciens C58 genome can be downloaded from http://211.69. 128.148/Atum/. Nucleotide sequence data of the 15 RT-PCR confirmed new genes are available in the Third Party Annotation Section of the DDBJ/EMBL/GenBank databases under the accession numbers TPA: BK008582-BK008596.

Supporting Information
Methods S1 Identification of non-coding ORFs from annotated hypothetical genes. (DOC) Figure S1 The PCR results of 29 DNA fragments re-annotated as no-coding ORFs. The expected products of PCR used total DNA as template were all obtained with the right sizes,  Figure S4 The PCR results with RNA of 19 DNA fragments reannotated as potential protein-coding genes. (A) The PCR with RNA of early log phase as templates. (B) The PCR with RNA of late log phase as templates. (C) The PCR with RNA of stationary phase as templates. When the total RNAs were used as templates in the PCR, no amplification band was produced. (TIF)