Genome Sequence of Candidatus Nitrososphaera evergladensis from Group I.1b Enriched from Everglades Soil Reveals Novel Genomic Features of the Ammonia-Oxidizing Archaea

The activity of ammonia-oxidizing archaea (AOA) leads to the loss of nitrogen from soil, pollution of water sources and elevated emissions of greenhouse gas. To date, eight AOA genomes are available in the public databases, seven are from the group I.1a of the Thaumarchaeota and only one is from the group I.1b, isolated from hot springs. Many soils are dominated by AOA from the group I.1b, but the genomes of soil representatives of this group have not been sequenced and functionally characterized. The lack of knowledge of metabolic pathways of soil AOA presents a critical gap in understanding their role in biogeochemical cycles. Here, we describe the first complete genome of soil archaeon Candidatus Nitrososphaera evergladensis, which has been reconstructed from metagenomic sequencing of a highly enriched culture obtained from an agricultural soil. The AOA enrichment was sequenced with the high throughput next generation sequencing platforms from Pacific Biosciences and Ion Torrent. The de novo assembly of sequences resulted in one 2.95 Mb contig. Annotation of the reconstructed genome revealed many similarities of the basic metabolism with the rest of sequenced AOA. Ca. N. evergladensis belongs to the group I.1b and shares only 40% of whole-genome homology with the closest sequenced relative Ca. N. gargensis. Detailed analysis of the genome revealed coding sequences that were completely absent from the group I.1a. These unique sequences code for proteins involved in control of DNA integrity, transporters, two-component systems and versatile CRISPR defense system. Notably, genomes from the group I.1b have more gene duplications compared to the genomes from the group I.1a. We suggest that the presence of these unique genes and gene duplications may be associated with the environmental versatility of this group.

. A phylogenetic tree of ammonia-oxidizing archaea amoA, amoB, amoC, and amoX subunits of ammonia monooxygenase. Amino-acid sequences of amo subunits of AOA were randomly selected from the National Center for Biotechnology Information databases. The multiple sequence alignment of the amino-acid sequences was used for building maximum-likelihood trees. The branching patterns are denoted by their respective bootstrap values (100 iterations). Topology is colored by the metabolic group (blue represents marine group 1.1a, green represents group 1.1b, red is ThAOA).

4H + ADP + Pi ATP
Regulates activity of AMO Figure S9. A phylogenetic tree of archaeal pelota gene homologs. Amino-acid sequences of pelota were randomly selected from the National Center for Biotechnology Information databases. The multiple sequence alignment of the amino-acid sequences was used for building maximum-likelihood trees. Table S1. Protein coding sequences of central carbon, nitrogen, lipid metabolism and genes involved in the stress response of the archaeon File S2

Quantitative PCR assays
Quantitative PCR (qPCR) for archaeal was performed in triplicate in the Mx3000P real-time PCR Thermal Cycler ( Results were expressed in relative abundance, log 10 of gene copies per nanogram of DNA.

Trimming and sequence filtering of PacBio sequences
Raw data from PacBio was initially processed for finding the highest scoring local alignments among reads with BLASR from SMRT Analysis portal 1.4 (http://www.pacbiodevnet.com/SMRT-Analysis/ Algorithms/BLASR). The sequences were also filtered by length before assembly, using 8859 nucleotides as cutoff value. Both filtering steps resulted in 8602.5 average read size (6964 reads, N50=9684) with 0.858 read quality. This final filtering by size was crucial to obtain the present genome from the assembly with Celera. High number of short reads dramatically increases computing requirements and may also result in a worse quality assembly due to the excess of error reads. The initial sequencing report data with BLASR filtering is available in Table B.

Sequence assembly
In order to verify the presence of error in the present genome assembly, we compared the results of different de novo assembly tools and sequencing technologies. A detailed comparison among the assembly results obtained from different methods and technologies is in Table C.

IonTorrent
The trimmed/filtered IonTorrent reads were assembled using the de novo genomic assembly tools Mira 3.9 (Chevreux et al., 1999) and IDBA-UD (Peng et al., 2012). IDBA-UD algorithm is based on the de Bruijn graph approach for assembling reads from single-cell sequencing or metagenomic sequencing technologies with uneven sequencing depths. Mira takes advantage of additional available information such as low confidence regions, quality values or repetitive region tags in order to improve the assembly procedure. In both de novo assemblers, we used the parameters optimized for the present reads, with non-uniform read distribution, accurate assembly options and no trace information.

PacBio
After reads filtering/trimming PacBio reads, we used Celera tool from SMRT portal for assembly (http://www.pacbiodevnet.com/ SMRT-Analysis/Software/SMRT-Pipe). Celera is a tool for scalable genome assembly of PacBio long reads. The default settings for PacBio reads were used in Celera assembly run. In addition, we used MIRA assembler as an alternative method in order to compare its generated contigs to Celera results. The default options were used in MIRA assembler for PacBio reads.

Genome finishing and scaffolding
The final PacBio contig (present genome) was also filtered with Quiver (Chin et al., 2013), a highly accurate consensus and variant caller that can generate 99.99% accurate consensus sequences using local realignment and the full range of quality scores associated with PacBio reads (We obtained a consensus concordance of 99.9945 for the present genome).

Assembly Verification
After the assembly procedure, we compared the contigs generated by both sequencing technologies (PacBio and Ion Torrent). We used Vista (Frazer et al., 2004) and Mauve (Darling et al., 2004) genomic analysis tools to align the final contig generated by Celera to all contigs generated by Mira. Vista results shows 99% of conserved nucleotides between Celera (PacBio) and Mira (IonTorrent) contigs and all assemblers have shown similar GC content (Table C).

Two-component systems annotation
We used Conserved Domain Search tool from NCBI (Gibney, Baxevanis, 2011

Phylogenetic analyses
The nucleotide and amino acid sequences for phylogenetic reconstruction were obtained from NCBI databases (Gibney, Baxevanis, 2011;Benson et al., 2012). The selected nucleotide sequences of 16S rRNA and predicted amino acid sequences of AMO were aligned using the multiple sequence analysis tool MUSCLE 3.8.31 (Edgar, 2004). The multiple sequence alignment of 16S rRNA was filtered using GLBLOCKS for selecting conserved sites and remove poorly aligned regions (Talavera, Castresana, 2004).
Conserved blocks were selected using the following criteria: at least 8 sites per conserved block length, with up to 4 contiguous nonconserved regions, present in at least 50% of the aligned sequences.
Maximum likelihood trees were built using the phylogenetic tree reconstruction tool PhyML 3.0 (Criscuolo, 2011), using the default parameters and 100 bootstraps for 16S sequences and JTT model, 1000 bootstraps, and best of NNI and SPR search operations for amino acid sequences of AMO. The phylogenetic trees were visualized and exported using Archaeopteryx 0.9809 tool (Han, Zmasek, 2009).