Insight into Dominant Cellulolytic Bacteria from Two Biogas Digesters and Their Glycoside Hydrolase Genes

Diverse cellulolytic bacteria are essential for maintaining high lignocellulose degradation ability in biogas digesters. However, little was known about functional genes and gene clusters of dominant cellulolytic bacteria in biogas digesters. This is the foundation to understand lignocellulose degradation mechanisms of biogas digesters and apply these gene resource for optimizing biofuel production. A combination of metagenomic and 16S rRNA gene clone library methods was used to investigate the dominant cellulolytic bacteria and their glycoside hydrolase (GH) genes in two biogas digesters. The 16S rRNA gene analysis revealed that the dominant cellulolytic bacteria were strains closely related to Clostridium straminisolvens and an uncultured cellulolytic bacterium designated BG-1. To recover GH genes from cellulolytic bacteria in general, and BG-1 in particular, a refined assembly approach developed in this study was used to assemble GH genes from metagenomic reads; 163 GH-containing contigs ≥ 1 kb in length were obtained. Six recovered GH5 genes that were expressed in E. coli demonstrated multiple lignocellulase activities and one had high mannanase activity (1255 U/mg). Eleven fosmid clones harboring the recovered GH-containing contigs were sequenced and assembled into 10 fosmid contigs. The composition of GH genes in the 163 assembled metagenomic contigs and 10 fosmid contigs indicated that diverse GHs and lignocellulose degradation mechanisms were present in the biogas digesters. In particular, a small portion of BG-1 genome information was recovered by PhyloPythiaS analysis. The lignocellulase gene clusters in BG-1 suggested that it might use a possible novel lignocellulose degradation mechanism to efficiently degrade lignocellulose. Dominant cellulolytic bacteria of biogas digester possess diverse GH genes, not only in sequences but also in their functions, which may be applied for production of biofuel in the future.


Introduction
Biogas digesters can efficiently degrade lignocellulose and have long been used in the treatment of animal manure and agriculture residues [1]. Biogas production from lignocellulose is a complex process that includes four main steps: hydrolysis, acidogenesis, acetogenesis, and methanogenesis [2]. Previous studies have revealed that the hydrolysis step, in which lignocellulose is decomposed into soluble sugar by cellulolytic bacteria, is the rate-limiting step in biogas production [3]. In previous studies, the cellulolytic bacteria Clostridium thermocellum, Clostridium stercorarium, Clostridium cellulolyticum, and Bacteroides cellulosolvens were detected in various biogas digesters by culture-independent methods (i.e., 16S rRNA gene-based methods), suggesting that these microorganisms could be the dominant cellulolytic bacteria in many biogas digesters [4][5][6][7][8][9]. In addition to these microorganisms, uncultured cellulolytic bacteria from Clostridia group 4 were identified as dominant cellulolytic bacteria in a landfill-derived biogas digester through DNA-stable isotope probing (DNA-SIP) and fluorescence in situ hybridization (FISH) techniques [10]. Recently, a meta-analysis of microbial diversity in anaerobic digesters further confirmed that species of the genera Acetivibrio and Clostridium are the most abundant cellulolytic bacteria in anaerobic digesters [11]. However, due to the absence of corresponding isolates, the glycoside hydrolase (GH) genes and lignocellulose degradation mechanisms of several dominant cellulolytic bacteria are completely unknown.
In previous investigations in which metagenomic methods were used to characterize the microbial communities of biogas digesters, many metagenomic reads or assembled metagenomic contigs were assigned to C. thermocellum, demonstrating that bacteria similar to C. thermocellum were the dominant cellulolytic bacteria in these biogas digesters [12][13][14]. For example, in one metagenomic study of a production-scale biogas plant fermenter, more than 3.37% of the assembled contigs were binned to the genome of C. thermocellum [12]. However, due to the complex microbial composition and low coverage of the obtained metagenomic reads, as well as the low homology of the GH genes [15], full-length GH genes were rarely recovered from the short metagenomic reads [12]. Moreover, it was impossible to recover genome information from uncultured cellulolytic bacteria in the biogas digesters, because most of the short metagenomic reads and assembled contigs could not be assigned to known species [12].
In parallel, functional screenings of lignocellulase genes from metagenomic libraries of biogas digesters have also been performed in the past few years [16][17][18][19]. In one study, as many as 973 positive fosmid clones harboring lignocellulase genes were screened out from the metagenomic library of biogas digester Z7 [19]. Based on these results, lignocellulase genes or gene clusters that have potential uses in industry may be obtained through functional screening methods [16]. However, as seen in other studies [20,21], some lignocellulase genes, including those from dominant cellulolytic bacteria, may be overlooked by functional screening, while other rare lignocellulase genes may be recovered, because positive clone selection is based on whether the GH genes are expressed in a host system (usually Escherichia coli). Even if all GH genes from all cellulolytic bacteria are recovered from the fosmid clones, there still remains the problem of linking these GH genes or gene clusters to the cellulolytic bacteria from which they were derived, as many GH genes are novel and are often assigned to uncultured bacteria [22]. Recently, methods such as PhyloPythiaS analysis, have been used to analyze diverse metagenomic samples through assigning metagenomic sequences from novel genera or higher-level clades to the proper taxonomy with limited reference data, [23][24][25].
In this study, 16S rRNA gene-based clone library methods and metagenomic methods were used to obtain insight into the dominant cellulolytic bacteria and their associated GH genes from two biogas digesters fed with pig manure and rice straw. A refined assembly approach developed in this study was utilized to assemble the dominant GH genes obtained from the biogas digesters.
Using a sequence-based method, fosmid clones harboring either GH genes or the 16S rRNA gene of a dominant cellulolytic bacterium were obtained and sequenced. Using these methods, contigs from dominant cellulolytic bacteria were recovered, and information regarding the genome of the cellulolytic bacteria in the biogas digesters was revealed. Additionally, six recovered GH5 genes were expressed in E. coli and confirmed to have diverse lignocellulase activities.

Materials and Methods
Setup of two laboratory-scale biogas digesters and metagenomic DNA preparation Two laboratory-scale 2.5-L biogas digesters (Z7 and Z8) were established with same slurry from a biogas plant fed with pig manure and operated at mesophilic temperature, located in Qianwei Village, Chongming County, Shanghai, China [7]. The two digesters were operated anaerobically with the same fed-batch fermentation method at 40°C. Every 7 days, 600 mL of biogas slurry were discharged from each biogas digester; after the discharge, 120 g of pig manure and 300 mL of slurry from the biogas plant, brought to a final volume of 600 mL with water, were fed into each biogas digester. An additional 3 g of rice straw were fed only to digester Z7 [16]. After the two biogas digesters had been running for 1.5 years, the microbial communities were stable as monitored by 16S rRNA gene PCR-DGGE (data not shown). The slurry samples were collected separately and subjected to DNA extraction as described previously [16]. The carbon and nitrogen content of the slurry and methane yield of the two biogas digesters were also measured (S1 File).
Metagenomic sequencing, sequence analysis, and function annotation DNA samples extracted from Z7 and Z8 were subjected to 454 pyrosequencing using the GS FLX sequencing system. Annotation of protein-coding reads was performed with BLASTX [26] against the NCBI-NR (http://www.ncbi.nlm.nih.gov/protein), KEGG [27], and STRING--COG [28] protein databases. A cutoff e-value of 10 -2 was used to assign environmental gene tags (EGTs) [29] from the BLASTX results [30]. Functional profiles of the KEGG and COG functional categories were calculated from the number of EGTs assigned to the specific KO or COG. 16S rRNA genes were predicted from the metagenomic data using BLASTN against the RDPII database [31]. For each 16S rRNA gene match, reads with e-values 10 -5 , match lengths ! 65 nucleotides, and similarities ! 80% identity were retrieved for further analysis [30]. Reads encoding GH, carbohydrate-binding modules (CBM), and cellulosomal genes were determined according to their best hits from the CAZy [32] and NCBI databases (S1 File). MEGAN [33] was used to analyze the NCBI taxonomies of the metagenomic reads (S1 File). The 454 pyrosequencing reads from the two samples were deposited in the GenBank SRA database under GenomeProject ID #50503. Additionally, the Newbler software version 2.6 (Roche/ 454 Life Sciences) was used with default parameters (Minimum overlap length of 40 bp and Minimum overlap identity of 90%) to assemble the reads, and the assembled contigs ! 1 kb were deposited in NCBI database under accession numbers KJ797191-KJ801155.
To compare phylogenetic profiles between the PCR amplicons and the metagenomic predictions in Z7 or Z8, sequences from both datasets were aligned with NAST at Greengenes [34], imported into the ARB software [35], and then inserted into the reference tree of the Greengenes database with the lanemaskPH filter using the parsimony insertion algorithm. From the resulting phylogenetic tree, groups of phyla were identified and the abundances of included sequences were calculated for the PCR amplicons and the metagenome, respectively.
Recovery of GH family genes from two biogas digesters by a refined assembly approach A refined assembly approach was used to recover dominant lignocellulase genes from short metagenomic reads obtained from the Z7 and Z8 samples using protein sequences from several GH families (GH1, GH3, GH5, GH9, GH10, and GH11) retrieved from the CAZy database as the reference database (S1 Fig) [32]. All of the 454 reads were queried against the reference database with BLASTX (e-value of 1e-2 and bacterial codon table-Q 11). The resulting high-scoring segment pair (HSP) hits were then subjected to assembly using the Geneious Pro Assembler (Biomatters, Auckland, New Zealand). To recover more genome information from neighboring genes, we also use the assembled GH-containing contigs to mega-BLAST the metagenomic reads and/or the contigs assembled from the total metagenomic reads by Newbler software (S1 Fig).
Only contigs ! 1 kb in length were taken into account for further analysis. To predict the open reading frames (ORFs) and determine functional roles of the genes, FGENESB (http://linux1. softberry.com/berry.phtml) and BLASTX were used in combination to annotate the contigs. All contigs ! 1 kb in length were blasted (BLASTX) against known GH family protein sequences from the CAZy and Pfam databases. Further manual inspection was conducted to verify the assembly quality and annotation results to obtain the final list of recovered GH-containing contigs.
To test the accuracy of the assembled GH-containing contigs, 50 pairs of contig-specific primers were designed from 50 randomly selected assembled GH-containing contigs and used to amplify the corresponding PCR fragments from the same DNA sample used to generate the metagenomic data (S1 File; S1 Table). The 50 resultant PCR fragments were then sequenced and compared to sequences in the corresponding 50 assembled GH-containing contigs. The Gen-Bank accession numbers of the recovered GH-containing contigs are KJ797028-KJ797190.
Screening of positive fosmid clones harboring GH-containing contigs from the Z7 fosmid library Contig-specific primers were also used to screen positive fosmid clones from the Z7 fosmid library (S1 Table). To obtain more information regarding the genome of the dominant bacterium BG-1 through PhyloPythiaS analysis [24], two additional fosmid clones containing BG-1 16S rRNA genes were obtained by PCR screening (S1 Table). The two fosmid clones harboring the BG-1 16S rRNA gene and the 11 selected positive fosmid clones harboring GH genes were sequenced and assembled into fosmid contigs using the Newbler software (S1 File). The fosmid contigs were annotated in the same method as the GH-containing contigs and deposited in GenBank under accession numbers KJ797017-KJ797027. The depth of contigs in the metagenomic data was calculated to weight the distribution of these contigs and the methods are described in S1 File [36].
Expression of 6 recovered GH5 family genes in E. coli Six recovered GH5 genes (Cel1-Cel6) with full-length sequence were expressed in E. coli (S2 and S3 Tables) and the purified enzymes were subjected to testing for lignocellulase activity on the substrates carboxymethyl-cellulose (CMC, Sigma, St. Louis, MO, USA), locust bean gum (Sigma, St. Louis, MO, USA), xylan from beechwood (Sigma, St. Louis, MO, USA), and pnitrophenyl-D-cellobioside (pNPC, Sigma, St. Louis, MO, USA). The six GH5 genes were assigned to different GH5 subfamilies according to phylogenetic analysis based on their closest GH5 gene relatives. Gene sequences, gene expression, enzymatic analyses, phylogenetic analysis, and signal peptide prediction details for the GH5 genes are described in S1 File.

16S rRNA gene based analysis reveals the microbial diversity in biogas digesters
The 16S rRNA genes amplified from DNA isolated from biogas digesters Z7 and Z8 were sequenced and subjected to phylogenetic analysis and diversity estimation. Using a similarity threshold of !99% for operational taxonomic unit (OTU) determination, there were 225 and 151 bacterial OTUs, and 14 and 8 archaeal OTUs observed for Z7 and Z8, respectively (S4 Table; S2 File and S3 File). The Good coverage index indicated that more than 84% of bacterial diversity and 95% of archaeal diversity was sampled from the two communities (S4 Table).
Firmicutes was the largest bacterial phylum (72.77% of Z7 and 48.68% of Z8), followed by the Bacteroidetes (20.77% of Z7 and 37.42% of Z8) and the Proteobacteria (4.26% of Z7 and 6.79% of Z8) in the 16S rRNA clone libraries (S5 Table; S2a Fig). Of the 339 bacterial OTUs identified from Z7 and Z8, 216 OTUs were singletons, indicating that most bacteria in the biogas digesters were present in low abundance (S2 File). Only 13 OTUs had ! 97% identity to cultured bacteria (i.e., type strains), though most bacterial OTUs matched sequences previously identified in microbial communities related to anaerobic fermentation systems (S2 File).
Twenty-six dominant bacterial OTUs represented by more than 10 sequences in clone libraries from both biogas digesters are summarized in Table 1. None of the dominant bacterial OTUs shared more than 97% identity with any cultured bacterial type strains. However, OTU158 and OTU98, which were dominant in Z7, were related to the cellulolytic strain Clostridium straminisolvens (T) CSK1 [37] with 94.76% and 92.29% identities, respectively. The C. straminisolvens (T) CSK1 16S rRNA gene shares 96.2% identity with that of the cellulolytic bacterium, C. thermocellum. The C. thermocellum-like bacteria, OTU158, OTU98, C. straminisolvens, and C. thermocellum clustered together in the phylogenetic tree (S3 Fig). OTU48 (hereafter named BG-1) shared 99.93% identity with one uncultured cellulolytic bacterium whose function was previously identified by FISH and DNA-SIP analyses [10]. BG-1 was taxonomically assigned to Clostridia group 4, and determined to be distinctively different from the C. thermocellum-like bacteria [10]. Further phylogenetic analysis of the BG-1 16S rRNA sequence indicated that BG-1 reveals a new order in the class Clostridia (S3 Fig). Besides, it is difficult to determine the functions of the other dominant OTUs due to the lack of close relatives with known functional information, or in some cases, the lack of any close phylogenetic relatives.
The phylogenetic group distributions were compared via 16S rRNA gene clone libraries and 16S rRNA gene reads extracted from the metagenome, indicating that the phylogenetic diversities of 16S rRNA clone libraries and metagenomic 16S rRNA gene datasets in Z7 or Z8 had consistent patterns (S5 Table; S2 Fig). When comparing them on the species level, most of the bacterial PCR-based 16S rRNA gene sequences (93.6%) had counterparts in the metagenomic 16S rRNA gene reads, with a minimum alignment length of 60 nucleotides and 98-100% sequence identity, and vice versa (85.6%). In all, the consistency of the phylogenetic distribution which was revealed by 16S rRNA gene analysis derived from PCR amplicons and 454 reads of the metagenome gene suggested that most of the microbial diversity in Z7 or Z8 was included in the metagenomic data (S5 Table;   Short metagenomic reads further reveal C. thermocellum-like bacteria were dominant in Z7 and Z8 A total of 519 015 and 668 860 reads were generated from the pyrosequencing of Z7 and Z8, respectively (S6 Table). Here, 3965 contigs ! 1 kb in length (average size = 1742 bp), summing up to 6.91 Mbp, were assembled from 174 221 metagenomic reads by the Newbler software; the largest contig was 22 059 bp. As only a small portion of the short metagenomic reads were assembled (14.67% of all Z7 and Z8 metagenomic reads), the short metagenomic reads were used in BLAST annotation, with hits from the NR, COG/KEGG, and Ribosomal Database Project (RDP) databases. In total, 68% of the metagenomic reads were assigned to NCBI taxonomy at various levels (S6 Table), and 40% of the reads were assigned at the family level, a reliable level as it was demonstrated by MEGAN analysis (S4a Fig) [33]. Almost all of the EGTs belonged to bacteria or archaea, only about 0.6% were assigned to eukaryotes, including fungi. A total of 44 phyla were identified, and Firmicutes, Bacteroidetes, Euryarchaeota and Proteobacteria were the dominant phyla in both samples (S4b Fig), which was consistent with the dominant phyla in 16S rRNA gene analysis (S5 Table; Table). Three cellulolytic bacterial species (C. thermocellum, C. cellulolyticum, and Caldicellulosiruptorsaccharolyticus) were dominant, and many metagenomic reads were assigned to the C. thermocellum genome (S7 Table).
Diverse GH genes and other related genes involved in lignocellulose degradation were revealed through annotation of short metagenomic reads COG and KO categories related to GH and associated proteins involved in the hydrolysis of carbohydrates such as starch, cellulose, and hemicellulose, were analyzed in Z7 and Z8. There were 3850 EGTs in Z7 that were assigned to 59 GH families and 3703 EGTs in Z8 assigned to 62 GH families, representing a rich diversity of putative lignocellulase and other GH genes (Fig  1a; S8 Table). Among them, the EGTs of lignocellulase genes were distributed mainly among the GH1, GH3, GH5, GH9, GH10, and GH11 families (Fig 1a; S8 Table). Carbohydrate-binding modules appended with GHs play a critical role in lignocellulose degradation [38], and 603 EGTs assigned to 16 CBM families were identified (S9 Table). CBM6 and CBM36, which often link to the xylanase catalytic module and have an affinity for xylan, were dominant in the two metagenomes. In addition, CBM4, CBM6, and CBM9, which have affinity for amorphous cellulose, were also abundant. Similarly, diverse EGTs associated with cellulosome-associated regions-such as cohesin and dockerin domains-were also identified, confirming that C. thermocellum-like bacteria were dominant in the biogas digesters (S10 Table).
Recovery of dominant GH-containing contigs using the refined assembly approach and validation of gene assemblies by PCR amplification and sequencing The refined assembly approach developed in this study was used to recover dominant lignocellulase genes (GH1, GH3, GH5, GH9, GH10, and GH11) from the two biogas digesters. A total of 44,477 metagenomic reads were HSP hits and were retrieved from the metagenomic data. Using the refined assembly approach, these metagenomic reads were assembled into 7438 contigs; 219 contigs had lengths ! 1 kb (S11 Table) and among these, 163 contigs contained at least one catalytic module belonging to a GH family (Fig 1b; S12 Table). Moreover, a comparison of GH-containing contigs assembled by the refined approach with the corresponding contigs assembled by the Newbler software revealed that the refined approach resulted in recovery of larger contigs (S12 Table). In total, 181 catalytic modules belonging to more than 32 GH families were recovered, including: 17 GH3 genes, 8 GH5 genes, 11 GH9 genes, 6 GH10 genes, and 2 GH11 genes, as well as many GH genes related to diverse hemicellulases, such as α-arabinofuranosidases (GH43, GH51, and unclassified arabinofuranosidases), α-glucuronidases (GH67 and GH115) and α-xylosidases (GH31, GH43, and unclassified α-xylosidase) (Fig 1b; S12 Table). However, no GH1-containing contigs ! 1 kb in length were obtained.
To determine whether the assembled GH-containing contigs were present in the biogas digesters, 50 pairs of contig-specific primers were designed from 50 randomly selected GH- containing contigs and used to amplify corresponding fragments from the metagenomic DNA sample isolated from Z7 as the template (S1 and S12 Tables). PCR fragments of the predicted sizes were obtained for all 50 contigs (100%), and their sequences were nearly identical to their corresponding target regions on the 50 assembled contigs (! 95% sequencing identity). These results confirmed that the recovered GH genes in the contigs assembled from metagenomic reads represented authentic genes in the metagenome of the biogas digesters.
Sequencing and analysis of fosmid clones screened from the Z7 fosmid library provide insights into GHs of dominant cellulolytic bacteria To obtain more genomic information of the dominant cellulolytic bacteria, several contig-specific primers for the GH-containing contigs and the BG-1 16S rRNA gene were used to perform sequence-based screening of a previously constructed fosmid library [16]. After screening, 11 fosmid clones harboring GH-containing contigs (mainly GH5 and GH9 genes) and two fosmid clones harboring BG-1 16S rRNA genes were subjected to sequencing.
Of the 13 sequenced fosmid clones, only one clone was not fully recovered (Fosmid 255O19), due to assembly errors resulting from repeated cohesin domain sequences ( Table 2). Assembly of the 13 fosmid clones resulted in 11 fosmid contigs and 0.45 Mbp of DNA sequence ( Table 2). Fosmid contig FC3, 51 kb in length, assembled from two fosmid clones harboring BG-1 16S rRNA genes, contained no GH genes ( Table 2). The gene structures of the 11 fosmid contigs are shown in Fig 2 and S5 Fig. Several GH family genes related to cellulose or hemicellulose degradation formed GH gene clusters in some fosmid contigs, including FC1 and FC2. Various AraC, LacI, BglG, or GntR family-like transcriptional regulators were detected in clusters containing GH family genes in the fosmid contigs, suggesting that these negative regulators might regulate GH gene expression [39,40]. Homologues of transporters, such as ABC transporters and tonB-dependent receptors, which may be used in lignocellulose degradation processes [21], were linked with lignocellulase genes in FC1, FC2, and FC4 (Fig 2; S5 Fig)

BG-1 genomic information recovered by PhyloPythiaS analysis
Of the 10 GH-containing fosmid contigs, PhyloPythiaS analysis using FC3 as the sample-specific model, supported the assignment of FC1 and FC2 to BG-1, while the other fosmid contigs were assigned either to the order Clostridia or to unknown bacteria (Table 2) [24]. In addition, another 214 contigs ! 1 kb in length assembled from Z7 and Z8 metagenomic reads by the Newbler software, were also assigned to BG-1 (S13 Table). The depth of 3 fosmid contigs (FC1, FC2 and FC3) and most other contigs assigned to BG-1 was similar. Based on these analyses,~0.53 Mbp of BG-1 genomic sequences were obtained, suggesting that BG-1 possesses at least 21 genes related to lignocellulose degradation. Specifically, there were 12 different lignocellulase genes, including GH3, GH4, GH10, GH39, GH43, as well as genes encoding α-L-rhamnosidase and α-arabinofuranosidase genes in FC2 (Fig 2b). In addition, FC1 possesses six GH genes, including two GH2 genes, one GH8 gene, two GH5 genes, and one GH9 gene (Fig 2a). According to the characterized proteins in the CAZy database, these lignocellulases are believed to degrade different components of the plant cell wall. No genes encoding cellulosomal components were detected in the GH-containing contigs or fosmid contigs assigned to BG-1, further testifying that BG-1 differs from the C. thermocellumlike bacteria.
In addition to the lignocellulose degradation potential, the molecular basis for the transmembrane import of oligosaccharides into the cytoplasm of BG-1 was emphasized by identification of four predicted ABC-type transporters within the FC1 gene cluster. Significantly, there were three extracellular solute-binding proteins (SBP family 1; gene 37, gene 50, and gene 60) among the ABC-type transporters in FC1, which are predicted to bind oligosaccharides and facilitate interactions with proximal permeases in the bacterial cell membrane [41]. Moreover, one ABC transporter (gene 11 to gene 13) predicted to transport sugar was also present in FC2. Additionally, several bacterial transmembrane sensor proteins, including two lacI family-like regulators (gene 35 and gene 49), one GntR family-like regulator (gene 59), and one methyl-accepting chemotaxis protein (MCP, gene 34), were present in FC1 [42]. One predicted operon comprised gene 35-gene 47, which might be co-expressed in BG-1 (Fig 2a). One BglG-like regulator linked with GH genes, and two MCP signaling domains (gene 6 and gene 14), were present in FC2 (Fig 2b).

Expression, purification, and characterization of six GH5 genes in E. coli
The GH5 gene family is one of the largest families in the CAZy database and~80% of GH5 genes can be classified into 51 distinct subfamilies [43]. GH5 was also one of the most dominant GH families in the biogas digesters [32]. Six full-length GH5 genes were recovered from the fosmid contigs; three of these (Cel2, Cel5 and Cel6) belonged to BG-1. Phylogenetic analysis indicated that three of the genes (Cel2, Cel4, and Cel6) could be assigned to subfamily GH5_4, one (Cel5) to subfamily GH5_10, while the remaining two genes (Cel1 and Cel3) could not be assigned to any subfamily (Table 3; S6 Fig). Among the six genes, Cel2, Cel4, Cel5, and Cel6 were predicted to have signal peptides.
The six GH5 genes were expressed in E. coli BL21 and the purified proteins were tested for enzymatic activity on CMC, locust bean gum, xylan (beechwood), and pNPC. The results indicated that most of the GH5 proteins have a wide substrate range (Table 3). Remarkably, Cel5 showed activity in all tests, including endoglucanase, xylanase, mannanase, and exocellulase activities. Cel1, one of the unclassified GH5 proteins, exhibited mannanase activity of 1255 U/mg protein, which is one of the highest mannanase activities ever reported in the BRENDA database [44]. Another GH5 unclassified protein, Cel3, exhibited weak exocellulase activity (Table 3).

Discussion
Although more than half dominant OTUs listed in Table 1 were shared by Z7 and Z8, their distributions in these two reactors were significantly different. For example, OTU48 (BG-1) was the dominant cellulolytic species in Z7 and Z8, its distribution was much more abundant in Z7 than Z8 (Table 1). Besides, metagenomic taxonomy analysis by MEGAN also suggested the abundance of cellulolytic species was significantly higher in Z7 than Z8 (S7 Table). Because the additional rice straw feeding in Z7 was the exclusive different operating condition for the two biogas digesters, feeding rice straw might contribute to the enrichment of cellulolytic species in Z7. However, more parallel experiments were necessary to support this hypothesis. In addition, MEGAN analysis also revealed the domain methangen, Methanoculleus marisnigri, was much more abundant in Z8 than Z7 (S7 Table). Though the chemical analysis of the slurry in these two biogas digesters were similar, the methane yield of Z8 was much lower than that of Z7 (S14 Table). These results suggested that the relative abundances between cellulolytic bacteria, methanogens and other functional groups might be important to achieve high methane yield, which would be investigated in the future.
Using the refined assembly approach developed in this study, 163 GH-containing contigs were successfully recovered from the short reads of the low-coverage metagenomic data (S12 Table). The 50 assembled GH-containing contigs selected at random were verified to be correct, suggesting that the refined approach can be used in the recovery of lignocellulase genes or other functional genes from similar short read metagenomic datasets generated with secondgeneration sequencing technology, such as Solexa technology. Though only gene sequences predicted as dominant lignocellulase by short 454 reads from the GH1, GH3, GH5, GH9, GH10, and GH11 families in the CAZy database were used as reference genes, 122 untargeted GH-containing contigs were recovered in addition to the 41 targeted contigs containing GH3, GH5, GH9, GH10, or GH11 family genes.This may be due to GH genes are ambiguous nonhomologous [15] or the parameters in the refined assembly approach used to retrieve metagenomic GH reads were loose. However, no GH1 genes were identified in the GH-containing contigs and fosmid contigs, implying that the GH1 family genes may have been ambiguously annotated and were not associated with the dominant cellulolytic bacteria [15]. Diverse GH genes and other related genes involved in lignocellulose degradation (CBM and cellulosomal genes) were recovered in the GH-containing contigs and fosmid contigs, suggesting that cellulolytic bacteria harboring these GH genes can efficiently degrade lignocellulose. The free enzyme system, the cellulosome system, and the predicted SusC (tonB)/SusD-mediated mechanism are the three known mechanisms used by microorganisms to efficiently degrade lignocellulose [20,21,38]. Because no processive cellulase genes (mainly distributed in the GH6, GH7, and GH48 families) were found in the biogas digesters, the free enzyme system might not have been present [38]. Cellulosomal genes were linked with GH genes in several recovered GH-containing contigs and fosmid contigs, and a tonB-dependent receptor gene and a GH9 gene were clustered in FC5, implying that the cellulosome system common in C. thermocellum-like bacteria as well as the SusC/SusD-mediated mechanism are present in the biogas digesters. Theoretically, the metagenomic data is a random sample of the entire microbial community; thus metagenomic reads containing GH gene sequences from the dominant cellulolytic bacteria should be abundant and easily assembled into large contigs.
BG-1 was first identified as an uncultured cellulolytic bacterium through DNA-SIP and FISH observations of a landfill leachate-derived mesophilic anaerobic digester in France [10], and has since been identified in other biogas digesters [45,46]. The analysis of 16S rRNA genes obtained from the clone libraries and metagenomic reads from our two biogas digesters indicated that BG-1 was one of the dominant cellulolytic bacteria (Table 1). However, as no genome information is available for BG-1 or its closest phylogenetic relatives (S3 Fig), and more than 70% of the metagenomic reads could not be taxonomically classified at the species level (S4a Fig), it was impossible to evaluate the abundance of BG-1 in our biogas digesters or obtain genomic information on BG-1 using MEGAN analysis.
Using PhyloPythiaS analysis,~0.53 Mbp of BG-1 genomic sequences were obtained ( Table 2; S13 Table). Based on the assembled genome sequences, some aspects of the lignocellulose degradation mechanism used by BG-1 could be deduced. Though BG-1 possesses diverse GH genes and may specialize in lignocellulose degradation, no genes encoding cellulosomal components or SusC/SusD were assigned to BG-1 by PhyloPythiaS analysis. Based on the gene structures of FC1 and FC2, and using FC1 as example, we propose here a possible novel lignocellulose degradation mechanism that might be used by BG-1 (Fig 3) [39,40]. At first, all GH genes are expressed at low-levels and small amounts of the two GH5 proteins, Cel5 and Cel6 (Table 3), are secreted into the surrounding environment. When BG-1 is exposed to lignocellulose, it is hydrolyzed into oligosaccharides by the secreted GH5 proteins. Next, the PAS-fold Nearest neighbor represented nearest protein sequence for each GH5 genes. 2 All activity assays were performed at pH7.4 and 50°C. The substrates used to determine activities of endoglucanase, mannanase, xylanase and exocellulase were CMC, locust bean gum, xylan (beechwood) and pNPC. The unit of enzyme activity is U/mg protein. 3 Activity was not detected. 4 Activity was less than 1 U/mg protein. 5 Cel5 and Cel6 were also mentioned as man1 and en2 in elsewhere (Yan et al., 2013), but were not expressed before.
doi:10.1371/journal.pone.0129921.t003 Insight into Dominant Cellulolytic Bacteria in Two Biogas Digesters domain and MCP signaling domain [42] detect rising levels of oligosaccharides released by GH5s from the extracellular environment and modulate BG-1 towards a lignocellulose-present environment [47]. The oligosaccharides are imported into cytoplasm by ABC transporters and hydrolyzed into simple sugars by other GH proteins in the cytoplasm, to be used for bacterial cell growth. Some of the simple sugars bind to the sugar-binding domain of the negative regulators, such as the lacI and GntR family-like regulators. The regulator-sugar complex is then released from the promoter region in the GH gene cluster, resulting in high expression of the GHs and ABC transporters identified in FC1. High expression of the GH proteins and ABC transporters results in efficient hydrolysis and utilization of lignocellulose by BG-1. As several genes are located in the same operon(such as gene 35-gene 47), these genes might be regulated by the same negative regulators and therefore co-transcribed [39,40]. When the expression levels of negative regulators overcome the intracellular sugar levels (i.e., when the lignocellulose in the vicinity of BG-1 is consumed), transcription of the GH gene cluster is severely down-regulated and returns to the low-level expression mode. As BG-1 was the predominant cellulolytic bacterium identified, this newly proposed mechanism suggests a possible alternative lignocellulose degradation mechanism and might be one of the dominant lignocellulose degradation mechanisms in these two biogas digesters. However, this possible mechanism is mainly speculative derived from the part genomic information of BG-1, and more experimental verifications are required in future studies. Two further dominant OTUs in Z7 (OTU158 and OTU98) were closely related to the cellulolytic bacterium, C. straminisolvens (T) CSK1, a C. thermocellum-like bacterium (Table 1). PCR-based 16S rRNA gene analysis identified only five singleton OTUs in Z7 that were closely related to C. thermocellum, with shared identities of 86-89%. However, MEGAN analysis indicated that C. thermocellum was the dominant cellulolytic bacterium, and many metagenomic reads were assigned to the C. thermocellum genome (S7 Table). As the genome of C. straminisolvens (T) has not yet been available in our MEGAN analysis, and C. straminisolvens (T) is phylogenetically similar to C. thermocellum, we speculate that MEGAN might have assigned most reads from OTU158 and OTU98 to the C. thermocellum genome. This speculation is supported by the fact that the compositions of reads from Z7 and Z8 that were assigned to C. thermocellum (3.54% and 0.36%, respectively) based on MEGAN analysis were consistent with the compositions of OTU158 and OTU98 (6.81% and 0.33%, respectively) in Z7 and Z8 16S rRNA gene clone libraries (Table 1; S7 Table).
Many different activities relevant to lignocellulose degradation have been identified in the GH5 family [43]. The six GH5 family genes assembled, expressed, and characterized in this study exhibited diverse lignocellulase activities. Although Cel2 and Cel4 were predicted to be members of subfamily GH5_4, which has been reported to exhibit only three different activities (endo-β-1, 4-glucanases, licheninases, and xylanases), we determined that Cel2 and Cel4 exhibited both mannanase activity and xylanase activity. In addition, Cel4 demonstrated exocellulase activity (Table 3). Cel5 was a member of subfamily GH5_10, a subfamily which has been reported to possess only mannanase activity [43]. However, Cel5 demonstrated not only mannanase activity, but also activities toward xylan and pNPC. This suggests that GH5_10 may be a subfamily with polyspecific activity (Table 3). Based on enzymatic analysis of the six GH5 proteins, the activity spectra of subfamilies GH5_4 and GH5_10 have been expanded. In future studies, more substrates should be used in the characterization of novel GH5 proteins. As a phylogenetic clade with five or more GH5 family genes in the public protein databases obtained from different microorganisms can be defined as a new subfamily of the GH5 family [43]. Cel1 and Cel3 and their clades can be seen as two new GH5 subfamily candidates (S6 Fig). Therefore, the enzymatic analysis of the six GH5 proteins has expanded the activity spectrum of subfamilies GH5_4 and GH5_10, and provided two new GH5 subfamily candidates with characterized proteins, further demonstrating that the recovered GH-containing contigs and fosmid contigs were active in the investigated biogas digesters.

Conclusion
This study revealed that the dominant cellulolytic bacteria in the biogas digesters utilized multiple lignocellulose degradation mechanisms, including using diverse lignocellulases to hydrolyze lignocellulose, which may be the key to maintain high biogas yields from biogas digesters. Besides, some of the lignocellulases may be applied for second generation biofuel production in future.
Supporting Information S1 Fig. The refinery assembly approach used to assemble GH-containing contigs from metagenomic reads. Targeted reads were retrieved from the biogas metagenome by BlastX of the reference GH protein sequences against the metagenome data, and then subjected to Geneious assembly. The resulted contigs were further extended from their ends by megaBlast to the metagenomic reads or the contigs from 454 Newbler assembly of all metagenomic reads, to obtain the contigs as long as possible. Manual effort was performed at the final step to inspect the assembly quality and get the final list of GH-containing contigs.  Table. Primers used for verification of GH-containing contigs assembled from the metagenomic short reads by the refined assembly method and screening of fosmid clones harboring these GH-containing or BG-1 16S rRNA gene.  Table. Metagenomic reads encoding cellulosomal genes in Z7 and Z8 metagenomes. (DOCX) S11 Table. Numbers of reads or contigs in the GH-containing contigs recovering process with the refinery assembly approach in this study. (DOCX) S12 Table. The 163 GH-containing contigs assembled from the metagenomic data of biogas digesters with the refined assembly approach. (DOCX) S13 Table. The 214 contigs assembled with Newbler software from the metagenomic reads of Z7 and Z8 assigned to BG-1 by PhyloPythiaS analysis. (DOCX) S14 Table. Chemical analysis of slurry and biogas production data of the two biogas digesters. (DOCX)