Functional metagenomics reveals novel β-galactosidases not predictable from gene sequences

The techniques of metagenomics have allowed researchers to access the genomic potential of uncultivated microbes, but there remain significant barriers to determination of gene function based on DNA sequence alone. Functional metagenomics, in which DNA is cloned and expressed in surrogate hosts, can overcome these barriers, and make important contributions to the discovery of novel enzymes. In this study, a soil metagenomic library carried in an IncP cosmid was used for functional complementation for β-galactosidase activity in both Sinorhizobium meliloti (α-Proteobacteria) and Escherichia coli (γ-Proteobacteria) backgrounds. One β-galactosidase, encoded by six overlapping clones that were selected in both hosts, was identified as a member of glycoside hydrolase family 2. We could not identify ORFs obviously encoding possible β-galactosidases in 19 other sequenced clones that were only able to complement S. meliloti. Based on low sequence identity to other known glycoside hydrolases, yet not β-galactosidases, three of these ORFs were examined further. Biochemical analysis confirmed that all three encoded β-galactosidase activity. Lac36W_ORF11 and Lac161_ORF7 had conserved domains, but lacked similarities to known glycoside hydrolases. Lac161_ORF10 had neither conserved domains nor similarity to known glycoside hydrolases. Bioinformatic and structural modeling implied that Lac161_ORF10 protein represented a novel enzyme family with a five-bladed propeller glycoside hydrolase domain. By discovering founding members of three novel β-galactosidase families, we have reinforced the value of functional metagenomics for isolating novel genes that could not have been predicted from DNA sequence analysis alone.


Introduction
pJC8 was used as a negative control. Lac + cosmid DNA was prepared and analyzed by EcoRI-BamHI-HindIII 115 digestion as described previously.

116
To further verify the Lac + phenotype conferred by the complementing clones, we constructed a S.
Ontario) pressure cell at an internal cell pressure of 1.6 × 10 8 Pa. His-tagged proteins were purified from Laboratories). Purified proteins were dialyzed twice at 4°C against 50 mM potassium phosphate and 10 mM were incubated at 37°C (Lac161_ORF10), 42°C (Lac36W_ ORF11), and 50°C (Lac161_ORF3) for 30 min and 160 terminated with Tris-HCl (pH 7) to a final concentration of 1 M. Aliquots of glucose oxidase/peroxidase reagent 161 (125 µl) were added to each well and left to develop at 37°C for 30 min. Absorbance was measured at 450 nm 162 and compared with a standard glucose curve to determine the amount of glucose released. All reactions were 163 performed in triplicate.

Bioinformatic analysis
Illumina raw sequence data were assembled as described previously (Lam et al. 2014). Open reading frames were annotated using MetaGeneMark (Zhu et al. 2010). Functions of proteins were predicted by BLAST   Table   181 S4. For comparison, and to estimate a background level of protein abundance using a housekeeping gene, all 182 metagenomes were also searched for metagenomic homologs of the rpoB protein using HMMer 183 (http://hmmer.org/) as implemented in MetAnnotate (Petrenko et al. 2015). Metagenomes possessing fewer than 184 100 rpoB hits were discarded, as these datasets were too small to yield meaningful results.

188
Cosmid clones expressing β-galactosidase genes were screened in metagenomic library 12AC (Cheng 189 et al. 2014). Functional β-galactosidase enzymes hydrolyse lactose (galactose-β-1,4-glucose) to galactose and 190 glucose, facilitating the growth of bacterial hosts (lac) on M9 minimal media when lactose is used as the sole cosmids were transferred from E. coli HB101 (Sm R Tc R ) to DH5α (lacZYA) via electroporation. We obtained screening in an effort to expand the range of recovered β-galactosidase-encoding clones. S. meliloti strain

200
RmF728 is a derivative of the well studied Rm1021 that has been modified to carry a genomic deletion that 201 removes the lactose metabolism genes (Charles and Finan 1991 (Table S2).

265
Examination of the annotated ORFs of the other 13 Lac + cosmids from S. meliloti (Table 2) also did not 266 suggest any candidate that resembled known β-galactosidases. Based on a protein sequence comparison to the 267 CAZy database, which showed low level similarity to proteins carrying known GHs (but not β-galactosidases),
resulting affinity-purified proteins were assayed for β-galactosidase activity. Here, we report the biochemical 273 properties of gene products from these three ORFs and confirm their activities on lactose as substrate.

Biochemical characterization of Lac36W_ORF11
Protein sequence searches of the predicted 33 ORFs against the CAZy database suggested that 280 Lac36W_ORF11 (GenBank accession AGW45517) showed sequence similarity to the protein ERE_21070 of 281 Eubacterium rectale M104/1 (GenBank accession CBK94002), which has three domains:

292
Because there was no similarity to any known GH domain and carbohydrate binding module (CBM), we 293 proposed that Lac36W_ORF11 (GenBank, AGW45517) is a new β-galactosidase with possible other functions.

298
The nature of the promoter for this predicted operon and its basis for function in S. meliloti, but not E. coli, is 299 not known. The Lac36W_ORF14 is predicted to encode a transcriptional regulator (LysR-like) but whether it 300 has a role in regulation of the operon is unknown. Additionally, there were no ORFs encoding homologs to 301 known transporters in the cloned 34-kb DNA. Thus, uptake of lactose and regulation of the predicted operon are 302 unknown.

305
Cosmid Lac161 complemented the Lacphenotype of S. meliloti RmF728 but could not complement E.
The Heat_2 and Cytochrom_C domains might be involved in intracellular transport and electron transfer. In 313 addition, Lac161_ORF7 was homologous to several proteins annotated as probable glycoside hydrolases, such 314 as HVO_B0215 (GenBank accession ADE01485.1; CBM16, CAZy) of Haloferax volcanii DS2. Further 315 sequence alignment analysis did not show any similarity to known GH and CBM. To determine whether the 316 gene product exhibited any GH activity, the Lac161_ORF7 was cloned and expressed. Purified ORF7 protein 317 was able to hydrolyze lactose with a K m of 1.8 mM, which is the lowest of the three β-galactosidases studied in 318 this work (Table 3). In addition, the K m value of Lac161_ORF7 is similar to the reported K m (2.0) of E. coli 319 LacZ (Wallenfels and Malhotra 1961). The ORF7 protein was most active at the same pH of 6.0 as 320 Lac36W_ORF11 (Table 3; Fig. 2A and 2C), but the highest activity of Lac161_ORF7 was observed at 50°C 321 ( Fig. 2D). In addition, Lac161_ORF7 had the highest K cat /K m among the β-galactosidases identified in this study.   (Table 3). The optimal pH and temperature of β-galactosidase activity was 6.5 and 37°C, 332 respectively ( Fig. 2E and 3F). In order to further investigate the range of substrate specificity, four other 333 disaccharides were tested as substrates. When sucrose (glucose-β-1,2-fructose) was added, no glucose was 334 released, suggesting that Lac161_ORF10 was not a β-fructofranosidase (or invertase, GH32). Additionally, the 335 ORF10 protein was unable to catalyze hydrolysis of xyloside (xylose-β-1,4-xylose, often associated with GH43), 336 maltose (glucose-α-1,4-glucoside, often associated with GH65), and cellobiose (glucose-β-1,4-glucoside, often 337 associated with GH1). Sequence analysis and activity assays therefore suggested that Lac161_ORF10 (GenBank   (Table S3), and all possessed 355 this domain over the aligning region (E < 0.01). According to CDTree, Lac161_ORF10 represented a highly 356 distinct branch within the GH_J sequence cluster (Fig. 3A), which provided some explanation for the observed 357 weak similarity to existing CDD domains.
Proteins within the GH_J superfamily, including GH32 and GH68, all posses a five-bladed propeller 359 fold, and share a funnel-shaped active site typically composed of a catalytic nucleophile (e.g., Asp) and proton 360 donor (e.g., Glu) acting as the general acid/base as well as a RDP motif (Lammens et al. 2009) involved in 361 stabilizing the transition state (Fig. 3B). Our analysis suggests that Lac161_ORF10 also shared some of these 362 characteristics.

366
Interestingly, both Tmari_1232 and Lac161_ORF10 are members of the Pfam DUF377 family, further 367 supporting the model. We then analyzed potential active sites using two separate methods: a sequence and 368 structure-based approach. According to the CDD sequence alignment, Lac161_ORF10 possesses a KDP motif 369 (residues 196-198) that aligns to the active site RDP motif in the reference 1y9m structure (Fig. 3C). Ligand-370 binding sites were also predicted in the structural model using 3dLigandSite (Wass et al. 2010). This revealed a 371 predicted cluster of eight residues, including the previously identified D-197 residue, as forming the putative 372 active site (Fig. 3B). However, alternate alignments and putative active sites from those reported above are

378
We were interested in the distribution of sequences similar to the newly described β-galactosidase 379 sequences throughout different metagenomes. To address this, we performed protein homology searches with 380 these sequences against collections of aquatic, human gut and soil metagenomic databases, and normalized 381 using the housekeeping rpoB abundance (Figure 4). Homologs to each of the three genes are represented in all 382 three habitats. However, Lac36W_ORF11 in human gut is by far of greatest relative abundance.

383
Lac36W_ORF11 is also high in soil, but not as high as in human gut. Although of overall lower relative 384 abundance, Lac161_ORF10 is also of greater abundance in human gut than in soil or aquatic. Lac161_ORF7 385 exhibits a quite different profile, being extremely rare in human gut, low levels in aquatic, but higher levels in 386 soil. It will be of interest to determine whether these homologs are also functional β-galactosidases.

388
Host influence on screening on the expression of genes of interest and presence of accessory components required for the enzyme activity in the surrogate hosts (Martinez et al. 2004;Taupp et al. 2011). Multi-host-systems have been developed to soil library (12AC) for the ability to complement β-galactosidase mutants resulted in a greater number of 397 distinct clones using S. meliloti than the most widely used E. coli. In addition, three novel β-galactosidase genes 398 were identified only in S. meliloti. These data emphasize the indispensable development of multi-host systems 399 for functional screening.

402
Metagenomics provides unprecedented access to the genomic potential of uncultivated microbial 403 communities. Despite enormous progress resulting from developments in high throughput sequencing, the 404 potential for novel enzyme discovery remains highest using a functional metagenomics approach, in which 405 genes are isolated based on their function rather than by DNA sequence similarity to already known genes.

406
Using such an approach, we have discovered genes encoding novel types of lactose hydrolyzing enzymes. The 407 enzymes encoded by these genes were biochemically similar to known enzymes, although they would not have 408 been easily predicted by their sequences without knowing that they were carried on a segment of DNA that 409 encoded β-galactosidase activity. These results demonstrate the importance of sequence-agnostic functional 410 screens for the discovery of enzymes of novel origin, and suggest that further implementation of this strategy