Curli Functional Amyloid Systems Are Phylogenetically Widespread and Display Large Diversity in Operon and Protein Structure

Escherichia coli and a few other members of the Enterobacteriales can produce functional amyloids known as curli. These extracellular fibrils are involved in biofilm formation and studies have shown that they may act as virulence factors during infections. It is not known whether curli fibrils are restricted to the Enterobacteriales or if they are phylogenetically widespread. The growing number of genome-sequenced bacteria spanning many phylogenetic groups allows a reliable bioinformatic investigation of the phylogenetic diversity of the curli system. Here we show that the curli system is phylogenetically much more widespread than initially assumed, spanning at least four phyla. Curli fibrils may consequently be encountered frequently in environmental as well as pathogenic biofilms, which was supported by identification of curli genes in public metagenomes from a diverse range of habitats. Identification and comparison of curli subunit (CsgA/B) homologs show that these proteins allow a high degree of freedom in their primary protein structure, although a modular structure of tightly spaced repeat regions containing conserved glutamine, asparagine and glycine residues has to be preserved. In addition, a high degree of variability within the operon structure of curli subunits between bacterial taxa suggests that the curli fibrils might have evolved to fulfill specific functions. Variations in the genetic organization of curli genes are also seen among different bacterial genera. This suggests that some genera may utilize alternative regulatory pathways for curli expression. Comparison of phylogenetic trees of Csg proteins and the 16S rRNA genes of the corresponding bacteria showed remarkably similar overall topography, suggesting that horizontal gene transfer is a minor player in the spreading of the curli system.


Introduction
Non-pathogenic as well as human and animal pathogenic Escherichia coli isolates and Salmonella enterica serovars are able to produce functional bacterial amyloids (FuBA) collectively referred to as curli [1][2][3]. FuBA are defined as bacterial protein polymers with a fibrillar structure in which the protein monomers fold as bsheets stacked perpendicular to the fibril axis [4,5].
Curli fibrils are involved in bacterial attachment to surfaces, cell aggregation and are an important part of the extracellular matrix required for the formation of mature biofilms [6][7][8][9]. Curli fibrils are also considered important virulence factors as they interact with a wide range of host proteins, which are proposed to facilitate bacterial dissemination through the host. These include extracellular matrix proteins [10][11][12] and contact-phase proteins [13][14][15][16]. Curli are recognized by Toll-like receptors, leading to the activation of the innate immune system [17,18]. Curli are therefore considered pathogen-associated molecular patterns (PAMPs).
A highly regulated pathway involving two divergently expressed operons is required for curli biogenesis. The csgBAC operon encodes the major curli subunit, CsgA, and its homolog CsgB, which after translocation to the cell surface acts as a nucleus for polymerization of soluble CsgA [3,19]. It also encodes CsgC, which is required for correct assembly of the mature curli fimbriae [20,21]. The csgDEFG operon encodes CsgD, a transcriptional activator of the csgBAC operon, together with CsgE and CsgF, which act as chaperones and are required for effective curli assembly. Finally, it encodes CsgG, a helical outer-membrane macromolecular exporter, important for secretion of the curli subunits through the outer-membrane [3,19,20,[22][23][24], which works in concert with CsgE to facilitate CsgA secretion [25].
Curli homologs have previously been identified within some, but far from all genera of the Enterobacteriales using either purification (Escherichia and Salmonella) or PCR-based methods targeting the csgA or csgD genes (Escherichia/Shigella, Salmonella, Citrobacter and Enterobacter) [12,26,27]. Curli fibrils may, however, be phylogenetically much more widespread and could consequently be important virulence factors for other pathogens.
The growing number of genome-sequenced bacteria spanning many phylogenetic taxa allows a reliable determination of the phylogenetic and structural diversity of genes and related proteins by bioinformatic tools [28,29].
In this study, homologs of the curli associated Csg proteins were found to be phylogenetically much more widespread than initially assumed, spanning at least four bacterial phyla. A high degree of variability within the operon structure between bacterial taxa suggests that the curli fibrils might have evolved to fulfill specific functions and may consequently be encountered frequently in environmental as well as pathogenic biofilms.

Curli Genes are Phylogenetically Widespread
Homologous curli systems were initially identified within the refseq protein database using PSI-Blast searches with Csg proteins from E. coli str. K-12 substr. MG1655 as query sequences. The hits were manually curated and additional homologs were identified by examination of the genetic neighborhood of the hits. The identified Csg homologs showed a high degree of variability in terms of primary structure, and sequence identity of the CsgA homologs were only 9-33% for the non-Enterobacteriales compared to CsgA from the E. coli strain.
Purely sequence-based methods, such as Blast searches, may not be able to detect evolutionarily related protein sequences if these have been subjected to intensive recombination and fast evolution [30]. The low sequence identity between Csg homologs of phylogenetically closely related species suggests that divergence could provide a problem when searching for curli homologs.
Profile hidden Markov models (HMMs) are much better at detecting remote homology between proteins than Blast searches [30]. The curated Csg homologs were therefore used to generate HMMs of the Csg proteins. As CsgA and CsgB are internal homologs and these proteins are highly variable, a combined CsgA/B HMM were constructed based solely on the curli repeat regions situated as described by Barnhart and Chapman [31]. The HMM models were able to identify additional curli homologs. The additional hits were curated as described earlier and the expanded Csg protein database were used to generate improved HMM models ( Table 1). The iterative process was repeated until no additional homologs could be identified.
The curated HMMs in general performed very well ( Table 1). The CsgA/B repeat model was able to detect 93% of all CsgA/B proteins in our homolog database. The remaining 7% were CsgA/ B proteins, which contained no or only one repeat region. These proteins are therefore assumed to be non-functional. The few hits on non-curli proteins represented four very large Shewanella adhesion proteins, which interestingly contained regularly spaced curli-like repeats as well as a single archaeal protein, which also had regularly, spaced curli-like repeats. The curated CsgA/B repeat model was able to detect more CsgA/B homologs than the curlin repeat Pfam model (PF7012), although the latter did not hit false positives (Table 1). However, a big advantage of the curated model is that it is able to correctly assign the location of the repeat regions in line with those described by Bernhart and Chapman [31]. The curated CsgC, CsgE, CsgF and CsgH models were all very sensitive and highly specific. The CsgE model performed better that the current Pfam-B model (PF10627), whereas the CsgF model had a similar performance as Pfam-B model (PF10614). The curated CsgG HMM was much more specific than the current Pfam model (PF3783), although it still included 41% false positives. The non-specific CsgG hits are likely paralogs, and indicate that CsgG is part of a larger protein family. The CsgD HMM has a good sensitivity, but cannot be used to probe curli systems as this protein is part of a very large protein family.
The combination of HMM searches and manual examination of the surrounding gene neighborhoods allowed identification of all previously known curli systems. These, however, represent only a tiny part of the phylogenetic curli diversity ( Figure 1 and Table  S1). The majority of the curli systems were found within the Proteobacteria. Surprisingly, several curli systems were also found within Bacteroidetes and within a single Firmicutes and Thermodesulfobacteria strain. Within the Proteobacteria most curli systems were found within the Alpha-and Gammaproteobacteria, although two systems were found in the Betaproteobacteria, both within the order Burkholderiales, and a single system were found within the order Desulfovibionales of the Deltaproteobacteria. This shows that the genes coding for curli systems are phylogenetically much more widespread than previously appreciated. It should be noted that only some of the families within the previously mentioned phyla contain curli homologs. Csg homologs are for example not present within any of the genome sequenced Klebsiella and Yersinia strains, although these bacteria are both members of the Enterobacteriales and phylogenetically closely related to E. coli.

Conservation and Organization of csg Genes
Conservation of curli genes shows some interesting variations among bacterial classes ( Figure 1). Homologs to all csg genes can be found within every orders of the Gammaproteobacteria with the exception of csgC, which is only found within the Enterobacteriales and Aeromonadales. The Delta-and Betaproteobacteria lack homologs to csgC and csgD, while the Alphaproteobacteria all lack csgC, csgD and csgE. In addition, approximately half of the genera within the Alphaproteobacteria also lack csgE and csgF. It is interesting to notice that many of the Alphaproteobacteria carry an additional gene (csgH), which is always situated next to csgA/B and does not show similarity to any other genes in the refseq database. This gene might fulfill the role of the missing csgE and csgF genes. The single Firmicutes lacks csgC, csgD and csgE, whereas the lone Thermosulfobacteria only lacks csgC. Finally, all genera of the Bacteroidetes lack csgC and csgD, with the exception of Spirosoma, which have a csgD homolog. It is only possible to distinguish CsgA and CsgB homologs within the Gammaproteobacteria. Interestingly, members of some genera contain up to six csgA/B homologs, as is the case for Spirosoma and Rhodopseudomonas. These additional copies are not similar and might be used to modulate the physiochemical properties of the final amyloid fibrils.  1 The size and location of the curli repeat region assigned by this model does not agree with the curli repeat regions described by [31].
This is supported by the fact that the additional copies are often situated in additional curli operons and may therefore be regulated independently of the main curli operon. An examination of the genetic organization of the homologous csg genes within the Gammaproteobacteria shows a high degree of reorganization compared to the strict organization known for members of the Enterobacteriales (Figure 1). The organization of the csg genes into the divergently expressed csgDEFG and csgBAC, described for E. coli and Salmonella, is only seen within the Enterobacteriales and the Vibrionales. The two operons are conserved in the Aeromonadales and Oceanospirillales, but here they are oriented in the same direction. In the Pseudomonadales and the Shewanella genus of the Alteromonadales the operons are divergently oriented, but the csgD homologs are separated from the csgDEFG operon and seem to be transcribed alone. Within Alteromonas and Pseudoalteromonas of the Alteromonadales all csg genes, apart from csgD, form a single operon. csgD is divergently oriented and located downstream the csgBAEFG operon.  [49]. The number of strains containing curli systems within each genus is indicated next to the taxonomic units. Note that these numbers are highly influenced by the number of sequenced strains within each phylogenetic group and therefore do not reflect the prevalence of curli systems within these groups. Genera highlighted in red represent genera where curli systems have been previously described. Organization of the curli operons is illustrated for each genus. doi:10.1371/journal.pone.0051274.g001 CsgA Homologs are Highly Variable whereas CsgB Structure is Conserved CsgA homologs from different phylogenetic groups show an astonishing variation in size and organization of repeat regions ( Figure 2 and Figure 3). Whereas CsgA homologs from the Enterobacteriales are relatively short (,152 amino acid residues) and contain only 4-5 complete repeat motifs, CsgA homologs from other bacterial orders are up to 529 amino acid residues in length and may contain up to 22 repeat motifs. This variation may have huge impact on polymerization, structure and stability of the resulting FuBA. It is interesting to notice that homologs of the nucleator protein, CsgB, do not show the same variation in size, although this protein is an internal homolog of CsgA (Figure 2 and Figure 3). The repeat motifs are generally well aligned end-to-end inside the CsgA homologs. For some genera, such as Halomonas, repeat regions superimpose due to the presence of two minimalistic repeat regions (X 6 QXGX 2 NX 10 ) inside the individual repeats. Such overlapping repeats may contribute to an increased stability and rigidity of the resulting FuBA.
Comparison of CsgA and CsgB repeat region consensus sequences from Gammaproteobacteria shows that the CsgB repeat regions are much better conserved then the CsgA repeats ( Figure 3). Two minimalistic repeat regions are common for CsgB but not for CsgA. This suggests, that the nucleator function of CsgB impose more structural constraints. There is also significant variation of the repeat region consensus sequence among genera. The highly conserved serine residue of the Enterobacterial CsgA repeats is for example not seen for any other genera.

Evolution of Curli Systems
A comparison of phylogenetic trees based on functional genes or protein sequences with those of 16S rRNA gene sequences for the corresponding bacteria can be used to track the evolutionary history and mechanisms of gene transfer of the functional genes. Phylogenetic trees based on the CsgDEFGH proteins sequence and the 16S rRNA gene of the corresponding bacteria show remarkably similar overall topography (Figure 4 and Figures S1, S2, S3, S4). Csg homologs from all genera localized in narrow genus-specific clusters, with the exception of Shewanella, which separated into two distant clusters. This denotes that horizontal gene transfer only play a minor role in the spreading of the curli system, and suggests that the curli systems might have evolved from a common ancestor.

Curli Systems within Metagenomes
Although the number of genome-sequenced bacteria has increased intensively over the last years, there is still a strong bias towards clinically relevant and cultivable bacterial strains. The CsgE, CsgF and CsgH HMMs were therefore used to identify curli homologs within 10 large metagenomes from a diverse range of habitats (Table 2). These HMMs were selected due to their high sensitivity and specificity and because they cover the phylogeny of the curli system well ( Table 1). The hits were aligned with the previously identified Csg homologs in order to construct phylogenetic trees ( Figure 5 and Figures S5 and S6). All metagenome hits fell within the known phylogenetic diversity of the genomesequenced bacteria. This indicates that the HMM models based on the genome-sequenced bacteria cover the phylogenetic diversity of the curli system. Some of the metagenome hits formed well-defined clusters in the phylogenetic tree and might therefore represent novel phylogenetic clades. These clusters were mainly seen within the Alpha-and Betaproteobacteria.

Curli System are Found in Bacteria from Diverse Habitats
The curli genes are phylogenetically much more widespread than initially appreciated. Curli-like FuBA may consequently be encountered frequently in environmental and pathogenic biofilms and they might have additional functions that are yet to be discovered. It is for example interesting that members of the Rhizobiales contain curli genes. These bacteria form complex symbiotic relationships with legume roots in the form of nitrogenfixing nodules. As E. coli and Salmonella curli have been shown to promote binding to plant tissue, curli fimbriae could conceivably play a role in this symbiosis [32,33]. Besides being found in many soil and enteric bacteria, curli genes were also found within many genera of marine bacteria, including Alteromonas, Pseudoalteromonas, Shewanella, Halomonas, and Vibrio and Oceanicola.

Curli as a Virulence Factor
Curli fimbriae are known virulence factors in E. coli and Salmonella pathogenicity [31]. The phylogenetic diversity of the curli system documented in this study arise the question whether this is also the case for other potential pathogens containing curli genes. For example, curli genes were identified in Aeromonas veronii, which is an emerging pathogen [34] that may cause diseases ranging from serious wound infections to septicemia in humans [35]. Although they are mostly isolated from cases of gastroenteritis that can range from a mild self-limiting diarrhea to cholera-like illness and dysentery [36], it is still poorly understood how A. veronii cause infections, but various adhesins may be involved [34]. Curlilike amyloids might mediate binding to and internalization into host cells as is seen for E. coli. Another example is Cronobacter turicensis, a food-borne pathogen associated with acute infections in neonates, which may lead to clinical outcomes such as meningitis, sepsis and necrotizing enterocolitis [37]. This bacterium is capable of forming biofilms that may contribute to pathogenicity [38]. An obvious experiment will be to investigate by genetic knock-out methods whether curli-like amyloid contributes to the biofilm development and virulence by C. turicensis.

Alternative Regulation and Biogenesis
Large flexibility is observed in the organization of csg genes across both genera and orders. For some genera the genes are divided into two operons, as seen for E. coli. In others they form a single operon. Genome rearrangements as seen for the csg operons are not uncommon in bacterial evolution [39,40]. Although operon rearrangement or disruption may result in loss of gene function, this does not have to be the case. The evolution of tryptophan synthesis (trp) operons is a clear-cut example of this. Whereas the six trp genes have evolved into a single transcriptional unit in E. coli, a separation into three transcriptional units is seen for P. aeruginosa [41]. Splitting of the trp operon does not affect physiological function, but it leads to alternative regulation, as each individual transcriptional unit requires its own transcription factors [41]. The overall variation in csg gene organization can therefore not be used to determine operon functionality. It does, however, imply that transcriptional regulation of curli biogenesis may vary between bacteria of different orders. The organization of csg into a single operon, as seen for Pseudoalteromonas, indicates that all genes are co-transcribed. Regulation of the individual Csg proteins is therefore required to take place during or after translation. For all Pseudomonas species except P. stutzeri, no csgD homologs are found in the vicinity of the csg operons. It might therefore be suspected that different transcriptional regulators control the expression of these operons. For many of the Alphaproteobacteria, Csg homologs could only be identified for the major and minor curlin subunits. This implies that alternative proteins must fulfill the role of the other Csg proteins in order to allow the bacteria to express functional curli fimbriae. This also suggests the existence of different pathways to amyloid biogenesis. Interestingly, the csg operon in these Alphaproteobacteria contains an additional gene, which we term csgH. The CsgH protein does not have any homologs of known function. A detailed study will therefore be required to determine the role of this protein.

Variation in the Number of Repeat Units
Whereas CsgA homologs from the Enterobacteriales are relatively short and contain only 4-5 repeat units, homologs from other bacterial orders can be much larger and display up to 22 repeat units (Figure 3). This could have profound effect on the amyloidogenicity of the proteins, the process of CsgA secretion and amyloid formation as well as on the stability and morphology of the resulting fimbriae. The size of CsgG, which constitutes the outer membrane macromolecular exporter, through with the curli subunits are secreted, is unaffected by the huge differences in size of CsgA homologs. This suggests that CsgA and CsgB are secreted as natively unfolded proteins.
Expansion of the oligopeptide repeat domain (ORD) within the mammalian prion protein (PrP) is associated with dominant, inherited prion diseases [42]. The effect of repeat expansion on amyloid formation has been investigated in details using chimeric versions of the yeast prion protein Sup35p, in which the ORD have been replaced by wildtype or expanded ORDs from PrP [43,44]. In vitro studies showed that expansion of repeat regions increase the aggregation propensity and the kinetics of fibril formation, the latter to such a degree that no lag phase can be observed. The repeat expansion showed no detectable effect on fibril stability, suggesting that a similar amyloid core was formed. The repeat expansion furthermore had a clear effect on the morphology of the formed fibrils. The wildtype construct produced fibrils that were long and straight, whereas the expanded construct resulted in curvy and clumped fibrils [44]. The results of the in vitro studies were in line with observations previously made on the phenotypic effects of similar mutations in yeast cells [43]. From the prion data we might expect that expansion of repeat units in CsgA homologs results in more aggregation prone proteins, which form amyloids with comparable stability. Mutation of gatekeeper residues, i.e. residues that reduce amyloidogenicity, in the E. coli CsgA repeats results in CsgA monomers, which in vitro forms amyloid fibrils in the absence of a nucleator [45]. Consequently, it is possible that some species are able to express fimbriae in the absence of a nucleator protein, although this might reduce control of the aggregation process. An alternative could be that gatekeeper residues within individual repeat units are used to dampen the amyloid propensity, thereby allowing the fibrillation to be controlled by CsgB.

Functional FuBA Systems or Pseudogenes?
The presence of curli gene homologs within the genome of a bacterium does not necessarily imply that the bacterium is able to express functional curli fimbriae. The genes may simply be pseudogenes, which have been passed on through vertical gene transfer. However, theoretical considerations suggest that the curli systems are functional. Although the CsgA homologs have been subjected to many mutation and recombination events, as judged by the low sequence identity between homologs and the large variation in the number of repeats, these do not seem to affect the highly organized modular structure of repeat units lined up within the CsgA homologs. Recombination events sometime occur within the repeats. Yet this does not lead to disruption of the repeat motif. The positive selection for repeat motifs implies that the proteins are functional. In addition, the reorganization of the csg operons seldom results in disrupted genes. Gene disruption should be much more common if the csg homologs were pseudogenes as there is no selective pressure acting against their disruption. We therefore suspect that the homologous curli systems are functional.

Multiple FuBA Operons suggest Multiple FuBA Systems
Curli homolog systems are found within Pseudomonas putida F1 and P. fluorescens Pf0-1. These strains also contain an operon coding for another type of FuBA, namely the Fap fimbriae [46]. This opens up for the possibility that some bacteria are able to Figure 2. Organization of Curli Repeat Motifs. Minimalistic curli repeats (X 6 QXGX 2 NX 10 ) are shown with arrows. Yellow arrows represent repeats within CsgA homologs, blue arrows repeats within CsgB homologs, and red arrows represent repeats in homologs, which cannot be reliably classified as either CsgA or CsgB homologs. doi:10.1371/journal.pone.0051274.g002 express several FuBA depending on the environmental conditions. These FuBA may have different functions. One could be used for initial adhesion or host interaction, whereas the other could play a structural role in the mature biofilm. This is supported by the fact that E. coli curli mainly are expressed during the stationary phase [19], whereas Fap fimbriae in Pseudomonas seem to have the highest expression during the exponential growth phase (Dueholm et al., unpublished results).

Concluding Remarks
This study clearly demonstrates that only the tip of the iceberg have been investigated in respect to curli functional amyloids. The phylogenetic diversity of curli systems implies that these proteinaceous extracellular polymeric substances (EPS) are common in many biofilms and should be considered of equal importance as polysaccharides and extracellular DNA. The variability in operon and amyloid subunit structure suggests that curli systems have evolved to fulfill specific functions for individual species, however  pure culture studies are required to elucidate the exact function of each curli system.

Identification of Homologous Curli Systems
Csg homologs were initially identified by PSI-Blast searches (default settings, blosum45 scoring matrix, E-value,1) against the refseq database using Csg proteins from E. coli str. K-12 substr. MG1655 as query sequences [47]. The hits were manually curated based on overall proteins structure and gene location relative to that of related csg gene homologs. HMMs were made for each of the Csg protein families using hmmbuild of the HMMER 3.0 package and non-redundant versions of the curated Csg protein datasets aligned using ClustalW (gap opening cost (GOC) = 15 and gap extension cost (GEC) = 1). The structural flexibility and low sequence similarity outside the repeat regions of CsgA and CsgB made confident sequence alignment of these proteins impossible. A combined CsgA/B HMM was therefore made in the same way based solely in the repeat regions. The hmmsearch command of Figure 5. Curli Systems within Metagenomes. CsgF homologs were identified within 10 large metagenomes covering a diverse range of habitats, see Table 2, using the curated CsgF HMM. The hits were aligned with the CsgF homologs identified within refseq and phylogenetic trees were estimated using distance matrix. doi:10.1371/journal.pone.0051274.g005 the HMMER 3.0 package was used together with the HMMs to search for additional Csg proteins within the refseq database. The hits were curated and included in the Csg homolog database (Table S1) and the expanded datasets were used to generate improved HMMs. This process was repeated until no further homologs could be identified. Identification of Csg homologs within the metagenome databases was done using a similar approach.

Identity and Curli Repeat Identification
Curli repeats were identified by motif search in CLC DNA workbench 5.7.1 (CLC Bio, Aarhus, Denmark) using a java regular expression of the minimalistic curli repeat (X 6 QXGX 2 NX 10 ) described by Chapman et al. [31]. All repeat regions from bacteria within the same genus were aligned (ClustalW, GOC = 50 and GEC = 1) in order to determine repeat region consensus sequences.

Phylogenetic Analysis
16S rRNA gene sequences were obtained for bacterial strains containing homologous curli systems from the NCBI nonredundant nucleotide database or the Silva 16S rRNA database (http://www.arb-silva.de/). The 16S rRNA gene sequences were aligned using the SINA v. 1.2.9 aligner (http://www.arb-silva.de/ aligner/) and imported to the ARB software [48]. The aligned 16S rRNA genes were used to calculate phylogenetic trees based on the neighbour-joining and maximum-parsimony methods provided in the software using the default setups. The two methods resulted in trees with similar overall topology. Homolog Csg protein sequences were aligned (ClustalW, GOC = 15 and GEC = 1) and imported into ARB. Phylogenetic trees were similarly calculated based on the neighbour-joining and maximum-parsimony. These trees also showed similar overall topology.

Supporting Information
HMMs S1 Curated Hidden Markov Models for Curli Repeats and CsgC-H. (ZIP) Figure S1 Phylogenetic Tree Based on the CsgD Protein Sequences. Trees based on aligned protein data were estimated using distance matrix and maximum likelihood and resulted in congruent tree topologies. The distance matrix tree is shown. (EPS) Figure S2 Phylogenetic Tree Based on the CsgE Protein Sequences. Trees based on aligned protein data were estimated using distance matrix and maximum likelihood and resulted in congruent tree topologies. The distance matrix tree is shown. (EPS) Figure S3 Phylogenetic Tree Based on the CsgG Protein Sequences. Trees based on aligned protein data were estimated using distance matrix and maximum likelihood and resulted in congruent tree topologies. The distance matrix tree is shown. (EPS) Figure S4 Phylogenetic Tree Based on the CsgH Protein Sequences. Trees based on aligned protein data were estimated using distance matrix and maximum likelihood and resulted in congruent tree topologies. The distance matrix tree is shown. (EPS) Figure S5 CsgE Homologs within Metagenomes. CsgE homologs were identified within 10 large metagenomes covering a diverse range of habitats, see Table 2, using the curated CsgE HMM. The hits were aligned with the CsgE homologs identified within refseq and a phylogenetic tree was estimated using distance matrix. (EPS) Figure S6 CsgH Homologs within Metagenomes. CsgH homologs were identified within 10 large metagenomes covering a diverse range of habitats, see Table 2, using the curated CsgH HMM. The hits were aligned with the CsgH homologs identified within refseq and a phylogenetic tree was estimated using distance matrix. (EPS)