Comprehensive analysis of chromosomal mobile genetic elements in the gut microbiome reveals phylum-level niche-adaptive gene pools

Mobile genetic elements (MGEs) drive extensive horizontal transfer in the gut microbiome. This transfer could benefit human health by conferring new metabolic capabilities to commensal microbes, or it could threaten human health by spreading antibiotic resistance genes to pathogens. Despite their biological importance and medical relevance, MGEs from the gut microbiome have not been systematically characterized. Here, we present a comprehensive analysis of chromosomal MGEs in the gut microbiome using a method that enables the identification of the mobilizable unit of MGEs. We curated a database of 5,219 putative MGEs encompassing seven MGE classes called ImmeDB. We observed that many MGEs carry genes that could confer an adaptive advantage to the gut environment including gene families involved in antibiotic resistance, bile salt detoxification, mucus degradation, capsular polysaccharide biosynthesis, polysaccharide utilization, and sporulation. We find that antibiotic resistance genes are more likely to be spread by conjugation via integrative conjugative elements or integrative mobilizable elements than transduction via prophages. Horizontal transfer of MGEs is extensive within phyla but rare across phyla, supporting phylum level niche-adaptive gene pools in the gut microbiome. ImmeDB will be a valuable resource for future studies on the gut microbiome and MGE communities.


Introduction
Horizontal gene transfer (HGT), the transfer of genes between organisms by means other than vertical transmission, allows for the rapid dissemination of genetic innovations between bacteria [1]. Ecology is an important factor shaping HGT, and the human gut in particular is a

Detection of putative insertions in reference genomes
A set of reference genomes of human gut/fecal isolates was curated from the NCBI (07-01-2018) based on metadata available from BioSamples, PATRIC, as well as literature searches.
The quality of the genomes were assessed with CheckM [22]. Genomes with less than 90% completeness or more than 10% contamination were removed. Our target database included 5,103 genomes representing 1,896 taxa (834 species taxa). 274 stool metagenomes from the Human Microbiome Project (HMP) [23] were downloaded from the NCBI. We used Mash [24] to calculate the minhash distance between each genome and each metagenomic sample with the default sketch size of s = 1000 and k = 21. Metagenomic reads from HMP samples were aligned to each genome with more than 2 matching-hashes shared with the reads separately with bwa (version 0.7.12) [25]. To find genomic regions that differ in terms of insertions/deletions between strains in the individual samples and the reference genomes, we used split reads and information from pair-end reads from the alignment (Fig 1A). First, we identified putative deletion junctions using split reads, which we defined as reads that align to two distinct portions of a genome. Split reads were initially identified as those reads having multiple hits in the SAM output from bwa. If a split read alignment starts at one genomic location in the reference and then "jumps" to aligning to a distant site downstream in the same strand, it may indicate a potential deletion in the strain of bacteria from the metagenomic sample compared to the reference genome. For each putative deletion junction, we confirmed the presence of the junction by determining if paired-end reads flanked the junction. We considered a deletion junction to be valid if the read pairs flanking the junction were aligned in the correct orientation, and the distance between the pairs minus the junction size is within the range of +/-2 times the standard deviation of the mean library insert size. Regions with more than ten split reads and more than ten read pairs supporting the deletion were considered as putative insertions in the reference genome. We selected for regions ranging in size between 1kbp and 150kbps to reduce the number of spurious results. The SRID method aimed to identify as many candidate MGEs as possible, especially in low-abundance species in metagenomic sequencing data. Therefore, there are no stringent requirements for filtering based on read or mapping quality at this stage. The putative insertions identified in the reference genomes were verified to be MGEs via additional stringent filtering steps detailed later in the methods. The code used to implement the SRID method, human gut related genome assembly accession numbers and HMP SRA accession numbers used in this study are available from GitHub (https://github. com/XiaofangJ/SRID) and the S2 File.

MGE signature detection
Genes from the genomes with putative insertions identified were predicted with Prodigal (version 2.6.3) [26]. Protein sequences were functionally annotated with interproscan (version 5.30-69.0) using the default settings [27]. Then, we used the interproscan annotations to identify serine and tyrosine integrases as well as group II intron proteins from all genomes. Prophage-related genes were identified by searching for genes with phage-related or phagespecific Pfam signatures. The list of phage signatures used were a combination of those identified in phage_finder [28] as well as a list of manually curated terms (S2 File). Serine integrases were identified as genes annotated with one of the genes involved in mobilization and conjugation of MGEs, we used ConjScan via a Galaxy web server (https://galaxy.pasteur.fr/) [29]. We identified transposases using blastp against the ISFinder database with an e-value 1-e10 [17]. The best hit for each protein was used to annotate the family of transposases.

Identification and classification of MGEs
MGEs are typically identified by two methods: the presence of MGE signatures [16,30], and evidence of phylogenetic incongruence such highly-identical copies of same element present in multiple species [2]. We used both methods in our study. Putative insertions were annotated as an ICE if they contained complete conjugation and relaxase modules and an integrase or transposase at the boundary of the element. Putative MGEs were annotated as prophages if there is an integrase or transposase at the boundary of the element and more than five genes were annotated with prophage-related Pfams. Putative MGEs were annotates as IMEs when no VirB4 is present and when an integrase is found in the vicinity of a relaxase and did not contain genes involved in conjugation. Putative MGEs were annotated as transposons if they contained transposase and were not previously annotated as an IME. We limited the size of IMEs to 30kb and transposons to 10kb to decrease the number of false positives. Putative MGEs were annotated as group II introns if the element was less than 10kb, contained a protein with the TIGR04416 signature, and did not contain a gene annotated as transposase. The remaining putative MGEs were then divided into two groups based on their sizes: unclassed genomic islands (>10kb), and islets (<10kb). To eliminate spurious MGEs, we only report genomic islands that contain an integrase or transposase, or those that are related to prophage/ ICEs, and islets that exist in more than two species with identity higher than 98%. We confirmed that the SRID method accurately identified known MGEs and their boundaries using both synteny and comparative genomics for CTnDOT and Tn916/Tn1549. To ensure the identified boundary of identified MGEs is correct, we manually inspected the elements using additional methods, including comparative genomics and visualization of alignments of the reads supporting the insertion ( Figure B in S1 File). After classification and verification, we identified 5,219 MGEs in 1,199 genomes (S2-S4 Files). The MGEs identified with the SRID method are limited to chromosomal MGEs. Thus, plasmids and extrachromosomal prophages were not characterized in this study.

Clustering each class of MGEs
Pairwise alignment of elements from the same class of MGEs was performed with nucmer (version 3.1) [31]. Elements with more than 50 percent of the sequence aligned to each other are grouped in the same cluster. For ICEs, we additionally require that elements in the same cluster should have the same types of integrase, relaxase and conjugation modules. For IMEs, we required that each cluster has the same types of integrases and relaxases for all elements. For transposons, the same cluster should have the same type and number of IS genes. If a transposon is a "nested" or composite transposon, the family names of all IS contained within were used to annotate the transposon.

Construction of phylogenetic trees
To build phylogenetic trees of ICE and prophage integrases, we selected a representative integrase sequence from each cluster. Representative sequences were chosen as examples with the highest identity to their homologs in the same cluster. We performed alignment of each group of sequences with mafft(v7.123b) [32] (parameter "-maxiterate 1000"). We used trimal (version 1.4.rev15) [33] to remove region with gaps representing more than 20% of the total alignments (parameter "-gt 0.8"). RAxML(version 8.2.10) [34] was used to build phylogenetic trees from the alignments using the LG substitution matrix and a gamma model of rate heterogeneity (parameter "-m PROTGAMMALGF"). Phylogenetic trees were plotted with the R package phytools [35].

Functional enrichment analysis of cargo genes
Cargo genes were identified by excluding genes involved in transposition and transfer from all genes on MGEs. To understand the function of cargo genes, we performed enrichment analysis based on gene ontology (GO), antibiotic resistance (Resfam), and protein families (Pfam). The enrichments were performed with all genes present in the genomes as background reference. We used hmmer(version 3.1b2) [36] to search Resfam [37] core database to annotate antibiotic resistant gene. The "-cut_ga" parameters were used to set the threshold. The best hits to each gene from the Resfam database were used to annotate antibiotic resistant genes. GO terms and Pfam signature of the same genes sets were extracted from interproscan result. R package GOStat [38] was used for GO enrichment analysis for GO and Pfam. The R package clusterProfiler [39] was used for the enrichment analysis of cargo genes based on Resfam and Pfam signatures. P-value of 0.05 were used as cutoff for all enrichment analysis. Samples size, p-value, and exact test statistics of these tests are provided in S4 Table. Results

Prevalence of MGEs in species of the gut microbiome
We systematically identified MGEs from species of the human gut microbiome using mapping information from metagenomic reads from the Human Microbiome Project (HMP) [23]. MGEs are actively inserted and deleted from genomes, causing differences between strains of bacteria. We identified cases where the reference genome of a bacterial strain differed from strains in the individual samples from the HMP. To find the sequences responsible for these differences, we mapped HMP metagenomic reads to available gut-associated bacterial reference genomes and identified genomic regions flanked by split reads and discordantly-aligned paired-end reads ( Fig 1A). These regions potentially are recent insertions of active MGEs. The MGEs identified with the SRID method are limited to chromosomal MGEs. Thus, plasmids and extrachromosomal prophages were not characterized in this study. By searching for MGE-specific gene signatures, we manually verified and classified these MGEs (See Fig 1B and Methods). We identified 5,219 putative MGEs from gut microbiome representatives of 101 strains of Actinobacteria (7 species), 283 strains of Bacteroidetes (95 species), 214 strains of Firmicutes (149 species), 574 strains of Proteobacteria (13 species), and 27 strains of Verrucomicrobia (3 species) (S2 Table). Then, we classified the identified MGEs based on their transfer and transposition mechanisms into seven classes: ICEs, prophages, IMEs, group II introns, transposons, unclassified islets, and unclassified genomic islands ( Fig 1C). Most of the MGEs identified (4113/5219) were from the phyla Bacteroidetes and Firmicutes because these two phyla tend to dominate the gut microbiome of healthy adults [23] (Figure A in S1 File).
Different strains of the same species often share identical or nearly-identical MGEs. To eliminate this redundancy, we collapsed MGEs into clusters based on overall nucleotide identity ( Fig 1C). Phylum-level differences in the diversity of MGEs were revealed. For example, Bacteroidetes had more diversity of ICEs than Firmicutes (53 vs. 31 respectively), while Firmicutes had more diversity of prophages than Bacteroidetes (71 vs. 31 respectively).
Most of our understanding of genomic islands such as ICEs comes from pathogens and a small number of tractable model species. In gut species, the knowledge of ICEs is limited to a few families of ICEs such as CTnDOT and Tn916/Tn1549. In this study, we successfully identified exact copies CTnDOT with the correct boundary in multiple species (ICE cluster 1), and several Tn916/Tn1549 family elements using the SRID method. Due to limited knowledge of genomic islands in gut microbes, most genomic islands described in this study are novel.

Diversity of MGE modules in gut microbiota
Although it has been known that ecology is important in shaping the diversity of MGEs in the gut microbiome, this study is the first to systematically characterize the putative mechanisms of transposition and transfer for MGEs of the gut microbiome [2]. We annotated the genes in MGEs involved in their transposition and transfer, and then classified the elements into groups based on these annotations (S2 Table).
There are four major protein families responsible for transposition of gut MGEs: serine integrases, tyrosine integrases, transposases, and group II intron proteins conferring reverse transcriptase and endonuclease activity. Serine and tyrosine integrases are the most prevalent protein families responsible for transposition in ICEs, IMEs, and prophages. In the gut microbiome MGE clusters we identified, we found 482 MGEs with tyrosine integrases (64 from ICEs, 285 from IMEs and 133 from prophages) and 237 MGEs with serine integrases (21 from ICEs, 184 from IMEs and 32 from prophages). Interestingly, while tyrosine integrases are found in several phyla, serine integrases of ICEs and prophages were exclusively found in the phylum Firmicutes. In IMEs, most serine integrases were identified in Firmicutes (171), but 18 clusters of serine integrases were found in Bacteroidetes, Verrucomicrobia and Actinobacteria (12,5 and 1 respectively). Unlike prophages and ICEs, 12 of 197 clusters of IMEs use serine integrases to transpose in Bacteroidetes. This implies that although integration via serine integrases occurs in Bacteroidetes, it occurs much less frequently than integration via tyrosine integrases. No ICEs and IMEs with transposase were identified in our study. Twelve prophage clusters were found with transposase from IS families: IS30, IS256, and IS110.All transposons we identified utilized DDE/DED/DDD-transposase. We identified 23 families of transposase. Most of the transposase clusters we identified are present in insertion sequences. Thirty-nine clusters (164 copies) of transposons are composite transposons flanked by two different insertion sequences families.
ICEs and IMEs encode relaxases (MOB) to initiate DNA mobilization and transfer. We used the CONJscan-T4SSscan server to classify relaxases identified in MGEs [40]. Six types of relaxase were identified in ICEs and IMEs. In ICEs, MOB T was identified only in Firmicutes, MOB V was identified only in Bacteroidetes, and MOB P1 was identified in Firmicutes, Bacteroidetes, and Actinobacteria. IMEs have a more diverse reservoir of relaxases. Besides the three types of relaxase found in ICEs, we also identified IMEs with MOB P3 , MOB B , and MOB Q type relaxases.
ICEs are capable of conjugation via mating pair formation systems. Six types of mating pair formation systems for conjugation have been described [40]. We found three types of mating pair formation system: typeB, typeFA, and typeFATA, in ICEs from the gut microbiome. Consistent with previous findings, type FA systems were identified in 6 ICE clusters from Firmicutes, type B systems were identified in 53 ICE clusters from Bacteroidetes, and type FATA systems were identified in 25 Firmicutes ICE clusters and one Actinobacteria ICE cluster [41].

MGEs carry niche-adaptive genes
Although fundamentally selfish, MGEs often carry genes other than those necessary for their transposition and transfer, sometimes referred to as cargo genes [42]. We found that smaller elements like transposons generally carry zero or only a few cargo genes. Genetic islands like ICEs and IMEs often carry numerous cargo genes (median cargo genes 40 and 11 respectively). One example is an ICE found in Bacteroides fragilis 3_1_12 (NZ_EQ973213.1:1794893-1944846) which carries 143 cargo genes. We performed functional annotation on the cargo genes, and enrichment analysis using gene ontology (GO), Pfam, and Resfam [37,43,44] (S4 Table). Several classes of enriched genes are well-known to be associated with the maintenance of MGEs such as restriction-modification systems (GO:0009307,PF07669) and toxin-antitoxin pairs (PF02452, PF07927,PF15738) (S4 Table). Many other gene families carried by MGEs may confer an adaptive advantage to colonize the gut.

Antibiotic resistance genes
Many classes of antibiotics consumed orally are incompletely absorbed in the small intestine, and therefore proceed to the large intestine where they can kill the resident microbes [45]. Therefore, genes that confer antibiotic resistance can be adaptive to the gut environment. In total, we identified 246 antibiotic resistance genes encompassing 19 distinct classes carried by MGEs. Classes of MGEs varied in their carriage of antibiotic resistance genes. Of 14,632 prophage cargo genes, only 2 were found to be antibiotic resistance genes. The carriage rate of antibiotic resistance genes normalized by total cargo genes in prophages is more than thirty times lower than that identified in ICEs (140/20269) and IMEs (66/9302) (S1 Table). This suggests that conjugation via ICE/IME may be more important than transduction in the spread of antibiotic resistance genes, consistent with previous findings [46,47]. GO analysis revealed that cargo genes involved in antibiotic resistance are enriched in both Bacteroidetes (GO:0015307; GO:0017001;GO:0008800") and Firmicutes (GO:0015307;GO:0016999;GO:0008811). Resfam enrichment analysis also supported this, as RF0135 (tetracycline resistance ribosomal protection protein), and RF0067 (Emr 23S ribosomal RNA methyltransferase) were enriched. One example of an MGE responsible for the transmission of antibiotic resistance is the ICE CTnDOT, the spread of which dramatically increased the prevalence of tetracycline-resistant Bacteroidetes species [48]. CTnDOT-like ICEs were clustered in ICE1. Elements in this cluster typically confer resistance to tetracycline via the tetQ antibiotic resistance gene (Fig 2A). In addition, ICE1 elements have multiple sites where antibiotic resistance genes can be inserted or substituted. We characterized 5 insertions of antibiotic resistance genes into ICE1 (Fig 2A). Insertion sites 1, 2, and 5 are between operons; therefore they do not interrupt the function of crucial genes. We observed one insertion and two substitutions of antibiotic resistance genes around the tetQ operon, suggesting that this site is likely a "hotspot" for insertions and substitutions of antibiotic resistance genes. Our analysis reveals the surprising extent to which MGEs in species of the gut microbiome contribute to the phenomenon of antibiotic resistance and that the insertion of antibiotic resistance genes into MGEs is an active and ongoing process.

Bile salt hydrolase and bile transporters
Bile acids are found in high concentrations in the human intestines [49] and can be toxic to bacteria [50]. Therefore, gut microbes have developed strategies to deal with bile acids by actively pumping bile acids out of the cell, or via deconjugation, which is hypothesized to diminish the toxicity of bile acids [49,50]. The high identity of archaeal and bacterial bile salt hydrolases strongly suggests the horizontal transfer of this gene [51]. A sodium bile acid symporter family (PF01758), which could help to pump bile acids out of the cell, was found to be enriched in the cargo genes of MGEs. Furthermore, 18 examples of bile salt hydrolases were identified as cargo genes of MGEs (S3 Table). Thus, MGEs carry genes that help microbes to overcome a specific challenge of colonizing the human gut.

Glycoside hydrolases for mucus utilization
The colon is lined with a layer of mucus composed of the glycoprotein MUC2 [52]. The glycans that decorate MUC2 have a core structure composed of galactose, N-acetylglucosamine, Nacetylgalactosamine, with terminal residues of fucose and sialic acid [53]. These specific glycans are a major energy source for members of the gut microbiota [54]. Therefore, it may benefit members of the gut microbiota to degrade these specific glycans [55]. We identified 43 glycoside hydrolases capable of degrading mucosal glycans carried by MGEs from the categories: sialidases (GH33), fucosidases (GH95), α-N-acetylgalactosaminidases (GH109), β-galactosidases (GH35) and β-hexosaminidase (GH20) [54,56] (S3 Table). Thus, MGEs carry genes to unlock a key energy source available to gut microbes.

Polysaccharide utilization loci
Gut Bacteroidetes can utilize a wide variety of polysaccharides via the products of polysaccharide utilization loci, which collectively make up large proportions of Bacteroidetes genomes [57]. Each polysaccharide utilization locus contains a copy of the gene SusC, a sugar transporter, and SusD, a glycan binding protein [58]. Due to the wide range of polysaccharides available to gut microbes, it is hypothesized that the possession of a large repertoire of polysaccharide utilization loci confers an adaptive advantage in Bacteroidetes [57]. We found 47 polysaccharide utilization loci containing both SusC and SusD carried by MGEs suggesting that the ability to degrade complex polysaccharides may be readily transferred between members of the gut microbiota (S3 Table).

Capsular polysaccharide biosynthesis loci
Many bacterial species produce capsules, an extracellular structure made up of polysaccharides [59]. However, gut Bacteroidetes species have a large repertoire of capsular polysaccharide biosynthesis loci (up to 8) compared to other bacterial species and even Bacteroidetes from other sites such as the mouth [60]. Furthermore, capsular polysaccharide biosynthesis loci have been reported to be the most polymorphic region of Bacteroides genomes [61,62]. Multiple capsular polysaccharide biosynthesis loci are necessary to competitively colonize the gut, and are therefore considered to be gut adaptive genes in gut Bacteroidetes [63]. Capsular polysaccharide biosynthesis loci are large and complex; many contain upwards of 20 genes [59]. We found 22 complete or fragmented capsular polysaccharide biosynthesis loci containing at least 8 genes carried by MGEs (S3 Table). For example, almost identical copies of ICE9 containing a capsular polysaccharide biosynthesis locus were found in two species, B. stercoris and B. sp. UW. The same capsular polysaccharide biosynthesis locus was also found in B. vulgatus, but the ICE9 copy was slightly divergent. Two other copies of ICE9 likely containing an orthologous capsular polysaccharide biosynthesis locus were found in B. fragilis, B. cellulosilyticus, B. vulgatus, Parabacteroides merdae and B. sp. 9_1_42FAA (Fig 2B; S3 Table). Additionally, many GO-terms related to capsular polysaccharide biosynthesis are enriched in Bacteroidetes MGEs including GO:0033037, GO:0015774, GO:0008653, GO:0009103, GO:0045226, GO:0046379 and GO:0015159 The transfer of large segments of capsular polysaccharide biosynthesis loci by MGEs may help to explain the incredible diversity of capsular polysaccharide biosynthesis loci observed in the genomes of gut Bacteroidetes [64].

Sporulation
The gut is an anaerobic environment colonized by many classes of strictly anaerobic organisms [65,66]. However, to transmit between hosts, gut microbes must be exposed to oxygen. Recent work has shown that many more gut microbes form spores than previously thought, likely enabling transmission between hosts [67]. In Firmicutes, 30 genes involved in sporulation (GO:0030435) were found to be enriched in MGEs. In addition, PF08769 (Sporulation initiation factor Spo0A C terminal) and PF04026 (SpoVG) were also enriched in our Pfam analysis. One example is GI153, a genetic island from Faecalibacterium prausnitzii A2-165, which contains a series of spore formation-related genes in an operon: SpoVAC, SpoVAD, spoVAEb, gpr (spore protease), and spoIIP. Another example is GI175, a genetic island derived from a degenerate prophage in Roseburia intestinalis L1-82. In one operon of GI175, there are three genes: SpoVAEb, SpoVAD, and one unknown gene with Cro/ C1-type HTH DNA-binding domain. SpoVAC, SpoVAD, spoVAEb homologs were previously found to be carried by a Tn1546-like ICE and conferred heat resistance to spores in the model spore forming organism Bacillus subtilis [68]. Thus, MGEs may help to transfer genes involved in sporulation between gut microbiota which may prove adaptive for colonizing new hosts.

Summary of cargo genes
Many additional gene families were found to be enriched in MGEs that could plausibly be niche adaptive including: histidine sensor kinases, and genes involved in vitamin B biosynthesis (S4 Table). Notably, MGEs from Firmicutes and Bacteroidetes have different types of genes enriched reflecting the differences in physiology between the phyla. Antibiotic resistance genes and genes involved in the detoxification of bile acids are enriched in MGEs from both phyla. Glycoside hydrolases for mucus utilization, and capsular polysaccharide biosynthesis loci are enriched only in MGEs from Bacteroidetes, while genes for sporulation are enriched in MGEs only from Firmicutes. Overall, the transfer of niche adaptive genes by MGEs likely has a large impact on the fitness of species of the gut microbiome.

Host ranges and evolution of MGEs
Although MGEs readily transfer between species, there has not been a systematic analysis of the host range of MGEs in the gut microbiome. The host ranges of different classes of MGEs is variable, and even within a class, different elements have variable host ranges. Understanding the host range of gut MGEs is of particular importance because gut MGEs carry many cargo genes, and the host range of the MGE defines how widely these cargo genes can be distributed. For example, the gut microbiome is a reservoir of antibiotic resistance genes, and many antibiotic resistance genes are located within MGEs [69]. Therefore, it is important to understand the probability of the transfer of MGEs with antibiotic resistance genes from commensals to pathogens [69].
First, we studied the host range of MGEs from the same cluster. MGEs in the same cluster that exist in at least two species generally represent recent horizontal transfer. Some MGE clusters are present in a wide range of species indicative of active horizontal transfer. One example is the ICE1 cluster, a representative of the CTnDOT-like ICEs, which is found in 35 species of Bacteroidetes from the genera: Bacteroides, Parabacteroides, Allistipes, and Paraprevotella ( Figure A in S1 File). The entirety of the 49kb element is found at more than 99 percent nucleotide identical to 10 Bacteroides, Parabacteroides, and Allistipes species, indicative of very recent horizontal transfer. This cluster also includes other CTnDOT-like elements with more variability such as CTnERL, which has an additional insertion of an IME conferring erythromycin resistance [70]. Another example is the Firmicutes ICE cluster ICE10, which is found in 10 species of the families Lachnospiraceae. This ICE10 cluster belongs to Tn916/Tn1549 family of ICEs, some members of which carry the medically-important VanB gene conferring resistance to vancomycin [71]. We found no examples of ICEs from the same cluster present in multiple phyla. Clusters of prophage, IMEs, group II introns, and transposons were also found in many species but were again limited to a single phylum. Our results support that although the recent horizontal transfer of MGEs is common within phyla, cross-phyla horizontal transfer is rare, as we did not observe any cross-phyla horizontal transfer events for elements of the same cluster.
Here we generated phylogenetic trees of tyrosine and serine integrases from ICEs and prophages identified to study the evolutionary history of the recombination module of MGEs. To contrast the phylogeny of the tyrosine and serine integrases with host species lineages we plotted tanglegrams (Fig 3 and Fig 4). The phylogeny of both serine and tyrosine integrases is incongruent with the host species lineages which is indicative of extensive past horizontal transfer of ICEs and prophages between species of the gut microbiome.
The tyrosine integrases can be divided into two clades: the first is associated with the phylum Bacteroidetes, the second clade is associated with the phyla Proteobacteria, Actinobacteria, Firmicutes and Verrucomicrobia (Fig 3). Tyrosine integrases from Bacteroidetes show no evidence of close inter-phyla transfer but ancient transfers of tyrosine integrases between the phyla Proteobacteria, Actinobacteria, and Firmicutes likely occurred several times during evolution. Serine integrases from ICEs and prophages were only found in the phylum Firmicutes. Therefore, we found no evidence of inter-phyla transfer for ICEs and prophages with serine integrases suggesting a phylum-level restriction in host range.
We also examined whether integrases derived from ICEs and prophages segregated into clades based on element type. Previous studies on the phylogenetic relationships of integrases from ICEs and prophages did not find strong evidence of intermingling between ICE and prophage integrases [72,73]. In our phylogeny of the tyrosine integrases, ICE and prophage integrases are extensively intermingled, suggesting that ICEs and prophages have exchanged integrases multiple times over the course of evolution. Moreover, in our phylogeny of serine integrases, ICE22, ICE64 and ICE65 appear in a branch containing mostly prophages, suggesting that the integrase may have originated from a prophage integrase.
In summary, although ancient cross-phyla horizontal transfers did occur during the evolution of MGEs, we did not observe recent cross-phyla horizontal transfer of MGEs. Therefore, the gene pools that are shared within the gut microbiome are likely limited to the phylumlevel.

Modular evolution of gut MGEs
Genes in MGEs are typically organized in functionally related modules which can be readily exchanged between MGEs. Type of modules found in MGEs include conjugation, integration, regulation, and adaptation. Deletion, acquisition, and exchanges of these modules can lead to immobilization, adaptation, and shifts in insertion specificity and host ranges of MGEs [13]. Here, we detail examples of each of these types of events.
Many unclassified genetic islands are likely remnants of ICEs or prophages due to the presence of only a subset of genes necessary for autonomous transfer. In many cases, the integrases have been lost while other genes for conjugation or capsid formation are maintained. One example is GI73, which appears to have formed when a CTnDOT-like element lost its Comprehensive analysis of chromosomal mobile genetic elements in the gut microbiome conjugation and mobilization modules to a large deletion (Fig 5A). We also observed many examples of the acquisition of new modules by insertions. CTnDOT-like elements have obtained adaptive modules via insertions of a group II intron together with the antibiotic resistance gene ErmF, an IME containing multiple antibiotic resistance genes including: ANT6, tetX, and ErmF [70,74], and other unidentified insertions containing many antibiotic resistance genes (Fig 2A; S5 File; S5 Table). Other examples are GI90, where ICE7 (CTnBST) inserted into a CTnDOT-like element (Fig 5A), and GI46, a genomic island formed when two types of ICEs (ICE43 and ICE56) inserted in tandem (Fig 5B). We observed that the exchange of recombination modules is common. Integrases have frequently been exchanged between ICEs and prophages during the evolution of MGEs (Figs 3 and 4). Exchanges also occur in the same class of MGE. For example, we observed that two clusters of ICEs, ICE15 and ICE16, share nearly identical sequences and the same typeFA conjugation module, but have different integrases: ICE15 has a tyrosine integrase while ICE16 has a serine integrase (Fig 5C). Overall, the modular nature of MGEs enables the formation of new mosaic elements, leading to the diversification of MGEs, and increasing the dynamics of the gene pools in the gut microbiome.

Discussion
In this study, we systematically characterized MGEs from the gut microbiome using a novel method to identify the mobilizable unit of active MGEs. We dramatically expanded the number of annotated MGEs from gut microbial species by identifying 5,219 putative MGEs. The MGEs we identified allow for the understanding of several fundamental questions about the role of MGEs and their importance to the evolution of species of the gut microbiome.

Comparison of SRID and other MGE detection methods
Two approaches are commonly used in the identification of MGEs. The first approach was to search for genomic loci enriched with mobile gene signatures and was used in tools such as ICEfinder [75], Conjscan [40], Phaser [16]. However, this approach cannot precisely determine the exact boundaries of MGEs, which is crucial to identify the functional units of MGEs. In addition, these tools are generally designed to identify one class of MGEs, not systematically identify all classes of MGEs. The second approach is done via comparative genomics. It requires two or more completely sequenced genomes of different strains or closely related species. In the metagenome field, metagenomic samples containing a species often far outnumber the number isolate genomes for that species. Therefore, SRID can leverage the thousands of metagenomic samples available and only requires one reference genome.
A limitation of the SRID is that it requires a match between the metagenomic data and the reference genome. The presence of a different strain of the same species as the reference genomes in the metagenomic samples is required. It also requires this strain to be relatively high abundance. This study used the HMP metagenomic sequencing data, where the samples were from healthy donors. In the stool of a healthy individual, Proteobacteria are of low abundance [76]. Therefore, the number of MGEs detected in the Proteobacteria is at least partially limited by the sequencing depth. Another limitation is that SRID alone cannot distinguish insertions due to MGEs or other random events. It often needs to use the MGE signatures to verify and classify the MGEs. Given no available tools for GIs/Islets gene signature identification, we require the presence of identical sequences present in two different species to confidently classify sequences as GI/Islets. There are only a few species of Proteobacteria that reach the necessary threshold of coverage. As a result, it is likely we are unable to identify some GI/ Islets in these Proteobacteria species.

Implications for future gut metagenomic analysis
The database of MGEs we have curated will be a valuable resource for future studies on the gut microbiome, especially with the increasing importance of taxonomic profiling, strain-level variation detection, and pangenome analyses. Many metagenomic workflows for taxonomic profiling use marker genes or k-mers "unique" to a specific species, where uniqueness is constrained by the available reference genomes [77][78][79]. These marker gene should exclude MGEs, as the potential horizontal transfer of these elements invalidates their "unique" speciesspecific associations. Strain-level variation analyses that based on single nucleotide polymorphisms (SNPs) or copy number variation should also exclude SNPs from MGEs [80][81][82][83]. In Comprehensive analysis of chromosomal mobile genetic elements in the gut microbiome pangenome analysis, it is beneficial to distinguish the accessory genes unique to an individual species and the mobilome shared among multiple species. To address the problems posed by MGEs to metagenomic workflows, an approach common in eukaryotic genomics, repeat masking, can be applied [84,85]. The database of curated MGEs identified in this study can be used to mask gut microbiome reference genomes before metagenomic workflows such as species-level classification, strain-level detection, and pangenome analyses are performed.

Host ranges of MGEs and the spread of antibiotic resistance genes
In the United States alone, more than 23,000 people die each year from antibiotic-resistant infections [86]. Tracking antibiotic resistance is one of the key actions to fight the spread of antibiotic resistance. The human digestive tract is a major reservoir of antibiotic resistance genes and likely serves as a hub for the horizontal transfer of antibiotic resistance genes from commensals to pathogens [4,5,69]. MGEs play a significant role in the spread of antibiotic resistance genes, and we found that many MGEs in the gut microbiome contain antibiotic resistance genes. This study helps to define the host range of MGEs in the gut microbiome. Our results suggest that HGT occurs mostly within a phylum, and inter-phyla HGT is rare. These results underscore the risk posed by transfer of antibiotic resistance genes like the vancomycin-resistance conferring gene VanB between commensal Firmicutes and pathogenic Firmicutes, such as Enterococcus faecalis [87]. Due to the fact our analysis only focused on chromosomal MGEs, antibiotic resistance spread by plasmids is not included in this study. Further studies on plasmids will provide more insight in the spread of antibiotic resistance. Overall, our study advances the understanding of the host range of MGEs which is of critical importance to understand gene flow networks in the gut.
This study underestimates the extent of host range because only MGEs in sequenced genomes were detected. As more bacterial genomes are sequenced, the extent of host range of MGEs will be refined. The scope of our research is chromosomal MGEs. Future studies using a combination of molecular and computational approaches will be beneficial to further understand the rate and extent of horizontal gene transfer by MGEs, including plasmids and extrachromosomal phages.

Niche-adaptive genes in the communal gene pool
The mammalian gut is a unique ecological niche vastly different from other environments due to the presence of IgA, antimicrobial peptides, bile acids, as well as specific polysaccharides available for utilization in the intestinal mucus. The microbes that inhabit the gut must develop mechanisms to cope with these challenges. We observed that MGEs transfer genes to help address the unique challenges of colonizing the human gut. MGEs influence the spread of gut adaptive genes in three ways. First, the spread of MGEs drives the expansion and diversification of protein families such as those involved in polysaccharide utilization, capsular polysaccharide biosynthesis, and sensing and responding to the environment [9]. Second, MGEs transfer successful innovations for colonizing the gut among distantly-related species from the same niche, such as bile salt hydrolases. Third, MGEs allow for the amplification and transfer of genes that are adaptive only under specific conditions, such as antibiotic resistance genes, and sporulation-related genes.
Cargo genes transferred by MGEs can have wide-ranging effects on the biology of the gut microbiome. They potentially involved in bacterial symbioses, sensing and responding to environmental stimuli, and metabolic versatility. The enriched classes of cargo genes we identified in this study are attractive targets for future studies to understand the underlying biology of the gut microbiome.

Opportunities to use MGEs to engineer gut microbes
Tools for genome editing only exist for a very limited number of species of the gut microbiome despite the exceptional basic and translational opportunities afforded by engineering gut species. Many of the tools for editing the genomes of species were originally derived from MGEs. For instance, the NBU system used to modify some Bacteroides species was originally derived from an IME [88], and the TargeTron system was originally derived from a group II intron [89]. The novel examples of MGEs identified in this study could be used to edit genomes from the gut microbiome, especially in currently intractable species such as Faecalibacterium prausnitzii. Unlike phages, whose cargo genes are limited by the capsid size, many novel ICEs and IMEs carry hundreds of genes that can confer selective advantages for the host, and are excellent candidate vectors for large genetic loci. Overall, the MGEs identified in this study could have translational applications for genome editing of species from the gut microbiome.
Supporting information S1 File. Supplementary information.