Pathway-based and phylogenetically adjusted quantification of metabolic interaction between microbial species

Microbial community members exhibit various forms of interactions. Taking advantage of the increasing availability of microbiome data, many computational approaches have been developed to infer bacterial interactions from the co-occurrence of microbes across diverse microbial communities. Additionally, the introduction of genome-scale metabolic models have also enabled the inference of metabolic interactions, such as competition and cooperation, between bacterial species. By nature, phylogenetically similar microbial species are likely to share common functional profiles or biological pathways due to their genomics similarity. Without properly factoring out the phylogenetic relationship, any estimation of the competition and cooperation based on functional/pathway profiles may bias downstream applications. To address these challenges, we developed a novel approach for estimating the competition and complementarity indices for a pair of microbial species, adjusted by their phylogenetic distance. An automated pipeline, PhyloMint, was implemented to construct competition and complementarity indices from genome scale metabolic models derived from microbial genomes. Application of our pipeline to 2,815 human-gut bacteria showed high correlation between phylogenetic distance and metabolic competition/cooperation indices among bacteria. Using a discretization approach, we were able to detect pairs of bacterial species with cooperation scores significantly higher than the average pairs of bacterial species with similar phylogenetic distances. A network community analysis of high metabolic cooperation but low competition reveals distinct modules of bacterial interactions. Our results suggest that niche differentiation plays a dominant role in microbial interactions, while habitat filtering also plays a role among certain clades of bacterial species. Author summary Microbial communities, also known as microbiomes, are formed through the interactions of various microbial species. Utilizing genomic sequencing, it is possible to infer the compositional make-up of communities as well as predict their metabolic interactions. However, because some species are more similarly related to each other, while others are more distantly related, one cannot directly compare metabolic relationships without first accounting for their phylogenetic relatedness. Here we developed a computational pipeline which predicts complimentary and competitive metabolic relationships between bacterial species, while normalizing for their phylogenetic relatedness. Our results show that phylogenetic distances are correlated with metabolic interactions, and factoring out such relationships can help better understand microbial interactions which drive community formation.

Recent advances in microbiome research have accelerated the study of the composition 2 and function of microbial communities associated with different environments and hosts. 3 Studies have shown the association of microbial communities with human health and 4 diseases including type 2 diabetes (1), and efficacy of treatment including 5 immunotherapy to cancers (2). To reveal the mechanisms behind the microbiome-host 6 interactions, it is important to understand how microbial species form communities and 7 how the microbial communities interact with the host to mediate various biological 8 processes (3). 9 Studying the principles underlying the structure and composition of microbial 10 communities is of long-standing interest to microbial ecologists. The dynamics which 11 govern microbial community assembly have been extensively debated, and it is disputed 12 upon as to what extent the role of neutral or deterministic dynamics plays in microbial 13 interactions (4; 5). Some studies support the neutral hypothesis, which assumes that 14 community structure is determined by random processes (6). Other theories suggest 15 that community assembly dynamics are govern by deterministic processes such as 16 habitat filtering and niche differentiation (7; 8). While many studies focus on species 17 abundances for studying community assembly, Bruke et al. (9) showed that the key 18 level at which to address the community assembly may not be species, but rather the 19 functional level of genes. Both niche and neutral processes are likely to affect the 20 assembly of complex microbial communities. 21 Some studies have shown that microbial communities tend to be more 22 phylogenetically clustered than expected by chance, harboring groups of closely related 23 taxa that exhibit microscale differences in genomic diversity (10; 11; 12). One study of 24 marine bacterial communities at various locations reported that local communities are 25 phylogenetically different from each other and they tend to be phylogenetically 26 clustered (12). However, some microbial communities have also shown the opposite 27 patterns, in which taxa are less clustered and are less related than expected by chance 28 (13; 14). Together, these studies have explored the relationship between functional 29 distances/metabolic overlap with phylogenetic relatedness, and they have given rise to 30 competing theories of 'habitat-filtering' and 'niche differentiation': habitat filtering 31 suggests that dominant species exhibit similar functional traits, whereas niche 32 differentiation says that phylogenetically similar species are unable to co-exist due to 33 similar traits and resource overlap (3). Nevertheless, methods have been developed for 34 inference of bacterial interaction network based on the assumption that phylogenetically 35 related species tend to co-exists. For example, Lo et al. (15) developed phylogenetic 36 graphical lasso approach for bacterial community detection, based on the assumption 37 that phylogenetically correlated microbial species are more likely to interact to each 38 other. 39 The study of microbial interactions and the dynamics which govern such interactions 40 are important in providing insights to community assembly and ultimately processes 41 which influence host health and disease. Insights into community cooperation and 42 competition may also uncover symbiotic and antagonistic relationships and can be used 43 to provide prospective candidates for probiotics. Leveraging the increasing availability 44 of microbiome datasets, novel statistical and computational methods have been 45 developed to infer bacterial interaction networks from co-occurrence information.  We first evaluated the effect of using MAGs for genome-scale metabolic 110 reconstructions when genomes are incomplete. We tested the robustness of GENRE 111 using simulated incomplete genomes by removing genes from complete genomes and 112 evaluating the resulting GENREs. We simulated incomplete genomes with 50, 100, · · · , 113 500 genes randomly removed from each genome, respectively, and for each setting we We applied our pipeline to analyze 2,815 human gut related MAGs and computed their 126 pairwise competition and complementarity scores (about 8M directed pairs). As shown 127 in Figure 2A, we see a positive relationship between the metabolic complementarity of 128 bacterial species and their phylogenetic distances. In contrast, we see in Figure 2B there 129 is a negative relationship between metabolic competition of bacterial species and To explore community assembly dynamics, we constructed a directed graph of bacterial 167 species where bacteria are the nodes and a directed edge is added between two bacteria 168 if they have a high metabolic complementarity (Z-score > 2.698) and low metabolic 169 competition (Z-score < −1.000); here we relaxed the Z-score of competition indicies to 170 -1.000 in-order to focus our analysis towards species pairs with greater complementarity 171 while still constraining the analysis to a degree of low competition observed between 172 species pairs. Using Infomap (36) to analyze the network, we were able to identify two 173 main community modules (Figure 4). The larger community module (shown on the 174 right in Figure 4) was populated with many multi-layer sub-modules, which featured  interactions (20; 22; 34). Together these observed relationships seem to support the 209 theory of niche differentiation, where functional overlap discourages phylogenetically 210 related species from co-existing. However, by taking into consideration the phylogenetic 211 distance between pairs to identify metabolic outliers, we were able to identify significant 212 intra-genus cooperation in several distinct taxa. The intra-genus modules may suggest 213 that while most bacteria interactions display niche differentiation characteristics, some 214 taxa exhibit habitat filtering. Notably, Bifidobacterium species were shown to form 215 distinct community modules which suggest significant intra-genus cooperation compared 216 to other taxa. These results support recent findings that suggest strains of 217 Bifidobacterium spp. in infants have different nutrient profiles to support colonization of 218 other specific Bifidobacterium species (39). The observation of both habitat filtering 219 and niche differentiation characteristics suggests that in some cases both contribute to 220 the dynamics of community assembly. 221 We note a few limitations of our approach. First, metabolic complementarity and GENREs are dependent on a variety of variables (e.g. the reconstruction tool and the 224 genome completeness) that can have a significant impact on predicted metabolic 225 interactions. Second, seed sets used to calculate the metabolic interaction indices do not 226 represent required metabolites for growth, but rather represent a baseline of metabolites 227 that in theory enable a given bacterium to produce any metabolite in their predicted 228 metabolic network. As such, seed sets may influence the overestimation or 229 underestimation of metabolic interactions between bacterial species. However, by 230 integrating phylogenetic distances to normalize metabolic interaction indices we believe 231 that our approach provides a more accurate prediction of metabolic interactions in 232 comparison to other similar methods. Additionally, low abundant microbial species 233 within microbiomes are not always well represented within metagenomic samples but 234 may play key roles within a metabolic network. While we acknowledge that validation 235 of this method remains difficult, the non-independent nature of comparative metricies 236 between organisms due to shared ancestry provides a logical explanation as to the 237 necessity to account for such confounding effects.

238
By decoupling phylogenetic distances between complementarity and competition 239 indicies, we provide a method to explore statistically significant cooperating/competing 240 species pairs within microbbiomes to better understand community assembly dynamics. 241 Additionally, competition networks can be used to identify highly competitive species 242 pairs, which may be useful for suggesting beneficial probiotic candidates. A future 243 research direction is to integrate phylogenetically-corrected cooperation and competition 244 scores with co-occurrence information to better address the challenges of identifying 245 bacterial interactions through mechanistic insight.

247
Genome sequences of human-gut bacteria 248 To assemble the human-gut associated reference genomes, we collected genomes from 249 two recent studies (40; 41). Bacterial genomes reported in (41) were compiled from two 250 sources: a total of 617 genomes obtained from the human microbiome project (HMP) 251 (42), and 737 whole genome-sequenced bacterial isolates, representing the Human 252 Gastrointestinal Bacteria Culture Collection (HBC). These 737 bacterial genomes were 253 assembled by culturing and purifying bacterial isolates of 20 fecal samples originating 254 from different individuals (41). The bacterial genomes reported in (40) were generated 255 and classified from a total of 92,143 metagenome assembled genomes (MAGs), among 256 which a total of 1,952 binned genomes were characterized as non-overlapping with 257 bacterial genomes reported. We were able to retrieve 612 out of 617 RefSeq sequences 258 using the reported RefSeq IDs. We only included genomes with > 80% completeness 259 and < 5% contamination (via CheckM (43)). Our final dataset for this study contains a 260 total of 2,815 genomes/MAGs. Taxonomic annotation of these genomes/MAGs was 261 done using GTDB-toolkit's least common ancestors approach (44). To compute pairwise evolutionary distances between gut bacteria, we first inferred a 273 phylogeny covering all participating genomes using FastTree (46). A total of 120 274 bacterial marker genes were used to infer these phylogeny. The 120 marker genes used 275 are ubiquitous among bacterial species and are shown to occur as single copies and less 276 susceptible to horizontal gene transfer (47). Amino acid sequence of protein coding genes 277 were searched using HMMER3 (48) against a 120 HMM model database of marker genes 278 received from Pfam (49) and TIGRfam databases (50). Sequences extracted from each 279 HMM model were individually aligned using hmmalign, which were later concatenated 280 to form the final alignment. Poorly aligned regions were removed from the concatenated 281 alignment and a final phylogeny was inferred using FastTree under WAG + GAMMA 282 models.

Species interaction indexes 284
To estimate potential metabolic cooperation and competition between bacterial species, 285 we need to know their nutritional profiles, which however are unavailable for most of the 286 gut bacteria. Similar to the approach reported in (30; 51), we use the compound seed 287 set of each species as a proxy for its nutritional profile: the seed set of a metabolic 288 network is defined as the minimal subset of the compounds that cannot be synthesized 289 from other compounds in the network (due to lack of the corresponding enzymes, and 290 hence are exogenously acquired) but their existence permits the production of all other 291 compounds in the network. 292 We implemented a pipeline for computing metabolic interaction indices from genome 293 sequences. Our pipeline uses 1) CarveMe for building genome-scale metabolic models 294 from genome sequences, b) NetworkX (52) to identity seed compounds, and c) our own 295 implementation (in Python) of the approaches for computing metabolic competition and 296 their SCC size, where the confidence level (C) is denoted as: The confidence level is representative of the confidence that a given compound belongs 303 to the seed set. A threshold of C ≥ 0.2 was used to select compounds to be regarded as 304 compounds part of a given 'seed set' of a given organism as specified by (51).

312
Complementarity Index (M I Complementarity ) is calculated as the fraction of A's seed set 313 that is found within B's metabolic network but not part of B's seed set, normalized by 314 the number of A's seed set in B's entire metabolic network (30; 37). M I Complementarity 315 represents the potential for A's to utilize the potential metabolic output of B.
We note that the competition and complementarity indices are asymmetric.

318
Phylogenetic normalization and outlier detection 319 Pairwise metabolic complementarity and competition indices between species pairs are 320 plotted against their predicted phylogenetic distance. While methods of outlier 321 detection for continuous data exists, local peaks and troughs of indices relative to 322 phylogenetic distance make it difficult to identify local outliers. Thus, we utilize a 323 binning approach to limit outlier detection to localized values. Both metabolic 324 complementarity and competition indexes use a two-step binning process to bin pairwise 325 observations, first by using a fixed phylogenetic distance interval of 0.01, followed by 326 merging bins which are smaller than a prespecified size. Here we used the first bin size 327 as the reference. Bins were merged with the closest preceding bin satisfying our 328 minimum bin size threshold. To identify metabolic complementarity and cooperation 329 outliers within each phylogenetic distance bin, we calculate the Z-score within each bin 330 respectively. Tukey's method for outlier detection (equivalent to a Z-score threshold 331 ±2.698) (35) was utilized to identify significant outliers.

332
Network construction and community detection 333 To build a metabolic complementarity/competition network, species pairs are 334 represented as nodes within the network. Identified significant outliers were used to 335 construct a network of gut bacteria, in which for any pair of species A and B, a directed 336 edge is added between A and B (from A to B), if A and B have significantly high modules within our dataset. Infomap is a random walk based approach for community 341 detection, and it provides a user friendly interface for visualization and exploration of 342 the network and community structure (https://www.mapequation.org/navigator).

343
Data and software availability 344