Discrete mutations in colorectal cancer correlate with defined microbial communities in the tumor microenvironment

Variation in the gut microbiome has been linked to colorectal cancer (CRC), as well as to host genetics. However, we do not know whether genetic mutations in CRC tumors interact with the structure and composition of the microbial communities surrounding the tumors, and if so, whether changes in the microbiome can be used as a predictor for tumor mutational status. Here, we characterized the association between CRC tumor mutational landscape and its proximal microbial communities by performing whole-exome sequencing and microbiome profiling in tumors and normal colorectal tissue samples from the same patient. We find a significant association between loss-of-function mutations in relevant tumor genes and pathways and shifts in the abundances of specific sets of bacterial taxa. In addition, by constructing a risk index classifier from these sets of microbes, we accurately predict the existence of loss-of-function mutations in cancer-related genes and pathways, including MAPK and Wnt signaling, solely based on the composition of the microbiota. These results can serve as a starting point for understanding the interactions between host genetic alterations and proximal microbial communities in CRC, as well as for the development of individualized microbiota-targeted therapies.


Introduction
The human gut is host to approximately a thousand different microbial species consisting of both commensal and potentially pathogenic members 1 . In the context of colorectal cancer (CRC), it is clear that bacteria in the microbiome play a role in human cell signaling 2-11 ; for example, in the case of CRC tumors that are host to the bacterium Fusobacterium nucleatum , the microbial genome encodes a virulence factor, FadA , that can activate the βcatenin pathway 12 . In addition, several attempts have been made to predict CRC status using the microbiome as a biomarker [13][14][15][16] . It has been shown that focusing on a single bacterial species, F. nucleatum , it is possible to predict some clinically relevant features of the tumor present 17 . However, as only a minority of CRCs are host to F. nucleatum , this is a somewhat limited application 18 . Other specific microbes have been linked to CRC, including Escherichia coli harboring polyketide synthase (pks) islands, as reported by one group 19,20 and enterotoxigenic Bacteroides Fragilis (ETBF) by another [21][22][23] . The mechanism of action of these associations is still under investigation with F. nucleatum being the most clearly developed 12 .
In healthy individuals, host genetic variation can affect the composition of the microbiome [24][25][26][27][28][29] , and the associated human genetic variants are enriched in cancerrelated genes and pathways 25 . However, it is still unknown whether somatic mutations in host cells can affect the composition of the microbiome that directly interacts with host tissues. Here, we aim to find (i) whether variation in somatic mutational profiles in CRC tumors is associated with variation in the microbiome; (ii) which host genes and bacterial taxa drive the association; (iii) how these patterns can shed light on the molecular mechanisms controlling hostmicrobiome interaction in the tumor microenvironment; and (iv) whether this correlation can be used to construct a microbiomebased predictor of genes and pathways mutated in the tumor.

Changes in the microbiome reflect tumor stage.
We performed wholeexome sequencing on a set of 88 samples, comprised of 44 pairs of tumor (adenocarcinomas) and normal colon tissue sample from the same patient, with previously characterized tissueassociated microbiomes 2 . The mutations in each of the tumors' proteincoding regions were identified relative to the patientmatched normal sample and annotated as either synonymous, nonsynonymous, or lossoffunction (LoF) mutations (Supplementary Figs. 12,and Supplementary Tables 12). The mutations were collapsed by gene as well as by pathways using both Kyoto Encyclopedia of Genes and Genomes (KEGG) and pathway interaction database (PID) annotations [30][31][32][33] .
We first investigated the relationship between microbial communities and tumor stage (Fig. 1). We hypothesize that the structure and composition of the associated microbiome can be affected by relevant physiological and anatomical differences between the tumors at different stages that would provide different microenvironmental niches for microbes. We identified the changes in the microbial communities surrounding each tumor as a function of stage by grouping the tumors into low stage (stages 12) and high stage (stages 34) classes and applied linear discriminant analysis (LDA) effect size (LEfSe) to the raw operational taxonomic unit (OTU) tables corresponding to these tumors (Supplementary Tables 34) 34 . The set of taxon abundances was transformed to generate a single value representing a risk index classifier for membership in the lowstage or highstage group (Fig. 1a; see Methods). To ascertain the error associated with these risk indices, a leaveoneout (LOO) crossvalidation approach was applied. We also used the LOO results to generate receiver operating characteristic (ROC) curves and to calculate the area under the curve (AUC; see Fig. 1b). In addition, we performed a permutation test to assess the method's robustness (Supplementary Table 4). Using this approach, we demonstrate that the changes in abundances of 31 microbial taxa can be used to generate a classifier that distinguishes between lowstage and highstage tumors at a fixed specificity of 80% and an accuracy of 77.5% (P = 0.02 by MannWhitney U test, and P = 0.007 by a permutation test; Supplementary Table  4).The resulting changes seen in our analysis of the microbial communities that vary by tumor stage were similar to those found in previous studies, including one using a Chinese patient cohort 4,35 . In both cases, there were significant changes among several taxa within the phylum Bacteroidetes , including Porphyromonadaceae , Paludibacter , and Cyclobacteriaceae ( Fig. 1 and Supplementary Table 4).
Tumor mutations correlate with consistent changes in the proximal microbiome.
Next, we attempted to use a similar approach to classify tumors based on mutational profiles. We initially focused on individual genes that harbor lossoffunction (LoF) mutations, as those, we predicted, would be the most likely to have a physiologically relevant interaction with the surrounding microbiome. A prevalence filter was applied to include only those mutations that were present in at least 10 or more patients at the gene level. The raw OTU table was collapsed to the level of genus for the analysis. A visualization of the correlations between genelevel mutational status and the associated microbial abundances revealed differing patterns of abundances that suggests an interaction between the 11 most prevalent LoF tumor mutations and the microbiome ( Supplementary Fig. 3). We hypothesized that the presence of mutationspecific patterns of microbial abundances could be statistically described by prediction of tumor LoF mutations in individual genes using the microbiome. For each of eleven genes that passed prevalence filtering cutoff, we identified the associated microbial taxa ( Fig. 2a and Supplementary Tables 56), generated risk indices for each patient (Fig. 2bc), and plotted the mean differences in abundances for a subset of microbial taxa interacting with each mutation (Fig. 2d). We found that we are able to use microbiome composition profiles to predict the existence of tumor LoF mutations in the human genes APC , ANKRD36C , CTBP2 , KMT2C , and ZNF717 (Qvalue = 0.0011, 0.0011, 0.019, 0.019, and 0.055, respectively, by permutation test after False Discovery Rate (FDR) correction for multiple tests with a Q value threshold of 0.10; Fig. 2). The risk indices for each mutation were generated using sets of microbial taxa that ranges from 22 ( ZNF717 ) to 53 ( ANKRD36C ) taxa (Supplementary Table 5). The taxa that showed the most dramatic differences in abundance when comparing tumors with and without mutations are shown in Fig. 2d. For example, the abundance of Christensenellaceae is relatively lower in tumors with APC mutations, but relatively higher in tumors with ZNF717 mutations.
Next, we applied our interaction prediction approach, as described above, to the pathwaylevel mutational data (see Methods). Following visualization of the pathway level abundances (Supplementary Figs. 45) and applying the model, we found that each of the 21 KEGG pathways passing prevalence filter were able to be significantly predicted with a fixed specificity of 80% and an accuracy up to 86% (Qvalues < 0.02 by permutation test after FDR correction; Fig. 3ad Table 7). The taxon abundances that were specifically associated (direct or inverse correlations) with each of the LoF mutations in the genes and pathways can be found in Supplementary Tables 811 and Supplementary Fig. 10. In general, the number of taxa within each of the sets used to generate the risk indices was lower than that used for the genelevel analyses (average of 37 taxa per geneassociated set compared to 7 taxa per set associated with mutations in KEGG or PID pathways). When comparing results using the genelevel interactions and the pathway level interactions, for instance looking at mutations in APC (Fig. 2) and comparing them to mutations in the KEGGdefined Wnt signaling pathway and the PIDdefined Canonical Wnt signaling pathway (Fig. 3), the interactions at the pathway level are more statistically significant (AUC for APC = 0.81, KEGG = 0.88, PID = 0.90). This trend is consistent and can be visualized as a density histogram of interaction prediction accuracies ( Supplementary Fig. 11).

Predicted microbiome interaction network affected by tumor mutational profile
Lastly, we assessed the correlations between taxa among tumors with and without LoF mutations ( Fig. 4; see methods). We found striking differences in structure of the network comparing tumors with and without a Lof mutation in APC the correlations between taxa (Fig.  4a). For example, in tumors with mutations in APC , the abundance of Christensenellaceae is positively correlated with Rhodocyclaceae and negatively correlated with Pedobacter . In tumors lacking LoF mutations in APC , these correlations are lost and Christensenellaceae is instead negatively correlated with Saprospiraceae and Gemm 1 . We also assessed the network of correlations across tumors with mutations in PID pathways (Fig. 4B). This analysis highlighted that some pathwaylevel mutations show a shared set of correlations between taxa, while others appear independent. Several of the taxa that can be used to predict LoF mutations in p75(NTR) signaling share correlations among each other as well as with taxa associated with mutations in PDGFRbeta signaling and direct p53 effectors.

Discussion
The link between colorectal cancer and the gut microbiome has been highlighted by a large number of recent studies [2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18] , with several hypotheses as to the causal role of microbes in the disease 9,12,36,37 . Since cancer is a genetic disease caused by mutations in host DNA, it is of interest to study the microbiome in the context of tumor mutational profiles, especially given recent studies showing an impact of host genetics on the microbiome [24][25][26][27][28][29] . Here, we jointly analyzed tumor coding mutational profile and the taxonomic composition of the proximal microbiome. We found that the composition of the proximal microbiome is correlated with mutations in tumor DNA, and that this correlation can be used to predict mutated genes and pathways solely based on the microbiome.
We performed quality control of the data and stringent filtering at every step ( e.g. , requiring 30x coverage at a site in both the tumor and matched normal sample to call a mutation; see methods). While these requirements are likely to increase the frequency of false negatives (true mutations that simply do not meet our criteria), this rigorous strategy is appropriate as a means of increasing the biological relevance of our findings. Of note, when comparing the common LoF mutations found in our dataset to those found in colorectal tumors sampled as part of The Cancer Genome Atlas (TCGA) project, we find several commonalities, including a high frequency of LoF mutations in APC, as well as numerous missense mutations in KRAS , NRAS, and TP53 , as expected (Supplementary Table 1) 38 . In general, the numbers of mutations across our sample set were also in line with those identified at part of the TCGA (Supplementary Table  2

) 38
The association of microbial taxa with tumor stage (Fig. 1) mirrors recent results, including a study of a Chinese population 4,35 . This concordance is relevant as it indicates that the microbial communities appear to be consistent even when comparing geographically distinct patient cohorts 39,40 . One of the predictive taxa, Porphyromonadaceae , has been shown to be altered in mouse models of CRC in other studies as well 7,14 . A study on the link between dysbiosis and colitisinduced colorectal cancer also showed similar results 41 . For instance, the bacterial genus Paludibacter was found to be associated with risk of developing tumors in a mouse model 41 . We find that Paludibacter is significantly associated with lowstage tumors, again, supporting the hypothesis that these bacteria as associated with cancer risk and may be contributing to early stage inflammation 41 . Conversely, we found that the genus Coprococcus is associated with highstage tumors and not low stage tumors. Members of this genus are known to generate butyrate and propionate, which in this context can act as antiinflammatory short chain fatty acids 42 . Although our results are correlational and cannot point to causal effects, these findings suggest that driving inflammation may play a role in early stage cancer, while generating nutrients at the cost of suppressing inflammation may be more beneficial to the tumor in later stages.
Genelevel mutation data, visualized in Supplementary Fig. 3, show intriguing patterns of microbial abundances that are associated with the tumors harboring different mutations. For instance, as reflected in the differing patterns within each gene (rows) in the heatmap, Aerococcus and Dorea are both show higher scaled abundances within tumors harboring mutations in ZNF717 , CTBP2 , and APC , relative to tumors with LoF mutations in ANKRD36C and KMT2C . This highlights the different patterns in the microbiome that can be found when assessing genetically heterogeneous sets of tumors; as Dorea has been found to be increased in tumor microbiomes by several different groups, whereas our work highlights some potential genetic interactions that explain cases wherein Dorea is not increased at the tumor site 3,[5][6][7][8] . Thus, incorporating genetic profiles in studies of the microbiome in CRC may be beneficial and uncover patterns that are dependant on specific tumor mutations.
Although it may be difficult to ascertain the biological mechanism behind the predicted interactions among mutated genes and microbial taxa (shown in Fig. 2), it is possible to generate hypotheses based on what is already known in the relevant literature. For example, ANKRD36C encodes a protein that may have a role in ion transport in epithelial cells 43 . Additionally, we find that LoF mutations in APC correlate with changes in 25 different microbial taxa, including an increase in the abundance of the genus Finegoldia . This genus has been identified in previous studies of colon adenomas and also harbors species that act as opportunistic pathogens at sites where the epithelium has been damaged 6,44,45 . In addition, Capnocytophaga has been identified as a potential biomarker for lung cancer 46 . Interestingly, changes in the abundance of Christensenellaceae are predictive of mutations in both APC and ZNF717 . A recent study in twins has identified Christensenellaceae as a taxon that is highly driven by host genetics 26 . We find that mutations in ZNF717 , a transcription factor commonly altered in gastric, hepatocellular, and cervical cancers 47-49 , are associated with Verrucomicrobiaceae and Akkermansia , which are both known to increase in conjunction with colitis 50 . Alphaproteobacteria are significant contributors to our ability to predict mutations in CTBP2 , a repressor of transcription known to interact with the ARF tumor suppressor 51 . Changes in this bacterial taxon's abundance has also been found to be associated with prostate cancer, however a mechanism of action was not explored 52 . We also show that mutations in KMT2C , a gene commonly comutated along with KRAS, could be predicted, in part, using the abundance of Ruminococcus 53 . These bacteria have been previously implicated in inflammatory bowel disorders and colorectal cancer by multiple groups 8,54-56 .
Similar results were also evident when aggregating the mutations into KEGG and PID pathways (Fig. 3, Supplementary Figs. 45; see Methods) [30][31][32][33] . As an example, we find that the abundance of microbes that predict KEGG pathways form two distinct clusters, and that the genus Escherichia has a higher scaled abundance in tumors with mutations in the KEGG pathways in cluster 1 relative to those in cluster 2 ( Supplementary Fig. 4). Cluster 1 contains adherens junctions, which are partially responsible for maintaining the intestinal barrier and interestingly, a disruption of the intestinal barrier in mice using cyclophosphamide was shown to cause a loss of adherens junction function and a concomitant increase in bacterial translocation into the intestinal tissue, including species of Escherichia 57 . When examining the heatmap with LoF mutation collapsed into PID pathways ( Supplementary Fig. 5), we again find differences in scaled microbial abundances between the tumors as a function of which pathways are mutated. For instance, we find lower abundance of Pseudomonas in tumors with LoF mutations in the pathways 'regulation of nuclear βcatenin signaling and target gene transcription', 'degradation of βcatenin', 'presenilin action in Notch and Wnt signaling', and 'canonical Wnt signaling pathways'. Recent studies have shown that Pseudomonas strains that express the LecB gene can lead to degradation of βcatenin, providing hypothetical support for the concept that this genus may play a somewhat protective role in CRC by suppressing the Wnt signaling pathway 58 . The mechanism that might explain this phenomenon is still unclear but may have to do with alterations in appropriate cell surface adhesion molecules for the LecB protein or a change in the content of the cellular microenvironment 58,59 .
Many of the interactions identified here between bacterial taxa and mutations in PID pathways have been demonstrated experimentally in the literature. For example, in human oral cancer cells, it was shown that bacteria of interest were able to activate EGFR through the generation of hydrogen peroxide 60 . In addition, the correlation between ErbB1 downstream signaling and increase in the abundance of Corynebacterium has been demonstrated mechanistically in a model of atopic dermatitis, whereby EGFR inhibition results in dysbiosis (the appearance of Corynebacterium species) and inflammation 61 . Specific depletion of Corynebacterium ablates the inflammatory response 61 . Moreover, our finding that the abundance of Fusobacterium is depleted in tumors with LoF mutations in the PDGFRbeta pathway, which may be explained by the dependence of several pathogenic strains of bacteria for functionally intact PDGFR signaling for adherence to intestinal epithelium 62 . In addition, p75(NTR) signaling has been shown to operate as a tumor suppressor by mediating apoptosis in response to hypoxic conditions and reactive oxygen species [63][64][65][66] . Alterations in this pathway have also been shown to be useful as a biomarker for esophageal cancer 67,68 .
Our study has several caveats. First, our study only shows correlations, and we cannot directly assess causal effects. Thus, we do not know whether the microbiome is altered before or after the appearance of specific mutations. Nevertheless, many of the predicted interactions described above have been previously tested, albeit across a wide variety of experimental systems and disease states, typically in isolation, for biological relevance and mechanism of action. We expect that future studies will more comprehensively test the causality of interactions by utilizing model organisms and cell culture techniques, where the effect of individual mutations can be assessed. Additionally, we have only profiled the taxonomic composition of the microbiome, and thus cannot detect interactions that are dependent on microbial genes or functions. Similarly, using wholeexome sequencing does not allow us to include noncoding mutations and larger tumor structural variants and chromosomal abnormalities. This can be alleviated by the use of metagenomic shotgun sequencing to profile the microbiome, as well as wholegenome sequencing to assess tumor mutations. Moreover, the study sample was relatively small (n = 88 samples from 44 patients). Nevertheless, the sample size was sufficient to detect significant patterns. Additional studies that use large tumor samples would be useful in validating our results and identifying further associations.
In summary, we present a strong association between tumor genetic profiles and the proximal microbiome, and identify tumor genes and pathways that correlate with specific microbial taxa. We also show that the microbiome can be used as a predictor of tumor mutated genes and pathways, and suggest potential mechanisms driving the interaction between the tumor and its microbiota. Our proofofprinciple analysis can provide a starting point for the development of diagnostics that utilize microbiome profiles to ascertain CRC tumor mutational profiles, facilitating personalized treatments.

Patient inclusion and DNA extraction
88 tissue samples from 44 individuals were used, with one tumor and one normal sample from each individual. These deidentified samples were obtained from the University of Minnesota Biological Materials Procurement Network (Bionet), a facility that archives research samples from patients who have provided written, informed consent. These samples were previously utilized and are described in detail in a previous study 69 . To reiterate these points, all research conformed to the Helsinki Declaration and was approved by the University of Minnesota Institutional Review Board, protocol 1310E44403. Tissue pairs were resected concurrently, rinsed with sterile water, flash frozen in liquid nitrogen, and characterized by staff pathologists. The criteria for selection were limited to the availability of patientmatched normal and tumor tissue specimens. Additional patient metadata are provided in the indicated work 69 .

Microbiome characterization
The microbiome data used in the study was generated previously and is described exhaustively in 69 . Briefly, microbial DNA was extracted from patientmatched normal and tumor tissue samples using sonication for lysis and the AllPrep nucleic acid extraction kit (Qiagen, Valencia, CA). The V5V6 regions of the 16S rRNA gene were PCR amplified with the addition of barcodes for multiplexing using the forward and reverse primer sets V5F and V6R from Cai, et al. 70 . The barcoded amplicons were pooled and Illumina adapters were ligated to the reads. A single lane on an Illumina MiSeq instrument was used (250 cycles, pairedend) to generate 16S rRNA gene sequences. The sequencing resulted in approximately 10.7 million total reads passing quality filtering in total, with a mean value of 121,470 quality reads per sample. The forward and reverse read pairs were merged using the USEARCH v7 program 'fastq_mergepairs', allowing stagger, with no mismatches allowed 71 . OTUs were picked using the closedreference picking script in QIIME v1.7.0 using the Greengenes database (August 2013 release) 72-74 . The similarity threshold was set at 97%, reverseread matching was enabled, and referencebased chimera calling was disabled.

Exome sequence data generation
Genomic DNA samples were quantified using a fluorometric assay, the QuantiT PicoGreen dsDNA Assay Kit (Life Technologies, Grand Island, NY) . Samples were considered passing quality control (QC) if they contained greater than 300 nanograms (ng) of DNA and display an A260:280 ratio above 1.7. Full workflow details for library preparation are outlined in the Nextera Rapid Capture Enrichment Protocol Guide (Illumina, Inc., San Diego, CA) . In brief, libraries for Illumina nextgeneration sequencing were generated using Nextera library creation reagents (Illumina, Inc., San Diego, CA) . A total of 50 ng of genomic DNA per sample were used as input for the library preparation. The DNA was tagmented (simultaneously tagged and fragmented) using Nextera transposome based fragmentation and transposition as part of the Nextera Rapid Capture Enrichment kit (Illumina, Inc., San Diego, CA). This process added Nextera adapters with complementarity to PCR primers containing sequences that allow addition of Illumina flow cell adapters and dualindexed barcodes. The tagmented DNA was amplified using dual indexed barcoded primers. The amplified and indexed samples were pooled (8 samples per pool) and quantified to ensure appropriate DNA concentrations and fragment sizes using the fluorometric PicoGreen assay and the Bioanalyzer HighSensitivity DNA Chip (Agilent Technologies, Santa Clara, CA). Libraries were considered to pass QC as long as they contained more than 500 ng of DNA and had an average peak size between 200 1000 base pairs. For hybridization and sequence capture, 500 nanograms of amplified library was hybridized to biotinylated oligonucleotide probes complementary to regions of interest at 58°C for 24 hours. Libraryprobe hybrids were captured using streptavidincoated magnetic beads and subjected to multiple washing steps to remove nonspecifically bound material. The washed and eluted library was subjected to a second hybridization and capture to further enrich target sequences. The captured material was then amplified using 12 cycles of PCR. The captured, amplified libraries underwent QC using a Bioanalyzer, and fluorometric PicoGreen assay. Libraries were considered to pass QC as long as they contained a DNA concentration greater than 10 nM and had an average size between 300 400 base pairs. Libraries were hybridized to a paired end flow cell at a concentration of 10 pM and individual fragments were clonally amplified by bridge amplification on the Illumina cBot (Illumina, Inc., San Diego, CA) . Eleven lanes on an Illumina HiSeq 2000 (Illumina, Inc., San Diego, CA) were required to generate the desired sequences. Once clustering was complete, the flow cell was loaded on the HiSeq 2000 and sequenced using Illumina's SBS chemistry at 100 bp per read. Upon completion of read 1, base pair index reads were performed to uniquely identify clustered libraries. Finally, the library fragments were resynthesized in the reverse direction and sequenced from the opposite end of the read 1 fragment, thus producing the paired end read 2. Full workflow details are outlined in Illumina's cBot User Guide and HiSeq 2000 User Guides. Base call (.bcl) files for each cycle of sequencing were generated by Illumina Real Time Analysis (RTA) software. The base call files and run folders were then exported to servers maintained at the Minnesota Supercomputing Institute. Primary analysis and demultiplexing was performed using Illumina's CASAVA software 1.8.2. The end result of the CASAVA workflow was demultiplexed FASTQ files that were utilized in subsequent analysis for read QC, mapping, and mutation calling.

Exome data analysis
The exome sequence data contained approximately 4.2 billion reads in total following adapter removal and quality filtering, inclusive of forward and reverse reads, with a mean value of 47.8 million highquality reads per sample. The raw reads were assessed using FastQC v0.11.2 and the Nextera adapters removed using cutadapt v1.8.1 75,76 . Simultaneously, cutadapt was used to trim reads at bases with quality scores less than 20. Reads shorter than 40 bases were excluded. The trimmed and filtered read pairs were aligned and mapped to the human reference genome (hg19) using bwa v0.7.10 resulting in a bam file for each patient sample 77 . These files were further processed to sort the reads, add read groups, correct the matepair information, and mark and remove PCR duplicates using picard tools v1.133 and samtools v0.1.18 78,79 . Tumorspecific mutations were identified using FreeBayes v0.9.1424gc292036 80 . Following these steps, 94.0% of the remaining read pairs mapped to the reference genome, hg19. Specifically, SNPs only were assessed and a minimum coverage at each identified mutation position of more than 30X was required in both the patient normal and tumor samples. These mutations were filtered to only include those that were within proteincoding regions and compiled into a single vcf file. This vcf file was assessed using SNPeffect v4.1 K (2015090) in order to predict the potential impact of each of the mutations 81 . Based on these results, the mutations were grouped into three categories: (1) total mutations (2) nonsynonymous mutations and (3) loss of function (LoF) mutations. The total mutations group is selfexplanatory. The nonsynonymous mutations included all the mutations in the total mutations group that were nonsilent. The LoF group only included those mutations that resulted in a premature stop codon, a loss of a stop codon, or a frameshift mutation.

Joint analysis of microbiome and exome data
Taxa that differentiated patients with or without LoF mutation were identified using LEfSe 34 . All the taxa with a LDA score (log 10) > 2 were included in the calculation of the risk indices, built to predict the presence or absence of a LoF mutation based on the OTU table collapsed at genus level. To build the risk index, the relative abundances (arcsine square root transformed) of the taxa associated with the LoF mutation (based on the LEfSe output) were summed and the relative abundances of the taxa associated with no mutation (based on the LEfSe output) were summed. The use of the unweighted sum in the risk index, rather than relying on the regression coefficients from LDA, is a simple way to control the degree of flexibility of the model when training on small sample sizes. More detail is described in a previous publication 82 . Then the difference between these two sums was calculated, thereby obtaining a risk index. This procedure was repeated 44 times to obtain a risk index for each patient.
A leaveoneout procedure (also described above) was conducted to evaluate the taxa that differentiated patients with or without LoF mutation in the heldout patient, based on the LEfSe output of n1 patients. In detail, the taxa that differentiated patients with or without LoF mutation were identified using LEfSe in the n1 dataset. The relative abundances of the taxa associated with the LoF mutation (based on the LEfSe output of the n1 dataset) were summed and the relative abundances of the taxa associated with no mutation (based on the LEfSe output of the n1 dataset) were summed and were used to build the risk index in the heldout patient. In detail, the difference between these two sums was calculated to obtain the risk index of the heldout patient. This procedure was repeated 44 times, to produce a risk index in each of the heldout patients, based on the difference between the sum of the taxa associated with the absence of LoF mutation minus the sum of the taxa associated with the presence of the LoF mutation found in each of the n1 datasets. The significance of the difference in risk indexes between the patients with LoF mutation and patients with LoF mutation for each gene was assessed using a MannWhitney U test and a permutation test, in which we permuted the labels for a given gene 999 times, each time deriving new heldout predictions of the risk indexes for each subject for that gene. Then the observed difference in means between the patients with LoF mutation and patients with LoF mutation risk index predictions using the method on the actual LoF mutation labels to the differences observed in the permutations to obtain an empirical Pvalue was compared. The resulting Pvalues were corrected using the false discovery rate (FDR) correction for multiple hypothesis tests.
Receiving Operating Characteristic (ROC) curves were plotted and the area under the curve (AUC) values computed on a dataset containing 10 sets of predictions and corresponding labels obtained from 10fold crossvalidation using ROCR package in R 83 . A risk index threshold was also obtained that best predicts the presence or absence of LoF mutation with a leaveoneout crossvalidation on the risk index. Each heldout sample was treated as a new patient on whom the optimal risk index cutoff was tested and subsequently refined to separate patients who had a LoF mutation and patient who did not have a LoF.
Correlation analysis was performed using SparCC on a reduced OTU table containing significant taxa identified using the above prediction methods collapsed to the genus level 84 .
As this work is an extension of a previous study of the CRCassociated microbiome, each of the patients in this project have associated clinical data 69 . We used a linear model to determine the extent to which any of these factors may correlate with mutation load. These included patient sex, tumor stage, patient age, patient body mass index (BMI), and microsatellite instability (MSI) status. None of these factors, alone or in combination, were found to significantly impact the mutational data, though it bears noting that MSI status was only available for a subset (13 out of 44) of the patients.   1 | Correlation between the microbial community at a tumor that differentiates between tumor stage. a , Lowstage (stages 12) and highstage (stages 34) tumors can be differentiated using a risk index classifier generated from microbial abundance data (yaxis). The central black bar indicates the median, and the thin black bars represent the 25th and 75th percentiles. b , A receiver operating characteristic (ROC) curve was generated using a 10fold crossvalidation (blue dotted lines). The average of the 10fold crossvalidation curves is represented as a thick black line. c , Differences in the mean abundances of a subset of the taxa predicted to interact differentially with highstage and lowstage tumors. This subset represents those taxa that had a mean difference in abundance of greater than 0.1%, proportionally. a subset of taxa chosen as they were identified as discriminatory in each leaveoneout iteration (columns) that were found significantly associated with prevalent LoF mutations (rows). Scaled abundances are from the patients with the indicated mutations. b , LoF mutations in each of the indicated genes can be predicted using a risk index as a classifier (yaxis). The central black bar indicates the median, and the thin black bars represent the 25th and 75th percentiles. c , ROC curves were generated for each of the indicated mutations using a 10fold crossvalidation (blue dotted lines). The average of the 10fold crossvalidation curves is represented as a thick black line. d , Differences in the mean abundances of a subset of the taxa predicted to interact differentially with tumors with a LoF mutation relative to those without the indicated mutation. This subset represents those taxa that had a mean difference in abundance of greater than 0.1%, proportionally. a subset of taxa (columns) that are found significantly associated with KEGG pathways harboring LoF mutations (rows). Scaled abundances are from the patients with mutations in the indicated pathways. b , LoF mutations in each of the indicated pathways can be predicted using a risk index as a classifier (yaxis). The central black bar indicates the median, and the thin black bars represent the 2nd and 4th quartiles. c , ROC curves were generated for each of the indicated pathways using a 10fold crossvalidation (blue dotted lines). The average of the 10fold crossvalidation curves is represented as a thick black line. d , Differences in the mean abundances of a subset of the taxa predicted to interact differentially with tumors harboring mutations in the indicated pathways relative to those without a mutation. This subset represents those taxa that had a mean difference in abundance of greater than 0.1%, proportionally. e f , Identically structured visualizations as in a d , but for PID pathway data rather than the KEGG pathways. analysis was run simultaneously for all taxa identified by LEfSe when predicting PID pathways. There are interactions (dashed edges) between the taxa (grey nodes) associated with mutations across sets of PID pathways (green nodes). The solid edges indicate SparCC Rvalues (red for direct and blue for inverse correlations). The grey taxon nodes are scaled to the average abundance of the taxa in the associated tumor set. Edge color indicates the direction of the interaction, red for negative and blue for positive.