Host Genes Related to Paneth Cells and Xenobiotic Metabolism Are Associated with Shifts in Human Ileum-Associated Microbial Composition

The aim of this study was to integrate human clinical, genotype, mRNA microarray and 16 S rRNA sequence data collected on 84 subjects with ileal Crohn’s disease, ulcerative colitis or control patients without inflammatory bowel diseases in order to interrogate how host-microbial interactions are perturbed in inflammatory bowel diseases (IBD). Ex-vivo ileal mucosal biopsies were collected from the disease unaffected proximal margin of the ileum resected from patients who were undergoing initial intestinal surgery. Both RNA and DNA were extracted from the mucosal biopsy samples. Patients were genotyped for the three major NOD2 variants (Leufs1007, R702W, and G908R) and the ATG16L1T300A variant. Whole human genome mRNA expression profiles were generated using Agilent microarrays. Microbial composition profiles were determined by 454 pyrosequencing of the V3–V5 hypervariable region of the bacterial 16 S rRNA gene. The results of permutation based multivariate analysis of variance and covariance (MANCOVA) support the hypothesis that host mucosal Paneth cell and xenobiotic metabolism genes play an important role in host microbial interactions.


Introduction
Inflammatory bowel diseases are complex genetic disorders resulting from the interplay of genetic and environmental factors [1][2][3]. Crohn's diseases (CD) and ulcerative colitis (UC) represent the two major inflammatory bowel diseases (IBD) phenotypes and are distinguished by different patterns of disease location. The inflammation in CD patients may be located anywhere along the gastrointestinal tract, but in the majority (80%) of CD patients, the terminal ileum is involved. In UC, the inflammation is confined to the colon. Because there is evidence that isolated Crohn's colitis are associated with genetic factors that are distinct from ileal CD, and the overlap between genetic factors associated with UC and isolated Crohn's colitis, we have focused our attention on the ileal CD subphenotype as a relatively homogenous category that is distinct from isolated colitis (CD or UC) and non-IBD controls [4][5][6].
Single nucleotide polymorphisms in the NOD2 gene and the ATG16L1 gene have been linked to alterations in innate host immunity, particularly Paneth cell function and with ileal CD phenotype [7][8][9][10][11][12][13][14]. We previously reported that increased CD3D mRNA expression in disease affected ileum resected from 18 ileal CD patients was associated with NOD2 genotype [15]. We also observed alterations in mRNA gene expression in the disease unaffected proximal margin of resected ileum from 19 ileal CD patients compared to 9 control non-IBD patients, regardless of NOD2 genotype [15]. The microarray dataset has recently been further expanded to include 47 ileal CD, 27 UC and 25 non-IBD control subjects (total = 99).
Culture-independent microbiological technologies coupled with high-throughput DNAsequencing have uncovered alterations in human intestine-associated microbial compositions (''dysbiosis'') in IBD patients compared with controls [16][17][18][19][20][21][22][23][24][25]. Ileal CD phenotype has been also associated with shifts in intestinal and fecal microbial composition, particularly reduced relative frequency of Faecalibacterium prausnitzii [19], [20], [23]. In addition to disease phenotype, exploratory analyses have also associated NOD2 genotype to intestinal associated microbial composition [22]. We have recently completed 16 S rRNA sequence analysis on an independent set of disease-unaffected ileal biopsies collected of 52 ileal CD, 58 colitis and 60 control patients without IBD undergoing initial surgical resection [24]. Of the 170 subjects with microbial composition data and 99 subjects with mRNA expression profiles, 84 subjects had paired microarray and microbial datasets. We report here the results of permutation based MANCOVA of these paired mRNA expression and microbial profiles in 34 ileal CD, 27 UC and 23 non-IBD control patients.

Patient Characteristics
As shown in Table 1, 35% of ileal CD patients harbored at least one NOD2 risk allele (NOD2R) compared to 13% of nonIBD control patients, consistent with previous studies [1][2][3]. Only one ileal CD patient was homozygous for the ATG16L1 nonrisk allele. The patients were predominantly Caucasian. The median age of surgery was lower in ileal CD patients than nonIBD control patients. Thirty percent of colitis patients had a concomitant C. difficile infection, consistent with the increased incidence of this infection noted previously in IBD patients [26], [27]. Thirty to fifty percent of IBD patients and none of the non-IBD control patients were treated with 5-aminosalicylic acid (5-ASA), steroids, immunomodulators or an anti-TNFa agent. All of the subjects received intravenous antibiotics within one hour of incision [28].

Comparison of Ileal Mucosal Expression Profiles between Ileal CD, UC and non-IBD Control Subjects
Normalization and pre-processing of the data to filter out undetectable gene-probes resulted in a total of 26,765 geneprobes. Because this number of variables still greatly exceeded the sample size, we sought to further reduce the number of input microarray variables. We reasoned that genes that were differentially expressed between the three disease phenotypes were most likely to be involved in altering host-microbial interactions. Twoclass unpaired SAM analysis was used to identify genes differentially expressed (fold change .1.5, FDR ,0.05) between ileal CD and Control samples, between UC and Control samples, and between CD and UC samples [29]. The results indicate significant differences in gene expression patterns between all three disease phenotypes (see Table S1) [30], [31]. By taking the union of the candidate genes identified by the three two-class comparisons, the dimensions of the normalized microarray data was reduced from 26,765 to a 2,979 gene-probe set (see Fig. 1).
Hierarchical clustering of the 2,979 gene-probe set was then carried out by using 1-correlation dissimilarities and Ward linkage as previously described [32], [33]. The number of clusters was chosen to be 43, based on inspection of the R2 plot (see Table S2). Hierarchical clustering of the original 26,765 gene-probe set was also carried out using the same algorithms to 265 clusters. This number of clusters was again chosen based on inspection of the R2 plot. In all but four (clusters #14, 20, 31, 36) of the 43 clusters, greater than 40% of the gene-probes were concentrated in two of the 265 clusters obtained by clustering the original 26,765 geneprobe set, indicating that using SAM to reduce the number of probes did not appear to bias clustering. We reasoned that genes, which were highly correlated with each other, would be linked by common biological pathways. Ingenuity Pathway Analysis (IPA) canonical pathways were associated (P,0.01 and at least 4 gene probes) in 12 of 43 clusters (see Table S3). In addition, direct inspection of cluster #24 revealed that this cluster included a number of genes expressed in Paneth cells, such as the adefensins.

Permutational based MANCOVA with Stepwise Variable Selection and Gene Cluster Centroids as Independent Variables
For 84 of 99 ileal mucosal samples with microarray profiles, 454 pyrosequencing of the V3, V4, and V5 (V3-V5) hypervariable regions of the 16 S rRNA gene was completed using primers adopted by the Human Microbiome Project [34], [35]. A vector consisting of the relative frequencies of six phyla/subphyla categories (excluding the seventh ''other Taxa'' category):1.) Actinobacteria, 2.) Bacteroidetes, 3.) Firmicutes. Clostridium Group IV, 4.) Firmicutes. Clostridium Group XIVa, 5.) Firmicutes. Bacilli, 6.) Proteobacteria, was used as the dependent variables. The 43 microarray cluster medians were used as cluster centroids [36], [37], in addition to disease phenotype and the other 12 input variables in the analysis. Using the stepwise variable selection method, permutation based MANCOVA selected disease phenotype, a Paneth cell gene enriched cluster, two xenobiotic metabolism gene enriched clusters, and NOD2 genotype as the independent variable set (see Table 2). Gene-probes included in these three clusters are listed in Table 3. We obtained similar results using the cluster mean or first principle component [36], [38] as the cluster centroids (data not shown). We also obtained similar results when parallel Sanger and 454 V1-V3 16 S sequence datasets were used (data not shown).
To examine correlations between gene transcripts and bacterial taxa at a more granular level, we selected individual gene transcripts within these microarray clusters and individual bacterial genera. The selected transcripts included the alpha defensins, (DEFA5 and DEFA6), which have exhibited altered regulation in ileal Crohn's disease [10], [39], [40], and included cellular detoxification genes, which have exhibited altered regulation in ulcerative colitis [41]. The bacteria genera selected included the Faecalibacterium and Shigella/Escherichia genera, because the relative frequency of Faecalibacterium prausnizii has been reported to be reduced and that of Escherichia coli have been reported to be increased in patients with ileal CD. Bacterial genera, previously selected as ulcerative colitis related were also included in this analysis [42].
As shown in Figure 2, a positive correlation (P,0.05) between the relative frequency of the Faecalibacterium genus (Firmicutes Phylum, Clostridium GroupIV) and mRNA expression levels of Paneth cell genes, including DEFA5 and DEFA6, was observed in ileal CD patients but not in non-IBD controls or UC patients. Negative correlations (P,0.05) were observed between the relative frequency of the Bacteroidetes genus and Parabacteroides genus (Bacteroidetes phylum) and mRNA expression levels of the REG genes in ileal CD patients, but not in UC or non-IBD control patients. Furthermore, negative correlations (P,0.05) were observed between the relative frequency of the Parabacteroides genus (Bacteroidetes phylum) and mRNA expression levels of  cellular detoxification genes in non-IBD control and ileal CD patients but not in UC patients. The highest correlation (out of .500 comparisons) was observed between Faecalibacterium and DEFA6 (r = 0.59, P = 0.00024, FDR = 0.057) in ileal CD patients.

Discussion
In this exploratory study, we report the results of the analysis integrating human ileal mucosal microarray expression profiles with microbiota profiles. Since the number of genes and bacterial taxa greatly exceed the number of samples, we sought to shrink these high dimensional datasets by grouping the bacteria taxa into broad phyla/subphyla categories, and by selecting potentially disease relevant transcripts by SAM followed by clustering using 1correlation dissimilarity measure. We were gratified to observe that a number of the resulting clusters could be linked to canonical pathways by IPA. Furthermore inspection of one of the clusters revealed that it included a number of genes that were expressed in Table 3. Gene list for the Paneth cell gene enriched cluster and the xenobiotic metabolism gene enriched clusters A and B.

Cluster Function Gene list
Paneth Cell (cluster 24) ENPP7, DEFA5, TM4SF20, RGN, MDK, REG3A, BCMO1, BAI2, GPR172B, CA9, ANGPTL4, ASAH2, CEL, NPC1L1, SERPINB5, SERPINA1, NPNT, VNN1,  DDO, PRSS2, PLA2G2A, PRSS1, SLC2A12, CCK, CDKN1C, UNC5CL, FBXO2, KLK12, SIGLEC15, CLCA1, RHBG, CCL25, AZGP1, LCT, DEFA6, GCNT1,  SLC16A4, UNC93A, LOC100128979, WNT11, VNN1, PEAR1, LOC643201, ITLN2, REG4, LOC100240735,  The genes selected for further correlation analyses are bolded. See Table S3   Paneth cells. We therefore selected gene transcripts within these three microarray clusters for further analysis. These genes included the alpha defensins and members of the regenerating gene (REG) family. The alpha defensins are antimicrobial peptides that are secreted by Paneth cells. Manipulation of alpha defensin expression in experimental animals has been shown to alter gut microbial composition [43]. On the other hand, monoassociation of a Bacteroidetes species with germ free animals was shown to alter regulation of Paneth cell gene expression [44]. Altered expression of alpha defensins have been associated with the ileal Crohn's disease phenotype. The REG genes belong to the calcium (C-type) dependent lectin superfamily and have been noted to be upregulated in CD and UC intestinal tissues [45], [46]. Expression levels of cellular detoxification genes, which are target genes for the transcription factor pregnane X receptor (PXR and also termed NR1I2), have been previously noted to be down-regulated in the colons of UC patients [41]. The expression levels of these genes were correlated with bacterial genera that had been previously reported to be disease associated [42]. At a threshold of P#0.05, exploratory analyses revealed potential correlations between transcript levels of specific genes with individual bacterial genera (e.g. Faecalibacterium, Bacteroidetes and Parabacteroides), that were modulated by disease phenotype. By honing in on these promising correlations identified by exploratory studies, we hope to be able to further confirm these observations in an expanded set of samples. Co-linearity between input variables may occur, despite our efforts to shrink the dimensions of the datasets. This may account for why C. difficile was not selected in this subset of the microbial dataset. Alternatively C. difficile may not have been selected because paired microarray and microbial data have been collected on a smaller number of subjects thus far. While the use of immunomodulators and anti-TNFa biologics were included as covariates in the MANOVA [24], we cannot completely exclude the potential confounding effects of these drugs on the microbial composition and mucosal gene expression. Nevertheless our results demonstrate that integrating paired expression profiles and microbial data can lead to the discovery of biologically meaningful host-microbial interactions in inflammatory bowel diseases. We anticipate that as we expand the sample set, other associations will be detected.

Patients and Acquisition of Macroscopically Diseaseunaffected Proximal Margin Ileal Tissue Samples
This study was approved by the Institutional Review Boards of Washington University-St. Louis and Stony Brook University. Ileal CD patients undergoing initial ileocolic resection, UC patients undergoing initial total colectomy and Control non-IBD patients undergoing either right hemicolectomy or total colectomy were prospectively enrolled in a consecutive fashion by the Washington University Digestive Diseases Research Core Center Tissue Procurement Facility to donate surgically resected tissue samples between April 2005 and February 2010. Patients who were unwilling or unable to give informed consent were excluded. Clinical information and patient samples were stripped of all identifying information and assigned a patient code and sample code. The de-identified patients were genotyped for the three major NOD2 and ATG16L1 genotypes and phenotyped as previously described [15]. All of the patients received antibiotics within one hour of incision. Ex-vivo biopsies were obtained of the disease unaffected proximal margin of the surgical resection specimens as previously described. RNA and DNA were extracted from the biopsy samples as previously described [15].

Human Ileal Mucosal Expression Profiles
The test RNA and a common reference ileal RNA were labeled and the resulting probes were hybridized to Agilent Whole Human Genome Arrays (Agilent No. G4410A) as previously described [15]. The pre-processing, filtering and normalization of the array data was conducted using the R package LIMMA [47], [48]. Probes with all Genepix flags less than 250 were treated as absent and removed from the dataset. There were technical duplicates on three samples and the log2 ratios for these three samples were averaged prior to analysis. Genes that were differentially expressed between ileal CD vs. Control, UC vs. Control, and ileal CD vs. UC were selected by conducting three two-class unpaired comparisons using SAM, with a cutoff of change .1.5 fold and false discovery rate (FDR) ,0.05 [29]. The hierarchical clustering was carried out by using 1-r dissimilarity measurement and Ward linkage as previously described [32], [34]. The cluster number was decided based on inspection of the coefficient of determination (R2) plot [49], [50]. The biological significance of these clusters was assessed by using Ingenuity Pathway Analysis (IPA) software [51]. To select cluster-enriched IPA canonical pathways, we lowered the threshold of the p-value from 0.05 to 0.001, and included only pathways that included $4 genes in the cluster. The data discussed in this publication have been deposited in NCBI's Gene Expression Omnibus and are accessible through GEO Series accession number GSE24287 (http://www.ncbi.nlm.nih.gov/ geo/query/acc.cgi?acc = GSE24287).

Assessment of Ileal-associated Microbial Composition
The V3-V5 region was targeted by using barcoded primers 357F  and 907R (59CCGTCAATTCMTTTRAGT) and were identical to the primers used by the Human Microbiome Project to characterize the microbiota in healthy human subjects. All sequences were screened for fidelity to a 16 S rRNA bacterial covariance model (CM) based on secondary structure using the Infernal software package [52] and were checked for chimerism with ChimeraSlayer [53] http://microbiomeutil.sourceforge.net/#A_CS). Potentially chimeric sequences and sequences lacking high fidelity to the CM were removed from subsequent analysis. Genera level taxonomic calls were produced by the RDP Classifier [53], which performs naïve Bayesian taxonomic classification versus a training set. This project used the code and training set provided by RDP (Version 2.1, http://sourceforge.net/projects/rdpclassifier/) April 6, 2010 respectively. The sequences were also classified into seven phyla/ subphyla categories. The seven categories were 1) Actinobacteria, 2) Bacteroidetes, 3) Firmicutes.Clostridium Group IV, 4) Firmicutes. Clostridium Group XIVa, 5.) Firmicutes. Bacillus, 6.) Proteobacteria, and 7.) Other taxa. The subdivisions of the Firmicutes phyla were based on concordance between the RDP classifier and the Greengenes 16 S rRNA phylogenetic schema [54][55][56]. The Clostridium GroupIV and Clostridium Group XIVa taxa are subsets of the Lachnospiriciae taxon [22], [57]. The sequence screening, classification, final binning and enumeration operations described were performed within a python based analysis pipeline created for this project [24]. Assembled Sanger sequences were deposited in GenBank accession HQ739096-HQ821395. 454 V1-V3 and V3-V5 sequences were deposited in the Sequence Read Archive accession SRX021348-SRX021368, SRX037800-SRX037802. Clinical and genotyping data can be accessed through the dbGAP authorized access system. Request access to: phs000255. The study accession is SRP002479 ''Effect of Crohn's disease risk alleles on enteric microbiota''. In order to request access to any of the individual-level datasets within the controlledaccess portions of the database, the Principal Investigator (PI) and the Signing Official (SO) at the investigator's institution will need to co-sign a request for data access, which will be reviewed by an NIH Data Access Committee at the appropriate NIH Institute or Center (https://dbgap.ncbi.nlm.nih.gov/aa/wga. cgi?page = login).

Statistical Analysis
In order to investigate the relationship between gene expression and bacteria composition, permutational MANOVA with stepwise variable selection was performed for a vector including six bacteria taxa, which served as the dependent variable [58]. Because the dependent variable is a vector of compositions, the centered log ratio transformation was used on the bacterial proportions [59]. The cluster medians [36], [37] were chosen to represent the cluster centroids and included as input variables along with clinical information (patients phenotype, age, race, smoking, BMI, gender, C._difficile, 5-ASA, steroids, immunomodulator, TNF) and genotypes (NOD2 and ATG16L1). Table S1 A. Gene-probes upregulated in CD compared to Control (n = 502). B. Gene-probes downregulated in CD compared to Control (n = 594). C. Gene-probes upregulated in UC compared to Control (n = 371).

(DOCX)
Table S3 Clusters obtained by after dimension reduction using SAM. The clusters are listed along with the percentage of variation explained by the 1 st PC of the cluster, which is related to the compactness of the cluster. The bolded clusters are the clusters in which .40% of the gene-probes were concentrated in two of the 265 clusters obtained without prior dimension reduction. The clusters were considered enriched for genes that were differentially expressed in CD vs. Control, UC vs. Control or CD vs. UC if $50% of the genes with a correlation of $0.75% to the 1 st PC of the cluster demonstrated a significant fold change (see Supplementary Table 1). Clusters were considered enriched for genes in an IPA canonical pathway if P,0.01 and at least 4 genes were in the pathway. In addition we listed our interpretation of the biological significance of the pathway. (DOCX)