Metagenomic Predictions: From Microbiome to Complex Health and Environmental Phenotypes in Humans and Cattle

Mammals have a large cohort of endo- and ecto- symbiotic microorganisms (the microbiome) that potentially influence host phenotypes. There have been numerous exploratory studies of these symbiotic organisms in humans and other animals, often with the aim of relating the microbiome to a complex phenotype such as body mass index (BMI) or disease state. Here, we describe an efficient methodology for predicting complex traits from quantitative microbiome profiles. The method was demonstrated by predicting inflammatory bowel disease (IBD) status and BMI from human microbiome data, and enteric greenhouse gas production from dairy cattle rumen microbiome profiles. The method uses unassembled massively parallel sequencing (MPS) data to form metagenomic relationship matrices (analogous to genomic relationship matrices used in genomic predictions) to predict IBD, BMI and methane production phenotypes with useful accuracies (r = 0.423, 0.422 and 0.466 respectively). Our results show that microbiome profiles derived from MPS can be used to predict complex phenotypes of the host. Although the number of biological replicates used here limits the accuracy that can be achieved, preliminary results suggest this approach may surpass current prediction accuracies that are based on the host genome. This is especially likely for traits that are largely influenced by the gut microbiota, for example digestive tract disorders or metabolic functions such as enteric methane production in cattle.


Running Metagenomic Predictions
A script which converts a metagenomic profile matrix and phenotypes to files that can be run as ASReml [1] has been supplied. As ASReml is not a free product we have also implemented the method using the free R package rrBLUP [2]. Both scripts and some small example data are supplied in File S2 .
The provided scripts were designed for a linux operating system. All systems are administered differentially and you should contact you system administrator to ensure you have access to the required files and programs.
The script is coded in the R statistical language [3], which is free software available for may platforms. The script requires a space delimited count matrix ( Figure S1a), and a space delaminated phenotype table ( Figure S1b) which contains the reference populations phenotypes. The sample names in the count matrix must match exactly with the sample names used in the phenotype table. The rrBLUP script requires a third variable, which is the name of the file which the output will be put in (e.g. Output.txt).
Once R is installed, the scripts can be opened and run one line at a time, or they can be run using a command.
To run the rrBLUP version, which is called MetagenomicPredictions.R, place the script in the current working directory. Then run one of:

Bacterial Cells in the Bovine Rumen
This is a calculation based on the reported number of bacteria per mL of rumen fluid [4] and adhered to the plant material and the rumen volume, compared to the average cell weight [5] versus the average cow weight. The average weight of total rumen contents from eight lactating Holstein cows was 106Kg (calculated by removing the total rumen contents through a fistula). The average empty cow weight of the same animals was 521Kg. We used an estimate that the rumen contained 77% rumen fluid (81.6L) by weight [6]. Studies have suggested that 70% of rumen microbes are adhered to the fibre component of rumen contents [7]. Based on these numbers we have tried to estimate the number of bacterial cells in the bovine rumen as compared to the animals own mammalian cells.
Only 30% of rumen microbes are in the fluid, 70% are firmly attached to the fibre component [7]. Therefore      Figure S1. Examples of metagenomic prediction files. Both input files are space delimited and the names in the phenotype file match the names in the profile matrix exactly. a) metagenome profile matrix which contains counts of the number of reads which align to each contig from each sample, b) the phenotypes of the reference population which will be used to predict the unknown samples. All samples in the phenotype file should be in the metagenome profile matrix; but not all samples in the metagenome profile matrix need be in the phenotype file (i.e. the validation animals are left out of the phenotype file) c) example of phenotypes predicted by the rrBLUP method, the file which these values are written to is determined by the third variable given to the script, d) example of the .as file that is generated from the ASReml method. This file and the others generated is then used to run ASReml.  Figure S4. Prediction by contig. Correlations between predicted and measured methane production. The X axis is contigs ordered from most significant to least significant from a linear model ( methane ~ contig abundance), with metagenomic profiles from both bovFT and bovGMC used in the model. The resulting equation was then applied to the bovFCE dataset to predict methane production. The Blue dots indicate the correlation coefficient between predicted methane using that one contig, and actual methane. The black line is the sum of the predictions from the x most significant contigs. Predictions have an average negative correlation with real methane production. This may be an artefact of the methane mitigating diet effects.
Page 11 of 12 Figure S5. Basic methodological procedure of performing metagenomic predictions. R refers to the R statistical language [3]. BWA is a short read alignment program [9].