Vikodak - A Modular Framework for Inferring Functional Potential of Microbial Communities from 16S Metagenomic Datasets

Background The overall metabolic/functional potential of any given environmental niche is a function of the sum total of genes/proteins/enzymes that are encoded and expressed by various interacting microbes residing in that niche. Consequently, prior (collated) information pertaining to genes, enzymes encoded by the resident microbes can aid in indirectly (re)constructing/ inferring the metabolic/ functional potential of a given microbial community (given its taxonomic abundance profile). In this study, we present Vikodak—a multi-modular package that is based on the above assumption and automates inferring and/ or comparing the functional characteristics of an environment using taxonomic abundance generated from one or more environmental sample datasets. With the underlying assumptions of co-metabolism and independent contributions of different microbes in a community, a concerted effort has been made to accommodate microbial co-existence patterns in various modules incorporated in Vikodak. Results Validation experiments on over 1400 metagenomic samples have confirmed the utility of Vikodak in (a) deciphering enzyme abundance profiles of any KEGG metabolic pathway, (b) functional resolution of distinct metagenomic environments, (c) inferring patterns of functional interaction between resident microbes, and (d) automating statistical comparison of functional features of studied microbiomes. Novel features incorporated in Vikodak also facilitate automatic removal of false positives and spurious functional predictions. Conclusions With novel provisions for comprehensive functional analysis, inclusion of microbial co-existence pattern based algorithms, automated inter-environment comparisons; in-depth analysis of individual metabolic pathways and greater flexibilities at the user end, Vikodak is expected to be an important value addition to the family of existing tools for 16S based function prediction. Availability and Implementation A web implementation of Vikodak can be publicly accessed at: http://metagenomics.atc.tcs.com/vikodak. This web service is freely available for all categories of users (academic as well as commercial).


Section A: Global Mapper Work-flow
Both sub-modules of Global Mapper, i.e. Co-metabolism and Independent contributions, take as input, (16S amplicon sequencing derived) raw taxonomic abundance data (in tab-delimited format) corresponding to various microbes in the studied environmental sample(s). For ease of understanding, a sample input file is provided in Supplementary File 2 (section A). Both sub-modules of Global Mapper process this input file using the following sequence of steps -

STEP 1 -Normalization of Abundance Data
Abundance data normalization is done with respect to two aspects a) Sample Size Normalization b) 16S Copy Number Normalization Sample size normalization (optional) ensures uniform scaling of the data, removing biases arising due to different sample sizes. For correcting the input abundance profiles for 16S copy number of various taxa, Vikodak uses a manually compiled database containing 16S copy numbers of microbial species). This database was compiled using information obtained from the RDP classifier distribution as well as from the "all.rnt' file available for download at the NCBI ftp site (ftp://ftp.ncbi.nih.gov/genomes/Bacteria/all.rnt.tar.gz).

STEP 2 -Generation of Enzyme Abundance Profile(s)
This stage computes the abundance profiles for enzymes present in various samples of the studied environment(s). Given the differences in the underlying assumptions, the two sub-modules of Global Mapper deduce these profiles differently as described below -Co-Metabolism sub-module: Co-metabolism sub-module of Global Mapper assumes that microbes in an environment exist as a consortium and individual microbes contribute to a common enyzme pool. The microbial community represented by individual (normalized) abundance data profiles is therefore mapped against the pre-computed Vikodak's back-end database (please refer to main paper) to obtain enzyme profile(s) of each sample.
This profile enlists all enzymes (along with their copy numbers) present in various taxa constituting the studied microbial community. In order to Supplementary File 2 (section B) depicts the format of an 'Enzyme Abundance Profile' generated at the end of this step. It may be noted that the cometabolism sub-module further processes information present in the Enzyme Abundance Profile to compute (and provide) information pertaining to the abundances of six enzyme classes (viz. oxido-reductases, ligases, lyases, hydrolases, tranferases, and isomerases) present in the studied environment.
Independent contributions sub-module: Independent contributions sub-module assumes that each microbe in a community has its own independent pool of enzymes with no co-metabolic associations between various microbes. Consequently, in contrast to generation of a single 'Enzyme Abundance Profile' for a given environment (as in the co-metabolism sub-module of Global Mapper), independent Enzyme Abundance Profiles are generated for individual taxa present in each individual sample of an environment. Therefore, S number of matrices are generated (where S is the number of samples in the microbial abundance data submitted as the input), each representing the individual Enzyme Abundance Data for various microbes of a sample.

STEP 3 -Computation of Metabolic Pathways
The objective of this step is to use Enzyme Abundance Profile(s) for computing the relative abundance of metabolic pathways at all three levels of KEGG hierarchy. The methodology followed by the two sub-modules of Global Mapper during this computation is described below - -4 -Once the functional profile for individual taxa of each sample is deduced, the computation of functional contribution of any microbial taxon in the overall environment can easily be performed using "mean or median or relative contribution value" of the sample-wise functional contributions of the given taxon. It may be noted that, "relative contribution" is calculated using the following mathematical expression: Wherein, Rc f = Relative contribution of f th feature, amongst total F features A = Abundance value of the f th feature in i th sample (total number of samples being N) The features in the above expression refer to the row-names in inferred functions abundance data for individual samples, and may pertain to Step

-Inferring Core Pathways
Both sub-modules (Co-metabolism and Independent contributions) of Global Mapper provide end-users an option for identifying 'core-pathways' present in a given environment. From the Pathway Abundance Profiles (at level 3 of the KEGG hierarchy), a bootstrapping mechanism is used for -5 -identifying and reporting the subset of pathways having a minimum abundance of 0.1% in at least 80% of the samples as 'core-pathways'. The bootstrapping approach iteratively picks (at random) 70% of the samples and identifies core-pathways in that set of samples. This process is iterated 1000 times. Pathways reported as core in at least 70% of the iterations are tagged as 'core pathways' with a boot-strap score of 70 or above.
Supplementary File 2 (section F) represents an example of a core-pathway result file generated using either of the two sub-modules of Global Mapper.

Section B: Noise and Error Reduction Provisions
Pathway Exclusion Cut-off (PEC): A major limitation of the current methods for function prediction is the inability to account for the constituents of a metabolic pathway. In other words, a metabolic pathway might be manifested by the joint expression of over 30 genes/ enzymes, but mere expression/ presence of 1-5 genes/ enzymes might not lead to the expression of the associated pathway. It is thus crucial to define a parameter for filtering pathways based on the proportion of various pathway associated genes/ enzymes expressed by the microbiota. Considering the need for such a parameter, Pathway Exclusion Cut-off (PEC) value has been defined in both algorithms of Global Mapper. PEC value is defined as the minimum percentage of genes/ enzymes belonging to any metabolic pathway that must be expressed by a given microbiota for tagging that pathway as being expressed by the microbiome. For example, a PEC value of 30 would mean that a given microbe/ microbiome must express at least 30% of genes/ enzymes belonging to any metabolic pathway for considering that pathway as being expressed. Both modules of Global Mapper have been developed in such a way that apart from providing the raw hits (where raw hits refer to the most inclusive criteria wherein even the expression of a single enzyme/ gene is considered for presence of a metabolic pathway) they provide the KEGG pathway (all three level) expression profiles for the microbial abundance data at various PEC values (30-90) as well.