^{1}

^{1}

^{1}

^{2}

^{2}

^{2}

^{1}

^{*}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: GW ES. Performed the experiments: GW ES YZ. Analyzed the data: PSLR BS ED WDS. Contributed reagents/materials/analysis tools: GW ES YZ. Wrote the paper: PSLR WDS. Designed statistical methods: PSLR WDS. Designed Software used in the analysis: PSLR BS ED WDS. Sequence samples: GW ES YZ. Taxonomic-based analysis of sequences: YZ.

Human microbiome research characterizes the microbial content of samples from human habitats to learn how interactions between bacteria and their host might impact human health. In this work a novel parametric statistical inference method based on object-oriented data analysis (OODA) for analyzing HMP data is proposed. OODA is an emerging area of statistical inference where the goal is to apply statistical methods to objects such as functions, images, and graphs or trees. The data objects that pertain to this work are taxonomic trees of bacteria built from analysis of 16S rRNA gene sequences (e.g. using RDP); there is one such object for each biological sample analyzed. Our goal is to model and formally compare a set of trees. The contribution of our work is threefold: first, a weighted tree structure to analyze RDP data is introduced; second, using a probability measure to model a set of taxonomic trees, we introduce an approximate MLE procedure for estimating model parameters and we derive LRT statistics for comparing the distributions of two metagenomic populations; and third the Jumpstart HMP data is analyzed using the proposed model providing novel insights and future directions of analysis.

The Human Microbiome Project (HMP)

Microbiome samples are collected from patient Body sites by swabbing (e.g., skin, nasal, oral) or bulk collection (e.g., saliva, stool). These samples contain within them the entire bacterial community (i.e., the microbiome) as well as other organisms (e.g., human cells, viruses, fungi). Samples are processed to isolate the genomic content (i.e., all DNA from the entire bacterial microbiome, all patient DNA, all viral DNA, etc.) within that sample, and prepared for state-of-the-art ‘next generation’ sequencing. To characterize the microbial community structure, 16S rRNA genes are sequenced using the high throughput 454 FLX Titanium sequencing platform (Roche). The sequences are analyzed using either a phylogenetic or taxonomic approach

With the goal to enumerate the content and abundances in the microbial communities of 18/15 body habitats of 300 healthy female/male adults, 7,000 16S rRNA sequences were produced from an individual on average per body site sample. These sequence data sets provide the opportunity to estimate the microbial diversity with high resolution, but statistical tools and strategies to analyze the microbial communities are needed to take full advantage of the data density.

In recent years several tools have been developed to compare Human microbiome communities using either phylogenetic or taxonomical classification of metagenomic sequences. Current strategies are based primarily on exploratory cluster analysis, phylogenetic inferences, biological diversity indices, bootstrap or resampling methods, and application of univariate and non-parametric statistics to different subsets of the data

Tools currently being used to analyze HMP data for limited numbers of sequence reads include UniFrac

In this work a novel parametric statistical inference method based on object-oriented data analysis (OODA) for analyzing HMP data is proposed. OODA is an emerging area of statistical inference where the goal is to apply statistical methods to objects such as functions

Seq. ID | Kingdom | Phylum | Class | Order | Family | Genus |

F51YIRY01BC31 | Bacteria:0.99 | Bacteroidetes:0.99 | Bacteroidia:0.9 | Bacteroidales:0.99 | Prevotellaceae:0.99 | Prevotella:0.99 |

F51YIRY01DFQI | Bacteria:0.99 | Firmicutes:0.53 | Clostridia:0.53 | Clostridiales:0.53 | Veillonellaceae:0.53 | Megasphaera:0.52 |

F51YIRY01CLKP | Bacteria:0.99 | Firmicutes:0.96 | Bacilli:0.91 | Lactobacillales:0.90 | Enterococcaceae:0.44 | Pilibacter:0.41 |

Of primary interest to HMP investigators is the estimation of the core microbiota from a set of samples. Determining a core microbiota aims at finding the organisms (or functions) selected in the host environment, and at studying its correlation with changes in human health. By defining a unimodal probability measure we are able to compute a central taxonomic tree, the maximum likelihood tree, providing an alternative and new definition of the core-microbiome for a set RDP trees samples. Though, in this paper we are focused on analyzing 454 sequencing of 16S rRNA genes with the reads mapped to taxonomic (classification) assignments, the methods are equally applicable to shotgun sequencing data with functional profiling of the microbial community.

Subjects involved in the study provided written informed consent for screening, enrollment and specimen collection. The protocol was reviewed and approved by Institutional Review Board at Washington University in St. Louis. The data were analyzed without personal identifiers. Research was conducted according to the principles expressed in the Declaration of Helsinki. This manuscript adheres to the HMP data release policy (see

The MLE tree of all samples is denoted by MLE (dot in black) in the MDS plot. Individual taxonomic trees are denoted by

Sample individual taxonomic trees shown in

Human microbiome data analyzed in this paper for illustration purposes are from samples of 24 subjects (male and female), 18–40 years old, from two geographic regions of the US: Houston, TX and St. Louis, MO. These samples were collected as part of study HMP: 16S rRNA 454 Clinical Production Pilot (Project ID: 48335) (see

In Figure (a), a pairwise distance matrix was generated using Euclidean distance, and multidimensional scaling was used to display the distribution of these 48 trees showing V1–V3 (red) and V3–V5 (blue) samples are overlapping; In Figure (b), the MLE tree for the 48 trees is illustrated; and in Figures (c) and (d), the MLE tree for trees corresponding to V1–V3 and V3–V5 regions are shown, respectively.

Body Habitats | P-value |

anterior-nares | 0.15 |

attached-gingivae | <E-03 |

buccal-mucosa | 0.02 |

hard-palate | 0.12 |

l-retroauricular-crease | 0.47 |

mid-vagina | 0.23 |

palatine-tonsils | 0.03 |

posterior-fornix | 0.22 |

r-retroauricular-crease | 0.53 |

saliva | 0.02 |

stool | 0.26 |

subgingival-plaque | 0.12 |

supragingival-plaque | 0.12 |

throat | 0.10 |

tongue-dorsum | 0.05 |

vaginal-introitus | 0.20 |

Building a taxonomic tree based on adding RDP confidences has several important properties. For example, the resulting tree is consistent with the RDP classification of each sequence where branches closer to the root have higher values than branches closer to leaves. Also, this approach provides with a linear approximation of the overall confidence of a branch in a sample, which allows us to identify tree branches that have overall higher confidence in each sample. Moreover, as stated above, for any given branch the addition of the confidence values provides with a measure of taxa abundance weighting on the confidence of the resulting RDP taxa assignment. However, one drawback of this approach is that Trees with larger number of sequence reads would tend to have branches with larger weight values, and thus would tend to bias the analysis when modeling a set of Trees, e.g., the computation of the MLE tree. Therefore, to avoid this issue in this work we normalize the number of sequence reads of all samples by a common number of reads.

In Figure (a), a pairwise distance matrix was generated using Euclidean distance and multidimensional scaling was used to display the distribution of these 48 trees showing stool (red) and saliva (red) samples do not overlap; In Figure (b), the MLE tree for the tree samples combined is illustrated; and in Figures (c) and (d), the MLE tree for trees from stool and saliva samples are shown, respectively.

A unimodal probability model for graph-valued random objects has been derived and applied previously to several types of graphs (cluster trees, digraphs, and classification and regression trees)

p-value | anterior-nares | attached-gingivae | buccal-mucosa | hard-palate | l-retroauricular-crease | mid-vagina | palatine-tonsils | posterior-fornix | r-retroauricular-crease | saliva | stool | subgingival-plaque | supragingival-plaque | throat | tongue-dorsum | vaginal-introitus |

anterior-nares | 1 | |||||||||||||||

attached-gingivae | * | 1 | ||||||||||||||

buccal-mucosa | * | 0.10 | 1 | |||||||||||||

hard-palate | * | 0.04 | 0.07 | 1 | ||||||||||||

l-retroauricular-crease | * | * | * | * | 1 | |||||||||||

mid-vagina | * | * | * | * | * | 1 | ||||||||||

palatine-tonsils | * | 0.01 | * | 0.27 | * | * | 1 | |||||||||

posterior-fornix | * | * | * | * | * | 0.60 | * | 1 | ||||||||

r-retroauricular-crease | * | * | * | * | 0.46 | * | * | * | 1 | |||||||

saliva | * | * | * | * | * | * | * | * | * | 1 | ||||||

stool | * | * | * | * | * | * | * | * | * | * | 1 | |||||

subgingival-plaque | * | * | * | * | * | * | * | * | * | * | * | 1 | ||||

supragingival-plaque | * | * | * | * | * | * | * | * | * | * | * | 0.05 | 1 | |||

throat | * | * | * | 0.01 | * | * | 0.09 | * | * | * | * | * | * | 1 | ||

tongue-dorsum | * | * | * | * | * | * | * | * | * | * | * | * | * | 0.01 | 1 | |

vaginal-introitus | * | * | * | * | * | 0.47 | * | 0.11 | * | * | * | * | * | * | * | 1 |

Two broad strategies exist for defining a suitable distance metric

In general, any finite graph defined on a set of labeled vertices or nodes can be uniquely characterized by mapping it into the space of matrices through the vertex-adjacency matrix

The space of RDP trees is continuous and constrained. In fact, by construction of the RDP tree the edge weights (the sum of confidence levels) are monotonically decreasing as we travel from the root, the vertex at the kingdom level, to the leaves, the vertices at the genus level. This means that

To estimate

Solving the above equations for

Note that solving the minimization problem in (7) with respect to

We are interested in assessing whether the distributions

We apply the HMP taxonomic tree OODA methods developed here to existing HMP data formed by 24 subjects (see Methods Section, under HMP Data Description and Data Structure, and reference there in for a complete description). An R-package has been developed containing the implementations of the visualization and methods proposed above

For a given set of HMP stool samples we show in

To illustrate differences within a body site but across variable regions of the 16S rRNA gene, stool samples for 24 subjects were sequenced at variable regions V1–V3 and V3–V5, mapped to the RDP database, and a taxonomic tree estimated for each sample.

To illustrate differences across body habitats, stool and saliva samples for 24 subjects were sequenced, mapped to the RDP database, and a taxonomic tree estimated for each sample. In

In

In addition, for purpose of comparison, we apply Analysis of Similarity (ANOSIM)

We propose a novel parametric statistical inference method for analyzing HMP data which is naturally represented in the form of a taxonomic tree. Using methods from Object Oriented Data Analysis (OODA), we applied classical statistical methods for inference and hypothesis testing to the analysis of HMP RDP data. In particular, we applied a unimodal probability model which depends on a dispersion parameter and central mode tree. We introduce an approximate MLE procedure for estimating model parameters and we derive LRT statistics for comparing the distributions of two metagenomic populations.

Within the framework of representing HMP data by taxonomic trees there are currently two basic approaches for defining (estimating) the core: First, a consensus tree can be built by combining common branches from the samples and removing unusual samples, i.e., the intersection tree. This approach defines the core as the set of organisms that are present in a particular body site in all or in a vast majority of individuals

Our approach is based on the assumption that a unimodal model fits the set of tree samples, which might not always be valid

The application of the LRT statistics to real HMP data formed by 24 subjects allowed testing for differences of core microbiomes across body habitats and variable regions within the same body site. When comparing the results of the LRT test with those obtained by ANOSIM