Metagenomic Data Utilization and Analysis (MEDUSA) and Construction of a Global Gut Microbial Gene Catalogue

doi:10.1371/journal.pcbi.1003706

Figure 1.

The MEDUSA pipeline and its application to 4 gut metagenome datasets.

(a) An overview of the MEDUSA pipeline and its functions is shown. Input data is fastq and can be compressed in various ways. MEDUSA counts reads aligning to a reference catalogue and outputs count files that can be annotated and analyzed. (b) The alignment function is implemented using linux pipes which reduces file IO substantially and integrates the quality control, filtering and aligning to a database into one step. (c) Data statistics of the human gut samples analyzed in this study. Most reads (>90%) pass the quality control step and few samples have any substantial contamination of human DNA. Overall, the reads align to the gene catalogue to a larger extent compared to the genome catalogue. (d) Percent of reads aligning to the gene and genome catalogues are shown for each study. Furthermore, for each sequencing run, the processing time and the number of reads are shown and scales linearly.

More »

Expand

Figure 2.

Taxonomic analysis of the gut metagenome.

(a) Genus abundance of each sample ordered by increasing Bacteroides relative abundance. There is a continuous gradient of increasing Bacteroides relative abundance in the studied samples. The 20 most abundant genera are shown, whereas the rest of the annotated reads are grouped into other. (b) Boxplots showing the relative abundance of Bacteroides, Prevotella and Ruminococcus. The Prevotella abundance is low in most samples but a few samples have a major Prevotella abundance. (c) Shannon diversity index of the species abundance shows that Swedish and Metahit samples have a higher diversity compared to Chinese and American. (d) Pan and core species with a relative abundance above 10⁻⁴ in the subjects (repeated samples from the same subject excluded). The core percentage means that a species was present in at least % of the subjects.

More »

Expand

Figure 3.

Gene catalogue construction and abundance.

(a) The Venn diagram shows how the 11 659 115 genes were shared in the 4 studies based on the merge of the 4 non-redundant gene catalogues. A core of 488 482 genes were found in all studies whereas a large part of the genes were unique to each study. (b) Relative abundance of genes grouped into how they are shared in the Venn diagram. The shared genes are also the most abundant genes followed by the unique genes to each study. Each field in the Venn diagram is denoted by the first letter of the study.

More »

Expand

Figure 4.

Gene richness and pan and core genes.

(a) Number of genes in each sample using 11 million reads is shown as a smoothed histogram. European samples have a higher gene richness compared to the Chinese and American. (b) The number of genes as a function of the number of samples. The definitions of the cores are the same as in Figure 2. The size of the core50% is 283 705 genes. (c) Shows the number of core genes as a function of the inclusion criteria (% of the population having the gene).

More »

Expand