Conceived, designed and directed the research project: CW. Drafted the manuscript: CW XY. Revised the manuscript: CW LX YL. Developed and implemented the algorithm, collected datasets, and devised the gene counting model: XY. Participated in discussion: LX. Conceived and designed the research work: YL.
The authors have declared that no competing interests exist.
Estimating the number of genes in human genome has been long an important problem in computational biology. With the new conception of considering human as a super-organism, it is also interesting to estimate the number of genes in this human super-organism.
We presented our estimation of gene numbers in the human gut bacterial community, the largest microbial community inside the human super-organism. We got 552,700 unique genes from 202 complete human gut bacteria genomes. Then, a novel gene counting model was built to check the total number of genes by combining culture-independent sequence data and those complete genomes. 16S rRNAs were used to construct a three-level tree and different counting methods were introduced for the three levels: strain-to-species, species-to-genus, and genus-and-up. The model estimates that the total number of genes is about 9,000,000 after those with identity percentage of 97% or up were merged.
By combining completed genomes currently available and culture-independent sequencing data, we built a model to estimate the number of genes in human gut bacterial community. The total number of genes is estimated to be about 9 million. Although this number is huge, we believe it is underestimated. This is an initial step to tackle this gene counting problem for the human super-organism. It will still be an open problem in the near future.
The list of genomes used in this paper can be found in the supplementary table.
Estimating the number of genes in human genome has been one of the most fundamental problems in computational biology. The number of genes estimated for human genome dropped from more than 100,000
In the other hand, a human body contains not only the human genome. Microbes inhabit ubiquitously in or on our human body, such as lung, skin, oral cavity, etc. A new concept is to consider human as a super-organism containing those microbes in or on human body as well
With the progress of molecular biotechnology and data accumulation, our understanding about gene and disease is under a revolution: diseases are not only associated with genes in human genome but also related to genomes from environment around and inside human body. A notable example of what environmental genome changes can result in is human gut bacterial community. The change of human gut bacterial community is associated with obesity
Based on the diversity of gut microbes and the average number of genes contained in a microbe genome, the number of genes in human gut microbiota was guessed to be 100 times greater than that of our human genome
In this paper, we present a model to estimate the number of protein-coding genes in human gut bacterial community, the largest microbe community in or on human body. We estimate the number of genes in human gut, both the number of overall genes and the number of core genes, which are conserved across multiple microbes.
While the update and further accumulation of genomic and metagenomic data may result in a more reliable and accurate estimation, how genes are defined in the counting can affect the estimation result by an order of magnitude. Therefore, a clear definition of genes should be given before we set off the journey of gene counting. In the upcoming section, we will define the terms we use in the paper,
In this section, we define the following six terms used in this paper: genes, orthologs, paralogs, core genome, pan-genome and genome combination. Their meaning may be different in other literatures.
The term “gene” in this paper stands for “protein coding” genes. This precludes other functional genes, such as RNAs.
Orthologs are genes found in different species but originated from a common ancestor and thus, often have similar functions. Here, we call two genes from different genomes “orthologs” if their sequence similarity reaches a certain threshold.
Paralogs are genes generated from gene duplication in the same genome and do not always have the same function. In this paper, two genes from the same genomes are called “paralogs” if their similarity reaches a certain threshold.
Core genome is the set of genes which are common to every selected genome. Usually the term “core genome” is used when no less than two genomes were considered, while we extended its usage to one genome. The core genome for a single genome can be viewed as the set of none duplicated genes, or the “de-paralogged” genome.
Pan-genome is the whole set of genes in a number of genomes, including core genes which are shared by all genomes, partially shared genes which can be found in some genomes but absent from the others, and strain-specific genes. The concept of “pan-genome” typically is used at genus or species level. We extend it to higher taxa. Also, the pan-genome for one genome is allowed in our study to denote the set of non-redundant genes of that genome.
The term “combination” of genomes is used in our counting of total genes and core genes. Two similarity cutoffs were set in the combination: the paralog and ortholog similarity thresholds. A similarity of 0.90 can't distinguish genes with similarity of 0.95. As a result, after the combination, genes with similarity above predetermined threshold can't be distinguished and are recognized as the same gene. The union set of combination is the pan-genome while the intersection describes the core genome. A higher similarity resolution will generate a larger size of pan-genome and a lower similarity resolution will yield a larger size of core-genome, accordingly.
Two hundred and two human gut bacterial genomes were selected (see
The nodes denote the total gene number after genome combination, each of which is the average value of 30 times of sampling. Sampling size scales from ten to two hundred genomes, with a step size of ten. Thresholds for circle, triangle, diamond, cross, square and asterisk markers are 1.00, 0.97, 0.90, 0.80, 0.70 and 0.60, respectively. Paralogs and orthologs used the same thresholds in this combination. Accessions and information for the two hundred and two genomes used in the figure can be found in
Thirty nine
Figure A shows the pan-genome of 39
The coefficients for the curve fit functions generated using least squares method at different thresholds for 39
Category | Similarity | Avg Genes | Coefficients* |
0.97 | 4756 | ||
0.9 | 4707 | ||
0.8 | 4674 | ||
0.6 | 4567 | ||
0.3 | 4393 | ||
0.97 | 3524 | ||
0.9 | 3501 | ||
0.8 | 3483 | ||
0.6 | 3387 | ||
0.3 | 2685 |
This table provides information for the 39
In contrast to genome within one species, the 24
Figure A shows how the gut bacterial community is visualized as a bamboo. B gives an example how the genus
The two requisite elements of our counting are incorporating of citizens who inhabit our guts and selecting of models which are fit to represent all these citizens. The Human Microbiome Project provides us hundreds of sequenced bacterial genomes, from which we can pick our models if the number of sequenced genomes within a species or genus reaches a considerable value (we modeled from 39
Composition of a bacterial community usually is studied in culture-based methods. However, the majority of gut citizens are “inculturable”. High-throughput sequencing technologies and culture-independent methods enable us to sequence metagenomes that cannot be cultured in labs. Particularly, directly sequencing of 16S rRNA genes can provide us culture-independent approaches to identify the existence of the otherwise inapproachable majority.
In our study, the composition of human gut bacteria was obtained by searching RDP's browser
The pie chart shows the distribution of gut bacteria obtained by searching RDP browser. Others are:
Bacteria from 8 of these 13 phyla were spotted in human gut microbiota in 2005
The distance matrixes for gut bacteria were downloaded from RDP browser at genus level. Distance matrix for each genus is then used to cluster these 16S rRNAs into different groups (species) at the cutoff of 0.02, which is a commonly used threshold for species
Although our search of RDP browser returned 40,180 sequences, sequences marked “unclassified” in RDP browser were excluded in our analysis since our trees were built from strains in the same genus.
We tested the fidelity of our strain-to-species model and species-to-genus model in the 39 E. coli strains and 24
16S rRNA sequences of 164 genera were downloaded from RDP browser. More than 94% of the 164 genera have less than 30 species and more than 84% of the 826 species have less than 40 strains.
The final step of our counting is to apply our models to every strain in every species with the model validated in the 39
Similarity | 0.97 | 0.90 | 0.80 | 0.60 | 0.30 |
Genes | 8,988,806 | 6,533,896 | 5,799,165 | 4,071,772 | 2,932,368 |
The strain-to-species and species-to-genus gene counting used the same similarity at 0.97, 0.90, 0.80, 0.60 and 0.30 in the estimation. Detail of the gene counting model can be found in the gene counting model part of the Result section.
Core genome denotes the set of genes which can be found in every genome, at a certain level of similarity. We studied the core genome for genus
Each genome represents a different species. Accessions and information for the ten genomes can be found in
Core genes are much more common at species level than that of genus level, as indicated by our core-genome analysis of seven
126 uncorrected distance matrixes (DNADist format) were downloaded from RDP website
For each genus, a percent identity cutoff of 0.98
The 202 genomes in
Protein coding sequences were downloaded from NCBI ftp were they in complete genomes or extracted from NCBI annotation file if they were from “draft assembly” genomes. NCBI annotations for bacteria were done by Prokaryotic Genomes Automatic Annotation Pipeline (PGAAP), which predicts genes using a combination of GeneMark and Glimmer. Without evidence such as existing of RNA or similarity with existing proteins to draw forth confident identification of a gene, we evaluate genes provided by NCBI by their lengths.
This figure shows the distribution of 202 gut bacteria genomes' gene lengths as annotated by NCBI. The genes used in this study were annotated by NCBI as protein coding sequences. Y-axis tells the number of genes with a certain length.
Genes for 202 genomes were run all-against-all WU-Blast 2.0 blastn to generate a tabular output, which is a tab-delimited text file. WU-Blast options were -w 14, -wink 8, -e 10, -Q 9, -R 8. The tabular output of WU-Blast is then further used to generate the similarity percentage (or identity percentage) between each query and subject sequence by dividing the number of matches by the total length of query sequence (Nmatch/Nall). The extracted files which contain information for query and subject gene accessions and their similarity percentage are further used for in genome combination for overall genes and core genes analysis.
Estimating the number of genes in or on human body is an interesting problem that has a large number of audients. There are two highlights of this paper:
We presented a novel model to estimate the number of genes in human gut by combining culture-independent sequencing data and completed microbe genomes currently available.
The total number of genes is estimated to be 8,988,806 in human gut bacterial community, and we believe it is underestimated.
In a similar analysis, Tettelin et al shows that the number of additional genes to a pan-genome varies largely for different species when adding a new strain to a species. For some species, the additional number of genes by adding one new strain to the species is going to be a positive constant, which implies that there are an infinite number of genes in a microbe species
There are two potential problems in our estimation since we used all 16S rRNA sequences from rdp database with a certain quality criterion. First, whether16S rRNA is capable of grouping strains into species is still debatable. Second, the 16S rRNA sequences may come from different individuals. Therefore, the estimated number of genes reported in this paper can be counted from many individuals. In our model, the number of genes in an individual human gut can be estimated by combining meta-genomic data of this individual and correspondent completed genomes. In fact, the number of unique 16S rRNA in an individual is still unclear. In a recent research, Turnbaught et al sequenced 10,000 V6 regions of 16S rRNAs for each of 154 individuals, and found there was little overlap between the sampled fecal communities. The estimation model we presented in this paper is relatively simple. It is a very initial step to tackle this problem. Therefore, estimating the total number of genes in an individual human gut is still an open problem.
Genomes used in this paper. Figure*2: the numbers in this column are the figures appeared in the paper or the supplementary material, take the third genome as an example, “1,2,8” indicates NC_002655 was used in the pan-genome analysis for 202 genomes, the pan-genome analysis for 39 E.coli and the core genome analysis for 26 E.coli. Genome*1: The genus name for each genome has been abbreviated to one capital. 1–39: Escherichia; 40–86: Clostridium; 87–97: Bacteroides; 98–107: Campylobacter; 108–110: Providencia; 111: Proteus; 112–116: Listeria; 117: Rickettsia; 118: Salmonella; 119–151: Salmonella enterica subsp. enterica serovar; 152: Salmonella enterica subsp; 153–155: Collinsella; 156: Victivallis; 157–160: Ruminococcus; 161–168: Bacillus; 169–175: Bifidobacterium; 176–183: Shigella; 184–185: Dorea; 186–187: Streptococcus; 188–197: Vibrio; 198–199:Tropheryma; 200: Mitsuokella; 201–203: Lactobacillus; 204: Methanosphaera; 205–207: Eubacterium; 208–209: Parabacteroides; 210–211: Yersinia; 212: Methanobrevibacter; 213: Klebsiella; 214: Akkermansia; 215: Actinomyces; 216: Faecalibacterium; 217: Enterobacter; 218: Anaerostipes; 219: Peptostreptococcus; 220: Coprococcus; 221: Alistipes; 222: Anaerotruncus; 223: Anaerofustis; 224: Roseburia.
(0.03 MB XLS)
Core genome for four species. Figure A, B, C and D show the core genome sizes for 7 C.perfringens, 8 C.difficile, 9 C.botulinum and 26 E.coli, respectively. Accession and information for genomes used in this analysis can be found in supplementary
(0.35 MB TIF)
We thank RDP staff for generating the 16S rRNA distance matrixes. We thank Dr. Yanni Sun and Dr. Liping Zhao for their helpful comments. We thank Jie Du, Qiang Zhou, and Jie Pan for their input on initialization of this project.