Comparative Analysis of Functional Metagenomic Annotation and the Mappability of Short Reads
(a) The probability function of predicting the copy number of a given KO in a given dataset across all simulated 101-bp datasets using the top gene protocol and when the strain from which the reads originated is absent from the database. Only KOs with copy numbers 1 to 4 are illustrated. The curve corresponding to copy number 0 represents false positive KO predictions. The smaller peaks showing in some curves (e.g., the two extra peaks in the blue “1 copy” curve) were found to be due to stretches of intergenic reads that mismapped to KO genes in the database and likely reflect genomic misannotations or pseudogenes. (b) The average recall across all simulated 101-bp datasets for identifying reads originating from each KO, ranked from highest to lowest average recall. 95% confidence intervals are shown in green. Recall is calculated for the case where the strain from which the read originated is absent from the database.