Overstating the issue

Posted by djstates on 17 Feb 2011 at 19:26 GMT

Using a primate specific SINE, AluY, we screened 2,749 non-primate public databases from NCBI, Ensembl, JGI, and UCSC and have found 492 to be contaminated with human sequence.

The lay press has focused on these statistics and implied that 18% of the database entries are contaminated and therefore useless. In fact, even for those genome assemblies that do contain some human sequence, the vast majority of the assembled genome sequence is correct and of the stated species origin. Further, the authors state that the human contamination is often flanked by N's and in poorly assembled parts of the genome. DNA sequence analysis is a highly sensitive assay, and only a single molecule is needed to produce a sequence read. Most sequencing facilities work with a range of DNA sources and the issue of cross contamination has been recognized for decades. In fact, this article could equally be viewed as celebration of the high precision of genome sequence analysis. The fact that only a few hundred moderate size contaminating contigs were identified in the many gigabases of genome sequence that the community has analyzed and deposited in public repositories indicates that the absolute level of contamination is extremely low. Indeed, 80% of the genomes analyzed were free of contamination. Each of these contamination free genomes contains millions to billions of nucleotides of data. The fact that they have been assembled in an environment filled with human DNA and yet managed to avoid any detectable contamination is a testament to the care taken in the contributing laboratories.

