Reader Comments
Post a new comment on this article
Post Your Discussion Comment
Please follow our guidelines for comments and review our competing interests policy. Comments that do not conform to our guidelines will be promptly removed and the user account disabled. The following must be avoided:
- Remarks that could be interpreted as allegations of misconduct
- Unsupported assertions or statements
- Inflammatory or insulting language
Thank You!
Thank you for taking the time to flag this posting; we review flagged postings on a regular basis.
closeParalogous sequences
Posted by AdamEyreWalker on 26 Oct 2010 at 16:36 GMT
It has been pointed out to us that the excess of coincident SNPs could be due to paralogous sequences that have been incorrectly assembled on to the same location. Substitutions between such paralogous sequences would appear to be SNPs. If the same mis-assembly error occurred in both humans and chimpanzees (and macaques) and the substitutions occurred before the species split, then an excess of coincident SNPs would be generated. Musumeci et al. [1] have recently estimated that ~8.3% of all human single SNPs in dbSNP may be artifacts due to this problem.
First, we note that the pattern of coincident SNPs is not consistent with the mis-assembly hypothesis; under this hypothesis we would expect an excess of transition coincident SNPs, since transitions dominate the process of mutation and substitution; but we observe a stronger excess of transversions, and in particular AT/AT coincident SNPs .
However, to explore the issue further we performed two analyses. In the first we repeated the analysis of Musumeci et al. [1] on our coincident SNPs. They blasted human SNPs from dbSNP against the human genome and considered cases in which the SNP mapped to two or more location, where a successful match was defined as cases in which at least 20% of the full length SNP sequence had at least 90% identity. The SNP was considered to be potentially artifactual if the two bases involved in the SNP were found in the two different mapped locations at the site of the putative SNP. We repeated this analysis using coincident SNPs and found that of our 11571 coincident SNPs, 9611 mapped to a unique location, 233 had multiple matches, but did not contain the nucleotides involved in the SNP, and 269 had multiple matches to the reference genome and the nucleotides involved in the SNP were found at the site of the SNP in the two locations; 95 SNPs did not match the reference. We therefore estimate that at most 2.3% of coincident SNPs are due to known duplicated sequences. This analysis suggests that known paralogy can only explain a very small fraction of our coincident SNPs.
However, this analysis does not allow us to assess the impact of duplicated regions that are yet to be identified. We therefore considered the minor allele frequency of the human SNP in the coincident SNP was greater than that of randomly selected human SNPs from dbSNP. If a disproportionate number of coincident SNPs are due to mis-assembly then they should have higher MAF, because an artifactual SNP generated by substitution between two duplicated regions should have a MAF of 50%. We were able to obtain the MAF for 7801 of our coincident SNPs; these have a mean MAF of 0.274 (95% CIs of 0.270, 0277). The same number of randomly chosen SNPs have a mean MAF of 0.271 (0.267, 0.274) (t-test is not significant, p=0.241). Again there is no evidence that paralogy is contributing significantly to the excess of coincident SNPs.
Acknowledgements: We are very grateful to Richard Durbin, Ewan Birney, Peter Keightley, Philip Johnson and Ines Hellman for helpful discussion.
1. Musumeci L, Arthur JW, Cheung FS, Hoque A, Lippman S, et al. (2010) Single nucleotide differences (SNDs) in the dbSNP database may lead to errors in genotyping and haplotyping studies. Hum Mutat 31: 67-73.
RE: Paralogous sequences
AdamEyreWalker replied to AdamEyreWalker on 12 Nov 2010 at 09:53 GMT
There is a slight mistake in the report above; I incorrectly state that 95 coincident SNPs didn't match the reference human genome, whereas in fact this was 1363. Therefore at most 2.6% of SNPs, which match the reference genome, are due to paralogy. Our conclusions remain unaffected.