Reader Comments

Post a new comment on this article

Invitation to comment on the implications of the study in light of the NIH’s modified policy on free access to aggregate genotype data

Posted by PLOS_Genetics on 03 Sep 2008 at 10:10 GMT

PLoS Genetics invites a community discussion on the implications of this study with respect to forensic medicine and public access to genomewide association study (GWAS) data. In response to the findings, the NIH has modified its policy on free access to aggregate genotype data. The major concern is that it may be possible to determine, directly from the observed average allele frequency at hundreds of thousands of loci, that an individual, whose whole genome genotype profile is known, participated in a study. The new NIH policy can be read here: The Wellcome Trust Case Control Consortium in England and the Broad Institute of MIT and Harvard in Boston have also agreed to remove aggregate GWAS data from open access, although individual researchers retain the right to apply for permission to work with the data. We invite opinion on the inferences that can be drawn from the science presented in this study, the impact of data restriction on the conduct and analysis of GWAS, and positive suggestions regarding how to protect patient identities while retaining community access to genomic data of all types.

Peter Visscher, Associate Editor; Greg Gibson, Section Editor

Forensic Science application

nrbiocom replied to PLOS_Genetics on 05 Sep 2008 at 06:13 GMT

To the authors,

We have read your article Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays with interest. However, at the risk of a cliché, we suggest that just because something is possible doesn’t mean we should do it. We have no doubt that you have developed a technique allowing individual differentiation from complex and disparate mixtures using SNPs. However, in suggesting the transfer of this technology to forensic science, you’ve failed to consider some unique aspects of forensic work.

In forensic work, one must consider, not only whether something is present, but its relevance to the crime event. For biological material, knowledge of the physiological origin of a sample contributes greatly to an understanding of its relevance. For example, the determination of a stain as blood, semen, or saliva, is of specific value in determining its significance to the alleged incident. For small traces of DNA, especially those found in mixtures, this determination is not possible, commensurately reducing the significance of the finding. The smaller the DNA trace, and the greater the number of contributors to the sample, the more difficult it is to relate the contribution to the crime event. These challenges already exist for low copy number (“LCN”) DNA samples analyzed by STR typing. Whether the profile is inadvertently of low copy number (e.g. a minor contributor to a visible stain), purposely low copy number (e.g. sampled for contact DNA), or whether extra PCR cycles are employed, the same interpretational constraints exist. The forensic community is currently wrestling with how to understand the import of various types of LCN testing; the SNP testing you suggest only exacerbates theses issues.

A second problem regarding forensic samples is the dynamics of PCR amplification. Obviously, PCR amplification is fundamental for forensic use, as minute samples are the rule. In your experimental samples you employ almost one mg of DNA, about one thousand times the amount typically tested in forensic analysis. Further, that one ng or so of DNA from a crime scene sample is then amplified at specific locations using PCR. It is well-known that amplified mixtures do not exactly reflect the true composition of the original sample. Preferential amplification among alleles at a locus, differential amplification between loci, and differential amplification between various components of a mixture are known characteristics of the system. Further, the PCR process typically fails to proportionally amplify DNA from any individual contributing less than 10% of the total DNA to a sample. The SNP analysis of large un-amplified samples you have developed does not solve the differential amplification problem and would be eclipsed by it. While an independent double check on mixture interpretation would indeed be welcome, a system that requires a large amount of non-amplified DNA could not take advantage of the approach that you advocate.

We do appreciate the privacy concerns raised about the ability to detect individuals whose profiles are already known among compilations of published databases.

Thank you for your consideration of our comments.

Norah Rudin, Ph.D., Forensic DNA Consultant
Keith Inman, M. Crim, Senior Forensic Scientist, Forensic Analytical Sciences, Inc.; Assistant Professor, Criminal Justice Administration, California State University, East Bay

RE: Forensic Science application

davcraig replied to nrbiocom on 07 Sep 2008 at 09:32 GMT

Drs. Rudin and Inman,

Drs. Rudin and Inman,

Thank you for your insightful post. You raise many excellent questions that highlight some interesting areas of further research. However, I would guess that you would be surprised at the robustness of high-density genotyping technology. At the core of your questions are concerns about the amount of DNA used for the high-density SNP genotyping microarrays. Within this study, we did not explore the performance of these arrays with lower amounts of starting DNA. Largely, we focused on challenging a central premise of forensics (whether criminal or medical): that aggregating or pooling samples makes the individual contributors anonymous. However, both in our experience and other papers (on whole-genome amplification), it is possible to obtain >95% call rates with only a few nanograms of materials.

More importantly, it is also intriguing to note that our paper suggests that only 10,000 to 50,000 SNPs resolves highly complex pools. Thus, if one did not use a whole-genome amplification scheme, one could focus on those SNPs that did amplify (2-10%). Using data when only 2-10% of STR’s amplify is problematic when only 13 STR’s are used. We have anecdotally been informed that starting from DNA from as few 30 cells one still obtains usable genotype data for over 10,000 SNPs. Testing whether in fact that is actually the case will require a future study. Moreover, some of the arrays used in this study contain mitochondrial SNPs. We excluded them from the study, though in practice they would be very powerful due to their increased copy number per cell.

Your post also mentions the concern of allele specific bias of PCR. We first note that the Illumina arrays do not use PCR for amplification, but rather a proprietary phi-29 amplification scheme. Indeed our method works well on two entirely different amplification approaches. Notwithstanding, there are definitely artifacts of amplification. However, we argue that this again is a limitation with using a few STR’s rather than thousands of SNPs. With STR’s one relies heavily on a small number of high-information content STR’s and these types of artifacts are very damaging to interpretation. But our approach use thousands of SNPs. Any specific bias (and there are many), are largely just noise in the overall calculation. Indeed, Figure 2C shows that noise can get quite high (10%), while the ability to resolve is still quite good. The reason is that we rely on cumulative shifts of a large number of markers, rather than so heavily on just a few markers

Your points about the source of DNA are well taken and, indeed, are a limitation. Mitochondrial SNPs and Y-chromosome SNPs on these platforms have some ability to answer the question of source of DNA. Saliva is known to have certain bacterial populations and one could see markers specific to detecting their presence incorporated within a forensic microarray. There is more work to be done in this area and our paper is only a first step challenging a central premise against the utility of SNPs.

David W. Craig

RE: RE: Forensic Science application

nrbiocom replied to davcraig on 09 Sep 2008 at 02:27 GMT

Dear Dr. Craig,

Thank you for the further information and response to our comments. We look forward to seeing your work develop.

Best regards,

Norah Rudin and Keith Inman

RE: Invitation to comment on the implications of the study in light of the NIH’s modified policy on free access to aggregate genotype data

dbalding replied to PLOS_Genetics on 21 Sep 2008 at 11:43 GMT

The changes of data release policy by the NIH, WT and other bodies are unwelcome, and as far as I can see aren't justified by information in the public domain. Information that has been released by the NIH is misleading and relevant information has not been released.

The PLoS genetics paper by Homer et al, cited in an NIH e-mail to corresponding authors of GWAS papers, does not deal with the issue of determining, from summary genotype data, whether an individual has participated in a GWAS, except for a brief mention in closing the discussion. Appropriately, the authors' press release issued on Aug 27 (see link below) does not mention GWAS. The paper is about distinguishing individuals contributing DNA to a mixture in a forensic identification setting.

The distinction is important because of the reference population. Essentially the method proposed in the paper takes a DNA profile from a single individual, and two sets of allele frequencies, and returns a difference in "distance" between the query individual's profile and the two sets of allele frequencies. Standardising this statistic and assuming normality, they derive a test of the null hypothesis that the query individual is closer genetically to the population from which one set of allele frequencies has been obtained, than to the population from which the second set was obtained. IF the individuals underlying the two sets of allele frequencies have been drawn from the same population THEN the query individual will not be expected to be significantly closer to one set of allele frequencies than the other, unless this individual is included among those from whom one set of allele frequencies has been obtained. If the populations are (perhaps subtly) different, however, the test will only establish that the query individual is genetically closer to one population than the other, without necessarily being one of the individuals sampled.

In the forensic identification setting, the two sets of allele frequencies are (1) from a reference set of individuals, and (2) a mixture of DNA from a number of individuals, such as those involved in a mass-disaster. In the forensic setting, the notion of a general population reference database is a familiar one, and often will not be problematic. However, as Homer et al note, the individuals underlying the reference database must be similar genetically to the individuals in the mixture, otherwise the method will be prone to false positives: anyone genetically resembling the individuals in the mixture more than the individuals in the database could be wrongly attributed to the mixture. As an extreme example, consider an African query individual and a mixture that does not include DNA from this individual but does include other Africans of similar ancestry, as well as Asians, compared with an Asian reference database. The query individual is likely to be wrongly included as a contributor to the mixture. In the artificial simulations of the paper, it was easy for the authors to ensure that this problem did not occur. Although Homer et al. don't say in the paper how their reference database was obtained, I understand from personal communication with corresponding author David Craig that the individuals in the mixture and those generating the reference set were all drawn from the same population.

Translating to the GWA setting, instead of a forensic mixture we have the allele frequencies from a set of individuals participating in a GWA. The NIH Background Fact sheet of August 28 says that

"... method that allows the detection of a single person’s SNP profile in a mixture of 1,000 or more individual DNA samples"

and goes on to say

"... the inquirer would first need to already have a highly-dense genomic profile (currently at least 10,000 SNPs) from an individual. Then this SNP profile would need to be statistically compared against the study dataset to measure how similar or different it is".

These claims are misleading, in my opinion, because they neglect the crucial role of the reference population frequencies. Consider a malicious user of the method, who obtains a genome-wide genotype of a query individual and wants to compare it with the allele frequencies from a GWAS. They cannot (as these quotes suggest) make any inferences about membership in the study without a suitable reference database. Of course (s)he could use for this purpose other GWAS typed on the same genotyping platform, but how would (s)he ensure that the query individual is not genetically more similar to the individuals in one GWAS than the other, without actually being included in either of the studies? The problem seems insurmountable in practice, in part due to subtle differences in genotype calling and recruitment in different studies. Artificial simulations can get good results, but making the scheme work reliably in practice seems unrealistic, and the remote threat that it might pose seems no justification for the action that has been taken.

In correspondence with people involved in the NIH and WT actions I have been assured that tests have been performed with good results, but there seems to be no information in the public domain about these tests. Curiously, nobody I contacted seemed able to answer the question of "which reference database?", even though this is crucial to the method.

Restricting access to data will impede science. Although scientists will still be able to access data eventually, in practice obtaining the relevant bureaucratic approvals can take months and researchers may be deterred from trying. If I am wrong in my opinion expressed above, and if there is a case for taking this action, I submit that the reasons should be presented in the public domain - so far the justifications that have been released, as far as I am aware, are inadequate.

RE: RE: Invitation to comment on the implications of the study in light of the NIH’s modified policy on free access to aggregate genotype data

davcraig replied to dbalding on 23 Sep 2008 at 07:55 GMT

Dear Dr. Balding,

In my two separate experiences, it took a few weeks to obtain to obtain individual genotype data from dbGAP. I make this remark, because it is not our goal to slow scientific progress. Considering that contributing to dbGAP is tied to NIH funding of GWAs and for many studies protecting identity is a high priority, the question of identifiability of subjects is an important one to address now rather than later.

Your additional point about the difficulty finding reference populations is quite easy to address. Since most GWA case-control studies release data for two matched cohorts (cases and controls), simply assign one cohort as the reference and another cohort as the mixture per the formula in the paper (noting that you now have a two-sided test). Thus, the concerns you raise are completely are largely taken account, since presumably the researchers would have made reasonable attempts to match the cohorts, addressed recruitment concerns, genotyped using the same platform, etc. This straightforward approach for addressing the concerns you raise about 'neglecting a reference population' underlies why healthy debate on the matter is needed: It is more difficult than one would anticipate to fully mask identity in summary level data. I would contend that until we better understand when identity is preserved, it would be better to not have the data posted directly to the web.