Empirical Validation of Pooled Whole Genome Population Re-Sequencing in Drosophila melanogaster

The sequencing of pooled non-barcoded individuals is an inexpensive and efficient means of assessing genome-wide population allele frequencies, yet its accuracy has not been thoroughly tested. We assessed the accuracy of this approach on whole, complex eukaryotic genomes by resequencing pools of largely isogenic, individually sequenced Drosophila melanogaster strains. We called SNPs in the pooled data and estimated false positive and false negative rates using the SNPs called in individual strain as a reference. We also estimated allele frequency of the SNPs using “pooled” data and compared them with “true” frequencies taken from the estimates in the individual strains. We demonstrate that pooled sequencing provides a faithful estimate of population allele frequency with the error well approximated by binomial sampling, and is a reliable means of novel SNP discovery with low false positive rates. However, a sufficient number of strains should be used in the pooling because variation in the amount of DNA derived from individual strains is a substantial source of noise when the number of pooled strains is low. Our results and analysis confirm that pooled sequencing is a very powerful and cost-effective technique for assessing of patterns of sequence variation in populations on genome-wide scales, and is applicable to any dataset where sequencing individuals or individual cells is impossible, difficult, time consuming, or expensive.


Introduction
Efficient assessment of presence and frequencies of singlenucleotide polymorphisms (SNP) in populations is vital to answering key problems in genetics and population biology. For instance, inference of demographic history, identification of causative loci affecting a trait of interest, discovery of cancercausing mutations in mixed pools of cells, or the search for evidence of natural selection in the genome all require knowledge of the frequency spectra in groups of individuals or cells. However, individually sequencing dozens of individuals from each population is often more costly and labor intensive. Multiplexing techniques allow a more efficient use of sequencing resources but still require a large number of individual DNA extractions, manipulations of reagents, barcoding oligos, PCR reactions, and sequencing library constructions. The same applies to overlapping pools of non-indexed samples [1]. In contrast, pooling individuals prior to DNA extraction and sequencing the pooled DNA without barcodes can generate an inexpensive and efficient assessment of allele frequencies genome-wide.
The empirical accuracy of pooled re-sequencing has been assessed in small genomes or small genomic regions re-sequenced to 8006 coverage [2][3][4][5][6][7][8][9]. In prokaryotes, pooling has been applied to clonal salmonella populations [4]. In eukaryotes, pooled Reduced-Representation Libraries (RRL) that cover a smaller portion of the whole genome have been re-sequenced for cattle [3] and domesticated pig [8]. Pooling has also been tested with individual human genes [6,9] and applied to disease studies where the genomic regions of interest in affected patients were resequenced in pooled samples to detect disease related variants [2,5,7].
While genome-wide pooled re-sequencing has been applied to Drosophila melanogaster [10][11][12], Arabidopsis lyrata [13], and Anopheles gambiae [14], the accuracy of allele frequency estimation was not assessed in these studies. There are two reasons why pooled sequencing of whole eukaryotic genomes might result in less accurate frequency estimates than smaller genomes or small genomic regions. First, lower coverage per chromosome and increased genetic complexity may increase mismapping around structural variations. Second, unequal contributions of DNA from each individual may systematically bias allele frequency estimates [15].
We generated a series of pooled re-sequencing libraries of D. melanogaster and show here that pooled re-sequencing provides a highly accurate assessment of allele frequencies genome-wide. When a sufficient number of chromosomes are pooled, the resulting allele frequency estimates are not biased by unequal DNA contributions and can be well approximated by a simple binomial distribution that depends only on read depth and allele frequency. Our results imply that pooled whole genome population sequencing should be easily applicable to any organism whose genome can be sequenced to ,50-1006 coverage. Currently, this limits the technique to organisms with small to moderate genome sizes. However, the continued increases in sequencing throughput will make the pooled re-sequencing approach relevant for organisms with larger genomes such as humans and many crop plants.

Drosophila Strains
All isogenic fly strains came from the Drosophila Genetic Reference Panel (DGRP) [16] and were maintained in our laboratory. We picked 112 out of the total 192 strains included in the DGRP as follows: we picked 22 DGRP strains that were also independently sequenced by the Drosophila Population Genomics Project DPGP at UC Davis (http://www.dpgp.org/) to make library A (Figure 1; Supporting Information S1). Another 42 DGRP strains were randomly picked to make libraries B1 and B2, and a final 50 DGRP strains (mutually exclusive from the 42) went into library B4 (Figure 1; Supporting Information S1).

True Frequency Estimation
True DGRP source population allele frequencies were obtained from the 162 sequenced DGRP strains. Only sites with no residual within strain heterozygosity within the 162 DGRP strains were used to assess the accuracy of pooled resequencing.

Library Construction
Library A (SRR353364.1) was constructed from a pool of 220 flies (10 females per strain) with DNA extracted using the Qiagen Gentra Puregen tissue kit. Library A was sequenced on a single lane of Illumina GAIIx to a total depth of 106with 100 bp pairedend reads. Libraries B1, B2 and B4 (SRR353365.1) were constructed from pools of male flies (1 male per strain) with DNA extracted via isopropanol precipitation [17]. Libraries B1, B2 and B4 were sequenced on a single lane of Illumina HighSeq 2000 to a total depth of 606 (206 each). Reads were 93 bp paired-end after accounting for barcode length.
Reads from libraries B1 and B2 were merged to ''construct'' B3 library ( Figure 1). Library B3 has therefore the same number of strains as B1 and B2 (i.e. 42 strains) but twice as many flies pooled and double the read depth coverage of B1/B2. Finally reads from libraries B1 and B2 were merged independently with reads from library B4 to construct two 92 strains pooled libraries: B5 and B6 respectively ( Figure 1).

Mapping/SNP Calling
All reads were mapped to the D. melanogaster reference genome release 5.33 using default parameters. No read trimming was done. Read pairs where one read was unmapped were discarded. Final analysis was restricted to positions covered to a minimum of 106 in all libraries. Initial alignments for both DPGP and pooled libraries were carried out with BWA (version 0.5.9) post-processed with samtools (version 0.1.18) [18]. Downstream base quality score recalibration, indel realignment, and SNP discovery [19] were carried out in GATK (version 1.2 for DPGP and 1.4 for pooled libraries) [20]. Library A was also separately mapped with MAQ (version 0.7.1) [21]. Allele frequency estimates were calculated as the percentage of reads carrying the non-reference allele at a DGRP identified polymorphic site. Novel SNP discovery in library B6 was carried out using GATK Unified Genotyper command using default parameters.

Accuracy estimates
Two values were computed as accuracy measures of allele frequency estimates from pooled libraries. Concordance correlation was computed using the epiR package (http://epicentre. massey.ac.nz) version 0.9-32. Relative error were computed according to the formula Error = ((freq pool 2freq DGRP )/ freq DGRP ) ' 2.

Binomial Simulations
Simulations were carried out in R (version 2.13.0) [22]. For each library, 1,000 SNPs were randomly selected (all SNPs were binned by frequency into 20 bins in 5% increments, and 50 SNPs were selected randomly from each bin per run) to obtain 1 observed concordance correlation and 1 relative error value. Binomial simulation based on observed read depth distribution from real sequencing data was then run on the set of 1,000 SNPs to obtain binomial concordance correlation and relative error values. This was repeated 100 times to obtain observed and simulated concordance and error rates.

Results
We designed and constructed a series of pooled resequencing libraries (SRA046699.1) from varying numbers of isogenic D. melanogaster strains from the DGRP panel (Table 1 and Figure 1) and compared the accuracy of allele frequency estimates from our pooled libraries to simulated expectations. We investigated the effects of mapping strategies, read depth, unequal DNA contribution, and reproducibility of the technique with regards to the accuracy of population allele frequency estimation from pooled sequencing. Finally, we estimated the efficiency of novel SNP discovery from pooled libraries.

Allele frequency estimation: sources of error
Mapping tools. We first tested if the choice of mapping tools affects the accuracy of allele frequency estimation. We began by creating pooled library A from 22 DGRP strains. We remapped library A in 3 ways -MAQ, BWA/Samtools, and BWA/ Samtools+GATK. We looked at DGRP SNP positions that were also covered to . = 106 read depth in library A and compared DGRP allele frequency estimates (from all 162 strains) to those from library A. To account for bias caused by a proportionally larger number of low frequency SNPs, we randomly selected 1000 SNPs in each of 100 runs, with evenly distributed SNP allele frequencies in each 1000 SNP set, for analysis and simulation of expected concordance with DGRP allele frequencies given pool coverage. The results obtained were roughly comparable across  Table 2). BWA+GATK were used for all subsequent analyses.
Read depth. The observed concordance correlations of library A allele frequency estimates as compared to 'true' DGRP source population allele frequencies, regardless of mapping tool, were .10% lower than binomially simulated values ( Table 2). We needed higher coverage libraries to explore the possibility that increasing coverage further will place observed correlations within expectations.
We followed up with a series of B libraries (Fig. 1). These libraries were made from different subsets of DGRP strains that mimic independent samples from the same source population as DPGP strains, but with true genotypes unknown, mimicking real experiments where samples would be collected from populations with unknown true allele frequencies. As in library A, we used all 162 DGRP genotypes for estimates of true population allele frequencies. Libraries B1 and B2 (Table 1 and Figure 1) were independent collections of flies pooled from the same 42 DGRP strains, individually processed during DNA extraction and library making and sequenced to 206 coverage. We placed minimum (.106) read depth filters as with library A, and observed concordance correlation values of (0.906-0.934) and (0.911-0.936) for B1 and B2 respectively. These genome-wide correlations were just below the binomial correlations of (0.933-0.952) and (0.930-0.950). This was a marked improvement from library A where we simulated binomial values between (0.931-0.952) but observed (0.822-0.867) with 106 read depth requirements (Table 2). However, these libraries, being pools of twice as many strains as library A, were not directly comparable to tease out the effects of coverage. We thus combined libraries B1 and B2 to 'construct' library B3 so as to compare libraries B3 to B1 and B2 as libraries with the same number of pooled strains, but with library B3 containing twice as many reads as libraries B1/B2 ( Table 1). The correlation between actual and observed allele frequencies for library B3 (0.921-0.939) was ,2% lower than binomial (0.935-0.954). The accuracy of allele frequency estimation in library B1 and B2 were also ,2% lower than binomial. Thus, the two-fold change in coverage between B3 and B1 and B2 did not substantially change the accuracy of allele frequency estimates. Library B4 that was pooled independently from 50 male flies (comparable to libraries B1 and B2) each representing a different DGRP strain not used in the making of B1 and B2 (Table 1 and Figure 1) and sequenced to a similar read depth (206), was similarly ,2% lower than binomial (Table 2).
Unequal DNA contribution. We explored the possibility that unequal DNA contribution from pooled strains leading to over-representation of SNPs present in strains with larger flies or flies generating more DNA for some other reason, was the larger source of noise. Library A was suitable for this analysis because 10 flies were used from each strain, and should amplify any effects from variance in DNA material. The 22 strains used in library A were fully sequenced by DPGP. Strain RAL-301 had significantly fewer unique SNPs compared to all other strains, partially due to its low coverage in DPGP data, and was dropped from the analysis. Using polymorphic position identified by DGRP, we identified approximately 250 'private SNPs' unique to each of the remaining 21 strains in library A ( Figure 2). SNPs were identified as unique to a strain if a DGRP SNP position was covered to .46 in all 21 strains, was found fixed in just 1 strain while being absent in all others, and covered to .106 in library A. 50 SNPs were randomly picked from each strain and their average frequency as estimated through library A calculated. This was repeated 10 times to obtain mean pool frequency and 2 standard deviation values. The average frequencies of these singleton SNPs within the pooled library A were treated as good approximations for the relative DNA contributions of each strain. We found that there was a gradation in DNA representation from each strain (Fig. 2), suggesting that unequal DNA contribution was a significant issue in our pool. However, the distribution of the relative DNA contributions from each strain is gradual without a clear divide between over-represented and under-represented strains. This implies that the variation is likely a by-product of sampling and not due to PCR jackpotting and therefore should be easily remedied by pooling more strains together.
To test this prediction, we analyzed the B libraries that vary in the number of strains used. Previously, the use of singleton SNPs to estimate DNA contributions was possible in library A because all 22 strains pooled were individually sequenced by DPGP. However, we were not able to perform the same analysis for the B libraries because whole genome sequences for some of the strains were not available. Thus, in order to test the effects of the number of input strains on allele frequency estimates, we contrasted the expected and observed accuracy of allele frequency estimates from libraries B1-B6 (Table 2). Libraries B1 and B2 contain 42 strains sequenced to ,206 coverage, libraries B3 and B4 contain 42 and 50 strains respectively sequenced to ,406 coverage and libraries B5 and B6 contain 92 strains sequenced to ,406 coverage. The accuracy of allele frequency estimates of libraries B1-4 are ,1-3% lower than expected. However, the accuracy of allele frequency estimates of libraries B5-6, (0.932-0.954) and (0.934-0.955) respectively, largely overlap with binomial values (0.944-0.962) and (0.945-0.963) (Figure 3). Thus, the number of strains used to make sequencing pools appears to be a major source of error in allele frequency estimation.

Allele frequency estimation: reproducibility
Libraries B1 and B2 were designed specifically to estimate the effect of technical and biological error on pooled frequency estimates. As biological replicates, they allowed us to estimate the compounded error introduced into final allele frequency estimates from unavoidable technical and biological error including variability in fly rearing, pooling, DNA extraction, PCR reactions, and sequencing. We expected that if there were no external sources of error from DNA extraction and library making protocol, B1 and B2 should behave as two independent binomial samples from the same source population. Simulations place expected correlation of B1 to B2 at (0.903-0.909), while observed correlation coefficient was 0.898. The correlation between B1 and B2 allele frequency estimates is closer to binomial expectations than the correlations between B1/B2 with the DGRP and their respective binomial expectations. As the only differences between libraries B1 and B2 were the flies used (different flies from the same strains), this suggests that compounded error from experimental components such as library making, pooling, and mapping is small relative to error from unequal DNA contribution from different flies.

SNP discovery
In addition to allele frequency estimation, novel SNP discovery from pooled libraries is an important application for population sequencing. In real life experiments where it is usually impossible to sample the full set of variants in a population and false negative rates are inherently high, it is more important that the SNPs that are called are real. To assess false positive rates of SNP calling, we compared our SNP calls from library B6 to the 162 sequenced genotypes from DGRP database, sorted by allele frequency (Fig. 4). As expected, false positive rate is higher for rare alleles but quickly falls to 7% for alleles present at .5% and around 1% for those present at .10% in the population. False negative rates, also included, might be exaggerated in this dataset as fewer SNPs are actually present in the pool of 92 strains than there are in the source population of 162 strains. We thus also calculated false negative rates using the 86 overlapping strains between DGRP and our pool, and observed improvements in high frequency alleles.

Discussion
We resequenced a series of pooled libraries made from largely isogenic, individually sequenced D. melanogaster strains. Allele frequency estimates derived from the pooled libraries were compared to corresponding estimates derived from individually sequenced strains (exact) or the source population (estimate). While there was significant variance in relative DNA contribution from the smallest pool of 22 strains, and association between pooled allele frequency estimates and true allele frequencies was lower than expected under a binomial model, this discrepancy disappeared as an increasing number of strains were pooled. Given sufficient number of strains pooled, we found that pooled sequencing provides an accurate estimate of population allele frequency with the error well approximated by binomial sampling. We focused on allele frequencies of known SNPs because our goal was to verify that pooled libraries are true representations of the sampled population. However, pooled libraries are also effective means of calling new SNPs without the redundancy of sequencing multiple individuals [15]. We found that SNP discovery had low false positive rates and was effective for common variants in the population.
Pooled libraries can also be used to estimate frequency of transposable element insertions [23, Fiston-Lavier et al personal communications], used in the estimation of Hp, HS, and Fst [11,15], and combined with likelihood methods for more powerful estimation of population genetics parameters such as nucleotide diversity and allele frequency [24][25][26][27], which are important for species where nucleotide diversity is low (such as humans) and more affected by high-throughput sequencing errors [28][29][30][31]. Pooled sequencing data can also be used in detecting selective sweeps [32], and could yield limited information on linkage as haplotype information is retained on the order of read lengths and the distance covered by each paired end reads. As statistical methods are constantly refined to deal with the complexities of pooled data [6,[33][34][35][36][37][38], our power to analyze such data will increase.
Pooled libraries are a means of efficiently sampling genetic variation in any group of polymorphic chromosomes. Possible applications of pooled re-sequencing include complex tissue samples from diseased cells, populations under artificial selection, samples of wild caught individuals, and endosymbionts or mitochondrial genomes [39]. In essence any dataset where sequencing individuals is impossible, difficult, time consuming, or just expensive are prime candidates for the use of pooled sequencing. More importantly, given our result that pooling approximates binomial sampling given a sufficient number of pooled chromosomes, it is easy to estimate the required coverage for the desired power of each analysis.
This method of estimating required samples or coverage for an experiment is theoretically applicable to any species, tissue type, or chromosome with known genomic/molecule size and a reasonable reference sequence. Even though most linkage information is lost, population allele frequencies can be sampled over a larger number of individuals at a fraction of the cost of individual sequencing. Given current sequencing power (which is likely to increase further in the near future) pooling would be easily applicable to any organism with euchromatic genome sizes less than ,500 Mb, including many model and non-model organisms. The continued increases in sequencing throughput will make pooled sequencing relevant for organisms with larger genomes such as humans and many crop plants. We believe that pooling is a very powerful and cost-effective technique for detecting unusual genomic patterns in populations on genome-wide scales.

Supporting Information
Supporting Information S1 Strains used in libraries pooled. (DOC)