A Comprehensive Map of Mobile Element Insertion Polymorphisms in Humans

As a consequence of the accumulation of insertion events over evolutionary time, mobile elements now comprise nearly half of the human genome. The Alu, L1, and SVA mobile element families are still duplicating, generating variation between individual genomes. Mobile element insertions (MEI) have been identified as causes for genetic diseases, including hemophilia, neurofibromatosis, and various cancers. Here we present a comprehensive map of 7,380 MEI polymorphisms from the 1000 Genomes Project whole-genome sequencing data of 185 samples in three major populations detected with two detection methods. This catalog enables us to systematically study mutation rates, population segregation, genomic distribution, and functional properties of MEI polymorphisms and to compare MEI to SNP variation from the same individuals. Population allele frequencies of MEI and SNPs are described, broadly, by the same neutral ancestral processes despite vastly different mutation mechanisms and rates, except in coding regions where MEI are virtually absent, presumably due to strong negative selection. A direct comparison of MEI and SNP diversity levels suggests a differential mobile element insertion rate among populations.


Mobile elements: significance and current catalogs
Retrotransposons are endogenous genomic sequences that copy and paste into locations throughout host genomes [1][2][3]. Most mobile elements annotated in the human reference genome are remnants of ancient retrotransposition events and are no longer capable of active retrotransposition. However, a fraction of mobile elements remain active and contribute to variation between individuals in the human population. These active elements belong almost exclusively to the Alu, L1, and SVA families of non-LTR retrotransposons [4].
The Alu family is the most common mobile element in primate genomes, with more than 1.1 million copies in Homo sapiens [5][6][7]. The sequence of a full-length Alu element is 300 bp long. Alu elements are classified into a range of sub-families which have different propensities for retrotransposition, and are identified according to sequence alterations. Several AluY sub-families are currently active and are responsible for the bulk of mobile element insertion variation in Homo sapiens. The human reference genome contains over 140,000 annotated AluY elements. After Alus, L1 insertions are the next most prevalent family of mobile elements. There are over 500,000 L1 elements annotated in Homo sapiens. A full-length L1 is a sequence of roughly 6 kb in length and the most active L1 sub-family in the human lineage is L1HS [8,9]. There are a little more than 1,500 L1HS annotated elements in the human reference. A third family of mobile element are SVA retrotransposons [10]. SVAs are hybrid elements of SINE, VNTR and Alu components that range in size up to several Kb, with more than 3,600 annotated SVA elements in the human reference genome. SVA elements are thought to be the youngest family of retrotransposons in primates [11]. Other less common classes of mobile elements, such as DNA transposons, and endogenous retroviruses are not the focus in this study.
Mobile element insertions (MEI) are known to generate significant structural variation within Homo sapiens [12,13] and have diverse functional impacts [14][15][16]. In vitro experiments identified key features of Alu [17] and L1 [18] elements responsible for retrotransposon activity. The identification of MEI variant loci in humans initially began with disease-causing insertion events (e.g. hemophilia [19], breast cancer [20]). Experimental approaches were based upon library screening and small-scale PCR based display assays [21]. These approaches have been augmented by comparisons of the NCBI and the HuRef genomes [22,23], large scale fosmid-end sequences [24], and targeted sequencing of element-specific PCR products [25][26][27][28]. The dbRIP database of MEI polymorphisms [29] currently contains 2,691 polymorphic loci, enabling early estimates for the total number of segregating events [25] and per-generation mutation rates [23].
MEI polymorphisms can be detected either as insertions or as deletions in samples relative to the reference genome. Mechanistically, however, both types of observations are due to retrotransposon insertion; precise excisions of mobile elements are essentially non-existent [1]. Therefore MEI detected as deletions are, in fact, retrotransposon insertions in the reference DNA and can be verified as such by comparison with ancestral genomes. Detection and genotyping properties of MEI detected as insertions (''non-reference MEI'') and as deletions (''reference MEI'') are substantially different. We present their respective properties separately before combining the two detection modes into a unified MEI analysis. The deletion detection methods and properties of the full set of 1000GP deletions have been extensively described in the 1000GP CNV companion paper [30]. This allows us to focus on specific properties of the reference MEI subset of those deletions. Effective computational algorithms using second-generation sequencing data exist for identifying deletions [27,31,32], and have been used to find MEI in particular [33]. Detecting non-reference MEI directly as insertions from whole genome shotgun sequence data poses a more challenging problem, owing to the inherent difficulties associated with accurate mapping of sequenced reads derived from highly repetitive regions of the genome. Only recently have methods been developed for the purpose of non-reference MEI detection from second-generation whole genome shotgun data including published studies of L1 element insertions [34] and of Alu insertions [35]. These studies adopted similar computational approaches to one of our insertion detection methods (the read pair method, see Materials and Methods) and have different detection properties (Text S2 Comparisons, Figures S8, S9, S10).
Relative to previous studies, we present a broad analysis of MEI variation in the human population; with more variant loci detected, from the three major mobile element families, using multiple detection methods, each with comprehensive experimental validation ( Table 1). The present study represents the combined efforts of the MEI sub-group of the 1000 Genomes Project and has been prepared as a companion to previous 1000GP pilot publications [30,36]. The MEI analyzed in this study were included the 1000GP variant call release of July 2010 (ftp://ftp-trace.ncbi.nih.gov/1000genomes /ftp/pilot_data /release /2010_07), also provided as Table S1). The specific purpose here is to provide a more detailed description of the methods, validation experiments, and properties of the 1000GP catalog of MEI events, and to extend the analysis by adding genotype information, population allele frequencies, and population specific mutation rates.

Datasets analyzed
We analyzed two whole-genome datasets produced by the 1000GP, the low coverage pilot dataset consisting of 179 individuals sequenced to ,1-3X coverage and the trio pilot dataset consisting of two family trios sequenced to high, ,15-40X coverage (Table S2, Figure S4). These datasets included samples from three continental population groups, 60 samples of European origin (CEU), 59 African (YRI), and 60 Asian samples from Japan and China (CHBJPT). The two pilot datasets were produced and analyzed for complementary purposes. The trio dataset was used for assessing detection methods in high coverage samples and for the purpose of finding candidate de novo insertions in the trio children. The high coverage dataset was used to assess population properties of MEI. Both datasets contributed to the overall catalog of events.

Detection of non-reference mobile element insertions
We developed two complementary methods for the detection of non-reference MEI, a read-pair constraint (RP) method applied to Illumina paired-end short read data, and a split-read (SR) method applied to the longer read data from Roche/454 pyrosequencing (Materials and Methods: non-reference MEI detection). Figure 1a and 1b shows the respective detection signatures and examples of event displays. Candidate MEI events were formed as clusters of supporting fragments. A limitation specific to RP detection arises from annotated elements within a characteristic read pair fragment length of candidate MEI (Figure 1a). Read pairs spanning from a uniquely mapped anchor into an annotated mobile element with a fragment length consistent with the given library fragment length distribution ( Figure S5) are characteristic of the reference allele and are not evidence for non-reference MEI. These ''background'' read pairs occasionally have fragment lengths on the extreme tails of the library distribution and can potentially be misclassified as evidence for non-reference MEI. For this reason RP detection criteria required at least two supporting fragments spanning into the insertions from both sides of the insertion. We also masked insertion positions within a fragment length around each annotated element of the corresponding family from RP detection in order to achieve a low false detection rate. The SR method was not dependent on the fragment length distribution in the 454 data so these additional detection criteria were not required.
We applied the two methods to both 1000GP pilot datasets (Table 1) separately, yielding a total of 5,370 distinct genomic MEI loci, 33% of which were found by both SR and RP methods

Author Summary
We embarked on this study to explore the 1000 Genomes Project (1000GP) pilot dataset as a substrate for Mobile Element Insertion (MEI) discovery and analysis. MEI is already well known as a significant component of genetic variation in the human population. However the full extent and effects of MEI can only be assessed by accurate detection in large whole-genome sequencing efforts such as the 1000GP. In this study we identified 7,380 distinct genomic locations of variant MEI and carried out rigorous validation experiments that confirmed the high accuracy of the detected events. We were able to measure the frequency of each variant in three continental population groups and found that inherited MEI variants propagate through populations in much the same way as single nucleotide polymorphisms, except that MEI are more strongly suppressed in protein coding parts of the genome. We also found evidence that the MEI mutation rate has not been constant over human population history, rather that different populations appear to have different characteristic MEI mutation rates.
( Figure 1c). The overall level of detection overlap between SR and RP methods is limited by detection sensitivity and specificity (see below) and the number of samples sequenced by both 454 and by Illumina read pairs.

Detection of reference MEI
In addition to the 5,370 non-reference MEI, we identified 2,010 reference MEI detected as deletions of mobile elements in samples. The reference MEI events were selected from the full release set of 1000GP pilot deletions (n = 22025) [30,36] based on matching deletion coordinates to RepeatMasker 3.27 Alu, L1, and SVA annotations [6], and the requirement that the mobile element is absent in the chimpanzee genome [37] (6x pan Trogodytes-2.1 assembly) at the corresponding positions in hg18 (Materials and Methods: Reference MEI selection). Figure 1e shows an example event display of an AluYb8 reference MEI, detected as a deletion in the trio pilot data. All but one of the reference MEI were found by one or more of the RP or SR deletion detection algorithms that were part of the released 1000GP deletion call set [30,[38][39][40][41][42] with a small overlapping contribution from algorithms based on assembly or read depth methods [43,44] (Figure 1d, Table S3).

Combined MEI catalog
The complete set of 7,310 MEI calls is simply the combined set of reference and non-reference MEI over both pilot datasets (summarized in Table 1, complete list in Table S1). Insertions occurring at the same locus from different call sets were merged using a 100 bp window for matching positions, choosing the SR insertion coordinate when available to represent the merged event. Similarly for reference MEI, deletion merging was accomplished among the 23 separate 1000GP call sets using a precision-aware algorithm described in detail in the 1000GP SV companion paper [30]. The full catalog of MEI loci appear to be distributed randomly across the genome (Figure 2b) with a characteristic spacing of 0.4 Mb between MEI loci, except for an apparent MEI hotspot in the HLA region of chromosome 6 where 19 MEI loci are clustered in a 1 Mb region (8 times the genomic average density for MEI, Figure S11). Accurate read mapping in the HLA region is complicated by a high density of variation [36], however, we see no evidence of falsely detected MEI here. The balance between reference and non-reference MEI, proportions of RP and SR detected loci, the fraction of previously identified MEI loci, and the validation rate are all consistent with genomic averages; only the density of MEI is significantly increased.
The genomic proportions of the three mobile element families are 8562% Alu, 1262% L1, and 2.561% SVA (Figure 2b) for both reference and non-reference MEI. Most non-reference MEI loci were detected from the low coverage pilot data (Figure 2c) while the reference MEI were more evenly distributed between the low coverage and trio pilot data ( Figure 2d). As described in the 1000GP main pilot paper [36], more than 80% of the nonreference MEI were newly identified loci not detected by previous studies [23][24][25][26][27][28]34,35,45]. However, in the mean time, several published studies have produced new lists of non-reference MEI loci including L1 insertions [34] and Alu insertions [28,35]. Half of the non-reference MEI loci from this study have not yet been reported elsewhere (Figure 2e, Figure S8). Table 1 of the 1000GP paper lists 5,371 MEI, two of these events were subsequently merged into one to form the present count of 5,370 MEI detected as insertions. For reference MEI, we find that 76% of our events matched deletion coordinates listed in the dbVAR (28 January 2011) structural variation database or a deletion identified in the HuRef genome [22,46], leaving 24% of the reference MEI unreported prior to 1000GP publications.
The 1000GP catalog of MEI variant sites includes all 7,310 detected loci, including those matching MEI from other publications. Further comparisons among the recent MEI studies are provided in Text S2.

Detection specificity and sensitivity
We benchmarked each of the four non-reference MEI call sets (separate SR and RP call sets for the low coverage and trio pilot datasets) to assess detection sensitivity and specificity. As MEI are currently not suitable for microarray validation due to their highly repetitive sequence, all validations were done by locus-specific PCR. 200 loci were randomly selected from each of the four insertion call sets. Using an automated pipeline [32], primer design was possible for 746 loci (Table S4). In addition to the randomly selected loci, other candidate loci were selected for validation experiments in order to confirm SVA insertions (n = 7), to test potential de novo insertions from the pilot 2 trio (n = 1), and gene-interrupting events (n = 86 attempted), as well as for algorithm training and testing purposes (n = 386). These additional PCR results (Table S4a) were not used to assess false detection rates, except for the special case of SVA insertions, which were under-represented in the random loci selection since SVA insertions are relatively rare. All candidate loci with successful primer design were tested on two different population genetic panels (Materials and Methods: Validation) one with DNA of 25 individuals from the low coverage pilot, and one with DNA from all samples of the trio pilot dataset. In addition to other human samples from populations not represented by the pilot datasets, DNA of a chimpanzee was also included on the panel to confirm that the identified insertion is indeed human-specific. An example of typical results for a low coverage locus is shown in Figure 3a. Through additional primer design for loci with inconclusive results and PCRs using a primer residing within the 39 end of a retrotransposon, in particular within SVA elements, more than 98% of the tested candidate loci were successfully genotyped. The validation experiments revealed overall insertion false discovery rates for each dataset of less than 5% (Table 1)   %) with 95% confidence intervals. This is not entirely unexpected as polymorphic Alu insertions tend to be low divergence full-length AluY elements, unlike L1 or SVA insertions which tend to be truncated and may be accompanied by adjacent transduced genomic DNA sequences. Although the SR and RP detection methods are very different, the overall detection specificities were remarkably consistent.
Following the validation of non-reference MEI, we assessed detection sensitivity. The primary challenge here was to find suitable gold standard non-reference MEI that should be present in our samples from which to assess sensitivity. We estimated Red and blue reads bracketing annotated elements are characteristic of mapping artifacts that we removed from insertion detection by masking out regions within a fragment length of an annotated element of the same family as the insertion. b) Signature for SR-based insertion detection. Split-mapped 454 reads span into the element sequence. The SR event display shows split reads spanning into an Alu insertion from the 59 (blue) or the 39(red) sides. The vertical green line marks the insertion site. Fully mapped 454 reads are shown in gray. Gray reads that span the breakpoint correspond to the reference allele. Note that the mapping quality increases with the length of the split-mapped segment. The red and blue segments overlap by roughly 15 bp in the target site duplication region that brackets the MEI insertion. c) Overlap between non-reference MEI detected by RP and by SR. d) Overlap between detection methods for reference MEI. Of the 23 1000GP deletion call sets, 11 were RP and 4 were SR. Also shown are the relative proportions of events detected by assembly (yellow) and by read depth (gray) both of which had nearly 100% overlap with RP and SR calls. e) RP signature for reference MEI detection. Read pairs with abnormally long mapped fragment lengths (in green) span over an AluYb8 annotation. The event display shows RP evidence for a homozygous reference MEI in chromosome 22 from the trio dataset. The yellow line at the top marks homologous regions from the chimpanzee assembly, with a gap at the precise location of the variant MEI. doi:10.1371/journal.pgen.1002236.g001 sensitivity in three different ways, as a consistency check. First, we estimated sensitivity by using the high quality non-reference MEI from HuRef [23] as a gold standard and found that 74% of the 650 Alu, L1, or SVA insertions in HuRef matched MEI insertion loci in our catalog (Table S5). This represents a lower limit for insertion detection sensitivity since not all MEI in the HuRef genome are necessarily present in the 1000GP pilot samples. Next we looked at the overlapping insertion detection between the RP and SR methods in the trio children samples (Figure 3c, Figure  S6), which were the samples sequenced to the highest depth for both Illumina and 454 data. Based on the detected loci overlap (see Materials and Methods: Detection sensitivity), we estimate 67%63% and 70%67% sensitivities respectively for RP and SR insertion detection in the trio children (Table S6), with a combined SR+RP detection sensitivity exceeding 90% in the CEU trio child (see Materials and Methods, Eq. 4) with high coverage data from both 454 and Illumina reads.
A third approach to estimate for the non-reference MEI detection sensitivity is based on the validation PCR genotypes in the low coverage dataset. Since the PCR loci were selected as random subsets for each RP and SR call set independently, the validated sites selected from SR events can be used as a gold standard to assess RP detection sensitivity, and vice-versa. Detection sensitivity as a function of allele frequency (Figure 3d) was estimated for each method from PCR genotypes at those loci randomly selected for validation of the complementary method. PCR genotypes provided the allele frequency estimate on the abscissa. Statistical errors at high allele frequency are large because the limited number of tested MEI loci at higher allele frequencies. Detection sensitivity of the RP method saturates close to 70% at high coverage and the SR method sensitivity exceeds 70% at high coverage ( Figure S6). The corresponding trend is apparent in Figure 3d. The combined detection sensitivity approaches 90% for common alleles (Materials and Methods, Eq. 4). However, since relatively few of the low coverage samples were sequenced with 454, a realistic estimate for the detection sensitivity to common MEI insertions is between 70% and 80%. This is consistent with 75% derived from the HuRef gold standard comparison and the sensitivity estimate from the trio pilot overlaps. Equivalent estimates for Alu, L1, and SVA specific sensitivities for common MEI alleles are 75%610%, 50%610%, and 50%620% respectively (Table S9). Regarding reference MEI detected as deletions, the overall validation rate from PCR and local assembly for the MEI component of deletions was 96%. This does not imply that the remaining 4% were false, only that the released set of deletions contained reference MEI detected by two high specificity algorithms with characteristic false detection rates less than 10%. These algorithms did not require additional validation evidence in the 1000GP release. A rough estimate for the false detection rate for the MEI component of deletions is therefore 0.4%. The number of algorithms supporting a given call is another indicator of call quality. The average number of separate deletion calls (out of a maximum of 23 call sets) supporting events in the MEI subset was 7.8 while the average over all other deletions was 2.3 ( Figure  S2). The high validation rate and high consensus among detection algorithms indicate that this subset of deletions is relatively free of detection artifact. The practical limitation on the specificity of these events as reference MEI is the subsequent MEI selection criteria. Only a small fraction the 2,010 selected events were ambiguous in terms of matching coordinates to an annotated mobile element with corresponding gap in the chimpanzee genome assembly (e.g. Figure S3, bottom panel). The 1000GP CNV paper identified 2029 reference MEI variants using the BreakSeq algorithm. Overlap between the respective lists is 89%. We estimate 10% as an upper limit on the false discovery rate for reference MEI.
Detection sensitivity for reference MEI was estimated from the fractions of gold standard reference MEI identified by Xing et al. from HuRef [22,23,46], and reference MEI identified by Mills et al. [4,47] from 1000GP samples NA12878 and NA12156 matched to any of our 2,010 reference MEI (Table S5). In each case the fraction of those MEI deletions found in this study exceeded 90%. This level of detection sensitivity is considerably higher than the bulk deletion detection sensitivity reported in the SV companion paper [30], indicating that the RP and SR deletion detection methods developed for the 1000GP were particularly well suited for reference MEI detection. , and dataset (LCP: low coverage pilot, TP: trio pilot). The false detection rate for Alu elements is uniformly ,3% while the false detection rates for L1s and SVA element insertions approach 30%, with large error bars (95% confidence intervals) arising from relatively low statistics. c) Non-reference MEI detection overlap from trio samples NA12878 and NA19240. This level of overlap between two independent methods using independent sequence data corresponds to a detection sensitivity of roughly 70% for each algorithm and a combined detection sensitivity of 90% in these samples. d) Non-reference MEI detection sensitivity as a function of allele frequency in the low coverage dataset. PCR results for loci randomly selected from one method were used as a gold standard for the complementary method, and vice versa. PCR also provides an estimate of the allele frequency based on the 25 low coverage samples used for validation experiments. RP (blue) and SR (red) and the combined (black) detection sensitivities rise with frequency. One standard deviation confidence intervals are shown as shaded bars for the RP and SR algorithm, with black error bars for the combined RP+SR detection efficiency. doi:10.1371/journal.pgen.1002236.g003 MEI properties, assembly, and sub-family classification We characterized each detected MEI event (Table S1) by the insertion position, which algorithm(s) detected the event, number of fragments supporting the insertion and reference alleles, insertion length ( Figure S12), element family, bracketing homology (Figure 4a), and assembled sequence. MEI have a characteristic ''target site duplication'' region of homology bracketing the insertion. The target site duplication length distributions for the MEI detected by different methods, as well as for different element families, peaked at 15 bp with a standard deviation of 7 bp (Figure 4a). The full insertion sequence from reference MEI is readily extracted from the reference, but non-reference MEI require local de novo assembly to reconstruct the inserted sequence. For this we used 454 data to reconstruct 1,105 Alu insertions (Tables S1 and S7) from our event list based on the PHRAP assembly program [48]. We then used BLAT [49] to map assembled contigs back to the build 36.3 human reference to identify the boundaries of the inserted sequence. The inserted sequence was then mapped back to the RepeatMasker mobile element sequences using the RepeatMasker web server (http:// www.repeatmasker.org) to identify the sub-family ( Figure 4b). The accuracy of Alu sub-family classification was assessed by comparison to matched 359 Alu insertions from dbRIP [29] and nine fully sequenced Alu insertions from PCR validation experiments. 272 of the assembled Alu sub-family classes were identical (74%). The most active Alu sub-families are AluYa5 and AluYb8. AluY sub-families account for essentially all Alu variation. The relative proportions among Alu sub-families are consistent among reference and non-reference MEI, as well as consistent with the Alu sub-families observed in HuRef [23]. The Alu subfamily breakdown differs from that reported by Hormozdiari et. al. [35] who identified more than 10% of their set of insertions from AluJ or AluS sub-families. The authors of that study point out that these 'older' Alu events could arise from mechanisms other than retrotransposon insertions.

Genotyping
Genotyping of non-reference MEI (Materials and Methods: Genotyping) was based on counts of fragments supporting the reference allele and fragments supporting the insertion allele at each locus for each sample. Heterozygous MEI sites are identified by roughly equal amounts of reference and alternate allele supporting fragments spanning an insertion locus, while homozygous sites should have all fragments supporting one or the other allele. For reference MEI, we used genotypes produced by the Genome STRiP package [39], which was developed for 1000GP deletion genotyping [30,39] and incorporates Beagle [50] imputation based on linkage with local SNPs. Both genotyping methods provide phred-scaled [51] genotype quality (GQ) metrics at each site that reflect confidence in the given call based on supporting evidence, GQ = 0 to a total lack of genotype evidence and GQ = 10 indicating that the genotype should be 90% accurate. The GQ metric depends on the number of fragments found to support the MEI and non-MEI alleles for a given locus and sample (Text S2: Genotyping methods). As in most issues of sensitivity vs. specificity, there is a trade-off between high genotype efficiency and genotyping accuracy. The drop-off in genotyping efficiency vs. GQ threshold is more severe for non-reference MEI ( Figure S13). For subsequent genotype-based analysis of nonreference MEI sites and samples we required GQ$7, which corresponds to roughly 40% genotyping efficiency in the low coverage pilot data. For reference MEI we required GQ$10, which corresponds to an efficiency of 80%. Genotyping efficiency improves with increased sample read coverage ( Figure S13, bottom panel), particularly for non-reference MEI.
Genotyping accuracy for non-reference MEI is assessed by direct comparison to PCR validation genotypes in the same samples, and by testing for Mendelian errors in the trios and violations of Hardy-Weinberg Equilibrium in the low coverage data (Text S2 Genotyping tests, Figures S13 and S14). Validation genotypes are listed in Table S4 (also as the ''VG'' field of the released MEI insertion genotyped VCF files). Genotype contingency tables for the low coverage data (Table 3) show an 87% agreement between sequenced genotypes and PCR genotypes for sites with GQ$7. Genotyping accuracy improves with increasing GQ threshold ( Figure S13) but never exceeds 90% in the low coverage data. Non-reference MEI genotyping performance for high coverage trio data ( Table 3, Table S8) was considerably better than for the low coverage data. However, for population analyses we used only low coverage data in order to minimize the potential for coverage biases. The accuracy of GenomeSTRiP genotypes (for reference MEI events) with GQ$10 was estimated at 99% in the full 1000GP deletion call set [30,36,39].

Population segregation of MEI
We estimated MEI allele frequencies from the count of high quality (GQ$7 non-reference and GQ$10 for reference MEI) genotyped insertion alleles for each MEI locus. Allele frequencies were estimated from loci with at least 25 high quality genotypes for each continental population group. The two MEI detection modes (i.e. reference and non-reference insertions) have very different allele frequency spectra (Figure 5a-5c). Since the non-reference MEI and reference MEI components have very different powers of detection and genotyping, the two components were corrected separately (Materials and Methods: Allele frequency spectra) before being combined into the full MEI spectrum (Figure 5d-5f). We estimated correction factors for each population group, each element type, and each detection mode (Table S9). Non-reference MEI correction factors are larger than reference MEI factors because of the lower detection sensitivity and genotyping efficiency.
The allele count spectra were compared to the standard neutral model [52][53][54], h/i, where h is an MEI diversity parameter and i is the allele count in a fixed number of samples. The value of h is fit from the MEI allele count spectrum for each population group and the fitted model is the gray dotted line appearing in Figure 5d-5f. Only allele count bins in the range 0.15,frequency,0.95 were used in the fit (bins marked with error bars in Figure 5d-5f) to avoid regions of poor detection sensitivity. The corresponding gray dashed lines superimposed on Figures 5a-5c also represent the neutral model expectation, modified to account for the respective ascertainment conditions, (h/2N) for reference MEI, (h/i)(2N2i)/ (2N) for non-reference MEI, where N = 25 is the number of samples in the spectra. These ascertainment condition expressions are based on the assumption that the reference genome represents a random sample from the given population, which is admittedly simplistic but nevertheless explains much of the difference between the allele spectra of reference and non-reference MEI. A coalescent simulation (Text S2 Coalescent, Figure S17) for MEI variation also shows this behavior using standard population history parameters [55]. Fitted values of the diversity parameter h for each of three population groups and each element family are listed in Table 4, along with rough estimates for the corresponding MEI mutation rates based on the neutral model (m = h/(4?N e )) with an effective population size N e of 10,000 [56,57]. Confidence intervals for m and h (Table 4) take into account Poisson noise and uncertainties in the correction factors, but do not reflect the degree to which the model assumptions are valid.
All three element families have been combined into the allele count spectra shown in Figure 5, although the Alu family is the dominant component. Allele frequency spectra for different element families have similar shapes (Figure 6a). We know from SNP studies that the shape of the allele frequency spectrum is modulated by demographic history, and that this shape is characteristically different for European, African, and Asian populations [56,57]. When compared to SNP allele frequency spectra from the same datasets (Figure 6b), the MEI and SNP frequency spectra show similar trends among the corresponding populations. Among the three population groups, the CHBJPT spectrum shows relatively few low frequency allele loci. This was also apparent in comparison with the neutral model (Figure 5e).
We also analyzed population differentiation by applying principal component analysis to the matrix of allele counts across the low coverage pilot samples and loci (Figures S15 and S16). Some structure is immediately apparent in the matrix of allele counts, e.g. increased heterozygosity in the YRI samples, but PCA reveals population specific patterns of MEI that result in tight clusters of samples according to geographic origin ( Figure 6c); again similar to population patterns for SNPs [58], CNVs [59] and deletions [30].

Functional properties
As few as 39 of the 5,370 non-reference MEI loci were located in exonic sequence, mostly in untranslated regions, and only 3 were found in coding exons ( Table 2). These numbers are much lower than expected from random placement (Materials and Methods: Functional calculation), indicating strong selection against MEI disrupting gene function. The suppression factor for an MEI to occur in a coding region compared to the genomeaveraged rate is 46x, a much stronger suppression than is observed for coding SNPs (Table S10, suppression factor = 3.9x), and is similar to SNPs that cause the loss of a stop codon (42x, derived from Table 2 of [36]). Two of the MEI interrupting coding regions were PCR-validated. These two MEI appear to be of little functional consequence: ZNF404 is a member of a highly paralogous zinc finger gene family and C14orf166B is a predicted gene without functional annotation. These findings suggest very strong negative selection against MEI interrupting coding regions. Although it is obvious from first principles that insertions in functional regions should be deleterious, the observed suppression factor in a large catalog of MEI in populations quantifies the effect.

Number of ME polymorphisms between pairs of individuals
The high-coverage trio data allows for the most precise estimates of the total number of MEI variants between pairs of individuals because of the high detection sensitivity. The number of pair-wise variant loci is calculated as the presence or absence of an insertion at a given locus, combining reference and nonreference MEI. We selected the two trio children (NA12878 and NA19240) for comparison between CEU and YRI individuals and the trio parents for comparison of individuals within the CEU and the YRI population groups. After corrections for detection sensitivity and false detection (Text S2 and Table S6), we found 2,0346120 MEI variant loci between the African and the European trio children, 1,4426120 between the YRI parents, and 6636140 MEI between the CEU parents. The pair-wise event numbers scale linearly with coalescent time derived from SNPs ( Figure 6d) in these samples (Text S2: Coalescent [60][61][62][63][64]).

Search for de novo MEI
Previous estimates for the de novo mobile element insertion rate and our own estimate of the MEI mutation rate are one event per 20 births in the human population [23]. Accordingly, we did not expect to find de novo insertions in our sample of two trio children. Among all MEI events detected in the trio offspring against the reference (1,778 in NA12878 and 1,971 in NA19240), we did identify a single de novo candidate insertion in NA12878, not detected in either parent or in any other sample (Table S6, de Novo). A subsequent PCR validation experiment revealed that this insertion was, in fact, present in one of the trio parents, but not detected from the sequence data. All in all, our study found no direct evidence for de novo MEI events in the two trio samples.

MEI heterozygosity and mutation rates
MEI genotyping allows us to estimate MEI heterozygosity within each sample. We define heterozygosity as the count of heterozygous loci across the individual's genome. In a manner similar to the allele frequency analysis, heterozygosity is corrected for detection and genotyping efficiencies (Materials and Methods: Heterozygosity) such that it represents the true number of heterozygous loci in the sample. Heterozygosity, p, and the diversity parameter, h, fit from the allele count spectrum, are related population metrics that depend on the MEI mutation rate, m MEI , and demographic history. In the neutral model (under mutation-drift equilibrium in the limit of infinite segregating sites and a constant effective population size, N e ) the two metrics should be approximately equal [65]: Deviations can be interpreted as evidence for selection pressure, changing demographic parameters, or possibly as changes in the mutation rates. These metrics were originally developed as a framework for SNP analysis but can also be applied to MEI variants. It is this property of heterozygosity that we wish to exploit. A comparison MEI and SNP heterozygosity within the same samples allows a direct comparison of the corresponding mutation rates, because the impact of long-term demography (here simplified in terms of N e ) is identical for both variant types. Consequently, the MEI mutation rate can be estimated as: Given constant mutation rates we would expect proportionality between p SNP and p MEI in samples from different population groups, however a scatter plot of p MEI vs. p SNP over the low coverage pilot samples (Figure 7a) shows some deviation. Heterozygosity for the Asian sample group is systematically elevated above the proportionality line (dashed line). Also shown on the scatter plot is a grey region corresponding to SNP and MEI differences between the human and chimpanzee reference genomes [37,66]. The MEI insertion rate is known to be roughly 2.5 times higher in the human than in the chimpanzee lineage [66], however, the time dependence of the MEI mutation rate during human evolution is not yet known. For this, we reexpressed the SNP and MEI heterozygosities for each sample in terms of m MEI vs. coalescent time (Figure 7b) based on equation (2), a constant SNP mutation rate (m SNP ,1.8610 28 mutations per site  per generation [67]), and the coalescent time estimated from the SNP heterozygosity. Characteristic MEI mutation rates for each population group were derived from Eq. (2) with ,p MEI . and ,p SNP . averaged over the samples in the group. Values of m MEI for each population and each element family are compared to m MEI derived from h fitting (Figure 7c) and are listed in Table 4 with 95% statistical confidence intervals. Confidence intervals from the allele frequency fits (error bars in Figure 7c) are larger than statistical errors from the averaged heterozygosities over samples (error bars within the circles on Figure 7c) because each sample provided independent observations for the average heterozygosity, whereas in the allele frequency spectra fits all samples were combined. Both estimates are subject to systematic errors that may arise from the detection, genotyping, and correction procedures. To test for systematic biases in m MEI we re-processed both allele frequency spectra and heterozygosity estimates over a range of genotype selection thresholds (Text S2: Stability, Figure S18) and found consistent trends in m MEI among the population groups and element families, although the overall scales of the mutation rates are uncertain to 20%. Values of the element specific mutation rates in Table 4 and Figure 7c are consistent with previous reports [23,25,68]. In summary, careful error analysis led us to believe that the differences in the mutation rate observed between the different population sample groups are likely to result from biological processes, rather than measurement or analytical artifacts.

Discussion
Common MEI polymorphisms in the human population MEI alleles propagate within population groups much like other predominantly neutral polymorphisms. MEI allele frequency spectra from the low coverage samples are in general agreement with expectations from the standard neutral model for allele drift in a population. The major differences in allele frequency spectra between non-reference and reference MEI (Figure 5a-5c) are explained by the ascertainment condition that the derived MEI allele occurs in a given sample (the reference) and are in agreement with expectations based on a coalescent simulation of MEI population drift ( Figure S17). MEI allele frequency spectra among the three population groups exhibits a similar trend to SNPs (Figure 6b), although the MEI spectrum in the Asian samples is a poor fit to the h/i form (x 2 /d.f.,2 from Table 4) with an excess of high frequency alleles and a deficit at low frequency (Figure 5e). MEI allele frequencies were based on MEI detected and genotyped across three element families (Alu, L1, and SVA), from both non-reference and reference MEI, and multiple detection methods (RP and SR), each with characteristic detection sensitivities and false detection rates. Corrections for these effects, as well as genotyping efficiencies, were included in the allele frequency spectra.
Measurements of MEI heterozygosity offer a more direct method to estimate MEI insertion rates. Like the allele frequency spectrum, heterozygosity is dependent on accurate genotyping and includes corrections for efficiency losses, but in this case the corrections were made on a per sample basis, which is more specific since sample coverage is the dominant limitation for detection and genotyping power ( Figure S6). The heterozygosity measurement also has an advantage in that each sample is an independent estimate of the population average ,p MEI . and ,p SNP .. The heterozygosity measurements revealed evidence for differential MEI mutation rates among the three population groups. The probability that the Asian population samples have the same MEI mutation rate as the other two population groups is very low (paired t-test p-value,10 26 ). We tested the stability of this result by varying the genotype selection criteria across a range of threshold ( Figure S18) and found that the differential MEI rate effect is indeed stable. Sequence coverage in the 1000GP lowcoverage pilot data was roughly the same for all three continental population groups (Table S2), so we do not expect coverage differences to generate significant systematic biases in these population comparisons.
The question remains whether the differential MEI mutation rate between populations is driven by a shared increase of m MEI within Homo sapiens, as suggested by Figure 7b, or simply by varying insertion rates among different populations. The pilot data is consistent with either interpretation, so data from more populations (more than 30 population groups from five continents are planned for the full 1000GP) will be needed to discriminate between the two hypotheses.
Based on the global values for the diversity parameters h MEI and p MEI (Table 4), and the neutral model, our rough estimate of the total number of MEI segregating sites in the human population with allele frequency.10% is 4500, and 9000 for frequency.1%, with 20% uncertainty arising from parameter estimates. Counting only those sites with a sufficient number of genotypes to measure allele frequency, our dataset contains more than half of the segregating human MEI sites with frequency.10%.

Significance
This study of the 1000GP pilot datasets is a sizable step toward a complete population-based catalog of common human MEI polymorphisms, made possible by targeting both non-reference and reference MEI events in the human genome. We identified 7,380 polymorphic mobile element insertions from the Alu, L1, and SVA families. Based on experimental validation of random subsets of loci we estimate that the false discovery rate in this study is less than 5%. Detection power for common alleles (allele frequency.10%) varies between non-reference MEI (70%-80%) and reference MEI (.90%). We were also able to assemble the inserted sequence for more than 1,000 non-reference Alu MEI and found consistent proportions of Alu sub-families in comparison to MEI identified in HuRef.
This comprehensive variant discovery and genotyping effort allowed us to directly compare the segregation properties of different variant types from the same dataset. Our analysis revealed that, to a first approximation, the evolution of MEI variants is similar to SNPs and consistent with neutral models [52,53], except in exonic regions where they are subject to negative selection on the scale that acts against SNP variants resulting in stop codon loss. An intriguing finding from our data, however, is the detection of signals suggesting a recent increase in MEI rates in humans.

Non-reference MEI detection
Both the SR and the RP methods were based on identification of non-reference MEI as clusters of mapped DNA fragments in which one end mapped to the consensus sequence of a mobile element while the other end was uniquely mapped to the reference genome in a location inconsistent with a known mobile element location in the reference (Figure 1a-1b). The RP method required at least two MEI supporting fragments across both the 59 and 39 insertion breakpoints for each candidate MEI from the pooled datasets (the low coverage and trio pilot data were pooled separately). The SR method required only one MEI supporting fragment across either the 59 or 39 breakpoints for candidate events. We used 52 consensus element sequences from Repbase [69] (www.girinst.org, version 14.03, Table S11) to identify reads mapping to mobile elements. The RP method used Mosaik [70] (bioinformatics.bc.edu/marthlab/Mosaik, version 0.9.1176) for read pair mapping of Illumina paired-end data to the NCBI36 human reference (build 36.3) and the Spanner [40] program to identify non-reference MEI by clustering supporting fragments [40,71,72]. The SR method also used Mosaik to align 454 data, for full read mapping as well as for split-read mapping. We used extensive simulation experiments [73] to optimize detection methods, algorithm parameters, and post-process MEI candidate event selection filters (further details are provided in Text S2).

Reference MEI selection
The 2,010 reference MEI events are a subset of the 1000GP pilot release of 22,025 deletions [30]. 95% of the MEI sites detected as deletions were found by more than one algorithm but the dominant mapping algorithms were Mosaik, and Maq [74], with detection algorithms Spanner, Pindel [41], BreakDancer [38], and GenomeSTRiP [39]. Two selection criteria ensure that a given deletion corresponds to a true variant MEI inserted in the reference genome: 1) The deletion coordinates match to an annotated Alu, L1, or SVA element [6] in the hg18 reference, defined as .50% reciprocal overlap and the start and end coordinates both match within a window of 20 bp for Alus, or 200 bp for L1s and SVAs. 2) At least 75% of the deleted region corresponds to a gap in the chimpanzee genome assembly [37].

MEI event matching between algorithms and studies
Non-reference MEI detected by the SR and RP methods were merged according to a 100 bp matching window around the leftmost insertion coordinates. To assess call set intersections between this study and other published lists of non-reference MEI, we used a matching window of 200 bp around each insertion position. We adopted the 'leftmost' coordinate convention ( Figure  S1), in keeping with 1000GP call sets, whereas other studies used rightmost or unclear coordinate conventions. The respective scales of the matching windows were dictated by the characteristic position resolutions of the algorithms ( Figure S7, Figure S10), which varied considerably from study to study. Redundant loci from recent publications were not counted multiple times in Figure 2e. To identify matching reference MEI to other studies we required at least 50% reciprocal overlap between the starting and ending NCBI36 deletion coordinates.

Calculations of sample sequence coverage
For SR detection the relevant coverage statistic is 454 base coverage, counts of aligned reads covering a given base, averaged across the accessible genome. For RP detection the driving coverage statistic is Illumina read-pair spanning coverage, counts of fragments in which the non-sequenced segment of the fragment between the reads cover a given base, averaged across the genome (Table S2).

Validation methods
The four non-reference MEI event lists (Table 1) were submitted to the 1000 Genomes Structural Variation subgroup for validation experiments to assess false detection rates. 200 loci from each list were randomly selected for primer design and subsequent PCR validation. Primers were designed as described previously [32,36] to span across the insertion breakpoint and to guarantee unique mapping to build 36.3. In addition to the estimation of the false detection rates, validation genotypes were derived from gel-band size comparison for each sample and site tested by PCR. We also used the validation data to estimate detection sensitivity based on the overlap of events called between the two independent sequence data platforms and algorithms.
For loci with ambiguous PCR results, no amplification, or amplification of only the empty insertions site, a second primer pair was designed. For the primer design, 600 bp of flanking sequence on either side of the insertion site was retrieved from genome.ucsc.edu using Galaxy. Alu elements within the flanking sequence were masked to ''N'' using RepeatMasker (repeatmasker.org). Primers were designed with BatchPrimer3 v2.0 in the flanking sequence, leaving at least 100 bp before and after the predicted insertion site. Next, all primers were tested with BLAT to determine the number of matches in the human genome. If one primer of a primer pair matched several times and the other primer was unique, a virtual PCR was performed. Primer combinations with one predicted PCR product were tested on our panel. Otherwise primers were designed manually (if possible) after repeat-masking the flanking sequence with the complete repeat library.
In addition, for L1 and SVA loci without unambiguous PCR amplification, primers were designed, placing one primer within the 39 end of the mobile element sequence [75]. The primers were designed to match the consensus sequences of the youngest L1 and SVA sub-families. All PCR primers were ordered from Sigma Aldrich, Inc. (St. Louis, MO). All LSU-designed PCR primer sequences used in this project can be found at http://batzerlab.lsu. edu.
DNA samples for PCR verification. A subset of 25 DNA samples from the low coverage pilot samples and all six trio samples were used in PCR validations (Table S4). Each DNA panel also included a population out-group sample, an individual of South American origin (NA17310, Coriell) for low coverage pilot, and an individual of Asian origin (NA17081, Coriell) for the trio pilot. Additional control DNA samples on both panels included human cell line DNA, (HeLa; ATCC CCL-2) as well as ''Pop80'', a locally pooled DNA sample from different individuals of diverse geographic origins (Asia, Africa, South American, and European). This sample serves as a diagnostic tool because amplification of an empty site alone in all samples (including Pop80) strongly indicates that the putative insertion is absent (false positive). In contrast, the presence of an MEI in a single study subject, while absent in Pop80, points toward a potential de novo retrotransposon insertion, or at least an insertion with a low allele frequency (AF). Chimpanzee DNA (NS06006, Coriell) was also included on each panel, representing the presumptive pre-insertion site for each event (empty site) as another PCR control.
In addition to the subset of 25 individuals used for the low coverage pilot validations, four more DNA samples from the low coverage pilot dataset were obtained for subsequent experiments. DNA samples NA12872, NA12814, NA12815 and NA12044 (CEPH/Utah USA) were purchased from the Coriell Institute for Medical Research. All 35 samples (25+6+4) were used for PCR validations associated with MEI events detected specifically in exons.
Full-length L1 and SVA elements typically exceed the limitations of standard DNA Taq polymerase in PCR. For L1 insertions, LA-Taq DNA polymerase (Takara Bio USA, Clontech Laboratories, Inc. Mountain View, CA) was used in the PCR reactions according to the manufacturer's instructions to enhance the yield of long PCR templates (2-10 kb). SVA elements are particularly GC-rich and difficult to amplify in PCR if full-length even with special long-template polymerases. In order to evaluate presence/absence of these insertions using PCR, we performed a PCR using one primer residing within the SVA insertion in conjunction with an external primer (forward or reverse, depending on the orientation of the predicted insertion). To determine the genotype and presence of the insertion, two separate PCR reactions were required in these instances. A PCR using primers flanking the MEI amplified a PCR product if the MEI was ''absent.'' A separate PCR with internal primers detected the MEI ''present'' site. In addition, this approach was also used for some L1 loci to confirm the presence/absence of the insertion and to minimize the chance of false non-detection.
PCR experiments were carried out in three different laboratories yielding similar success rates. At EMBL, PCRs were preformed using 10 ng of NA12878 genomic DNA (Coriell) in 20 ml volumes in a C1000 thermocycler (BioRad). Two different enzymes, iProof High Fidelity DNA Polymerase (Biorad) and Hotstart Taq (Qiagen) were used, with comparable results. PCR conditions for iProof were: 98uC for 1 min, followed by 5 cycles of 98uC for 10 s, 68uC for 20 s and 72uC for 4 min and 30 cycles of 98uC for 10 s, 66uC for 20 s and 72uC for 4.5 min, followed by a final cycle of 72uC for 5 min. PCR conditions for HotStart Taq were: 94uC for 15 min, followed by 5 cycles of 94uC for 30 s, 60uC for 30 s and 72uC for 3 min and 30 cycles of 94uC for 30 s, 56uC for 30 s and 72uC for 3.5 min, followed by a final cycle of 72uC for 5 min. PCR products were analyzed on a 1% agarose gel stained with Sybr Safe Dye (Invitrogen) and a 100 bp ladder and 1 kb ladder (NEB).
PCR reactions at Louisiana State University were performed under the following conditions: initial denaturation at 94uC for 90 sec, followed by 32 cycles of denaturation at 94uC for 20 sec, annealing at 61uC for primers designed by pipeline or 57uC for other primer design for 20 sec, and extension at 72uC for 30 to 90 sec depending on the predicted PCR amplicon size. PCRs were terminated with a final extension at 72uC for 3 min. When LA-Taq DNA polymerase was used to amplify L1 insertions, the extension step of each cycle was carried out at 68u for 8 min, 30 sec, followed by a final extension step at 68u for 10 minutes at the end of the run. 20 ml of each PCR product were sizefractionated in a horizontal gel chamber on a 2% (Alu and SVA) or 1% (L1) agarose gel containing 0.1 mg/ml ethidium bromide for 60 minutes at 175 V or 1 hour/45 min at 150 V, respectively. UV-fluorescence was used to visualize the DNA fragments and images were saved using a BioRad ChemiDoc XRS imaging system (Hercules, CA).
An outcome from the validation experiments on the 86 geneinterupting MEI was a high false detection rate for candidate Alu insertions in close proximity to 7SLRNA annotations. Subsequently we reclassified all 22 Alu insertion candidates within 200 bp of a 7SLRNA as invalidated (Table S1).

Detection sensitivity
The two non-reference MEI detection methods use independent DNA libraries. So the overlap between the RP and SR are governed by the respective detection sensitivities, statistically akin to the Lincoln-Peterson method [76] used in ecological studies to estimate the size of a population based on two random capture and recapture samplings. This estimate assumes that the two algorithms are sensitive to the same type of events and that the difference between the event lists is a sampling issue. The expression for the detection respective detection sensitivities (e RP and e SR ) depends on the false detection rates (f RP and f SR ) provided by the validation experiments, the counts of loci detected by each method (n RP and n SR ), and the count of loci detected by both methods (n RP . SR ): Given detection sensitivities e RP and e SR from independent datasets and methods, the combined detection sensitivity (RP+SR) becomes: for samples in which both types of data were available (e.g. trio samples NA12878 and NA19240).

Genotyping methods
For reference MEI we used available genotypes calculated by GenomeSTRiP [39] for the 1000GP deletion call set. Genome-STRiP results were not readily available for non-reference MEI so we developed a simple Bayesian framework to estimate the posterior probability for each possible genotype. The posterior genotype probability is: where N ALT and N REF are the counts of fragments supporting the alternate and reference alleles respectively; g is the genotype (i.e. homozygous reference allele, heterozygous, homozygous insertion allele); P(g) is the prior probability for the genotype g (a flat prior was used, P(g) = 1/3); p g is the expected fraction of insertion fragments given a genotype g (i.e. p g = 0.5 for heterozygous insertions, p g ,0 for homozygous reference, and p g ,1 for homozygous insertions); P bin (N ALT ,N ALT +N REF ,p g ) is the binomial probability that N ALT +N REF fragments will fluctuate to N ALT , given an expected fraction p g . The called genotype for a given site is the genotype with the maximum posterior probability. The Bayesian framework also provides genotype likelihoods, which are used to construct genotyping quality metric (GQ) for each site and sample. The GQ value adopts the ''phred'' quality convention: Where P(g|N ALT ,N REF ) is the posterior probability for the called genotype from Eq. (5). GQ is highly dependent on the total number of supporting fragments (reference plus insertion). A selection of sites at GQ = 7 should correspond to roughly to 80% genotyping accuracy and corresponds to sites with 2 or more supporting fragments.

Allele frequency spectra
MEI loci with at least 25 genotyped samples per population (50 samples for the combined population spectra) were included in allele frequency spectra. Sites of GQ$7 non-reference MEI and of GQ$10 reference MEI were included. For loci with more than 25 genotyped samples, a random subset of 25 was used for the allele count spectra ( Figure 5). For the allele frequency spectra (Figure 6a-6b) we projected down to 25 samples according to the hypergeometric distribution [56,57] which smooths the spectrum while retaining all available information from loci with more than 25 genotyped samples. Hypergeometric projection was not used to build the allele count spectra used for fitting purposes because it introduces correlation among allele count bins. We constructed the allele count spectra for MEI events detected as insertions and those detected as deletions separately to account for the distinct ascertainment conditions before combining them into the aggregate spectrum. The combined spectrum includes corrections for respective detection and genotyping efficiencies: where n REF (i) and n NREF (i) are the counts of genotyped loci for reference (e.g. Figure 5b) and non-reference MEI (Figure 5a) at allele count i, K REF and K NREF are scaling factors for each detection mode (not dependent on i), and n MEI (i) is the net count of MEI variant loci at a given allele count i (Figure 5c-d). The correction factors depend on the detection sensitivity (e DET ) and genotyping efficiency (e GEN ) as K = (e DET ?e GEN ) 21 . Genotyping efficiency is simply the fraction of detected sites with 25 more genotyped samples (Table S9). Detection efficiency is described above (Detection specificity and sensitivity, Figure 3d). SNP allele frequency spectra (Figure 6b) were based on the 1000GP release VCF files (ftp://ftp-trace.ncbi.nih.gov/1000genomes/ ftp/pilot_data/release/2010_07/) with no corrections. SNP allele frequency spectra were projected down to 50 samples using the hypergeometric distribution [56,57].

Functional calculation of suppression factor
Only non-reference MEI with insertion position confidence intervals entirely within annotated regions (Gene, UTR, CDS) were counted. No MEI that were subsequently invalidated were counted. Relative to random placement across the genome the MEI suppression or boost factor is defined as: where N tot is the total number of MEI loci, L tot = 2.85610 9 bp is the length of the accessible genome, L obs is the size of the region (1 MB or the sum of coding regions) where the number of observed MEI is N obs . The null model for MEI placement results in a binomially distributed N obs , which is generally not far from what we observe, except in the case of functional regions (suppressed) and HLA (hotspot). For the calculation of MEI inserted in CDS regions, only non-reference MEI were considered, since an embedded reference MEI precludes annotation as a coding sequence.

Heterozygosity
MEI and SNP heterozygosity for each sample were calculated from the counts of genotyped heterozygous sites. For MEI, the total numbers of genomic heterozygous sites were estimated with corrections for genotyping efficiency and detection sensitivity. The genotyping efficiency for a given sample is the fraction of detected loci with high quality (GQ$7 non-reference, of GQ$10 reference MEI) genotypes. There is also a sample specific correction for genotyping bias against heterozygotes at sites with limited fragment coverage: where the sum is over genotyped loci passing the GQ threshold for the given sample, N loci is the count of such sites, NF is the count of supporting fragments (both reference and insertion allele) at the site and binopdf is the binomial probability density function that a heterozgygous site will randomly produce only reference supporting fragments. The K HET correction was applied only to the nonreference MEI component because, for reference MEI detected as deletions, GenomeSTRiP used not just supporting fragment information for genotype likelihoods, but also used Beagle to impute missing data from linkage with local SNP haplotypes to identify heterozygous deletions. For each sample (s) the number of heterozygous MEI in the genome is estimated as: where p MEI (s) is the heterozygosity for sample s, p REF (s) and p NREF (s) are the raw counts of heterozygous sites for reference and non-reference MEI, e DET is the detection sensitivity, and e GEN (s) is the fraction of detected sites genotyped in the given sample ( Figure  S13, Table S9). SNP heterozygosity is derived from the raw counts of heterozygous sites. All values of heterozygosity are in units of heterozygous sites per genome, and the length of the genome is considered to be the accessible genome (2.85 Gb) [36]. The SNP heterozygosity values are transformed to rough estimates of the corresponding coalescent time (Figure 7b) [77]: where, m SNP = 1.8610 28 mutations per site per generation, and T GEN ,25 y is the average time between generations. Figure S1 Insertion coordinate convention. (EPS)    Figure S15 Genotype Matrix of low coverage samples. Each element in the matrix corresponds to a sample and a locus at which the genotype is color coded. Sample populations are labeled across the top, separated by green lines. The chromosome order for the MEI loci is labeled on the right side, with non-reference MEI (''insertions'') and reference MEI (''deletions'') grouped separately. This matrix was input to Principal Component Analysis for plotted in the main text Figure 6c ( Figure S16d). (EPS) Figure S16 Principal Component Analysis population clustering for PCR genotypes, MEI ins, MEI del, combined. A matrix of genotypes for each site and sample was input to a PCA and the resulting first two components are plotted against each other. The sum of insertion alleles is the value in the matrix elements. For elements corresponding to sites and samples without genotypes, the global average genotype value was used. a) Genotypes from PCR validation for the low coverage pilot. b) Genotypes from low coverage non-reference MEI only. c) Genotypes from reference MEI only. d) Genotypes from samples with both non-reference and reference MEI. Population clusters become tighter as more MEI insertion information is added to PCA. (EPS) Figure S17 Coalescent simulation allele frequency spectra for the combined CEU, YRI, CHB and JPT population groups. AF is binned in units of 0.1. The lowest bin (0-0.1) is not plotted to allow the spectra at higher AF to be compared. The normalizations for MEI detected as insertions (red) and deletions (green) are set to that the two components sum to the total unbiased MEI AFS (blue). (EPS) Figure S18 MEI insertion rate vs. coalescent time for increasing MEI site selection thresholds. The estimated MEI insertion rates (main text Eq.2) for each sample is plotted vs. the coalescent time derived from SNP heterozygosity. Panel a) is the same as Fig. 7b from the main text and corresponds to genotyped sites with GQ$7, which also corresponds to sites with at least two supporting fragments. As more supporting fragments are required b) NF$3, c) NF$5, d) NF$7, the numbers of genotyped sites decrease, but the trend between populations in the MEI insertion rates remains.

(EPS)
Table S1 Combined MEI event list (external Excel file). Genomic coordinates with confidence intervals are listed for each of the 7380 MEI loci. Each event is characterized by an element type (ELEMENT = Alu, L1, or SVA), element STRAND (+ or 2), detection (DET = DEL or INS for non-reference and reference MEI respectively), event ID, estimated insertion length (LEN), detection algorithm (ALG), validation status (VAL), validation method (VALMETH = PCR, ASM for assembly, 7SLRNA should be discarded due to proximity to annotated 7SLRNA element), population (POP = CEU, YRI, CHB, or JPT), allele frequency in three major groups (AF), number of genotyped samples in the three groups, number of insertion alleles in the three groups, previous study ID's (DBVARID, DBRIPID, PUBID), TSD length, number of insertion-supporting fragments from the 59 side (NALT5), from the 39 side (NALT3), the 1000 Genomes CALL SET name, quality value (Q), gene/exon/UTR/CDS interrupted (GENE), sub-family, and inserted sequence when available, and a list of all samples in which the alternate allele was detected (ALTSAMPLES). Note: 71 events identified by the VAL field as invalidated or in close proximity to a 7SLRNA loci are marked in yellow and were not included in the counts of interrupted genes, exons, UTRs, or CDS regions. (XLSX)    [23] found by this study is a lower limit to the detection sensitivity to common MEI alleles. a) MEI insertion detection sensitivity. b) MEI deletion sensitivity. b) MEI deletion sensitivity based on loci detected in the same samples from Mills et al. [47]. (XLSX)

Table S6
Trios (external Excel file). a) Overlap between RP and SR in the same trio samples (NA12878 and NA19240) can be used to estimate detection sensitivity. Columns RP and SR are the counts of all loci for the two samples broken down by element type. RP-only and SR-only count loci where only one method found the insertion. RP+SR is the count of loci deleted by both methods. The detection sensitivity estimates (e RP , e SR , and e) with corresponding statistical 1-sigma errors are derived from the overlaps. The combined detected efficiency is based on the union of the two independent methods. b) Counts of MEI site differences between two individuals. The trio samples were used for this because of the relatively high coverage and corresponding sensitivity to low frequency alleles. Corrections to the counts compensate for less-than-perfect detection sensitivity and false detections. The trio children from two populations (CEU and YRI) have the most differences (20346120) while the CEU parents have the fewest (6636120). The YRI parents' count of sites is between the other pairs. These differences are plotted vs. the corresponding coalescent time in Figure 6d (main text). c) De novo insertion hunt. Any MEI appearing in the children of the family trios but not in the parent would be a de novo MEI insertion. Six candidates from NA12878 (a) and 15 from NA19240 (b). All but one de novo candidate occurred at a site not found in any of the other samples. This site was PCR tested and identified in NA12892 (mother). (XLSX)