Conceived and designed the experiments: D Kural, C Stewart, MP Strömberg, GT Marth, MA Batzer, MK Konkel, JO Korbel. Performed the experiments: JA Walker, MK Konkel, AM Stütz, AE Urban, F Grubert, HYK Lam, JO Korbel. Analyzed the data: C Stewart, D Kural, MP Strömberg, MK Konkel, M Busby. Contributed reagents/materials/analysis tools: W-P Lee, MA Batzer, JA Walker, MK Konkel, AR Indap, E Garrison, D Kural, C Huff, J Xing, MP Snyder, LB Jorde. Wrote the paper: C Stewart, GT Marth. Developed the methods: MP Strömberg. Helped prepare the manuscript: MP Strömberg, M Busby, MA Batzer.
¶ Membership of the 1000 Genomes Project is listed in Text S1.
The authors have declared that no competing interests exist.
As a consequence of the accumulation of insertion events over evolutionary time, mobile elements now comprise nearly half of the human genome. The Alu, L1, and SVA mobile element families are still duplicating, generating variation between individual genomes. Mobile element insertions (MEI) have been identified as causes for genetic diseases, including hemophilia, neurofibromatosis, and various cancers. Here we present a comprehensive map of 7,380 MEI polymorphisms from the 1000 Genomes Project whole-genome sequencing data of 185 samples in three major populations detected with two detection methods. This catalog enables us to systematically study mutation rates, population segregation, genomic distribution, and functional properties of MEI polymorphisms and to compare MEI to SNP variation from the same individuals. Population allele frequencies of MEI and SNPs are described, broadly, by the same neutral ancestral processes despite vastly different mutation mechanisms and rates, except in coding regions where MEI are virtually absent, presumably due to strong negative selection. A direct comparison of MEI and SNP diversity levels suggests a differential mobile element insertion rate among populations.
We embarked on this study to explore the 1000 Genomes Project (1000GP) pilot dataset as a substrate for Mobile Element Insertion (MEI) discovery and analysis. MEI is already well known as a significant component of genetic variation in the human population. However the full extent and effects of MEI can only be assessed by accurate detection in large whole-genome sequencing efforts such as the 1000GP. In this study we identified 7,380 distinct genomic locations of variant MEI and carried out rigorous validation experiments that confirmed the high accuracy of the detected events. We were able to measure the frequency of each variant in three continental population groups and found that inherited MEI variants propagate through populations in much the same way as single nucleotide polymorphisms, except that MEI are more strongly suppressed in protein coding parts of the genome. We also found evidence that the MEI mutation rate has not been constant over human population history, rather that different populations appear to have different characteristic MEI mutation rates.
Retrotransposons are endogenous genomic sequences that copy and paste into locations throughout host genomes
The Alu family is the most common mobile element in primate genomes, with more than 1.1 million copies in
Mobile element insertions (MEI) are known to generate significant structural variation within
MEI polymorphisms can be detected either as insertions or as deletions in samples relative to the reference genome. Mechanistically, however, both types of observations are due to retrotransposon insertion; precise excisions of mobile elements are essentially non-existent
Relative to previous studies, we present a broad analysis of MEI variation in the human population; with more variant loci detected, from the three major mobile element families, using multiple detection methods, each with comprehensive experimental validation (
Non-reference MEI | Reference MEI | |||||||
Detection method | Illumina RP | 454 SR | RP+SR | Combined deletion detection algorithms | ||||
Dataset | Low Cov | Trio | Low Cov | Trio | Total | Low Cov | Trio | Total |
Number of samples | 138 | 6 | 22 | 2 | 156 | 169 | 6 | 175 |
Coverage per sample | 2.2x | 16.4x | 2.0x | 7.6x | 3.0x | 3x | 25x | 3.9x |
Alu insertions | 2882 | 1786 | 2420 | 1284 | 4500 | 1689 | 1420 | 1730 |
L1 insertions | 345 | 192 | 396 | 172 | 792 | 193 | 170 | 206 |
SVA insertions | 49 | 35 | 17 | 7 | 79 | 70 | 65 | 74 |
Loci PCR tested | 193 | 186 | 182 | 185 | 746 | - | - | - |
Loci validated | 183 | 182 | 173 | 174 | 712 | 1873 | 1615 | 1927 |
FDR (%) | 5.2±1.6 | 2.2±1.1 | 4.4±1.6 | 5.5±1.1 | 4.5±0.8 | - | - | - |
Number of samples, average read coverage, detected loci, and validation results are shown. Non-reference MEI false detection rates (FDR) were based on validation results at randomly selected loci. In addition to PCR validation, reference MEI were also tested for validation as deletions by local assembly. The FDR for reference MEI, including the additional MEI selection criteria, is estimated to be <10%.
We analyzed two whole-genome datasets produced by the 1000GP, the low coverage pilot dataset consisting of 179 individuals sequenced to ∼1–3X coverage and the trio pilot dataset consisting of two family trios sequenced to high, ∼15–40X coverage (
We developed two complementary methods for the detection of non-reference MEI, a read-pair constraint (RP) method applied to Illumina paired-end short read data, and a split-read (SR) method applied to the longer read data from Roche/454 pyrosequencing (
a) RP signature for of non-reference MEI detection. The RP signature consists of Illumina read pairs spanning into the element from each side of the insertion. The RP event display shows a heterozygous Alu insertion allele on chromosome 22 from the trio pilot dataset. Fragment mapping quality is shown on the vertical scale. Horizontal grey lines show read pairs uniquely mapped at both ends with a mapped fragment length consistent with the sequence library; the blue and red lines are read pairs spanning into an Alu sequence from the 5′ and 3′ ends. The green vertical line is the position of the insertion. Thick black lines near the top show annotated Alu positions. Red and blue reads bracketing annotated elements are characteristic of mapping artifacts that we removed from insertion detection by masking out regions within a fragment length of an annotated element of the same family as the insertion. b) Signature for SR-based insertion detection. Split-mapped 454 reads span into the element sequence. The SR event display shows split reads spanning into an Alu insertion from the 5′ (blue) or the 3′(red) sides. The vertical green line marks the insertion site. Fully mapped 454 reads are shown in gray. Gray reads that span the breakpoint correspond to the reference allele. Note that the mapping quality increases with the length of the split-mapped segment. The red and blue segments overlap by roughly 15 bp in the target site duplication region that brackets the MEI insertion. c) Overlap between non-reference MEI detected by RP and by SR. d) Overlap between detection methods for reference MEI. Of the 23 1000GP deletion call sets, 11 were RP and 4 were SR. Also shown are the relative proportions of events detected by assembly (yellow) and by read depth (gray) both of which had nearly 100% overlap with RP and SR calls. e) RP signature for reference MEI detection. Read pairs with abnormally long mapped fragment lengths (in green) span over an AluYb8 annotation. The event display shows RP evidence for a homozygous reference MEI in chromosome 22 from the trio dataset. The yellow line at the top marks homologous regions from the chimpanzee assembly, with a gap at the precise location of the variant MEI.
We applied the two methods to both 1000GP pilot datasets (
In addition to the 5,370 non-reference MEI, we identified 2,010 reference MEI detected as deletions of mobile elements in samples. The reference MEI events were selected from the full release set of 1000GP pilot deletions (n = 22025)
The complete set of 7,310 MEI calls is simply the combined set of reference and non-reference MEI over both pilot datasets (summarized in
a) MEI genomic distribution. Circos plot with non-reference MEI represented in blue and reference MEI in red. The outermost ring of chromosomes show the cytoband structure. The outer histogram displays counts of Alu polymorphisms in bins of 5 Mbp, the middle ring L1 polymorphisms in bins of 10 Mbp, and the innermost ring SVA polymorphisms in bins of 20 Mbp. The radial scale of the site counts is the same for each element type. b) MEI family breakdown. Non-reference MEI (blue) and reference MEI (red). c) Venn diagram of non-reference MEI from each pilot dataset. Most of the loci were detected from the low coverage dataset (dark grey). d) Venn diagram of reference MEI from each pilot dataset. e) Venn diagram of non-reference MEI from this study and other studies
The genomic proportions of the three mobile element families are 85±2% Alu, 12±2% L1, and 2.5±1% SVA (
The 1000GP catalog of MEI variant sites includes all 7,310 detected loci, including those matching MEI from other publications. Further comparisons among the recent MEI studies are provided in
We benchmarked each of the four non-reference MEI call sets (separate SR and RP call sets for the low coverage and trio pilot datasets) to assess detection sensitivity and specificity. As MEI are currently not suitable for microarray validation due to their highly repetitive sequence, all validations were done by locus-specific PCR. 200 loci were randomly selected from each of the four insertion call sets. Using an automated pipeline
All candidate loci with successful primer design were tested on two different population genetic panels (
a) Example of PCR gel chromatograph validation results. At this site, three of the 25 low coverage samples show two bands characteristic of heterozygous insertions. Two additional test samples (Pop80 and HeLa) also show the insertion allele. b) False detection rate estimates based on PCR experiments at random sites, broken down by element type (Alu, L1, SVA), algorithm (RP & SR), and dataset (LCP: low coverage pilot, TP: trio pilot). The false detection rate for Alu elements is uniformly <3% while the false detection rates for L1s and SVA element insertions approach 30%, with large error bars (95% confidence intervals) arising from relatively low statistics. c) Non-reference MEI detection overlap from trio samples NA12878 and NA19240. This level of overlap between two independent methods using independent sequence data corresponds to a detection sensitivity of roughly 70% for each algorithm and a combined detection sensitivity of 90% in these samples. d) Non-reference MEI detection sensitivity as a function of allele frequency in the low coverage dataset. PCR results for loci randomly selected from one method were used as a gold standard for the complementary method, and vice versa. PCR also provides an estimate of the allele frequency based on the 25 low coverage samples used for validation experiments. RP (blue) and SR (red) and the combined (black) detection sensitivities rise with frequency. One standard deviation confidence intervals are shown as shaded bars for the RP and SR algorithm, with black error bars for the combined RP+SR detection efficiency.
Following the validation of non-reference MEI, we assessed detection sensitivity. The primary challenge here was to find suitable gold standard non-reference MEI that should be present in our samples from which to assess sensitivity. We estimated sensitivity in three different ways, as a consistency check. First, we estimated sensitivity by using the high quality non-reference MEI from HuRef
A third approach to estimate for the non-reference MEI detection sensitivity is based on the validation PCR genotypes in the low coverage dataset. Since the PCR loci were selected as random subsets for each RP and SR call set independently, the validated sites selected from SR events can be used as a gold standard to assess RP detection sensitivity, and vice-versa. Detection sensitivity as a function of allele frequency (
Regarding reference MEI detected as deletions, the overall validation rate from PCR and local assembly for the MEI component of deletions was 96%. This does not imply that the remaining 4% were false, only that the released set of deletions contained reference MEI detected by two high specificity algorithms with characteristic false detection rates less than 10%. These algorithms did not require additional validation evidence in the 1000GP release. A rough estimate for the false detection rate for the MEI component of deletions is therefore 0.4%. The number of algorithms supporting a given call is another indicator of call quality. The average number of separate deletion calls (out of a maximum of 23 call sets) supporting events in the MEI subset was 7.8 while the average over all other deletions was 2.3 (
Detection sensitivity for reference MEI was estimated from the fractions of gold standard reference MEI identified by Xing et al. from HuRef
We characterized each detected MEI event (
a) Length of target site duplications bracketing the MEI sites. Different detection modes (top) and different element families (lower plot) exhibit similar distributions target site duplications lengths. b) Alu sub-family breakdown of 1,105 assembled Alu non-reference insertions. Also shown are the Alu breakdowns from reference MEI (ref) from this study, as well as variant Alus found in the HuRef genome by Xing et al. AluYa5 is the most frequent polymorphic Alu sub-family.
Genotyping of non-reference MEI (
Genotyping accuracy for non-reference MEI is assessed by direct comparison to PCR validation genotypes in the same samples, and by testing for Mendelian errors in the trios and violations of Hardy-Weinberg Equilibrium in the low coverage data (
We estimated MEI allele frequencies from the count of high quality (GQ≥7 non-reference and GQ≥10 for reference MEI) genotyped insertion alleles for each MEI locus. Allele frequencies were estimated from loci with at least 25 high quality genotypes for each continental population group. The two MEI detection modes (i.e. reference and non-reference insertions) have very different allele frequency spectra (
a–c) Uncorrected allele count spectra. Non-reference MEI (blue) and reference MEI (red): a) CEU, b) YRI, c) CHBJPT. Loci with 25 or more genotyped samples were included. A random subset of 25 samples was selected for any locus with more than 25 genotyped samples. Gray dashed lines are based on neutral model fits from the full MEI spectra, modified to account for the respective ascertainment conditions,
The allele count spectra were compared to the standard neutral model
All three element families have been combined into the allele count spectra shown in
a) Element family breakdown of the combined population allele frequency spectra. L1 and SVA are scaled up to allow comparison with the Alu spectrum. b) MEI and SNP allele frequency spectra across three population groups. The corresponding allele frequency spectra of SNPs relative to the ancestral genome from the 1000 Genomes low coverage pilot project are superimposed as dotted lines. The SNP spectra are scaled down by a factor of 500 for this comparison. c) Principal component analysis of MEI genotypes. CEU: blue; YRI: red; CHB: cyan; JPT: green. The first and second principal components are plotted. d) Total number of MEI between trio samples versus coalescent time based on SNP differences between the sample pairs.
We also analyzed population differentiation by applying principal component analysis to the matrix of allele counts across the low coverage pilot samples and loci (
As few as 39 of the 5,370 non-reference MEI loci were located in exonic sequence, mostly in untranslated regions, and only 3 were found in coding exons (
Gene | UTR | CDS | Total | |
|
1438 | 32 | 2 | 4499 |
|
249 | 4 | 0 | 792 |
|
31 | 0 | 1 | 79 |
|
1718 | 36 | 3 | 5370 |
|
2020 | 105 | 137 | - |
|
1.2 | 2.9 | 45.7 |
Detected events subsequently invalidated by PCR are not counted. Expected counts of insertions were calculated according to random placement across the genome. The p-value that the observed number of CDS interrupting MEI is consistent with random placement is <10−50.
Low coverage pilot | Trio pilot | ||||||||
PCR genotypes | PCR genotypes | ||||||||
Sequenced genotypes | 0/0 | 0/1 | 1/1 | Sequenced genotypes | 0/0 | 0/1 | 1/1 | ||
0/0 | 2773 | 188 | 5 | 0/0 | 901 | 5 | 0 | ||
0/1 | 18 | 913 | 217 | 0/1 | 2 | 671 | 54 | ||
1/1 | 1 | 140 | 372 | 1/1 | 0 | 10 | 144 |
Low coverage pilot samples; trio pilot samples. Genotypes are listed in “VCF” convention: 0/0 homozygous reference, 0/1 heterozygous MEI, 1/1 homozygous MEI. For the low coverage validation, 23 samples at 333 sites were tested, while for the trio data all 6 samples were tested at 332 sites. The agreement for the low coverage data is 88.7% with 58% of the sites genotyped with GQ≥7. Genotype agreement for the pilot data was 96% with 90% genotyping efficiency.
The high-coverage trio data allows for the most precise estimates of the total number of MEI variants between pairs of individuals because of the high detection sensitivity. The number of pair-wise variant loci is calculated as the presence or absence of an insertion at a given locus, combining reference and non-reference MEI. We selected the two trio children (NA12878 and NA19240) for comparison between CEU and YRI individuals and the trio parents for comparison of individuals within the CEU and the YRI population groups. After corrections for detection sensitivity and false detection (
Previous estimates for the
MEI genotyping allows us to estimate MEI heterozygosity within each sample. We define heterozygosity as the count of heterozygous loci across the individual's genome. In a manner similar to the allele frequency analysis, heterozygosity is corrected for detection and genotyping efficiencies (
a) MEI
population | element | Θ [95% CI] | μ(θ) [95% CI] | χ2 | d. f. | Π [95% CI] | μ(π) [95% CI] |
all | MEI | 1860 [1540–2170] | 0.0464 [0.0384–0.0543] | 75.4 | 78 | 2160 [2130–2200] | 0.0499 [0.0490–0.0507] |
CEU | MEI | 1700 [1360–2040] | 0.0425 [0.0339–0.0510] | 52.3 | 39 | 2040 [2020–2070] | 0.0493 [0.0487–0.0499] |
YRI | MEI | 2240 [1690–2790] | 0.0559 [0.0421–0.0697] | 39.9 | 39 | 2480 [2430–2530] | 0.0488 [0.0478–0.0499] |
CHBJPT | MEI | 1550 [1220–1870] | 0.0387 [0.0306–0.0468] | 70.5 | 39 | 2030 [2000–2060] | 0.0533 [0.0525–0.0541] |
all | ALU | 1570 [1310–1830] | 0.0392 [0.0326–0.0457] | 83.9 | 78 | 1880 [1840–1910] | 0.0432 [0.0424–0.0439] |
CEU | ALU | 1440 [1150–1720] | 0.0359 [0.0289–0.0430] | 55.4 | 39 | 1770 [1750–1800] | 0.0428 [0.0422–0.0434] |
YRI | ALU | 1830 [1390–2270] | 0.0458 [0.0348–0.0569] | 43.4 | 39 | 2150 [2100–2200] | 0.0423 [0.0414–0.0433] |
CHBJPT | ALU | 1300 [1020–1570] | 0.0324 [0.0256–0.0391] | 86.5 | 39 | 1750 [1720–1780] | 0.046 [0.0453–0.0468] |
all | L1 | 224 [120–329] | 0.0056 [0.0030–0.0082] | 51.9 | 71 | 264 [257–270] | 0.0061 [0.0059–0.0062] |
CEU | L1 | 223 [100–346] | 0.0056 [0.0025–0.0086] | 49.6 | 38 | 243 [234–252] | 0.0059 [0.0057–0.0061] |
YRI | L1 | 326 [118–535] | 0.0082 [0.0029–0.0134] | 59.6 | 39 | 303 [292–314] | 0.006 [0.0057–0.0062] |
CHBJPT | L1 | 166 [70–262] | 0.0041 [0.0018–0.0066] | 49.7 | 39 | 251 [243–258] | 0.0066 [0.0064–0.0068] |
all | SVA | 80 [48–113] | 0.002 [0.0012–0.0028] | 15.4 | 39 | 55 |
0.0013 [0.0012–0.0014] |
CEU | SVA | 38 |
0.001 [0.0004–0.0014] | 10.4 | 27 | 51 |
0.0012 [0.0011–0.0013] |
YRI | SVA | 64 [26–101] | 0.0016 [0.0006–0.0025] | 11.2 | 24 | 61 |
0.0012 [0.0011–0.0013] |
CHBJPT | SVA | 46 |
0.0012 [0.0005–0.0018] | 12.5 | 27 | 55 |
0.0014 [0.0013–0.0015] |
MEI diversity parameter
MEI alleles propagate within population groups much like other predominantly neutral polymorphisms. MEI allele frequency spectra from the low coverage samples are in general agreement with expectations from the standard neutral model for allele drift in a population. The major differences in allele frequency spectra between non-reference and reference MEI (
MEI allele frequencies were based on MEI detected and genotyped across three element families (Alu, L1, and SVA), from both non-reference and reference MEI, and multiple detection methods (RP and SR), each with characteristic detection sensitivities and false detection rates. Corrections for these effects, as well as genotyping efficiencies, were included in the allele frequency spectra.
Measurements of MEI heterozygosity offer a more direct method to estimate MEI insertion rates. Like the allele frequency spectrum, heterozygosity is dependent on accurate genotyping and includes corrections for efficiency losses, but in this case the corrections were made on a per sample basis, which is more specific since sample coverage is the dominant limitation for detection and genotyping power (
The question remains whether the differential MEI mutation rate between populations is driven by a shared increase of
Based on the global values for the diversity parameters
This study of the 1000GP pilot datasets is a sizable step toward a complete population-based catalog of common human MEI polymorphisms, made possible by targeting both non-reference and reference MEI events in the human genome. We identified 7,380 polymorphic mobile element insertions from the Alu, L1, and SVA families. Based on experimental validation of random subsets of loci we estimate that the false discovery rate in this study is less than 5%. Detection power for common alleles (allele frequency>10%) varies between non-reference MEI (70%–80%) and reference MEI (>90%). We were also able to assemble the inserted sequence for more than 1,000 non-reference Alu MEI and found consistent proportions of Alu sub-families in comparison to MEI identified in HuRef.
This comprehensive variant discovery and genotyping effort allowed us to directly compare the segregation properties of different variant types from the same dataset. Our analysis revealed that, to a first approximation, the evolution of MEI variants is similar to SNPs and consistent with neutral models
Both the SR and the RP methods were based on identification of non-reference MEI as clusters of mapped DNA fragments in which one end mapped to the consensus sequence of a mobile element while the other end was uniquely mapped to the reference genome in a location inconsistent with a known mobile element location in the reference (
The 2,010 reference MEI events are a subset of the 1000GP pilot release of 22,025 deletions
The deletion coordinates match to an annotated Alu, L1, or SVA element
At least 75% of the deleted region corresponds to a gap in the chimpanzee genome assembly
Non-reference MEI detected by the SR and RP methods were merged according to a 100 bp matching window around the leftmost insertion coordinates. To assess call set intersections between this study and other published lists of non-reference MEI, we used a matching window of 200 bp around each insertion position. We adopted the ‘leftmost’ coordinate convention (
For SR detection the relevant coverage statistic is 454 base coverage, counts of aligned reads covering a given base, averaged across the accessible genome. For RP detection the driving coverage statistic is Illumina read-pair spanning coverage, counts of fragments in which the non-sequenced segment of the fragment between the reads cover a given base, averaged across the genome (
The four non-reference MEI event lists (
For loci with ambiguous PCR results, no amplification, or amplification of only the empty insertions site, a second primer pair was designed. For the primer design, 600 bp of flanking sequence on either side of the insertion site was retrieved from genome.ucsc.edu using Galaxy. Alu elements within the flanking sequence were masked to “N” using RepeatMasker (repeatmasker.org). Primers were designed with BatchPrimer3 v2.0 in the flanking sequence, leaving at least 100 bp before and after the predicted insertion site. Next, all primers were tested with BLAT to determine the number of matches in the human genome. If one primer of a primer pair matched several times and the other primer was unique, a virtual PCR was performed. Primer combinations with one predicted PCR product were tested on our panel. Otherwise primers were designed manually (if possible) after repeat-masking the flanking sequence with the complete repeat library.
In addition, for L1 and SVA loci without unambiguous PCR amplification, primers were designed, placing one primer within the 3′ end of the mobile element sequence
A subset of 25 DNA samples from the low coverage pilot samples and all six trio samples were used in PCR validations (
In addition to the subset of 25 individuals used for the low coverage pilot validations, four more DNA samples from the low coverage pilot dataset were obtained for subsequent experiments. DNA samples NA12872, NA12814, NA12815 and NA12044 (CEPH/Utah USA) were purchased from the Coriell Institute for Medical Research. All 35 samples (25+6+4) were used for PCR validations associated with MEI events detected specifically in exons.
PCR amplifications were performed in 25 µl reactions in a 96-well format using either a Perkin Elmer GeneAmp 9700 or a BioRad i-cycler thermo-cycler. Each reaction contained 15–50 ng of template DNA; 200 nM of each oligonucleotide primer; 1.5 mM MgCl2, 1× PCR buffer (50 mM KCl; 10 mM TrisHCl, pH 8.3); 0.2 mM dNTPs; and 1–2 U
Full-length L1 and SVA elements typically exceed the limitations of standard DNA
PCR experiments were carried out in three different laboratories yielding similar success rates. At EMBL, PCRs were preformed using 10 ng of NA12878 genomic DNA (Coriell) in 20 µl volumes in a C1000 thermocycler (BioRad). Two different enzymes, iProof High Fidelity DNA Polymerase (Biorad) and Hotstart Taq (Qiagen) were used, with comparable results. PCR conditions for iProof were: 98°C for 1 min, followed by 5 cycles of 98°C for 10 s, 68°C for 20 s and 72°C for 4 min and 30 cycles of 98°C for 10 s, 66°C for 20 s and 72°C for 4.5 min, followed by a final cycle of 72°C for 5 min. PCR conditions for HotStart Taq were: 94°C for 15 min, followed by 5 cycles of 94°C for 30 s, 60°C for 30 s and 72°C for 3 min and 30 cycles of 94°C for 30 s, 56°C for 30 s and 72°C for 3.5 min, followed by a final cycle of 72°C for 5 min. PCR products were analyzed on a 1% agarose gel stained with Sybr Safe Dye (Invitrogen) and a 100 bp ladder and 1 kb ladder (NEB).
PCR reactions at Louisiana State University were performed under the following conditions: initial denaturation at 94°C for 90 sec, followed by 32 cycles of denaturation at 94°C for 20 sec, annealing at 61°C for primers designed by pipeline or 57°C for other primer design for 20 sec, and extension at 72°C for 30 to 90 sec depending on the predicted PCR amplicon size. PCRs were terminated with a final extension at 72°C for 3 min. When LA-
An outcome from the validation experiments on the 86 gene-interupting MEI was a high false detection rate for candidate Alu insertions in close proximity to 7SLRNA annotations. Subsequently we reclassified all 22 Alu insertion candidates within 200 bp of a 7SLRNA as invalidated (
The two non-reference MEI detection methods use independent DNA libraries. So the overlap between the RP and SR are governed by the respective detection sensitivities, statistically akin to the Lincoln-Peterson method
For reference MEI we used available genotypes calculated by GenomeSTRiP
MEI loci with at least 25 genotyped samples per population (50 samples for the combined population spectra) were included in allele frequency spectra. Sites of GQ≥7 non-reference MEI and of GQ≥10 reference MEI were included. For loci with more than 25 genotyped samples, a random subset of 25 was used for the allele count spectra (
Only non-reference MEI with insertion position confidence intervals entirely within annotated regions (Gene, UTR, CDS) were counted. No MEI that were subsequently invalidated were counted. Relative to random placement across the genome the MEI suppression or boost factor is defined as:
MEI and SNP heterozygosity for each sample were calculated from the counts of genotyped heterozygous sites. For MEI, the total numbers of genomic heterozygous sites were estimated with corrections for genotyping efficiency and detection sensitivity. The genotyping efficiency for a given sample is the fraction of detected loci with high quality (GQ≥7 non-reference, of GQ≥10 reference MEI) genotypes. There is also a sample specific correction for genotyping bias against heterozygotes at sites with limited fragment coverage:
The SNP heterozygosity values are transformed to rough estimates of the corresponding coalescent time (
Insertion coordinate convention.
(EPS)
Number of deletion call sets supporting reference MEI locus. The average number of deletions call sets supporting MEI events is about eight (blue) while for all deletions in the 1000GP release (gray dashed line) the average number of calls was about three. The peak at the call sets for Alu MEI deletions corresponds to the eight Illumina RP based call sets (BC, Wash U, WTSI, for both pilots, Broad for pilot 1 and U.Wash for pilot 2) and two SR call sets (Pindel for both pilots).
(EPS)
UCSC browser display of reference MEI. (top) The deletion (red track with 1000GP deletion id's P1_M_061510_12_213 for low coverage pilot and P2_M_061510_12_22 from the trio pilot) matches to the annotated AluYg6 element at chr12:8516855–8517156, present in the NCBI36 reference sequence but missing in the sequenced sample. The black RepeatMasker track shows that the AluYg6 element matches the deletion start and end coordinates. The green tracks indicate the extent of the chimpanzee assembly, which does not include the AluYg6 element. The blue DGV tracks show that this particular deletion has been previously identified by several experiments with various degrees of position resolution. (bottom) Example of questionable reference MEI. The blue track at the top marks a detected deletion (id P2_M_061510_3_301) at chromosome 3, 60,660,331 bp that overlaps >50% with a short annotated L1HS element, but the start and end coordinates do not match precisely. The chimpanzee genome (in yellow) has a gap in the region, but the edges do not align precisely. This deletion was included in the count of 2,010 reference MEI, but adds to the level of uncertainty.
(TIFF)
1000 Genome Project pilot sample breakdown. a) Venn diagram of pilot samples by sequencing platform (Illumina and 454 only). The bulk of the samples were sequenced by Illumina. The circle areas are only roughly proportional to the number of samples contained. b) Venn diagram of samples used for MEI detection (left) and genotyping (right). MEI detected as insertions (red) and deletions (blue) have different signatures and algorithms resulting in the difference between the samples used.
(TIFF)
Illumina paired end fragment length distributions. Left) Low coverage pilot fragment length distributions for a random selection of 20 lanes of Illumina read pair data. Most libraries have a median fragment length from 100 to 300 bp with a wide variety of shapes. Right) Trio pilot fragment length distributions for 130 lanes of Illumina read pair data for NA12878. Five libraries are shown in different colors with different characteristic shapes. The small peak visible in orange at 550 bp is shifted by 300 bp from the main peak. This small peak arises from reference Alu insertions of length 300 bp. This small Alu peak occurs for all libraries in both pilots.
(EPS)
MEI insertion sensitivity vs. coverage for the two methods. Coverage for the RP method is quantified as “span” coverage on the blue scale. Span coverage is calculated based on the fragment gap between the reads at the end of the fragment where RP detection is sensitive to large structural variations. The SR algorithm sensitivity depends on read coverage (red scale at the top) because the insertion can be detected anywhere within a given read (except within 20 bp of the ends). The detection sensitivity at maximum coverage is determined by the trio overlap calculations from
(EPS)
Non-reference MEI insertion breakpoint resolution. (top) the position residual between matched RP to SR insertions. (bottom) 1000GP loci vs. dbRIP. The dbRIP hg18 coordinates were shifted by TSD such that both lists adopt the ‘leftmost’ coordinate convention.
(EPS)
Venn diagrams of MEI insertion overlap with recent studies. (top) L1 overlap with Ewing and Kazazian
(EPS)
Genomic distance to nearest element of the same family. (top) Non-reference MEI. 1000GP and HuRef distributions are plotted as well as L1 distances for Ewing and Kazazian
(EPS)
Insertion position resolution comparison. Non-reference MEI were matched to dbRIP using a 200 bp window.
(EPS)
Number of MEI per 1 MB binned regions across genome. (top) Dotted gray line is a simple Poisson model for MEI distributed uniformly across the accessible genome (2.85 Gb). The red arrow points to a significant hotspot in chromosome 6, position 33 Mb in the HLA region where 19 MEI were detected in a 1 MB region. (bottom) MEI density profile across chromosome 6 showing spike in region of HLA at 33 Mb.
(EPS)
MEI insertion length. a) Comparison of insertion lengths with 617 dbRIP assembled MEI insertions that match 1000 Genomes MEI using a 200 bp window around insertion position. b) MEI insertion length residual distribution. c) The insertion length from MEI deletions (red) is the number of reference nucleotides in the deleted region (the annotated mobile element plus one copy of the TSD and any carry-over sequence). Sharp peaks at 300 bp and 6000 bp are the Alu and L1 insertions respectively. The insertion length for MEI detected as insertions (blue) is estimated from the span of the mapping coordinates within the mobile element. This estimate does not take into account any inserted sequence that is not part of the mobile element such as the TSD, poly-A tail, or carry-over sequence.
(EPS)
Genotyping efficiency. top) Fraction of MEI sites surviving genotype quality thresholds in low coverage data for non-reference MEI (blue steps, GQ≥7) and for reference MEI (red, GQ≥10). Also shown is genotype accuracy based on validation experiments for non-reference MEI (dashed with grey 95% confidence interval). bottom) Sample-by-sample fraction of MEI sites surviving genotype quality threshold for vs. coverage in low coverage samples. Non-reference MEI (crosses) show a genotyping efficiency approaching 60% at 4 fragments/base spanning coverage, while reference MEI (circles) genotyping efficiency is nearly flat at 80%. Samples from the three population groups show the same trends. Coverage here is calculated as spanning coverage, most relevant for RP detection.
(EPS)
Hardy-Weinberg Equilibrium test. Proportions of each genotype as a function of allele frequency for each population group (blue: CEU, red YRI, and green CHBJPT). Also plotted in gray dashed lines for comparison is the proportion expected from HWE.
(EPS)
Genotype Matrix of low coverage samples. Each element in the matrix corresponds to a sample and a locus at which the genotype is color coded. Sample populations are labeled across the top, separated by green lines. The chromosome order for the MEI loci is labeled on the right side, with non-reference MEI (“insertions”) and reference MEI (“deletions”) grouped separately. This matrix was input to Principal Component Analysis for plotted in the main text
(EPS)
Principal Component Analysis population clustering for PCR genotypes, MEI ins, MEI del, combined. A matrix of genotypes for each site and sample was input to a PCA and the resulting first two components are plotted against each other. The sum of insertion alleles is the value in the matrix elements. For elements corresponding to sites and samples without genotypes, the global average genotype value was used. a) Genotypes from PCR validation for the low coverage pilot. b) Genotypes from low coverage non-reference MEI only. c) Genotypes from reference MEI only. d) Genotypes from samples with both non-reference and reference MEI. Population clusters become tighter as more MEI insertion information is added to PCA.
(EPS)
Coalescent simulation allele frequency spectra for the combined CEU, YRI, CHB and JPT population groups. AF is binned in units of 0.1. The lowest bin (0–0.1) is not plotted to allow the spectra at higher AF to be compared. The normalizations for MEI detected as insertions (red) and deletions (green) are set to that the two components sum to the total unbiased MEI AFS (blue).
(EPS)
MEI insertion rate vs. coalescent time for increasing MEI site selection thresholds. The estimated MEI insertion rates (main text Eq.2) for each sample is plotted vs. the coalescent time derived from SNP heterozygosity. Panel a) is the same as
(EPS)
Combined MEI event list (external Excel file). Genomic coordinates with confidence intervals are listed for each of the 7380 MEI loci. Each event is characterized by an element type (ELEMENT = Alu, L1, or SVA), element STRAND (+ or −), detection (DET = DEL or INS for non-reference and reference MEI respectively), event ID, estimated insertion length (LEN), detection algorithm (ALG), validation status (VAL), validation method (VALMETH = PCR, ASM for assembly, 7SLRNA should be discarded due to proximity to annotated 7SLRNA element), population (POP = CEU, YRI, CHB, or JPT), allele frequency in three major groups (AF), number of genotyped samples in the three groups, number of insertion alleles in the three groups, previous study ID's (DBVARID, DBRIPID, PUBID), TSD length, number of insertion-supporting fragments from the 5′ side (NALT5), from the 3′ side (NALT3), the 1000 Genomes CALL SET name, quality value (Q), gene/exon/UTR/CDS interrupted (GENE), sub-family, and inserted sequence when available, and a list of all samples in which the alternate allele was detected (ALTSAMPLES). Note: 71 events identified by the VAL field as invalidated or in close proximity to a 7SLRNA loci are marked in yellow and were not included in the counts of interrupted genes, exons, UTRs, or CDS regions.
(XLSX)
Samples with corresponding sequence coverage
(XLSX)
Reference MEI detection method breakdown. (external Excel file
(XLSX)
Validation genotypes for non-reference MEI datasets (external Excel file). Complete genotyping information for all samples tested at the 746 sites used for false detection rate estimates and for genotyping assessment. a) Additional validation results for non-reference MEI loci (external Excel file) Genome coordinates for 267 additional validation PCR experiments carried out at Yale, EMBL, and LSU. These experiments were done as preliminary tests (EMBL, Yale, LSU-PRELIM) and for testing specific loci (SVA,
(XLSX)
MEI sensitivity based on comparison to gold standard events. (external Excel file) The fraction of HuRef MEI
(XLSX)
Trios (external Excel file). a) Overlap between RP and SR in the same trio samples (NA12878 and NA19240) can be used to estimate detection sensitivity. Columns RP and SR are the counts of all loci for the two samples broken down by element type. RP-only and SR-only count loci where only one method found the insertion. RP+SR is the count of loci deleted by both methods. The detection sensitivity estimates (εRP, εSR, and ε) with corresponding statistical 1-sigma errors are derived from the overlaps. The combined detected efficiency is based on the union of the two independent methods. b) Counts of MEI site differences between two individuals. The trio samples were used for this because of the relatively high coverage and corresponding sensitivity to low frequency alleles. Corrections to the counts compensate for less-than-perfect detection sensitivity and false detections. The trio children from two populations (CEU and YRI) have the most differences (2034±120) while the CEU parents have the fewest (663±120). The YRI parents' count of sites is between the other pairs. These differences are plotted vs. the corresponding coalescent time in
(XLSX)
Sub-family breakdown (external Excel file). Fragments from 1,105 of the Alu insertions were assembled into contigs spanning the Alu element to allow subfamily identification. The subfamilies are compared with those from the reference MEI detected as deletions and to the Venter MEI.
(XLSX)
Non-reference MEI genotyping validation (external Excel file). Genotype contingency table for non-reference MEI vs. genotypes from PCR validation experiments. “0/0” are homozygous reference, “0/1” are heterozygous insertions, and “1/1” are homozygous insertions (VCF file genotype label convention). Counts in each box are the numbers of sites and samples with the corresponding combination of genotype from sequencing and PCR. The overall genotyping accuracy is the fraction of counts on the diagonal while the genotyping efficiency is the fraction of all genotyped sites & samples divided by sites×samples for the given pilot dataset. Only genotypes with Q≥7 are included. The low coverage (a) accuracy is 87% and the efficiency is 57%. The trio pilot (b) accuracy is 95.7% and the genotyping efficiency is 89.9%. The improved genotyping performance for the trio pilot is a consequence of higher coverage.
(XLSX)
MEI genotyping corrections. (external Excel file). a) Detection sensitivity. b) Genotyping efficiency with correction factors used in constructing the allele frequency spectra for each population and element type. c) Heterozygosity counts and correction factors for each sample and element family.
(XLSX)
Loss of Function variants (external Excel file). Counts of insertions occurring within genes, UTR, and CDS regions annotated from Gencode version 3b. This table is partially shown as
(XLSX)
Mobile element consensus sequences (external Excel file). Repbase element names and sequences for each of the element added to the reference genome for MEI insertion detection.
(XLSX)
The 1000 Genomes Project Consortium.
(DOC)
Supporting Methods.
(DOCX)
We thank M. E. Hurles, R. E. Mills, A. R. Quinlan, J. H. Chuang, and S. Sherry for valuable discussions and R. E. Handsaker for providing deletion genotype likelihoods.