Wrote the paper: MM LP JB. Conceived and designed the study: JB, FC. Discussed the results: JB, MM, AJ, MP, FC, LP. Contributed in writing the the software: MM, AJ, MP, LP. Performed the validation and tuning: MM, MP. Analyzed the HapMap3 data: MM, AJ. Laid the groundwork for the study: The Genographic Consortium.
‡ MM completed this work during an internship at IBM T. J. Watson Research Center.
¶ Membership of the Genographic Consortium is provided in the Acknowledgments.
The authors have declared that no competing interests exist.
Recombination is one of the main forces shaping genome diversity, but the information it generates is often overlooked. A recombination event creates a junction between two parental sequences that may be transmitted to the subsequent generations. Just like mutations, these junctions carry evidence of the shared past of the sequences. We present the IRiS algorithm, which detects past recombination events from extant sequences and specifies the place of each recombination and which are the recombinants sequences. We have validated and calibrated IRiS for the human genome using coalescent simulations replicating standard human demographic history and a variable recombination rate model, and we have fine-tuned IRiS parameters to simultaneously optimize for false discovery rate, sensitivity, and accuracy in placing the recombination events in the sequence. Newer recombinations overwrite traces of past ones and our results indicate more recent recombinations are detected by IRiS with greater sensitivity. IRiS analysis of the MS32 region, previously studied using sperm typing, showed good concordance with estimated recombination rates. We also applied IRiS to haplotypes for 18 X-chromosome regions in HapMap Phase 3 populations. Recombination events detected for each individual were recoded as binary allelic states and combined into recotypes. Principal component analysis and multidimensional scaling based on recotypes reproduced the relationships between the eleven HapMap Phase III populations that can be expected from known human population history, thus further validating IRiS. We believe that our new method will contribute to the study of the distribution of recombination events across the genomes and, for the first time, it will allow the use of recombination as genetic marker to study human genetic variation.
Recombination brings together DNA sequences that can be very distantly related, and, thus, quite different from each other. This is often cited as a main hurdle for using recombining regions (that is, most of the genome) to reconstruct sequence phylogeny. We have turned this argument around: chromosomes carrying a similar change in sequence pattern are likely to be descendants of the same recombination event, and thus, related. We have devised an algorithm that detects such changes in sequence patterns and identifies the descendants of a recombination event. After some fine-tuning, we have applied it to sequence data in several human populations and have found that recombination events recapitulate the history of these populations. This opens the possibility of adding recombination to the current allele-based analysis of population structure and history. Our method also provides a tool for the genomic analysis of recombination, both because it pinpoints recombination events rather than just estimating recombination rates, and because, being biased towards more recent events, it can offer a glimpse of the fast evolution of recombination.
Recombination has been lately the focus of much attention. Specifically, much effort has concentrated in trying to understand the extensive variation of the recombination process seen in humans and to unravel the basic mechanisms underlying this variation
In the present study, we have developed a method that allows studying recombination from a new perspective: using the presence or absence of the trace of a particular recombination event in a specific sequence as a genetic marker. Actually, this idea was first proposed by Sir Ronald A. Fisher
Many different methods have been developed to detect recombination; they are implemented in a number of computer programs listed at
We have developed a method that is specifically aimed at detecting past recombination events from a set of extant human haplotypes and that is able to tell which are the sequences carrying the information of these events. The method called IRiS (Identifying Recombination in Sequences) performs an extensive screening for recombination within a large amount of markers and sequences in short computing time. IRiS is based on a combinatoric algorithm
In this paper, we calibrate and validate our method using extensive simulations, which will both give us a proxy for the efficiency of our method and also help us to understand which recombinations are preferentially detected. The simulations used incorporate a model that mimics human demography and variable recombination rates including the presence of hotspots, allowing us to evaluate the performance of IRiS within these regions. Moreover, we compare its performance to known cases of recombination observed by sperm typing or estimated by linkage disequilibrium based methods. Finally, we apply IRiS to reconstruct the recombination history of several gene-free regions on the X chromosome from the HapMap3 dataset
The applications of the method can be extended to other fields such as basic genetics, recombination dynamics, and the analysis of structure of the genome. IRiS provides in fact a novel tool to understand the past of recombinant genomes.
IRiS is based on the algorithm described in
(A) Input dataset of 10 sequences and 83 SNPs. Colors on sequences represent similar patterns of SNPs, and a change of color along a sequence represents the signal of past recombination events. (B) Recoded matrix. The patterns of SNPs within a column of grain size
Since this algorithm divides the haplotypes into SNP patterns of size
The analyzed dataset is the one shown in
Finally, using the same approach, we can potentially aggregate detections of multiple runs performed with different grain sizes and also we can run the algorithm in both forward and reverse directions. This aggregation improved significantly the performance of the method (see next section) by increasing the sensitivity and reducing the false discovery rate. Moreover, it allowed a much more precise inference of the breakpoint position since maximum intervals become much narrower (
Each line represents the distribution of detections for particular recombination events. The dataset corresponds to one COSI simulation. Only those recombinations reaching the threshold will be considered as true events. The pick of each distribution will locate the breakpoint position for each particular recombination event along the sequence. The optimal method (grains 20, 10 and 5 forward and reverse and a threshold of 42) creates narrower maximal intervals in the detection distributions than when only using grain 10.
We sought the best parameter set for IRiS in human sequences by calibrating the program using sequences produced by the coalescent simulator COSI
For each parameter set false discovery rate, sensitivity and accuracy of placement (defined as the empirical 90th percentile of the distribution of the distances between actual and inferred recombination location, in terms of number of SNPs) were averaged over 100 COSI simulations. We first evaluated the impact of varying grain size (5, 10, 15, 20, and 30) and threshold defined as the percentage of algorithm runs in which a particular recombination event has to be detected to be considered as true. Next, we assessed the improvement in results gained by running the algorithm in both forward and reverse directions (
Results showed that both grain size and the threshold affect the false discovery rate. It decreases with increasing grain size, and also varies with threshold, reaching the lowest at thresholds of 60% (
Since grain size provides a tradeoff between false discovery rate and accuracy of placement, we evaluated combinations of different grain sizes in order to obtain a method that combined an acceptable false discovery rate with high accuracy. For all the methods, we ran the algorithm in both forward and reverse directions (
Z scores were calculated over mean values for 100 simulations of false discovery rate, sensitivity and 90th percentile of the distance between the inferred breakpoint to the true position. Different colored lines represent different methods, the numbers on the legend inform on the grain size used and whether they combine more than one grain size. All methods are run using a sliding window and forward and reverse. Different thresholds are represented along the X axes. Threshold is defined as number of detections to be considered as true divided by the number of runs of the algorithm.
The optimal method combined runs with grain sizes 20, 10 and 5 with a threshold of 42. However, other methods had only slightly lower aggregate z-scores. With the optimal method, false discovery rate was 5.79% (below 10% in 90% of the simulations) and sensitivity was 21.15%. The median distance to the breakpoint was 1.59 SNPs: most of the inferred breakpoint locations are not more than 2 SNPs away from their true position.
In order to provide a more robust validation and avoid overfitting, we performed 1000 coalescent simulations and estimated false discovery rate, sensitivity and accuracy for the optimal method (
Since each recombination event is defined by a set of descendant sequences and an interval, there is no straightforward definition of true negatives and that is why only false discovery rate and sensitivity were evaluated. In order to estimate the performance of the method in the absence of recombination, we performed 100 blank COSI simulations, with no recombination. The mean number of recombinations detected was less than one per simulation (0.84) and the median turned out to be zero.
Finally we assessed whether some matches between COSI recombinations and IRiS calls were chance events by scoring 100 IRiS outputs against 100 random COSI outputs (
We assessed which was the distribution of the age of the recombination events detected by IRiS to estimate the time-frame of the events that our method was able to detect. From all recombinations detected in 500 different simulations, 90% of them occurred between present and 3,205 generations ago. Moreover, the median age of the recombination events detected per simulation was 663 generations (around 13,000 years).
We also studied the effect of the age of the recombination events on false discovery rate and sensitivity. When looking across parameter sets, the methods with lowest false discovery rates tended to detect most recent recombination events as average (
Results plotted are the averaged between 100 simulations. The black curve depicts how sensitivity of IRiS varies with the age of the recombination events (in bins of 500 generations) and follows the left axis. The two gray curves represent the number of recombination events generated by COSI and detected by IRiS and follow the right axis.
The number of recombination events affects the performance of IRiS, since we found that false discovery rate correlated with the number of recombinations (r = 0.412; p<10−6). A much higher linear correlation was found for sensitivity (r = 0.734; p<10−6), which was even higher if fitted to a logarithmic curve (r = 0.925; p<10−6), meaning that IRiS reached a plateau beyond which even if the number of recombinations generated increased, IRiS did not detect them. Interestingly, when comparing the number of recombinations detected by IRiS across the simulated datasets with the mean recombination rates estimated by LDhat
We compared IRiS performance against linkage disequilibrium based estimates of recombination rates in a region where direct sperm-typing rates were also available. A region of chromosome 1 near the MS32 minisatellite contains some recombination hotspots that were both observed through sperm typing and inferred with different statistical methods
(A) Inferred recombination rates based on sperm typing information; figure adapted from the figures in
A study on the capacity of the optimal method to detect recent recombination events was performed through
The plot shows the number of times
Results showed that IRiS performs much better in detecting recombination events that occur between parental sequences that differ in a certain number of SNPs and have those differences close to the breakpoint location. If the recombinant sequence is unique, then there is less room for confusion and IRiS performs even better. Finally, we see that within the hotspot region, even if the recombinant sequence is unique and the two parental sequences are different, the sensitivity is not optimal. That could be due to the fact that since nearly all sequences are recombinant, IRiS detected different recombination events across runs, and many of them did not reach the necessary threshold to be counted as true. Actually, if all events are considered regardless of the number of times they have been detected, the sensitivity increases to 100% (
Gene conversion, recurrent mutation and genotyping errors were not modelled in the COSI simulations. We used
We performed 12,000
Results show that, first, gene conversion will not have an impact on IRiS results since the majority of gene conversion events involve a very small number of SNPs
Since IRiS uses haplotype data, we evaluated its robustness to phasing errors using similar
In order to study the impact of phasing errors further, we selected 18 regions from the X chromosome from HapMap phase 3 project (see
Results show that when sequences are phased using PHASE, no differences in the number of recombinations can be found between males and females (Wilcoxon test; p = 0.992 ); even when grouping the number of recombinations detected in bins of 5 SNPs along the sequence (Wilcoxon test; p = 0.795). Conversely, when phasing with fastPHASE, the number of recombinations detected in the female sample is significantly higher in females both when calculating it per region (Wilocxon test; p<10−6) and in bins of 5 SNPs (Wilcoxon test; p = 0.000292).
Regarding the post-processing in which recombinations occurring in the two homolog chromosomes of the same individual are removed, there is a difference between the three datasets. In the male dataset (in which all recombinations that may be removed are not phasing errors), 5.59% of the recombinations are removed. In the female dataset phased with PHASE, 3.37% of the sequences are removed, whereas for the fastPHASE phased dataset 9.70% of the sequences are removed. This may be indicating that some of the sequences that are removed in the fastPHASE dataset are indeed phasing errors.
A search for human X-chromosome regions harboring more than 80 SNPs and not containing known genes, copy number variants or segmental duplications (see
Values were calculated for each of the populations based both on haplotypes and recotypes for the 18 regions. Values of recombination diversity have been multiplied by 100 to make them comparable.
We next analyzed the geographical structure of recombination events by means of two different statistical analyses: Principal Component Analysis (PCA) and Multidimensional Scaling (MDS). Results of the PCA analysis for component 1 and 2 can be seen in
Only recombinations present in at least in two individuals were taken for the analysis. The first component explained 18.03% of the variance and the second component 14.53%.
MDS was based on a recombination distance matrix among the eleven populations and showed very similar results as for the PCA (
The geographical structure found by these analyses is consistent with that produced with allele frequencies in classical markers, short tandem repeats
We present here a novel method that is able to extensively detect and place historical recombination events. We have calibrated and validated it specifically for human variation through simulations, but it could be easily adapted to other species provided that basic models for demography and recombination exist. An optimal parameter set has been defined based on three performance criteria: false discovery rate, sensitivity, and accuracy in placing the breakpoints. Given that different parameters had contrasting effects (a decrease in false discovery rate is often accompanied by a decrease in sensitivity, for instance), a compromise was sought and the best parameter set was the one that combined grain sizes of 20, 10 and 5 run forward and reverse with a threshold of 42. It should be noted, though, that some applications may require a different balance; for instance, greater accuracy may be needed, and the parameter set can be adapted to accommodate these different requirements.
Recombination rate varies across the genome
We have seen that sensitivity is higher for more recent recombination events both in those events generated in the last 500 generations (∼10,000 years) in coalescent simulations, or in those generated instantly
Several factors such as gene conversion, recurrent mutation, genotyping errors, phasing errors, or SNP ascertainment may affect the performance of IRiS and in this study we have extensively evaluated the impact of them all. First, we have shown that gene conversion has higher impact as more SNPs are involved in the process. Since gene conversion typically involves very small regions
When applying our method to population genetics, using a set of regions in the X chromosome in HapMap Phase III data, we found that the ascertainment of the SNPs present in HapMap resulted in similar nucleotide diversities in all populations, whereas the recombination diversity values were not affected by it. Higher recombination diversity was found in African populations. If nucleotide diversity were different across populations, this could be attributed to a higher capacity of IRiS to detect recombination within more diverse populations. However, since we have seen that the nucleotide diversity in HapMap Phase III is similar across populations, we can conclude that African populations contain indeed more recombination events than the other populations, as expected from their larger long-term effective population size. This is a further validation of the method.
Moreover, the ASW (Afro-Americans) is the population with the highest recombinational diversity. This could be explained by it being an admixed population and because some of the recombination events may be clearer since the two ancestral sequences that recombined may come from very different populations. It has to be stressed that this admixed population has been included in our validation procedure and thus, we have already taken into account the possible effects on IRiS performance due to admixture. The possible use of IRiS to detect and analyze past admixture deserve future attention.
Two different statistical methods that plot the relationship between populations based on their shared recombination events as detected by IRiS have produced results that are strikingly similar to the a priori expectations based on what is known about human population history. This can be taken both as a validation of our method, and as a pilot project for the applicability of IRiS to human population data. We believe that the use of recombination events in conjunction with the standard methods based on SNP and haplotype frequencies
In conclusion, we have presented a method that is able to extensively retrieve historical recombination events from a set of extant human haplotypes and point out which are the sequences that contain the information on those events. We also have shown a potential application of the method in population genetics: the use of recombination as a genetic marker that can complement present methods. Finally, we believe that this method will have a whole set of applications beyond population genetics in fields such as the study the recombination dynamics and the recombinational differences between populations, or the study of the mechanisms that have given rise to our recombinant genomes.
Simulations were performed using the coalescent simulator named COSI (Shaffner et al. 2005) with the Best-Fitting Model parameters which simulate datasets that closely resemble human data. We took a sequence length of 250,000 bp and a sample size of 50 sequences per population for all populations except the African population, for which the sample size was 60.
In the first phase of the calibration, we evaluated IRiS performance over 100 random simulations having more than 80 SNPs. The number of simulations was then increased to 1000 in order to establish more robust results. For each simulation, the evaluation process was as follows:
First, the COSI best fit model was run; the information of all recombination events created, their descendant sequences, and the age and exact position of the breakpoint was saved.
Second, the SNP ascertainment was performed; SNPs with MAF lower than 10% and non-tagSNPs were removed from the sample. TagSNPs were selected using Haploview with pairwise option (Barrett et al., 2005) and
Third, recombinations that took place between identical parental sequences in at least one side of the breakpoint were removed, since they are impossible to detect as recombinants and will be not be considered further (note that those events will not be used to compute sensitivity values).
Fourth, IRiS was run on the dataset created by COSI. Each detected recombination is defined as a set of recombinant sequences and a breakpoint interval defined by IRiS, (maximum of the distribution). Detection is considered correct if there exists a recombination event within the COSI simulation that has exactly the same descendant sequences and it occurred within the interval defined by IRiS. We allow IRiS to detect a subset of the descendants, of a particular recombination event created by COSI, if there has been a younger recombination that masks the trace of the older one in some of the descendant sequences.
Finally, false discovery rate, sensitivity and the 90th percentile for the distance from the inferred to the real location are computed.
All correlations performed on the results of the simulations were calculated by means of the SPSS software (version 15.0).
The study of the correlation between IRiS results and hotspots inferred from recombination events detected through sperm typing was performed based on HapMap Phase 2 data (
Inference of the recombination rates was performed by means of the coalescence-based algorithm implemented in the LDhat package
The IRiS proportion of recombination was calculated based on the threshold interval normalized by the length of the interval in bp. Each recombination contributed equally to all SNPs in the inferred breakpoint interval a value inversely proportional to the length of the interval in bp. To check sensitivity to linkage disequilibrium, raw IRiS detections were added up across all the algorithm runs. Note that in this case each inferred breakpoint interval will be the length of the network in which the recombination was detected.
In
We performed the same analysis several times varying the process of selection of the two parental haplotypes. First, they were chosen at random, second, they were only taken if they were different in at least one position within the breakpoint region (defined as a 10 SNP distance of the breakpoint), third, they were taken if they were different in at least one position on each side of the breakpoint within the region, and finally, we would only consider for the analysis those events that created recombinant sequences that was unique within the breakpoint region
We considered that two recombinations to be the putative product of a phasing error if there appeared two recombination events in the two homologs of an individual at a distance of less than 6 SNPs.
The mergepats parameter is implemented in IRiS when defining groups of SNP patterns of size g (which in the optimal method will be 5, 10 and 20). If mergepats is activated, patterns that differ in one SNP will be considered as the same. This is performed hierarchically first by taking the most frequent pattern and merging it with all patterns that are at edit distance one of it. Then the second most frequent pattern will be merged with all patterns being at edit distance of one, and so on. In this way, we avoid merging all patters of a sample into a single one.
The whole X chromosome was screened in order to find the optimal regions for our analysis. Regions at least 50 Kb distant from known genes, copy number variants and segmental duplication, and containing at least 80 SNPs genotyped in the 11 populations from the HapMap Phase 3 release #1 (
SNP genotypes for the X chromosome were obtained from the HapMap website (
To avoid phasing errors, only males were selected for recombination analysis; heterozygote positions, which are expected to be erroneous, were considered as missing values (only 0.02% of the positions were heterozygous). Individuals with >5% missing genotypes (22 in total) were discarded; the rest of the missing values were imputed using fastPHASE
To obtain the equivalent number of sequences from females we followed the same procedure as males and selected 537 sequences from the same populations in order to match the male dataset. We phased and imputed the missing positions using two different softwares: PHASE
Nucleotide diversity was calculated using DnaSP
For the MDS analysis, the recombinational distance (DAB) between populations A and B was computed as:
where RAB was the number of recombinations shared between population A and B, RA was the number of recombinations occurring in individuals in population A, and RB the number of recombinations having occurred in individuals of population B. Only those recombinations found in at least two different populations were considered.
Mean values taken from the analysis of 100 simulations with different IRiS settings: grain sizes (5, 10, 15, 20 and 30), different thresholds, defined as number of detections to be considered as true divided by the grain size or the double of the grain size in the cases in which the algorithm is run in two directions. For each setting the algorithm could be run only on the forward direction (F) or in both directions (FR). Figure S1A False discovery rate (%). Figure S1B Sensitivity (%). Figure S1C 90% confidence interval of the distance (measured in number of SNPs) between the inferred breakpoint position and the real location. Figure S1D, median age of the detected recombinations.
(0.68 MB DOC)
Mean values taken from the analysis of 100 simulations with different IRiS settings that combine different grain sizes (indicated with different colors), different thresholds (defined as number of detections to be considered as true divided by the sum of the different grain size and multiplied by two since the algorithm is run in the two directions). All settings included running the algorithm in the two possible senses. Figure S2A False discovery rate (%). Figure S2B Sensitivity (%).Figure S2C. 90th percentile distance from the breakpoint location measured in number of SNPs.
(0.53 MB DOC)
Plot showing the relationship between the false discovery rate and the number of COSI simulations under a scenario in which IRiS is given a different dataset than the one used to compare it with the COSI results.
(0.02 MB DOC)
Each dot represents mean values of false discovery rate and median age of the detected recombinations taken from the analysis of 100 simulations with different IRiS settings that combine different grain sizes (indicated with different colors) and different thresholds. All settings included running the algorithm in the two possible senses.
(0.29 MB DOC)
Plot showing values of the number of times in silico recombination events were detected by IRiS run with no threshold depending on the breakpoint location along the sequence. Different colors indicate different ways to produce the recombinant sequence, from light gray to black: “random” indicates that parental haplotypes were taken at random, “1dif near bkp” indicates that parental sequences had to be different near the breakpoint region (plus minus 10 SNPs), “2 dif near bkp” indicates that parental sequences had to be different near the breakpoint regions at both sides of the breakpoint, and “unique” indicates that the parental sequences had to be different near the breakpoint region and the recombinant sequence had to be unique within the breakpoint region. Below, the recombination rate estimated by LDhat is shown, following the right axis.
(0.15 MB DOC)
MDS 2D plot based on a recombinational distance matrix. The stress is 0.081 which is below the 0.16 stress obtained with 1% probability with random data sets (citation: Sturrock K, Rocha J (2000) A Multidimensional Scaling Stress Evaluation Table. Field Methods 12: 49-60).
(0.03 MB DOC)
Evaluation of IRiS with the optimal parameters for different SNP ascertainments. SNP selection process is explained in the
(0.04 MB DOC)
Percentage values on the number of times each of the simulated event is either not detected, detected as 1 recombination or as 2 recombinations. The percentage values are calculated over 1,000 in silico simulations.
(0.03 MB DOC)
Number of recombinations detected in each of the 18 regions in the male dataset, female dataset and female dataset when removing putative phasing errors. Females were phased using both PHASE and fastPHASE without using male phase information.
(0.05 MB DOC)
The main characteristics of 18 X-chromosome regions. From left to right: start position and end position in base pairs (based on NCBI Build 36 assembly), length of each in base pairs, number of SNPs (N SNPs), number of haplotypes (N haplo), recombination rate calculated by means of Ldhat, Number of recombinations detected, number of recotypes, average number of recombinations detected by IRiS per Kb.
(0.06 MB DOC)
We thank Hafid Laayouni, Martin Sikora, David Comas, Oscar Lao and especially, Tomàs Marquès for their valuable comments along the study. We also thank Angel Carreño and Txema Heredia for their help and Arcadi Navarro for his comments and for coining the word recotype. Finally, we thank the anonymous reviewers of PLoS Computational Biology, specially the second one for his comments and suggestions which helped to improve the present manuscript.
The Genographic Consortium includes: Janet S. Ziegle (Applied Biosystems, Foster City, California, United States); Li Jin & Shilin Li (Fudan University, Shanghai, China); Pandikumar Swamikrishnan (IBM, Somers, New York, United States); Daniel E. Platt & Ajay K. Royyuru (IBM, Yorktown Heights, New York, United States); Lluis Quintana-Murci (Institut Pasteur, Paris, France); R. John Mitchell (La Trobe University, Melbourne, Victoria, Australia); Danielle A. Badro, Marc Haber & Pierre A. Zalloua (Lebanese American University, Chouran, Beirut, Lebanon); Syama Adhikarla, ArunKumar GaneshPrasad, Ramasamy Pitchappan, Kavitha Valampuri John & Arun Varatharajan Santhakumari (Madurai Kamaraj University, Madurai, Tamil Nadu, India); Christoff J. Erasmus, Angela Hobbs & Himla Soodyall (National Health Laboratory Service, Johannesburg, South Africa); Elena Balanovska & Oleg Balanovsky (Research Centre for Medical Genetics, Russian Academy of Medical Sciences, Moscow, Russia); Chris Tyler-Smith (The Wellcome Trust Sanger Institute, Hinxton, United Kingdom); Daniela R. Lacerda & Fabrício R. Santos (Universidade Federal de Minas Gerais, Belo Horizonte, Minas Gerais, Brazil); Pedro Paulo Vieira (Universidade Federal do Rio de Janeiro, Rio de Janeiro, Brazil); David Comas & Begoña Martínez-Cruz (Universitat Pompeu Fabra, Barcelona, Spain); Christina J. Adler, Alan Cooper, Clio S. I. Der Sarkissian & Wolfgang Haak (University of Adelaide, South Australia, Australia); Matthew E. Kaplan & Nirav C. Merchant (University of Arizona, Tucson, Arizona, United States); Colin Renfrew (University of Cambridge, Cambridge, United Kingdom); Andrew C. Clarke & Elizabeth A. Matisoo-Smith (University of Otago, Dunedin, New Zealand); Matthew C. Dulik, Jill B. Gaieski, Amanda C. Owings & Theodore G. Schurr (University of Pennsylvania, Philadelphia, Pennsylvania, United States); David F. Soria Hernanz & R. Spencer Wells (National Geographic Society, Washington, District of Columbia, United States).