The Great Migration and African-American Genomic Diversity

We present a comprehensive assessment of genomic diversity in the African-American population by studying three genotyped cohorts comprising 3,726 African-Americans from across the United States that provide a representative description of the population across all US states and socioeconomic status. An estimated 82.1% of ancestors to African-Americans lived in Africa prior to the advent of transatlantic travel, 16.7% in Europe, and 1.2% in the Americas, with increased African ancestry in the southern United States compared to the North and West. Combining demographic models of ancestry and those of relatedness suggests that admixture occurred predominantly in the South prior to the Civil War and that ancestry-biased migration is responsible for regional differences in ancestry. We find that recent migrations also caused a strong increase in genetic relatedness among geographically distant African-Americans. Long-range relatedness among African-Americans and between African-Americans and European-Americans thus track north- and west-bound migration routes followed during the Great Migration of the twentieth century. By contrast, short-range relatedness patterns suggest comparable mobility of ∼15–16km per generation for African-Americans and European-Americans, as estimated using a novel analytical model of isolation-by-distance.

Introduction 1 remained bleak. Better opportunities in the North (Northeast and Midwest) and West led locus derives ancestry from two distinct lineages. We used RFMix (11) together with 1000 48 Genomes panels from Africa, Europe, and Asia to identify the most likely continental location 49 of the pre-1492 ancestors that contributed genetic material at each locus for individuals in the 50 cohorts ( Fig. 1D, Fig. S14, and Methods). The overall proportion of African ancestry (   Time estimates point to admixture occurring when most ancestors to present-day  Americans lived in the South. Regional differences in ancestry are therefore unlikely to be 94 caused by differences in recent admixture rates, and the large influx of migrants from the South 95 would have strongly attenuated any earlier differences. An alternate explanation for regional 96 differences in ancestry proportions is ancestry-biased migrations, if individuals with higher Eu-97 ropean ancestry were more likely to migrate to the North and West during the Great Migration.

98
To validate the ancestry-biased migration model, we compared ancestry proportions of HRS 99 individuals according to their region of birth, residence, and migration status. females.
The US Census includes a separate category for Hispanic/non-Hispanic ethnicity. In HRS, 146 32 African-Americans have self-identified as Hispanics (of which only 10 are within the con-147 tiguous US). Genetic ancestry within this group is distinct from the bulk of the non-Hispanic

148
African-American population in at least two ways: elevated Native American ancestry and a 149 higher genetic similarity to southern European populations (Figs. S16 and S17). The correla-  To compare these relatedness patterns with recent migration data, we used the 20th cen-170 tury US census data and a simple coalescent model to estimate the expected relatedness be-  African-Americans across the US are more related to European-Americans from the South than 178 to those from the North or West. In addition, European-Americans from the South tend to be 179 more related to African-Americans in the North than to those in the South. This increased re-180 latedness with increased distance is unusual in population genetics, but is easily explained: The ancestry-biased migration is also a relatedness-biased migration. The reduced relatedness be-182 tween northern European-Americans and African-Americans may also be reinforced by recent 183 European migration, because the new migrants were more likely to settle in the North but were 184 less likely to be related to African-Americans. Despite the unusual long-range relatedness patterns, identity-by-descent decays with distance 187 within African-American communities in the South, reflecting isolation-by-distance (Fig. S5).

188
To understand how migrations affect isolation-by-distance and identity-by-descent, we intro-189 duce a novel, simple model taking into account a diploid population density n and spatial diffu- where r min,max = D/l min,max , and K α (x) is the modified Bessel function of the second kind 198 (see Methods).

199
Using IBD segments longer than 18 cM, we estimate population density n AFR = 1.9 km −2 200 and diffusion constant D AFR = 63.5 km 2 /generation for African-Americans across the US, and 201 n EUR = 7.6 km −2 and D EUR = 59.6 km 2 /generation for European-Americans (Fig. 3E) We performed comparisons with data from 23andMe (12) and from 97 individuals of African 354 ancestry from the southwest USA (ASW) from the 1000 Genomes Project 1 (28). The 23andMe 355 cohort includes many African-American individuals and has been the subject of a detailed pop-356 ulation genetic analysis (12), and the ASW cohort has been a reference African-American pop-357 ulation in recent studies. However, these two cohorts were not meant to be representative of the 358 US population. The 23andMe database has a complex ascertainment scheme, which may cause 359 biases in ancestry and socioeconomic status. In particular, biases in regional representation and 360 a small amount of survey response errors might lead to a lower European ancestry proportion.

361
These possible biases are described in detail in (12). Similarly, the ASW cohort was assem-362 bled from duos and trios with at least one Oklahoma resident, but with no attempt to reach 363 geographic or demographic representativeness. For comparisons with the 23andMe study, we 364 used the global ancestry proportions reported in (12) typed on Illumina Human 1M-Duo). All genotyped individuals were either cases or controls in their respective nested case-control studies. We converted the lung cancer dataset from human genome assembly hg18 to hg19 using the LiftOver utility from the UCSC Genome Bioinfor-376 matics Group and merged the four separate SCCS datasets into one using PLINK 1.9 (29).

377
During the merge process, we removed markers to which more than one name was assigned 378 at the same position along a chromosome; removed markers with missing genotype calls; cor-379 rected unambiguous strand misassignments and removed ambiguous strand (mis)assignments; 380 removed multi-allelic markers; and, finally, filtered the data for missing calls (30) Table S5).

IBD inference 417
We used GERMLINE (32) (arguments -err hom 1 -haploid -bits 32 -w extend) 418 to infer IBD tracts of length 3 cM or larger shared between individuals from the HRS, SCCS, 419 and ASW cohorts. GERMLINE is prone to false positive IBD assignment (see, e.g., (33)). We 420 used a heuristic filtering scheme to identify and filter regions with large excess of IBD. We first 421 count the number of overlapping IBD segments at each genomic position across all individuals.

422
A chromosomal region is then marked as "forbidden" if the total number of IBD segments over-423 lapping it is larger than 25,000, since the background IBD has an approximate depth of 15,000.

424
Next, two forbidden regions will be merged as one if they are less than 0.1 cM apart. IBD seg-425 ments that overlap these forbidden regions are excluded from the downstream analysis unless 426 they extend outside the forbidden regions by at least 3 cM. In that case, we presume that there 427 is sufficient evidence in the non-forbidden regions, and the segments are therefore kept. After American status within each region, we form two sparse relatedness matrices: L which contains 437 the total IBD length shared between each pair of individuals, and N which contains the total 438 number of shared IBD segments between each pair of individuals. We have to emphasize that 439 the diagonal elements of L and N , which represent self-IBD, are zero by definition. 440 We next remove the contributions of closely related individuals from these matrices as fol- where 455 • i and j represent individuals as indexed in L;

456
• the primed sum runs over relevant pairs (i, j) such that i < j, (R(i), S(i)) = (R 1 , S 1 )  Americans separately. We assume a generation time of 30 years, thereby taking census years 486 1900, 1910, and 1920 as generation 3; 1930, 1940, and 1950 as generation 2; and 1960, 1970, 487 and 1980 as generation 1. For each race group, we construct a matrix whose elements m ij are 488 the number of migrations for each generation from region i to region j; this matrix is highly 489 asymmetric because of asymmetric migrations.
We now construct a heuristic census-based measure of relatedness between regions. Let us i→j as the proportion of individuals at generations g − 1 in region j with ancestors at 492 generations g in region i. In other words, the (i, j) element of the matrix P (g) is where g ∈ {1, 2, 3} denotes the generation time of the ancestral population, and m

497
We construct a three-generation transition matrix as: This definition forP ij takes into account all possible migration routes starting at region i and 499 ending at region j that could have taken place in the span of the three generations.

500
To estimate genetic relatedness between different geographic regions, we further make the 501 extremely coarse assumption that population sizes were constant before 1910, and that popula-502 tions were randomly mating. These assumptions allow us to model expected relatedness within 503 regions using coalescent theory before the massive 20th century migrations. Neither assumption 504 is expected to hold, but the resulting relatedness metric remains informative as long as census 505 population size correlate with the expected time to the most recent common ancestor for a given 506 pair of lineages in a region.

507
GivenP k,i as the probability of a sampled individual from region i having an ancestor from 508 region k, we define the census relatedness metric between regions i and j as 509 where N k is the census population size of region k. Population size matters, because in larger 510 populations, it is less likely that a given pair of individuals share a common ancestor. If N k is 511 large enough, the number of common ancestors at each generation is inversely proportional to 512 N k , and therefore the expected recent shared ancestry is approximately inversely proportional 513 to N k . Thus, I ij is proportional to the probability of two individuals from regions i and j having 514 ancestors from (any) region k times the probability that these ancestors have a recent common 515 ancestor within region k. UnlikeP which is a directional metric, I is non-directional and 516 symmetric and can be directly compared with genetic relatedness matrix L in Eq.
(2), which 517 was estimated using IBD data. The regional relatedness patterns derived usingP and I are 518 shown in Fig. S4 West North Central), forming a 3 × 3 matrix from the census data to be compared with the 524 corresponding matrix from IBD data. To quantify the correlation between these two matrices, 525 we use the Mantel test as follows. Given that all the elements in each of these two matrices are 526 independent, we perform 9! possible permutations on the elements of the matrix derived from 527 the census data and calculate the Pearson correlation coefficient between the original IBD matrix 528 and the permuted census matrix. We then accept or reject each permutation based on whether 529 the calculated correlation coefficient is lower or higher than the correlation coefficient between  by performing a random subset of 10 7 permutations out of a total of 12! ones.

534
We also perform this test while using the region of birth of HRS individuals as their location.

535
Given the average year of birth (1939.8) and the birth year distribution (Table S2) in HRS, we 536 only take for consistency generation 3 from the census data (see definition above) and write 537P = P (3) as our overall directional relatedness matrix (compare with Eq. (4) above). We then 538 proceed as before to calculate the non-directional (symmetric) relatedness I. Given the new 539 census-based prediction (using only g = 3 above) and the IBD relatedness pattern (using the 540 region of birth), we perform a Mantel test as before in order to find the correlation between the 541 data and our prediction. with random mating within islands and migrations between the islands. We will consider a 546 limiting example of a continuous population below. 547 We are first interested in the probability that a genomic segment of given length, stretching across a specific locus, is shared identical-by-descent between two randomly selected individu-549 als living on different islands. For identity-by-descent to occur, we need two events to happen: 550 first, lineages at that locus must have coexisted on one unknown island at some point in the past.

551
Second, these two geographically coexisting lineages must also have coalesced further in the 552 past.
To derive the probability of coexistence, we first want to estimate the expected position of a 562 lineage given its position in the past. Concretely, let x 0 be the current location of an individual 563 at t = 0. We would like to find Φ(x, t|x 0 ), the probability that an individual's lineage is on 564 island x at t generations ago, given that it is currently on island x 0 .

565
By construction, the probability Φ(x, t|x 0 ) takes into account contributions from all possible 566 space-time paths that start at x 0 and end at x at time t. For instance, a possible path is to arrive 567 at x at t/2 and stay at that position until t, whereas another path is to arrive at x at t/3, leave x 568 at the next step for a series of random walks to finally arrive at x again at t.

569
Consider a region of area ∆A i that encompasses a deme with haploid population 2n(x i , t)∆A i ,

570
where n(x i , t) is the effective diploid population density at position x i and time t in the past.

571
The probability of that two lineages in ∆A i coalesce in a given generation is This expression does not consider the possibility of multiple coalescent events and is therefore 573 appropriate only for a number of generations that is much less than the population size. More-574 over, the discrete probability of two lineages having coexisted on the deme at x i at time t in the 575 past, given that they are a distance R apart (at x 0 and x 0 + R) at present (at t = 0), is Therefore, the total probability of having a common ancestor t generations ago in the discrete 577 model is To go from a discrete random walk to the continuous limit, we set Φ(x, t|x 0 ) → ϕ(x, t|x 0 )∆A, 579 where ϕ(x, t) is now a continuous probability density. Thus, in this limit, i.e. with i ∆A i . . . → 580 d 2 x . . ., we get The continuous limit of a random walk process is the diffusion model. In this model, the 582 probability density ϕ(x, t) of finding a lineage at an infinitesimal area d 2 x centered around x at 583 generation t in the past obeys the two-dimensional partial differential equation where the diffusion coefficient D(x) encompasses the information related, in the discrete model,

585
to probabilities of taking a step to an adjacent island or staying on the same island 3 . Solving for ϕ(x, t|x 0 ) amounts to solving equation (11) with initial condition ϕ(x, t = 0) = δ(x − x 0 ) 587 where δ(x) is the (two-dimensional) Dirac delta function.

588
For simplicity, we consider random walks with uniform probability of transitioning to any with p(l|t) = (2t) 2 l exp(−2tl) the probability density of an IBD segment of length l (in units 603 of Morgans) spanning the locus shared by the two randomly chosen individuals whose lineages 604 coalesce t generations ago. Performing the integrals above leads to the following closed form 605 solution for the expected fraction of the genome shared as a function of spatial separation  We can use this estimator to find the shared fraction per chromosome; that is, for each chro-

615
Access to the exact location of clinics at which the SCCS cohort was sampled allows us to 616 investigate the relation between IBD relatedness and spatial distance. Having inferred possible 617 IBD segments using GERMLINE, we calculate, for each pair of individuals from SCCS, the 618 total length of shared IBD and the distance between the clinics in which they were sampled. We 619 make the underlying assumption that each individual lives close to the clinic at which he or she 620 was sampled. Each pair is then placed, based on the distance between the two individuals, into the total number of points in the bin) and assign it to a distance equal to the midpoint of the bin 625 (e.g., for the length bin [1,101), the assigned distance is 51 km). The result is shown in Fig. S5.

626
Apart from the expected decay of relatedness with distance, we also notice the presence of 627 a constant "background" IBD. This background IBD is larger for smaller IBD segments. As In the Wright-Fisher model, given the shared locus ζ, the probability of having a MCRA g 654 generation ago is where N is the (constant) effective haploid population size. Therefore, given the length l of 656 an IBD tract (in units of Morgans), we use (16) and (17) where we have assumed that the haploid population size N 1 in the last step. and 91 CEU individuals. We extracted the intersecting set of SNPs between our merged dataset 666 and the three reference populations mentioned above, which we used as the input to RFMix.

667
RFMix assigned continental ancestry of each marker in each sample to either CHS, YRI, and 668 CEU, which we interpret as Native American/Asian, African, or European. ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/analysis_results/supporting/ omni_haplotypes/ composed of samples chosen from 91 CEU, 50 CHS, and 96 YRI, ensuring that the individual 682 from whom the genotypes were copied was not used in the reference panel. We inferred 74.96% 683 African, 24.95% European, and 0.09% Native American ancestry.

684
For the three-population admixture model, we simulated a sample of 100 haploid chromo-685 somes with 80.9% YRI, 18.2% TSI, and 0.91% JPT ancestry, using the same method described 686 above. In this case, the inferred proportions were 80.9% African, 18.2% European, and 0.94% 687 Native American. These results are consistent with previous estimates of false assignment using 688 a similar pipeline (11). 689 We also considered whether the amount of Native American ancestry in real samples cor-  African-Americans with self-reported Hispanic ethnicity (32 additional individuals), and one 706 additional African-American who was listed as "White, non-Hispanic" in HRS Tracker but as 707 "African-American" in dbGaP. All individuals were kept in the other cohorts.

708
Optimization was performed for 6 models in each cohort: 2 two-population models, and 4 709 three-population models. admixture. Finally, the xpp_pxx_xpx model has a founding population of Europeans and 743 Native Americans, followed by a pulse of African admixture, followed by a pulse of European 744 admixture. These histories are shown in Fig. S10. The best-fit model for the SCCS and HRS is 745 the pxp_xpx_xpx (see Fig. S11 and S10).

746
In addition to the three global ancestry proportions, model pxp_xpx, has two free time

Confidence intervals for timing estimates 754
Confidence intervals for all parameter values were obtained via bootstrap (Table 3). For each 755 model, we generated 100 bootstrap populations by resampling individuals with replacement. 756 We performed parameter inference for each bootstrap population, and computed the 95% con-757 fidence interval of the resulting distribution of parameters. These confidence intervals account 758 for the finite number of individuals in the sample. However, they do not account for biases re-759 sulting from population structure or model mis-specification. Because of the large sample size, 760 these biases are likely more important than the uncertainty measured by the bootstrap.     To find the spatial distance between HRS individuals, we use the ZIP Code Tabulation Area

SCCS
The data we received from SCCS contains latitude and longitude coordinates of clinics partic-787 ipating in the study. To convert the coordinates into specific locations (e.g., ZIP codes, states, 788 and census regions), we used the Nominatim service from OpenStreetMap 6 to perform reverse 789 geocoding of the coordinates. Specifically, we used the OpenStreetMap API provided through 790 MapQuest 7 due to its unlimited usage policy.  Americans, or more generally the presence of population structure that persisted throughout 860 US history. By contrast, African-Americans have more long IBD, on average. Since long seg-861 ments represent recent history, we attribute this difference between short and long segments   Figure 19: Distribution of IBD sharing for African-American (blue) and European-American (red) individuals using IBD tracts belonging to different length bins.
to the a much greater reduction in effective population size in African-Americans compared to 863 European-Americans since the arrival into the Americas.

864
IBD sharing for European-Americans tends to be mostly via shorter (and, therefore, older)

865
IBD segments compared to that for African-Americans. Hence, for bins containing longer IBD 866 segments, the peak corresponding to European-American IBD sharing moves to the left faster 867 compared to the respective peak for African-Americans.  between European-Americans across US census regions is similarly displayed in Fig. S23.

893
Relatedness between African-Americans and European-Americans is shown in Fig. S24.