^{1}

^{2}

^{*}

^{1}

^{3}

Conceived and designed the experiments: JvE RJH. Analyzed the data: JvE. Contributed reagents/materials/analysis tools: JvE RJH. Wrote the paper: JvE RJH.

The authors have declared that no competing interests exist.

The study of the prehistoric origins and dispersal routes of domesticated plants is often based on the analysis of either archaeobotanical or genetic data. As more data become available, spatially explicit models of crop dispersal can be used to combine different types of evidence.

We present a model in which a crop disperses through a landscape that is represented by a conductance matrix. From this matrix, we derive least-cost distances from the geographical origin of the crop and use these to predict the age of archaeological crop remains and the heterozygosity of crop populations. We use measures of the overlap and divergence of dispersal trajectories to predict genetic similarity between crop populations. The conductance matrix is constructed from environmental variables using a number of parameters. Model parameters are determined with multiple-criteria optimization, simultaneously fitting the archaeobotanical and genetic data. The consilience reached by the model is the extent to which it converges around solutions optimal for both archaeobotanical and genetic data. We apply the modelling approach to the dispersal of maize in the Americas.

The approach makes possible the integrative inference of crop dispersal processes, while controlling model complexity and computational requirements.

Understanding the domestication and the subsequent dispersal of cultivated plants is fundamental to our comprehension of the rise of early agricultural societies

Research on plant domestication and crop dispersal is a multidisciplinary effort with the principal contributions coming from archaeology and molecular biology

In spite of this increase in the availability and quality of data, conflicting views persist regarding the evolutionary and geographical trajectories of crops. The issue does not seem to be data availability alone. For instance, for Asian rice (

Spatially explicit models of dispersal are not new to archaeology. Models of diffusion have been used to represent the spread of Neolithic innovations, including pottery, copper metallurgy, cultivated maize

In contrast, current methods used for the geographical analysis of crop genetic diversity are generally not spatially explicit. A common approach to determine the geographical origin of a crop is to locate the genetically closest wild progenitor population

In the broader field of phylogeography, the need for further integration of genetics and geography is increasingly recognized

We present a novel, integrated approach, which combines different elements from existing archaeological and genetic geospatial models. Our approach has at its core a landscape model that represents the ease of movement through geographical space. Given the geographical origin of a crop, we derive from the landscape model different distance measures that can be quantitatively related to (1) the radiocarbon dates for the first appearance of the crop in the archaeological record, (2) heterozygosity of (contemporary) crop samples, and (3) genetic distances between these samples. The measures are all based on (randomized) shortest path metrics that can be obtained without stochastic simulation. This provides greater computational speed, allowing for the evaluation of alternative locations of crop origins and of different variables of potential influence on crop dispersal.

We apply the approach to maize dispersal in the Americas. Although there is still debate about the exact geographical origin of maize, it is thought to lie in a limited area in southern Mexico, where its closest wild relatives occur naturally

The initial dispersal of crops is affected by several geographical factors such as the location of water bodies, environmental barriers, the suitability of environments to grow the crop, and prehistoric human population density. To model the relative influence of different factors, we use conductance matrices derived from gridded geographic data. In the conductance matrix, each grid cell is represented by a row with values indicating the conductance or relative ease of crop dispersal and gene flow to other cells on the grid. A grid with n cells produces an n×n cells conductance matrix. Generally, we connect pairs of spatially adjacent cells, which receive non-zero values in the conductance matrix, while unconnected pairs of cells receive a zero. Spatially non-adjacent cells could be connected in the conductance matrix to represent long-distance ‘leap-frog’ movements. Here, to keep the model simple and following a number of existing models in archaeology

What we call a “conductance matrix” is called a “weighted adjacency matrix” in graph theory. In this context, however, we prefer the term “conductance matrix” to avoid confusion, as adjacency between nodes in the graph derived from the grid does not necessarily imply

Using conductance matrices has a number of advantages. In geospatial analysis, least-cost distances are generally calculated from a cost or friction grid

Conductance values are determined from the values of the two grid cells that are connected, using different functions. Simple functions, such as the average, or functions that require parameters can be used. The conductance values need to be corrected for (1) differences between diagonal and non-diagonal connections between cells if cells are connected in more than four directions and (2) distance distortions, specifically the decreasing W-E distance between cell centres on a longitude-latitude grid when moving from the equator towards the poles. Both issues are addressed by dividing conductance values by the distances between the cell centres.

We refer to the final conductance matrix used to calculate the distance metrics as the

Given a certain geographical origin of crop dispersal and a landscape model, we can predict the movement of crops in geographical space and, consequently, the age of archaeobotanical crop remains, heterozygosity, and genetic distances between crop populations. The following sections discuss the construction of predictor variables from the landscape model and the model-fitting procedure. The modelling approach requires measures of goodness-of-fit between the landscape model on the one hand and the archaeobotanical and genetic data on the other. To derive these measures, we use and adapt elements of existing modelling approaches in both archaeology and geographical genetics.

We use a variant of existing spatial diffusion models in archaeology for the post-domestication diffusion of crops ^{2} of quantile regression, R^{1}, as the goodness-of-fit

Dispersal is expected to leave a mark on the diversity within and between populations. During the expansion of humans out of Africa and spread across the world, each time generally small groups split off to occupy new areas, taking with them only a portion of the alleles from their original population. As a result, human populations show a regular decline in heterozygosity from Africa to the southern tip of South America

Like for crop remain ages, the least-cost distance from the origin of the crop should therefore be a good predictor for heterozygosity levels. Here, for simplicity, we determine the cost distance to predict heterozygosity from the same landscape model as we use for crop remain ages. This assumes that the loss of diversity due to genetic bottlenecks through each area is proportional to the time it took to cross these areas. This may not be realistic. The intensity of genetic drift is related to population size, which may change over time and among agricultural systems. Also, the number of bottlenecks over a given distance may differ. See below, under

Selection, introgression from wild populations, as well as recent founder effects and subsequent hybridization may all confound the spatial pattern of heterozygosity. However, as long as the pattern is mainly due to the initial wave of dispersal and not to subsequent long-distance gene flow events or introgression from wild relatives, the net effect of these subsequent demographic events will be to

During dispersal, the genetic divergence between populations is due to the progressive isolation of populations as their trajectories split. The earlier trajectories split, the more genetic divergence is to be expected. On the other hand, populations that share a large part of their trajectory will undergo a common loss of alleles (alleles which may continue on pathways in other directions from the origin) and have a higher degree of common ‘surfing’ alleles, which have emerged at intermediate locations

Ramachandran et al. predicted genetic distances between human populations with distances along dispersal routes out of Africa through waypoints

During crop dispersal, the first varieties to reach a new place will have more probability to be taken further than varieties that arrive there later. Hence, the dispersal route of individual alleles will be close to the shortest (least-cost) path from the origin location to the location of the sampled population. However, there will also be movements of sideward gene flow along the expansion front that will canalize genes towards parallel paths, bringing in a random element. Random walks can be modelled with analytical methods, using the analogy with electrical current, to avoid repeated simulations to determine probabilities

For a given origin, destination, conductance matrix, and a value for θ, we calculate the _{a} represents the stochastic trajectory from the origin to point a. The probability that two different trajectories (_{a} and _{b}) coincide in connections between cells can be calculated by multiplying the two matrices:_{joint} is a matrix with the probabilities of joint passage for each cell connection. Likewise, to determine to what extent connections between cells are part of the divergent part of the trajectories, we calculate the probability that the most probable trajectory crosses a cell connection and the least probable trajectory fails to do so. If this probability exceeds the probability that the least probable trajectory crosses the cell, there is enough asymmetry between the trajectories to consider the cell connection as part of the divergent part of the trajectory:_{joint} and _{disjunct} with the resistance matrix, R. (Here, we determine R as the reciprocal of the conductance matrix of the landscape model, but see below under

Both populations are from South America (point locations A and B). The origin of the crop is in Mexico. A. Probability of passage from origin to location a. B. Probability of passage from origin to location b. C. Overlap of the trajectories. D. Divergence of the trajectories.

Using the computational strategies outlined above, we can derive from the landscape model different distance measures that relate to crop remain age, heterozygosity, and genetic distances. We use multiple-criteria optimization to evaluate how well our model can explain the archaeobotanical

Multiple goodness-of-fit values are determined by regression of the predictors against the archaeobotanical and genetic data. A genetic algorithm optimizes two or more goodness-of-fit measures through an iterative search of the best parameter values and origin coordinates. The outcome of this optimization is a Pareto front of solutions. Pareto solutions are those for which improvement in one goodness-of-fit dimension can only occur with the worsening of at least one other goodness-of-fit dimension. The shape of the Pareto front gives a good indication of the degree of convergence or conflict between the two datasets, given the model structure. A pointed, convex front (seen from the cloud of possible solutions) is evidence for convergence around the same solution. Inspecting the parameters and origin coordinates of the different solutions can provide insights into the source of the conflict and thus help in improving the model structure.

The data analysis was done in

We modelled maize dispersal and diversity with a simple landscape model. We used a grid of 0.5 by 0.5 degree resolution, covering the study area. Grid cells were connected in eight directions (queen's case) to form conductance matrices. The area of origin was modelled as a single cell, which could be anywhere on land. The landscape model includes only information about the shape of the landmass to keep our example as simple as possible, but additional variables could be added (see

The conductance of between-cell connections on land was set to 1. The conductance of major water bodies was modelled with a decay function and a weight relative to the conductance of the landmass (p_{1}), following _{2}, the conductance half-value distance). Conductance over water bodies was calculated as

We limited the analysis to macrobotanical remains of maize, which at the moment is the only type of archaeological remains of this crop for which there is a somewhat complete coverage of the Americas. Radiocarbon dates were derived from

We used genetic data from

We first optimized the landscape model with (1) the age of the archaeobotanical crop remains and (2) the heterozygosity of contemporary maize samples. For a number of the Pareto solutions obtained, we then evaluated the goodness-of-fit with (3) the genetic distances. We choose this setup in two rounds to reduce computation time and to test the performance of our new path overlap and divergence metrics independently. The path overlap and divergence metrics should predict the genetic distances well if the modelling approach is coherent.

In the first round, the goodness-of-fit was determined with quantile regression, setting τ to 0.8 for both radiocarbon age and heterozygosity. Since the archaeobotanical data were highly unequally spread with an especially high density of observations in Colorado, New Mexico, and Arizona, we weighted each observation by 1/number of observations within a radius of 100 km from that observation (including the observation itself). Hence, an observation with one neighbour within 100 km distance received the weight 0.5, while an isolated observation was weighted as 1.

We optimized with a population of 200 during 60 generations. Further improvements after 200 generations were minimal and solutions showed a regular pattern. We selected for further analysis a subset of nine representative Pareto solutions. We calculated path overlap and divergence based on these solutions for various values of θ. We evaluated the correspondence between path overlap/divergence and genetic distances with linear permutational regression with 999 permutations

We obtained a set of Pareto solutions that were overall similar. We provide the full set of obtained solutions as Supplementary Information (^{1} values for the fit with heterozygosity were low for all solutions, indicating that heterozygosity may be influenced by recent population bottlenecks and selection. The solutions with a higher fit for heterozygosity suggest a more northern origin of maize than those that correspond better to the archaeobotanical data. A strong source of tension between the archaeobotanical and genetic data is the conductance of water bodies, with the genetic data suggesting that water bodies are less conductive than what would be expected from the pace of dispersal according to the archaeobotanical data. The solutions with good archaeological fit have higher half-value distances (p_{2}) than those with a good genetic fit. Hence, the main effect of the high weight given to water bodies in the latter solutions is to increase the conductance along the coast.

A. Model fits at the Pareto front, with selected solutions labeled from A to I. B. Inferred locations of crop origin, with colors and labels corresponding to those of

Solution | Area of origin | Water bodies | Goodness of fit (R^{1}) |
|||

Longitude | Latitude | Relative weight (p_{1}) |
Conductance decay half-value distance (km) (p_{2}) |
Archaeobotanical crop remain ages | Heterozygosity of contemporary maize samples | |

A | −84.29 | 14.22 | 0.23 | 5310 | 0.43 | 0.05 |

B | −84.34 | 10.31 | 0.18 | 5991 | 0.42 | 0.05 |

C | −87.81 | 13.17 | 2.03 | 3065 | 0.38 | 0.08 |

D | −87.08 | 12.63 | 2.39 | 3018 | 0.37 | 0.08 |

E | −89.11 | 13.37 | 2.12 | 3093 | 0.39 | 0.08 |

F | −99.94 | 17.83 | 6.10 | 2631 | 0.29 | 0.12 |

G | −101.58 | 19.41 | 7.51 | 2569 | 0.27 | 0.14 |

H | −102.06 | 19.91 | 8.39 | 2519 | 0.26 | 0.15 |

I | −101.17 | 19.17 | 16.17 | 2347 | 0.18 | 0.17 |

Visualizing and comparing the different solutions provides some additional information. In

Routes determined with randomized shortest paths (θ = 0.2) and logarithmically scaled. A. Routes of dispersal from origin (circle) to six locations (squares) according to solution A. B. Routes of dispersal from origin (circle) to six locations (squares) according to solution I.

A. Locations of the archaeobotanical observations with modelled isochrons (oldest quintile of macrobotanical remains) for solution A. B. Relation between the age of crop remains and the least-cost distance from the crop origin according to solution A. The colors of the observations correspond to Figure 4A. The line indicates the highest quintile (τ = 0.8) predicted by the model. C. Locations of the archaeobotanical observations with modelled isochrons (oldest quintile of macrobotanical remains) for solution A. D. Relation between the age of crop remains and the least-cost distance from the crop origin according to solution A. The colors of the observations correspond to Figure 4C. The line indicates the highest quintile (τ = 0.8) predicted by the model.

A. Locations of the genetic observations with isolines indicating modelled heterozygosity values (highest quintile) for solution A. B. Relation between heterozygosity and the least-cost distance for solution A. The colours of the observations correspond to

We used least-squares regression to determine the degree in which our dispersal model explains the variation in genetic distances, for different values of θ (^{2} of 0.36 to 0.38 was found for solutions F, G, H, and I. These are also the solutions with the highest fit for heterozygosity (^{2} = 0.16. All solutions provided significant predictors of genetic distance (p<0.001). The two variables had the expected sign (negative for path overlap, positive for divergence) in all cases, except for path divergence in solution A and B with θ values of 1.5 and 2. In these cases, the contribution of path divergence was insignificant (0.02<p<0.26).

A | B | C | D | E | F | G | H | I | |

0.01 | 0.19 | 0.16 | 0.16 | 0.16 | 0.18 | 0.26 | 0.35 | 0.38 | 0.38 |

0.1 | 0.23 | 0.15 | 0.20 | 0.20 | 0.22 | 0.36 | 0.36 | 0.36 | 0.31 |

0.5 | 0.27 | 0.15 | 0.28 | 0.27 | 0.29 | 0.36 | 0.34 | 0.35 | 0.25 |

1 | 0.15 | 0.12 | 0.28 | 0.27 | 0.30 | 0.32 | 0.34 | 0.33 | 0.22 |

1.5 | 0.06 | 0.05 | 0.23 | 0.24 | 0.25 | 0.32 | 0.33 | 0.30 | 0.22 |

2 | 0.02 | 0.04 | 0.11 | 0.18 | 0.11 | 0.33 | 0.32 | 0.28 | 0.22 |

With a simple model for maize dispersal, we obtained a set of solutions with similar geographical origins. Multi-criteria assessment revealed the conflict between the archaeological and genetic datasets and gave clues regarding the possible causes for this tension. Different solutions can be inspected visually and compared. This allows us to formulate next steps to iteratively improve the model in a focused way.

One possible explanation of the observed conflict is that, along the coast, maize was transported over large distances, each movement causing a single genetic bottleneck, while over land, spread was more continuous and movements were shorter and alleles were lost each time seeds passed from hand to hand. Hence, differences between the genetic outcomes of dispersal over water versus land may partly explain the observed conflict between the two datasets. However, it seems more likely that the bottleneck observed in North American maize is related to geographical factors that are not taken into account here. The samples with low heterozygosity include Northern flint varieties. Northern flints are known to constitute a genetically very separate group

The results also demonstrate a fairly good predictivity of path overlap and non-overlap metrics for genetic distances. The best of the obtained goodness-of-fit values are reasonable when considering the simplicity of our model and values obtained in similar studies

The results show that the speed of dispersal does not correspond linearly to the loss of heterozygosity. Other factors of influence not included in the current model certainly have importance, like climatic factors in the case of the Northern flints. However, adding more variables to the model will affect dispersal speed and the loss of heterozygosity equally and will not break their linear relationship. The discrepancy can only be resolved by relaxing the assumption that genetic drift was constant in time and determining (archaeological) “travel time” and (genetic) “travel cost” separately.

In a next modelling iteration, the model should be expanded to make this possible. Shortest paths from the origin in the landscape model should be used to predict crop remain age. Also, the trajectories from the origin to the genetic samples should be determined on the basis of the landscape model. However, a separate genetic conductance matrix should then used to calculate the cost of these routes in order to predict heterozygosity and genetic distances. For the path overlap and divergence metrics, this means that R (resistance) in formulae 3 and 4 is not determined as the reciprocal of the landscape model, but as the reciprocal of this genetic conductance matrix. Predictors for heterozygosity can also be determined with this same matrix. The parameters to construct the genetic conductance matrix are added to the multiple-criteria optimization. Otherwise, the approach remains the same. This extension of the approach would be a logical next iteration in our modelling exercise.

Next modelling iterations should also make use of additional information to refine the landscape model. For example, ecophysiological crop models

Our results make clear that next modelling steps should ideally incorporate the genetic distances directly into model fitting. Genetic distances are often easier to obtain than reliable heterozygosity values and the accuracy of heterozygosity values is often limited by small sample sizes. For inbreeding crops, heterozygosity values are not available or meaningful and only genetic distances can be used. As the repeated computation of path overlap and divergence is very time-consuming in the current implementation, reducing computation time is a priority. Parallel computing approaches can be used at several levels. On the other hand, it has been found that relatively coarse grids still produce relatively accurate results

Choosing appropriate values for τ and θ is an important issue. Values for τ should be determined taking into account the magnitude of error or bias in the data. Even so, the best value τ is difficult to determine beforehand. The best value of θ is also difficult to determine beforehand, although reference values may become available if our approach is applied to various crops and dispersal processes. Sensitivity analyses could be applied to determine the influence of these two parameters.

We used least-squares regression to quantify the variance of the genetic distances the model was able to explain. As with crop remain age and heterozygosity, bias reduction in genetic distances could be achieved using quantile regression (giving emphasis to long genetic distances by setting τ to a low value). However, this would be under the assumption that the bias is mainly due to local divergence, not to posterior long-distance geneflow or introgression from wild relatives. For that reason, the quantile regression approach has limitations when working with contemporary genetic data. Bias reduction may also be achieved in other ways, for instance, by selectively removing outliers and introgressed samples. Also, if genetic data for archaeological crop remains become available, it should become possible to obtain a clearer genetic signal of the first wave of dispersal, which would then help to distinguish it from the changes that occurred after this first wave (local divergence, foreign introductions, and hybridization). For maize, long-distance gene flow after the first wave of dispersal seems to be due to relatively recent (colonial and post-colonial) migration and trade

Above we have modelled the geographical crop origin as a point location, but a crop origin may also be modelled as an area with a certain extent to assess the possibility of a ‘protracted’ domestication process, in which a crop evolves during a long period, perhaps several thousands of years, in an extended region, before spreading to other areas

The main methodological innovations in our approach are (1) the use of parameterized landscape conductance matrices to construct landscape models, (2) the use of quantile regression to reduce noise in radiocarbon dates of crop remains and heterozygosity values, (3) the use of multi-criteria optimization for simultaneous model assessment, and (4) the introduction of path overlap and divergence as measures to predict genetic distances. An important strength of our approach is that parameterization can be done in a fully automated way. There is no need to force routes through certain waypoints, to determine landscape conductance

Models with more variables and parameters can be evaluated with the presented methods and should lead to an increase of one or more goodness-of-fit values while not deteriorating the other values. The possibility to incrementally move from a simple initial model to more complex models, as comprehension of the processes studied increases, is crucial to successful modelling. It provides for a modelling approach that is driven by an understanding of the dominant processes supported by the data, while avoiding unnecessary details and computational effort. As the resulting models would be based on a representation of the underlying geographical processes, they could be used to predict levels of biodiversity in unsampled locations and lead to applications in genetic resources management.

R script. Script in R which replicates the complete analysis presented in the manuscript. Instructions to run this script: 1. Put all the documents in a single folder (this becomes your working directory) 2. Install the necessary packages in R (see first part of the script) 3. Define the working directory in the script (setwd(“C:/…/”), as indicated in the script) 4. Run the script in R

(0.02 MB TXT)

Sea mask. Geo-data in ASCII format: location of water bodies and land, 0.5×0.5 degree resolution.

(0.06 MB TXT)

Maize archaeobotanical database. Excel file with radiocarbon data and coordinates used in the analysis.

(0.14 MB XLS)

Maize SSR data. Excel file with the genetic (SSR) data used in the analysis, from ref. 22. The IDs were corrected to match IDs of file 5.

(3.48 MB XLS)

Plant samples maize SSR data. Excel file with further information on the samples used in the SSR analysis of File 4, from ref. 22. The IDs were corrected to match IDs of file 4.

(0.13 MB XLS)

Pareto solutions. Complete set of Pareto solutions generated by the analysis.

(0.02 MB CSV)

We thank Martin Maechler for helping with sparse matrices in R and Marco Saerens for providing Matlab code implementing randomized shortest paths. We thank all archaeologists who have generously supplied us with maize radiocarbon dates and site coordinates.