^{1}

^{2}

^{3}

^{4}

^{2}

^{5}

^{6}

^{7}

^{8}

^{7}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: CJP SJG CVC DJM JZ ALT JSM. Performed the experiments: CJP SJG ALT. Analyzed the data: CJP SJG ALT. Contributed reagents/materials/analysis tools: SJG CVC JWW DJM JAP JSM. Wrote the paper: CJP SJG ALT CVC DJM JAP JZ JSM. Developed the software used: CJP ALT .

We present a gridded 8 km-resolution data product of the estimated composition of tree taxa at the time of Euro-American settlement of the northeastern United States and the statistical methodology used to produce the product from trees recorded by land surveyors. Composition is defined as the proportion of stems larger than approximately 20 cm diameter at breast height for 22 tree taxa, generally at the genus level. The data come from settlement-era public survey records that are transcribed and then aggregated spatially, giving count data. The domain is divided into two regions, eastern (Maine to Ohio) and midwestern (Indiana to Minnesota). Public Land Survey point data in the midwestern region (ca. 0.8-km resolution) are aggregated to a regular 8 km grid, while data in the eastern region, from Town Proprietor Surveys, are aggregated at the township level in irregularly-shaped local administrative units. The product is based on a Bayesian statistical model fit to the count data that estimates composition on the 8 km grid across the entire domain. The statistical model is designed to handle data from both the regular grid and the irregularly-shaped townships and allows us to estimate composition at locations with no data and to smooth over noise caused by limited counts in locations with data. Critically, the model also allows us to quantify uncertainty in our composition estimates, making the product suitable for applications employing data assimilation. We expect this data product to be useful for understanding the state of vegetation in the northeastern United States prior to large-scale Euro-American settlement. In addition to specific regional questions, the data product can also serve as a baseline against which to investigate how forests and ecosystems change after intensive settlement. The data product is being made available at the NIS data portal as version 1.0.

Historical datasets provide critical context to understand forest ecology. They allow researchers to define ‘baseline’ conditions for conservation management, to understand ecosystem processes at decadal and centennial scales, to track forest responses to shifting climates, and, particularly in regions with widespread land use change, to understand the extent to which forests after conversion and regeneration differ from the original forest cover.

Euro-American settlement and subsequent land use change occurred in a time-transient fashion across North America and were accompanied by land surveys needed to demarcate land for land tenure and use. Various systems were used by surveyors to locate legal boundary markers, usually by recording and marking trees adjacent to survey markers. These data provide vegetation information that can be mapped and used quantitatively to represent the period of settlement. Early surveys (from 1620 until 1825) in the northeastern United States provide spatially-aggregated data at the township level [^{2} and no information about the locations of individual trees; we refer to these as the Town Proprietor Survey (TPS). Later surveys after the establishment of the U.S. Public Land Survey System (PLS) by the General Land Office (GLO) provide point-level data along a regular grid, with one-half mile (800 m) spacing, for Ohio and westward during the period 1785 to 1907 [

Logging, agriculture, and land abandonment have left an indelible mark on forests in the northeastern United States [

In

The raw data were obtained from land division survey records collated and digitized from across the northeastern U.S. by a number of researchers (

Locations are grid cells in midwestern portion and townships in eastern portion. In addition to locations without data being indicated in white, grid cells completely covered in water are white (e.g., a few locations in the northwestern portion of the domain in the states of Minnesota and Wisconsin).

Note that surveys occurred over a period of more than 200 years as European colonists (before U.S. independence) and the United States settled what is now the northeastern and midwestern United States. Our estimates are for the period of settlement represented by the survey data and therefore are time-transgressive; they do not represent any single point in time across the domain, but rather the state of the landscape at the time just prior to widespread Euro-American settlement and land use [

Extensive details on the upper Midwest (Minnesota, Wisconsin, Michigan) data and processing steps are available [

The aggregation into taxonomic groups is primarily at the genus level but is at the species level in some cases of monospecific genera. We model the following 22 taxa plus an “other hardwood” category: Atlantic white cedar (

Diameters are only recorded in the PLS data. Although surveyors avoided using small trees, there was no consistent lower diameter limit. The PLS data generally represent trees greater than 8 inches (ca. 20 cm) diameter at breast height (dbh), but with some trees as small as 1 inch dbh (smaller trees were much more common in far northern Minnesota). TPS data have no information about dbh, but the trees were large enough to blaze and are presumed to be relatively large trees useful for marking property boundaries.

There are approximately 860,000 trees in the midwestern subdomain and 420,000 trees in the eastern subdomain. In the midwestern subdomain, oak is the most common taxon and pine the second most common, while in the eastern subdomain oak is the most common and beech the second most common.

Our domain is a rectangle covering all of the states using a metric Albers (Great Lakes and St. Lawrence) projection (PROJ4: EPSG:3175), with the rectangle split into 8 km cells, arranged in a 296 by 180 grid of cells, with the centroid of the cell in the southwest corner located at (-71000 m, 58000 m). For the midwestern subdomain we use the western-most 146 by 180 grid of cells when fitting the statistical models. For the eastern subdomain we use the eastern-most 180 by 180 grid of cells and then omit 23 rows of cells in the north and 17 rows of cells in the south, as these grid cells are outside of the states containing data.

We fit a Bayesian statistical model to the data, with two primary goals:

To estimate composition on a regular grid across the entire domain, filling gaps where no data are available, and

To quantify uncertainty in composition at all locations. Even in grid cells and townships with data, we wish to quantify uncertainty because the empirical proportions represent estimates of the true proportions that could be calculated using the full population of all the trees in a grid cell or township.

At a high level, the Bayesian statistical model estimates composition across the domain, even in locations with sparse or no data, by combining the raw composition data with the assumption that composition varies in a smooth spatial fashion across the domain. The information in the data is quantified by the data model, also known as the likelihood. The assumption of smoothness is built into the model by representing the true unknown spatially-varying composition using a statistical spatial process representation that induces smoothing of estimates across nearby locations. This spatial process representation is a form of prior distribution and is a function of model parameters called hyperparameters that determine the correlation structure of the process and are also estimated based on the data.

The result of fitting the Bayesian model via Markov chain Monte Carlo (MCMC) is a set of representative samples from the posterior distribution for the composition in the 23 taxonomic groupings at each of the grid cells. These samples are the data product (described further in

In the remainder of this section we provide the technical specification of the model and of the computations involved in fitting the model.

We start by describing the basic model for those states for which we have raw data on the 8 km grid, and in

The statistical model treats the observations as coming from a multinomial distribution with a (latent) vector of proportions for each grid cell,
_{i} is the vector of counts for the _{i} is the number of trees counted in the cell, and _{i}) is the vector of unknown proportions for those taxa at that cell. Note that we use a standard multinomial distribution without overdispersion, because the set of trees in the dataset is roughly uniformly sampled across the cells or townships [

The proportions, _{p}(_{i}), _{p}(_{i}), _{p}(_{p} = {_{p}(_{1}), …, _{p}(_{m})} for the _{p}(_{p}(

The critical component of the statistical model is the representation of _{p}(

In the next section, we consider two spatial models to define the structure of the _{p}(

MRF models represent the neighborhood information by working directly with the precision matrix (the inverse of the covariance matrix) of the values of the spatial process, so calculation of the prior density of _{p} is computationally simple [_{p}. The latent variable representation helps to alleviate this problem. Next we describe two alternative spatial models that we considered; in

Our first model is a standard conditional autoregressive (CAR) model; technical details can be found in [_{ii}, equal to the number of neighbors for the _{ik} = −1 (the negative of a weight of one) when areas _{ik} = 0 when they are not. This gives the following model for the values of _{p}(_{i}) collected as a vector across all of the grid cells, _{ik} = 1 if locations

We refer to this as the

Gaussian processes (GP) are also standard models for spatial processes [

Gaussian processes are generally constructed using one of a number of correlation functions that define how the strength of correlation between the values of the process at two locations decays as a function of the distance between the locations. We consider Gaussian processes in the commonly-used Matérn class, using the following parameterization of the Matérn correlation function,
_{p}(

The approach of [^{2}. The entries corresponding to cardinal neighbors are −2

The primary difference between the CAR and Lindgren models is that the Lindgren model provides an additional degree of freedom by estimating

To ensure that the ^{2} parameter is mathematically equivalent between the two models, we reparameterize, producing our second model:

We refer to this model as the

The ICAR specification contains a set of hyperparameters _{p} parameter, with upper bound of 1000. For the SPDE model we also have hyperparameters {_{p}}, which we give flat, non-informative priors (truncated at ±10), and {_{p}}, which we give uniform priors on the interval (0.1, exp(5)). These various hyperparameters are unknown parameters that control the spatial structure of the two spatial models and are estimated from both the data and the prior distributions just specified based on the Bayesian approach.

It is well-known that devising an effective MCMC algorithm for models with latent Gaussian process(es) and a non-Gaussian likelihood is difficult [_{p}.

Suppose that compositional counts are available at a number of locations. At location _{i} observations is collected, and each observation (i.e., each tree) can be classified into _{ij} denote the response variable indicating the category. Let _{ij} be associated with _{ij1}, …, _{ijP} such that _{ij} = _{p}(_{ijp} values. Consider the following example with two locations that are neighbors and _{ij1} and _{ij2}, governed by the latent variables _{1}(_{i}) and _{2}(_{i}), respectively. Suppose that _{1}(_{i})>_{2}(_{i}) for a given location _{1}(_{i}) and _{2}(_{i}) explains the _{p}(_{1}) and _{p}(_{2}) explains the

We developed an extension of the model described in previous sections to account for data at a different aggregation than our core 8 km grid. This extension introduces a new set of latent variables, one per tree, that indicate the grid cells in which the trees are located and that can be sampled within the MCMC as additional unknown parameters. Specifically, _{tj} is the latent “membership” variable for tree _{tj} is a discrete distribution that puts mass, _{ti}, _{t1}, …, _{tm} values are zero.

Using the latent variable representation, we have that _{tjp} ∼ N(_{p}(_{ctj}), 1) for tree _{tj}}, which provides a “soft” (i.e., probabilistic) assignment of trees to grid cells that respects both the known township in which the trees occurred and the uncertainty in which grid cells the trees occurred.

Note that this prior represents the location of each tree in a township as being independent of the other trees; this is somewhat unrealistic because it does not represent our knowledge that the trees in a township would be distributed somewhat regularly across the area of the township because the witness trees were used to indicate property boundaries.

The [_{ijp} variables (these distributions are truncated normal) in closed form and to draw the entire vector of latent process values for each taxon, _{p}, as a single sample that respects the spatial dependence structure for each taxon.

While the latent variable representation provides great advantages in the MCMC sampling for each _{p} compared to joint Metropolis updates or updating each location individually, there is still strong dependence between the hyperparameters, _{ijp} variables). To address the first, we developed a “cross-level” joint updating strategy for the CAR model in which we propose _{p} = _{p}, _{p} ∈ {_{p}, (_{p}, _{p})}) via a Metropolis-style random walk and then given the proposed value, _{p} from its full conditional distribution given _{p} variables, where _{p} is the vector of all _{ijp} values for taxon _{p} = {_{ijp}}, _{i}. This is equivalent to sampling from the marginalized (with respect to _{p}) distribution of _{p} conditional on _{p}. For these various joint samples of hyperparameters and _{p}, we use adaptive Metropolis sampling [

The full description of the MCMC sampling steps is provided in _{p}(_{ijp}, _{p}(_{i}), also described in

The model is implemented in R [

We compared the CAR and SPDE models by holding out data from the fitting process and assessing the fit of the model on the held-out data. We used two experiments with held-out data:

The first experiment used a subregion containing most of Minnesota and a small amount of western Wisconsin, defined to be the cells whose x-coordinate was less than 300,000 m (this defines a north-south line that approximately goes through Duluth, Minnesota) and hereafter referred to as the “Minnesota subregion”. We chose this subregion for evaluation because of its high data density, allowing us to experiment with the effects of increasing data sparsity on model performance. We held out all data from 95% of the cells in this Minnesota subregion, with cells selected at random. This was meant to assess the ability of the model to interpolate from a sparse set of cells/townships and mimics the limited data in Illinois and Indiana.

We held out 5% of the trees from all of the trees in the dataset for the midwestern subdomain (leaving aside the held-out Minnesota subregion cells). This was meant to assess the ability of the model to estimate the composition in cells in which data were available.

Finally, in a separate sensitivity analysis we instead left out 80% of the cells in Minnesota subregion at random. This variation on the first experiment above was meant to indicate whether our model comparison conclusions would be robust as the digitization process for Illinois and Indiana progresses and provides us with increasingly dense data.

There has been extensive work in the statistical literature on good metrics to use to compare the predictive ability of models; these metrics are referred to as scoring rules. A general conclusion from this work is that predictive distributions should maximize sharpness subject to calibration. That is, the predictive distribution should be as narrow as possible while being calibrated such that the observations are consistent with the distribution [

Following the suggestions in [_{i} = {_{i1}, …, _{iP}} as the count of all trees in held-out cell _{i} is the count of held-out individual trees in the cell, while _{ijp} is an indicator variable taking value either 0 or 1 depending on whether the _{i} held-out trees in cell

Brier score: [

Log predictive density: This metric takes the log of the probability density of held-out observations under the fitted model,

While in principle, this metric should be optimal [

(Experiment 1 only) Weighted root mean square prediction error (RMSPE),

(Experiment 1 only) Coverage and length of 95% prediction intervals for _{ip}. We considered only cells with at least 50 trees to focus our assessment on cases where empirical proportions were reasonably certain and avoid being strongly influenced by predictive inference for cells where observational variability dominates.

Note that all of the metrics except coverage and interval length can be applied to individual posterior samples and therefore allow us to estimate the posterior probability that one model has a lower (better) value of the metric than the other model by simply calculating the proportion of samples for which the model has a lower value of the metric. Also note that in addition to allowing comparison between models the MAE and RMSPE metrics allow one to assess absolute performance of each model in predicting composition.

In our initial exploratory fitting, we noticed that the SPDE model produced boundary effects in the predicted composition near the edges of the convex hull of the observations. To attempt to alleviate this, we added a buffer zone with a width of six grid cells around our entire original domain, but note that the boundary effects were still evident even after inclusion of the buffer. For the model comparison, we included this buffer for both the SPDE and CAR models.

We ran each model for 150,000 iterations. After discarding 25,000 iterations for burn-in, we retained a posterior sample of 250 subsampled iterations—we use a subsample instead of the full 125,000 post-burn-in iterations to reduce post-processing computations and storage needs.

Here we summarize the results of our analyses that inform the choice between the CAR and SPDE models.

For Experiment 1 (full cells held out) for cells in the Minnesota subregion held out of the fitting process, the CAR model outperforms the SPDE model based on the posterior distribution over the predictive metric values (

Posterior mean of metric | Metric of posterior mean predictions | ||||
---|---|---|---|---|---|

CAR model | SPDE model | Posterior Prob. CAR <SPDE | CAR model | SPDE model | |

Brier | 0.819 | 0.844 | 0.98 | 0.738 | 0.733 |

Negative Log Density | 466325 | 510383 | 1.00 | 394003 | 394554 |

Mean Absolute Error | 0.0364 | 0.0383 | 0.98 | 0.0275 | 0.0269 |

Root Mean Square Error | 0.0897 | 0.0960 | 0.97 | 0.0647 | 0.0627 |

Smaller values are better for all metrics.

CAR model | SPDE model | |
---|---|---|

Coverage | 0.977 | 0.978 |

Mean Interval Length | 0.129 | 0.142 |

Median Interval Length | 0.037 | 0.033 |

Coverage values near 0.95 are optimal, while shorter intervals are better.

The results for the variation on Experiment 1 in which the proportion of cells that are held out decreases from 95% to 80% show that the SPDE model generally outperforms the CAR model, but again differences from a practical perspective, based on mean absolute error, are limited (Tables

Posterior mean of score | Score of posterior mean predictions | ||||
---|---|---|---|---|---|

CAR model | SPDE model | Posterior Prob. CAR <SPDE | CAR model | SPDE model | |

Brier | 0.773 | 0.765 | 0.10 | 0.710 | 0.710 |

Negative Log Density | 355928 | 353987 | 0.25 | 311525 | 311902 |

Mean Absolute Error | 0.0309 | 0.0296 | 0.10 | 0.0226 | 0.0223 |

Root Mean Square Error | 0.0763 | 0.0739 | 0.02 | 0.0533 | 0.0530 |

Smaller values are better for all metrics.

CAR model | SPDE model | |
---|---|---|

Coverage | 0.981 | 0.972 |

Mean Interval Length | 0.112 | 0.103 |

Median Interval Length | 0.028 | 0.022 |

Coverage values near 0.95 are optimal, while shorter intervals are better.

In Experiment 2 (individual trees held out), we have evidence (posterior probability of 0.93) that the SPDE model is better based on the Brier score, but the Brier score values for the two models are numerically almost the same (

Posterior mean of metric | Metric of posterior mean predictions | ||||
---|---|---|---|---|---|

CAR model | SPDE model | Posterior Prob. CAR <SPDE | CAR model | SPDE model | |

Brier | 0.662 | 0.661 | 0.07 | 0.657 | 0.657 |

Negative Log Density | 51757 | 51626 | 0.01 | 50705 | 50736 |

Smaller values are better for all metrics.

The differences between models are not consistent across the various comparisons, so there is not a clear choice. In our final data product we use the CAR model, for three reasons. First, the CAR model has modestly better performance when data are sparse, as is still the case for Illinois and Indiana. Second, the model is simpler and easier to explain, and computations can be done more quickly. Third, predictions from the SPDE model showed boundary effects, with some taxa showing non-negligible posterior mean values at the edges of the domain, well away from where the taxa were present in the empirical data. This included non-negligible values within (but near the edge of) the convex hull of locations with data.

The final data product is a dataset that contains 250 posterior samples of the proportions of each of the 23 tree taxa at each grid cell in the states in our domain of the northeastern United States.

For this final data product, we ran the model using the CAR specification with all of the data (including the data held out in the model comparison analyses) for 150,000 iterations with the same burn-in and subsampling details as described in

Maps of estimated composition for the full domain for several taxa of substantive interest illustrate the results, contrasting the raw data proportions, the posterior means, and posterior standard deviations as pointwise estimates of uncertainty (

Empirical proportions from raw data (column 1), predictions in the form of posterior means (column 2) and uncertainty estimates in the form of posterior standard deviations—representing standard errors of prediction (column 3). In raw data plots, white indicates no data.

The data product is publicly available at the NIS Data Portal under the CC BY 4.0 license as version 1.0 as of February 2016 [

In the parts of the modeled region with spatially complete data (in particular, Minnesota, Wisconsin, and Michigan), the statistical estimates of forest composition closely match the patterns apparent in the raw data (

A key advance of this work over prior reconstructions of settlement-era vegetation lies in the estimates of uncertainty across the spatial domain. These estimates of uncertainty include the sampling uncertainty within grid cells (as do the within-grid cell estimates of uncertainty available from the raw proportions), but, because this is a spatial model, predictions and their associated uncertainty estimates are also informed by the information content of nearby cells. The maps of standard errors across species (

The exploration of alternative approaches to spatial modeling of composition showed similar results for the SPDE and CAR models, both in terms of prediction accuracy and performance of prediction intervals. Small differences among the various metrics of goodness of fit favored each model in turn, but applied users of the models would find little pragmatic difference affecting scientific inference. Ultimately, we slightly favor the CAR model, because it avoids the boundary effects apparent in the SPDE model at the edges of the domain.

The models presented here estimate only the relative abundance of tree taxa, which does not directly tell us about tree density or other aspects of vegetation structure. This becomes a particular limitation for interpreting vegetation where trees become sparse at the prairie-forest transition from northern Minnesota through southern Illinois [

Define _{ii} is the number of trees in the _{ii} = 0. For the township data, at each iteration, based on the current values of the grid cell membership variables, {_{tj}}, trees are aggregated into grid cells and the calculations above can then be carried out.

The conditional distribution for _{ijp} given the other unknowns in the model and the data is as follows. Let TN(^{2}) denote the truncated normal distribution with mean parameter ^{2}, truncated below by

The conditional distribution of _{p} is
_{p} = log _{p} for the CAR model and _{p} ∈ {_{p}, (log _{p}, log _{p})} for the SPDE model, we sample {_{p}, _{p}} jointly, proposing _{p} as a random walk and, conditional on the proposed value of _{p}, sampling _{p} from the distribution just above. The joint proposal is accepted or rejected as a standard Metropolis-Hastings proposal, with adaptation of the proposal (co)variance [_{p} is a normal distribution (bivariate for _{p} = (log _{p}, log _{p})).

For the township-level data, for a given tree _{tj} ∈ {1, …, _{1} _{tj1}, …, _{m} _{tjm}}, produced by multiplying the prior weights by a likelihood contribution, _{tji}, _{tji} is the density of the latent _{tj1}, …, _{tjP} values for the given tree under the condition that _{tj} = _{tj} values for a tree are with the

In the latent variable representation, _{p}(_{ijp}, _{p}(_{i}). The quantity _{itp}, _{p}(_{i}).

The authors are deeply indebted to all of the researchers over the years who have preserved, collected, and digitized survey records, in particular John Burk, Jim Dyer, Peter Marks, Robert McIntosh, Ed Schools, Ted Sickley, Ronald Stuckey, and the Ohio Biological Survey. We thank Madeline Ruid, Benjamin Seliger, Morgan Ripp and Daniel Handel for processing of the southern Michigan data. Indiana and Illinois data were made possible through the hard work of many Notre Dame undergraduates in the McLachlan lab.