^{1}

^{*}

^{2}

^{3}

KP, MF, and HM conceived and designed the experiments. KP performed the experiments. KP, MF, and HM analyzed the data. KP, MF, and HM contributed reagents/materials/analysis tools. KP, MF, and HM wrote the paper.

The authors have declared that no competing interests exist.

Given a collection of fossil sites with data about the taxa that occur in each site, the task in biochronology is to find good estimates for the ages or ordering of sites. We describe a full probabilistic model for fossil data. The parameters of the model are natural: the ordering of the sites, the origination and extinction times for each taxon, and the probabilities of different types of errors. We show that the posterior distributions of these parameters can be estimated reliably by using Markov chain Monte Carlo techniques. The posterior distributions of the model parameters can be used to answer many different questions about the data, including seriation (finding the best ordering of the sites) and outlier detection. We demonstrate the usefulness of the model and estimation method on synthetic data and on real data on large late Cenozoic mammals. As an example, for the sites with large number of occurrences of common genera, our methods give orderings, whose correlation with geochronologic ages is 0.95.

Seriation, the task of temporal ordering of fossil occurrences by numerical methods, and correlation, the task of determining temporal equivalence, are fundamental problems in paleontology. With the increasing use of large databases of fossil occurrences in paleontological research, the need is increasing for seriation methods that can be used on data with limited or disparate age information. This paper describes a simple probabilistic model of site ordering and taxon occurrences. As there can be several parameter settings that have about equally good fit with the data, the authors use the Bayesian approach and Markov chain Monte Carlo methods to obtain a sample of parameter values describing the data. As an example, the method is applied to a dataset on Cenozoic mammals. The orderings produced by the method agree well with the orderings of the sites with known geochronologic ages.

Seriation, the task of temporal ordering of fossil occurrences by numerical methods, and correlation, the task of determining temporal equivalence, are fundamental problems in paleontology. Fossils have been used for both tasks since the very beginnings of modern paleontology [

In the past several decades, both seriation and correlation of fossil occurrences by numerical methods have in fact become practically feasible alternatives to conventional biostratigraphy. The computational solutions that have been developed for correlation and seriation have much in common, but the implementations differ depending on the purpose and the nature of the data (e.g., the CHRONOS [

Here we are explicitly concerned with the task of seriation, for which methods based on several distinct approaches are available. These include the graph-theoretical unitary associations method by Guex et al. [

A fossil site (a collection of fossil remains collected from some location, typically in a sedimentary deposit) may be loosely regarded as a snapshot of the set of taxa that lived at a certain location at approximately the same time. Sites and their taxa may be described as an occurrence matrix, i.e., a 0–1 matrix, where the rows correspond to sites and the columns correspond to taxa: a one in entry (i,j) means that taxon j has been found at site i. The snapshot may capture a smaller or larger proportion of the taxa that were actually present, a smaller or larger area, and a shorter or longer time interval, and it may be biased in different ways. It is therefore clear that the ones and zeros in such a matrix are not all equal. Some presences will be weakly founded on single specimens, others on hundreds or thousands of specimens from many sites. Similarly, many absences will be nothing more than missing data, whereas absences in well-sampled sites may carry more meaning. These facts virtually call out for a probabilistic approach to the analysis of paleontological presence-absence data.

Here we describe a straightforward probabilistic model that contains parameters for the origination and extinction of taxa, for the ordering of the sites, and for the probabilities of errors (wrong zeros and wrong ones). Given the ordering of the sites, the origination and extinction parameters for a taxon specify the interval in which the taxon is assumed to be present. Any occurrence (a one in the matrix) of the taxon outside this interval is considered to be a false occurrence, and any nonoccurrence (a zero in the matrix) is considered to be a false occurrence as well. Given the parameters, the likelihood of the data depends on the number of false and true ones and zeros. The task we consider is to find parameter vectors that yield high likelihood, i.e., have a small number of false ones and zeros.

In more detail, our probabilistic model is as follows. Given a dataset with _{m}_{m}_{m}_{m}_{m}_{m}

Denoting by θ the whole parameter vector (π,

We could, in principle, find a parameter vector θ that maximizes the likelihood of the data (maximum likelihood solution). This parameter vector would give a total order for the fossil sites, implying a probability of zero or one for a site pre-dating another. However, we know that the data contain pairs of sites from the same time periods. We are interested in finding pairs of sites for which the seriation is uncertain, i.e., the probability of one site pre-dating another is close to one-half.

Therefore, we work in the Bayesian framework, and find a sample of parameter vectors where the probability of a vector is proportional to its posterior probability. To this end, we use the Markov chain Monte Carlo (MCMC) method [

MCMC methods yield samples from the posterior distribution of the parameters, and this makes it possible to study the space of the parameters in many different ways. For example, we can determine, for each pair of sites, the probability that one precedes the other. We can also estimate for the probability of false zeros and false ones and find for a particular observation in the data the probability that it is a false zero or a false one.

A further useful property of the model is that it is easy to incorporate additional information. For example, the model allows us to freeze the ordering of certain sites. That is, if we know that site

We first ran the experiment on synthetically generated data, with known “true” ordering and probabilities of false zeros and ones for varying numbers of sites and taxa, shown in

Results for Artificially Generated Datasets

The results on synthetic data show that the method quite accurately determines the parameters of the model: the expected values of

In MCMC simulations, different runs can converge to separate regions in the parameter space. This is indeed what happens with the datasets on genera of Cenozoic large land mammals. We ran 100 MCMC chains over the datasets, and computed the variance in negative log-likelihood within the first chain, and then included all chains with the expected negative log-likelihood within one sigma of the best chain to our analysis.

The results are summarized in

Results on the Large Mammal Dataset

The probability that a site

Black denotes probability one, and white denotes probability zero. For most pairs, the probability is close to zero or one, but some blocks of observations have many different orderings with high probability.

The pair-order matrices for all 100 chains are shown on our Web site (

For the dataset specified by _{t}_{s}_{m}_{m}

The sites have been ordered by _{m}_{t}_{s}

We further verified the detection of false zeros and ones by preparing two datasets, based on data parametrized by _{t}_{s}

We also tested a model where each taxon has its own

We have described a probabilistic model for paleontological data and shown that MCMC methods can be used to obtain samples from the posterior distribution of the parameters. The parameters of the model have a natural interpretation, and the hard sites enable us to insert existing prior knowledge of the ordering in a natural way.

The task of finding the optimal ordering, or knowing for certain that a given ordering is optimal, is a very difficult problem. MCMC methods have the advantage of being able to explore various parts of the parameter space, but the issue of guaranteeing convergence of the sampling is always present in these methods. We have solved the problem of convergence by sampling 100 chains in parallel, and taking into account only the chains having the best log-likelihood. We have also checked that the pair-order matrices predicted by these best chains are consistent with each other. This way, we can state with reasonable confidence that our results are indeed an accurate description of the posterior distribution of the model. We also tested the method by adding false zeros and ones to the data randomly, and checking that they were identified correctly.

The results show that for generated data the method is able to reconstruct orderings and locate outliers with excellent accuracy. For the data on large late Cenozoic mammals, the results indicate a high level of agreement with existing orderings and correctly capture the basic feature of paleontological data that false absences are likely to be common and false presences rare.

For the past 40 years the main stratigraphic framework for the study of the Cenozoic land mammals from Europe has been the MN system [

The structure just described is also evident in the patterns of

Most genus occurrences considered by the model to be false are either genuine outliers in time and/or space or actual data errors. For example, of the ten cases at the head of the list for the _{t}_{s}

Some apparent false occurrences reflect the biology of the animal in question. For example, the genus

In [

Like Alroy's [_{i}

One major difference between Alroy's model and ours is that we use MCMC methods to obtain a sample of the possible parameter values instead of looking for the maximum likelihood solution. This provides additional information about the robustness of the estimates. In particular, we tested our model by randomly adding false zeros and ones, and found that they were identified correctly.

Formally, the _{nm}_{nm}_{nm}

First, we assume that the sites appear in some temporal order, denoted by

We further assume that there exists an ordering for all pairs of sites, i.e., for all

We could take all _{i1}, _{h2})}_{i∈{1, … , NH}}, order of which is known, i.e., π(_{i1}) < π(_{i2}). We denote by Π_{H}

We assume that a priori all permutations in Π_{H}_{H}

One should note that without hard ordering, i.e., when

One of the most basic properties of the taxa is that they originate and then go extinct at some later time. Therefore, for each taxon _{m}_{m}_{m}_{m}_{m}_{m}_{m}_{m}_{m}_{m}

If our observations would be perfect, i.e., we would find samples of taxon if and only if it were alive (there would, e.g., be no Lasarus events), our time-ordered observation matrix _{π(n)m} = _{nm}_{tm}_{m}_{tm}_{m}_{tm}_{nm}_{nm}

We account for the imperfect observations by introducing two probabilities, the probability of false zero, _{nm}_{nm}_{n,m} (1 − _{nm}_{nm}_{n,m} (1 − _{nm}_{nm}_{n,m} _{nm}_{nm}_{n,m} _{nm}X_{nm}_{nm}

Parameters of Our Model, with Prior Distributions

We used a dataset of European late Cenozoic large land mammals derived from the NOW database (

For each locality, we calculated a database age as the mean of the minimum and maximum ages given in the original downloaded file. By MN age, we refer to the mean of the temporal boundaries of MN units according to the correlations given in [

We selected further data subsets as follows. First we selected the genera that occurred in at least _{t}_{s}_{t}_{s}_{t}_{s}

We use the hard orderings of the sites, given by the MN reference sites,

π(Paulhiac)< π(MontaiguleBlin)<

π(Laugnac)< π(WintershofWest)<

π(LaRomieu)< π(Pontlevoy)<

π(Sansan)< π(LaGriveM)<

π(Can Llobateres I)< π(Masía del Barbo)<

π(Crevillante 2)< π(Los Mansuetos)<

π(Arquillo)< π(Perpignan)<

π(Villafranca d′Asti (Arondelli))< π(Saint Vallier)

Notice that all reference sites do not appear in all of the datasets, e.g., the dataset for _{t}_{s}

Given the likelihood of Equation 6 and the prior of Equation 5, we can obtain the _{X}

We are interested in computing various interesting expectation values from the parameter distribution. If we know the posterior distribution, we can compute the expectations from integrand

However, the analytic solution or integration of the posterior distribution is infeasible. Instead of solving the integral of Equation 9 directly, we use numerical integration, namely the MCMC method.

The MCMC algorithm allows us to draw samples from the posterior distribution, without the need for actually solving the Bayes equation. The MCMC algorithm gives us ^{t}, ^{t}

The Markov chain in the name of the MCMC algorithm comes from the fact that a posterior sample is a stochastic function of the previous posterior sample and the data, ^{t+1}^{t}^{1}, θ^{2}, …, θ^{T}. The consecutive samples in the chain are not independent. If the chain is too short (

We first initialize each chain with random values, as follows:

Initial permutation is drawn from the prior, π^{1} ~ _{H}

The initial intervals are set to smallest intervals that have no false ones ^{1}.

The probabilities of false ones and zeros are initialized to ^{1} = 0.01 and ^{1} = 0.3.

After the initialization, we run the chain for the _{B}^{TB}

In MCMC methods, the question of convergence always arises. The parameter space may have areas of large probability mass that are separate in the sense that it is very unlikely that a chain jumps from one of these regions to another. It is possible that a chain ends up in these regions, resulting in inaccurate expectations due to the fact that the integration effectively takes only a small subset of the posterior mass into account. Indeed, efficient sampling of the full parameter space is in a general case a very difficult problem. The problems of finding the maximum likelihood solution for these types of problems are typically NP-hard [

We proceed in two steps: first, we run 100 chains in parallel, and compute the expected log-likelihood, _{t}_{s}

If the predictions given by the chains having the high log-likelihood are consistent with each other, we can conclude that the chains have converged well and that the results are reliable. However, if the predictions given by chains would differ, we could conclude that the chains have converged to separate regions in the parameter space and we also get an estimate for the error due to bad convergence.

Specifically, assume that we analyze

If the expectations _{k}^{2}) to the combined result, ^{2}(_{k}^{2}(_{k}^{2}(

We use the expectations of pair-order probabilities and their Hellinger divergences as ^{2} to measure the similarity of the chains (see below).

The actual sampling rules are given below. In our implementation run of one chain over the dataset with _{t}_{s}

In this section, we describe the details of the sampling methods we have used. We use _{π(i)j} = _{ij}^{−1} to denote inverse permutation, defined by π ^{−1}(π(

The permutation of order, π, is most difficult to sample efficiently. To compensate for this difficulty, we have constructed four sampling iterations for the permutations, which we iterate five times for each MCMC step.

The first sampling method consists of moving site

TOYOUNGER(

Let

For

Let π(

Let π(

Let _{m}_{m}_{m}

Let _{m}_{m}_{m}

TOOLDER(

Let

For

Let π(

Let π(

Let _{m}_{m}_{m}

Let _{m}_{m}_{m}

The actual MCMC step consists of moving a site

MOVEONESITE(

If

TOYOUNGER(

Else if

TOOLDER(

The sample is then taken by first selecting a random pair of indices,

The second part of sampling of π consists of selecting an interval [^{−1}(^{−1}(^{−1}(^{−1}(

REVERSE1(

Let π^{−1}(^{− 1}(

Let _{m}_{m}_{m}

Let _{m}_{m}_{m}

Swap _{m}_{m}_{m}_{m}

The sample is then taken by first selecting a random interval

The third sampling rule for π, REVERSE2(

The fourth sampling rule consists of swapping neighboring sites, i.e.,

REVERSE1(

After sampling for the permutation, we proceed to sample the parameters

To sample for the parameters _{m}_{m}_{m}_{m}_{m}_{m}

Summarizing, one sampling iteration consists of one sampling of

We can visualize a MCMC chain with a ^{2}(_{1}, _{2}) ∈ [0, 1] and it is equal to zero only if the pair-order matrices are equal. The average Hellinger distance between the pair-order matrices of the eight chains used in the analysis of the dataset with _{t}_{s}

We thank three anonymous referees for detailed and constructive comments that improved the manuscript significantly.

Markov chain Monte Carlo

Mammal Neogene