The authors have declared that no competing interests exist.
Conceived and designed the experiments: TL UW DG MB CW. Performed the experiments: TL MB. Analyzed the data: TL UW DG MB CW. Contributed reagents/materials/analysis tools: RM KP. Wrote the paper: TL DG UW MB CW RM KP.
Networks are rarely completely observed and prediction of unobserved edges is an important problem, especially in disease spread modeling where networks are used to represent the pattern of contacts. We focus on a partially observed cattle movement network in the U.S. and present a method for scaling up to a full network based on Bayesian inference, with the aim of informing epidemic disease spread models in the United States. The observed network is a 10% state stratified sample of Interstate Certificates of Veterinary Inspection that are required for interstate movement; describing approximately 20,000 movements from 47 of the contiguous states, with origins and destinations aggregated at the county level. We address how to scale up the 10% sample and predict unobserved intrastate movements based on observed movement distances. Edge prediction based on a distance kernel is not straightforward because the probability of movement does not always decline monotonically with distance due to underlying industry infrastructure. Hence, we propose a spatially explicit model where the probability of movement depends on distance, number of premises per county and historical imports of animals. Our model performs well in recapturing overall metrics of the observed network at the node level (U.S. counties), including degree centrality and betweenness; and performs better compared to randomized networks. Kernel generated movement networks also recapture observed global network metrics, including network size, transitivity, reciprocity, and assortativity better than randomized networks. In addition, predicted movements are similar to observed when aggregated at the state level (a broader geographic level relevant for policy) and are concentrated around states where key infrastructures, such as feedlots, are common. We conclude that the method generally performs well in predicting both coarse geographical patterns and network structure and is a promising method to generate full networks that incorporate the uncertainty of sampled and unobserved contacts.
Network analysis is an important technique for extracting epidemiologically relevant information from complex systems. For livestock diseases, animal movement networks have received particular attention because they may serve as a proxy for contact networks for disease spread
In this study we focus on the network of cattle movements in the United States. While considered an important mechanism for disease transmission, the extent of cattle movements in the U.S. is not well characterized, making any surveillance, prediction and control for animal diseases extremely challenging
In this paper we present a novel Bayesian kernel approach to address all three issues: (i) 10% sampling, (ii) sampling only interstate movements, and (iii) source-sink dynamics in the U.S. cattle industry. Our aim is to parameterize a spatially explicit probabilistic model for individual movements that may be used for prediction of the whole network structure. Therefore, performance of the model is evaluated by comparing a set of network statistics to the observed network (as given by the ICVI reports) as well as randomized networks. As such, we are fitting the model at a low level (i.e. individual movements) and subsequently evaluating the model performance at a higher level (node-level and global network properties). This paper is structured such that we first introduce the data used for the analysis. We then introduce the kernel and present how parameters are estimated in a Bayesian framework using Markov Chain Monte Carlo (MCMC) simulation. Finally, the model performance is evaluated by comparing networks generated from the posterior predictive distribution of the fitted kernel model with the observed data as well as with randomized networks (
This analysis uses three different data sets. ICVIs provide data on interstate animal movement. Data from the National Agricultural Statistics Service (NASS) describes the current distribution of cattle premises, and a separate NASS survey provides historical measures of cattle flows at the state level.
ICVIs are an official document required for most interstate cattle movement with the exception of animals going directly to slaughter. In general, ICVIs list the origin and destination addresses for the cattle shipment, number of cattle in the shipment, purpose of shipment, and breed of cattle in the shipment. ICVIs are generally stored as paper documents at the individual states. Characterizing cattle movements requires digitizing a large number of paper documents and sampling is necessary to make data collection feasible. We requested that all states send a 10% sample of their calendar year 2009 cattle ICVIs that originated in their state by taking a systematic sample of every tenth cattle ICVI. We specifically requested origin ICVIs to avoid duplication because copies of ICVIs are maintained by both the sending and receiving states.
We obtained calendar year 2009 ICVIs from 48 states, with the exceptions being New Jersey (did not participate) and Alaska (no ICVIs to report). We excluded Hawaii from the analysis because their contact pattern with other parts of the U.S. is expected to depend on different underlying processes. In general, we successfully obtained a 10% systematic sample of 2009 export ICVIs, but approximations of this sampling design were implemented in Kentucky, Missouri and Vermont to accommodate time and budget constraints.
We created a database of the ICVIs including: origin and destination address; dates the animals were inspected, shipped, and the ICVI was received at the state veterinarian’s office; the purpose of the shipment; whether the shipment was beef or dairy cattle; the number of animals; and the breeds, age, and gender distributions of the cattle in the shipment. In all, this database contains 19,170 interstate shipment records from 2433 counties. We classified shipments as beef or dairy using shipment purpose data on the ICVI. If the production type was not present on the ICVI a classification tree analysis was used to classify the shipment as beef or dairy (Buhnerkempe, unpublished). We aggregated all address information for the origin and destination to the county-level and focus on networks with county as the node and edges as movements between counties, using the county centroids to calculate distances (
Our model adjusted the probability of movements between counties by the number of premises as reported by the most recent (2007) NASS census of U.S. agriculture. We used data reporting the number of beef and dairy cattle premises per county and define premises as a general term for any type of operation where cattle are traded as a commodity according to the NASS definition: any establishment from which $1,000 or more of agricultural products were sold or would normally be sold during the year (NASS:
We also used historical summaries of the number of cattle moved into each U.S. state from other states (inflow) to incorporate national-scale cattle flow patterns. We obtained interstate inflow data from 1988–2009 NASS reports of the total number of cattle imported into each state. The inflows have no information on the states of origin. Historical summaries are available at
Here, we describe a novel method based on a Bayesian kernel approach presented in
We are interested in the joint probability of the total number of movements (
We assume no biases in observing intrastate vs. interstate movements and the probability of a movement from county
To quantify the width and shape of the spatial kernel, we use two-dimensional measures of variance and kurtosis, respectively, as defined by
This distribution has some benefits in that it may take the form of some well known distributions as special cases, such as the normal distribution (
Through
Description | Source for estimation and comments | |
Estimated state level parameters | ||
State ( |
Estimated jointly, conditional on all data as well as hierarchical parameters for |
|
Hierarchical parameters | ||
|
Mean ( |
Estimated in the analysis and allows for borrowing strength between state level parameters of |
Fixed parameters | ||
|
Mean number of animals/year received from interstateinto state s. | Given by NASS data. |
|
Distance between counties |
Given by NASS data. |
|
Number of farm in county |
Given by NASS data. |
|
Inflow attraction of county |
|
In a Bayesian framework, we usually know something about the system, and we incorporate this knowledge to construct a vague prior. Because we implement a hierarchical Bayesian model for the kernel parameters, we do not need to specify priors for parameters of the different states separately. However, we need to specify the hyperpriors.
We define the hierarchical prior for kurtosis
The conjugate prior for the variance of the normal distribution is the scaled inverse chi square distribution. When specifying the hyperprior of
We express the hierarchical prior
The prior for
We analyzed beef and dairy movements separately using the above framework. We separated the two due to the potentially different movement drivers underlying the two production types. Technically, the Bayesian analyses were performed with MCMC, using Metropolis-Hastings updates for
For each production type (beef and dairy), we ran ten replicates of the MCMC simulation, each with 250000 iterations. For each simulation, the first 50000 iterations were discarded, and the chains were analyzed to ensure that they converged to the same area of high posterior density. Our posterior was given by combining the result of the ten chains. Inference based on MCMC involves repeatedly drawing random numbers from the posterior distribution. These are then used to parameterize the model when generating networks. For further details on MCMC, see
There are several ways to validate models in a Bayesian framework. Here, we employ a commonly used method where the observed data and posterior predictive distribution are compared by appropriate summary statistics
Because the validation necessarily compares samples of interstate county-county links (observed and generated) we cannot make comparisons about the presence or weight of individual county links. However, we can make direct comparison between links aggregated to the state-to-state level to evaluate the precision of our model at a large geographic scale. In addition, the summary of cattle movements at the state scale has been previously reported
In order to compare observed and kernel model generated data to an appropriate null, we also generated randomized networks for comparison. For each state we generated the same number of outgoing movements as the number of observed movements (as given by the ICVI data) for that state. For each movement, the origin county was picked randomly within the state and the destination was picked randomly from all other counties.
Our main interest does not lie in the parameter estimates themselves, but rather in how well the method performs in predicting the network structure. Hence, we focus on a general description of the estimates, and marginal posteriors of parameters are presented in the supplementary material. The estimated movement kernels were generally leptokurtic with 93.9% of the estimated marginal densities of kurtosis higher than two (i.e. the kurtosis of a normal distribution) and 87.3% larger than 3.33 (i.e. the kurtosis of an exponential distribution). The result however revealed very diverse kurtosis estimates. For dairy movements, the lowest median kurtosis was estimated for Massachusetts at 1.42 [1.33, 34.1] (number in brackets indicate 95% central credibility interval of estimated kurtosis) and the highest for Texas at 1.28×105 [4.52×103, 1.65×106]. The corresponding values for beef movements were found for Mississippi with 1.41 [1.39, 1.46] and Iowa with 7.03×106 [2.14×106, 6.77×108] (
The lowest kernel variance for dairy movements was estimated for Massachusetts with median 5.81×104 [4.06×104, 4.69×105] km2 and the highest for Texas with 1.64×109 [1.07 ×108, 3.90×1010] km2. The corresponding values for beef movements were found for Connecticut with 1.13×104 [2.54×103, 4.50×105] km2 and Kansas with 2.40×1010 [1.54×109, 1.88×1011] km2 (
While the main focus of this study is not to compare the dairy and beef industry, modeling the production types separately illustrated heterogeneity in the shipment characteristics among beef and dairy production. Using 95% probability as a level where we consider having strong support for differences, five states (Connecticut, Michigan, Minnesota, New Mexico and New York) showed strong support that more dairy than beef movements originated in that state, while 32 states showed strong support that more beef than dairy movements originated in that state (
The results for the total number of movements per state,
To validate the Bayesian kernel model prediction against the data using network properties we generated a comparable 10% sample of interstate movements from full kernel generated networks (section 2.3). Overall, generated networks from the Bayesian kernel model have network statistics that are similar to the observed data and different from randomized networks (
The observed movements are from a systematic 10% sample of ICVIs from each state and the generated movements are 10% of interstate movements sampled from a single realization out of 1000 kernel generated networks. Darker shading represents the number of cattle premises per county.
The betweenness score is a count of the number of shortest paths between any two nodes in a network (
Statistic | Observed value | Kernel Mean | Standard deviation | Randomized Mean | Standard deviation |
Number of active nodes (counties) | 2407 | 2718.44 | 12.8 | 3108 | 0.655 |
Diameter | 12 | 16.56 | 1.79 | 11.22 | 0.704 |
Reciprocity | 0.029 | 0.028 | 0.001 | 0.001 | 0.0002 |
Transitivity | 0.049 | 0.035 | 0.001 | 0.005 | 0.0002 |
Mean In/Out Degree | 7.72 | 6.84 | 0.06 | 6.19 | 0.001 |
Max In Degree | 396 | 184.97 | 9.73 | 16.79 | 1.20 |
Max Out Degree | 242 | 75.67 | 6.46 | 40.88 | 2.66 |
Mean Betweenness | 5539 | 6185 | 226 | 10914 | 75.9 |
Max Betweenness | 673608 | 320257 | 83928 | 98022 | 12406 |
Assortativity | 0.204 | 0.190 | 0.016 | −0.294 | 0.016 |
The kernel generated networks generally performed better than their randomized counterparts. The in-degree and betweenness distributions (
The kernel generated movements continued to match the ICVI data much better than its randomized counterpart when comparing movements aggregated to the state level. The kernel generated state-to-state level movements had a high correlation with observed data (
The heavy line in the boxplots represents the median value, the box area represents the 25th and 75th percentile of the data and the whiskers represent the maximum and minimum values.
Modeling processes that are influenced by livestock movement, such as disease spread, requires confident estimates of how animal shipment patterns connect the players in the system. Undersampling and incompletely observed data are common problems facing data-driven efforts, even in the most well-characterized systems, such as the United Kingdom
Our sample of 10% of cattle shipments that crossed state lines represents the best characterization of cattle movement across the diverse industry and geographic extent of the U.S. cattle industry to date. In order to scale up to the complete network, we developed a Bayesian kernel model based on some simple assumptions about the underlying process and fitted the model to this incomplete data. The model was structured so that the kernel parameters (width,
The kernel model generated a network of movements that was comparable to the observed data. Notably, the kernel model was fit to characteristics of individual cattle movements and county characteristics and predicted both node-centric and global network properties. Within the Bayesian framework, this also allowed us to evaluate the accuracy and quantify the error in the kernel model’s performance. Node level network centrality distributions were comparable over most of the range of the centrality values (in-degree & out-degree;
We postulate that we could not capture this process in our model because it is structured by unobserved characteristics that occur at a scale smaller than our nodal unit (county). For example, the kernel model does not include any information about the types of premises in a county and the presence of certain types of cattle premises, such as livestock auctions or feedlots, may predispose a county to attract more incoming edges or generate more outgoing edges than expected based on a count of premises alone. A kernel generated shipment will have a probability of terminating in a county,
Comparing global properties of the kernel generated networks of interstate movement also produced a similarly close match to the observed network and out-performed randomized networks in most cases. The kernel generated networks had low reciprocity that closely matched the observed value (
The deviation between the kernel generated and observed networks found at low degrees (i.e. counties that send and do not receive or
At coarse spatial scales, geographic patterns generated by the kernel model were more similar to the ICVI sample than those generated by randomization (
The aim of the kernel model is to describe a complex process by a set of parameters that captures essential aspects of the observed contact structure. By doing this within a Bayesian framework, we acknowledge the importance of uncertainty in these parameters and include this when predicting from the model. Future contact patterns may then be predicted based on the assumptions of similar underlying processes. However, as with any data-driven modeling, there are several limitations imposed by the data. Foremost, the data represents a one-year snapshot of a large and fluid industry. We are confident in our ability to explain patterns from 2009, but if there are large scale differences in the contact pattern between years, we might do less well in predicting cattle movement in other (future) years. However, we are encouraged because a comparison of the observed 2009 ICVI data to a coarse grain analysis of interstate cattle movement from 2001 showed that the 2009 ICVI network captured similar patterns of coarse nation-wide animal flow
An additional assumption is that cattle movements are not influenced by state boundaries, such that the total number of movements (hence, including intrastate movements) may be estimated jointly with the width and shape of the kernel parameterized by interstate movements. This is a difficult assumption to evaluate because a comprehensive measure of cattle movements within states is challenging to obtain. We therefore have to consider that this assumption cannot currently be verified. To address this issue in modeling the spread of infectious disease, any disease-spread model should include sensitivity analysis to address the uncertainty in predicted intrastate movements.
While the estimated network statistics are generally similar to the observed, we have highlighted some potentially important deviations and assumptions that can be used to guide future developments of the kernel approach. The most apparent differences relate to the very high aggregation in network centrality, represented by a few very highly connected nodes that the kernel model fails to reproduce. This is likely to be a result of more complex production structures, where premises of some types have particularly high probability of contact. This may be an important feature for more realistic modeling
The ultimate goal in developing a model that can address under-sampled and missing data is to use the model predictions of cattle movement as a basis for disease-spread models. Our technique extends previous approaches to address sampling of network data by taking a unique focus on a characteristic of sampled edges, without having to sample how node characteristics are involved in the network. Previous approaches to evaluate the effect of sampling network data has relied on knowledge of the characteristics of nodes to fill-in missing edges
The ultimate goal in developing a model that can address under-sampled and missing data is to use the model predictions of cattle movement as a basis for disease-spread models. Previous techniques have been concerned with under sampling and are therefore conservative with regard to the network structure
(TIFF)
(TIFF)
(TIFF)
(TIFF)
Data was provided by the U.S. Department of Agriculture, Animal and Plant Health Inspection Service, Veterinary Services. We also acknowledge the National Institute for Mathematical and Biological Synthesis for supporting and hosting the Modeling Bovine Tuberculosis working group, where the initial ideas for collecting and using ICVI data were developed. We also thank to the state veterinarians and staff, whose cooperation and effort made the data collection possible.