Using machine learning as a surrogate model for agent-based simulations

doi:10.1371/journal.pone.0263150

Table 1.

Summary of methods implemented in our study.

More »

Expand

Table 2.

The ten parameters used in the Linked Lives ABM surrogate model generation process, with descriptions, default values and lower and upper bounds used when generating simulation output data.

More »

Expand

Fig 1.

Performance of the nine machine-learning methods trained on simulation outputs from 200, 400, 800 and 1600 runs.

The spider plots compare speed and accuracy across all nine methods for the 200, 400, 800 and 1600 run scenarios in plots (a), (b), (c) and (d) respectively. For each method, the total computational runtime on an 8-core i7 CPU and the mean-squared error (MSE) on the test set are shown (both in log scale, reversed, and mapped to the [0, 1] interval to represent relative speed and accuracy, respectively). Neural networks were the strongest overall performers, with gradient-boosted trees also performing well overall, and non-linear SVM performing increasingly well for higher numbers of runs. The high accuracy of the neural network models has a significant cost in terms of speed. Gradient-boosted trees and non-linear SVM consistently perform well in terms of speed, but suffer from a lower accuracy overall.

More »

Expand

Fig 2.

Sample results on the 800-run simulation scenario.

Diagrams of the neural network architecture in full detail in (a) and in simplified schematic form in (b). In the 800-run scenario, the network with 10 hidden layers pictured here performed the best in a brief comparison between networks with varying numbers of hidden layers. (c) Loss of a 15000-round training run of the simple neural network. (d) Comparison plot produced after training the neural network on the simulation data.

More »

Expand

Fig 3.

Output of the GP emulator run, performed using the 400-run simulation data set.

(a) Graphs of the main effects of each of the 10 input parameters on the final output of interest, in this case social care cost per person per year. The graphs demonstrate that the emulator was unable to fit a model to the simulation results, as each successive emulator run produced very different results and estimates of the main effects. (b) Numerical outputs of the emulator. The emulator estimates total output variance at 5.41 billion, a clear indication that the emulator is not able to fit the simulation data.

More »

Expand

Fig 4.

PCA variable contribution maps and scree plots for the 400- and 1600-sample datasets.

The scree plots of the percent variance contribution of each component visually convey the location where there is a sharp change in gradient, which defines the number of significant components, i.e. the components to be retained in the analysis. The gradient change seen at component 6 of the 400-sample dataset contrasts with the steep gradient change at component 1 of the 1600-sample dataset. The 400-sample dataset variable contribution map shows variables beginning to be clustered, however, there is very little separating the contribution to variance between components with less than 2% difference between the first and last components (as can be seen in the 400-sample scree plot). The variable contribution map of the 1600-sample dataset shows the variables converging into a single component (component 1) contributing 90.3% of the variance. Here PCA is unable to make any useful discrimination between the variables, while identifying eight parameters (on the first component) significantly explaining the variance in the ABM social care per capita.

More »

Expand

Fig 5.

Predicted value (x axis) versus actual value (y axis) for the 200 run scenario, across all the methods implemented in our comparative study.

The dotted line represents the y = x line.

More »

Expand