Deep reinforcement learning for optimal experimental design in biology

The field of optimal experimental design uses mathematical techniques to determine experiments that are maximally informative from a given experimental setup. Here we apply a technique from artificial intelligence—reinforcement learning—to the optimal experimental design task of maximizing confidence in estimates of model parameter values. We show that a reinforcement learning approach performs favourably in comparison with a one-step ahead optimisation algorithm and a model predictive controller for the inference of bacterial growth parameters in a simulated chemostat. Further, we demonstrate the ability of reinforcement learning to train over a distribution of parameters, indicating that this approach is robust to parametric uncertainty.

To test the potential of using reinforcement learning for OED, we first applied the Fitted Q-learning algorithm to a simple non-linear system with Monod dynamics to investigate whether the agent is capable of optimising the D-optimality score with the given information. We compare the performance of the FQ agent with a one step ahead optimiser (OSAO). In this system we have one state variable, x, and no measurement noise so that output Y = x. There is one control variable u that is controlled by the OSAO or the RL controller. The dynamics of the system are given by a simple Monod relationship between dx dt and u: where p 1 and p 2 are the parameters to be estimated. In the following p 1 = p 2 = 1. Both the RL and OSAO implementations start from initial condition x 0 = 1, u 0 = 0.5 and are able to choose u between 0 ≤ u ≤ 0.1. The FQ agent works in a discrete action space and therefore has ten equally spaced discrete actions to choose from (distributed uniformly between 0 and 0.1). Fig AA shows the experimental input profiles chosen by the FQ agent and OSAO, which are similar. These consist of inputs at the highest level available and a value about half way between the maximum and minimum values. The FQ agent selects inputs values that straddle the optimum found by the OSAO, likely due to the discrete nature of its action space. Fig AB shows the system trajectories of both controllers. As expected these are very similar. Fig AC shows the performance of the FQ agent as it was trained for 500 episodes, compared to the performance of the trajectory found by the OSAO. Their performance is similar; by the end of training, the FQ agent is performing slightly better than the OSAO.

RT3D performance for different parameter samples
As a comparison to the other methods the error in independent parameter estimates was found (see Methods for details) where the best performing repeat from Fig 4A was used as the RT3D controller. Here the true system parameters were sampled from the uniform distribution before each repeat to assess the ability of the experimental designs to fit the parameters of different parametrisations across the distribution. To remove the dependence of the OSAO and MPC on knowledge of the system parameters they were initialised with the centre of the parameter distribution. Because the system takes different parametrisations for each experiment, we discard the determinant of the covariance matrix of parameters and the optimality score, as these will both be highly dependent on the samples taken, and focus on the normalised MSE in the resulting inferred parameters. Independent parameter samples were taken for each method. These results are shown in Table A, which reveal that in comparison to the previous section the performance of the Rational design, OSAO and MPC have drastically declined. This is expected, because they are ignorant of the changing parametrisations and has no mechanism to adapt their experimental design depending on where in the parameter distribution the true system happens to be. In contrast, the RT3D controller has maintained its performance better, with a smaller increase in the total parameter error. This is further verification that the RT3D controller can adapt the experiment online and that this leads to more informative experiments. The normalised squared error for each parameter sample was plotted on the log scale for each experimental design (Fig D). These were ordered from low to high error for each design. This shows that the error of the T3D controller is consistently below that of the other methods for parameters sampled from the distribution and also highlights that the average error is dominated by a few pathological parameter samples.

RT3D can design optimal experiments on a model of gene transcription
We implemented the reinforcement learning OED algorithm with the goal of inferring values of the 'intrinsic' parameters of a genetic construct. We distinguish between intrinsic parameters, which govern the gene's induction behaviour, and the physiological parameters which are growth-rate-dependent and reflect the state of the host cell. The model equations [1] are: where X rna and X prot are the concentration of RNA and protein, α, K r , K t , K rt , δ, β and K M are intrinsic parameters, V, g, P a , G, R f and λ are growth-rate dependent parameters and η and ξ are fixed.
Nominal parameter values are provided in Table C. As control inputs we choose the transcription factor copy number, u. The measurement outputs are the abundance of protein and mRNA: X rna and X prot . We omit K r from the set of intrinsic parameters to be inferred because it has previously been found to be practically unidentifiable from these output channels [1].
Experiments were designed using rational (human) design, MPC and the RT3D controllers Each experiment lasts 600 minutes. They are divided into six 100-minute long sample-and-hold intervals, each of which is assigned an input value u. The MPC and RT3D methods were allowed to choose each u continuously from within the range 0 < u < 1000. The RT3D controller outperformed the rational design and performed equivalently to the MPC in terms of the optimality score (Table C).
The covariance of 30 independent parameter estimates was calculated. The resulting logarithm of the determinant of the covariance matrix of the parameters is reported in Table D, from which we see that there is good agreement between the optimality scores and the co-variance in the parameter estimates.
The experimental inputs and resulting trajectories for the three experimental designs are shown in Fig EA-C. The designs from the OSAO and the RL optimisers show similar input profiles, which is expected because they share the same objective. Fig ED shows the training performance of 3 independent RL controllers, of which the best performing one was used to design the experiment in Fig EC. We see that the resulting performance at the end of training is consistently high compared with the rational design and OSAO and approximatley the same as the MPC. These results confirm Intrinsic parameter

Value
Minimum Maximum Unit α 20 1 30 min −1 K t 5 × 10 5 2 × 10 3 1 × 10 6 AU K rt 1.09 × 10 9 4.02 × 10 5 5.93 × 10 10 AU δ 2.57 × 10 −4 7.7 × 10 −5 7.7 × 10 −4 µm −3 min −1 β 4.0 1 10  that our approach of reinforcement learning for OED can produce D-optimal experimental designs for reducing uncertainty in the inference of parameter values for a complex nonlinear biological model. Finally, we tested the ability of the RT3D agent to learn over a parameter distribution. A uniform distribution defined defined by the ranges in Table C was sampled for each episode. The training performance of ten agents is shown in S Fig E, where we can that the agents have learned to increase the average optimality over the training time and the performance is consistent between the different repeats. The parameter inference ability of the best of these repeats was compared to an MPC controller which designed an experiment using parameters in the centre of the distribution. As expected the results in Table E