Machine-learning a virus assembly fitness landscape

Realistic evolutionary fitness landscapes are notoriously difficult to construct. A recent cutting-edge model of virus assembly consists of a dodecahedral capsid with $12$ corresponding packaging signals in three affinity bands. This whole genome/phenotype space consisting of $3^{12}$ genomes has been explored via computationally expensive stochastic assembly models, giving a fitness landscape in terms of the assembly efficiency. Using latest machine-learning techniques by establishing a neural network, we show that the intensive computation can be short-circuited in a matter of minutes to astounding accuracy.


Introduction
Two facts about simple viruses have been known for a long time. Firstly, that genetic economy leads to the use of symmetry, such that virus capsids are mostly icosahedral or helical. Secondly, packaging signals, that is secondary structure features in the viral RNA, are often required for encapsidation in viruses with single-stranded genomes. Examples are the Origin of Assembly (OAS) sequence in Tobacco Mosaic Virus (TMV), the psi element in HIV (Human Immunodeficiency Virus) and the TR sequence (Translational Repressor) in MS2 (Male Specific 2 bacteriophage). This is an evolutionary advantage, as it ensures vRNA-specific encapsidation and can increase assembly efficiency through a cooperative role of the RNA, which acts as a nucleation site.
More recently, it has been shown that taken together, these two facts suggest that there could be more than one packaging signal, with multiple signals in fact dispersed throughout the genome [1,2]. This is because the capsid is symmetric, and the packaging signal mechanism functions via interaction between viral RNA and the coat protein (CP). In several cases, this RNA-CP interaction leads to a conformational change in the CP, which only then makes it assembly competent (e.g. TMV and MS2 [3]. The picture that emerges is then that there are multiple packaging signals (PS) that recruit CP onto a growing capsid. This reduces the phase space that CP has to search in order to assemble a capsid, resulting in vastly increased assembly efficiency. The details of such a mechanism were found in MS2 and STNV (Satellite Tobacco  a1111111111  a1111111111  a1111111111  a1111111111  a1111111111 Necrosis Virus) as model systems. Once the details of this mechanism were understood using biochemistry, structural biology [4], bioinformatics [5], biophysics and graph theory [6] in these model systems, related mechanisms could be found in clinically relevant viruses such as Hepatitis C virus (HCV), Hepatitis B virus (HBV) and Human Parechovirus. These packaging signals are secondary structure features of the viral genomes where a stemloop in the singlestranded RNA presents a common recognition motif that can bind to CP (see Fig 1A). The viral genome thus has multiple layers of constraints, by having to code for genes as well as the PS instruction manual. This set of packaging signals can also be repurposed and optimised for the assembly of virus-like particles, which do not share the same genetic constraints as the virus, and could be used e.g. for vaccines, drug delivery or as an anti-viral strategy [7].
An equilibrium model of how to assemble a simplified example of an icosahedral virus, a dodecahedron built from 12 pentagonal faces, was considered in [8] using an ODE (ordinary differential equations) model. More recently, the multiple dispersed packaging signal paradigm has sparked renewed interested in such a dodecahedral model [9]. The assembly reaction kinetics was modelled via a set of discrete reactions in a stochastic simulation paradigm based on the Gillespie algorithm [10].
In this model 12 PSs can bind CP, as well as dissociate again, reflecting reversible/equilibrium kinetics. Bound CP can then bind other bound CP, gradually building up a capsid (see Fig 1B and 1C). The PSs here have three different bands of binding affinity: weak, medium and strong. These choices correspond to binding energies of 4/8/12 kcal/mol respectively, Fig 1. A The nucleotide sequence of a virus determines the gene products; however, in addition to this information content the RNA also explores a configuration space of secondary structures. Viruses appear to have evolved to use such motifs to help recruit coat protein with a conserved common recognition motif during assembly. The stability and binding affinity of these packaging signals gives a distinctive profile for viral assembly, which is the phenotype relevant to assembly. Assembly efficiency is the fitness of this phenotype, or at least the contribution to the overall fitness that is determined by aspects of assembly. B The genomes in the model consist of twelve packaging signals (PS) that can take weak, medium and strong binding affinities. They successively recruit twelve pentagonal coat proteins, which together form the dodecahedral virion in this model. C The stochastic simulation algorithm models several possible reactions. Firstly, packaging signals can bind coat proteins (and fall apart again), and secondly, two coat proteins that have been recruited by packaging signals can bind to each other. The fitness landscape was computed for 2000 virions for each possible genome, making the computations very intensive.
https://doi.org/10.1371/journal.pone.0250227.g001 based on the TR sequence in MS2 which has approximately 12kcal/mol. The binding energy between CPs is much lower, at approximately 2 kcal/mol. This modulation of affinity affects the assembly kinetics, e.g. by providing a nucleation point that starts assembly, or allowing for error-correction via weaker binding elsewhere. The thermodynamics of PS binding and of the number of CP bonds formed then translates into assembly efficiency. This in turn is taken as a proxy for fitness (all other things being equal)-or at least the contribution to the fitness that results from assembly considerations [11][12][13].
In [14] the whole space of these 3 12 genomes (or rather, phenotype profiles) has been explored. The assembly efficiency there is given by the number of capsids that have correctly assembled out of a possible total of 2, 000. This efficiency provides a fitness landscape on the 12-dimensional genome space. This is an interesting model that is tractable, in contrast with many other biological systems, as it has a small number of degrees of freedom and is dominated by the symmetry of the capsid. This tractability also allows for the consideration of viral evolution. For instance, mutation of the PS strengths leads to the emergence of a set of related genomes that form a 'quasispecies' [15,16]. One can thus investigate the effect of evolutionary pressures, e.g. those exerted by standard drugs or a novel type of drug that targets packaging signals [9].
This model thus captures many interesting aspects of viral genetics, geometry and assembly. A more realistic model would have more CP building blocks and PSs, e.g. around 60 for MS2 (i.e. one full orbit of the icosahedral group). But the computation time for even these simple genomes and the assembly kinetics that provide the fitness landscape are already considerable. Even other simplified models, e.g. reduced orbits on symmetry axes given by e.g. an icosahedron consisting of 20 triangles with 20 PSs, a rhombic triacontahedron consisting of 30 rhombuses with 30 PSs, or a finer gradation of binding affinity bands are already computationally out of reach. For experimental approaches to measure local fitness values please see for instance [17], though note that in this reference this is limited to only 48 data points, whereas the fitness space of the simple above model is already 10; 000 times that.

Results/Discussion
This data set is a perfect example of data that is amenable to a machine-learning approach, since it associates a vector input with a number output. We therefore train a neural network to predict the fitness landscape. The network is trained on a subset of the whole genome space, and validated on the remainder of the data. This proof-of-principle shows that it is very fast for a neural network to learn the inherent patterns within the large degeneracy of the detailed stochastic modelling to predict assembly efficiency fitness for unseen genomes to extremely high accuracy (c.f. the paper [18] which has been published after submission of this manuscript that applies machine learning to the related problem of finding high-level behaviour within the large degeneracies of protein folding). The danger is that some subtleties of the stochastic modelling are lost, but allowing for computation times many orders of magnitude faster. More likely, however, since the data is obtained from a Monte Carlo simulation, and many works in the literature immediately go to an ODE approximation and miss these details anyway, the ML is actually indifferent to this. In fact, the discontinuities from the stochastic method, which the ODE smoothing would not capture, is perfectly adapted to the neural network and machine learning classifiers, which are much better adapted to partitioning fitness space in more subtle ways. Many different neural network architectures were tested and all led to very similar results. This supports the point that there is something reasonably simple underlying the data set that can be learned by any reasonable neural network. This approach could thus in future be used to tackle more realistic models such as the ones mentioned above. Stochastic simulations could be used to partially explore these larger genome spaces, calculating assembly fitness in order to provide a training set for a neural network. The rest of the fitness landscape can then be predicted by the artificial intelligence; it is also possible to only compute this fitness if necessary, e.g. when a new genome arises through mutation in a quasispecies model, such that such computation may only be necessary 'precedurally'. By that we mean that they are only calculated locally when required during the computation, e.g. when evolutionary dynamics starts exploring a certain range in fitness space, and not calculating the entire fitness space before beginning the simulation.
The assembly process discussed here is ultimately a problem of geometry and thermodynamics. So it would with some modification also apply to the assembly of other icosahedral particles such as virus-like particles (VLPs) for drug delivery or as vaccines, as well as to carbon fullerene assembly, which are very attractive fields for biomedical and nanoscience applications. For instance, virus-like particles could present viral epitopes in order to act as vaccines. In order to find the most efficient assembly pathway for such VLPs, an analogue of the above fitness landscape could be constructed in order to solve the resulting optimisation problem and to give industry suggestions which parts of the fitness landscape to explore deeper experimentally.

Methods
From a purely mathematical point of view, we have the following problem. Let (weak, medium, strong) be denoted respectively by (1, 2, 3). The input is a vector v in a 12-dimensional vector space over 3 , the field of three elements. The output is an integer (which we treat as a real number) between 0 and 2000, which we can normalise into � 2 [0, 1] � by dividing by 2000. The algorithm used by [14] is thus a map f : v 2 12 3 À !� 2 ½0; 1� :

Computational aspects of the simulation
The map f is a computationally intensive one with individual genome run times between 20 minutes and 12 hours, and cumulative run time of 3-4 weeks on the N8 Polaris high performance computing research cluster, Intel 2.6 GHz Sandy Bridge E5-2670 processors, with a total of 5, 312 cores, with a mix of 4 and 16Gb of RAM (https://n8hpc.org.uk/facilities/) [9]. The fluctuations in numbers of assembled capsids tend to be in the tenth of a percent range (i.e. ±20 virions), however when initially running the code with 75 repeats of each run for certain genomes the standard error was very small (±0.001%) [9]. Due the amount of time and cluster resources it already took, each point on the landscape is the result of only 2 simulation runs. While not ideal, cluster computation was limited; the simulation allows to capture the more generic features whilst the above cross-validation suggests that the error is small [9]. Nevertheless a brute-force simulation has been performed on the 3 12 = 531, 441 possible input values and the efficiency value extracted. This gives us a database of some half a million known cases of the form (2).

Machine-learning the dataset
Such a problem is perfectly adapted to supervised machine-learning: we know many input values and wish to train some artificial intelligence to associate the input with the known output on some small subset, and use it to predict the output for unseen input [19]. The advantage of this approach is that often approximate results can be attained at reduction in computation time by many orders of magnitude. The paradigm of using machine-learning in algebraic geometry and more general classes of problems in pure mathematics was proposed in [20][21][22] to satisfying accuracy, and it is a similar philosophy that we will adopt here. Let us first try the following specific procedure: • Take the full data D of the form (2), of size 3 12 ; • Establish the neural network, a 3-layer perceptron In the above, L means a linear-layer, S, a sigmoid layer and S, a summation layer. In particular, the first linear layer L 20 is a fully connected layer taking the 12-vector v to 20 neurons by simply the linear function y = wx + b. This is then fed into an element-wise sigmoid layer s x ð Þ ¼ 1 þ e À x ð Þ À 1 of 20 neurons, followed again by a linear layer, which is then summed to the real number � as the output. The schematic of is shown in Fig 2. We have taken this neural network only to illustrate the power of our methodology and have not optimised the hyper-parameters such as 20, nor the network architecture or the choice of the type of neurons. • Now split D into a training set T of 30,000 random samples; the validation set will be the complement V = D\T. Note that the training data is only about 5.6% of the total data.
• We train with T and validate on V.
• As a further check, we create a "fake" validation setṼ which has the same inputs as that of V but with output randomly assigned from the set of correct outputs.
On an ordinary laptop (Intel Core m5-6Y57 CPU, 1.10GHz, 2 Cores, 4 Logical Processors, with 8Gb of RAM), the training took about 45 seconds, and the prediction about 10 seconds. The algorithm is implemented on Mathematica [23] and is expected to run even faster on the Python package Tensorflow [24]. A python Jupyter notebook is attached along with the Mathematica notebook as S1 and S2 Files. In other words, the entire computation took under 1 minute with an ordinary notebook as opposed to the many hours it took on a supercomputer. As mentioned above, a number of similar neural networks were all able to perform this supervised learning task in a comparably short time, meaning there was little motivation to optimise neural network architecture or hyperparameters. However, this reinforces the point that there is intrinsic structure in the costly simulation dataset that any reasonable neural networks finds easy to learn. We also tried various combinations of other standard machine-learning algorithms such as decision trees and support vector machines, and empirically find that our particular neural network approach above seems to out-perform all of them.
We present the result in Fig 3. In part A, we present a plot of the predicted � on the horizontal versus the actual � on the vertical. There are 501, 441 points. One can see that they cluster near the desired y = x line, which would mean perfect prediction (note the axis ranges). To give some precise measures, the best fit line is y = −0.0262122 + 1.02519x with F-statistic 2.45082 × 10 6 and p-value less than 10 À 10 6 . The R-squared value is 0.830151. The errors themselves (fit-residuals) give a mean and standard deviation of (6.88 ± 0.02) × 10 −16 , showing that the residuals are unbiased around 0. To double check, we plot the same result for the fake validation setṼ in part B. It is obvious that the distribution is much less structured and essentially randomly occupies a square. The fit here is y = 0.815233 − 0.000947474x i.e. practically a constant, with a poor F-statistic of 0.450401 and a poor p-value of 0.502145. The R-squared value is 8.98217 × 10 -7 . This is very re-assuring for less than 6% of seen data and total computation time of less than 1 minute on an ordinary laptop.
To give an idea of the prediction, for the (1, 1, . . ., 1) vector, the net predicts � = 0.87069, or 1741. The original value is 200, but that is a singular outlier in the whole data set, which we would not expect the neural network to be able to reproduce. For the (2, 2, . . ., 2) vector, the net predicts � = 0.834721, or 1669; the correct value is 1745. For the (3, 3, . . ., 3) vector, the net predicts � = 0.673568, or 1347; the correct value is 1309.
We will use R-squared, a real number between 0 and 1, as a measure of accuracy of the machine-learning; the closer it is to 1, the better the fit (for a good reference on machine-learning and goodness of fit measure, cf. e.g. [25]). Our 30, 000 training set was only to illustrate the technique in detail. In general, we need to perform cross-validation by splitting the dataset. We split the data into a fraction x of random samples for training and validate on the complement 1 − x, done for training set from 30, 000 to 500, 000, in steps of 30, 000. The R-squared value is computed for each case as a measure of precision. Moreover, for each x, we repeat the random sampling 10 times, for which we get the error bars. The plot of the R-squared (with error bars) against the increase of percentage x of the size of the training set con-stitutes a learning curve which illustrates how the neural network responds to the data; this is shown in

PLOS ONE
As a comparison, one might imagine that since there is an underlying pattern being machine-learnt, a simple regression might suffice. That is, could one fit a hyperplane f ðx i Þ ¼ a 0 þ P 12 i¼1 a i v i to the data? We perform this over the entire dataset, and find that the best multilinear regression obtains only R 2 = 0.575428. Introducing non-linearity and more parameters, such as fitting f ðx i Þ ¼ a 0 þ P 12 i¼1 a i v i þ P 12 i¼1 b i v 2 i does not do much better, at R 2 = 0.665484. The inherent complexity (non-linearity) of the problem is therefore best captured by our neural network approach.