Neural networks for self-adjusting mutation rate estimation when the recombination rate is unknown

Estimating the mutation rate, or equivalently effective population size, is a common task in population genetics. If recombination is low or high, optimal linear estimation methods are known and well understood. For intermediate recombination rates, the calculation of optimal estimators is more challenging. As an alternative to model-based estimation, neural networks and other machine learning tools could help to develop good estimators in these involved scenarios. However, if no benchmark is available it is difficult to assess how well suited these tools are for different applications in population genetics. Here we investigate feedforward neural networks for the estimation of the mutation rate based on the site frequency spectrum and compare their performance with model-based estimators. For this we use the model-based estimators introduced by Fu, Futschik et al., and Watterson that minimize the variance or mean squared error for no and free recombination. We find that neural networks reproduce these estimators if provided with the appropriate features and training sets. Remarkably, using the model-based estimators to adjust the weights of the training data, only one hidden layer is necessary to obtain a single estimator that performs almost as well as model-based estimators for low and high recombination rates, and at the same time provides a superior estimation method for intermediate recombination rates. We apply the method to simulated data based on the human chromosome 2 recombination map, highlighting its robustness in a realistic setting where local recombination rates vary and/or are unknown.

Feature importance linear neural network via SHAP. Feature importance of the linear neural network for ρ = 0 (A), ρ = 35 (B), ρ = 1000 (C) and ρ variable (D). Here, the calculated shap values for ρ = 0 refer to the linear neural network trained for ρ = 0 and so on. The shap values were calculated for 500 SFS. Each dot represents the shap value for a particular prediction of one feature, i.e. one SFS entry of one of the 40 SFS. The color indicates whether the particular SFS entry had a high or low value, and the position on the x-axis shows its significance, i.e., the extent to which it increased or decreased the θ estimate.
Artificial neural networks tend to have complex architectures, which can impair the comprehension of the underlying learning process. However, it is often of interest to understand what the neural network has learned. Depending on the ML method, there are different ways to obtain knowledge about it. In our case, we determined the feature importance, i.e. a score for each feature representing its effect on the model. One tool to do so is SHAP (SHapley Additive exPlanations), a game theoretic approach to explain the output of machine learning models [1]. Hereby, the shap values describe the responsibility of an SFS entry for a change in the θ estimate, the higher the absolute shap value, the greater the impact.
In Fig A the feature importance of the linear neural network is displayed. In general the first entry of the SFS, S 1 , is the one with the highest impact on the estimation as its shap value is the most pronounced. For ρ = 0 the importance of S 2 , S 3 , . . . more and more decreases and increases again for the last SFS entries. For ρ = 35 and ρ = 1000, the feature importance also decreases initially, but is quasi-constant for middle and posterior S i . For variable recombination rates (D), we observe a mix of the feature importance for ρ = 0 and ρ = 35, 1000. The pattern from A can still be seen, i.e. that the impact of the last SFS entries increases again and high feature values lead to a reduction of the θ estimation, but to a lesser extent. Subplots A-D in Fig A have in common that they show a higher importance for the first SFS entries than to the others. This is not surprising as it means that the estimator puts more weight on the mutation events happening in the more recent past, just as Fu's estimator does. Considering the weights of the linear neural network, which can be compared to the coefficients of the model-based estimators by nature, we can also extract information about the sign of the coefficient by Fig A. For the linear neural network the weights corresponding to S 26 , . . . , S 39 are negative for ρ = 0. This can also be observed in Fig A A as higher values for SFS entries lead to an decreased θ estimation. Those negative weights are not surprising as half of the coefficients of the optimal model-based estimator, Fu's estimator, are negative for large enough θ, see Fig K. For ρ = 1000 the linear neural network has no negative weights, which can also be observed in Fig A. This is also reasonable since Watterson's estimator has constant positive coefficients and is the optimal estimator for high recombination rates.  Fig A, no too big differences are noticeable. This shows that the adaptive neural network distributes the importance between the SFS entries more or less the same as the linear neural network, at least when trained for fixed recombination rates. Fig C shows the feature importance for the adaptive neural network trained for variable recombination rates. Shap values were calculated as before for ρ = 0, ρ = 35, ρ = 1000, and ρ variable. By doing so, we can observe, how the adaptive network adapts to the different recombination scenarios. Again, the first entries of the SFS have the most importance, especially S 1 , since their shap values are the most pronounced. In addition, the adaptive neural network adapts to different recombination scenarios, as we can observe, for example, from a different pattern for S 31 to S 39 between subplot A and subplot B/C in Fig C. For these features, a high value for ρ = 0 results in a decreased θ estimate, but for ρ = 35 or ρ = 1000 in an increased estimate. The shap values were calculated for 500 SFS. Each dot represents the shap value for a particular prediction of one feature, i.e. one SFS entry of one of the 500 SFS. The color indicates whether the particular SFS entry had a high or low value, and the position on the x-axis shows its significance, i.e., the extent to which it increased or decreased the θ estimate.

B Neural Network Training Procedure
We considered two artificial neural networks: a linear dense feedforward neural network (no hidden layer) and a dense feedforward neural network with one hidden layer and an adaptive loss function. All neural networks take the site frequency spectrum S as input and are trained for θ chosen uniformly in (0, 100). The neural networks for fixed recombination rates have been trained based on 2 · 10 5 simulations, for variable recombination rates with 6 · 10 5 simulations. For the linear neural network the division of simulations between training and validation was chosen to be 80% and 20%. Since the adaptive training procedure is similar to an ensemble training, in the sense that the process involves training multiple neural networks, a second validation set is required to evaluate in between the training steps. Therefore, an additional 20% is subtracted from the training set as a second validation set and the larger validation set is used to decide if another update of the adaptive loss function is needed. The test set is simulated separately and usually of size 4.9 · 10 5 . Hyperparameters are selected via a grid search. ReLU is used as activation function, the adam optimizer [2] with learning rate 0.001, the number of hidden nodes is chosen as 200 and the batch size as 64.
Preventing overfitting is always a challenging problem in ML. The risk of overfitting can be reduced not only in the training process by using methods such as regularization methods, but also via architectural design choices. First of all, the adaptive neural network has a lower model capacity compared to more sophisticated ML methods, making an overfit less likely. In addition, the adaptive training process is similar to ensemble learning with a termination condition (Eq 4Adaptive Reweighing of the Loss Function by Model-Based Estimatorsequation.0.4) preventing the model from fitting the training data too well. Furthermore, each of the neural networks in the adaptive training procedure was trained using the regularization method Early Stopping. Training, validation and test errors were monitored. We observed no tendency of the model to overfit. For the generalization outside the training range, see also Figure    ) and the adaptive neural network for the block recombination map (Fig D). An extract not displaying the whole chromosome is shown in Fig F and for completeness the corresponding recombination map section in Fig G. Figures E  and F illustrate how the estimators adapt to the two recombination levels. Obviously, Watterson's estimator struggles for the parts without recombination. The iterative Futschik estimator is not particularly noticeable, in part because for high recombination rates the errors are generally much lower than for ρ = 0 and thus less detectable on the scale (see e.g. Fig 1Performance of mutation rate estimators. The normalized MSE for four different estimators are shown: The iterated minimal MSE estimator (Futschik (iter)), Watterson's estimator, and two neural network estimators. A linear network without a hidden layer and an adaptive network with one hidden layer and 200 hidden nodes. For each subplot, we used msprime to generate a total of 4.9 · 10 5 independent simulations with sample size n = 40, recombination rate ρ, and 49 different mutation rates θ ∈ (0, 40]. A: recombination rate ρ = 0, B: ρ = 20, C: ρ = 35, D: ρ = 1000. All shown neural networks have been trained on data sets with the corresponding recombination rate ρ. figure.caption.19). Of the estimators considered, the adaptive neural network uniform performs best.  The adaptive neural network, the iterated minimal MSE estimator (Futschik (iter)), Watterson's estimator, and a modification of Watterson's estimator (Watterson (mod)) by Futschik and Gach [3]. For this plot, we used msprime to generate a total of 4.9 · 10 5 independent simulations with sample size n = 40, recombination rate ρ = 0, and 49 different mutation rates θ ∈ (0, 40]. The adaptive neural network has been trained on a data set with ρ = 0. The modified version of Watterson's estimator by Futschik and Gach [3] is of the form