On convolutional neural networks for selection inference: Revealing the effect of preprocessing on model learning and the capacity to discover novel patterns

A central challenge in population genetics is the detection of genomic footprints of selection. As machine learning tools including convolutional neural networks (CNNs) have become more sophisticated and applied more broadly, these provide a logical next step for increasing our power to learn and detect such patterns; indeed, CNNs trained on simulated genome sequences have recently been shown to be highly effective at this task. Unlike previous approaches, which rely upon human-crafted summary statistics, these methods are able to be applied directly to raw genomic data, allowing them to potentially learn new signatures that, if well-understood, could improve the current theory surrounding selective sweeps. Towards this end, we examine a representative CNN from the literature, paring it down to the minimal complexity needed to maintain comparable performance; this low-complexity CNN allows us to directly interpret the learned evolutionary signatures. We then validate these patterns in more complex models using metrics that evaluate feature importance. Our findings reveal that preprocessing steps, which determine how the population genetic data is presented to the model, play a central role in the learned prediction method. This results in models that mimic previously-defined summary statistics; in one case, the summary statistic itself achieves similarly high accuracy. For evolutionary processes that are less well understood than selective sweeps, we hope this provides an initial framework for using CNNs in ways that go beyond simply achieving high classification performance. Instead, we propose that CNNs might be useful as tools for learning novel patterns that can translate to easy-to-implement summary statistics available to a wider community of researchers.


77.6
Table A Model accuracy for decreasing levels of complexity, for single-population demographic model with selection coefficient s = 0.01.Each model is trained on 80,000 training simulations, with 20,000 simulations for testing and validation.Accuracy is calculated on a balanced testing set, and red values indicate substantial losses in accuracy.mini-CNN is one of two models with the lowest complexity that maintains high performance: 1 convolutional layer with a single 2x1 kernel, followed by either ReLu or MaxPooling, with a single 1-unit dense layer followed by a sigmoid function.We chose the model with ReLu rather than MaxPooling for its slight favorability with respect to interpretation of the model.We note that a 1x2 kernel does not perform as well (with 84.3% accuracy), indicating that row-to-row differences are a more salient signal than column-to-column differences, as would be expected.In addition, a single dense layer followed by a sigmoid, replicating a logistic regression model on the pixels of the input image, also does not perform as well (77.6% accuracy).

51.5
Table B Model accuracy for decreasing levels of complexity, for single-population demographic model with selection coefficient s = 0.01, trained without row-sorting.Each model is trained on 80,000 training simulations, with 20,000 simulations for testing and validation.Accuracy is calculated on a balanced testing set, and red values indicate substantial losses in accuracy.In the absence of row sorting, four 3x3 kernels appear to be necessary to maintain performance.Table C Model accuracy for all CNN models and summary statistics, across all demographic models and types of pre-processing.Performance values of multiple methods that were trained and tested on the same demographic model and selection coefficient.The performance values were found by computing the accuracy of the trained model or statistic on a balanced, held-out set.Here, (ZP) denotes that the images were standardized to a fixed width using zero padding, and (tr) indicates that the images were standardized to a fixed width using trimming.In all other cases where standardization was required, an image resizing algorithm was used.The CNN therefore learns that finding variation beyond a certain point is a signature of neutrality.We note that when the selection coefficient is low (s = 0.005; second and fourth rows), this paves the way for higher classification accuracy when compared with the accuracy on the same scenario under image resizing.We urge caution in generalizing this result, however, without analyzing performance on a wider range of simulations across a range of mutation rates.It is intriguing that the salient features seem to extend further down the image in the case of soft vs neutral, indicating that the model finds useful information from more haplotypes.We are hesitant to generalize these results, however, as our accuracies at the Hard/Soft task are quite low (Imagene: 60.7%, mini-CNN: 59.7%, Garud's H2/H1: 63.7%) and we have not optimized either our simulations or our model architectures for this task.

Hard Sweep vs Neutral
Hard Sweep vs Soft Sweep The interpretation here would be that an absence of row-to-row differences in this lower stripe is the most salient signal to mini-CNN for a classification of Hard over Soft.This could make sense given that soft sweep results in a less robust signal of haplotype homozygosity.We are hesitant to generalize these results, however, as our accuracies at the Hard/Soft task are quite low (Imagene: 60.7%, mini-CNN: 59.7%, Garud's H2/H1: 63.7%) and we have not optimized either our simulations or our model architectures for this task.The "Simulation on the Fly" strategy follows the strategy used in Torada et al. [1], which trains the model for a single epoch, meaning that each training simulation is used only once.This approach may be viewed as simulating samples on the fly where the total number of simulated samples was decided upon prior to training.The early stopping and simulation on the fly trainings were restarted if the model failed to improve in accuracy after the first epoch.Across all the demographic models and selection coefficients, the range of accuracy differences between the early stopping and best of 10 strategies were (−0.0599, 0.0445), (−0.0725, −0.0019), and (−0.097, 0.0137) for Imagene, Mini-CNN, and Deepset, respectively.The same range of accuracy differences between the simulation on the fly and best of 10 strategies were (−0.1595, 0.0100), (−0.1956, 0.0155), and (−0.0490, −0.0078).We note that this implementation of "Simulation on the fly" may be at a disadvantage in this comparison because the additional epochs of the other training strategies result in more updates to the model during training; while it is possible to continue simulating novel data for simulation on the fly, our sample complexity analyses (see Figure P) indicate that we would likely see diminishing returns and would not expect this to result in much higher accuracy than we see with the other training strategies.The "Simulation on the Fly" strategy follows the strategy used in Torada et al. [1], which trains the model for a single epoch, meaning that each training simulation is used only once.This approach may be viewed as simulating samples on the fly where the total number of simulated samples was decided upon prior to training.The early stopping and simulation on the fly trainings were restarted if the model failed to improve in accuracy after the first epoch.Across all the demographic models and selection coefficients, the range of accuracy differences between the early stopping and best of 10 strategies were (−0.0599, 0.0445), (−0.0725, −0.0019), and (−0.097, 0.0137) for Imagene, Mini-CNN, and Deepset, respectively.The same range of accuracy differences between the simulation on the fly and best of 10 strategies were (−0.1595, 0.0100), (−0.1956, 0.0155), and (−0.0490, −0.0078).We note that this implementation of "Simulation on the fly" may be at a disadvantage in this comparison because the additional epochs of the other training strategies result in more updates to the model during training; while it is possible to continue simulating novel data for simulation on the fly, our sample complexity analyses (see Figure P) indicate that we would likely see diminishing returns and would not expect this to result in much higher accuracy than we see with the other training strategies.

Figure A [
Figure A Comparison of Imagene with mini-CNN.Imagene, similar to other CNNs for detecting selection, contains multiple convolution layers and kernels.mini-CNN contains a single convolution layer with a single 2x1 kernel, followed by ReLU activation and a single dense layer.

Figure B
Figure B SHAP explanations for Imagene predictions without row-sorting.Visualization of Imagene with SHAP explanations.From left to right are examples of neutral and sweep processed images, SHAP values for the two image examples, and average SHAP values across 1000 neutral and sweep images.A negative SHAP value (blue) indicates that the pixel of interest contributes toward a prediction of neutral, while a positive SHAP value (red) indicates that the pixel of interest contributes toward a prediction of sweep.Without row-sorting, it is difficult to identify any particular patterns of interest to the model.

Figure C
Figure C Performance correlation for summary statistics and CNN approaches under image resizing.Spearman correlation matrices are shown for the single population demographic model, as well as the three-population demographic model with sweeps in YRI and in CEU.Left matrices are calculated for sweep simulations with selection coefficient s=0.01, and right matrices are calculated for selection coefficient s=0.005.CNN methods (Imagene, mini-CNN, and DeepSet) are run on pre-processed images using image resizing.

Figure D
Figure D Visualization of model performances across all demographic models for selection coefficient of 0.01.Each sub-plot corresponds to a different demographic model, the x-axis denotes the model type, and the y-axis corresponds to the accuracy of the model on a balanced, held-out set.In all cases where image standardization was required, an image resizing algorithm was used.Error bars correspond to 95% confidence intervals for the accuracy on the test set.

Figure E
Figure E Visualization of model performances across all demographic models for selection coefficient of 0.005.Each sub-plot corresponds to a different demographic model, the x-axis denotes the model type, and the y-axis corresponds to the accuracy of the model on a balanced, held-out set.In all cases where image standardization was required, an image resizing algorithm was used.Error bars correspond to 95% confidence intervals for the accuracy on the test set.

FigureFFigure G Figure H Figure I Figure J
Figure F Simulations of CEU and YRI under the three-population demographic model match the site frequency spectrum of 1000 Genomes populations.On the left is the full folded site frequency spectrum(SFS) for each simulated dataset and real dataset, and on the right is the same figure, zoomed into the low minor allele frequencies for easier visualization.The SFS curves for the simulated populations match quite closely the observed SFS curves for each of the two populations, an indication that these simulations do a decent job of capturing the overall sequence diversity of these populations.

Figure K Figure L Figure M Figure N
Figure K Performance correlation for summary statistics and CNN methods under zeropadding.Spearman correlation matrices are shown for the same models as in Figure C. CNN methods are run on pre-processed images using zero-padding.In addition, we include a summary statistic "Ncols" that counts the number of columns in the image after removal of sites with minor allele frequency below 1%; the correlation of Ncols with the CNN approaches illustrates the potential artifacts introduced by zero-padding.

Figure O
Figure O Visualizations of mini-CNN dense layer, comparing Hard vs Neutral and Hard vs Soft sweep classification.For the same soft sweep simulations referenced in Figure N, and our singlepopulation hard sweeps with s = 0.01 we trained mini-CNN to classify between hard and soft sweeps.Looking at the dense layer when classifying Soft vs Hard, it is intriguing to see a pattern of dark pixels, corresponding to a classification of Hard, that is slightly further down the image than what we see when we classify Neutral vs Hard.The interpretation here would be that an absence of row-to-row differences in this lower stripe is the most salient signal to mini-CNN for a classification of Hard over Soft.This could make sense given that soft sweep results in a less robust signal of haplotype homozygosity.We are hesitant to generalize these results, however, as our accuracies at the Hard/Soft task are quite low (Imagene: 60.7%, mini-CNN: 59.7%, Garud's H2/H1: 63.7%) and we have not optimized either our simulations or our model architectures for this task.
Figure P Sample Complexity Analysis.Analysis of performance with increasing training set size.Results are shown for the single-population and three-population demographic models for selection coefficient s=0.01, with the image-width standardization approaches denoted.Accuracy across a balanced test set is calculated across a range of training set sizes, for a range of CNN methods and summary statistics.Horizontal dotted red lines indicate the training set size used in the paper; we see in particular that for the more computationally intensive simulations (3 populations with forward simulator SLiM), training sets larger than 18,000 (with 2,000 held out for testing and validation) do not offer dramatic performance increases.

Figure Q
Figure Q Visualization of model performances across all demographic models and training strategies, for selection coefficient of 0.01 Each sub-plot corresponds to a different demographic model, the x-axis denotes the model type, the y-axis corresponds to the accuracy of the model on the balanced, held-out set, and the color of the bar corresponds to the training strategy.We refer to an epoch as a single training pass through the training dataset.The "Best of 10" training strategy trains the model 10 times for 2 epochs each, then selects the best performing model from the 10 trainings using the validation set.The "Early Stopping" strategy trains the model for 100 epochs, and utilizes early stopping to stop training whenever the validation accuracy fails to increase after 2 epochs of training.The "Simulation on the Fly" strategy follows the strategy used in Torada et al.[1], which trains the model for a single epoch, meaning that each training simulation is used only once.This approach may be viewed as simulating samples on the fly where the total number of simulated samples was decided upon prior to training.The early stopping and simulation on the fly trainings were restarted if the model failed to improve in accuracy after the first epoch.Across all the demographic models and selection coefficients, the range of accuracy differences between the early stopping and best of 10 strategies were (−0.0599, 0.0445), (−0.0725, −0.0019), and (−0.097, 0.0137) for Imagene, Mini-CNN, and Deepset, respectively.The same range of accuracy differences between the simulation on the fly and best of 10 strategies were (−0.1595, 0.0100), (−0.1956, 0.0155), and (−0.0490, −0.0078).We note that this implementation of "Simulation on the fly" may be at a disadvantage in this comparison because the additional epochs of the other training strategies result in more updates to the model during training; while it is possible to continue simulating novel data for simulation on the fly, our sample complexity analyses (see FigureP) indicate that we would likely see diminishing returns and would not expect this to result in much higher accuracy than we see with the other training strategies.

Figure R
Figure R Visualization of model performances across all demographic models and training strategies, for selection coefficient of 0.005 Each sub-plot corresponds to a different demographic model, the x-axis denotes the model type, the y-axis corresponds to the accuracy of the model on the balanced, heldout set, and the color of the bar corresponds to the training strategy.We refer to an epoch as a single training pass through the training dataset.The "Best of 10" training strategy trains the model 10 times for 2 epochs each, then selects the best performing model from the 10 trainings using the validation set.The "Early Stopping" strategy trains the model for 100 epochs, and utilizes early stopping to stop training whenever the validation accuracy fails to increase after 2 epochs of training.The "Simulation on the Fly" strategy follows the strategy used in Torada et al.[1], which trains the model for a single epoch, meaning that each training simulation is used only once.This approach may be viewed as simulating samples on the fly where the total number of simulated samples was decided upon prior to training.The early stopping and simulation on the fly trainings were restarted if the model failed to improve in accuracy after the first epoch.Across all the demographic models and selection coefficients, the range of accuracy differences between the early stopping and best of 10 strategies were (−0.0599, 0.0445), (−0.0725, −0.0019), and (−0.097, 0.0137) for Imagene, Mini-CNN, and Deepset, respectively.The same range of accuracy differences between the simulation on the fly and best of 10 strategies were (−0.1595, 0.0100), (−0.1956, 0.0155), and (−0.0490, −0.0078).We note that this implementation of "Simulation on the fly" may be at a disadvantage in this comparison because the additional epochs of the other training strategies result in more updates to the model during training; while it is possible to continue simulating novel data for simulation on the fly, our sample complexity analyses (see FigureP) indicate that we would likely see diminishing returns and would not expect this to result in much higher accuracy than we see with the other training strategies.
Model accuracy for Imagene, Mini-CNN, DeepSet, and Garud's H1 methods on Single Pop Model with varying selection coefficient and number of haplotypes.The performance values were found by computing the accuracy of the trained model or statistic on a balanced, held-out set of size 10,000.For image standardization with the ML models, an image resizing algorithm was used.The images with 128 haplotypes were resized to a height and width of 128x128.Due to GPU memory constraints, the images with 1000 haplotypes were resized to 200x200.The top performing method is bolded in each column.