MimicrEE2: Genome-wide forward simulations of Evolve and Resequencing studies

Evolve and Resequencing (E&R) studies allow us to monitor adaptation at the genomic level. By sequencing evolving populations at regular time intervals, E&R studies promise to shed light on some of the major open questions in evolutionary biology such as the repeatability of evolution and the molecular basis of adaptation. However, data interpretation, statistical analysis and the experimental design of E&R studies increasingly require simulations of evolving populations, a task that is difficult to accomplish with existing tools, which may i) be too slow, ii) require substantial reformatting of data, iii) not support an adaptive scenario of interest or iv) not sufficiently capture the biology of the used model organism. Therefore we developed MimicrEE2, a multi-threaded Java program for genome-wide forward simulations of evolving populations. MimicrEE2 enables the convenient usage of available genomic resources, supports biological particulars of model organism frequently used in E&R studies and offers a wide range of different adaptive models (selective sweeps, polygenic adaptation, epistasis). Due to its user-friendly and efficient design MimicrEE2 will facilitate simulations of E&R studies even for small labs with limited bioinformatics expertise or computational resources. Additionally, the scripts provided for executing MimicrEE2 on a computer cluster permit the coverage even of a large parameter space. MimicrEE2 runs on any computer with Java installed. It is distributed under the GPLv3 license at https://sourceforge.net/projects/mimicree2/.


Introduction
The Evolve and Resequencing (E&R) approach is a powerful tool for studying adaptation at a genome-wide scale [1,2].The advent of next generation sequencing (NGS) made it feasible to study genomic changes occurring in populations subject to any form of artificial or natural selection [1,2].Usually allele frequency changes are monitored, for example by sequencing pools of populations (Pool-Seq [3]), but it is also feasible to study changes of the haplotype structure [4].Monitoring the genomic response to selection ultimately promises to shed light on some major open questions in evolutionary biology such as the genetic basis of complex traits [5], the distribution of fitness effects [6], the mode of adaptation (e.g.polygenic adaptation vs. selective sweeps [7,8]) and the repeatability of evolution [9].
However E&R studies increasingly require genome-wide forward simulations of adapting populations, where especially three key challenges stand out: First, E&R studies come at a considerable cost, both in terms of money and time.For example studying adaptation in Drosophila for 60 generations may take up to two years and require several thousand Dollars of sequencing cost [2,10].In case of a suboptimal experimental design it may be impossible to answer the pertinent research questions and the invested resources may have been wasted.It is therefore of considerable interest to evaluate the power of an experimental design before embarking on a costly E&R study.Computer simulations may, for example, help to identify the optimal number of replicates, generations of selection and starting haplotypes [11,12,13].
Second, many different test statistics for identifying selected loci in E&R studies have been suggested [5,9,14,15,16].Computer simulations can help to identify the strength and weaknesses of these different statistics [2,17].Simulations have, for example, shown that time-series based test statistics could increase the power to identify selected loci in E&R studies [14,15].Additionally, the validation of novel methodological approaches, for example the reconstruction of haplotypes from E&R data [4], requires computer simulations.
Finally several E&R studies have found an unexpected genomic response to selection.For example, Kosheleva and Desai [8] observed that experimentally evolving yeast populations responded to selection at the genomic level for about 240 generations but no further response was found during the following 720 generations.The authors suggested that a quantitative trait model involving many loci of small effect in combination with diminishing returns epistasis could generate this pattern.Computer simulations will allow to test whether a proposed model, such as this, could account for the observed data.
In summary, it is likely that computer simulations will be an integral part of future E&R studies, be it in study design or interpretation of the data.To aid researchers in these tasks we developed MimicrEE2 (mimicry of experimental evolution) a tool for fast genome-wide forward simulations of evolving populations.

Design and implementation
MimicrEE2 is a user-friendly tool for individual based forward simulations of evolving populations.It uses a discrete time model and allows simulating haploid as well as diploid organism.As an important feature MimicrEE2 enables usage of available genomic resources such as haplotypes or recombination maps.MimicrEE2 supports three different models of selection (Fig 1 ): • w-mode (w is a widely used symbol for fitness): fitness of individuals is directly computed from the selection coefficients of SNPs • qt-mode: a quantitative trait is simulated and phenotypically extreme individuals are truncated • qff-mode: a quantitative trait is mapped to fitness using a fitness function For several reasons MimicrEE2 is especially suitable for genome-wide forward simulations of E&R studies.First, it supports biological particularities of model organism commonly used in E&R studies.MimicrEE2 allows to simulate haploid and diploid organism, different forms of reproduction (e.g.males and females in Drosophila, different ratios between males and hermaphrodites in Caenorhabditis), variable rates of self-fertilization, clonal evolution, hemizygous sex chromosomes and sex specific recombination maps (e.g.Drosophila males do not recombine).Second, MimicrEE2 supports many different models of adaptation, like classic selected loci, complex epistasis between pairs of loci (fitness may be provided for all combinations of genotypes), selection on a quantitative trait, truncating selection, stabilizing selection, diminishing returns epistasis, disruptive selection, directional selection and adaptation to a moving optimum.Third, MimicrEE2 is comparatively user friendly (no programming skills are necessary) and enables the convenient use of available genomic resources such as haploytpe data, recombination maps and known positions of causative loci.Fourth, Mimi-crEE2 supports multi-threading and we provide scripts that allow running MimicrEE2 on computer clusters (Apache Spark spark.apache.org).This allows simulating even powerful experimental designs (e.g.large population size and many replicates) in a time effective manner.Finally the output of MimicrEE2 (sync, fasta) is compatible with many downstream tools frequently used for analyzing E&R data, such as tools for identifying selected loci (PoPoola-tion2, poolSeq, CLEAR, BBGP [18,17,15,14]), reconstructing haplotypes [4] and simulating reads (e.g.ART [19]).
MimicrEE2 is implemented in Java and does not require installation of any libraries or tools, hence it is platform independent and runs on any computer with Java installed (v8 or higher; tested with macOS and Linux).To install MimicrEE2 it is solely necessary to download the java archive file (jar; see manual https://sourceforge.net/p/mimicree2/wiki/Manual/).As input MimicrEE2 requires haplotypes for a population and the recombination rate.Haplotypes need to be provided as nucleotides (A,T,C,G), which simplifies conversion from commonly used file formats such as vcf files.An arbitrary number of chromosomes may be provided.MimcrEE2 converts the haplotypes into a bitarray (0,1), therefore solely biallelic SNPs may be used.A recombination rate may be provided for arbitrary sized windows.For each window the mean number of cross overs is computed using Haldane's mapping function [20] and the number of cross over events is drawn from a Poisson distribution.A random position within the window is picked for each cross over event.At each generation MimicrEE2 performs the following steps in the given order, where details may vary among the modes (Fig 1): i) truncating selection is performed (if applicable), ii) mate pairs are formed, iii) gametes are generated based on cross over events and random assortment of chromosomes iv) mutations are introduced into the gametes (optional) v) zygotes are formed and a novel population is generated vi) migrants are introduced into the population (optional) vii) the genotype, phenotype and fitness of the individuals is computed and viii) the output is stored (optional).For details see the manual (https://sourceforge.net/p/mimicree2/wiki/Manual/).

Results
MimicrEE2 can be used to i) evaluate the power of different designs of E&R studies, ii) assess the performance of diverse statistical approaches and iii) predict the genomic response to selection under a given model/hypothesis.
To illustrate the utility of MimicrEE2 we tested whether it is feasible to identify loci contributing to starvation resistance in Drosophila with a simple truncating selection experiment.Unraveling the genetic basis of complex traits, such as starvation resistance, is considered to be a key challenge for biology in the 21 st century [21,22].This example also serves to illustrate a major advantage of MimicrEE2, i.e. the user-friendly design which enables convenient usage of available genomic resources.We used 205 haplotypes from the DGRP lines (Drosophila Genome Reference Panel; Freeze 2.0) [23], the recombination rate of D. melanogaster [24] and introduced beneficial alleles into four genes known to confer starvation resistance in Drosophila [25].We used a female to male ratio of 50:50, a hemizygous X-chromosome in males, no recombination in males and simulated truncating selection for 10 replicates and 40 generations, with 80% of the most starvation resistant individuals surviving truncation.Finally we identified the selected loci with PoPoolation2 (cmh-test [18]), directly using the output of MimicrEE2 as input in PoPoolation2.We found that with the simulated experimental design the four targets of selection yield distinct peaks that may be readily identified (Fig 2A).All data and instructions necessary to reproduce this experiment can be found at https://sourceforge. net/p/mimicree2/wiki/BioExample/.
The performance of different experimental designs or test statistics may be evaluated using diagnostic tools such as receiver-operating characteristic (ROC) plots [26] that contrast the true-positive rate (TPR) with the false-positive rate (FPR).To illustrate this we simulated another truncating selection experiment using a slightly more complex architecture of the quantitative trait (50 causative loci).We used several different truncating selection regimes (95%, 80%, 60%, 40%, 20%, 5% of individuals surviving truncation) and repeated the experiment ten times to obtain estimates for the error bars (in total 100 simulations for each condition: 10 experiments with 10 replicates).The population size was 205 and the number of generations 40.Based on the resulting ROC curves we found that truncating selection retaining 95% of the phenotypic most pronounced individuals yielded the highest power to identify the causative variants (S1A Fig) .Similarly it is possible to evaluate the suitability of different test statistics for identifying the causative loci (S1B Fig).
Simulations also enable predicting the genomic response under different adaptive models such as stabilizing selection or diminishing returns epistasis (S2 and S3 Figs).Comparison of the observed genomic response to the simulated one enables to assess whether a proposed adaptive scenario could account for the observed data.For example, in a recent work Mimi-crEE2 was used to test whether selection on a polygenic trait could explain an observed pattern

Fig Simulation of truncating selection for starvation resistance in D. melanogaster.
A) Manhattan-plot showing the significance (cmh-test) of allele frequency differences between the founder and evolved populations, which were subject to truncating selection for 40 generations (10 replicates).Four loci in genes known to contribute to starvation resistance were picked as targets of selection (big black dots, gene names in italics) and an effect size was assigned to each (in brackets).A hemizygous X-chromosome was simulated in males.B) We used a different recombination rate for females (red) and males (blue).C) Nucleotide diversity of the 205 DGRP haplotypes used as founder population.

Comparison to other tools
In contrast to its predecessor MimicrEE [11], MimicrEE2 implements several novel features, most notably support for a quantitative trait model, sex, sex chromosomes, migration and de novo mutations (Table 1).
Several other tools for genome-wide forward simulations have been developed such as forqs [28], quantiNemo [29], SLiM2 [30] and FFPopSim [31].To evaluate the performance with E&R data, we performed with each tool the same truncating selection experiment for starvation resistance as described above (Table 1; S1 Text).In cases where truncating selection was not supported we performed neutral simulations (quantiNemo, MimicrEE).As output we requested the allele frequencies.We found that MimicrEE2 was fast and requires little memory (Table 1).Note that we used 8 cores for MimicrEE and MimicrEE2 while we only used a single core for the other tools (multi-threading is not supported by other tools; Table 1).On a per-CPU basis SLiM2 was faster but we consider the actual execution time (i.e. the waiting time) as more important benchmark.Interestingly forqs performance increased when haplotypes instead of allele frequencies are requested as output (Table 1).
SLiM2 [30] is notable as its specifically developed programming language Eidos allows to simulate a wide range of different evolutionary scenarios including spatial models and gene Table 1.Comparison of tools for genome-wide forward simulations of evolving populations.MimicrEE1 (mim1) [11], MimicrEE2 (mim2, this study), forqs [28], quan-tiNemo (qNemo) [29], SLiM2 [30], FFPopSim [31]. 3fitness values may be provided for all combinations of genotypes at pairs of loci 4 the effect size of QTLs may vary, not the position and shape of the fitness function 5 in brackets: haplotypes are requested as output 6 features are referring to the haploid_highd class which allows genome-wide simulations (>20 loci) https://doi.org/10.1371/journal.pcbi.1006413.t001 environment interactions (Table 1).However, using some features may require substantial programming skills in Eidos.This raises the difficult question as to which extent a feature is actually supported if it largely needs to be implemented by the user.For example usage of a variable recombination map requires implementing a file parser.We thus opted to indicate "intermediate support" for features that could in principle be simulated but will require substantial coding.The same holds for FFPopSim which requires programming skills in Python (Table 1).We conclude that of the existing tools MimicrEE2 is best suited for simulating E&R studies as it i) is the fastest tool ii) requires little memory iii) supports the convenient usage of available genomic resources iv) directly supports a wide range of adaptive models and v) is compatible with downstream tools for the analysis of E&R data (Table 1).

Validation
The main evolutionary forces that shape the frequency of SNPs in E&R studies are genetic drift, selection and recombination.Here, we validated the correct implementation of these forces in MimicrEE2.To test if genetic drift was correctly modeled we simulated 10,000 unlinked loci with a starting allele frequency of 0.5 in a population of size N = 250.We performed neutral simulations for 50 generations.We computed the expected allele frequencies using the binomial formula and a Markov Chain model [32].We found that the obtained allele frequency distribution closely follows the theoretical expectations (Fig 3; Chi-squared test; χ 2 = 482.64;df = 500; p = 0.70).
Next we tested whether selection was correctly modeled.We simulated codominant loci (h = 0.5) having a selective advantage of s = 0.1 and a starting allele frequency of 0.1.We used 50 replicates, a population size of N = 10.000 and performed forward simulations for 200 generations.Theoretical expectations were derived using the equation w, where p t is the allele frequency of the next generation, p t−1 the allele frequency of the previous generation and W AA , W Aa , W aa the fitness of the genotypes [32].At each 10 th generation we compared the obtained allele frequencies to the theoretically expected ones and did not find any significant deviations (Fig 3B; twenty t-tests; p > 0.07).
To test whether selection on quantitative traits was correctly implemented we relied on the breeder's equation (R = h 2 S [33]), which permits to calculate the response to selection (R) based on the selected individuals (S) and the heritability of a trait (h 2 ).We simulated 10 QTLs with starting allele frequency 0.5 in a population of N = 1000.The heritability was h 2 = 0.5 and 100 simulations were performed for each of the following fractions of selected individuals: 0.2, 0.4, 0.6 and 0.8; The expected response to selection agrees with the observed one (Fig 3C ; four χ 2 tests, df = 1, p > 0.67).Finally we tested whether recombination was modeled correctly by tracing the decay of linkage disequilibrium (LD) between two loci for 100 generations.The alleles at these loci were initially completely linked (D = 0.25) and at a frequency of 0.5.We used a recombination rate between the loci of r = 0.05 and a population size of N = 1000.Theoretical expectations were calculated using the equation D t = D 0 (1 − c) t , where D t is the LD at the given generation (t), D 0 the LD at the starting population and c the recombination rate [33].At each 10 th generation we compared the observed to the expected LD and did not find any significant deviation (Fig 3D; ten t-tests p > 0.09).
To ensure correct behavior of the components of MimicrEE2 we implemented more than 200 unit tests (JUnit 4.12 junit.org/junit4/)that may be executed by the user.More details and further validations can be found at https://sourceforge.net/p/mimicree2/wiki/Home/ #validation.
Haller and Samuel Neuenschwander for helpful advice.We thank all members of the Institute of Population Genetics for feedback and support.

Fig 1 .
Fig 1. Flow diagrams showing the order of events occurring at each generation during simulations with MimicrEE2.A separate diagram is shown for each model of selection (i.e.mode's) supported by MimicrEE2.A) At the w-mode the fitness of each individual is directly computed from the selection coefficients of the SNPs present in the genome.The mating success of individual scales with fitness.B) With the qt-mode, MimicrEE2 first computes the phenotypic values for each individual based on the effect sizes of the SNPs and some environmental variance.Then it performs truncating selection, where the individuals with the most pronounced phenotypic values are culled.C) During the qff-mode, MimicrEE2 computes the phenotypic values of a quantitative trait and maps these values to fitness using a fitness function (e.g.: a Gaussian fitness function for stabilizing selection).D) Events occurring during clonal evolution using the w-mode as example.Most importantly, clones do not mate but generate identical copies of themselves (with the exception of de novo mutations).In the flow diagram's yellow indicates migrants and the width of the circles indicates the population size.Optional events are shown in square brackets.https://doi.org/10.1371/journal.pcbi.1006413.g001