Pairwise hybrid incompatibilities dominate allopatric speciation for a simple biophysical model of development

Understanding the origin of species is as Darwin called it “that mystery of mysteries”. Yet, how the processes of evolution give rise to non-interbreeding species is still not well understood. In an empirical search for a genetic basis, transcription factor DNA binding has been identiﬁed as an important factor in the development of reproductive isolation. Computational and theoretical models based on the biophysics of transcription factor DNA binding have provided a mechanistic basis of such incompatibilities between allopatrically evolving populations. However, gene transcription by such binding events occurs embedded within gene regulatory networks, so the importance of pair-wise interactions compared to higher-order interactions in speciation remains an open question. Theoretical arguments suggest that higher-order incompatibilities should arise more easily. Here, we show using simulations based on a simple biophysical genotype phenotype map of spatial patterning in development, that biophysics provides a stronger constraint, leading to pair-wise incompatibilities arising more quickly and being more numerous than higher-order incompatibilities. Further, we ﬁnd for small, drift dominated, populations that the growth of incompatibilities is largely determined by sequence entropy constraints alone; small populations give rise to incompatibilities more rapidly as the common ancestor is more likely to be slightly maladapted. This is also seen in models based solely on transcription factor DNA binding, showing that such simple models have considerable explanative power. We suggest the balance between sequence entropy and ﬁtness may play a universal role in the growth of incompatibilities in complex gene regulatory systems.


Introduction 1
The detailed genetic mechanisms by which non-interbreeding that the hybrid binding energies change diffusively (Khatri and 48 Goldstein 2015a,b). However, real gene regulatory systems are 49 more complex than a single TF binding to DNA, so again the 50 question arises do these predictions hold for more complex gene 51 regulatory systems with more realistic fitness landscapes? 52 Although there has been much progress in understanding 53 evolution in terms of selection, mutation and genetic drift, the 54 majority of this work has been reliant on phenomenological fit-55 ness landscapes, which encompass in a heuristic manner smooth-56 ness, epistasis and neutrality (Higgs and Derrida 1992;Kauffman 57 and Levin 1987). In recent years, the question of the structure In this paper, we will use a slightly modified version of the 73 spatial patterning model in Khatri et al. (2009), which has explicit 74 sequence representation of each loci, to examine the growth of 75 Dobzhansky-Muller incompatibilities in allopatry as a function 76 of population size and under stabilising selection in each lineage. 77 Our results show that smaller populations develop incompati-78 bilities more quickly and in a manner mostly predicted based 79 solely on simple models of transcription factor DNA binding, 80 showing the power of these simple approaches (Khatri and Gold-

92
Genotype-Phenotype map 93 The genotype-phenotype map we use is a slight modification of denoted by E and protein-protein denoted by δE. More specif-106 ically, gene regulation is controlled by two non-overlapping 107 binding sites, the promoter P and an adjacent binding site B, 108 together with two protein species, the morphogen M and RNA Figure 1 An overview of the genotype-phenotype map. The gene regulatory module has input a morphogen gradient [M](x) across a 1-dimensional embryo of length L and outputs a transcription factor TF(x). Gene regulation of TF using a morphogen and RNAP (R) is controlled in a bottom-up manner, by binding to its regulatory region consisting of a promoter P and adjacent binding site B; E represents binding free energies of proteins to one of the two binding sites of the regulatory region of the transcription factor T, δE are protein-protein free energies to aid in co-operative binding of paired protein complexes. Each energy is calculated by the number of mismatches ρ (Hamming distance), shown in red, between relevant binary sequences, together with mismatch energies pd and pp for protein-DNA and protein-protein energies respectively. Transcription of T is controlled by the probability of RNAP being bound to the P, p RP .
[M](x, α) as a function of the position of embryonic cells, x, 1 and a fixed concentration [R] of RNAP, in each cell, we follow 2 Shea and Ackers (1985) to calculate the TF concentration profile 3 [TF](x), which assumes the steady state concentration profile 4 is simply proportional to the probability of RNAP being bound 5 to the promoter: [TF](x) ∝ p RP (G, R, M(x, α)). The proportion-6 ality constant is given by the ratio of the rate of transcription 7 and translation to the rate of degradation of TF, which is not 8 important in our study, since we are only interested in the shape 9 or contrast of [TF](x) that can be achieved. 11 We use a kinetic Monte Carlo scheme to simulate a Wright-Fisher 12 evolutionary process for the genome G and α on two indepen-13 dent lineages, as detailed in Khatri and Goldstein (2015b). The 14 rate of fixation of one-step mutants are calculated based on 15 Kimura's probability of fixation (Kimura 1962

Monte Carlo Scheme for speciation simulations
where, where W * is related to the threshold for inviability and κ F is 26 the strength of selection for the trait represented by the spatial 27 patterning process; when the hybrid's log-fitness drops below 28 F * = κ F log(W * ) an incompatibility or DMI arises. We choose 29 W * = 0.2 to give a reasonable number of incompatibilities that 30 arise in a simulation, where typically the maximum of W ≈ 0.6.

31
Note that although here the exact form of the fitness is slightly 32 different to the one used in Khatri et al. (2009), the qualitative 33 behaviour is the same (Supporting Information).

34
The speciation simulations consist of two replicate simula-35 tions starting with the same common ancestor and with the 36 same fitness function. We draw the common ancestor from the 37 equilibrium distribution for G and α. Our genome is composed of 4 loci: 1) RNAP, (whose sequence 3) Regulatory region of TF (g T = [t P , t B ]) and 4) the morphogen 49 gradient steepness α. Hybrids between the two lines are con-50 structed by independent reassortment of these loci and assuming 51 complete linkage within each loci. We define a hybrid genotype 1 by a 4 digit string where each digit corresponds to one of the loci 2 defined above and takes one of two values, which correspond 3 to the allele from the 1st line or 2nd line; for example, the hy-4 brid rMTa corresponds to R locus having an allele from the 1st 5 lineage, M locus with the allele from the 2nd lineage, T locus 6 from the 2nd lineage and α locus the allele from the 1st lineage.

7
Note that the underlying sequence of each hybrid changes as 8 different substitutions are accepted in each lineage; the notation 9 only refers to alleles fixed at any point in time. As α is a continu-  1 The total number of n−point DMIs is (2 n − 2)( L n ), as there are ( L n ) combinations of n loci amongst L total loci and then considering a binary choice of alleles across both lines, there are a total of 2 n allelic combinations or states, 2 of which are the fit allelic combinations where all alleles come from one lineage or the other giving 2 n − 2. For example, between each pair of loci there are 2 2 − 2 = 2 mismatching combinations of alleles (e.g. rM and Rm) that could give DMIs and ( L 2 ) = L(L − 1)/2 = 6 pairwise interactions. A similar argument would give a total of 24 3-point DMIs as there are 2 3 − 2 = 6 mismatching combinations of alleles at 3 loci (e.g., excluding rmt and RMT) and ( L 3 ) = 4 3-point interactions and similarly, 14( L 4 ) = 14 for 4-point interactions. In total, the max number of DMIs is I max = ∑ L n=2 (2 n − 2)( L n ) = 3 L + 1 − 2 L+1 , which for L = 4 loci is I max = 50.

53
Evolutionary properties of genotype-phenotype map on each 54 lineage 55 The properties of this genotype-phenotype map have been pre-56 viously explored (Khatri et al. 2009). An important property 57 of this genotype-phenotype map is that only a single mecha-58 nism of patterning is found, which is that RNAP (R) binds with  Khatri et al.
. CC-BY 4.0 International license peer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/123265 doi: bioRxiv preprint first posted online Apr. 2, 2017; As discussed in the model section the DMIs shown in Fig.4 will 2 have contributions from many different fundamental incompati-3 bility types, which can be 2-point, 3-point and 4-point in nature.

4
Using the method described above to decompose DMIs into fun-5 damental types, we plot the total number of each type of DMI 6 versus divergence time in Fig.5, where the panels correspond 7 to different scaled effective population sizes from 2Nκ F = 0.1 8 to 2Nκ F = 100. We see clearly that pair-wise DMIs are domin-9 inant at all population sizes and divergence times, though the 10 difference is diminished at larger population sizes. These results

11
show that contrary to the prediction of Orr, that higher order 12 DMIs should be easier to evolve, higher order DMIs evolve more 13 slowly and are in smaller number compared to pair-wise DMIs.
14 As mentioned in the introduction the Orr model also predicts 15 that n-point DMIs should increase as ∼ t n . Here, we find that which has the asymptotic form of I(t) ∼ t γ for t τ and t T  Table 1. We see that the total number of DMIs and 2-point 27 DMIs have a power law exponent close to γ = 2, which is con-  pect P I (t) ∼ (µt) K * and so given that at least n substitutions are 46 needed for a n-point incompatibility, we would expect K * ≥ n.
It is possible the inconsistency here could be resolved by more

52
At larger population sizes (2Nκ F ≥ 50), Fig.5 Table 2 Table of values of the parameters characterising the sub-diffusive growth of DMIs for large scaled population sizes; β = 1 corresponds to normal diffusive motion, β < 1 to subdiffusion and β > 1 super-diffusion, while K * corresponds roughly to the number of substitutions required to reach the invaible region. Fig.6 we have plotted the number of 2-point we will see the analysis of these DMIs will not be so clear.

23
Examining Fig.6, we see at small population sizes, 2Nκ F ≤ 1, 24 that all the DMIs grow approximately quadratically at short 25 times with a saturating form at long times, as also seen in Fig.5.

26
In addition, we see that when population sizes are small, the quence length, as phenotypes coded by longer sequences evolve 10 more quickly, so it is possible these two effects could confound 11 each other. Here for example, E MB has a stronger selective con-12 straint compared to δE RM , but a longer sequence length, as each 13 DNA-protein interaction interface has 10 binary digits versus 14 each protein-protein interaction interface that has 5. Also it is 15 not clear how a single inviability threshold F * effectively maps 16 to these pair-wise incompatibilities, complicating the picture 17 further.

18
However, as the scaled population size increases, we see 19 that the time for I mt incompatibilities to arise sharply increases, 20 while the time for I rm increases less rapidly and I rt even less 21 rapidly. This is consistent with the simple model of transcription 22 factor DNA binding described in Khatri and Goldstein (2015b) 23 and as observed with the hybrid DMIs in Fig.4, as E MB , which 24 contributes most to I mt is under the greatest selection pressure  There is still very little understood about the underlying genetic 12 basis that gives rise to reproductive isolation between lineages.

13
Gene expression divergence is thought to be a strong determi- sizes), which is as predicted by Orr's framework (Orr 1995), the 37 underlying reason is very different in these models and arises as 38 the common ancestor is likely to be close to the inviable region 39 that gives non-functional binding (Khatri and Goldstein 2015b). slower speciation rate for animals with large ranges or popula-56 tion sizes (Mayr 1970(Mayr , 1954Rubinoff and Rubinoff 1971;Cooper 57 and Penny 1997). In addition, there is more direct evidence from 58 the net rates of diversification (Coyne and Orr 2004)  sizes has a characteristic negative curvature on a log-log plot, 1 predicted theoretically by Khatri and Goldstein (2015a), indi-2 cating that hybrid traits randomly diffuse, a simple model of 3 diffusion does not fit the simulation data well; instead a model 4 of sub-diffusion, that would arise if there are a number of kinetic 5 traps giving a broad distribution of substitution times, does 6 fit the data well. This is consistent with the finding that the 7 genotype-phenotype map has a rough fitness landscape, which 8 is only revealed at sufficiently large population sizes (Khatri et al. 9 2009). These predictions can be tested empirically by more de-  There is an inherent simplicity with our gene regulatory mod-53 ule for spatial patterning, which requires only two proteins to 54 bind to a regulatory region to turn on transcription; a key direc-