On the Recombination Rate Estimation in the Presence of Population Substructure

As recombination events are not uniformly distributed along the human genome, the estimation of fine-scale recombination maps, e.g. HapMap Project, has been one of the major research endeavors over the last couple of years. For simulation studies, these estimates provide realistic reference scenarios to design future study and to develop novel methodology. To achieve a feasible framework for the estimation of such recombination maps, existing methodology uses sample probabilities for a two-locus model with recombination, with recent advances allowing for computationally fast implementations. In this work, we extend the existing theoretical framework for the recombination rate estimation to the presence of population substructure. We show under which assumptions the existing methodology can still be applied. We illustrate our extension of the methodology by an extensive simulation study.

for fixed N . Let 0 ≤ r N ≤ 1 2 denote the fraction of recombination. Therefore, we get the term θ N ij;kl,mn = as the probability to draw a A i B j gamete from an A k B l /A m B n gamete. The Markov chain Z N in E N := K N 1 × · · · × K N Γ is defined as follows. Here we have The equations for the corresponding steps in the life cycle can be written as P * αij = k,l,m,n θ N ij;kl,mn P αkl P αmn , (A.4) and The last step is formulated as As mentioned, this Markov Chain is a combination of the two-locus Markov Chain in [1] and the one-locus migration model in [2].

B Diffusion approximation
The following statements are pure restatements of the results in [3], chapter 10.
Introduce two compact, convex subsets K and H of R m and R n , respectively, with nonempty interior. Assume that 0 ∈ H. Furthermore, we call a Strong Continuous Contraction Semigroup in the following SCCSG.
Lemma A.1. Let c : K ×R n → R n be of class C 2 and such that the solution Y (t, x, y) of the differential equation Then there exists a compact set x, y)) ∈ E for all t ≥ 0, and the formula defines a SCCSG on C(E)(with sup norm). The generator B of S(t) has C 2 (E) = {f | E : f ∈ C 2 (R m × R n )} as a core and the form Finally, Lemma A.2. Given δ ∞ > 0, let c : K × R n → R n be continuous, such that the solution Y (k, x, y) of the difference equation which exits for all (k, x, y) ∈ Z + × K × H, satisfies Then there exists a compact set E, with K × H ⊂ E ⊂ K × R n , such that (x, y) ∈ E implies (x, Y (k, x, y)) ∈ E for k = 0, 1, . . . , and the formula where V is a Poisson process with parameter δ −1 ∞ , defines a strongly continuous contraction semigroup S(t) on C(E). The generator B of S(t) is the bounded linear operator Let each of the functions a : K × R n → R m ⊗ R m , b : K × R n → R m and c : K × R n → R n be continuous and suppose that, for i, j = 1, . . . , m and l = 1, . . . , n, and assume that the closure of {(f, Gf ) : f ∈ C 2 (K)} is single-valued and generates a Feller semigroup U (t) on C(K), corresponding to a diffusion process X in K. Suppose further that c satisfies the conditions of Lemma A.1 if δ ∞ = 0 or of Lemma A.2 if δ ∞ > 0. Then the following conclusions hold: it is sufficient to check (1), for all i, j = 1, . . . , m, (A.14) for (A.9) to be satisfied. Additionally, it is sufficient to check to obtain (A.10).

C Diffusion limit
We define the state space for the diffusion process X The generator for X is given by With the help of Theorem A.3, we can observe the following facts.  which follows from the properties of the multinomial distribution. The aim is now to get the parameters, related to the mechanisms, into the game. Then we can use the scaling behavior we assumed.
We obtain We used the fact k,l P αkl = 1. As the next step, we resolve the expression P * * * αij . Recalling (A.1), (A.2), multiplication directly implies The fact that c satisfies the conditions of Lemma A.2 can be deduced with the same argumentation as in [2] (p. 111).
implies (A.12). Ethier argued in [4] and [5] that the closure of the generator (A.17) generates a Feller Semigroup. This concludes that we can apply Theorem A.3.

2.) See 1.) and conclusion b) of Theorem A.3.
Note that we can use the more convenient form for the state space and express the generator as The latter representation is often used in the context of population genetics.

D Proof of Lemma 1
Fix a N ∈ N, denote the relative frequencies for Z N by P αij and fix α, i, j.
ij P * * αij = 1 together with the parameter assumption imply 0 < P * * * αij < 1 for all (P αij ) α,i,j ∈ E N . By the properties of the multinomial distribution of (P αij ) αij in (A.7) we can observe that Z N is aperiodic and irreducible. This concludes the statement.

E Proof of Lemma 3
The proof of Lemma 3 uses the objects of the derivations and statements of Theorem A.3. For every N ≥ 1, µ N • Φ −1 N ∈ P(K). Since K is compact, we choose a subsequence (µ N •Φ −1 N ) N which converges to a weak limit point µ 0 ∈ P (K). Rename the subsequence to (µ N • Φ −1 N ) N . Introduce π N : C(K) → C(E N ) by (π N f )(z) = f (Φ N (z)), since K is compact. Choose T > 0, f ∈ C(K) and 0 ≤ t ≤ T , then we can write From this we can conclude for f ∈ C(K). Since C(K) is separating and the unique stationary distribution µ is unique, we know that every weak limit point for µ N • Φ −1 N coincides with µ and therefore the sequence converges weakly to µ. This shows the convergence of the discrete stationary distributions to the stationary distribution of the diffusion process. We also know Y N ⇒ 0 as N → ∞, that means that the derivations from the global frequencies within the subpopulations vanish. This immediately implies the statement of the Lemma by the form of P.