Figure 1.
Bad MCMC mixing for cases of double genotype sharing.
MaCH and similar approaches implement a Markov-chain Monte Carlo scheme where in each iteration the individual genotype resolutions are updated one by one, by mapping the genotypes. If two individuals contain identical marker genotypes for a longer stretch of markers, the Hidden Markov Model will give the other individual a probability approaching . When no reference haplotypes are provided, all haplotype data is initialized randomly. In this series of panels, individuals
and
are initialized differently (a). In panel (b), A is updated. With high probability, the existing (random) haplotype resolution from
is copied. When
is updated (c),
is sampled with high probability, replicating the original random data for
. In iteration 2,
is updated again (d), but again
is sampled with high probability. Since any haplotype resolution for
will match the genotypes for
, there is no pressure to identify a better resolution. The two individuals form a local feedback loop with no true mixing in the Markov chain. Our modified algorithm lowers the probability of sampling from a mirror individual (like the pair of
and
), thus allowing haplotypes from other individuals in the dataset to influence the final resolution. Similar cases can also arise with larger groups of individuals than
. Those are handled successfully by our remedy, as well.
Table 1.
Comparison between original and modified MaCH.
Figure 2.
Comparison of switch error locations.
Switch errors for all markers for CEU trio parents on chromosome 21 plotted in order left-right, top-down ( markers). For each marker, red color intensity indicates the switch error rate for all 30 parents using the original MaCH 1.0.17 algorithm, while green intensity indicates the error rate using our proposed modification. Hence, yellow color indicates regions where errors are shared. The issue of bad chain mixing we describe for the original algorithm manifests as contiguous (horizontal) blocks of repeated switch errors using the original approach, while the error rate using the modified algorithm is 50% lower in total. The errors in the modified algorithm consist of events more evenly distributed. Several of those error locations coincide with errors from the original method. This figure also shows that even if overall haplotype quality in terms of error rate would be acceptable, some regions can still be heavily affected, and paradoxically those regions are the ones where multiple individuals share both haplotypes identical by descent.