Using Object Oriented Bayesian Networks to Model Linkage, Linkage Disequilibrium and Mutations between STR Markers

In a number of applications there is a need to determine the most likely pedigree for a group of persons based on genetic markers. Adequate models are needed to reach this goal. The markers used to perform the statistical calculations can be linked and there may also be linkage disequilibrium (LD) in the population. The purpose of this paper is to present a graphical Bayesian Network framework to deal with such data. Potential LD is normally ignored and it is important to verify that the resulting calculations are not biased. Even if linkage does not influence results for regular paternity cases, it may have substantial impact on likelihood ratios involving other, more extended pedigrees. Models for LD influence likelihoods for all pedigrees to some degree and an initial estimate of the impact of ignoring LD and/or linkage is desirable, going beyond mere rules of thumb based on marker distance. Furthermore, we show how one can readily include a mutation model in the Bayesian Network; extending other programs or formulas to include such models may require considerable amounts of work and will in many case not be practical. As an example, we consider the two STR markers vWa and D12S391. We estimate probabilities for population haplotypes to account for LD using a method based on data from trios, while an estimate for the degree of linkage is taken from the literature. The results show that accounting for haplotype frequencies is unnecessary in most cases for this specific pair of markers. When doing calculations on regular paternity cases, the markers can be considered statistically independent. In more complex cases of disputed relatedness, for instance cases involving siblings or so-called deficient cases, or when small differences in the LR matter, independence should not be assumed. (The networks are freely available at http://arken.umb.no/~dakl/BayesianNetworks.)


Introduction
There are several areas of applications motivating this paper. The general problem is to determine the most likely pedigree and in this paper we discuss models to achieve this goal. It is well known that linkage analysis performed to locate disease mutations may be misguided if the pedigree is incorrectly specified as will be the case if for instance false paternities are not detected. Similarly, association analyses frequently assume that all individuals are unrelated and again deviations from this assumption may affect conclusions. In forensic cases, for instance paternity cases or identification following disasters, establishing the most likely pedigree is the main objective. Traditionally forensic applications have been based on unlinked markers in linkage equilibrium. For some applications however, these assumptions have been questioned [1,2,3] Furthermore, the conventional markers used in forensics may not have sufficient power to resolve some cases, e.g. family relationships involving more distant relations than siblings [4,5,6]. It is therefore an urgent need to consider methods and practical implementations for more general markers and this is the main objective for this paper.
The evidence is conventionally summarized by the LR (likelihood ratio) [7]. The LR is the probability of the data given one hypothesis (for instance that a specific man is the father) divided by the probability conditioned on an alternative hypotheses (for instance that some unknown man is the father). A large value of the LR results in a man being declared to be a father. In immigration cases, LR calculations can be decisive when decisions are made on granting immigration. It follows that biased LR calculations resulting from unwarranted assumptions may have serious consequences. As far as we know, methods and implementations accounting for linkage, linkage disequilibrium and mutation have not previously been presented.
A forensic example involving two short tandem repeat (STR) loci, D12S391 and vWa, will serve as a motivating case. These markers are located on chromosome 12 only 6.3 Mb apart, but the genetic distance has been estimated to be as large as 10.8 cM [3]. Following the introduction of D12S391 to the new European forensic standard set [8], questions has been raised as to whether the markers can be considered statistically independent when assessing the evidence in specific cases. In addition, studies have been performed to determine whether the physical proximity of the markers has caused linkage disequilibrium (LD) and whether this should be taken into account [1,3]. Moreover, Phillips et al. recently published an overview of the commercial STR kits describing several pairs of markers separated by less than 50 cM [9]. Commonly used software for likelihood ratio calculations, such as Familias [10] and DNAView [11] do not consider linkage or linkage disequilibrium in statistical calculations. Although programs exist which model linkage, they are often more complicated to use and it may be necessary to navigate a command line user-interface, e.g. Merlin [12,13]. In addition, to our knowledge, there is no complete model which simultaneously handles linkage, LD and mutations.
Object Oriented Bayesian Networks (OOBN) may provide an alternative solution with an appealing graphical interface. The object-oriented approach also provides a simple user-interface, hiding the complexities within the objects (nodes) [14]. In the model, the nodes contain sub-structures such as states, conditional probability tables and so forth. The nodes are connected to other nodes and the interplay is governed by probabilities within each node.. Several studies have already shown the advantages of using OOBN in forensic contexts [15,16,17,18,19,20]. Taroni et al. [15] offers a thorough introduction to the basic methodology. We used the freeware GeNIe (http://genie.sis.pitt.edu) to create the Bayesian networks. One alternative is the commercially available Hugin (http://www.hugin.com).
In this paper we model linkage, linkage disequilibrium, and mutations in a single Bayesian network (BN), freely available at http://arken.umb.no/,dakl/BayesianNetworks/. We present networks for some basic relationships, but the model can easily be extended to other pedigrees as well. In addition to previous investigations, this provides an alternative approach to the study of LD between D12S391 and vWa, but also more generally when studying pairs of linked STR markers. In contrast to other studies, which often measures the disequilibrium, or association of alleles, in terms of an r 2 value or a p-value depending on a sample size, our intention was to investigate the effects of LD on actual cases.

Materials and Methods
In order to model linkage disequilibrium (LD), haplotype probabilities must be estimated. A simplified model was constructed (Tillmar et al. [21]), based on a Dirichlet distribution, providing non-zero probability estimates also for unseen haplotypes. Specifically, a diallelic haplotype probability f ij was estimated with f ij~( c ij zlp i q j )=(Czl) where c ij is the observed count of the haplotype among C unrelated individuals, p i and q j are the allele frequencies of the two alleles, and l is a constant, set to 1 in the computations below. Further, to incorporate this into a Bayesian Network (BN) the haplotype probabilities were used to construct conditional allele probabilities, i.e. based on what allele is observed at the first locus we estimated the conditional probability of observing each allele at the second locus.
In order to obtain haplotype counts, we used data from regular trio paternity cases. When the parenthood is established and no mutations are present, the phase, i.e., the haplotypes can be deduced for the child using a simple algorithm. There are, however, ambiguous cases where the haplotypes cannot be determined for the child, e.g. when the parents and the child are all heterozygous for the same alleles. Out of 450 selected trios, 6 where discarded due to more than one possible haplotype configuration. As these ambiguous cases constitute only 1.3% of the total cases, it was not considered to bias the calculations enough to influence the conclusions. Notice that the phased haplotypes for the father and mother, based on the child's genotypes, are generally unknown since recombination might have occurred. Although reasonable estimates of the parents' haplotypes can be obtained, e.g. through the EM-algorithm or Gibbs sampling (PHASE by Stephens et al. and IMPUTE2 by Howie et al., [22,23]), we found that haplotype probabilities computed this way did not differ much from those based on the children and therefore used the latter for simplicity (data not shown). Moreover, it is well known that the LD as measured by D declines with (1recombination rate) per generation and hence,one generation will only have a minor impact on the disequilibrium.

Data
A selection of 444 unrelated Norwegian trios were used to estimate allele and haplotype probabilities at the STR loci D12S391 and vWa (using only the genotypes from the children). Table 1 describes the allele frequencies; in total 8 different alleles were observed at vWa and 16 different alleles at D12S391. To estimate haplotype probabilities, the number of observations for each haplotype was first counted (using the data from the children). In total, 100 different haplotypes were observed out of 128 possible. Haplotype probabilities were then estimated as described above. (Tables S1 and S2 provide further details on the observed haplotype frequencies and the estimated haplotype probabilities). To calculate the conditional probability of each D12S391 allele given a specific vWa allele, each column in Table  S1, containing the observed haplotype probabilities, is normalized to 1. Table 2 describes the calculated conditional allele probabilities. Conditioning rather on D12S391 would of course lead to the same results.

Network
A simple Bayesian network describing a paternity case is illustrated in Fig. 1, the network is more or less self-explanatory and presents the given problem in a intuitive way. It is worth pointing out that as more parameters (i.e. recombination rates, LD and mutations), markers and more distant relationships are considered, the network grows in complexity and can become visually incomprehensible. This can be counteracted by rearranging the most relevant network nodes in a simpler way, hiding the complexity from the user. The networks created in this study use a simple naming convention, based on few abbreviations, but larger networks might require shorter node names. All networks are freely available at http://arken.umb.no/,dakl/ BayesianNetworks/. In addition we provide a short user manual as well as a software to generate the networks based on your own data.
Two different scenarios were considered; a regular paternity case, Fig. 2 and a case of disputed siblingship, Fig. 3. For each network the user can vary recombination rate, decide whether to use conditional allele probabilities, based on Table 2, or allele frequencies, see Table 1. In the paternity network the user can decide whether to instantiate the mother's genotypes (trio) or to leave them unknown (duo), i.e. to use the allele frequencies. In the sibling network the hypotheses compare whether the two persons are unrelated or full siblings. (A separate network was also constructed for a halfsibling case when the siblings are known to share the same mother, see Fig. S1.) The parents' genotypes can be instantiated if available, otherwise allele frequencies will be used. The network in Fig. 1 is in principle equal to the one described by Taroni et al. for a paternity case [15]. The main differences lie in the existence of a Recombination node as well as an LD node. The Recombination node describes the probability for a cross-over to occur, i.e. the recombination rate. Also for each possible inheritance of a D12S391 allele, the P/M nodes transmit whether the Paternal or Maternal vWa allele have been passed on. The LD node is also connected to each possible inheritance of a D12S391 allele. If the LD node is instantiated to Yes, conditional allele probabilities will be used. The Mutation nodes contain a transition matrix. In this study a simple mutation model was used, where each transition has an equal probability of occurring, i.e. m/ (n21), where m is the mutation rate and n is the number of alleles. Mutation rates for each locus were obtained from a local database. The Child Paternal Allele (CPA) nodes are subject to the Hypothesis node (Either Is Father or Are Siblings depending on the network), with states Yes and No. The Hypothesis node will display the posterior probabilities for the given relationships. The tables for the CPA nodes are based on the Alleged father given that he is the father and the allele frequencies if he is not the father. Also, if the LD node is set to Yes, conditional allele probabilities for the D12S391 allele will be used. (Please see user manual for a more complete description.)

Results
The networks were tested on a selection of real cases where the likelihood ratio (LR), assuming marker independence, had already been calculated using the software Familias [10]. In addition an attempt was made to create a worst-case-scenario (WCS) regarding linkage disequilibrium, i.e.,selecting the haplotypes where the observed haplotype frequencies deviated maximally from the expected haplotype frequencies, see Table S1 and S2. The genotypes used in the WCS include rare alleles and as a consequence also often unobserved haplotypes. Table 3 describes the results from the likelihood ratio calculations. Each case was investigated using three different methods. The method denoted M1 in Table 3 is equivalent to the most commonly used approach in forensic laboratories, where the markers vWa and D12S391 are considered to be independent, i.e. recombination rate of 50%, and allele frequencies are utilized. In the two remaining methods, denoted M2 and M3 in Table 3, a recombination rate of 9% was used in accordance with previous studies by Budowle et al. [3]. In addition, the decision of whether to use conditional allele probabilities were evaluated, using in M2 allele frequencies (Table 1) and in M3 conditional allele probabilities (Table 2). Quotients between the LR values obtained using each method are To further test the method, we also created a network where instead of using data from D12S391 and vWa we used data from two other closely located markers, D5S818 and CSF1PO (Table 4).  A recombination rate of 0.3 was used, close to the value obtained using any of the mapping functions. The results reveal that, even when comparing two markers accepted to be in LE, discrepancies can be detected. Future studies should be conducted involving markers known to be in LD. Our network can of course be extended to include more linked markers in LD.

Discussion
We have demonstrated the application of Object Oriented Bayesian Networks in modeling linkage, linkage disequilibrium and mutations in cases of disputed genetic relatedness. As an example, we present data from a pair of STR markers, vWA and D12S391, recently studied with regards to possible linkage disequilibrium. Two different networks were created to investigate a selection of actual cases as well as fictional, see Worst Case Scenarios in Table 3. The small differences in calculated LRs between method M1 (not considering linkage and LD) and the Comparison are due to the use of slightly different allele frequency databases, where the Comparison LR has been calculated using a Norwegian population database utilized in routine casework. However, it is notable that the differences between the results using any of M1, M2 (10% recombination rate and LD not considered) or M3 (10% recombination rate and LD is considered) are in many cases comparable to the differences between M1 and the Comparison methods. Consequently, the differences between method M2 and M3, allele frequencies versus conditional allele probabilities, can perhaps be considered as merely a small bias in the estimation of allele frequencies.
Since linkage has previously been measured between vWa and D12S391, the most important concern of this paper is to evaluate the effect of using conditional allele probabilities as measured by the quotient between the LR values obtained using methods M3 and M2, see Table 3. None of the real cases display a quotient LR M2 /LR M3 larger than 2, and for most of the cases the quotient is close to 1. Also, the Worst Case Scenarios do not display quotients larger than 4. We should of course always expect some differences since no data will indicate exact linkage equilibrium (Table 4). Whereas our study has only included a small selection of real cases, we are aware that larger studies considering hundreds of cases should be conducted and also that our results, concerning possible LD between vWA and D12S391, are partly anecdotal. A recent paper by Gill et al. provides further evidence and discussion on the matter [2].
Haplotype frequencies are generally hard to estimate as genotype data does not normally indicate which chromosome, i.e. paternal or maternal, each allele is located on. New methods, such as mass-sequencing provide means to determine each chromosomal setup, but given current forensic casework, using STR markers, one might instead rely on the massive amount of available data from families (trios mainly) where the haplotypes from the children can, in most cases, be unambiguously determined, as long as the possibility of mutation is disregarded. In our study we used 444 phased unrelated children, i.e. 888 haplotypes, to determine the observed as well as expected haplotype frequencies. We observed 100 of a total of 128 possible haplotypes. An important consideration is if this is enough material for a reliable estimation of population haplotype frequencies? In particular, can we reliably estimate the probability of observing a haplotype that has not been observed in the database? The same dilemma exists when previously unseen or new alleles are observed in regular genotyping, but for haplotypes one may use allele frequencies to construct a reasonable guess at a probability. Our formula contains a parameter l which loosely corresponds to the pseudo-counts often used in the estimation of population allele frequencies. Although a value for l might be estimated for data, we have simply used l = 1. This gives the initial estimates, constructed as products of allele frequencies, the same weight as a single haplotype observation, leading to fairly small estimates of conditional probabilities for unobserved haplotypes.

Conclusions
An imminent practical concern for forensic laboratories using closely located STR markers, such as the pair studied in this paper, is how computations should be performed with such data. One issue is whether linkage must be taken into account. Though statistical calculations in regular paternity cases is not affected by linkage and disputed paternities make up the majority of cases for most labs, we believe that in sibling cases and other more extended relationships, linkage should be taken into account. We recommend that forensic labs perform sensitivity calculations and/or simulations to investigate the effect of recombination rate, especially in kinship analyses and deficient paternity cases. The recently released software FamLink provides features to perform such analyses [24]. In addition to STR markers, our model can easily be extended to accommodate SNP data. In fact, the networks available at our repository are able to handle diallelic markers, but to process high throughput data a more automated system is needed. The other major concern, besides recombination, is whether to use conditional allele probabilities, i.e. to account for linkage disequilibrium. All calculations are affected by the use of such probabilities, even standard paternity and match probability calculations. The effect on the marker pair vWA/D12S391 is, according to our results, reasonably small. In addition, the marker pair D5S818/CSF1PO displays equal deviation from expectation, further corroborating results in previous studies. Moreover, our implementation heavily depends on the estimates of conditional allele probabilities, which are currently fairly uncertain. We have illustrated how estimates can be generated based on data from trios, but clearly much larger datasets are needed to reduce the uncertainty. Furthermore, other models to approach the problem with unseen haplotypes should be considered.
Nevertheless, this paper demonstrates how software implementing Object Oriented Bayesian Networks can be used to assemble and code models reasonably quickly, and how these models can subsequently be used to explore complex questions about the interplay between genetic phenomena such as linkage, LD, and mutations. The models can then in fact be used for relevant computations in actual cases. We present Bayesian networks for two basic relationships, available at http://arken.umb.no/,dakl/ BayesianNetworks/, which can be used as prototypes for investigations of linkage and linkage disequilibrium for pairs of closely located STR markers. Figure S1 Bayesian network describing a sibling case, where the children are known to share the same mother. The nodes P/M tell whether the vWa paternal or maternal allele is inherited. The P/ M node connected to the D12S391 allele also contains the recombination frequency. The LD node is connected to the paternal and maternal allele nodes and decides whether or not to use conditional allele frequencies. Furthermore, the node Are Siblings? contains the different hypotheses.

Supporting Information
(DOC)