^{1}

^{*}

^{2}

^{1}

^{3}

^{2}

^{4}

^{3}

^{2}

^{5}

^{1}

^{*}

JL contributed algorithm and experiment design, implementation and running of the experiments, and writing of the paper. ZLB organized the data and aided in the writing and experiment design. CK helped make the public Web tool version of the algorithm available. GX provided HLA typing of private data. BDW organized the data. MC provided HLA typing of private data, and aided in the writing. PJG presented the need for statistical HLA refinement. DH contributed algorithm and experiment design, and contributed to the writing of the paper.

A tool based on our approach will be made freely available for research purposes at

High-resolution HLA typing plays a central role in many areas of immunology, such as in identifying immunogenetic risk factors for disease, in studying how the genomes of pathogens evolve in response to immune selection pressures, and also in vaccine design, where identification of HLA-restricted epitopes may be used to guide the selection of vaccine immunogens. Perhaps one of the most immediate applications is in direct medical decisions concerning the matching of stem cell transplant donors to unrelated recipients. However, high-resolution HLA typing is frequently unavailable due to its high cost or the inability to re-type historical data. In this paper, we introduce and evaluate a method for statistical, in silico refinement of ambiguous and/or low-resolution HLA data. Our method, which requires an independent, high-resolution training data set drawn from the same population as the data to be refined, uses linkage disequilibrium in HLA haplotypes as well as four-digit allele frequency data to probabilistically refine HLA typings. Central to our approach is the use of haplotype inference. We introduce new methodology to this area, improving upon the Expectation-Maximization (EM)-based approaches currently used within the HLA community. Our improvements are achieved by using a parsimonious parameterization for haplotype distributions and by smoothing the maximum likelihood (ML) solution. These improvements make it possible to scale the refinement to a larger number of alleles and loci in a more computationally efficient and stable manner. We also show how to augment our method in order to incorporate ethnicity information (as HLA allele distributions vary widely according to race/ethnicity as well as geographic area), and demonstrate the potential utility of this experimentally. A tool based on our approach is freely available for research purposes at

At the core of the human adaptive immune response is the train-to-kill mechanism in which specialized immune cells are sensitized to recognize small peptides from foreign sources (e.g., from HIV or bacteria). Following this sensitization, these immune cells are then activated to kill other cells which display this same peptide (and which contain this same foreign peptide). However, in order for sensitization and killing to occur, the foreign peptide must be “paired up” with one of the infected person's other specialized immune molecules—an HLA molecule. The way in which peptides interact with these HLA molecules defines if and how an immune response will be generated. There is a huge repertoire of such HLA molecules, with almost no two people having the same set. Furthermore, a person's HLA type can determine their susceptibility to disease, or the success of a transplant, for example. However, obtaining high quality HLA data for patients is often difficult because of the great cost and specialized laboratories required, or because the data are historical and cannot be retyped with modern methods. Therefore, we introduce a statistical model which can make use of existing high-quality HLA data, to infer higher-quality HLA data from lower-quality data.

The Major Histocompatibility Complex (MHC), located on the short arm of chromosome 6, encodes the Human Leukocyte Antigen (HLA) class I and II genes, whose protein products play an essential role in the adaptive immune response. The HLA class I and class II proteins bind antigenic, pathogen-derived peptides (called

High-resolution HLA typing (meaning the determination of the specific HLA alleles which an individual expresses at each of the class I and/or class II loci) is an essential tool for basic as well as clinical immunology research. For example, HLA typing has been used to identify immunogenetic risk factors for human diseases

Historically, HLA typing was performed using low-resolution, antibody-based serological tests. However, higher-resolution HLA typing is now achievable using more modern, molecular (DNA-based) methods. Molecular methods for HLA typing include hybridization with sequence-specific oligonucleotide probes (SSOP), PCR amplification with sequence-specific primers (PCR-SSP), and more recently, DNA sequence-based methods. Generally, DNA sequence-based methods involve locus-specific PCR amplification of exons 2 and 3 (for HLA Class I genes), or exon 2 only (for HLA class II), followed by “bulk” DNA sequencing of the amplified product (i.e., sequencing of products derived from both HLA haplotypes). Sequencing is restricted to exons 2 and/or 3 because these regions are the major determinants of HLA peptide-binding specificity and thus contain enough information to discriminate between most allele combinations. If an individual is heterozygous (i.e., possesses two different alleles) at any locus, direct sequencing of an amplified PCR product will yield nucleotide mixtures at positions in which the two alleles differ in sequence. Consequently, there are two reasons why modern sequence-based typing methods may yield ambiguous typing results: first, if the differences between the two alleles are located outside the genotyped region (in most cases, exons 2 and/or 3), and secondly, if two or more allele combinations yield the exact same pattern of heterozygous nucleotide mixtures when combined into a “bulk” sequence.

Because of the great (and ever-increasing) number of HLA alleles (and thus growing list of ambiguous combinations), unambiguous HLA typing is costly, laborious, and limited to laboratories specializing in this work. For the purposes of scientific research, HLA types are not always unambiguously determined; rather, they are only determined up to some “resolution” (i.e., level of ambiguity). Additionally, because the number of HLA alleles is constantly increasing, sequence-based, SSOP and SSP based typing results, which depend on the list of known alleles, require constant re-interpretation in light of newly discovered alleles. This re-interpretation can result in more ambiguity than originally thought

The practical consequence of these issues is that there is a large incongruence between the high-resolution HLA typing required for scientific investigations and the HLA data that is widely available. As such, any method which can help to increase resolution of HLA data,

Our improvements are achieved by using a parsimonious parameterization, and by smoothing the maximum likelihood (ML) solution. These improvements make it possible to scale the refinement to a larger number of alleles and loci in a more computationally efficient and stable manner. We also show how to augment our method in order to make use of data arising from different ethnic backgrounds, and show the potential use of this experimentally. Our method is evaluated using data from various sources, and from various ethnicities, as described in the Experimental section. Additionally, an implementation of our method is available for community-wide use.

HLA nomenclature is closely tied to the levels of possible HLA ambiguity. Each HLA allele is assigned a letter (or letters) which designate the locus (e.g., A, B, and C for class I; DRA, DRB1, DRB2-9, DQA1, DQB1, DPA1, DPB1, for class II.) This letter is followed by a sequence of numbers, such as A*0301, for one allele at the A locus. The first two digits describe the allele type; in most cases the first two digits correspond to the historical serological antigen groupings.

The third and fourth digits are used to designate the allele subtypes, wherein alleles are assigned numbers from 01–99 roughly according to their order of discovery. A minimum of four digits thus uniquely defines any allele: by definition, any two alleles which differ in their four-digit number, differ by at least one amino acid. For example, A*0301 and A*0302 do not encode the same protein sequence. Because two-digit names are exhausted after 99 alleles, there are a few oddities in the nomenclature. For example, A*02 and A*92 belong to the same two-digit class as do B*15 and B*95

Assuming that HLA resolution beyond four digits are ignored, there are still various levels of ambiguity that can arise from molecular (DNA)-based HLA typing methods. For example, rather than knowing unambiguously which two A alleles a person has, one may instead know only a list of possibilities; for example, A*0301-A*3001 or A*0320-A*3001 or A*0326-A*3001. Such

Although related but different HLA alleles (for example, those alleles which share the same first two digits) sometimes share immunogenic properties, higher resolution data allows for more precise and informative downstream use (e.g.,

The input to our statistical HLA refinement method consists of two data sets. The first is data of interest that have not been typed unambiguously to a four-digit resolution, but for which we would like to increase the resolution as much as possible. The second input is a set of training data consisting of four-digit resolution HLA types for individual people, where the population is drawn from one that is the same (or, in practice, as similar as possible) to the population of interest for which we wish to refine HLA types. First we train our model on the training data. Then we apply this trained model to our limited-resolution data of interest. For example, if a patient in our data set of interest was typed ambiguously at the A locus as having either (1) A*0243, A*0101, or (2) A*0243, A*0122, then our statistical model assigns a probability to each of these two possibilities. More generally, our model assigns a probability to any number of possibilities (not just two), and over many loci. To date, we have used our method, without computational difficulty, to refine up to four loci with 20–130 alleles at each locus, and, on data sets with up to half a million possible haplotypes.

To be precise about what kind of HLA typing ambiguities our approach can tackle, we emphasize that in principle, our approach can handle any kind of ambiguity, so long as that ambiguity has been resolved in the training data set, and so long as the ambiguity can be defined as an allele or set of alleles, taking on some number of clearly defined possibilities. Two common ambiguities that are of interest to researchers are i)

At the core of our HLA typing refinement model is the ability to infer and predict haplotype structure of HLA alleles across multiple loci (from unphased data, since this is the data that is widely available). If certain alleles tend to be inherited together because of linkage disequilibrium between them, then clearly this information can help us to disambiguate HLA types—and far more so than using only the most common allele at any particular locus. We derive a method for disambiguating HLA types from this haplotype model.

Existing methods for haplotype modeling fall into three main categories:

The haplotype modeling part of our approach is most closely related to the EM-based maximum-likelihood methods, although it differs in several crucial respects. To our knowledge, all implementations of EM-based maximum likelihood haplotype models use a full (unconstrained) joint probability distribution over all haplotypes (i.e., over all possible alleles, at all possible loci) with the exception of the partition-ligation algorithms noted below. Furthermore, because they are maximum-likelihood based, they do not smooth the parameter estimates, thereby allowing for unstable (i.e., high variance) estimates of rare haplotypes. Together, these two issues make existing methods difficult to scale to a large number of loci or to a large number of alleles per locus. This scalability problem is widely known (e.g.,

We note that within the HLA community, even recently, haplotype inference seems to be exclusively performed with the most basic EM-based algorithm of Excoffier and Slatkin, and Hawley and Kidd

There are two pieces of work which tackle the allele refinement problem using haplotype information: that of Gourraud et al. in the HLA domain

Thus, issues of scalability and the specific nature of haplotypic patterns are substantially different between these two domains. With respect to methodology, Jung et al. perform imputation in a sub-optimal way. First, they apply an EM-based haplotype inference algorithm (

One other study touches on statistical HLA refinement

Before explaining our model in detail, we first explain the standard EM-based model and training algorithm used for haplotype inference _{1}, _{2}, and _{3}, with ^{i}

In this case, there would be

Given the current parameter estimates (for

Given the distribution over haplotypes/hidden states for each observed genotype, compute the maximum likelihood parameter estimates (in this case, the multinomial parameters). This is the M-step, where the parameters are maximized with respect to the

Note that in both of these steps, it is assumed that the probability of an individual's genotype data having a particular phasing is the product of the probability of each of the two haplotypes defined by the phasing. Thus this approach assumes Hardy-Weinberg equilibrium (HWE).

As mentioned earlier, there are two main problems with this modeling approach. The first is that the number of parameters,

First, we describe a model for _{1},_{2},_{3},) that uses far fewer parameters than the full table. Using the chain rule of probability, we can write

Equation 2 does not introduce any conditional independencies. If we were to use a (conditional) probability table for each of these three local distributions, then this model would capture exactly the same information as Equation 1 and would not reduce the number of parameters. However, instead of using conditional probability tables, we use softmax regression functions (also known as multilogit regression) ^{th} allele, conditioned on the alleles at the other two loci, _{1}, _{2}, we have

Similarly, the softmax regression function for _{1}), trivially, as_{i}^{i}^{th} position, which contains a one. Correspondingly, the parameter vectors are augmented in length to match this dimensionality. Thus, in this binary representation, the length of each ^{k}^{1}+^{2}+1, and the total number of scalar parameters required to represent _{1},_{2},_{3},) would be ^{3}(^{1}+^{2}+1)+^{2}(^{1}+1)+^{1}(1). Note that

This softmax-based model can be easily extended, by direct analogy, to more than three loci, and far more efficiently than can the multinomial-based model. We note that the additive nature of the softmax regression functions leads to the property that similar haplotypes have similar joint probabilities. Coalescent priors used in some Bayesian approaches also have this property, whereas full tables do not.

We use the EM algorithm to train our model—that is, to choose good settings of the softmax parameters (^{j}^{j}^{j}

Formally, let ^{d}^{th} person in our data set. For example, if we have data for three loci, HLA-A, HLA-B, and HLA-C, then we would have unphased data for each chromosome, for each locus, ^{number of loci−1} possible unique phase states, _{i}^{d}

For the E-step, we compute _{i}^{d}^{th} genotype to be in phase state 2 is given by

For the M-step, we use the E-step posteriors to compute the parameter estimates. As mentioned, we use MAP parameter estimates which are generally more stable. For the prior distribution of each parameter, we use a zero-centered Gaussian distribution. The use of this parameter prior is sometimes referred to as ^{j}^{j}^{j}_{1},_{2},_{3},), which are (inversely) related to the variance of the Gaussian prior, are set empirically using a hold out set. Because this MAP estimation problem is embedded inside of an M-step, the regularization parameters are theoretically not independent (except for _{1} because it does not depend on the phasing of the data), and hence must be adjusted jointly. We describe how we do so in the experimental section.

The use of other parameters priors is possible. One commonly used alternative is the Laplacian prior or, equivalently, L1 regularization. In experiments not reported here, we have found L2 and L1 regularization to provide comparable performance on our task.

By iterating between the E-step and the M-step from some chosen parameter initialization (or, some posterior initialization), we are guaranteed to locally maximize the log posterior of the data, _{j}

We note that one can smooth/regularize the parameters of the multinomial table using a Dirichlet prior. This smoothing has the effect of adding pseudo-counts to the observed counts of the data when computing the ML estimate during the M-step. In our experiments, we compare our model against both the traditional multinomial haplotype model and a Dirichlet regularized multinomial model.

The ML (and L2-regularized MAP) softmax regression parameter estimation problem within a single M-step is a convex problem, and hence not subject to local minima. In contrast,

As with the traditional algorithm used in the HLA community, our EM algorithm assumes random mating. In the discussion, we propose one way to remove this assumption.

As discussed, we first train our model using the EM algorithm on a data set consisting of four-digit resolution HLA data from a population similar to that of our data of interest. We then use the model to probabilistically refine our lower-resolution data set. To do so, we refine each person's HLA type independently of the others. The way we do so, is to exhaustively write out a list of all possible unique four-digit phasings that are consistent with each person's observed genotype data. We do so by first writing out all possible (mixed resolution) phases, and then expanding each of these to all possible four-digit phases. For example, if one person's observed genotype in the data set of interest was

Expanding Equation 5, for example, we then obtain,

Similarly, we expand each of Equations 6–8 to obtain an additional ^{2}, ^{3}, and ^{4} possible four-digit phasings. The total number of possible four-digit phasings consistent with this person's observed genotype is thus ^{1}+^{2}+^{3}+^{4}. Alternatively, if our data set of interest contains genotype-ambiguity (in the form of possible pairs of alleles), then we expand the data in all possible ways consistent with those pairs.

If our desired endpoint is a statistical estimate of phased four-digit data, then we need only compute and renormalize the likelihood of each member of the list (to get the posterior probability of each pair of four-digit haplotypes). However, usually we are interested in a probability distribution over the possible four-digit genotypes. To obtain this distribution, we sum the posterior probabilities of those members of the list that are consistent with each observed genotype. For example,

Because haplotype patterns are often population (ethnicity)-specific, a natural approach is to use separate models for each population, when the populations are known. For example, if the low-resolution data of interest pertained primarily to individuals of European descent, then one would train a model using data from a European population. Or, if the low-resolution data consisted of both European and Amerindian populations, then one would train a model on European and Amerindian populations separately, and then refine the data of interest using the appropriate model.

Nonetheless, it is likely that some haplotype patterns are population-specific whereas others are not, or far less so. Consequently, it would be useful to combine data across populations, so that as much data as possible is available for parameter estimation. The challenge of course is how to combine data when appropriate, to maintain population-specific training data when appropriate, and to make good choices automatically. One way to achieve this goal is to augment the feature space (which so far consists of binary encodings of HLA alleles) with population features. We can, for example, include a one-hot encoding of the population labels in our features. Alternatively or in addition, we can add features that correspond to conjunctions of the one-hot encodings of allele and population label. Whereas the first type of augmentation, which we refer to as

The idea of leveraging information across multiple populations is closely related to some of our previous work on epitope prediction in which we show how to leverage information across HLA alleles

One could imagine using a mixed-resolution data set of interest (which contains some four-digit HLA types) as its own training data since EM naturally handles incomplete data. If the data that are missing four-digit resolution information are

To assess statistical significance of the difference of the performance of two models (e.g., softmax compared to multinomial), either in terms of the number of correct MAP predictions, or, in terms of the test log likelihood, we used a non-parametric, permutation-based, paired test, wherein the null hypothesis is that the average of the pair wise difference in scores is zero.

Suppose the test set contains _{1} and _{2}, we do the following:

Compute the average difference between paired scores, in each algorithm,

For permutation,

For each datum,

Compute the average difference between paired, permuted scores,

Then the two-sided p-value for method _{1} being statistically different from _{2} is given by the proportion of times that the average difference observed on permutated data matched, or exceeded that observed on the real data. Formally,

where

We used data sets from two main sources, and denote the number of individuals in each by N. The first data set is a collection of private data derived from a large collection of disease cohorts and controls that were all typed in the laboratory of Mary Carrington. This data set comprises data from four populations, across three loci, as summarized in

ethnicity | N | # unique A alleles | # unique B alleles | # unique C alleles |

North American European | 7526 | 81 | 129 | 48 |

North American African | 3545 | 60 | 106 | 42 |

Asian | 1318 | 43 | 76 | 30 |

North American Hispanic | 881 | 47 | 106 | 35 |

Class I genotyping: Genomic DNA was amplified using locus-specific primers flanking exons 2 and 3. The PCR products were blotted on nylon membranes and hybridized with a panel of sequence-specific oligonucleotide (SS0) probes (see

The second data set was taken from the publically available dbMHC database (

ethnicity | N | # unique A alleles | # unique B alleles | # unique C alleles | # unique DRB1 alleles |

Irish | 1000 | 26 | 49 | 23 | 33 |

North American Asian | 393 | 34 | 66 | 24 | NA |

North American European | 287 | 28 | 48 | 21 | NA |

North American Black | 251 | 28 | 49 | 23 | NA |

North American Hispanic | 240 | 35 | 62 | 25 | NA |

North American Amerindian | 229 | 27 | 55 | 22 | NA |

All except Irish | 1400 | 48 | 102 | 31 | NA |

In order to evaluate our model, and also to compare how it performs to a multinomial-based model, we use data sets consisting of four-digit resolution HLA data from individuals. Then we synthetically mask the known four-digit allele designation for some loci and some individuals, at random. In this way, ground truth is available for quantitative assessment. Specifically, we use the following set-up:

Start with a four-digit HLA resolution data set,

Randomly partition _{train}) and 20% for testing (_{test}).

To learn good settings of the regularization parameters, randomly partition _{train} into 80% for a regularization training set (_{train}) and 20% for a regularization hold out set (_{hold}). Train a model on _{train}, for each value of the regularization parameters, and then test its performance on _{hold}. Select the regularization parameters which perform best.

Using the best regularization parameters, train the model on _{train}, and then test its performance on _{test}).

To test the performance as mentioned above, we randomly mask 30% of the four-digit HLA types (on an individual and independent allele basis) in the test/hold-out set. That is, we truncate the last two digits of their four-digit designation. We then use our HLA refinement to obtain a probability distribution for all four-digit HLA types which are consistent with the masked values. Then we assess the prediction in two ways. One, we take the four-digit type with the highest probability as the single, best answer, and then count how many of these are correct. We refer to this criterion as the

Although we mask the HLA types at random, this is likely not the same process that is responsible for the true, observed, experimental process that results in masking. Nonetheless, we feel that it is a reasonable proxy, because it focuses on how well haplotype patterns have been learned, how strong these patterns are, and how much they can be used to refine HLA data, which is the question of interest. Additionally, we measure performance under a 100% masking, and also a locus-by-locus masking, for broader testing of the performance of our model.

In addition to experimenting with our softmax-based model, and the multinomial (with and without regularization), we also compare performance to a baseline model of allele marginals. In this baseline model, the probability over four-digit HLA types is proportional to the frequency of that allele in the training set, regardless of the HLA data at other loci. This model, by construction, cannot capture haplotype structure. As we shall see, this model does not perform well.

For the softmax-based model, we first learned the best value for _{1} (i.e., for the first locus) since it is independent of the others. Then, fixing the value of _{1} at its best value, we set all other _{i}_{i}

Lastly, to determine if there is a statistically significant difference between our methods (in terms of either test log likelihood, or number of correct MAP predictions), we use a permutation-based, non-parametric, paired test in which the null hypothesis is that the average of the pair wise difference in scores is zero. Because 10,000 permutations were used, the smallest p-value that could be obtained was

Because the objective function we use, the penalized likelihood, is not convex, our parameter estimation and hence HLA refinement can be sensitive to the initial parameter setting. (Note that by parameters, we mean ^{j}^{j}^{j}_{i}_{i}

When training our softmax-based model, the geometric mean probability across the five initializations was aways 0.5255. (A larger geometric mean probability is better.) In all five runs, 262 of the 306 masked alleles were correctly predicted, indicating little sensitivity to parameter initialization. Similarly, for the regularized multinomial-based model, the geometric mean probabilities across the five initializations was always 0.4180. In all five runs, 262 of the 306 masked alleles were correctly predicted, again indicating little sensitivity. For the unregularized multinomial-based model, the geometric mean probabilities across the five initializations were: 0.0077, 0.0117, 0.0126, 0.0092, and 0.0105. Of the 306 masked alleles, 260, 265, 260, 266, and 262 were correctly predicted across the five runs, indicating a far greater sensitivity to initial parameters.

The geometric mean probability was best for the softmax-based model, followed by the regularized multinomial, followed by the unregularized multinomial model (which does poorly due to its inability to make stable estimates for the huge number of parameters it requires). This is a pattern we shall see throughout our experiments.

The sensitivity we see here will allow us to gauge how important observed differences are in the remainder of the experiments, where we always initialize the parameters to be all zero. Of course, when deploying this method in a real setting, it would be wise to try several parameter initializations, and then to choose the one that yields the highest likelihoods on hold-out data. Also note that, for the unregularized multinomial model, we regularize it with an equivalent sample size of 1×10^{−16} so that negative infinities do not appear when haplotypes not seen in the training sample appear in the test set.

Next we used our large, private data set to measure the refinement performance of the various models we have discussed. We trained and tested within each ethnic population separately. The results are summarized in

Each set of grouped bars represents the four different modeling approaches. From darkest to lightest: softmax, regularized multinomial, unregularized multinomial, allele marginal. The number of masked alleles, respectively, in the European, African, Asian, and Hispanic data sets was 2669, 1287, 477, and 306, respectively.

The softmax model has the best performance overall and can correctly resolve a substantial number of ambiguous alleles. In terms of both criteria, the softmax model is significantly better than the other methods (see ^{−4}), because the allele marginals are naturally regularized due to the small number of parameters.

Method 1 | Method 2 | log likelihood p-value | # correct MAP p-value |

softmax | regularized mult. | p = 10^{−4} | p = 2.8×10^{−3} |

softmax | non-reg. mult. | p = 10^{−4} | p = 8×10^{−4} |

softmax | allele marginals | p = 10^{−4} | p = 10^{−4} |

regularized mult. | non-reg. mult. | p = 10^{−4} | p = 0.51 |

non-reg. mult. | allele marginals | p = 10^{−4} | p = 10^{−4} |

Denotes the method that performed better (except for the last row, where the allele marginals perform better than the unregularized multinomial on the log likelihood, but worse on the number of correct MAP predictions.

In realistic settings where our algorithm will be deployed, it is likely that the data set of interest is not drawn from exactly the same distribution as the training data. To get a sense of how robust our approach is to deviations from this idealized setting, we have performed several experiments more closely mimicking a realistic setting. In particular, we evaluated our refinement accuracy when the training and test distributions were drawn from different populations.

First, we split the dbMHC Irish data set (HLA-A, HLA-B, HLA-C alleles) into 80% training data and 20% test and masked 30% of the test alleles to two digits. Then we trained a model using the training data, and tested on the test data. Next, we used the model we had previously trained on the ‘private North American European’ data, and used this model to predict the same masked, Irish alleles. Of the 200 people in the Irish test set, there was one person who contained one allele never observed in the European data (B*2409, which is actually a null allele, B*2409N, for which the typing of the private data was not capable of finding). After removal of this person, we then compared the performance when using the dbMHC Irish data set itself for traning, as compared to using our much broader private European data set for training. The resulting test geometric mean probabilities of the test set were 0.8851 when training with the dbMHC Irish, and 0.8891 with the private European. This difference was not significant (p = 0.44).

Next, we used the model trained on the private Asian data to predict a 30% masking of 279 dbMHC Canton Chinese individuals

Asian data set itself. Because this dbMHC data set was not large enough to partition into a training and test set, we were not able to measure accuracy achieved when training on itself. This is true for the next three dbMHC data sets as well, in which we perform similar experimentation.

Next we used a model trained on the private North American African data set, to predict masked alleles in 251 dbMHC African American individuals, of which five individuals contained alleles not matching the training data (A*6804, B*1502, B*1515, B*5802). After removal of these individuals, 321/373 = 86% of masked alleles were correctly predicted, which is lower than the 90% accuracy achieved when testing on the private North American African data itself. Results were comparable when we first removed individuals from Africa from the training data (leaving only US-based individuals of Africans descent).

Next, we used a model trained on the private North American European data set (containig 776 individuals), to predict masked alleles in 287 dbMHC North American European individuals, of which three individuals contained alleles not matching the training data (B*1802, B*4408, B*5202). After removal of these individuals, 478/510 = 94% of masked alleles were correctly predicted, roughly equal to the 95% accuracy achieved when testing on the private North American European data set itself.

Finally, we used a model trained on the private North American Hispanic data set, to predict masked alleles in 240 dbMHC North American Hispanic individuals, of which 13 individuals contained alleles not matching the training data (A*0212, A*0213, A*2422, A*2608, A*3401, A*6805, B*5105, B*3509, B*4406). After removal of these individuals, 344/400 = 86% of masked alleles were correctly predicted, comparable to accuracy achieved when testing on the private North American Hispanic data set itself.

Based on this small set of experiments, we believe it may often be feasible to use our broadly defined ethnic categories for resolving ambiguity in other, independently created data sets falling in to the same broad category, or falling into a much more specific sub-category. Of course, this may not generally be true, and in particular, it may be less true for African-derived data. Additionally, a user of a trained model might have access to some high-resolution data for their population of interest, and could thus see how well the trained model works for the subset of their data (by synthetically masking it) before using the model to resolve ambiguity in their low-resolution data.

Note that there are two statistical

To determine whether the availability of more training data may lead to improved refinements, we examined the sensitivity of performance to the size of the training set. For the European and the African private data sets, we iteratively halved the sample size of training data, where the largest available training data set sizes were, respectively, 6020 and 2836. The results shown in

Top row shows the geometric mean probabilities; the bottom row shows the percentage of correct MAP predictions.

To determine whether leveraging information across populations is useful, we compared our leveraged population models to those built separately on each population. We did so on data from dbMHC, which contains a diverse set of populations. (We excluded the Irish population because this population is extremely homogeneous relative to the others.) Recall that we introduced two types of leveraging features: simple and conjunctive. We used our softmax model both with the simple features alone, and with both the simple and the conjunctive features, as shown in

Abbreviations are: SSC = softmax+simple+conjunctive, SS = softmax+simple, S = simple, RM = regularized multinomial, M = non-regularized multinomial, AM = allele marginals, S = separate. The number of masked alleles in the test set was 514. For all methods, except ‘separate’, a single model was trained on data from all ethnicities.

The performance of the population-augmented models are significantly better than the softmax model on test log likelihood (e.g.,

Because we use softmax regression functions in our haplotype model, the order in which we apply the chain rule (Equation 2) to our loci will have an effect on predictive accuracy. We examined the sensitivity of performance to variable ordering on three loci (A,B,C) using the European and Hispanic data sets. The results are shown in

Top two rows are for the Hispanic data set; bottom two rows are for the European data set. Within each of these, the top row is the geometric mean probabilities, and the bottom row shows the percent correct MAP predictions. The number of masked alleles, respectively, in the Hispanic 30% and loci masks (A,B,C), was 306 and 354. The number of masked alleles, respectively, in the European 30% and loci masks (A,B,C), was 2669 and 3012.

For the ‘30% mask’ experiments, no statistically significant (

Note that it is possible to use a parsimonious model which is not dependent upon variable ordering (a so-called ‘undirected’ model _{i}_{j}_{k}

In some domains, the ability to predict certain loci is of greater importance than others. For example, in HIV research, the ability to predict B alleles is often paramount (e.g.,

Each set of grouped bars, from darkest to lightest, represents, respectively, A-masking, B-masking, and C-masking. The number of masked alleles in each masking was 3012, 1418, 528, and 354, respectively, for the European, African, Asian, and Hispanic test sets.

Finally, in some instances, only low-resolution data (i.e., two-digit resolution) is available. Consequently, we investigated the prediction accuracy of our algorithm in this situation—that is, when 100% of the alleles were masked to two-digit. The results for the private African, Asian, and Hispanic data sets are shown in

Each set of grouped bars represents, from darkest to lightest, respectively, 100% mask with softmax model, 100% mask with allele marginals model, 30% mask with softmax model, 30% mask with allele marginals model. The number of masked alleles for the 100% mask was 4254, 1429, and 1062, and for the 30% mask, 1287, 477, and 306, in the African, Asian, and Hispanic test sets, respectively.

In order to gauge how much haplotype information is being used in this context, we compare the results to those from the allele marginal model. In all cases, the softmax model performs significantly better than the allele marginal model (^{−4} for all three population comparisons on the test log likelihood). Thus, a large amount of haplotype information is being used by our model in this 100% masking context, and prediction of four-digits from strictly two-digit data is feasible. For comparison,

We compared our methods on data with four loci, spanning the HLA-A, -B, -C and -DRB1 loci. The four-loci data available to us, with the largest sample size, was the Irish set in dbMHC. As shown in

Abbreviations are: S = softmax, RM = regularized multinomial, M = non-regularized multinomial, AM = allele marginals. A total of 468 alleles were masked in the test set.

We have introduced a method for statistical refinement of low or intermediate resolution HLA data, when a full resolution training data set from a similar population is available. In doing so, we have also improved upon the EM-based approach to haplotype estimation by using a more parsimonious parameterization of the haplotype distribution. Experimentally, we show both that it is feasible to use statistical approaches for HLA refinement, and also that our method outperforms the standard multinomial-based models used throughout the HLA community for haplotype estimation. Our HLA refinement method helps to mitigate the limiting factor of cost in HLA typing today, and allows for lower/intermediate resolution, or historical data to be statistically refined when it cannot be refined by assay. A tool based on our approach is available for research purposes at

Although there is widespread caution about the use of assigned, or self-defined ethnicity labels

Because our modeling approach assumes that the training and testing populations are drawn from the same distribution, one should take care when trying to use this approach for case-control studies where case and controls are thought to be drawn from different distributions. One may also be wary of using this approach in the domain of transplantation, for similar reasons (patients requiring transplants likely make up a specific sub-population). However, since HLA ambiguity resolution is applied in the area of transplants to potential donors in a registry, rather than the patients themselves (who are routinely typed at high resolution), application in this domain should not be problematic.

As with the traditional algorithm used in the HLA community, our EM algorithm assumes HWE. One could make a small change to our model which would allow us to circumvent making such an assumption. In the models discussed so far, the probability of data in a particular phasing is defined as follows. If a haplotype, _{A}_{1},_{B}_{1,}_{C}_{1,}_{A}_{2},_{B}_{2,}_{C}_{2} into two sets: _{A}_{1},_{B}_{1},_{C}_{1}) and _{A}_{2},_{B}_{2},_{C}_{2}) are specified by a

Future work in probabilistic HLA refinement may involve comparing EM-based approaches to full Bayesian approaches. Also, an interesting, though perhaps computationally difficult avenue to pursue would be the use of HLA DNA sequences to better model rare haplotypes, or the use of SNP data to directly predict HLA types.

We wish to acknowledge the following cohorts and investigators from whom samples used in this study (for HLA typing) were derived: the International HIV Controllers Study, the Multicenter AIDS Cohort Study, the Multicenter Hemophilia Cohort Study, the Washington and New York Men's Cohort Study, the San Francisco City Clinic Cohort, the AIDS Linked to Intravenous Experience, the Swiss HIV Cohort, the Urban Health Study, the NIH Focal Segmental Glomerulosclerosis Genetic Study, Hepatitis C Antiviral Long-term Treatment against Cirrhosis, National Cancer Institute Surveillance Epidemiology and End Results Non-Hodgkin Lymphoma Case-Control Study, Woman Interagency Health Study, Classic Kaposi Sarcoma Case-Control Study I and II, Genetic Modifiers Study, Nairobi CTL Cohort, Grace John-Stewart, Stephen O'Brien, and Thomas O'Brien. We thank John Hansen for useful discussion, David Ross for code which contributed to the softmax regression, Mark Schmidt for use of his L1 code, Giuseppe Cardillo for his publically available implementation of