High-accuracy HLA type inference from whole-genome sequencing data

Extensive hyperpolymorphism and sequence similarity between the HLA genes make HLA type inference from whole-genome sequencing data a challenging problem. We address these by representing sequences from over 10,000 known alleles in a reference graph structure, enabling accurate read mapping. HLA*PRG, our algorithm, outperforms existing methods by a wide margin and for the first time consistently achieves the accuracy of gold-standard reference methods with one error across 158 alleles tested.


1
Wellcome Trust Centre for Human Genetics, University of Oxford 2 UCSF, Neurology Department, San Francisco, USA. 3 Inserm unit 1064 ATIP-Avenir team 6, University of Nantes -Nantes University Hospitals, France 4 Li Ka Shing Centre for Health Information and Discovery, University of Oxford * Corresponding author dilthey@well.ox.ac.uk E E E Extensive hyperpolymorphism and sequence similarity between the HLA genes xtensive hyperpolymorphism and sequence similarity between the HLA genes xtensive hyperpolymorphism and sequence similarity between the HLA genes xtensive hyperpolymorphism and sequence similarity between the HLA genes make make make make HLA type HLA type HLA type HLA type inference from inference from inference from inference from w w w whole hole hole hole----genome sequencing data a challenging problem. We address these by genome sequencing data a challenging problem. We address these by genome sequencing data a challenging problem. We address these by genome sequencing data a challenging problem. We address these by representing sequenc representing sequenc representing sequenc representing sequences from es from es from es from over 10,000 over 10,000 over 10,000 over 10,000 known alleles in a reference graph structure known alleles in a reference graph structure known alleles in a reference graph structure known alleles in a reference graph structure, enabling , enabling , enabling , enabling accurate read mapping accurate read mapping accurate read mapping accurate read mapping. . .
. HLA*PRG, our algorithm, HLA*PRG, our algorithm, HLA*PRG, our algorithm, HLA*PRG, our algorithm, outperforms existing methods by a wide margin and outperforms existing methods by a wide margin and outperforms existing methods by a wide margin and outperforms existing methods by a wide margin and for the first time for the first time for the first time for the first time consistently consistently consistently consistently achieves the accuracy of gold achieves the accuracy of gold achieves the accuracy of gold achieves the accuracy of gold----standard standard standard standard reference reference reference reference methods with methods with methods with methods with one one one one error error error error across 158 alleles tested across 158 alleles tested across 158 alleles tested across 158 alleles tested. . . .
Genetic variation at HLA loci, both classical and non-classical, is associated with many medical phenotypes including risk of autoimmune 1-3 and infectious 4 disease, adverse drug reactions 5, 6 , success of tissue and organ transplants 7 , and, via epitope presentation preferences, the success of cancer immunotherapy 8 . The current gold standard for high resolution typing of HLA alleles, sequence-based typing (SBT), uses Sanger sequencing or targeted amplification of the HLA genes followed by nextgeneration sequencing and can require extensive manual curation, thus making high throughput application of the method expensive and challenging. With the growth of high throughput genomic technologies, methods for inferring HLA genotype have been developed that use SNP genotyping 9-12 , exome and whole-genome sequencing 13-17 . These approaches offer high throughput, but, to date, are either limited to a subset of HLA loci 17 or do not achieve the same degree of accuracy as SBT.
Multiple factors influence accuracy, including the sheer sequence and structural diversity of the region, the presence of multiple paralogous genes (including pseudogenes) and rare, but important, gene conversion events that generate mosaic allelic structures.
To address these challenges, we have previously introduced structures to represent known genomic variation called population reference graphs (PRGs) and demonstrated their value in characterising variation across the MHC and particularly within the HLA Class II gene region 18 . Briefly, a PRG is a directed graph in which alternative alleles, insertions and deletions are represented as alternative paths through the graph, and in which orthologous and identical regions are collapsed locally to model potential recombination. Although expensive computationally, reads likely to arise from the region can be identified and mapped directly to the graph structure, thus enabling the assessment of evidence for the presence of each stretch of sequence along a path. The pairs of paths with greatest joint support can therefore be identified and assigned as the diploid genotype for an individual. Previously, we demonstrated that a prototype of this approach can identify the nucleotide-level variants at classical HLA alleles with high accuracy. However, we did not address the problem of inferring the allele present at the gene level 18 .
Allelic variants at HLA genes can be typed 19, 20 at different degrees of resolution; low resolution ("2digit") types specify serological activity; intermediate resolution ("4-digit") HLA types specify the complete primary sequence of the HLA proteins and high-resolution ("6-digit") types determine the full exonic sequence including synonymous variants. Higher levels of resolution include non-coding variation. SBT is typically carried out at 6-digit "G" resolution, in which only the sequences of the exons encoding the peptide binding groove are considered: exons 2 and 3 for HLA class I genes and exon 2 for HLA class II genes. In most applications of typing, a set of 6-8 loci are typed (Class I: HLA-A, -B, -C, Class II: HLA-DQA1, -DQB1, -DRB1, -DRA1 and -DPB1), though there exist over 30 HLA genes and pseudogenes. Supplementary Figures 1 and 2 demonstrate the high degree of sequence similarity and its non-random spatial structure between alleles in certain groups of loci.
We set out to modify the existing PRG approach to provide accurate HLA typing at 6-digit "G" resolution using high coverage whole-genome sequencing data, such as is being generated by largescale genomics projects. Details of the approach, including source data, graph-construction, readmapping and HLA typing are given in the Supplementary Note and a schematic is shown in Figure. 1. Briefly, we build a gene-specific PRG comprising 46 (mostly HLA) genes and pseudogenes, 720 genomic and 10,500 coding (exon-only) alleles from IMGT/HLA 21 (Supplementary Table 1). Each gene is embedded in a stretch of surrounding reference sequence, but we don't attempt to model the full intergenic sequence. Reads are mapped to the PRG and the pair of alleles with the highest joint likelihood is identified and reported with an associated quality measure for each individual allele (integrating over the distribution of posterior probabilities, see Supplementary Note). The software to carry out these steps, HLA*PRG, is freely available.
To assess the accuracy of the method, we used two data sets with available high coverage sequencing data and independent SBT-based HLA type information (Table 1). First, we analysed NA12878, NA12891 and NA12892 from the Illumina Platinum Genomes Project, sequenced to 50 -55x with a PCR-free 2 x 100bp protocol. We correctly infer all 36 HLA alleles. Second, we analysed 11 samples from the 1000 Genomes Project, sequenced to 28 -68x with a PCR-free 2 x 250bp protocol. Initial analysis identified three discrepancies (Supplementary Note), though on re-typing these individuals two of three were the result of initial errors in the validation data. The remaining inconsistency, (HLA-DRB1*16:02:01 incorrectly typed as HLA-DRB1*16:23) is likely caused by HLA-DRB5 sequences incorrectly aligned to HLA-DRB1 within IMGT (IMGT/HLA currently don't provide genomic sequences for HLA-DRB5 and the representation of this gene in the PRG therefore remains incomplete).
We compare the performance of HLA*PRG with PHLAT 14 and HLAreporter 13 , two state-of-the-art algorithms that support HLA class I and class II. For the Platinum samples, we find that PHLAT also correctly infers all 36 alleles, whereas HLAreporter only reports 16 alleles (of which 14 are correct). For the 1000 Genomes Samples, we find that HLA*PRG outperforms both programs by a wide margin. Mean accuracy at 4-digit resolution across all loci is 75% for PHLAT and 80% for HLAreporter, and HLAreporter achieves a call rate of only 38%. To assess to what extent HLA*PRG depends on the availability of whole-genome data, we also apply it to whole-exome sequencing data of a cohort of HapMap samples. Results are varied and accuracies consistently lower across all loci (ranging from 79% for HLA-C to 98% for HLA-DQB1, Supplementary Table 2). To assess sensitivity of HLA*PRG to whole-genome sequencing depth, we subsampled the NA12878 data from the Platinum and 1000 Genomes projects to average coverages of 40x, 30x and 20x in triplicates. We find that performance is stable (all alleles correctly predicted) down to 20x for the Platinum data and down to 30x for the 1000 Genomes data (Supplementary Table 3). To assess whether HLA*PRG could be applied to additional HLA loci beyond the set of 6 genes validated here, we used it to genotype a set of 12 additional HLA genes and pseudogenes in the Illumina Platinum data (Supplementary Table 4). Across the 72 alleles inferred, we find one trio inconsistency at the pseudogene HLA-K, which is driven by an allele called with low confidence.
Our current implementation is optimised for accuracy rather than computational efficiency. Analysing the NA12878 Platinum data (55x depth) takes 11 hours (clock time, AMD Opteron 6174 2.2GHz), and 17 hours for the NA12878 1000 Genomes data (63x coverage). We provide a detailed runtime (including CPU time) and memory analysis in the Supplement (Supplementary Note). Achieving improvements in computational efficiency is ongoing work. Future versions might make use of linear sequence alignments to seed graph alignment and also leverage population haplotype frequencies 22, 23 .
In conclusion, we find that HLA*PRG infers HLA types at accuracies comparable to current gold standard typing technologies (two errors in the original reference data compared to one from HLA*PRG at 4-digit / 6-digit resolution), provided that high-quality (PCR-free protocol, read length of at least 100bp, coverage of at least 30x) whole-genome sequencing data are used as input. HLA*PRG will enable researchers to augment population-scale whole-genome sequencing data with reliable HLA type information and contribute to characterizing HLA signals in important medical phenotypes.    Finescale structure of the PRG input sequences. Exons, introns and UTRs are embedded in regional haplotypes (padding sequence). Exon sequences typically outnumber intron sequences. The red line indicates the region covered by IMGT genomic sequences. c c c c For each gene represented in the PRG, multiple sequence alignments representing up to 3 sources of sequence data are merged for PRG construction: exonic sequences, genomic (UTR, exons, introns) sequences, regional haplotypes ("xMHC Ref."). Using alleles present in both the current and the next-higher-level MSA (identifiers printed in red), the merging algorithm determines consensus boundaries (blue bars) to connect the MSAs of different input sequence types. For each segment so-defined, we use the MSA corresponding to the highest-resolution input sequence type (sequence characters therefore ignored are printed in grey). d d d d The PRG corresponding to the input sequences shown in c, and a seed-and-extend alignment of a sequencing read to the PRG. PRG nodes are represented by boxes and edges by labelled arrows. The four blue markers correspond to the consensus MSA boundaries shown in c. The aligned sequence of the read is displayed below the PRG, and the alignment path (the sequence of edges and nodes traversed in the PRG) is highlighted. The red component of the alignment path corresponds to the exact-match "seed" component of the alignment (spanning a graph-encoded gap), whereas the orange components correspond to the "extend" component of the alignment (where mismatches are allowed).

PGF (T*01:01) A C T A G C A _ _ A C G T A C T _ _ C C T A T G A MANN (T*01:04) T T T _ T T A _ _ A T C C G C T A C C C T A T G A xMHC
Ref.