Using de novo assembly to identify structural variation of eight complex immune system gene regions

Driven by the necessity to survive environmental pathogens, the human immune system has evolved exceptional diversity and plasticity, to which several factors contribute including inheritable structural polymorphism of the underlying genes. Characterizing this variation is challenging due to the complexity of these loci, which contain extensive regions of paralogy, segmental duplication and high copy-number repeats, but recent progress in long-read sequencing and optical mapping techniques suggests this problem may now be tractable. Here we assess this by using long-read sequencing platforms from PacBio and Oxford Nanopore, supplemented with short-read sequencing and Bionano optical mapping, to sequence DNA extracted from CD14+ monocytes and peripheral blood mononuclear cells from a single European individual identified as HV31. We use this data to build a de novo assembly of eight genomic regions encoding four key components of the immune system, namely the human leukocyte antigen, immunoglobulins, T cell receptors, and killer-cell immunoglobulin-like receptors. Validation of our assembly using k-mer based and alignment approaches suggests that it has high accuracy, with estimated base-level error rates below 1 in 10 kb, although we identify a small number of remaining structural errors. We use the assembly to identify heterozygous and homozygous structural variation in comparison to GRCh38. Despite analyzing only a single individual, we find multiple large structural variants affecting core genes at all three immunoglobulin regions and at two of the three T cell receptor regions. Several of these variants are not accurately callable using current algorithms, implying that further methodological improvements are needed. Our results demonstrate that assessing haplotype variation in these regions is possible given sufficiently accurate long-read and associated data. Continued reductions in the cost of these technologies will enable application of these methods to larger samples and provide a broader catalogue of germline structural variation at these loci, an important step toward making these regions accessible to large-scale genetic association studies.


Introduction
The capability of the human immune system to respond to environmental pathogens results from its substantial diversity and variability, both among individuals within a population and among cells within a single host. Key components of the innate and adaptive immune system, including the human leukocyte antigen (HLA), immunoglobulins (IG), T cell receptors (TCR) and killer-cell immunoglobulin-like receptors (KIR), have evolved exceptional complexity in their genomic loci, featuring numerous highly similar genes interspersed with pseudogenes and repetitive elements. Variation in genes encoding some of these components have wellestablished associations with infectious, immune-mediated, and other disease traits. The major histocompatibility complex (MHC) encoding HLA is so far the best-studied example, with hundreds of associations now known across multiple classes of disease [1,2] including infections [3,4]. In some cases the underlying functional mechanisms have also been identified [5,6]. However, despite the clearly important role of immunoglobulins (IG), TCR and KIR [7][8][9], the underlying complexity of these genomic regions has so far prevented a full analysis of their contribution to human disease.
Three challenges must be overcome to make these regions accessible to future studies. First, key aspects of adaptive immunity are driven by somatic recombination and hypermutation of TCR and IG genes in immune cells. Consequently, DNA from non-recombining cell populations is needed to access germline genetic variation in these regions; these are not targeted in current surveys of haplotype variation based on lymphoblastoid cell lines or whole blood [10][11][12]. Second, extensive paralogy makes these regions intractable to short-read sequencing approaches [13], although analyses based on known immunogenetic sequences can be achieved [14]. Approaches using more costly long-read sequencing must therefore be employed [15,16], though even these methods are not always sufficient to solve the most complex regions [17]. Third, even if these technical challenges can be dealt with, the high diversity observed at these regions presents further difficulties for methods that identify, catalogue, and genotype the underlying variation. Solving these challenges would in principle enable the development of large haplotype variation reference panels at these loci, complementing existing immunogenetic variation databases [18] and opening them to analysis in large disease association studies.
Motivated by these challenges, here we utilize genomic data from a single individual (HV31) to assemble eight regions that encode key components of the human adaptive and innate immune response. To achieve this, we use DNA extracted from CD14 + monocytes, which do not undergo systematic somatic recombination. We develop a pipeline that exploits PacBio HiFi long-read sequencing, Bionano optical mapping, and short-read sequencing data to produce high-quality de novo assemblies of these regions. We then use additional long-read and short-read datasets to assess assembly accuracy and to call heterozygous variations using computational approaches based on read alignment and the copy number distribution of short k-mers (i.e. short DNA fragments of fixed length k). We find that HV31 carries substantial structural differences between haplotypes and in comparison to the GRCh38 reference sequence, including multiple large variants that affect core immune system genes but are not accurately called by current methods, and we investigate several of these in detail. Lastly, we analyze four gaps in the GRCh38 reference sequence at the immunoglobulin κ and T cell receptor γ regions, that are fully or partially filled in our assembly.

Immune system loci display a spectrum of complexity in the human reference sequence
We focused on eight genome regions that encode components of the human immune system, namely those encoding the HLA, immunoglobulins (IGH, IGL, IGK), T cell receptors (TRA, TRB, TRD, TRG), and the killer-cell immunoglobulin-like receptors (KIR) ( Table 1). Regions were defined based on NCBI RefSeq locus definitions [19] (except HLA and KIR which were based on previously published gene ranges [20,21]), plus an additional 1Mb flanking sequence added to both sides (see Methods). In the IGK region we additionally expanded the range to include a~1 Mb heterochromatin gap present in GRCh38. The expanded regions range from 2-6 Mb in length and vary considerably in terms of repetitive structure and haplotype diversity ( Table 1). We noted the least reference sequence complexity in the T cell receptor α, δ and γ regions (which contain < 2% repeat sequence and no listed alternate haplotypes), but greater complexity in other regions. In particular, the regions encoding immunoglobulin subunits contain the highest levels of duplication; previous analyses [22,23] have demonstrated significant structural diversity among known haplotypes in these regions. GRCh38 also contains dozens of alternative haplotype sequences at the HLA and KIR regions, and four gaps in the IGK and TRB regions (Table 1). Comparison to the earlier GRCh37 assembly and the presence of fix patches highlights that these regions are likely to be challenging to assemble.

Assembling immune system regions with long-read, short-read and optical mapping data
We assessed whether the eight selected regions can be accurately assembled de novo using data from a single individual identified here as HV31. HV31 was recruited as a healthy volunteer and identified as having European ancestry. To facilitate accurate assembling of these complex regions, we generated data from multiple complementary platforms. Specifically, we performed PacBio Sequel II circular consensus sequencing (obtaining 12.3× genome coverage bỹ 12 kb HiFi reads), MGI short-read sequencing (56.8×) and Bionano Saphyr Direct Label and Stain (DLS) optical mapping (152.7× coverage by imaged molecules). In addition, long-read and short-read sequencing data from PacBio continuous long read (CLR; 35×), Oxford Nanopore Technologies (ONT) PromethION (63×), 10x Genomics linked-reads (40.2×), Illumina Novaseq PCR-free (44.2×), MGI single-tube long fragment read (stLFR) (51.3×) and MGI CoolMPS (56.9×) platforms were also generated from the same blood sample (S1 Table). To minimize the impact of cell-specific events including V(D)J recombination and somatic hypermutation and enable accurate assembly of the germline genome, the long-read data used for our assembly below were collected from CD14 + monocytes isolated from peripheral blood mononuclear cells (PBMCs) with antibody-conjugated beads (see Methods). Bionano imaging data, as well as several of our short-read datasets, were generated directly from PBMCs (S1 Table). We note that coverage by PBMC-derived reads drops significantly around T cell receptor genes (S1 Fig), consistent with an effect of V(D)J recombination in T cells which are the predominant cell type in PBMC. Datasets generated in this study are further detailed in S1 Table and have been deposited with the European Genome-Phenome Archive (Data Availability).
To generate an accurate representation of the eight regions in the HV31 genome, we developed an assembly pipeline consisting of four stages (Fig 1A). This approach deals with heterozygosity by producing a consensus assembly of each region, and a list of heterozygous  Fig 1B and S1 Dataset) that jointly describe the HV31 genome. We describe the steps of this assembly process and a comparison to other approaches below. Initial assembly. We used the Canu assembler [25] applied to HiFi reads to produce a draft whole-genome assembly. We aligned the resulting contigs to GRCh38 and extracted all contigs that overlap with the predefined regions of interest, hereafter referred to as local contigs, for further processing. Local contigs were highly fragmented (Figs 1C and S2), reflecting the unusual genomic complexity in these regions. The assembly also contained multiple shorter contigs (referred to as "haplotigs" below) aligning to the same location as longer contigs in some regions, which either represent assembly errors or genuine differences between haplotypes (S2 Fig). Heterozygous SVs on the unassembled haplotype that are larger than 1 kb in size are shown as orange diamonds or red triangles. Note that the assembled scaffolds (gray) are often larger than the originally selected immune system regions (blue) defined in Table 1. (C) Contig/scaffold continuity (NG50, y axis) for local contigs (gray) and finished HV31 assembly scaffolds (red) in each region (x axis). NG50 is defined as the length of the longest contig/scaffold that, along with longer contigs/scaffolds, covers 50% percent of each locus, as determined by alignment to GRCh38. The size of the selected region on the GRCh38 reference is also shown. To ensure comparable results, for each contig/scaffold, only the length within region boundaries is taken into NG50 calculation. (D) The estimated number of errors per megabase in each region, before and after assembly polishing. Error rates are estimated using a modified version of the Merqury algorithm [24]  Scaffolding. We next used the local contigs with Bionano optical imaging data to produce longer continuous scaffolds. Imaged DNA molecules had an observed mean length of 149 kb, substantially longer than reads from other datasets involved in this study (S3 Fig). We assembled these molecules using the proprietary Bionano Access software. As expected, the resulting contigs tended to be substantially longer than those in the draft whole-genome assembly (S4 Fig). We used the Bionano Solve algorithm to align the local contigs to the Bionano-assembled contigs and implemented a modified version of the BiSCoT algorithm [26] (Methods) to order and orient the local contigs accordingly. This process also removes or merges in haplotigs that can be effectively aligned to the hybrid scaffolds. Finally, we confirmed that the remaining haplotigs represented substantial duplication of scaffolded contigs using a k-mer based method (Methods), and removed these from downstream analysis. The scaffolds generated by this process fully covered six of the eight regions with a single scaffold, while the IGL and IGK regions were assembled with two scaffolds each (Fig 1B and 1C).
Gap filling and polishing. We further improved the assembly quality by carrying out a gap-closing step (which fills in nucleotide information for missing bases between adjacent contigs in a scaffold) using TGS-GapCloser [27] applied to local HiFi reads, resulting in the closing of seven gaps. We also implemented a polishing step using Pilon [28] applied to local HiFi and MGI reads, correcting erroneous bases in the assembly that likely originate from sequencing errors. To avoid bias due to read selection, for both processes we selected relevant reads using a double-alignment process that first aligns all reads to the initial whole-genome assembly, and then realigns the subset of reads mapping to local contigs to the fully scaffolded assembly (referred to as locally aligned reads below; see Methods). This process left six gaps (i.e. sequences of 'N' bases) in the HV31 scaffolds ( Fig 1B); these lie outside regions aligning to core immune system genes but could potentially be improved with additional processing.
Structural variant calling. We used the available long-read data to call heterozygous SVs using the HV31 assembly as reference ( Fig 1B). In brief, SVs were called separately from locally aligned HiFi, CLR and ONT long reads using PBSV (for HiFi and CLR) and Sniffles (for HiFi, CLR and ONT). A computational approach based on unique k-mers [29] was used to refine read alignment before variant calling (see Methods). Across the eight regions, 1,366 SVs were reported by PBSV or Sniffles, 491 of which were jointly supported by two or more dataset-software combinations (S5 Fig and S1 Dataset; including 179 >100 bp and 23 >1 kb in length), as reported by SVanalyzer [30], which analyzes the sequence information of each variant and identifies groups of compatible variants. As these numbers indicate, we observed considerable discrepancy between the individual SV calling approaches (S6 Fig), reflecting the difficulty of aligning reads and calling SVs in paralogous regions. As a comparison point, we also created a dataset of SV calls based on 10x Genomics linked-read sequencing data using the Long Ranger pipeline (S2 Dataset). We compare these structural variant calling results with the HV31 assembly further below. We also note that six gaps remain in the HV31 assembly (Fig 1B), which lie outside regions aligning to core immune system genes but could potentially be improved with additional processing.
We refer to the polished assembly scaffolds and SV dataset generated by these steps as "the HV31 assembly" hereafter; the assembly is summarized in Fig 1 and compared to the GRCh38 reference in Fig 2 and to other published assemblies in S7 Fig. Benchmarking information of the assembly pipeline is available as S3 Dataset. As we describe below, the per-base error rate of these assembled regions is on the order of 5-50 errors per Mb (Fig 1D), which is of a similar magnitude to recently published whole-genome assemblies based on HiFi data [17,31], although some structural errors do remain. The HV31 assembly is also comparable in coverage and continuity to other assemblies (S7 Fig), including to recently published high-quality assemblies of the homologous CHM13 cell line [29,32].

Structural variation revealed by comparison to GRCh38
We used k-mer sharing plots (i.e. "dot plots" [33]; S8 Fig) to compare HV31 to the GRCh38 reference sequence (Fig 2). Each point in these plots represents a short sequence of length k (here k = 50) that is shared by both the reference sequence and the HV31 assembly; the observed pattern of points therefore provides a visualization of similarities and differences between the two assemblies. This comparison further suggests that the HV31 assembly is relatively complete for the eight regions, without apparent missing sequence (apart from the six gaps mentioned above) or chimera sequences. HV31 contains two scaffold breaks at the IGK and IGL loci; both are located near long (� 100 kb) SDs that are highly identical (� 99%) and indicate that this type of SD remains challenging for current assembly methods. In contrast, genomic loci with higher proportions of shorter, low-similarity SDs such as the HLA and KIR were completely resolved in the HV31 assembly.
Close inspection of these plots (S4 Dataset) reveals many large (� 1 kb) SVs that differ between GRCh38 and the HV31 primary assembly. To systematically characterize these SVs, we aligned the assembly to GRCh38 and applied Assemblytics [34]. Assemblytics reported 145 SVs, 55 of which were � 1 kb in size ( Fig 1B and S5 Dataset). The majority (65.5%) of the reported SVs involved expansions or contractions of repeat elements, while the rest were insertions or deletions of unique sequences. The KIR region harbors the highest number (29) of SVs, followed by IGH (28) and IGK (24) regions.

Validation of assembly accuracy using unassembled sequencing reads
Given that HV31 differs structurally from GRCh38, an important question is how the structure of our assembly of these regions can be confirmed (or conversely how any remaining errors can be identified) without reliance on a reference sequence. Motivated by previous work [24,35], we adopted an approach based on computing the multiplicity of each assembly k-mer in a validation dataset, which we here take as the set of sequence reads including PacBio HiFi reads and all short reads from MGI, 10x and Illumina platforms (S1 Table), which were chosen due to their relatively low error rates (S9 Fig). This dataset has over 150× coverage of k-mers appearing in both copies of the genome (Fig 3A), and is sufficiently high-coverage that heterozygous k-mers, and k-mers in higher repeat numbers, can be separated from homozygous kmers ( Fig 3A and 3B). Under the assumption that sequence reads are approximately uniform across the genome, the multiplicity of each k-mer in the validation dataset should be proportional to the its copy number in the HV31 assembly [36]; any discrepancies therefore indicate heterozygous variation or assembly errors.
To leverage this, we plotted validation multiplicity in comparison to scaffold multiplicity across all regions (S10 Fig; illustrated for part of the IGH region in Fig 3C and 3D). The assembly and validation dataset are in generally good agreement at most informative k-mers, (i.e. relatively few large regions show validation k-mer multiplicity systematically incompatible with assembly multiplicity), including across several repetitive regions with assembly copy number systematically greater than 1. We used a similar approach to estimate base-level errors, which we identified as clusters of k-mers with validation multiplicity < 5 (using k = 22 as recommended previously [24]; this approach is similar to but more stringent in practice than the previously published Merqury method [24]; S2 Table and Methods). These errors are relatively sparse; on average across regions the error rate was 18.1 per 1 Mb sequence (based on 3.16×107 k-mers, of which 8786 (0.0278%) were deemed erroneous; Fig 1D; improved from 85.3 per Mb prior to polishing). However, a number of locations show larger discrepancies between assembly and validation data (numbered regions in Panel B in S10 Fig; S3 Table). We examined these in detail and found that many reflect heterozygous structural variants (illustrated in Panel A in S11 Fig for a heterozygous deletion, and in Fig 3D for a complex heterozygous duplication that we discuss further below) as well as the aforementioned assembly gaps. However, a small subset of these locations indicate possible structural errors in our assembly. These include a~30 kb duplication in the HLA region that we confirmed is incorrectly assembled in a single copy (i.e. "collapsed") in HV31 (Panel B in S11 Fig), as well as three relatively extensive stretches of elevated multiplicity in the IGK and IGL regions where we were unable to fully confirm the assembly structure using the k-mer approach (S10 Fig).
We also implemented a comparison to contigs de novo assembled from optical mapping data, which suggested that several of these regions were correctly assembled (Methods). In general, this analysis indicates that our assembly of HV31 is substantially accurate apart from three repeat-rich segments in the IGK, IGL and HLA regions that may still contain errors. Comparing the HV31 assembly with GRCh38. Main panels show dot plots indicating locations of shared k-mers (k = 50) on the GRCh38 reference (x axis) and on the HV31 assembly (y axis) for each locus studied. Multiple scaffolds are separated with orange horizontal dashed lines. Plots are annotated as follows: Gene, core genes of each locus. SegDup, segmental duplications defined as sequence fragments that are � 1 kb in length and � 90% identical to another fragment (segmental duplications with identity � 99% are highlighted in orange; reference gaps are shown in gray). SV, structural variants detected in the HV31 assembly relative to GRCh38. Structural variants larger than 1 kb are highlighted in dark red. https://doi.org/10.1371/journal.pcbi.1009254.g002 We note two issues that impede validation by short k-mers. First, accurate measurement of k-mer multiplicity in highly repetitive regions is challenging; in our assembly this is particularly relevant to three relatively extensive stretches of elevated multiplicity in the IGK and IGL regions where we were unable to fully confirm the assembly structure using the k-mer approach (S10 Fig). Secondly, there is a trend towards a drop in coverage of the validation kmers in regions encoding T cell receptors (S10 Fig); as discussed above this is due to the use of DNA from PBMCs for some platforms (S1 Fig). An alternative approach based on coverage of Vertical lines show locations of distribution peaks; text indicates interpretation of k-mers near these peaks (het., heterozygous; hom., homozygous; CN denotes assumed true copy number of the k-mer in the diploid HV31 genome). (C) k-mer multiplicity (y axis) plotted against k-mer position for a repeat-rich sequence fragment in the IGH region of the HV31 assembly. Green, blue and purple colors denote the k-mer copy number in the HV31 assembly scaffolds. Non-specific k-mers that are also found outside the IGH region are colored gray. (D) validation k-mer multiplicity normalized against assembly copy number for the same sequence fragment as in (C). To make the y axis center on 1, values are further normalized by dividing by the peak multiplicity of unique homozygous k-mers as shown in panel A. The region of discrepancy indicates a complex heterozygous duplication around IGHV3-30 that we discuss further below. In

HV31 contains diverse complex immune system structural variants
As detailed above, a nontrivial number of large structural variants exist between GRCh38 and HV31 as well as between the two haplotypes of HV31. To assess the impact of these SVs on core genes, we used an alignment process to identify the best-matching allelic variant of each immunoglobulin and T cell receptor variable gene segment, and each HLA and KIR gene within the relevant IMGT or IPD database (Fig 4). Relative to GRCh38, the HV31 scaffolds contain both insertions and deletions of gene sequence in the IGH, IGK, IGL and TRB regions. It also contains allelic variation in all regions except TRG. We also noted a small number of genes that differ from the best matching IMGT allele, and may represent novel sequence. We note that HV31 gene content also differs substantially from the GRCh37 assembly [22].
In interpreting these results, some care must be taken because of the consensus nature of the HV31 scaffolds, which do not necessarily represent a single haplotype at each locus. To elucidate underlying genetic variation, we investigated the genetic basis of the observed copy number changes in detail, focusing on the IGH and TRB regions and described in the following sections.

A tandem repeat within a 45kb CNV involving IGHV1-69 and IGHV2-70
Variation in the copy number of IGHV1-69 and IGHV2-70 genes has previously been reported [22]. Both genes are present in two copies in GRCh38. In the HV31 scaffold, we found only one copy of IGHV1-69 and IGHV2-70 remaining, as the result of a 45 kb copy number contraction relative to GRCh38 (Figs 4 and 5A; variant IGH_b_29 in S1 Dataset). The earlier GRCh37 reference genome shared a similar haplotype in the IGH region, with only one copy of IGHV1-69 and IGHV2-70 genes. This haplotype has been suggested to be more common worldwide than the GRCh38 haplotype [14,22] and comparison to validation k-mers indicates it is homozygous in HV31 (S10 Fig). We noted that this CNV appears to be effectively callable by aligning reads to GRCh38, e. g. manifesting as a coverage gap in aligned PacBio HiFi reads ( Fig 5B); a co-located deletion was also called by the 10x pipeline, though the endpoints and length appeared inaccurate (S2 Dataset).
Within this 45 kb CNV, we also noticed a 2.66 kb cluster of tandem repeats with a 59-mer motif (Figs 5A and S12) that was not correctly assembled in either GRCh37 or GRCh38 (see GenBank: AC245369.4). Similar repeat clusters have also been reported for CHM1 (from which the GRCh38 sequence for IGH region was derived) and NA19240 samples, though the copy numbers of the 59-mer motif varied [15].

A compound heterozygous CNV involving IGHV3-30
A second prominent feature of the IGH region is the loss of one copy of the IGHV3-30 gene. GRCh38 carries two copies of the IGHV3-30 gene, namely IGHV3-30 and IGHV3-33 [14]. We find that IGHV3-33 is removed in the HV31 assembly, together with IGHV4-31 ( Fig 5C). However, inspection of Bionano contigs covering this region revealed a further unusual feature ( Fig 5D): one of the two contigs covering this region contains a corresponding three-fold copy number expansion. To confirm this, we inspected validation k-mer multiplicity in the surrounding area, and observed elevated multiplicity compatible with a 3-fold expansion on the unassembled haplotype ( Fig 3D). As further confirmation we also searched for reads spanning this region and observed a CLR read consistent with the expanded haplotype (S13 Fig). These results therefore indicate that the unassembled haplotype carries three copies of the region Pseudogenes are not shown. V genes in each region are arranged according to their relative order on the positive-sense strand. Allelic variants refer to genes where the best-matching HV31 allele differs from the GRCh38 allele. Insertions refer to genes in the HV31 assembly that cannot be matched to a GRCh38 gene. Alleles with identical sequences, such as TRBV6-2 � 01 and TRBV6-3 � 01, are not distinguished. Alleles that carry additional SNPs compared to the best-matching reference allele are marked with stars. The sequence fragment between IGK proximal and distal clusters that remains not fully resolved is denoted as a red line. surrounding IGHV3-30, such that HV31 carries both a contraction and expansion of this region, with an overall copy number change that would not be evident from coverage analysis of reads aligned to the reference sequence. This observation is also compatible with previous work [22] which reports this region as a hotspot for SVs, with the diploid copy number of IGHV3-30 and related genes ranging from zero to six. This CNV was not called accurately by any of the SV calling methods we employed (S1, S2 and S5 Datasets). Collapsed repeats (i.e. repeats incorrectly assembled in a single copy; red arrows) in GRCh38 were resolved in the HV31 assembly inside a 45 kb copy number contraction (i.e. reduction in copy number of a repeat unit relative to GRCh38, shaded in grey). (B) Pileup of HiFi reads aligned to GRCh38. The 45 kb CNV was evident from the uncovered central region. The resolved repeats were shown as alignment breakpoints (orange) or insertions (purple) within the aligned segments (gray). Nearby short deletions (red) were also evident. (C) k-mer sharing plot (k = 50) showing complex structural variations found between IGHV3-21 and IGHV3-43, including a 25 kb copy number contraction (i.e. reduction in copy number relative to GRCh38 of a repeat unit; blue) and an 80 kb complex duplication event (orange). (D) Alignment patterns of HV31 assembly and two Bionano contigs covering the region shown in (C). The four rows show the HV31 assembly (green horizontal bars) and two Bionano contigs (blue horizontal bars). Vertical/diagonal grey lines between HV31 and the Bionano contigs show alignments of DLE-1 recognition markers; alignments are computed by the Bionano Solve algorithm based on inter-marker distances. The relevant marker positions are indicated by vertical lines inside the contigs (colored dark blue for aligned markers and yellow for markers that were not aligned). The alignment patterns indicate that Bionano contig 2 contains an expansion consistent with the triplication of the region highlighted in blue. Black numbers indicate the approximate repeat units and relevant inter-marker distances are annotated above the contigs. An 80 kb complex duplication involving multiple IGHV genes HV31 carries additional copies of IGHV1-38, IGHV3-43, IGHV4-38 and IGHV3-38 genes compared to GRCh38, that are contained in a~80 kb duplication with complex structure (Fig  4 and Fig 5C). Inspection of k-mer multiplicity data implies this duplication is homozygous (S10 Fig). However, this duplication was not called by any of the methods we used to call SVs (S1, S2 and S5 Datasets). We interpret this as resulting from difficulty in aligning the two sequences; consistent with this, we observed that the HiFi reads in this region displayed suspicious alignment patterns when mapped to GRCh38, which were improved when mapped to the HV31 assembly (S14 Fig).

Large insertions incorporating novel TRBV genes
In the TRB region, we detected a~11 kb homozygous insertion near TRBV6-2 and another 19 kb insertion near TRBV5-7 (Fig 6A). Both are supported by Bionano contigs (Fig 6B), and both insertions incorporated sequence fragments that are not found in GRCh38, with limited homology to adjacent sequences (Fig 6A). Comparison to k-mer validation again implies both insertions are homozygous. Assemblytics identified duplications at both locations but with inaccurate length and sequence content (S5 Dataset). The HV31 scaffold was consistent with an alternative contig for the TRB locus included in GRCh38 (RefSeq NG001333.2; S15 Fig). By comparing NG001333.2 with GRCh38, we confirmed that the 11 kb insertion introduced TRBV4-3 and TRBV6-2 genes and a TRBV3-2 pseudogene, while the 19 kb insertion introduced the TRBV6-9, TRBV7-8 and TRBV5-8 genes (Figs 6 and S15).

Reference gaps amidst complex segmental duplications resolved in the HV31 assembly
In addition to the complexities engendered by structural variation, gaps in the reference genome constitute another potential impediment to the analysis of genetic variation. Large gaps typically arise due to highly repetitive sequence that is challenging to assemble (e.g. heterochromatin regions that often consist of megabase-scale tandem satellite repeats), and their functional significance remains largely unexplored [37].
We were therefore interested to note that the HV31 assembly closes three large gaps in the GRCh38 assembly, and partly closes a fourth. Three of these gaps lie in the IGK region (S16 Fig), while the fourth lies within a 400 kb region of high-identity segmental duplications [38] 1Mb downstream of TRB genes (S17 Fig). We here focus on the largest such gap, a~1 Mb gap annotated as heterochromatin in GRCh38, located between the distal cluster of IGK genes and the centromere of chromosome 2 (S18 Fig). Examination of this gap revealed a~650 kb sequence assembled as an array of approximately 115 imperfect tandem copies of 6 kb repeat units (Panel A in S19 Fig). Most of the repeat units contain a 22-bp signature sequence (TTCGATTCCATTTGATGATTCCAT), indicating that the heterochromatin sequence belongs to the human satellite HSat2B family [37].
Notably, the assembled heterochromatin sequence also contains a nonrepetitive sequence fragment within the assembled heterochromatin sequence (Panel B in S18 Fig). This 32 kb fragment appears unique to the region, sharing no significant homology with either the rest of the heterochromatin region nor any part of GRCh38. The heterochromatin sequence does not contain the recognition motif of DLE-1 (CTTAAG) used in generating optical maps, and we were therefore unable to directly confirm the arrangement using Bionano contigs, though a marker corresponding to the 32 kb unique sequence could be identified (Panel C in S18 Fig).
We compared the HV31 assembly with the recently reported T2T CHM13 assembly [32] (GenBank: GCA_009914755.2), where the IGK region is fully reconstructed in one contig. The corresponding heterochromatin sequence in the T2T CHM13 assembly is consistent with the HV31 assembly in terms of total length and repeat unit sequences, though the specific order and orientation of these repeat units differ (S19 Fig). This is of interest because it potentially reflects the structural variability in this heterochromatin region. A similar 32 kb unique sequence is also found in the T2T CHM13 assembly, though at a different location (Panel B in S19 Fig). In addition, we found this 32 kb fragment, along with 76.8 kb flanking sequences, was over 90% identical to a 108.8 kb unplaced sequence (GenBank: AP023554.1; Panel D in S18 Fig) assembled recently from individuals of Japanese ethnicity [39]. Similar 'islands' of unique sequence amid heterochromatin regions have previously been suggested for chromosome Y [40] and chromosome 21 [41].
Eleven heterochromatin gaps remain in GRCh38, with estimated sizes ranging from 20 kb to 30 Mb, and only the largest heterochromatin gap in chromosome X has been resolved in the form of a de novo assembly so far [29]. The resolved heterochromatin sequence for chromosome 2 in the HV31 assembly may provide insight into other problematic heterochromatin regions.

Discussion
Genetic regions encoding the human immune response are among the most important for medical science, harboring determinants of both infectious and immune-mediated diseases. Despite this, the complex structure and diverse nature of many of these loci has hindered assessment of their contribution to health in large-scale genetic studies. Resolving this will likely require the creation of reference haplotype variation datasets linked to genome function.
The greatest progress in this direction has been made for inference of HLA alleles [42][43][44][45] and to a lesser extent KIR alleles [46] based on large databases of known gene sequences. More recent approaches have used targeted long-read sequencing to characterize immunoglobulin variation [15]. However, much functional genetic variation affecting core immune system genes is still to be discovered and made accessible to larger studies. The decreasing cost and improving performance of long-read sequencing technologies suggests a possible route to this through the de novo assembly of representative individual genomes, although computational, cost, and analytic challenges remain in practice.
In this study, we contribute to this program by assembling eight of the most complex immune system regions in a single healthy European individual identified as HV31, who was recruited as part of a larger study of genomic and transcriptomic variation in immune cell types. To do this, we based our assembly on DNA extracted from CD14 + monocytes, allowing us to assess the germline haplotype configurations in the immunoglobulin and T cell receptor regions. We exploited accurate PacBio HiFi data for the initial assembly, and used optical mapping data to scaffold the assembled contigs, followed by additional gap closing and polishing steps (Fig 1). Alongside this, we also developed a set of methods to validate assembly correctness using available datasets. Although some structural errors do remain-and might be resolvable with future work based on the breadth of data we have generated here-our results suggest we have produced an essentially correct representation of the regions reported here, with some caveats that we have noted. Our assembly thus adds to growing numbers of reference-quality assemblies that can be utilized in these regions [15,23,47,48] and to the catalogue of known alleles at these loci. The sequencing data and finished assemblies generated in this study have also been deposited at the European Genome-Phenome Archive (Data Availability) for future use.
An ideal description of a diploid sample would involve two fully assembled contigs for every somatic chromosome, each representing one of the two inherited haplotypes. This approach is sought by several emerging experimental and computational approaches [48-51]. Generating a phased assembly inevitably involves a trade-off among cost, phasing accuracy and assembly continuity, which is further complicated by the presence of large high-identity duplications [51]. Given the limited depth of HiFi reads in our data (S1 Table), we instead chose to represent the HV31 genome as a set of consensus (i.e. mixed-haplotype [52]) scaffolds supplemented with a list of heterozygous variants (S1 Dataset). This is a pragmatic approach which also simplifies downstream analysis as the assembly can be directly used in place of a reference genome without additional preparation.
Fully assembling regions containing long, complex repeat structures from shotgun sequencing remains a challenging problem; any approach must somehow distinguish reads coming from different repeat copies but identify those that truly overlap-while allowing for sequencing errors. Exemplifying this challenge, our assembly of the IGK region fills a heterochromatin gap in GRCh38 that largely consists of high copy-number repeats but also contains unique sequence (S18 Fig). Direct confirmation of the assembled structure of IGK is difficult using our data and may require further experimental methods to achieve, but our assembly of this region is compatible with other recently reported assemblies. The fact that most of the repetitive regions within the loci studied here are correctly assembled is of interest in itself, as it implies that the repeats are old enough (or divergent enough) to be effectively distinguished.
Given the high diversity observed in these regions it is unsurprising that HV31 carries structural variants which make it differ from current genome builds and from other previously reported samples. Some of the variation we have identified in HV31 has previously been associated with phenotypes-for example, copy number polymorphism of IGHV1-69 is known to strongly correlate with the prevalence of this gene in expressed antibody repertoires, which is preferentially used in antibodies against certain influenza strains and the HIV-1 virus [7]. However, we have also highlighted extensive variant haplotypes that have not previously been reported. The degree of variation observed in this single sample indicates that much haplotype variation of immunoglobulins and T cell receptor regions remains to be discovered. It is also notable that many of the more complex variants we have identified are not accurately called by the variant calling methods we employed (S1, S2 and S5 Datasets). but this is also an area of active development [17,30,53]. The costs associated with current long-read platforms remain high, but if this can be overcome (e.g. through further improvements in throughput or targeted sequencing approaches [15]) then these methods will become applicable to larger samples, enabling a full catalogue of haplotype variation in these important regions to be generated.

Ethics statement
HV31 was recruited as a healthy volunteer under approval by the Oxfordshire Research Ethics Committee (COREC reference 06/Q1605/55). The donor provided written informed consent for the use of their blood in research.

Software
Bioinformatics software and algorithms used in this study are summarized in S4 Table. DNA extraction, sequencing and optical mapping Blood sample of a healthy female donor of European ancestry identified as HV31 was used in this study. For PacBio HiFi and CLR sequencing, Oxford Nanopore sequencing, 10x linkedread sequencing and MGI stLFR linked-read sequencing, the DNA was extracted from CD14 + monocytes isolated from PBMC using CD14 antibody-conjugated beads (Miltenyi Biotec). DNA extraction was performed with QIAGEN MagAttract HMW DNA kit following manufacturer's instructions, with slight adaptations. In brief, 220 μl Buffer ATL and 20 μl Proteinase K were added to a suspension of 1×10 6 CD14 + monocytes. The mixture was incubated overnight at 56˚C, shaking at 900 rpm to lyse the cells. The lysate was then processed according to manufacturer's instructions for the purification of high-molecular-weight genomic DNA from tissue.
For MGI standard short-read, MGI coolMPS and Illumina PCR-free sequencing, the DNA was extracted from PBMC with NEB Monarch Genomic DNA Purification Kit (T3010) following manufacturer's instructions. Library preparation, sequencing and optical mapping were performed following the instructions of the respective platform providers.

Region definition
Eight genomic regions encoding key components of the human immune system, including HLA, IG, TCR and KIR were selected for investigation (Table 1). Each region was defined as a core range in GRCh38 that contained genes related to immune system components, with additional flanking sequences added to both sides. The core range were typically selected based on the respective reference sequences in the NCBI RefSeq database [19]. As exceptions, the core range of the HLA region was defined as the genomic range from GABBR1 to KIFC1 genes [20], and the KIR region was defined as the genomic range from KIR3DL3 to KIR3DL2 genes [21]. The flanking sequence was typically 1 Mb on either side. As exceptions, the telomeric flanking sequence in the IGH region was limited to 164 kb by the length of chromosome 14. In addition, we expanded the centromeric flanking sequence in the IGK region by 0.67 Mb to bridge a 1 Mb heterochromatin gap present in GRCh38.

GRCh38 annotations
The GRCh38 segmental duplication, alternative haplotypes and fix patches annotations were downloaded from the UCSC Table Brower [54] based on the genomicSuperDups, altSeqLiftO-verPsl and fixSeqLiftOverPsl datasets, respectively.

Hybrid scaffolding and haplotig removal
Hybrid scaffolding was performed using Bionano Solve, a proprietary software provided by Bionano Genomics (https://bionanogenomics.com/), with default parameters. We used a custom script based on BiSCoT [26] to improve the contiguity and quality of the resulting scaffolds. Specifically, we merged adjacent contigs in a scaffold if they overlap with each other, as inferred from shared enzymatic labelling sites or sequence alignment. If the two adjacent contigs were expected to be non-overlapping, they were joined with a gap (i.e. a sequence of "N" bases) between them, the size of which was estimated based on the distance of nearest labelling sites. In addition, we incorporated shorter contigs into longer ones if the shorter contig represented a subsequence of the longer contig, and aligned better with the Bionano genome maps.
After scaffolding, we removed duplicated contigs or scaffolds that presumably represent alternative haplotypes ('haplotigs') using a custom k-mer based method. In brief, we listed all unique 22-mers for each contig or scaffold and compare these sets of 22-mers in a pairwise manner. If a shorter contig had more than 80% of unique 22-mers shared with a longer contig, then the former was considered as a haplotig and removed from the assembly.

Read mapping
Sequencing reads from each locus of interest were required for various purposes including gap closing, polishing, error rate estimation and assembly validation based on alignment coverage and patterns. In order minimize reference bias, we first mapped the reads from each sequencing dataset using minimap2, and then extracted reads that mapped to contigs that represent each locus of interest [48]. The extracted reads were again mapped with minimap2 to the scaffolded or finalized assembly as appropriate for specific applications.
A unique k-mer anchoring method [29] was used to improve the mapping of long reads in repetitive regions. In brief, given a set of locus-specific reads and a corresponding reference sequence, we first defined a set of anchoring k-mers for each locus of interest. Only k-mers that appeared to be unique in both short read sequencing datasets (31 � multiplicity � 231) and the reference sequence (copy number = 1; no occurrence outside the locus) were selected as anchoring k-mers. Then, we mapped the reads to the reference with minimap2 using parameters -n 50 -r 10000, which enabled the output of up to 50 alignments for each read, with gap sizes up to 10 kb in each alignment. An optimal alignment for each read were was then selected based on the number of bases shared with the reference that were part of an anchoring k-mer. These selected alignments were pooled into a new BAM file, after filtering out alignments that were shorter than 5 kb. The resulting BAM file were used for polishing and reference-free alignment validation.

Gap closing and polishing
Gap closing was performed using TGS-GapCloser v1.0.1 [27] with PacBio HiFi reads. Sequencing reads were first mapped to the whole genome assembly produced by Canu, which enabled locus-specific read extraction. The extracted reads were used as input for TGS-Gap-Closer, which was executed using the following parameters: -ne-tgstype pb-g_check. Polishing was performed using Pilon [28] with HiFi reads and MGI paired-end short reads extracted in a similar manner. The default parameters were used. For clarity, the finalized scaffolds were displayed and coordinated based on the relative order and orientations of the corresponding sequence in GRCh38 in visualization steps.

Error rate estimation and reference-free assembly validation
Jellyfish [57] was used to count the multiplicity of each k-mer (k = 22 or 31) from a pooled FASTQ dataset of PacBio HiFi, MGI standard short-read, MGI CoolMPS, MGI stLFR linked read, 10x Linked-Read and Illumina PCR-free sequencing platforms (S1 Table), with the following parameters: jellyfish count -m <k> -s 30G -min-qual-char "?" -C. The accumulated sequencing depth of the pooled FASTQ dataset was 262×. In each read, k-mers that include bases with base quality < 20 were excluded. For error rate estimation, k-mers (k = 22) in the HV31 assembly with multiplicity < 5 were classified as erroneous k-mers, and clustered by their positions in the assembly, allowing a maximum of k-1 correct k-mers between two adjacent erroneous k-mers in each cluster. The number of erroneous k-mer clusters per Mb assembled sequence was used as an indicator of the error rate of the HV31 assembly.
For reference-free assembly validation, we define the normalized multiplicity (N) of each kmer (k = 31) in the HV31 assembly as N = M / (C × D), where M is the multiplicity of that kmer in the validation dataset, C is the copy number of that k-mer in the HV31 assembly, and D is the mode multiplicity of unique homozygous k-mers in the validation dataset, as estimated from the k-mer multiplicity histogram (Fig 3A). The normalized k-mer coverage was visualized against the position of the k-mer, along with the normalized coverage of ONT reads aligned to the assembly using the k-mer anchoring method. Regions where the normalized kmer coverage or normalized ONT coverage deviated from 1 were labelled and inspected for potential assembly errors (S3 Table).

Allelic variant detection
Reference variant sequences of IGHV, IGKV, IGLV, TRAV, TRDV, TRBG and TRGV genes were downloaded from the IMGT reference directory [60] (http://www.imgt.org/download/ V-QUEST/IMGTV-QUESTreference_directory.zip). Reference variant sequences of HLA genes were downloaded from the IPD-IMGT/HLA database [61] (ftp://ftp.ebi.ac.uk/pub/ databases/ipd/imgt/hla/fasta/hla_gen.fasta). Reference variant sequences of KIR genes were downloaded from the IPD-KIR database [62] (ftp://ftp.ebi.ac.uk/pub/databases/ipd/kir/fasta/ KIR_gen.fasta). The reference gene variant sequences were mapped to GRCh38 or the HV31 assembly using minimap2 with the following parameters: -a -w1 -f1e-9. We extracted subsequences in regions where at least one reference gene was mapped, with 20 bp flanking sequence at either side. These query sequence fragments were submitted to NCBI IgBLAST [63] (for IGHV, IGKV, IGLV, TRAV, TRDV, TRBV and TRGV genes) or NCBI BLAST+ [64] (for HLA and KIR genes) to search for matching sequences in the relevant databases, with default parameters. The top hit variant with the highest match score returned by NCBI IgBLAST or NCBI BLAST+ were assigned to each query fragment. Query fragments shorter than the top hit variant were considered to represent partial alignment and discarded.

S1 Fig. Depth of coverage of sequencing reads across platforms and regions.
For each sequencing platform (colored lines, with platforms as in S1 Table), the plot shows the depth of coverage of reads aligned to GRCh38 across the eight selected regions (Table 1). Depths are normalized by the average depth across each region for each dataset. Areas with apparent systematic lower depth in the TRA, TRB and TRG regions are highlighted with black triangles. Datasets generated with DNA from CD14+ monocyte and PMBC are denoted with solid and dashed lines, respectively. Decrease of sequencing depths is not identified in immunoglobulin regions, presumably due to the relatively low fraction (5-15%) of B cells in PBMC. Despite γδ T cells being rarer than α/β T cells, the coverage drop around T cell receptor γ genes can be explained by the fact that the γ locus is known to undergo rearrangement in most α/β T cells [65], and the drop around T cell receptor δ genes can be explained by the fact that any rearrangement at the α locus leads to the loss of the δ locus. [66] (PNG)

S2 Fig. Whole-genome assembly contigs aligned to GRCh38 in each of the 8 selected loci.
Each row represents one local contig extracted from the draft whole-genome assembly ( Fig  1A) by alignment to GRCh38 using minimap2 (Methods). The x axis reflects GRCh38 coordinates across each of the selected regions (Table 1) show the estimated per-base error rate of each dataset as estimated using GenomeScope [35]. (PNG) S10 Fig. Assembly validation based on k-mer multiplicity. (A) For each k-mer (k = 31) that appears in the HV31 assembly, the multiplicity of that k-mer in the validation dataset (y axis) is plotted against the position of that k-mer (x axis), colored by the copy number of that k-mer in the HV31 assembly as shown in the legend. (B) the normalized k-mer multiplicity (y axis), defined as ratio of validation k-mer multiplicity to assembly k-mer multiplicity computed using the values shown in (A), plotted against the position of that k-mer (x axis). Values are further normalized by dividing by the peak multiplicity of unique homozygous k-mers as shown in Fig 3A, such that these kmers are expected to lie near y = 1. k-mers that found both inside and outside the given regions are considered noninformative and are shown in gray. Orange lines show ONT read coverage depth normalized to the genome-wide average coverage depth (63×). k-mer sharing plot (k = 50) comparing GRCh38 (x axis) with the HV31 assembly (y axis). The 2.56 Mb scaffold and the 1.97 Mb scaffold in the HV31 assembly are shown in blue and green, respectively. Coverage of ONT reads aligned to GRCh38 is displayed above, and the proximal and distal clusters are annotated. Gaps in GRCh38 are shaded in gray. Novel sequence junctions in the HV31 assembly are annotated with red arrows. Sequence fragments of which extra copies were introduced in the HV31 assembly to fill in the gaps between IGK proximal and distal gene clusters in GRCh38 are highlighted in yellow; corresponding read coverage peaks confirm increased genome multiplicity of these fragments. (B) Alignment of Bionano contigs (blue) to the 2.56 Mb scaffold in the HV31 assembly (green). DLE-1 labels and their alignments are denoted by colored lines within and between scaffolds) as described in Fig 5D legend; note that all gray alignment lines connect the HV31 scaffold to each of the Bionano contigs (no between-Bionano alignments are shown). The approximate sequence region that maps to the GRCh38 gaps between IGK proximal and distal gene clusters is shaded in gray. k-mer sharing plot (k = 50) comparing the HV31 assembly (y axis) with the T2T CHM13 assembly (x axis) in the IGK region. The 2.56 Mb scaffold and the 1.97 Mb scaffold in the HV31 assembly are shown in blue and green, respectively. (B) k-mer sharing plot as in panel (A), zoomed in to show details of the heterochromatin region. The assemblies contain similar sequence in the heterochromatin region, but with some differences including in the position of the unique sequence (highlighted with a red arrow) as noted in main text. (PNG) S1 Table. Summary of sequencing and genome mapping datasets. Table shows the name, cell type, genome coverage estimated by alignments to GRCh38, and read length statistics for each dataset generated in this study. (PDF) S2 Table. Comparison of original and modified Merqury error rate estimation results. Table shows per-base error rate estimates for local scaffolds and the finished HV31 assembly (Fig 1A) using the Merqury algorithm, and the modified algorithm described in Methods and Fig 1D legend. As described in Methods, the key difference is that we estimate based on clusters of kmers with low validation coverage and this produces slightly higher estimates in practice than the Merqury method. (PDF) S3 Table. Potentially problematic regions in the HV31 assembly as identified from k-mer multiplicity. Table shows details of locations where validation k-mers show discrepancy from expectation (as numbered in S10 Fig) and our conclusion about the region. For each region, the table lists the relevant assembled region (Table 1), number as shown in S10 Fig, type of evidence we have inspected to provide additional evidence, and our conclusion about the cause of discrepancy. (PDF) S4 Table. Detail of bioinformatics tools used in this study. (PDF) S1 Dataset. List of structural variants reported by SVanalyzer. Table shows structural variants detected by applying pbsv and sniffles to the long-read sequence reads aligned to the HV31 assembly, as output by SVanalyzer and described in main text and methods. Columns reflect the region, scaffold, position and identifier of the SV, the assembly and alternate alleles, and the genotypes assigned by each combination of long-read dataset and SV calling method. (TSV) S2 Dataset. Structural variants detected by the 10x Genomics Long Ranger pipeline. Table reflects SVs identified by 10× Long Range pipelines applied to 10× reads aligned to the GRCh38 reference sequence in the eight selected regions (Table 1). Columns show the region, GRCh38 chromosome, position and identifier of the SV, followed by the GRCh38 allele and the alternate allele. The last column reflects detailed information including SV type (deletion or insertion) and length, as output by the Long Ranger pipeline. (TSV) S3 Dataset. Benchmarking information of the assembly pipeline. Table provides a summary of computational resources used for each step in our assembly pipeline (Fig 1A). Columns reflect an identifier for the pipeline step, bioinformatics tool used, the total number of processes run, the average time taken per job, the number of cores allocated to each job and the average memory used per job (computed as the maximum over the lifetime of the job). (XLSX) S4 Dataset. k-mer sharing plots (k = 50) comparing the HV31 assembly with GRCh38 in each of the eight regions of interest. Plots provide further detail of regional k-mer sharing plots shown in Fig 2, with details as described in Fig 2 legend and main text. In the IGK and IGL regions, colors reflect the distinct HV31 scaffolds. (ZIP) S5 Dataset. Structural variants between GRCh38 and the HV31 assembly detected by Assemblytics. Table shows the output of Assemblytics applied to GRCh38 and HV31 in each region as described in main text and Methods. Columns show the region, chromosome and position in GRCh38, SV identifier, the identified type of the SV, details of the SV length and the matching coordinates in HV31. (TSV)