Blood group typing from whole-genome sequencing data

Many questions can be explored thanks to whole-genome data. The aim of this study was to overcome their main limits, software availability and database accuracy, and estimate the feasibility of red blood cell (RBC) antigen typing from whole-genome sequencing (WGS) data. We analyzed whole-genome data from 79 individuals for HLA-DRB1 and 9 RBC antigens. Whole-genome sequencing data was analyzed with software allowing phasing of variable positions to define alleles or haplotypes and validated for HLA typing from next-generation sequencing data. A dedicated database was set up with 1648 variable positions analyzed in KEL (KEL), ACKR1 (FY), SLC14A1 (JK), ACHE (YT), ART4 (DO), AQP1 (CO), CD44 (IN), SLC4A1 (DI) and ICAM4 (LW). Whole-genome sequencing typing was compared to that previously obtained by amplicon-based monoallelic sequencing and by SNaPshot analysis. Whole-genome sequencing data were also explored for other alleles. Our results showed 93% of concordance for blood group polymorphisms and 91% for HLA-DRB1. Incorrect typing and unresolved results confirm that WGS should be considered reliable with read depths strictly above 15x. Our results supported that RBC antigen typing from WGS is feasible but requires improvements in read depth for SNV polymorphisms typing accuracy. We also showed the potential for WGS in screening donors with rare blood antigens, such as weak JK alleles. The development of WGS analysis in immunogenetics laboratories would offer personalized care in the management of RBC disorders.


Introduction
Whole-genome data has become more accessible thanks to techniques being made easier, the availability of sequencing machines or contractors, and the release of public data. Only a small part of these entire genomes are exploited beyond the scope of their initial purposes.
Amplicon-based next-generation sequencing (NGS) assays have in many ways laid the groundwork for whole-genome analyses as they require equivalent reagents, equipment and experimental skills. Much software for amplicon-based NGS has been developed, validated and certified in clinical fields. More particularly, most immunogenetics labs are equipped with amplicon-based NGS for HLA typing and some have also developed and validated such techniques for human platelet and RBC antigens [1][2][3]. Many questions can be explored in various fields thanks to WGS resources and their integrative investigation; first in population genetics where such data may improve understanding of natural selection, local adaptation, demographic history and early human migration [4,5]. Then in evolutionary genetics where it can address more fundamental issues such as gene evolution and functional investigation [4,5] thanks to haplotype reconstruction or the localization of new variants. Finally, these rapidly evolving techniques have now made their entry into analysis on an individual scale, for example in forensics and for clinical purposes [3,6]. However many issues raised by WGS handling limit the implementation of these techniques both in Research and Clinical laboratories working within regulatory approved frameworks e.g. Council of Europe (CE): software availability, database accuracy and editing, coverage and read depth quality indicators [5,7]. Thus, many whole-genome experiments designed for one scientific purpose are not used for any further analyses.
The aim of this study was to overcome these limitations and to estimate the feasibility of RBC antigen typing from WGS data. We analyzed whole-genome data from 79 individuals from Central Asia [8] for the highly polymorphic HLA-DRB1 gene and for 9 blood group antigen.
The same samples had previously been typed for HLA-DRB1 by amplicon-based monoallelic sequencing and for blood group bi-allelic polymorphisms using SNaPshot analysis [9].
WGS data was analyzed with software validated for HLA typing from NGS data [10]. This software relies on an allele alignments database; whereas the HLA system has a very convenient database and consensus on allele naming [11] with monthly updates, genetic polymorphism of RBC antigens are provided in Portable Document Format (pdf) and need to be converted. Most importantly, the software used in this study allows phasing of variable positions to define alleles or haplotypes. In a second analysis, the software was set up to search WGS data for new alleles. Indeed, previous investigations for blood groups but also for specific anthropogenic analyses revealed that this cohort presented a singular genetic mosaic of components from various geographic regions of Eurasian ancestry [12].

DNA samples
Seventy-nine samples were used in this study formerly analyzed for anthropogenic markers and described in [12]. All samples were obtained from unrelated male Afghan volunteers after obtaining written informed consent. The study protocol was registered by the Ministere de l'Enseignement Superieur et de la Recherche in France (committee 208C06, decision AC-2008-232). Institutional review board Ministere de l'Enseignement Superieur et de la Recherche in France committee 208C06, (decision AC-2008-232) specifically approved this study.

Blood group genotyping by SNaPshot analysis
Samples were analyzed for main RBC antigens and results have been previously published [9].  [13]. Determination of blood group antigens, other than those of the ABO, RH and MNS systems, depends mainly on the presence of one or more SNPs in the coding sequence. Fourteen SNPs were analyzed corresponding to bi-allelic polymorphism (KEL p.Thr 193Met

HLA-DRB1 typing by monoallelic sequencing
HLA-DRB1 was typed by monoallelic sequencing using Protrans HLA SBT S3 (Protrans) according to manufacturer's instructions. This kit relies on locus specific amplification followed by monoallelic Sanger sequencing.

Whole-genome NGS library preparation and data acquisition
Detailed description of WGS procedure is given in [8]. DNA samples were sonicated using a Covaris S220 Ultrasonicator to yield fragments with a median fragment length of 300 bps according to the manufacturer's recommendations. Low molecular weight DNA (<300 bps) enrichment from all samples was performed using AMPure XP beads (NEB). The library was prepared using the TruSeq Nano DNA LT kit (Illumina) according to the manufacturer's recommendations. Library size and quality was confirmed with Fragment Analyzer (Advanced Analytical) and quantitative PCR (Biorad S1000; CFX96 Real Time System). Paired-end sequencing (2x150 bps) was performed on the Illumina NovaSeq 6000 System (Illumina) following the manufacturer's recommendations.

Whole genome data analysis
Pre-alignment processing. Demultiplexing of runs was performed in BaseSpace (www. illumina.com/BaseSpaceApps). Prior to analysis, quality and adapter trimming was performed by Trim Galore (Babraham Bioinformatics http://www.bioinformatics.babraham.ac.uk/ projects/trim_galsore/) on all fastq files from all runs. Low quality bases with a Phred score below 20 (Q20) were removed from the 3 prime end of the reads followed by the removal of any Illumina adapter contamination (minimum adapter match of 3 with an allowed matching error rate of 0.1). Reads of less than 40 after quality and adapter trimming were removed and only properly paired-end read data were retained and analyzed.
Sequencing data quality assessment. Sequencing performance relies mainly on genome coverage and read depth [14]. WGS data quality was assessed by the quantity of reads obtained per sample. Mean read depth of genome was estimated for each sample by the total number of reads X read size [150 bps] / genome size [2,867,437,753 bps]. Mean read depth for each gene was also estimated by the number of reads mapped X read size [150 bp] / gene size.
Statistical analyses. Statistical analyses were performed with GRAPH PAD Prism 5 software (California USA, www.graphpad.com). Number of reads are presented as mean and range [min, max]. Differences among number of reads according to typing gene status were tested using Kruskal-Wallis one-way ANOVA for three values and Mann Whitney test for two values. Threshold for significance (alpha) was set at 0.05.
Blood group typing and HLA-DRB1 allelic assignment. PolyPheMe software (Xegen, France) was used to perform all typing from WGS data. WGS data were directly aligned to each gene as reference, no human genome was used for read mapping. Alignments were generated by PolyPheMe software with a Bowtie tool [15,16].
Genetic polymorphisms of RBC antigens described in [3] and International Society of Blood Transfusion (ISBT; http://www.isbtweb.org) were used for genetic alignment construction. Reference alleles were generated for KEL, ACKR1 (FY), SLC14A1 (JK), ACHE (YT), ART4 (DO), AQP1 (CO), CD44 (IN), SLC4A1 (DI) and ICAM4 (LW) genes. The other blood group database for which updates were stopped in 2017 was not used for this study [17,18]. 1648 variable positions (68 for KEL, 228 for FY, 904 for SLC14A1, 4 for ACHE, 21 for ART4, 387 for CO, 5 for IN, 20 for DI and 11 for LW were analyzed with PolyPheMe v1.2 on WGS data. All positions analyzed and their corresponding alleles are given in S1 Table. A minimum threshold was defined at 5 reads per position analyzed. The PolyPheMe software can phase heterozygous positions and identify haplotypes when reads overlap. WGS analysis validation was based on a comparison with the 14 positions described by SNaPshot assays.
In a second phase, potential new alleles were estimated by previously unidentified combinations of known polymorphisms but also by polymorphisms unmapped in the ISBT database. For these unreported alleles, WGS data was re-analyzed and polymorphisms were taken into consideration if they had a minimum threshold of 10 reads per position combined with a minimum of 5 occurrences.
HLA-DRB1 was typed at second field resolution with specific parameters for HLA systems previously described [10] using the IMGT 3.39.0 database [11] as reference. This analysis used allele typing according to polymorphisms described in the database. A second analysis was performed on WGS data to find potential new polymorphisms.

Sequencing data quality
Sequencing data are available at http://www.ncbi.nlm.nih.gov/bioproject/662371. Genome sequencing displayed a mean of 34 Gb .The mean read depth of the genome, estimated for each sample by the total number of reads X read size / genome size, was 11.8x [5.5x-18.4x]. Mean read depth for each gene, estimated by the number of reads mapped X read size [150 pb] / gene size, are given in S2 Table. Blood group analyses Blood group genotyping analyzed by SNaPshot are given in Table 1. Most analyses focused on one SNP leading to bi-allelic results, except for KEL, DO and FY systems for which 2 or 3 SNPs were analyzed. Most antigens displayed low or no allelic diversity except for DO (p. Asn265Asp), FY (p.Gly42Asp), JK (p.Asp280Asn) and YT (p.His353Asn).
Group typing based on WGS analysis was performed targeting all of the variable positions described in S1 Table. Sixty-three alleles out of 1035 described by SNaPshot could not be resolved (6.1%) by WGS analysis. For all genes analyzed, typing resolution was associated with the number of reads mapped on their genetic sequence (S3 Table).

PLOS ONE
98.7% of concordance was observed for DI (p.Pro854Leu) with 1 incorrect typing for a heterozygous sample (DI � 02); 2 samples remained unresolved.
WGS data analysis also revealed polymorphisms that were unmapped in the ISBT database. A total of 267 previously unidentified polymorphisms covered with a minimum depth of 10x and observed in a minimum of 5 samples were found (S5 Table). Among these, 5 SNPs were in exonic regions but none led to amino-acid changes. Two SNPs in the DO gene were observed in 18 and 21 samples, 2 SNPs in IN were observed in respectively 37 and 41 samples and, in the JK gene, one SNP was observed in 37 individuals (S6 Table).

HLA-DRB1 analyses
Thirty-four HLA-DRB1 alleles were defined at maximum resolution by amplicon-based monoallelic sequencing, 5 samples could not be analyzed (Table 3).
Ninety-one percent of WGS-based HLA-DRB1 typing, i.e. 135 out of 148 alleles, showed an exact match with typing defined by monoallelic sequencing at second field resolution. Most discordances were due to insufficient coverage and low read numbers leading to differences in 3 rd and 4 th digits; two samples (counting for 4 alleles) could not be typed.
No novel polymorphism could be detected in HLA-DRB1 during the second analysis of the WGS data.

Discussion
In this study we explored diploid markers in WGS data generated for Y-chromosome analysis from 79 individuals [8]. Analyses were performed with Polypheme software validated for HLA typing from NGS data [10] and set up for RBC analysis. HLA-DRB1 gene and 9 blood group antigens were typed (KEL, ACKR1 (FY), SLC14A1 (JK), ACHE (YT), ART4 (DO), AQP1 (CO), CD44 (IN), SLC4A1 (DI) and ICAM4 (LW)) according to standard nomenclature (IMGT 3.39.0 database [11], ISBT (http://www.isbtweb.org) and RBC antigens [3]). Whereas targeted strategies, such as PCR followed by sequencing or SnaPshot, circumvent specificity issues of genes with structural changes and hybrids such as RHCE/RHD and GPA/GPB; their analysis from WGS data have requires specific bioinformatic approaches including CNV (copy number variation) analysis. Therefore, such systems were not included in this study.
Our results showed that blood group typing deduced from WGS were correct at 99.5% compared to SNaPshot analysis (967 SNP correctly identified out of 972 typed); 93% when taking into account ambiguous typing. In a clinical or research context however, ambiguous RBC results need to be reanalyzed. HLA-DRB1 typing from WGS showed 91% of concordance with those obtained by amplicon-based monoallelic sequencing. These performances on RBC antigens were similar to those presented in a former study on WGS from donor data [3] which included the typing of highly complex genes such as MNS, RHD/RHCE and ABO systems.
WGS data quality is assessed by the estimation of read depth. A former study conducted on WGS data established a minimum of 15x for RBC antigen typing in the clinical field [3,14]. Here, mean read depth of the genome was estimated at 11.8x [5.5x-18.4x] and read depth for each gene reached higher values. For each gene, typing resolution was significantly associated with the number of reads mapped on its sequence and ambiguous and incorrect typing showed low numbers of reads corresponding to the missing allele and read depth equal to or below 15x. Our study thus confirms that RBC typing from WGS should be considered reliable with read depths strictly above 15x. To reach this goal, genome sequencing of one human (3Gb) should be analyzed with at least 45 Gb of data, here mean data was 34 Gb .
In our study, WGS data analysis allowed refined typing, identification of both potential new alleles and haplotypes as PolyPheMe software used here allowed phasing of polymorphisms subject to sufficient coverage and variable positions. We were able to type the JK � 01W.01 allele [19] and the FY � 02 allele associated with c.298G>A [20]. The weak JK allele may present a risk of hemolytic transfusion reactions [21] as it has been shown that among

PLOS ONE
samples screened as JK:-1,-2, a fraction was JK:1 WK [22]. JK � 01W.01 has been reported in Caucasian, Asian and Chinese individuals [19] but there is a lack of description of this allele among different populations. Given the frequency found here, our results strongly support the need of a better description of this allele, particularly in Asia. Serological typing is the gold standard for blood group analysis but in particular situations molecular analysis can provide valuable information. In hematology laboratories, molecular biology based on sequence analysis was superseded by ready-to-use closed systems mainly based on SNPs analysis and validated for clinical purposes. Whereas unthinkable for routine patient care, some situations would gain from WGS such as screening donors for rare blood antigens and the management of RBC disorders. In this regards, our results showing rare and potential new alleles are particularly relevant in diseases such as Sickle Cell disease for example, where allo-immunization is a major complication [23]. Research of minor alleles and their potential role in allo-immunization in these patients would be a major advance in personalized medicine.
In a second analysis, WGS data were screened for new polymorphisms. 262 new variable positions in intronic regions and 5 polymorphisms in exons were identified, none led to nonsynonymous mutations. An insight of their frequencies in populations described as being related to the Afghan population would contribute to refining their origins [12].
Molecular testing for the HLA system has been integrated in immunogenetics laboratories for a long time and evolves according to new technologies. Amplicon-based NGS is suitable for donor HLA typing, with robust and certified protocols, high throughput and highly resolutive typing results. These protocols can be performed with methods requiring several days and are also suitable for patients, for whom typing results are rarely impatiently awaited.
Immunogenetics laboratories are thus quite prepared to integrate WGS in their pipeline and use it to analyze other immune markers. Patients with auto-immune diseases, solid organ and HSC transplantation, or inflammatory diseases would benefit from personalized care with specific typing of non-classical HLA, FC receptors, KIR or LILRs [24][25][26][27]. In conclusion, the implementation of WGS can serve many purposes, from anthropogenic integrative studies to handling specific diseases in clinical fields.
Supporting information S1