Skip to main content
Advertisement
  • Loading metrics

A machine-learning method for biobank-scale genetic prediction of blood group antigens

  • Kati Hyvärinen ,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

    kati.hyvarinen@bloodservice.fi

    Affiliation Research and Development, Finnish Red Cross Blood Service, Helsinki, Finland

  • Katri Haimila,

    Roles Conceptualization, Data curation, Investigation, Resources, Validation, Writing – review & editing

    Affiliation Blood Group Unit, Finnish Red Cross Blood Service, Vantaa, Finland

  • Camous Moslemi,

    Roles Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Software, Validation, Writing – review & editing

    Affiliations Department of Clinical Immunology, Zealand University Hospital, Køge, Denmark, Department of Clinical Immunology, Aarhus University Hospital, Aarhus, Denmark

  • Blood Service Biobank,

    Roles Data curation, Resources

    Affiliation Finnish Red Cross Blood Service, Vantaa, Finland

  • Martin L. Olsson,

    Roles Funding acquisition, Supervision, Writing – review & editing

    Affiliations Department of Laboratory Medicine, Lund University, Lund, Sweden, Department of Clinical Immunology and Transfusion Medicine, Office for Medical Services, Region Skåne, Sweden

  • Sisse R. Ostrowski,

    Roles Supervision, Writing – review & editing

    Affiliations Department of Clinical Immunology, Copenhagen University Hospital, Rigshospitalet, Copenhagen, Denmark, Department of Clinical Medicine, University of Copenhagen, Copenhagen, Denmark

  • Ole B. Pedersen,

    Roles Funding acquisition, Project administration, Resources, Supervision, Writing – review & editing

    Affiliations Department of Clinical Immunology, Zealand University Hospital, Køge, Denmark, Department of Clinical Medicine, University of Copenhagen, Copenhagen, Denmark

  • Christian Erikstrup,

    Roles Funding acquisition, Project administration, Resources, Supervision, Writing – review & editing

    Affiliation Department of Clinical Immunology, Aarhus University Hospital, Skejby, Denmark

  • Jukka Partanen,

    Roles Conceptualization, Funding acquisition, Project administration, Resources, Writing – review & editing

    Affiliation Research and Development, Finnish Red Cross Blood Service, Helsinki, Finland

  • Jarmo Ritari

    Roles Conceptualization, Formal analysis, Investigation, Methodology, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliation Research and Development, Finnish Red Cross Blood Service, Helsinki, Finland

Abstract

A key element for successful blood transfusion is compatibility of the patient and donor red blood cell (RBC) antigens. Precise antigen matching reduces the risk for immunization and other adverse transfusion outcomes. RBC antigens are encoded by specific genes, which allows developing computational methods for determining antigens from genomic data. We describe here a classification method for determining RBC antigens from genotyping array data. Random forest models for 39 RBC antigens in 14 blood group systems and for human platelet antigen (HPA)-1 were trained and tested using genotype and RBC antigen and HPA-1 typing data available for 1,192 blood donors in the Finnish Blood Service Biobank. The algorithm and models were further evaluated using a validation cohort of 111,667 Danish blood donors. In the Finnish test data set, the median (interquartile range [IQR]) balanced accuracy for 39 models was 99.9 (98.9–100)%. We were able to replicate 34 out of 39 Finnish models in the Danish cohort and the median (IQR) balanced accuracy for classifications was 97.1 (90.1–99.4)%. When applying models trained with the Danish cohort, the median (IQR) balanced accuracy for the 40 Danish models in the Danish test data set was 99.3 (95.1–99.8)%. The RBC antigen and HPA-1 prediction models demonstrated high overall accuracies suitable for probabilistic determination of blood groups and HPA-1 at biobank-scale. Furthermore, population-specific training cohort increased the accuracies of the models. This stand-alone and freely available method is applicable for research and screening for antigen-negative blood donors.

Author summary

Blood transfusion is one of the most common clinical procedures in the hospitals and the key element for safe transfusion is compatibility between the recipient and donor red blood cell antigens. Precise antigen matching reduces the risk for sensitization and other adverse transfusion outcomes. Here we describe a stand-alone and freely available random forest classification method and models for determining red blood cell and human platelet antigens from array technology-based genotyping data. We investigate the performance of models trained with Finnish blood donor biobank data and further validate the method with a large Danish cohort. The results demonstrate high overall accuracy, and the method is suitable for biobank-scale research and screening of antigen-negative donors. The implementation is possible in the local computing environment without sensitive data uploads and requires only a moderate level of bioinformatic skills.

Introduction

Blood transfusion is a life-saving procedure performed widely in treating various medical conditions. Despite routine practices, the safety of transfusions remains a major concern [1]. Exposure to foreign RBC antigens may result in alloantibody formation and hemolytic transfusion reactions. Additionally, sensitization to non-self RBC antigens and human platelet antigens (HPAs) can also occur via pregnancy and cause fetal morbidity and mortality [2,3]. The current general practice of matching the recipient and blood donor for ABO and RhD antigens is inadequate to prevent sensitization to other antigens. Extended matching could reduce the risk of alloimmunization and adverse events, which are especially pronounced among patients receiving regular transfusions [4,5].

Blood group typing of blood donors has been conventionally performed by serotyping and is still the main method used in blood centers. To overcome limitations regarding low throughput and lack of valid reagents for all clinically relevant antigens, numerous DNA-based genotyping and sequencing methods have emerged within the last decades [610]. This development has been enabled by the accumulating knowledge about the genetic basis of the blood groups [11,12] and the rapid evolution of molecular methodology. However, the systematically extended blood group typing of blood donors and, even more so, the recipients, remains sparse. Economic feasibility has been a major restraint to the progress. The development of genotyping array technologies has promoted high-throughput and cost-effective genetic studies in many fields and, in 2020, Gleadall et al. [13] introduced a microarray platform for RBC antigen, human leukocyte antigens (HLA), and HPA typing for precision matching of blood.

While accurate blood group typing is obligatory for safe transfusions, an initial screening for potential donors could be achieved using less stringent procedures. In the last decade, the development of machine learning approaches for high-dimensional data has provided new opportunities for exploitation of expanding genetic data. In 2015, Giollo et al.[14] presented BOOGIE, an RBC antigen predictor based on Boolean rules and k-nearest neighbor (k-NN) algorithm. Decision tree -based methods, including bootstrap aggregation [15] and random forest [16], have been utilized for imputation of HLA alleles [17,18] and killer cell immunoglobulin-like receptor (KIR) copy number [19] and gene content [20]. To our knowledge, these methods have not yet been implemented on RBC antigen and HPA screening. The analysis of high-dimensional data with computational performance suitable for large-scale analyses may be implemented using “RANdom forest GEneRator” software R package [21]. The execution is feasible in the local computing environment and sensitive data uploads are not required.

Here we describe a stand-alone and freely available random forest classification method and models for determining RBC antigens and HPA-1 from array technology-based genotyping data. We investigate the performance of models trained with Finnish blood donor biobank data and further validate the method with a Danish cohort. Our results suggest that the method is applicable for biobank-scale probabilistic determination of RBC antigens and HPA-1, and could facilitate research and screening for antigen-negative blood donors.

Results

Evaluation of the Finnish classification models

An overview of the study design is depicted in Fig 1. In the Finnish cohort, the genotype data was accessible for 1,192 blood donors and the RBC antigen typing data was available for 39 antigens representing 15 blood group systems. The blood group typing frequency varied greatly depending on RBC antigen/phenotype, being at the lowest 5% for HPA-1b and at the highest 100% for A, B, AB, O, K, D, C, c, E, and e (Table 1).

thumbnail
Fig 1. Study design.

Random forest classification models were generated using Finnish reference data set (n = 1,192). Allele dosages of genes determining RBC antigens/phenotypes and HPA-1 were combined with the antigen typing data. The dataset was divided randomly to train and test data sets. Random forest modelling was executed in the training data set (n = 596) and the important variables were selected using permutation. The models were evaluated in the test data set (n = 596) for prediction accuracy and errors. The final models were fitted using the full data set and both models and the method were validated in the Danish cohort (n = 111,677).

https://doi.org/10.1371/journal.pcbi.1011977.g001

thumbnail
Table 1. Blood group/HPA-1 antigen typing information of the Finnish and Danish cohorts.

https://doi.org/10.1371/journal.pcbi.1011977.t001

After data partitioning, the number of study subjects in the test data set was 596. The median (interquartile range [IQR]) balanced accuracy for 39 models was 99.9 (98.9–100)% in the test data set and accuracy metrics for all models are presented in Table A in S1 Text. The models for antigen/phenotype positivity of AB, B, A1, A2, Ytb, Coa, Doa, Dob, Fya, HPA-1b, K, Kpa, Ula, Jka, Lua, S, and s reached balanced accuracy of 100%. For other models, the balanced accuracy was ≥98.0%, except 83.3% for Lsa, 94.0% for Leb, 95.0% for HPA-1a, and 96.0% for hrS. Accuracy metrics for the train and full data sets are presented in Tables B and C in S1 Text, respectively. Fig 2 illustrates the distributions of accuracy metrics over all antigens shared by different data sets. The number of false negative plus false positive (FN + FP) samples out of all samples was low, ranging from 0 to 1% in all models, except 2% for hrS. Detailed confusion matrices for the Finnish test, train, and full data sets are presented in Figs A–C in S1 Text, respectively. The median (IQR) prediction error, determined as misclassification frequency obtained from out-of-bag data, of the Finnish models was 1.6 x 10−3 (1.9 x 10−4–7.0 x 10−3) (Table 2). Receiver operating characteristic (ROC) and precision-recall curves for combined test data predictions are presented in Fig D in S1 Text. The area under ROC curve was 99.9% with confidence interval 99.8–99.9%.

thumbnail
Fig 2. Summary of prediction accuracy metrics.

Distributions of accuracy metrics over all antigens shared by different test data sets. (A) Finnish random forest models evaluated in the Finnish test data set. (B) Finnish gradient boosting models evaluated in the Finnish test data set. (C) Finnish random forest models evaluated in the Danish full data set. (D) Danish random forest models evaluated in the Danish test data set. NPV, negative predictive value; PPV, positive predictive value.

https://doi.org/10.1371/journal.pcbi.1011977.g002

thumbnail
Table 2. Characteristics of the Finnish classification models.

https://doi.org/10.1371/journal.pcbi.1011977.t002

The distributions of posterior probabilities (PP) in the test data set are depicted in Fig E in S1 Text. The samples having PP >0.5 were classified as antigen positive and ≤0.5 as antigen negative. The majority of the PPs were close to 1 for the antigen typing positive samples and close to 0 for the antigen typing negative samples. The Coa-negative samples (only two samples in the test data set) were classified correctly but the PPs were closer to 0.5 than to 0. One of the three Lsa-positive samples were misclassified and the PPs for the other two were closer to 0.5 than to 1 (specificity 66.7%). The spectrum of PP distribution with some misclassifications was observed for Cob, Leb, M, N, C, Cw, D, and hrS. Figs F and G in S1 Text depict the distributions of PPs in the train and full data sets.

Evaluation of prediction using gradient boosting method

We investigated the impact of machine learning algorithm selection using gradient boosting method. Combined accuracy metrics over all antigens shared by different data sets for the Finnish test data set for random forest and gradient boosting methods are presented in Fig 2A and 2B, respectively. Table D in S1 Text presents the detailed accuracy metrics, Fig H in S1 Text the confusion matrices, and Fig I in S1 Text the distributions of PPs in the Finnish test data set. The overall performance of gradient boosting was slightly lower than our random forest classification approach.

Validation of the Finnish random forest classification models in the Danish cohort

The Danish validation cohort had genotype and phenotype data for 34 out of the 39 Finnish classification models. Antigen/phenotype typing data varied from 433 for A2 to ~111,000 for A, AB, B, O, and D (Table 1). Due to missing Finnish model variables in the Danish genotype data, the Danish allele dosage data was harmonized using mean imputation before applying the Finnish models.

The median (IQR) balanced accuracy for classifications was 97.1 (90.1–99.4)% and all the evaluation metrics are presented in Table E in S1 Text. The balanced accuracies were >98.0% for 14 models including antigen/phenotype positivity of A, AB, B, O, Ytb, Doa, Dob, HPA-1a, Jka, Lea, S, s, E, and e. Models for antigen/phenotype positivity of A1, Cob, Fya, Fyb, HPA-1b, K, Kpa, Lua, M, N, and Cw had balanced accuracy ranging from 91.6 to 98.0%. Six models, A2, Coa, Jkb, Leb, D, and C had balanced accuracy ranging from 64.6 to 89.4%. The Finnish models for LWb, P1, and c failed classification in the Danish cohort. Fig 2C illustrates the distributions of accuracy metrics over all antigens shared by different data sets for Finnish random forest models evaluated in the Danish full data set.

Validation of the random forest classification model algorithm in the Danish cohort

The RBC antigen/phenotype and HPA-1 typing and genotype data available for the Danish cohort enabled implementation of 40 Danish classification models representing 15 blood group systems. Due to missing genotypes (approximately 5%), missing allele dosage values were imputed separately for train and test data sets using mean values.

Median (IQR) balanced accuracy for the 40 Danish models in the Danish test data set was 99.3 (95.1–99.8)%. The evaluation metrics for test data set are available in Table F in S1 Text and for the train and full data sets in Tables G and H in S1 Text, respectively. More than half (23/40) of the Danish models reached balanced accuracy of ≥99.0% including models for antigen/phenotype positivity of A, AB, B, O, Yta, Ytb, Doa, Dob, Fya, HPA-1a, HPA-1b, Jka, Jkb, M, N, S, s, C, c, D, E, Lea, and Knb. Balanced accuracies for A1, Cob, Fyb, K, Kpa, Lua, Cw, e, and P1 models ranged from 94.4 to 98.1%, and for A2, Coa, k, Kpb, Lub, Vel, and Leb from 70.0 to 89.3%. Danish model for Kna failed classification due to too low number of Kna-negative samples in the test data set. Confusion matrices for the Danish models in the Danish train, test and full data sets depict the distribution of true negative (TN), FN, true positive (TP), and FP samples and are illustrated in Figs J–L in S1 Text, respectively. The median (IQR) prediction error of the Danish models was 2.3 x 10−3 (9.3 x 10−4–7.1 x 10−3)% (Table I in S1 Text). The distributions of accuracy metrics over all antigens shared by different data sets for Danish random forest models evaluated in the Danish test data set are illustrated in Fig 2D.

Comparison of the Finnish and Danish random forest classification models

Assembly of the balanced accuracies for Finnish and Danish models in the Finnish and Danish full data sets is presented in Table 3. When analyzing the shared 33 models, the Finnish models predicted the blood groups of the Finnish cohort more accurately than the blood groups of the Danish cohort (median [IQR] balanced accuracy 99.9 [98.8–100]% vs. 97.1 [91.6–99.5], p = 1.15e-06). The Danish models were performing better than the Finnish models in the blood group classification of the Danish cohort (median [IQR] balanced accuracy 99.5 [96.5–99.8]% vs. 97.1 [91.6–99.5]%, p = 0.006).

thumbnail
Table 3. Balanced accuracies for the Finnish and Danish models in full data sets.

https://doi.org/10.1371/journal.pcbi.1011977.t003

The number of genetic variants available for the Finnish random forest modelling ranged from 35 to 688 depending on the blood group/HPA system and number of the important variables selected by the classifier for the final models ranged from 12 to 214 (Table 2). In the Danish genotyping data set, the number of variants varied from 42 to 766 and the final models utilized 20–743 variants (Table I in S1 Text).

Discussion

Our study introduces random forest classification models for predicting RBC antigens/phenotypes and HPA-1 from array-based genotyping data. The method and models were generated utilizing blood group typing data from Finnish blood donors and further validated using a large Danish blood donor cohort. The results demonstrate high overall accuracy, and the method is suitable for biobank-scale screening and analysis of HPA-1 and RBC antigens.

Blood transfusion is one of the most common clinical procedures in the hospitals and the key element for safe transfusion is compatibility between the recipient and donor RBC antigens [1]. Although transfusion-related severe outcomes are rare, the prominent risk of sensitization and further alloimmunization affects especially patients dependent on recurrent transfusions [4,5]. Extended blood group typing has proven to be beneficial by reducing the incidence of alloantibody formation [22,23]. Additionally, studies have shown that the extended genotyping of blood donors markedly increases the number of suitable donors for immunized recipients [13] and enhances the supply of antigen-negative blood [6].

At present, preventive matching strategies are implemented only for specific patient groups and, despite the obvious advantages of the extended genotyping of donors, the procedure has not been considered feasible covering all blood donors. Over the last decades, the genotyping of different populations has expanded widely. Using machine learning approaches to screen blood donor and research biobank genotyping data may provide a cost-effective solution for enlarging the pool of antigen-negative blood donors. Our random forest classification method infers RBC antigens and HPA-1 from genotype-imputed microarray data. The R package ranger performed fast and handled the dimensionality of input data without problems [21]. The obtained results demonstrated high balanced accuracies both in the Finnish discovery cohort (median 99.8% for the 39 Finnish models) and in the Danish validation cohort (median 99.3% for the 40 Danish models) (Table 3). The performance was not affected by nearly a 100-fold size difference between the Finnish and the Danish cohorts (~1,200 vs. ~111,000, respectively).

Rh and MNS blood group system antigens have been challenging to determine by sequencing due to complex genetic variation and gene rearrangements [12,24]. We observed reduced balanced accuracy in the Finnish model for hrS (93.3%) and the Danish model for Cw (95.3%). However, the other Rh and MNS antigen models, including clinically significant E, e, C, c, S, and s, performed accurately. The balanced accuracies for clinically significant antigens in other systems, including K, Jka, Jkb, Fya, and Fyb, ranged from 95.6% to 100% (Table 3).

The BOOGIE method for prediction of RBC antigens was published in 2015 [14]. It builds on 1-NN algorithm and implementation requires genotype sequencing data and curated haplotype tables for the RBC antigen phenotypes. When compared, the Finnish models for ABO and RhD performed better than the BOOGIE method (median balanced accuracy for the Finnish ABO models 100% vs. BOOGIE ABO accuracy 94.2%; balanced accuracy for the Finnish RhD model 98.8% vs. BOOGIE RhD accuracy 94.2%). The observed differences in accuracies could be explained by the potentially limited haplotype tables utilized by BOOGIE. Additionally, the reported results of BOOGIE are based on low number of samples.

When applying the Finnish models to the Danish cohort, the observed decrease in balanced accuracies was expected because of the evident genetic, genotyping, and imputation differences between the Finnish and the Danish cohorts (Table 3). The Finnish cohort was imputed using population-specific imputation reference panel having no missingness per individual. On the contrary, the Danish cohort was imputed using the North European reference sequence panel resulting in an average missingness of 5%. As random forest is not able to handle missing input data and the important variables of the Finnish models were not fully present in the Danish data, we were obliged to use mean imputation for missing variant dosage data. It is obvious that this approach also introduces errors to the data, which may partly explain the reduced accuracy. The better performance of the Danish models in Danish cohort underlined the benefit of the population-specific training cohort.

Tree-based ensemble methods such as random forest offer robust performance with low risk of overfitting and gradient boosting [25] may increase accuracy by modelling residuals. However, our XGBoost test controlling overfitting via cross-validation did not result in equal performance when compared to our random forest approach (Fig 2), suggesting that our model makes efficient use of available genetic data. Neural network models are a lucrative option for imputing missing data but parameter tuning benefits from large training data [26]. However, if the input data is relatively small, using specialized tools designed for imputing missing genotypes independently of input data size prior to modelling are a superior option to neural networks requiring large data sets. The most effective imputation of input genotypes is achieved using a population-specific reference panel [27]. In our case, we did not have large training set at our disposal and had only limited data from rare antigen types available. Therefore, we adopted a modelling approach able to efficiently handle this kind of data.

The Finnish genotyping data had only one variant in the RHD region. Nonetheless, the Finnish model for RhD performed with sufficient balanced accuracy in the Finnish cohort (98.8%). Our method combines RHD and RHCE region variants for the modelling and the high linkage disequilibrium may have supported the classification (Table A in S1 Text). However, the Finnish model for RhD worked poorly in the Danish cohort (78.4%), which may be attributed to the mean imputation of missing values.

The present modelling method is restricted to the RBC antigen typing data available for the training and test data sets, which can be considered as a major limitation because the data for some RBC antigens are scarce. RBC antigens have demonstrated significant diversity among populations and rare blood group variants may not be discovered without substantially large typing numbers. The Danish model for Kna failed because of lacking Kna-negative samples in the test data set and we were not able to create Finnish models for e.g., Vel, k, Kpb, Lua, and LWa. It would be beneficial to validate the present method and models in non-European populations to enable systematic blood group studies in biobanks of different ethnic origins and phenotype data content.

To our surprise, B3GALNT1 on chromosome 3 supported the prediction of P1 antigen status in the P1PK system, even if this system is known to be governed by A4GALT on chromosome 22. B3GALNT1 normally governs expression of the P and other antigens in the GLOB system [28]. Thus, our data may suggest an unknown but intriguing role of the glycosyltransferase encoded by B3GALNT1 in the synthesis of P1 antigen. This deserves further investigation beyond the scope of this study.

In the future, comprehensive donor and recipient typing and precision matching are likely to increase. A recent publication by van Sambeeck et al. [29] demonstrated the feasibility of preventive matching for all genotyped recipients and donors. Our method is suitable for initial screening for antigen-negative donors at biobank-scale, presenting a cost-effective solution for the extended blood group and HPA-1 typing. Additionally, successful prediction of polygenic blood groups may facilitate the research of disease associations in large biobanks.

Scripts for random forest modelling and for applying the tested 39 Finnish models are freely available in the GitHub. The implementation is possible in the local computing environment without sensitive data uploads and requires only a moderate level of bioinformatic skills.

Study subjects and methods

Ethics statement

The Finnish study cohort consists of 1,192 blood donors belonging to the Blood Service Biobank, Helsinki, Finland (https://www.veripalvelu.fi/en/biobank/). Genotype and blood group phenotype data were obtained from the Blood Service Biobank. The study (biobank decision 002–2018) conforms to the principles of the Finnish Biobank Act (688/2012) and the participants have given written informed consent to the Blood Service Biobank.

The Danish validation cohort consists of 111,667 participants of the Danish Blood Donor Study (DBDS) Genomic Cohort expanding on the Danish blood bank system [30,31]. The genetic studies in DBDS have been approved by the Danish Data Protection Agency (P-2019-99) and the Scientific Ethical Committee system (NVK-1700407).

Genotyping and genotype imputation

The genotyping and genotype imputation of the Finnish cohort have been performed originally as a part of FinnGen project (https://www.finngen.fi/en). Biobank samples were genotyped using FinnGen ThermoFisher Axiom custom array v2 (Thermo Fisher Scientific, Santa Clara, CA, USA) and imputed using the population-specific Sisu v3 imputation reference panel with Beagle 4.1. Detailed description of the procedures is available at https://finngen.gitbook.io/documentation/v/r4/methods/genotype-imputation and the marker content of the custom array v2 is downloadable at https://www.finngen.fi/en/researchers/genotyping. The phased genotypes were filtered for the imputation INFO-score >0.6 and were in vcf format.

In the Danish cohort, the genotyping was performed using Illumina’s Infinium Global Screening Array and imputed using the deCODE genetics’ (Reykjavik, Iceland) North European reference sequence panel. Unphased genotypes were filtered for the imputation INFO-score >0.75, minor allele frequency >0.01, Hardy–Weinberg equilibrium P-values <1 × 10−4, and samples for missingness per individual <3%.

RBC antigen and HPA typing

The RBC antigen and HPA-1 phenotypic information for the Finnish and Danish cohorts is presented in Table 1. The availability of the phenotype data varied in a wide range depending on the antigen due to the different testing criteria practices. In the Finnish cohort, RBC antigen and HPA-1 typing was performed at the FRCBS Blood Group Unit by routine methods and the results were obtained using validated serological and genotyping techniques.

The sources for RBC antigen and HPA-1 typing results were the Danish electronic blood bank systems and the typing was performed using serological methods, except for Vel-status, which was determined using polymerase chain reaction technique.

Classification random forest models

Fig 1 presents an overview of the study design. RBC antigen and HPA-1 coding genes and the genetic regions used in the models are presented in Table J in S1 Text. The input and output of the model fitting are presented in Fig 3. The input of the model is imputed genotype data in chromosomal variant call format (VCF) and antigen data in text format (e.g., Kpa+/Kpa-). The outputs are the models and associated information in R Data Serialization (RDS) format and accuracy statistics and important variables in figure and text format. The genomic regions of the genes encoding the target antigens are extracted from the VCF data, converted to PLINK format, and further into allele dosages. The models for the antigens were generated separately using the same hyperparameters. Only antigens having at least four cases in each respective typing data class were included, resulting altogether in 39 models. For the Finnish reference data set, SNVs in RBC antigen and HPA-1 coding genetic regions ± 2,000 bp flanking regions were utilized in dosage format. Table 2 presents the number of SNVs available for each model. Only samples having full dosage data were used. The genetic and antigen typing information were combined into a single full data set and divided randomly 1:1 into train and test data sets.

thumbnail
Fig 3. Random forest model fitting.

Input data for model fitting include target genotype and phenotype data and gene-phenotype data provided in the GitHub repository https://github.com/FRCBS/Blood_group_prediction. Outputs of the classification are models for the target antigens and related accuracy information.

https://doi.org/10.1371/journal.pcbi.1011977.g003

R v4.3.0 environment [32] was used for the implementation of analyses. Classification random forest models were created using the R package ranger v0.13.1[21]. The number of trees was 2,000 and split criteria based on node impurity measured by the Gini index. Class weights were applied due to unbalanced outcome classes. Number of variables to possibly split at each node (mtry) was number of SNVs divided by 2 and the variable importance was determined by permutation. Feature selection was based on variable importance >0 and the model was re-fitted using these important SNVs only. The number of important variables and prediction errors for each antigen model are presented in the Table 2. Prediction error was determined as misclassification frequency obtained from out-of-bag data and prediction on the test set. The important variables and their importance values for the Finnish models are listed in S1 Data. The full data set was used in fitting the final models.

Evaluation of prediction using gradient boosting method

To compare our random forest with feature selection approach to another tree-based classification algorithm, we fitted a binary logistic eXtreme Gradient Boosting (XGBoost)[25] model implemented by the R library xgboost v 1.7.6.1(33) to the Finnish training set data and evaluated its performance in the independent test set. To minimize overfitting of the XGBoost model, we performed 100 random data partitionings (2/3 train, 1/3 test) within the training set and selected the optimal number of boosting rounds based on minimum negative log-likelihood from each iteration. Within the iterations, we used an early_stopping_rounds parameter value of 4, indicating that the training with a validation set stops if the performance doesn’t improve for four rounds. The final number of boosting rounds applied to the model fitted on the full training data was an average over the 100 iterations.

Model evaluation metrics

The model accuracy was evaluated using sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and balanced accuracy. The data was wrangled using tidyverse v1.3.1 package[34] and the evaluation metrics were derived using caret v6.0–92[35]. For each model, the number of true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN) were determined. Sensitivity was defined as TP / (TP + FN), specificity as TN / (TN + FP), PPV as TP / (TP + FP), NPV as TN / (TN + FN)[36]. Balanced accuracy accounts for imbalanced classification and was defined as (sensitivity + specificity) / 2. ROC and precision-recall curves were generated using ROCR v. 1.0–11 package.

Validation of the Finnish models and the random forest method for generating the models

The models obtained using the Finnish data set were applied to the Danish cohort. The input and output of the application of the Finnish models are presented in the Fig 4. The implementation required imputed genotype data in chromosomal variant call (or PLINK) format and provided Finnish prediction models and gene information in RDS format. The outputs were prediction results and associated information (RDS), and combined posterior probability results in text format. The Danish allele dosage data was harmonized by naming and allele orientation for compatibility with the Finnish models and the dosage data for the missing important variables was imputed using mean values.

thumbnail
Fig 4. Application of the Finnish models.

Input data for application include target genotype data and Finnish model file provided in the GitHub repository https://github.com/FRCBS/Blood_group_prediction. Outputs of the classification are prediction results for the antigens.

https://doi.org/10.1371/journal.pcbi.1011977.g004

The model-generating method was further validated by fitting the models on the Danish data set to create models specific to the Danish cohort. In the Danish data set, the percentage of missing genotypes was on average 5% depending on the genetic region of the blood group/HPA system. Missing allele dosage values were imputed separately for train and test data sets using mean values before classification random forest step. Characteristics of the Danish models are presented in Table I in S1 Text. The important variables and their importance values for the Danish models are listed in S2 Data. The evaluation metrics for both prediction and modelling were defined as depicted in the “Model evaluation metrics” section.

The significance of variation of balanced accuracies was analyzed using Mann-Whitney-Wilcoxon Test implemented with R v3.6.1.

Supporting information

S1 Text. Document contains supplementary tables and figures.

Table A. Accuracy metrics for the Finnish random forest models in the Finnish test data set. Table B. Accuracy metrics for the Finnish random forest models in the Finnish train data set. Table C. Accuracy metrics for the Finnish random forest models in the Finnish full data set. Table D. Accuracy metrics for the Finnish gradient boosting models in the Finnish test data set. Table E. Accuracy metrics for the Finnish random forest models in the Danish full data set. Table F. Accuracy metrics for the Danish random forest models in the Danish test data set. Table G. Accuracy metrics for the Danish random forest models in the Danish train data set. Table H. Accuracy metrics for the Danish random forest models in the Danish full data set. Table I. Characteristics of the Danish random forest classification models. Table J. Blood group/HPA-1 genes and genetic regions. Fig A. Confusion matrices for the Finnish random forest models in the Finnish test data set. Fig B. Confusion matrices for the Finnish random forest models in the Finnish train data set. Fig C. Confusion matrices for the Finnish random forest models in the Finnish full data set. Fig D. Receiver operating characteristic and precision-recall curves for the Finnish random forest models in the Finnish test data se. Fig E. Posterior probability boxplots for the Finnish random forest models in the Finnish test data set. Fig F. Posterior probability boxplots for the Finnish random forest models in the Finnish train data set. Fig G. Posterior probability boxplots for the Finnish random forest models in the Finnish full data set. Fig H. Confusion matrices for the Finnish gradient boosting models in the Finnish test data set. Fig I. Posterior probability boxplots for the Finnish gradient boosting models in the Finnish test data set. Fig J. Confusion matrices for the Danish random forest models in the Danish train data set. Fig K. Confusion matrices for the Danish random forest models in the Danish test data set. Fig L. Confusion matrices for the Danish random forest models in the Danish full data set.

https://doi.org/10.1371/journal.pcbi.1011977.s001

(PDF)

S1 Data. Document contains list of important variables for the Finnish random forest models.

https://doi.org/10.1371/journal.pcbi.1011977.s002

(XLSX)

S2 Data. Document contains list of important variables for the Danish random forest models.

https://doi.org/10.1371/journal.pcbi.1011977.s003

(XLSX)

Acknowledgments

We want to thank Dr. Satu Pastila and Ms. Ritva Toivanen at the FRCBS for the collaboration with the blood group typing data. We are also grateful for Ms. Birgitta Rantala, Mr. Petteri Vaskin, Ms. Katariina Karjalainen, Ms. Nina Nikiforow, Ms. Jonna Clancy, and Dr. Mikko Arvas and Dr. Tiina Wahlfors at the Blood Service Biobank for their help in handling the data and samples, and Dr. Jaana Mättö and the personnel at the FRCBS Blood Group Unit for blood group typing analyses. From Denmark, we wish to thank the Danish blood donors and deCODE Genetics for genotyping the Danish cohort.

References

  1. 1. Goel R, Tobian AAR, Shaz BH. Noninfectious transfusion-associated adverse events and their mitigation strategies. Blood; 2019 Apr 25;133(17):1831–1839. pmid:30808635
  2. 2. Hendrickson JE, Delaney M. Hemolytic Disease of the Fetus and Newborn: Modern Practice and Future Investigations. Transfus Med Rev. 2016 2016 Oct;30(4):159–64. pmid:27397673
  3. 3. Bussel JB, Vander Haar EL, Berkowitz RL. New developments in fetal and neonatal alloimmune thrombocytopenia. Am J Obstet Gynecol. 2021 Aug;225(2):120–127. pmid:33839095
  4. 4. Hendrickson JE, Tormey CA, Shaz BH. Red blood cell alloimmunization mitigation strategies. Transfus Med Rev. 2014 Jul;28(3):137–44. pmid:24928468
  5. 5. Evers D, Middelburg RA, de Haas M, Zalpuri S, de Vooght KMK, van de Kerkhof D, et al. Red-blood-cell alloimmunisation in relation to antigens’ exposure and their immunogenicity: a cohort study. Lancet Haematol. 2016 Jun 1;3(6):e284–92. pmid:27264038
  6. 6. Flegel WA, Gottschall JL, Denomme GA. Implementing mass-scale red cell genotyping at a blood center. Transfusion. 2015 Nov 1;55(11):2610–5. pmid:26094790
  7. 7. Cone Sullivan JK, Gleadall N, Lane WJ. Blood Group Genotyping. Clin Lab Med. 2022 Dec;42(4):645–68. pmid:36368788
  8. 8. Lane WJ, Westhoff CM, Gleadall NS, Aguad M, Smeland-Wagman R, Vege S, et al. Automated typing of red blood cell and platelet antigens: a whole-genome sequencing study. Lancet Haematol. 2018 Jun 1;5(6):e241–51. pmid:29780001
  9. 9. Veldhuisen B, Van Der Schoot CE, De Haas M. Blood group genotyping: From patient to high-throughput donor screening. Vox Sang. 2009 Oct;97(3):198–206. pmid:19548962
  10. 10. Hyun J, Oh S, Hong YJ, Park KU. Prediction of various blood group systems using Korean whole-genome sequencing data. PLoS One. 2022 Jun 1;17(6). pmid:35657818
  11. 11. Möller M, Jöud M, Storry JR, Olsson ML. Erythrogene: a database for in-depth analysis of the extensive variation in 36 blood group systems in the 1000 Genomes Project. Blood Adv. 2016;1(3):240–9. pmid:29296939
  12. 12. ISBT Blood Group Allele Tables [Internet]. Available from: https://www.isbtweb.org/isbt-working-parties/rcibgt/blood-group-allele-tables.html#blood group allele tables
  13. 13. Gleadall NS, Veldhuisen B, Gollub J, Butterworth AS, Ord J, Penkett CJ, et al. Development and validation of a universal blood donor genotyping platform: A multinational prospective study. Blood Adv. 2020 Aug 11;4(15):3495–506. pmid:32750130
  14. 14. Giollo M, Minervini G, Scalzotto M, Leonardi E, Ferrari C, Tosatto SCE. BOOGIE: Predicting blood groups from high throughput sequencing data. PLoS One. 2015;10(4):1–15. pmid:25893845
  15. 15. Breiman L. Bagging predictors. Mach Learn. 1996;26(2):123–40.
  16. 16. Breiman L. Random forests. Mach Learn. 2001 Oct;45(1):5–32.
  17. 17. Zheng X, Shen J, Cox C, Wakefield JC, Ehm MG, Nelson MR, et al. HIBAG—HLA genotype imputation with attribute bagging. Pharmacogenomics J. 2014;14(2):192–200. pmid:23712092
  18. 18. Ritari J, Hyvärinen K, Clancy J, FinnGen , Partanen J, Koskela S. Increasing accuracy of HLA imputation by a population-specific reference panel in a FinnGen biobank cohort. NAR Genom Bioinform. 2020 May 6;2(2):lqaa030.
  19. 19. Vukcevic D, Traherne JA, Næss S, Ellinghaus E, Kamatani Y, Dilthey A, et al. Imputation of KIR Types from SNP Variation Data. Am J Hum Genet. 2015;97(4):593–607. pmid:26430804
  20. 20. Ritari J, Hyvärinen K, Partanen J, Koskela S. KIR gene content imputation from single-nucleotide polymorphisms in the Finnish population. PeerJ. 2022 Jan 7;10(e12692). pmid:35036093
  21. 21. Wright MN, Ziegler A. ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. J Stat Softw. 2017;77(1):1–17.
  22. 22. Lasalle-Williams M, Nuss R, Le T, Cole L, Hassell K, Murphy JR, et al. Extended red blood cell antigen matching for transfusions in sickle cell disease: a review of a 14-year experience from a single center (CME). Transfusion. 2011 Aug;51(8):1732–9. pmid:21332724
  23. 23. Schonewille H, Honohan Á, Van Der Watering LMG, Hudig F, Te Boekhorst PA, Koopman-Van Gemert AWMM, et al. Incidence of alloantibody formation after ABO-D or extended matched red blood cell transfusions: a randomized trial (MATCH study). Transfusion. 2016 Feb 1;56(2):311–20.
  24. 24. Zhang Z, An HH, Vege S, Hu T, Zhang S, Mosbruger T, et al. Accurate long-read sequencing allows assembly of the duplicated RHD and RHCE genes harboring variants relevant to blood transfusion. Am J Hum Genet. 2022 Jan 6;109(1):180–91. pmid:34968422
  25. 25. Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, NY, USA: Association for Computing Machinery; 2016. p. 785–94. (KDD ‘16). Available from: https://doi.org/10.1145/2939672.2939785
  26. 26. Alwosheel A, van Cranenburgh S, Chorus CG. Is your dataset big enough? Sample size requirements when using artificial neural networks for discrete choice analysis. J Choice Model. 2018;28:167–82.
  27. 27. Kurki MI, Karjalainen J, Palta P, Sipilä TP, Kristiansson K, Donner KM, et al. FinnGen provides genetic insights from a well-phenotyped isolated population. Nature. 2023 Jan;613(7944):508–518. pmid:36653562
  28. 28. Reid ME, Lomas-Francis C, Olsson ML. The Blood Group Antigen FactsBook. 3rd ed. Boston: Elsevier; 2012. 745 p.
  29. 29. van Sambeeck JHJ, van der Schoot CE, van Dijk NM, Schonewille H, Janssen MP. Extended red blood cell matching for all transfusion recipients is feasible. Transfus Med. 2022 Jun 1;32(3):221–8. pmid:34845765
  30. 30. Pedersen OB, Erikstrup C, Kotzé SR, Sørensen E, Petersen MS, Grau K, et al. The Danish Blood Donor Study: A large, prospective cohort and biobank for medical research. Vox Sang. 2012 Apr;102(3):271. pmid:21967299
  31. 31. Hansen TF, Banasik K, Erikstrup C, Pedersen OB, Westergaard D, Chmura PJ, et al. DBDS Genomic Cohort, a prospective and comprehensive resource for integrative and temporal analysis of genetic, environmental and lifestyle factors affecting health of blood donors. BMJ Open. 2019 Jun 1;9(6).
  32. 32. R Core Team Team RDC. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2023.
  33. 33. Chen T, He T, Benesty M, Khotilovich V, Tang Y, Cho H, et al. xgboost:Extreme Gradient Boosting. [Internet]. 2023 [cited 2024 Jan 23]. Available from: https://CRAN.R-project.org/package=xgboost.
  34. 34. Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, et al. Welcome to the Tidyverse. J Open Source Softw. 2019;4(43):1686.
  35. 35. Kuhn M. Building Predictive Models in R Using the caret Package. J Stat Softw. 2008;28(5):1–26.
  36. 36. Monaghan TF, Rahman SN, Agudelo CW, Wein AJ, Lazar JM, Everaert K, et al. Foundational Statistical Principles in Medical Research: Sensitivity, Specificity, Positive Predictive Value, and Negative Predictive Value. Medicina. 2021 May;57(5). pmid:34065637