Fig 1.
Summary of HLA predictor performance.
(A) Venn diagram of the alpha and beta chains significantly associated with an HLA allele after performing the Fisher’s exact test in our three different cohorts of data. Overlap corresponds to sequences associated to the same HLA in the different datasets. (B) The number of HLA-associated TCRs show concordance between their CD4+ and CD8+ phenotype and the MHC class of the HLA allele. (C-D) Correlation between the precision of the classifier with (C) the number of found associations and (D) the frequency of the HLA allele in the cohort. Dots are color-coded to identify the corresponding HLA locus. (E) Receiving Operator Characteristic (ROC) of the classifier for the A*02:01 allele in the validation data set, using either each chain separately or together. (F) Comparison of the area under the curve (AUC) of the ROC for each of the tested alleles when training the model only with beta chain (x axis) vs. with beta + alpha chain (y axis). The performance reaches its best when using both chains in 26 out of 47 alleles. (G) Performance metrics of the HLA classifier on the validation dataset. Specificity = TN/(TN+FP), Sensitivity = TP/(TP+FN), Accuracy = TP+TN/(TP+TN+FP+FN), Precision = TP/(TP+FP), where T/F stand for true/false, and P/N for positives/negatives.
Fig 2.
(A) Comparison of AUCs between the basic classifier (based on exact matches with HLA-associated TCR), and a classifier where single amino acid variants are counted as a match. (B) Comparison of AUCs between the basic classifier, where the weights associated to each HLA-associated TCR are learned using logistic regression, and a simpler classifier that simply counts the number of matches (without weights, i.e. all weights = 1), which was used in [21].
Fig 3.
Features of the HLA-related sequences.
(A) CDR3 length distribution of A*02:01-associated TCRs, compared to all TCRs. (B) V gene frequency usage comparison between the same two subsets. Some V gene families are preferentially used in A*02:01-related TCRs (TRBV10, TRBV19, TRBV29) while other genes are underrepresented (TRBV15,TRAV24,TRAV21). (C) Whole-repertoire frequencies of two representative differentially used genes, TRBV19 and TRBV15, in A*02:01 positive vs negative individuals show small but statistcally significant difference in those genes. (D) Network analysis of TCRβ chains associated with different A, B and C HLA-alleles (TCRα were excluded since they formed negligible networks). Each node represent an amino acid CDR3 + V gene clonotype. Edges connect clonotypes that have at most on amino acid mismatch but the same length and V gene. Each color represent the specific HLA to which the TCR is responsive. Shadow color correspond to epitopes for which those TCRs are found to be responsive from the VDJdb database [37].
Fig 4.
(a) Number TCRs found to be significantly related to the HLA alleles A*02:01, DQA1*01:02, DPB1*04:01, DPB1*02:01 after recursively applying our HLA predictor in two untyped datasets [41, 42]. In this procedure, donors are iteratively typed by groups of 100 patients. The HLA related sequences found in our newly typed subcohort are used to infer the HLA type of the next 100 patients. (b) Comparison of the AUC when predicting in the validation dataset using the sequences inferred in the first round from our initial typed cohort (blue bar) the ones learned exclusively from the untyped cohorts after 14 iterations (green bar) or the combination of both subcohorts (orange bar). The performance is shown to be maximized when using all the available HLA-related TCRs, supporting the quality of the information acquired through this iterative process.