GraphMHC: Neoantigen prediction model applying the graph neural network to molecular structure

Neoantigens are tumor-derived peptides and are biomarkers that can predict prognosis related to immune checkpoint inhibition by estimating their binding to major histocompatibility complex (MHC) proteins. Although deep neural networks have been primarily used for these prediction models, it is difficult to interpret the models reported thus far as accurately representing the interactions between biomolecules. In this study, we propose the GraphMHC model, which utilizes a graph neural network model applied to molecular structure to simulate the binding between MHC proteins and peptide sequences. Amino acid sequences sourced from the immune epitope database (IEDB) undergo conversion into molecular structures. Subsequently, atomic intrinsic informations and inter-atomic connections are extracted and structured as a graph representation. Stacked graph attention and convolution layers comprise the GraphMHC network which classifies bindings. The prediction results from the test set using the GraphMHC model showed a high performance with an area under the receiver operating characteristic curve of 92.2% (91.9-92.5%), surpassing a baseline model. Moreover, by applying the GraphMHC model to melanoma patient data from The Cancer Genome Atlas project, we found a borderline difference (0.061) in overall survival and a significant difference in stromal score between the high and low neoantigen load groups. This distinction was not present in the baseline model. This study presents the first feature-intrinsic method based on biochemical molecular structure for modeling the binding between MHC protein sequences and neoantigen candidate peptide sequences. This model can provide highly accurate responsibility information that can predict the prognosis of immune checkpoint inhibitors to cancer patients who want to apply it.


Introduction
Cancer is the leading cause of death in a large number of countries worldwide [1] and the number one cause of death in South Korea [2].Cancer mainly results from somatic mutations [3].Cancer immunotherapy basically works by encouraging tumor cells to recognize the MHC protein non-self foreign, leading to activation of the cytotoxic T cell receptor (TCR) and CD8 coreceptor.However, the mechanism by which this occurs acts as an immune checkpoint, suppressing overactivation of the immune system, with proteins such as the programed cell death protein 1 (PD-1) and cytotoxic T lymphocyte antigen-4 (CTLA-4) the main actors in these pathways.Third-generation anticancer drugs block or inhibit this immune checkpoint [4].Despite these efforts, only a small number of subjects respond well to immune treatment [5], and it is limited because of the high expense [6].Neoantigens, or neoplasmic antigens, are tumor-specific antigenic determinants or epitopes that consist of 9-mer-long peptides that are cleaved by proteasome internal organelles and act as biomarkers predicting immune checkpoint inhibition [7,8], which can be estimated by predicting the binding potential of major histocompatibility complex (MHC) proteins to candidate peptides associated with somatic mutations.
MHC proteins bind with peptides via non-covalent hydrogen bonds.The half maximal inhibitory concentration (IC 50 ) between MHC and isolated peptides can be experimentally determined [9], and the results of binding attempts can be found in the Immune Epitope Database (IEDB) [10].

MHC-peptide binding models based on deep neural networks dealing with amino acid sequences
Deep neural network models that utilize binding information between MHC and peptides from IEDB have been published.NetMHCpan-4.0[11] and 4.1 [12] researches are representative of attempts to model the bond between MHC and peptide by constructing a deep neural network using known IEDB.According to NetMHCpan-1.0[13], because the amino acids of MHC are long, only 34 amino acids corresponding to polymorphic residues with high mutation frequency are used, and they are used for input by concatenating with the peptide sequence.Although it is provided as a predictive value by default on the IEDB website and is widely used in clinical practice, there are limitations in terms of structure and accuracy for interaction modeling.
Several models using more advanced modern deep neural networks have also been reported.MHCAttnNet uses a recurrent neural network (RNN) for string processing [14], DeepNeo uses a convolutional neural network (CNN) [15], DeepImmuno used CNN, graph convolutional networks (GCN), which dealt with the relationships between amino acids, not between the constituent atoms [16].
Since these models are merely attempts to connect one-dimensional vectors or make them into two-dimensional matrices, they have limitations in simulating multidimensional interactions between biopolymers.

Graph neural networks for modeling molecular structures
Graph neural networks (GNN) [17], graph convolutional networks (GCN) [18], and graph attention networks (GAT) [19] are useful for graph classification as well as node classification.They are effective in expressing molecular structures and intrinsic attributes.To represent feature matrices and adjacency matrices derived from the graph structure, operations such as convolution and attention are stacked, and sigmoid or softmax functions are applied for classification.
Unlike sequence-based neural networks or array-based neural networks, these models use not only features but also connection information between nodes to enhance accuracy by extracting more information [20,21].

Graph neural networks that model SMILES-based interaction or affinity
Meanwhile, studies using the Simplified Molecular Input Line Entry System (SMILES) [22], an expression for molecular structure in the interaction between polymers in an adjacent academic field, have been reported as input to deep neural networks.Looking at each field, it corresponds to the protein-protein interaction (PPI), the drug-target affinity (DTA), and the drug-drug interaction (DDI).In terms of interaction methods, they can be grouped into two categories: concatenation [23][24][25][26][27][28][29] or coattention [30,31].These graph-graph interaction can be combination between embedded vectors extracted from each graph rather than direct combination between the nodes constituting the two graphs.There are limitations to model the interaction between all components of the graph.

GraphMHC: A model that predicts neoantigens using a graph neural network using molecular structure
This study proposes the GraphMHC model, which uses data from the IEDB and the GNN model to illustrate the binding between MHC proteins and peptide sequences via the molecular structure.The determination of the feature extracted from combining an MHC protein with a peptide as a neoantigen is performed through multi-layered graph convolution when using this method.This model is a method of extracting feature vectors from both graphs, rather than connecting individual feature vectors extracted from each of the two graphs, so comprehensive and interactive feature extraction is possible.The GraphMHC model was validated against the baseline model [12], and applied to melanoma patient data from the Cancer Genome Atlas (TCGA) project.Data were divided into low and high neoantigen load groups for comparison of the clinical differences.

Materials and methods
In this section, the methods used to build and validate the neoantigen prediction model GraphMHC are presented.

GraphMHC: A neoantigen prediction model based on the GNN using MHC class I from the IEDB
The processes for building GraphMHC, which is a neoantigen prediction model based on GNNs, are presented in detail in this section.The overall pipeline can be seen in Fig 1.
The Immune Epitope Database (IEDB) [10], a binding affinity data set for peptides and HLA types, was used to construct the model for predicting neoantigens.The MHC class I dataset was used for neoantigen prediction.The use of the MHC class II dataset is described separately in the section describing extra-validation.
Referring to previous studies, a result with a binding affinity of 500 nM or less was determined to be a neoantigen [32][33][34].
Conversion from the molecular structure of amino acids to dataset of interatomic graph structures.In the IEDB dataset, 157,325 rows pertaining to humans were utilized.These rows were then transformed into amino acids that constitute the MHC protein, employing the conversion data from NetMHCpan-4.1 [12].Out of these, 157,084 were converted, excluding types for which conversion information was unavailable.Classification was based on binding affinity, with binding assumed for IC 50 � 500 nM, and non-binding for IC 50 > 500 nM.The dataset was split, with 80%, or 125 667, for training and 20%, or 31 417, for testing.The conversion process from database to dataset is as follows.
First, human leukocyte antigens (HLA) from the IEDB are converted to MHC amino acid sequences.Second, MHC and peptide sequences are converted using SMILES [22] structures using the RDKit 2022.03.2 library.The expressions used are given in S1 Table .Third, join the two SMILES strings with non-bond notation together (.).Fourth, convert to molecular structure using RDKit.Hydrogen atoms that were omitted from the symbol must be expressed at this point.The molecular structure is expressed in S2 Fig. Fifth, convert to graph structure using the RDKit library to encode vectors and matrices.The feature information of the graph representation is shown in Table 1.Each feature is encoded via one-hot encoding and constructed as a sparse matrix.The graph structure is expressed in S2 Fig using NetworkX 2.8.4 library with Kamada Kawai layout [35].Characteristics of graph representations describing bound and unbounded data are shown in the S2 Table .Sixth, convert graph dataset using the PyTorch Geometric (PyG) [36] 2.1.0library.
Architectural design of the GNN GraphMHC for graph classification.The layer-bylayer architecture of GraphMHC, the GNN model for MHC-peptide binding, is presented in a schematic diagram provided in Fig 2 .In this model, graph attention [37] was used as graph convolution, where the attention factor is multiplied to assign importance to nodes.Another noteworthy element of this model is the stacking of conventional one-dimensional convolution layers after graph convolutions.This procedure enables the re-extraction of a given feature vector multiple times for classification purposes.Another highlight is that skip connections are connected to pass weights between these convolutional layers.This prevents vanishing of weight passing and contributes to more precise tuning.The stacking steps of the model are subdivided as follows.
First, stack four layers of graph attention [38] using PyTorch Geometric (PyG) [36] 2.1.0library.Second, for graph classification, the readout layer is the mean value of the node feature vectors.Third, two layers of 1-dimensional convolution are stacked with the PyTorch [39] 1.11.0+cu113framework.At this time, add a skip connection [40] between layers.Fourth, classify via sigmoid in fully-connected layer.
Model training and evaluation.ADAM [41] was used as the optimizer for learning.All hyperparameters of each layer in these networks are described in Fig 2 , and the batch size is set to 64 and the number of epochs is set to 100.The reason why the number of attention heads was set to 8 is because if it is lower than this, there is a performance degradation, and if it is higher than this, there is a computational load.For tensor calculation during the deep neural network model training, massive parallel processing was performed using RTX 3090 24GB graphic processing units (GPU).Scikit-learn [42] 1.1.2was used to evaluate the classification model, and the 95% AUC-ROC confidence interval was obtained through the DeLong method [43] in MedCalc [44] 20.106.95% confidence intervals for other metrics were calculated directly.
For comparison with models comprising fewer layers, two cases were constructed; one of which involved subtracting two of four GNN layers, while the other involved the subtraction of two CNN layers.

Neoantigen load predictions on next generation sequencing (NGS) data from The Cancer Genome Atlas database
The constructed model was applied to next-generation sequencing (NGS) data from actual cancer patients.Data were sourced from The Cancer Genome Atlas (TCGA) [45].Skin cutaneous melanoma (SKCM) data was used because of the effective response that this type of cancer has to immunotherapy [46].Data were primarily downloaded from the Broad Institute Firehose website.
Sequence information for normal samples, which is not disclosed by TCGA-SKCM, is required for HLA typing and was thus obtained from previous studies [47].The HLA types were reported to be Optitype [48] and were converted to 34-mer amino acids using NetMHCpan-4.1 [12].
The procedure for translating the mutation into a 9-mer peptide and identifying it as a neoantigen is as follows.
First, convert from mutation annotation format (MAF) to variant calling format (VCF) using maf2vcf [49].Second, annotate as Variant Effect Predictor (VEP) [50].At this time, the sequence information for the GRCh38 reference genome is used.Third, select only missense variants.Fourth, convert to amino acids with R customProDB [51].The chromosome nomenclature must be consistent for this process.Fifth, only the region in which the actual mutation appears is truncated to a length of 9-mer [52].Sixth, convert the MHC protein sequence and peptide sequence together to form a graph structure.Seventh, predict the binding affinities using GraphMHC.
Thus, data from a total 310 subjects were used for predicting neoantigen load by combining 9-mer peptides that are considered neoantigen candidates with HLA type information.For more details of the selection process, see S3 Fig Candidates are then predicted neoantigens or not using GraphMHC.
Comparison of groups according to high and low neoantigen load.Two groups were formed around the median, with high and low neoantigen loads.Comparison was performed using survival analysis and immunity score analysis.Clinical information on survival was also obtained from the Firehose website.This study uses overall survival as the analysis target, for which Lifelines 0.27.0 was used.
Estimates of stromal cell score, immune cell score, and tumor purity were obtained from the ESTIMATE website and utilized [53].Data were calculated from the TCGA expression data.The stromal score refers to the number of stroma cells within tumor tissue [54,55], and was used because stromal cells have been reported to be involved in tumor growth and disease progression.The immune score indicates the infiltration of immune cells into tumor tissues and is an immunological biomarker for prognosis prediction and therapeutic response [56].The ESTIMATE score relates to tumor purity and is a combination of the stromal and the immune scores.Scipy 1.9.3 was used for the comparison test.

Extra-validation of the model using MHC class II from the IEDB
Although neoantigens are related to MHC class I and CD8 + , MHC class II and CD4 + have also been reported complementary, with less variation between patients [57,58].Several model studies have investigated this concept [12,59], and a model with the same architecture as the proposed model was trained using MHC class II data from the IEDB for extra-validation in this study.The same classification threshold of 500 nM was used.Data with reduced similarity were selected, and training and test sets comprising 85 708 and 21 427 data points, respectively, were used after pre-processing.

Baseline models for comparison
The model NetMHCpan-4.1 [12] was used as the baseline.Classification using IEDB data and comparison between groups using TCGA-SKCM data were applied equally.NetMHCIIpan-4.0 was used for extra-validation [12].In order to convert the obtained binding affinity to a value between 0 and 1, studies [11,60,61] such as those involving NetMHCpan-4.0used the expression 1 − log(�)/log(50000), whereas an equation 1/(1 + exp(�)) similar to the sigmoid was used for the same comparison in this study.

Results
In this section, the GraphMHC model, which is proposed for neoantigen prediction based on GNNs, is validated using intra-validation, inter-validation, clinical application, and extra-validation.The datasets and models used are clearly shown in Table 2. Matplotlib 3.5.3was used to plot the ROC and PR curves, and Seaborn 0.12.0 was used to create violin plots.

Intra-validation: Combining graph convolution with convolution improves prediction accuracy
The comparison results obtained for the different models according to the layer configuration of the neural network are shown in the center columns of Table 3.The model consisting of 4-layer GNNs and 2-layer CNNs (GraphMHC) demonstrates the highest performance.A model consisting of a 2-layer GNN and a 2-layer CNN has higher performance than a model consisting of only a 4-layer GNN, because feature extraction is performed effectively in the CNN layers.Additionally, it was found that a 4-layer GNN contributed to performance improvement compared to a 4-layer GNN.This is due to the repeated aggregation and updates from neighboring nodes.

Inter-validation: Graph-based model shows better sensitivity than stringbased model
The results of comparing GraphMHC with the baseline model, NetMHCpan-4.1, in the rightmost column of the Table 3 indicate that GraphMHC perform well in terms of AUC-ROC and sensitivity and badly for specificity and F 1 -score.These results indicate a high rate for true positives and a low rate for false negatives, suggesting that it can be used as a meaningful indicator as a predictive biomarker for fatal diseases such as cancer.

Clinical applications: Groups divided by the graph-based model are clinically discriminated
The most common method of using the median as the threshold value for dividing groups was inherited in this study [62][63][64].Examination of the 5-year survival in Fig 3A indicates that the differences between groups divided using GraphMHC are borderline (p=0.061).On the other hand, the results obtained using NetMHCpan-4.1 were not significant in any of the observation periods, aligning with the results of previous studies in which no significance has been reported (p=0.567) for 10-year survival under this method [64].Comparison of the biomarker scores in two groups in Fig 3B show that significant differences were obtained for stromal scores when using GraphMHC, but not for other cases.

Extra-validation: GraphMHC model shows the best performance for most metrics in MHC class II data
The following are the evaluation results of the test set of 21 427 samples obtained from the MHC class II datasets in the IEDB.Comparison of GraphMHC and the corresponding baseline model NetMHCIIpan-4.0, applied to MHC class II data in Table 4, confirms GraphMHC showed higher performance in terms of AUC-ROC and sensitivity, as indicated by the results of inter-validation.Other scores showed similar or slightly lower values.The F 1 -score was the same as the value derived using the baseline model, and the specificity obtained was low.

Implications of the neoantigen prediction study using the GNN
This study is the first to predict binding by modeling the biochemical molecular structure using information that describes MHC protein sequences and peptide sequences as candidate neoantigens.It is noteworthy that feature extraction and binding modeling are possible only when the inherent structural information from the sequence data is used and no additional external information is included.In other words, no separate interaction mechanism is required; the structure and connection information between nodes and edges in the graph structure itself are aggregated and extracted through layered automatic feature extraction, and the minute attractive and repulsive forces that occur between complex and diverse atoms are simulated when using this method.As a similar research case, a protein folding model using GNN with reduced parameters from AlphaFold 2 [65] has been reported to simulate the interaction between MHC and peptides [66], indicating the possibility of using GNN for the structural prediction of MHC-peptide binding.Comparison of high and low neoantigen load groups using TCGA-SKCM data.A Comparison of overall survival in high and low groups according to the median number of loaded neoantigens.The left is the result of grouping by GraphMHC while the right is obtained using NetMHCpan-4.1.Examination of the 5-year survival indicates that the differences between groups divided using GraphMHC are borderline.B Comparison of biomarker scores in high and low groups according to the median number of loaded neoantigens.The upper row is the result of grouping by GraphMHC while the lower row was obtained using NetMHCpan-4.1.
Comparison of the stromal scores in two groups show that significant differences were obtained when using GraphMHC. https://doi.org/10.1371/journal.pone.0291223.g003 In addition, the differences in the research results obtained in previous studies are not limited to the performance of the model.Application of the proposed model, GraphMHC, to clinical data suggested that it has better discrimination compared to those observed previously.These results suggest that the GraphMHC model can be used as a biomarker for the cancer immune response.

Potential application suggested from research results
This research model can be used in biological experiments or for medical prediction or prevention.One case in which binding affinity was evaluated using NetMHCpan in in vitro immunoprecipitation and liquid chromatography-tandem mass spectrometry experiments (LC-MS/MS) [67] and one in which the CD8 + epitope was predicted and validated using NetMHC in vivo can be cited as close examples [68].In terms of screening, it is possible that accurate information about responsibility can be obtained via sequencing for patients who are to receive immune checkpoint inhibitors, enabling patient-specific treatment.This model can also be used as a biomarker to divide groups and compare between two groups to predict response to cancer immunotherapy [15,69].Even more noteworthy is its possible use in preventing cancer by using it in the development of peptide vaccines that can activate T cells [70][71][72].On the other hand, the results of this study could also be used to predict similar amino acid bindings such as the SARS-CoV-2 (COVID-19) antigen [16,73].Furthermore, the application of this model could be considered not only for amino acids, but also for modeling ligand or drug binding that can be represented by SMILES.From these various perspectives, this study can be considered a pioneering breakthrough in precision medicine research.

Limitations of the study
Despite the original suggestions and potential applications, this study also has some limitations.The information in the IEDB concerning the binding affinity between MHCs and peptides, including HLA-type polymorphisms is incomplete, although the experimental data are steadily accumulating.Inevitable uncertainties are also associated with the conversion and transformation of data via several different methods.In this regard, attempts to call variants more accurately using deep learning have been reported [74].In addition, there is no guarantee that the conversion of the 34-mer amino acid sequence corresponding to the polymorphic residues in the HLA types referenced to by NetMHCpan is absolute.In terms of the model itself, the internal structure of the GNN model means that it occupies more memory than the string-based neural network model, indicating that a server-side service would be useful.Beyond the neoantigen load, several other points require consideration when expanding the research of immunotherapy.Even in cases where large numbers of neoantigens are the prognosis is often poor.Therefore, not only the foreignness of a tumor as an antigenic mutation, but also the tumor sensitivity to mutations that are exported from the inside to the outside of a cancer cell should be considered and different machine learning methods applied [15,75].It is also necessary to consider TCR binding [76], for which other machine learning methods are being developed [77][78][79].

Research topics not included in this study
Since this study aimed to extract intrinsic information from a given amino acid sequence, the research methodology of using the extracted dataset with additional data was not included.This because the methodologies used in studies reporting in this field are not as yet verifiable because of the small number or the artificial nature of datasets used.For example, one model that used molecular information about amino acids as well as binding affinity, Neopepsee [63], reached high prediction accuracy with an AUC of 0.981 by applying a support vector machine (SVM) model that combined the binding from the IEDB with parameters as immunogenicity, sequence similarity, and amino acid pairwise contact potentials [80].The study was limited by the small sample size, with only 311 positive epitopes and 14,633 mutant negative peptides included, and the IMMA2 dataset used to calculate the amino acid pairwise contact potentials in this study was composed of only 558 immunogenic and 527 non-immunogenic peptide values [78], Another example, NetMHCpan-4.0, is a multilayer perceptron model that outputs binding affinities and eluted ligands from mass spectrometry, reached approximately 0.98 [11].This study has a total of 85,217 entries, but the results are limited because the negative entries were artificially generated.Other models derived from it, such as MHCflurry [81] and DeepHLApan [82], have similar methods.

Further studies on precision oncology
In this study, only the genomic approach was presented using the graph neural network for cancer treatment, but a metabolic network approach and a gene regulatory network approach are also possible, and several such attempts have already been reported.These directions are outlined for further research on cancer-related networks.
In current precision oncology studies, the interaction effects within the tumor microenvironment, such as stromal cells and immune cells surrounding the tumor cells, are also being considered.In other words, in tumor-centric research, interaction between players will be treated as a time-dependent study in the future.While normal cells become proliferative through the cell cycle when there is a mitogenic growth signal, cancer cells overcome the antitumor defense by receiving oxygen from the blood vessel supply.This represents an important characteristic of early and midstage [83].Tumor cells receive and symbiotically utilize glucose and lactic acid, so suppressing this supply can lead to the death of tumor cells [84].If normalization of angiogenesis accompanies this, the effect of immunotherapy can be expected to increase [85,86].A metabolic network-based approach is useful for understanding the tumor microenvironment, and in a related study, a graph neural network was used to estimate the flux between cells [87].
After the study of induced pluripotent stem cells (iPS), which enabled dedifferentiation by regulating four genes [88,89], an extended study revealed that two regulators, BCL11A and HDAC1/2, have been identified in the gene regulatory network for reprogramming cancer cells [88,89].The proposed model in this study only addressed the classification problem of graphs, but it can also be applied to link prediction [90] or community detection [91] problems in gene regulatory networks.

Conclusion
The GraphMHC model predicts neoantigens by converting MHC protein and peptide binding to graphic structure using the intrinsic features of the sequences themselves.
The GraphMHC model, which is based on the GNN, showed high accuracy with a low false negative rate for predicting neoantigens as compared to the baseline model.A significant difference was also observed when using the GraphMHC model to divide data into two groups for testing clinical discrimination.The GraphMHC model can thus be considered suitable for predicting the response prognosis of immune checkpoint inhibition.

Fig 1 .
Fig 1. Overall pipeline for modeling neoantigen prediction.HLA types and 9-mer peptides from the IEDB are converted into molecular structures, and the GraphMHC model is used to classify probable neoantigens from the extracted atoms and interatomic features.https://doi.org/10.1371/journal.pone.0291223.g001

Fig 3 .
Fig 3.Comparison of high and low neoantigen load groups using TCGA-SKCM data.A Comparison of overall survival in high and low groups according to the median number of loaded neoantigens.The left is the result of grouping by GraphMHC while the right is obtained using NetMHCpan-4.1.Examination of the 5-year survival indicates that the differences between groups divided using GraphMHC are borderline.B Comparison of biomarker scores in high and low groups according to the median number of loaded neoantigens.The upper row is the result of grouping by GraphMHC while the lower row was obtained using NetMHCpan-4.1.Comparison of the stromal scores in two groups show that significant differences were obtained when using GraphMHC.