CAMAP: Artificial neural networks unveil the role of codon arrangement in modulating MHC-I peptides presentation

MHC-I associated peptides (MAPs) play a central role in the elimination of virus-infected and neoplastic cells by CD8 T cells. However, accurately predicting the MAP repertoire remains difficult, because only a fraction of the transcriptome generates MAPs. In this study, we investigated whether codon arrangement (usage and placement) regulates MAP biogenesis. We developed an artificial neural network called Codon Arrangement MAP Predictor (CAMAP), predicting MAP presentation solely from mRNA sequences flanking the MAP-coding codons (MCCs), while excluding the MCC per se. CAMAP predictions were significantly more accurate when using original codon sequences than shuffled codon sequences which reflect amino acid usage. Furthermore, predictions were independent of mRNA expression and MAP binding affinity to MHC-I molecules and applied to several cell types and species. Combining MAP ligand scores, transcript expression level and CAMAP scores was particularly useful to increase MAP prediction accuracy. Using an in vitro assay, we showed that varying the synonymous codons in the regions flanking the MCCs (without changing the amino acid sequence) resulted in significant modulation of MAP presentation at the cell surface. Taken together, our results demonstrate the role of codon arrangement in the regulation of MAP presentation and support integration of both translational and post-translational events in predictive algorithms to ameliorate modeling of the immunopeptidome.


Introduction
In jawed vertebrates, virtually all nucleated cells present at their surface major histocompatibility complex class-I (MHC-I) associated peptides (MAPs), collectively referred to as the immunopeptidome [1,2]. MAPs play a central role in shaping the adaptive immune system, as they orchestrate the development, survival and activation of CD8 T cells [3]. Moreover, recognition of abnormal MAPs is essential to the elimination of virus-infected and neoplastic cells [4]. Therefore, systems-level understanding of MAP biogenesis and molecular composition remains a central issue in immunobiology [5,6].
The generation of the immunopeptidome can be conceptualized in two main events: (a) the generation of MAP candidates (i.e. peptides of appropriate length for MHC-I presentation) through protein degradation, and (b) a subsequent filtering step through the binding of MAP candidates to the available MHC-I molecules. Rules that regulate the second event have been well characterized using artificial neural networks (ANN) and weighted matrix approaches [7,8]. However, accurately predicting which peptides will ultimately reach MHC-I molecules following a multistep processing in the cytosol and endoplasmic reticulum remains an open question [6]. Most efforts at modeling MAP generation have focused on post-translational events and their regulation by the amino acid sequence of MAPs and of directly adjacent residues (typically 10-mers at the N-and C-termini). While the consideration of preferential sites of proteasome cleavage has proven useful to enrich for MAP candidates [9], it remains insufficient for MAP prediction, due to prohibitive false discovery rates [10][11][12].
A large body of evidence suggests that a substantial portion of MAPs are produced co-translationally [13][14][15], deriving from defective ribosomal products (DRiPs), that is, polypeptides that fail to achieve a stable conformation during translation and are consequently rapidly degraded. This concept was initially supported by two observations: (i) viral MAPs can be detected within minutes after viral infection, much earlier than their associated proteins halflife [16], and (ii) MAP presentation correlates more closely with translation rate than with overall protein abundance [17,18]. In addition, while all proteins contain peptides that are predicted to bind MHC-I molecules, mass spectrometry analyses have revealed that the immunopeptidome is not a random excerpt of the transcriptome or the proteome [1,19]. Indeed, proteogenomic analyses of 25,270 MAPs isolated from B lymphocytes of 18 individuals showed that 41% of expressed protein-coding genes generated no MAPs [19]. These authors also provided compelling evidence that the presentation of MAPs cannot be explained solely by their affinity to MHC-I alleles and their transcript expression levels, while ruling out low mass spectrometry sensitivity as an explanation for the non-presentation of the strong binders. Because (i) MAPs appear to preferentially derive from DRiPs and (ii) codon usage influences both precision and efficiency of protein synthesis [20,21], we hypothesized that codon usage in the vicinity of MAP-coding codons (MCCs) might significantly contribute to the regulation of MAP biogenesis. We developed an artificial neural network called Codon Arrangement MAP Predictor (CAMAP), trained to identify MCCs flanking regions. We then used CAMAP to uncover key codon features that characterize mRNA sequences encoding for MAPs (i.e. source) when compared to sequences that do not (i.e. non-source).

Dataset description
We analyzed a previously published dataset consisting of MAPs presented on B lymphoblastoid cell line (B-LCL) by a total of 33 MHC-I alleles from 18 subjects [19,22]. Because we were searching for features that influence MAP generation and not the binding of MAP to MHC-I molecules, we elected to analyze the MCC flanking sequences only and excluded the MCCs per se from our positive (hits) and negative (decoys) sequences ( Fig 1A). To facilitate data analysis and interpretation, we restricted our hit dataset to MAPs with a length of 9 amino acids, for a total of 19,656 9-mer MAPs (which represents 78% of MAPs in this dataset). We next created a decoy dataset from transcripts that generated no MAPs, by randomly selecting 98,290 9-mers from these transcripts. Finally, we used pyGeno [23] to extract MCCs flanking regions corresponding to both hit and decoy MAPs, which constituted our final dataset for CAMAP. Of note, each sequence in the final dataset is unique and derives from the canonical reading frame.
In addition, in order to investigate the relative importance of codon vs. amino acid usage in MAP biogenesis, we generated a dataset of shuffled sequences (for both positive and negative datasets) in which original codon sequences were randomly replaced by synonymous codons according to their usage frequency in the human transcriptome ( Fig 1B). This transformation was performed to ensure that both neural networks received the same number of parameters as input, preventing the introduction of a favorable bias for the codon network. The random shuffling causes any codon-specific feature to be shared among synonyms, thereby causing the shuffled codon distribution to reflect the amino acid usage (see Materials and Methods for more details). Indeed, codon distributions in the shuffled datasets more closely reflected those of their corresponding amino acid than in the original dataset (S1 Fig), with 92% of codons in the shuffled dataset showing a strong correlation (R 2 > 0.95) with the amino acid distribution, compared to only 69% in the original dataset (p < 2x10 -16 , S2 Fig). It also eliminates differences in codon usage between the hit and decoy datasets. Importantly, this shuffling does not affect the resulting amino acid sequence thereby preserving all potential amino acid-related motifs. Distributions of each codon in the original VS shuffled datasets and comparison to its corresponding amino acid can be found in S3-S20 Figs.

CAMAP links codon usage to MAP presentation
To assess the importance of codon usage in MAP biogenesis, we reasoned that if codons bear important information that is operative at the translational rather than the post-translational level, then: (i) CAMAP trained to identify MCCs flanking regions should consistently perform better when trained on original codon sequences than on shuffled codon sequences (reflecting amino acid sequences), and (ii) synonymous codons should have different effects on the prediction. To test these hypotheses, CAMAP received as inputs MCCs flanking regions from hit and decoy sequences from either the original or shuffled datasets. It was then trained to predict the probability that individual input sequences were MCCs flanking regions (i.e. hit) rather than sequences from the negative dataset (i.e. decoy, S21A Fig).
We compared CAMAP performance when predicting MAP presentation from original codon sequences, versus shuffled sequences representing amino acid arrangement. To evaluate the robustness of our approach, 12 different CAMAPs were trained in parallel, with different train-validation-test splits of the dataset. Our results show that predictions were consistently better when CAMAP received the original codons rather than the shuffled sequences (Fig 2A). CAMAPs receiving information from both pre-MCCs and post-MCCs sequences (i.e. whole MCC flanking context) also performed better than when receiving only pre-or post-MCCs context (Figs 2A and S21B and S21C), suggesting that pre-and post-MCCs context are not redundant. Indeed, we found a weak correlation between the prediction scores of CAMAPs trained only with pre-or post-MCCs sequences (S22 Fig). In addition, CAMAPs receiving longer sequences performed better than those receiving shorter sequences (Fig 2B). Because Transcripts expressed in B-LCL from 18 subjects were considered as source or non-source transcripts depending on their match with at least one MAP. Because we were searching for features that might influence MAP generation and not the binding of MAP to MHC-I, we focused our attention on mRNA sequences adjacent to the nine MCCs (i.e. up to 162 nucleotides on each side of MCCs). (B) Creation of the shuffled dataset. Codons were randomly replaced by a synonymous codon according to their respective frequencies (i.e. codon usage) in the transcriptome. The random shuffling causes any codon-specific feature to be shared among synonyms, thereby causing the shuffled codon distribution to reflect the amino acid usage. It also eliminates the differences in synonymous codon usage between the hit and decoy datasets. Importantly, both the original sequence and its shuffled version translate into the same amino acids.
https://doi.org/10.1371/journal.pcbi.1009482.g001 sequences located far upstream and downstream of the MCCs (i.e. in ranges exceeding the direct influence of proteases) are informative regarding MAP presentation, it supports the existence of factors unrelated to protein degradation modulating MAP presentation.

CAMAP predictions are independent of MAP binding affinity or transcript expression level
Both MAP binding affinity to the MHC-I molecule and the level of gene expression are predictive of MAP presentation [19]. Because codon usage has been shown to be different in highly expressed genes, we wanted to verify whether the codon-specific rules captured by CAMAP were associated with potential biases in our positive dataset, which is enriched in highly expressed genes. We first show that there is no correlation between gene expression levels and CAMAP scores in both the positive and negative datasets (R < 0.1, Fig 3A). This was true for both average expression levels across our samples (Fig 3A), and for samples individually (see S23 Fig). Secondly, we trained CAMAP networks using a decoy dataset that mirrored the positive dataset gene expression level (S24A Fig) and showed similar results: CAMAP trained on original codon sequences performed better than CAMAP trained on shuffled sequences (S24B Fig). These results show that the codon-specific rules captured by CAMAP trained on original sequences are independent of gene expression levels.
While CAMAP does not receive any information about the MCC per se, we stipulated that the presence of MHC-I binding motifs in the MCCs in the positive dataset might be associated with biases in the MAP-flanking regions, which could also influence CAMAP training. Therefore, to evaluate the presence of this potential bias, we first evaluated the correlation between CAMAP scores and MAPs binding affinity. Again, our result showed no correlation between CAMAP scores and MAP binding affinity, both when considering the minimal binding affinity of each MAP to the MHC-I alleles contained in our dataset (Fig 3B) or when considering each allele individually (S25-S26 Figs). Secondly, we trained CAMAP networks using a decoy dataset that mirrored the positive dataset MAP binding affinities (S27A Fig). Again, CAMAPs trained on original codon sequence performed better than CAMAPs trained on shuffled sequences (S27B Fig). These results show that codon-specific rules captured by CAMAP We next evaluated the possibility of biases associated with many MAPs originating from conserved regions (e.g., found in multiple domains of the same domain family such as zinc fingers or kinases). We first evaluated MAPs that could originate from different transcripts within the transcriptome (i.e. transcripts with sufficient expression levels detected by RNA sequencing) as they are likely to represent conserved regions in the genome. We therefore trained CAMAPs with datasets excluding entries encoding for MAPs that had >3 or >10 possible origins and compared their performance with that of CAMAPs trained without excluding these MAPs. Our results show that whatever the dataset used, CAMAP trained with original sequences always significantly outperformed CAMAP trained with shuffled sequences (S30 Fig). Taken together, these results suggest that the codonspecific rules captured by CAMAP are independent of potential homologies in the hit dataset, as they do not appear to influence CAMAP performance.
Non-source transcripts have been previously associated with higher GC content [19]. To evaluate the possibility of biases associated with GC-content, we evaluate the correlation between GC-content within MAP-flanking sequences (using a context size of 162 nucleotides) and CAMAP score. We found a negative correlation between an example's CAMAP score and its GC-content (R = -0.4, p < 1 x 10 −16 , S31A Fig). This could be related to the fact that lower GC-content is associated with lower mRNA stability [24], which could lead to the formation of DRiPs. We then trained CAMAP networks using a decoy dataset that mirrored the positive dataset GC content (S32A Fig). Again, CAMAPs trained on original codon sequence Previous studies have shown that codon bias plays a significant role in translation efficiency (TE) by co-adaptation to the tRNA pool [25,26]. This effect has been quantified in the tRNA Adaptation Index (tAI), which is a metric for translation efficiency [27]. While hit and decoy datasets showed a similar distribution of mean tAI of the MAP-flanking sequences, we evaluated the possibility of a bias in translation efficiency in our datasets. Results show a weak correlation between mean tAI and CAMAP scores (R = 0.1767 and 0.2600 for hits and decoys, respectively), suggesting that the signal captured by CAMAP could be in part attributable to codon usage impact on translation efficiency (S31B Fig).
We evaluated whether CAMAP would still be predictive for negative examples (decoys) derived from source transcripts. While most decoys derived from source transcripts showed a CAMAP score above 0.5 and are thus classified as source, they still had a lower CAMAP score compared to hits from source transcripts (p < 1x10 -16 , S33 Fig).
Finally, we evaluated how different shuffling methods affected CAMAP performance, to better understand the role of codon arrangement (i.e. usage and position) in CAMAP's predictions. The shuffling method used above (Figs 1 and 2) generates shuffled sequences by replacing each codon by one of its synonymous codons (including itself) according to the codon usage within the transcriptome. This shuffling method, which we designated as transcriptome shuffling (S34A Fig), eliminates any synonymous codon usage and GC content differences between the hit and decoy datasets. To evaluate the importance of codon positions in the MAP-flanking sequences, we generated a second shuffled dataset in which synonymous codons present within a given example (i.e. MAP-flanking sequences pre-and post-MCC context) are swapped with one another (S34B Fig). This shuffling method, which we call codon swap shuffling, preserves both the global codon usage, GC-content and amino acid sequences for each example within the hit and decoy datasets. Finally, we generated a third shuffled dataset in which only the third nucleotide of codons are swapped within each example, while preserving amino acid sequences and GC content (third nucleotide shuffling, S34C Fig). Here, the global codon usage is completely different from normal codon usage in humans, as the frequency of each synonymous codon is not taken into account during shuffling. The performance of CAMAP networks pre-trained on original (non-shuffled) datasets was then evaluated on each shuffled dataset.
As shown in S35 Fig, the codon swap shuffling only slightly decreased CAMAP predictions compared to the original dataset (mean AUC from 0.689 for original to 0.687 for codon swap shuffling, respectively, p = 2.266 x 10 −4 ). This could be due to the limited number of synonymous codons available for swapping within a single example. This also suggests that the overall codon composition of the MAP-flanking sequences has more impact on CAMAP predictions than the codons' position within the MAP-flanking sequences. The third nucleotide shuffling also decreased CAMAP performances (mean AUC of 0.676) but remained above that of the transcriptome shuffled dataset (mean AUC of 0.647). These results suggest that the GC content has a significant impact on CAMAP performances, as the codon swap and third nucleotide shuffling both preserves the GC content within each example, while the transcriptome shuffling does not. This is also supported by the negative correlation between GC content and CAMAP scores, as shown in S31A Fig. Taken together, these results support a role for codon arrangements, and especially for codon usage in the regulation of MAP biogenesis which cannot be reduced to codon's impact on gene expression levels, GC content or translation efficiency.

Validation of CAMAP predictions on human and mouse datasets
We next validated our CAMAP trained on 9-mer MAPs derived from B-LCL using 5 datasets derived from different human and mouse cell types. All the validation datasets were described through proteogenomic analyses similarly to our B-LCL training datasets. However, all the validation datasets included MAPs of 8-11 mers, in contrast with the training dataset that contained only 9-mer MAPs. The validation datasets consisted of (i) our B-LCL dataset, this time including all peptide lengths [19,22], (ii) a dataset of human peripheral blood mononucleated cells (PBMCs) [28], (iii) a dataset of B-lymphoblastoid cells expressing unique HLA alleles (B721.221 [11]), (iv) murine colon carcinoma cell line (CT26) and (v) a murine lymphoma cell line (EL4, [28,29]). For all datasets, we created hit and decoy datasets of original and shuffled sequences using the same approach described above but including MAPs of 8-11 amino acids. Notably, CAMAPs trained on human sequences encoding 9-mers MAPs from one human cell type (i.e. B-LCL) could also predict presentation of 8-11 mers MAPs in other human cell types (Fig 4), as well as from mouse cell lines, albeit with lower performances (Fig 4). Here again, CAMAPs trained on original sequences consistently outperformed CAMAPs trained on shuffled sequences (Fig 4). These results show that the codon-specific rules derived by CAMAPs to predict MAP presentation are valid across different cell types, and can even be applied to another species, albeit with slightly lower performances. These results support a role for codons in the modulation of MAP presentation.
The lower performances of CAMAP trained with shuffled sequences (representing amino acid distribution) suggests that amino acids in MAP-flanking sequences are less informative than codons regarding MAP presentation. We formally quantified this difference in information using the Kullback-Leibler (KL) divergence (see Materials and Methods for more details). Most codons (47/61, 77%) showed greater KL divergence in the original dataset than the shuffled dataset, indicating that codon distributions contained more information with regards to MAP presentation than amino acid distributions (S36 Fig). These results suggest that codons in MAP-flanking regions play a role that is non-redundant with amino acids in MAP biogenesis.

Codons closest to the MCC are most influential on CAMAP predictions
We wondered whether some regions were more influential on MAP presentation than others. To address this question, we retrieved the model preferences for each codon at each position. The preferences correspond to the prediction score of our best model (trained with original codon sequences for a context size of 162 nucleotides) when a single codon at a single position is provided as input (all other positions being set at [0,0] coordinates in the embedding space, see Materials and Methods for more details). The model's preferences are therefore a measure of each individual codon's propensity to increase or decrease the model's output probability as a function of its position relative to the MCCs. A value of 0.5 denotes a neutral preference, while negative and positive preferences correspond to values below and above 0.5, respectively. Preferences were obtained by feeding CAMAP sequences in which all codon values were masked, except for a single position that received a non-null codon label.
Interestingly, while codons closest to the MCCs were the most influential on CAMAP scores, some synonymous codons showed opposite effects, demonstrating that codon usage does not entirely recapitulate amino acid usage (Figs 5A, 5B and S37). The use of embeddings to encode codons has the advantage of arranging them into a semantic space, wherein codons with similar influences are positioned close to each other. Interestingly, most synonymous codons did not form clusters, with a notable exception being proline codons (Fig 5C). This finding indicates that for some codons, their effect on CAMAP prediction score may be closer to that of a non-synonymous codon than to that of one of its synonyms.
As expected, a significant part of the signal captured by CAMAP still originates from amino acids. As shown in Fig 2, the accuracy (AUC) of CAMAP trained with shuffled synonymous codons is closer to the AUC of CAMAP trained with original codons than it is to 0.5 (random score). In addition, we find that CAMAP has a higher preference for all codons encoding arginine (R) and lysine (K), indicating a strong tryptic-like specificity at the N-terminus (S37 Fig).

Combination of binding affinity, expression levels and CAMAP scores increases MAP prediction accuracy
We next compared MAP prediction capacities of CAMAP scores to that of MAP predicted ligand score (ranks as predicted by NetMHCpan4.0) and mRNA transcript expression levels. We used ligand scores as predicted by NetMHCpan4.0, which was shown to possess the best predictive capacities for naturally processed peptides compared to other predictive algorithms [30]. Because MAP binding to the MHC molecule is essential for its presentation at the cell surface, we elected to only compare hits and decoys encoding potential binders, i.e. with a ligand score of �1% for at least one allele in the B-LCL dataset. Using a linear regression model, we compared the predictive capacity of each single parameter using Matthews correlation coefficient, which measures the quality of binary classifications [31]. Of note, only the predictions on the test set were used to evaluate the Matthew correlation coefficient in our different models, and the cutoff value for a prediction to be considered positive was set at 0.5.
Because only potential binders were analyzed here, the mRNA expression levels had the highest predictive capacity, then followed by ligand scores (second) and CAMAP scores (third, Fig 6A). As expected due to the multiplicative relationship between MAP ligand score and expression levels in predicting naturally processed MAPs [11], combining both variables greatly increased prediction performances (Fig 6B). Furthermore, we show that including CAMAP scores to the regression model further increased predictive performances (Fig 6B). We next computed how many predicted peptides would need to be tested to capture 1, 5, 10 or 50% of hits in the B-LCL dataset. Results presented in Table 1 show that using only NetMHC-pan4.0 ligand scores (ranks) leads to a very high false positive rate (FPR) at 76.0% when targeting the top 1%. Adding the expression levels greatly increased prediction accuracy and decreased the FPR to 39.1% for the top 1% hits. When adding CAMAP scores as a third variable, the number of peptides needed to capture 1% of hits greatly decreased, resulting in a very low FPR at 8.9%. Similar trends were observed when targeting 5 or 10% of hits, although with higher FPR (see Table 1). Only when targeting 50% of hits does the performance decrease when adding the CAMAP score. Similarly, adding CAMAP scores to expression levels and ligand scores also ameliorated prediction accuracies for the two other human datasets introduced above (B721.221 and PBMCs, see S2 Table). We hypothesized that the variation in performance observed for the 50% case was due to the limitations of logistic regression, rather than to the data per se. We thus additionally assessed the value of each prediction method using a multi-layer perceptron (MLP) classifier. Results show that for all of them, adding CAMAP score to the input improves prediction accuracy (Table 1). Similar results were obtained for the two other human datasets (S3 Table). Interestingly, using a non-linear classifier such as an MLP ameliorated predictions for both the model receiving ligand score and transcript expression, and the model receiving ligand score, transcript expression and CAMAP scores. This suggests that the relationship between binding score, transcript expression and CAMAP score is a complex one. Therefore, implying the most efficient way to combine the three is through the use of a non-linear classifier. These results show that combining CAMAP scores with the MAP's ligand score (ranks) and its corresponding transcript expression level significantly improves prediction of MAP and facilitates identification of relevant epitopes through more accurate predictions.  Table 1. Number of peptides needed to capture 1%, 5%, 10% or 50% of epitopes detected by mass spectrometry. The lower the number of peptides needed to capture the respective number of epitopes, the better the performance of the prediction model. This is also illustrated by the percentage of false identification (false positive rate, FPR) reported here. Peptides were rank ordered according to regression scores, for a total of 512,423 unique peptides and 8,807 hits. Of note, only the maximal transcript expression was used for peptides with multiple potential origins.

Codon usage can modulate MAP presentation
To evaluate whether changing the codon arrangement in a MAP-coding sequence might directly lead to modulation of MAP presentation, we generated three variants of the chicken ovalbumin (OVA) protein containing the model MAP SIINFEKL [32]. One construct encoded the wild type OVA (OVA-WT). For the other two constructs, we used CAMAP (trained on original human B-LCL sequences; Fig 2) to generate two OVA variants in silico, both encoding for the same OVA protein but using different synonymous codons: one predicted to enhance SIINFEKL presentation (OVA-EP), the other predicted to reduce it (OVA-RP). Accordingly, the respective CAMAP scores for OVA-RP, OVA-WT and OVA-EP were: 0.03, 0.65, and 0.96 ( Fig 7A). All variants encoded the same amino acid sequence but used different synonymous codons. Notably, the sole difference between the three constructs were the 162 nucleotides flanking each side of the SIINFEKL-coding codons (i.e. the RNA sequences coding for OVA 202-256 and OVA 265-319 , S1 Table and S38 Fig).
Because codon usage affects translation efficiency, theoretically leading to DRiP formation through premature translation arrest [20,21], we expected the variable regions of our construct to affect both translation rates and SIINFEKL presentation in our variants. Therefore, each construct also coded for two other proteins, eGFP and Ametrine, placed upstream and downstream of the OVA coding sequence, respectively (Fig 7A). While the Ametrine fluorescence intensity reflected the translation rate of the whole construct, the ratio of Ametrine/eGFP fluorescence intensity was informative regarding the translation efficiency of the whole construct. Indeed, efficient translation of the full-length construct should produce equivalent quantities of Ametrine and eGFP proteins, while inefficient/interrupted translation of the construct (i.e. leading to DRiP formation) should decrease the Ametrine/eGFP ratio (Fig 7B). The three protein coding sequences were separated with P2A self-cleaving peptides [33], therefore allowing the co-synthesis of three separate proteins, controlled by the doxycycline-inducible Tet-On promoter. Importantly, the three proteins were tightly co-expressed because of the presence of only one start codon at the 5' end of the GFP protein, as shown by the very high correlation between eGFP and Ametrine fluorescence for each construct (R>0.97, see S39 Fig). As we assumed that CAMAP scores reflected the probability of DRiP generation leading to increased MAP presentation, we expected the OVA-RP construct to show both reduced SIINFEKL presentation and enhanced translation efficiency compared to the OVA-EP and OVA-WT constructs. However, as both the OVA-EP and OVA-WT have CAMAP scores above the neutral threshold of 0.5 and closer to one another (0.98 and 0.65, respectively) compared to the OVA-RP construct (0.03), we expected OVA-EP and OVA-WT to behave more similarly.
We then used a SIINFEKL-H2-K b specific T-cell activation assay [34] to measure SIIN-FEKL presentation at the cell surface following doxycycline induction. Results for the T-cell activation assay were normalized by both the Ametrine mean fluorescence intensity and the percentage of transduced (eGFP+) cells in each specific sample, so that any difference in T-cell activation observed between our constructs could only be ascribed to synonymous codon variants in the SIINFEKL-flanking OVA codons. Two main findings emerged from our analyses. First, in accordance with CAMAP predictions, variation in codon usage led to a 2.3-fold difference in SIINFEKL presentation between the OVA-EP and OVA-RP variants, with OVA-WT in between (Fig 7C). Second, translation efficiency (Ametrine/eGFP ratio) was higher with OVA-RP than with OVA-EP or OVA-WT, while OVA-EP showed similar translation efficiency compared to OVA-WT ( Fig 7D). Hence, synonymous codon variations led to slightly divergent outcomes in OVA-EP and OVA-RP: they modulated the levels of SIINFEKL presentation in both constructs, but enhanced translation efficiency could only be detected for OVA-RP. These data show that codon arrangement can modulate MAP presentation strength without any changes in the amino acid sequence and support a role for translation efficiency and DRiP formation in the modulation of MAP presentation.

Discussion
Our analyses of large datasets using artificial neural networks and other bioinformatics approaches provide compelling evidence that codon usage regulates MAP biogenesis via both short-and long-range effects. While most MAP predictive approaches focus on MAP sequences per se, CAMAP's novelty is that it only receives the MAP-flanking mRNA sequences as input, and no information on the MAP itself, thereby providing completely independent information for MAP prediction.
The better prediction accuracy of CAMAPs trained with original codons rather than with shuffled synonyms supports the role of codon usage in modulating MAP biogenesis (Fig 2). A large body of evidence suggests that a significant proportion of MAPs derive from DRiPs produced during mRNA-to-protein translation [13][14][15]. Because codons influence both precision and efficiency of protein synthesis [20,21], it is likely that codon features in the mRNA sequences flanking MCCs might contribute to the regulation of MAP biogenesis through the generation of DRiPs. The functional link between codon arrangement and MAP biogenesis was illustrated by our in vitro analyses of SIINFEKL biogenesis, in which we were able to modulate SIINFEKL presentation solely by substituting synonymous codons in mRNA regions flanking SIINFEKL codons, without changing the protein sequence (Fig 7). While the experimental data derives from a single model thus limiting the interpretability of our results, this points nonetheless to an interesting mechanism that could be exploited to enhance antigenic presentation in peptide-based immunotherapy (i.e. dendritic cells modified to express a specific MAP). The full extent of the contribution of CAMAP's learned codon features to DRiPs generation will need to be assessed in subsequent work.
Previous works have also demonstrated the role of transcript expression levels in the regulation of MAP biogenesis, with a higher transcript expression being associated with a higher likelihood of MAP presentation [11,19]. While codon usage is different in highly expressed genes, our results showed that the codon-specific signal captured by CAMAP is independent of transcript expression levels, thereby providing complementary and non-overlapping information regarding MAP presentation. Interestingly, CAMAP preferences were more influential for codons located close to the MCCs (Fig 5), which is consistent with the influence of adjacent amino acids on proteasome cleavage. However, the better performance of CAMAP trained with longer context size also points toward a long-range impact of codon usage on MAP presentation.
Further analyses will be needed to assess the full extent of codon arrangement's impact on both classic MAPs (i.e. derived from canonical reading frames of coding sequences) and cryptic MAPs (i.e. derived from non-canonical reading frames and non-coding sequences) [35,36], as well as the potential contribution of codons in non-coding regions (e.g. 5'-or 3'-UTRs) on the regulation of MAP presentation. However, our results show that the integration of CAMAP scores to the two best predictive factors for naturally processed MAPs led to a significant increase in prediction accuracy. Indeed, our MLP classifier showed that combining CAMAP scores to transcript expression levels and MAP ligand scores (ranks as predicted by NetMHCpan4.0), led to a lower FPR compared to a model combining only transcript expression levels and MAP ligand scores. Although predictions were not as accurate for the two other human datasets, adding CAMAP scores always resulted in improved prediction accuracy. Our results therefore support the combined use of ligand scores, transcript expression levels and CAMAP scores in MAP predictive algorithms. These results have important practical implications for cancer immunotherapy and peptide-based vaccines, where discovery of suitable target antigens remains a formidable challenge to this day [37,38].

Dataset generation
We analyzed a previously published dataset consisting of MAPs presented on B lymphocytes by a total of 33 MHC-I alleles from 18 subjects [19,22]. Since this dataset was assembled using older versions of MHC-I binding prediction algorithms (i.e. using a combination of NetMHC3.4 for common alleles and NetMHCcons1.1 for rare alleles), we verified that the majority of MAPs in this dataset would also be predicted as binders using more recent algorithms (i.e. a rank � 2.0% using NetMHC4.0 or NetMHCpan4.0). We found an overlap of >92% between these methods (see S40 Fig), thereby validating this dataset for further analysis. In addition, we reasoned that a transcript should be considered as a genuine positive or negative regarding MAP biogenesis only if it was expressed in the cells. We therefore excluded from the dataset all transcripts with very low expression (<1 st percentile in terms of FPKM).
To facilitate data analysis and interpretation, we only included transcripts coding for MAPs with a length of 9 amino acids, for a total of 19,656 9-mer MAPs (which represents 78% of MAPs in this dataset). We then used pyGeno [23] to extract the mRNA sequences of transcripts coding for these 9-mer MAPs, which constituted our source-transcripts (Fig 1A). We next created a negative (non-source) dataset from transcripts that generated no MAPs. Importantly, transcripts that encoded for MAPs of any length (i.e. 8 to 11-mer) were excluded from the negative dataset. We then randomly selected 98,290 non-MAP 9-mers from this negative dataset, and extracted their coding sequences using pyGeno. Of note, both positive and negative datasets were derived from the canonical reading frame of non-redundant transcripts.
We analyzed only the MAP context and excluded the MCCs per se from our positive (hits) and negative (decoys) sequences (Fig 1A). We limited our analyses of flanking sequences to 162 nucleotides (54 codons) on each side of MCCs, because longer lengths would entail the exclusion of >25% of transcripts (S41 Fig).

Creation of the shuffled synonymous codon dataset
To create the shuffled synonymous codon dataset, each sequence was re-encoded by replacing each codon with itself or with a random synonym according to the human transcriptome usage frequencies. These frequencies were calculated using the annotations provided by Ensembl for the human reference genome GRCh37.75. Thus, all codon-specific features differing between the positive and negative datasets were removed from the shuffled datasets. Because codons were replaced by their synonymous codons, the shuffled sequences directly reflected amino acid usage in the positive and negative datasets.

CAMAP architecture, sequence encoding and training
The first (input) layer received either MCCs flanking regions from the hit dataset or sequences of the same length contained in the decoy dataset (Fig 1A). The second layer (S21A Fig) was a codon embedding layer similar to that introduced for a neural language model [39]. Embedding is a technique used in natural language processing to encode discrete words, and has been shown to greatly improve performances [40]. With this technique, the user defines a fixed number of dimensions in which words should be encoded. When the training starts, each word receives a random vector-valued position (its embedding coordinates) in that space. The network then iteratively adjusts the words' embedding vectors during the training phase and arranges them in a way that optimizes the classification task. Notably, embeddings have been shown to represent semantic spaces in which words of similar meanings are arranged close to each other [40]. In the present work, we treated codons as words: each codon received a set of random 2D coordinates that were subsequently optimized during training. The third (output) layer delivered the probability that the input sequence was a MCCs flanking region (rather than a sequence from the negative dataset). Therefore, CAMAP only has two levels of processing, the embedding layer and the output layer. Finally, the output layer has no bias parameter, a feature that enabled us to easily compute the preferences.
CAMAPs were trained on sequences resulting from the concatenation of pre-and post-MCCs regions. Before presenting sequences to our CAMAPs, we associated each codon to a unique number ranging from 1 to 64 (we reserved 0 to indicate a null value) and used this encoding to transform every sequence into a vector of integers representing codons. Neural networks were first developed using the Python package Mariana [41] [https://www.github. com/tariqdaouda/Mariana] and reimplemented identically for this work using pyTorch (results presented herein derive from the pyTorch version). An Embedding layer was used to associate each label superior to 0 to a set of 2D trainable parameters; the 0 label represents a null (masking) embedding fixed at coordinates (0,0). As an output layer, we used a Softmax layer with two outputs (positive / negative). Because negative sequences are more numerous than positive ones, we used an oversampling strategy during training. At each epoch, CAMAPs were randomly presented with the same number of positive and negative sequences. All CAMAPs in this work share the same architecture (S21A Fig), number of parameters and hyper-parameter values: learning rate: 0.001; mini-batch size: 64; embedding dimensions: 2; linear output without offset on the embedding layer; Softmax non-linearity without offset on the output layer.
For each condition (e.g. context size), the positive and negative datasets were randomly divided into three non-redundant subsets: (i) the training subsets containing 60% of the positive and negative examples, (ii) the validation and (iii) the test subsets each containing 20% of the positive and negative examples. Examples were assigned through a sequence redundancy removal algorithm, thereby ensuring that no example was assigned to multiple subsets. We used an early stopping strategy on validation sets to prevent over-fitting and reported average performances computed on test sets. We trained 12 CAMAPs for each combination of conditions, each one using a different random split of train/validation/test sets. To mask sequences either before or after the MCCs, we masked either half with null value.

Kullback-Leibler divergence
The Kullback-Leibler (KL) divergence computes how well a given distribution is approximated by another distribution. Its value can be either positive or 0, a null value indicating that the two distributions are identical. Accordingly, a higher KL divergence for codon distributions vs. amino acid distributions would indicate that codon variations are not entirely accounted for by amino acid variations. KL divergence is not a metric, as it is neither symmetric nor does it satisfy the triangle inequality. It is nevertheless an accurate and most common way of comparing two probability distributions.
We defined the probability of having codon c at position i as a function of the number of occurrences of c at position i, divided by the total number of occurrences of that same codon: Here Q is a probability, N is a number of occurrences, c is a codon, y is a class (positive or negative), s indicates if codons have been shuffled (true or false), i is a position in sequence. For the remainder of the text we will use the following abbreviations: We then used the KL divergence to compute how well P c distributions approximate D c distributions and PS c distributions approximate DS c distributions.
The KL divergence was defined as: QðiÞ � We performed this calculation for both the original and the shuffled dataset, which we then compared together. If codons and amino acid distributions were equivalent, KL divergence between hits and decoys would be the same for both original and shuffled sequences, and codons would cluster along the diagonal.

Extracting CAMAP's preferences
Preferences are a way of interpreting CAMAP networks that naturally derives from CAMAP's mathematical formalism. The validity of the preference method for CAMAP has been demonstrated when we used it to define the EP and RP constructs. By following this method, we were able to precisely engineer codon sequences that maximized or minimized the predictions of a CAMAP network. CAMAP only has two processing layers: an embedding layer and an output layer with no latent processing in between. Moreover, the output layer has no bias. In these very specific conditions, the manifold is the embedding layer. The word embedding layer encodes the importance of codons, while the weights of the output layer encode the importance of the position.
The output is thus computed as: Where E is the vector of embeddings for a sequence and W is the weight matrix of the output layer. When masking all codons but one, we obtain the importance (I) of that codon (c) at that position (x): Where Wx are the weights of the output layer for position x. The preference is just this quantity (I c,x ) normalized to a probability (using a Softmax function) for ease of interpretation.

Predicting MAP presentation with logistic regressions and MLP
The prediction capacity of CAMAP, NetMHCpan-4.0 ligand score and transcription expression (TPM) was tested in different combinations of those parameters (Ligand Score + Expression, Ligand score + Expression + CAMAP score) using the LogisticRegressionCV or MLPClassifier functions from the python package sklearn (sklearn.linear_model and sklearn. neural_network, v0.24.1). In each case, the dataset containing hits and decoy sequences was split into train and test datasets with a ratio of 0.7 to 0.3, respectively. Values for CAMAP score, Ligand Score and TPM were each scaled to a range of 0-1 in the train set using Min-MaxScaler from sklearn.preprocessing and the same scaling model was applied to the test set afterwards. Regression analysis was performed using LogisticRegressionCV with a 10x crossvalidation using the lbfgs solver with 1000 iterations. MLP with one hidden layer of 1000 neurons was trained using MLPClassifier for 4000 iterations using the adam solver with early stopping. Classes were balanced prior to training for both models. Matthew correlation coefficients were calculated using matthews_corrcoef from sklearn.metrics. When a peptide had multiple sources (multiple transcripts or genes), only the maximum transcript expression is used.

In vitro assay-inducible translation reporter (iTR)-OVA construct design
An inducible translation reporter was generated by flanking the truncated chicken ovalbumin (OVA) cDNA (amino acids 144-386) with EGFP-P2A (in 5') and P2A-Ametrine (in 3') cDNA sequences. MCCs flanking contexts for the EP and RP construct were synthesized as gBlocks (purchased from Integrated DNA Technologies). The fragments were amplified by PCR and joined by Gibson assembly under a doxycycline-inducible Tet-ON promoter in a pCW backbone. Synthetic variants of the OVA coding sequence were generated in silico by varying synonymous codon usage in the MAP context regions (i.e. 162 nucleotides pre-and post-MCCs). Importantly, the amino acid sequence was preserved between the different variants; only nucleotide sequences in the MAP context (162 nucleotides on either side) differed. The sequences with the highest (EP) and the lowest (RP) prediction scores were selected for further in vitro validation and swapped into the iTR-OVA plasmid by Gibson assembly [42]. OVA-EP and OVA-RP sequences can be found in S1 Table. Important features of our inducible translation reporter construct and T cell activation assay were: (i) No changes in amino acid sequence between the three variants: only co-translational events can differ between the three variants, post-translational events being equivalent for the three constructs; (ii) Only one start codon, at the beginning of the eGFP coding sequence: this is important for the translation reporter aspect of our construct (i.e. Ametrine/ eGFP ratio), to ensure that translation can only start at the 5'-end of the whole construct, and not at the beginning of the OVA or Ametrine coding sequences; (iii) Separation of the three proteins using P2A peptide: allows the inducible synthesis of three separate proteins in a highly correlated manner (as demonstrated in S39 Fig); also, the degradation of one protein will be independent from the others. As we hypothesized that codon usage might lead to DRiP formation, we did not want the degradation of OVA-derived polypeptide to induce degradation of attached eGFP or Ametrine, which would affect our translation reporter assay (Ametrine/ eGFP ratio); (iv) Because transcript expression level impacts MAP presentation, we normalized T-cell activation results by both the number of transduced cells present in the samples (% of eGFP+ cells) and the Ametrine mean fluorescence intensity of eGFP+ cells (representing whole construct expression level). Because of these four features, any difference between the three constructs could be ascribed solely to synonymous codon variants in the SIINFEKLflanking OVA codons.

T-cell activation assay
Raw-K b OVA-EP, OVA-RP and OVA-WT cells were plated at a density of 250,000 cells/well in 24 well-plates 24h prior to doxycycline treatment (1 mg/ml). After the corresponding treatment duration, cells were harvested and fixed using PFA 1% for 10 minutes at room temperature and washed using DMEM 10% FBS. Raw-K b were then co-cultured (37˚C, 5% CO 2 ) in triplicates with the CD8 T cell hybridoma cell line B3Z cells at a 3:2 ratio for 16h (7.5 x 10 5 B3Z and 5 x 10 5 Raw-K b ) in 96 well-plates. Cells were lysed for 20 minutes at room temperature using 50 μl/well of lysis solution (25mM Tris-Base, 0.2 mM CDTA, 10% glycerol, 0.5% Triton X-100, 0.3mM DTT; pH 7.8). 170 μl/well CPRG buffer was added (0.15mM chlorophenol redβ-d-galactopyranoside (Roche), 50mM Na 2 HPO 4 •7H 2 0, 35mM NaH 2 PO 4 •H 2 0, 9mM KCl, 0.9mM MgSO 4 •7H 2 O). β-galactosidase activity was measured at 575 nm using SpectraMax 190 Microplate Reader (Molecular Devices). In parallel, cells were analyzed by flow cytometry using a BD FACS CantoII for eGFP and Ametrine fluorescence.  Table. Number of peptides needed to capture 1%, 5, 10 and 50% of epitopes detected by mass spectrometry in B721.221 and PBMC cell lines using a logistic regression model. The lower the number of peptides needed to capture the respective number of epitopes, the better the performance of the prediction model. This is also illustrated by the percentage of false identification (false positive rate, FPR) reported here. Peptides were rank ordered according to regression scores. Of note, only the maximal transcript expression was used for peptides with multiple potential origins. (DOCX) S3 Table. Number of peptides needed to capture 1%, 5, 10 and 50% of epitopes detected by mass spectrometry in B721.221 and PBMC cell lines using a multi-layer perceptron (MLP). The lower the number of peptides needed to capture the respective number of epitopes, the better the performance of the prediction model. This is also illustrated by the percentage of false identification (false positive rate, FPR) reported here. The non-linear classifier led to better predictions when using ligand score and transcript expression, with or without CAMAP scores compared to logistic regression model. Peptides were rank ordered according to regression scores. Of note, only the maximal transcript expression was used for peptides with multiple potential origins. (DOCX)