Polymorphic sites preferentially avoid co-evolving residues in MHC class I proteins

Major histocompatibility complex class I (MHC-I) molecules are critical to adaptive immune defence mechanisms in vertebrate species and are encoded by highly polymorphic genes. Polymorphic sites are located close to the ligand-binding groove and entail MHC-I alleles with distinct binding specificities. Some efforts have been made to investigate the relationship between polymorphism and protein stability. However, less is known about the relationship between polymorphism and MHC-I co-evolutionary constraints. Using Direct Coupling Analysis (DCA) we found that co-evolution analysis accurately pinpoints structural contacts, although the protein family is restricted to vertebrates and comprises less than five hundred species, and that the co-evolutionary signal is mainly driven by inter-species changes, and not intra-species polymorphism. Moreover, we show that polymorphic sites in human preferentially avoid co-evolving residues, as well as residues involved in protein stability. These results suggest that sites displaying high polymorphism may have been selected during vertebrates’ evolution to avoid co-evolutionary constraints and thereby maximize their mutability.


Introduction
Major Histocompatibility Complex class I proteins (MHC-I), also referred to as Human Leukocyte Antigen class I (HLA-I) in human, are expressed on the surface of cells. MHC-I proteins form a complex with either 'self' ligands derived from the endogenous proteins or 'foreign' ligands (non-self) derived from invading pathogens or somatic alterations in cancer cells. Upon presentation of non-self ligands from inside the cytoplasm, the complex can be recognized by CD8 T-cells [1]. MHC-I proteins show a very high degree of polymorphism especially around the peptide-binding groove and tens of thousands of different alleles are reported in databases like PFAM [2] or IMGT/HLA [3]. Moreover, striking differences in binding specificity are observed between different alleles. Several evolutionary events contributed to MHC-I diversity in vertebrates. Duplication events occurred during the evolution of jawed vertebrate, which led to MHC-I polygenicity in many species [4,5]. Following the gene duplication events, the different gene copies diverged through separate evolutionary processes, which allowed some MHC-I genes to gain different functions, while others became dysfunctional or lost [6]. Consequently, the number of MHC-I loci differs between vertebrate species [7]. These duplication events produced 6 MHC-I genes in human all located on chromosome 6. Three of them (HLA-A, HLA-B and HLA-C) are broadly expressed in most cell types and are the main contributors to class I antigen presentation. The high level of allelic diversity of the MHC-I in vertebrate population is likely due to strong selection because of the exposure of vertebrate populations to various infections across the world [8] [9]. In particular, the polygenicity and polymorphism entails the immune system of each individual with the ability to present at the cell surface a wide range of peptides from foreign pathogens.
Despite their high polymorphism, MHC-I alleles share the same three-dimensional fold across vertebrates. In particular, the peptide-binding groove is composed of two almost parallel alpha helices and one beta sheet. This conserved structure across all MHC-I alleles suggests that they undergo molecular constraints. Molecular constraints can be predicted using stability models that investigate the impact of a mutation on the structure (e.g. alanine scanning) [10] or conservation [11]. Recent studies have also demonstrated that simultaneously evolving sites (also called co-evolving sites) can reveal structural contacts [12] folding intermediate [13], allosteric communication, core protein sites [14], or functionally important sites [15]. Several models are available in the literature to predict co-evolving sites. Most of the models evaluate a score to assess if a pair of sites simultaneously evolves regardless of the other residues. Some of these models use statistical formalisms such as Mutual Information [16], Statistical Coupling Analysis [17] or Coev [14,18] when others use combinatorial formalism [19,20]. The only model that investigates co-evolving residues in the light of global alignment is Direct Coupling Analysis (DCA) [12], also introduced in the EVfold suite [21]. This phylogeny-free method was shown to accurately identify sites in contact in protein structures, and because of this, DCA has been used to help predicting protein structures [21] [22][23] [24].
In this work, we study the co-evolving constraints on MHC-I across vertebrates' species using DCA. Despite the low number of species (<500), we observed that DCA could accurately predict structural contacts directly from MHC-I protein sequence alignment. We then investigated the relationship between polymorphism and co-evolution constraints. Our work reveals that polymorphism within human does not contribute much to the observed co-evolution signal. Moreover polymorphic sites show little overlap with both co-evolving sites across vertebrates and sites predicted to be most important in protein structural stability. We further extended the DCA algorithmic framework to incorporate multiple MHC-I ligands per allele and observed the same uncoupling between co-evolving and polymorphic residues. These results suggest that polymorphic residues in MHC-I molecules preferentially avoid sites displaying strong stability or co-evolutionary constraints.

Co-evolution among MHC-I residues
To investigate co-evolutionary constraints among MHC-I residues we retrieved all MHC-I protein sequences from the PFAM v30 database (PF00129) [2]. This domain family covers the MHC-I domains alpha1 and alpha2 (179 amino acid) and is present in 445 organisms [2]. We excluded from the dataset 117 sequences from 14 bacterial and viral species (see Materials and Methods). We ended up with 40'739 sequences, including 20'256 sequences from human MHC-I alleles where the MHC-I polymorphism has been most studied (Fig 1). We then applied DCA on the whole PFAM alignment. Considering pairs of residues that are distant along the protein sequence (more than 4 residues apart), we observed a very strong enrichment of structural contacts among pairs of residues with high DCA scores (Fig 2A). For instance, among the top 44 DCA predictions (25% of MHC-I PFAM domain length), 31 correspond to pairs of residues less than 8Å apart in crystal structures (see Fig 2A and

Co-evolutionary predictions and species predominance
To assess the contribution of the 20'256 human sequences to the co-evolution predictions, we led two additional experiments: one where the co-evolving scores based on DCA are evaluated using solely the 20'256 human sequences ( Fig 2C) and another where the co-evolving scores are evaluated by excluding the human sequences from the analysis (Fig 2D). These experiments revealed that the top predictions of DCA applied to human sequences did not highlight pairs of residues close in protein structures ( Fig 2C). Reversely, when excluding all human sequences DCA predictions of co-evolving sites remained almost unchanged and still pinpointed mainly pairs of sites in the structural proximity (Fig 2D). Similar results are obtained using a threshold of 5Å to define the contact map (S2 Fig). Moreover when removing the sequences from species with more than 500 MHC-I sequences (Homo sapiens (Human); Macaca mulatta (Rhesus macaque); Macaca fascicularis (Crabeating macaque) (Cynomolgus monkey); Acrocephalus schoenobaenus (sedge warbler); Parus major (Great tit); Macaca nemestrina (Pig-tailed macaque); Bos taurus (Bovine); Sus scrofa (Pig), we still observed that many of the top co-evolving sites are in structural proximity (S3 Fig). Altogether these experiments suggest that the co-evolution signal captured by DCA reflects molecular constraints in the course of vertebrate evolution, and not constraints on polymorphic sites within one species. This is in line with the low weight on human sequences due to their high homology in DCA within the full alignment (see Fig 1). Nevertheless, the lack of structurally meaningful correlations when considering only human sequences suggest that little co-evolution is observed among them, although polymorphic sites are contacting each other in the MHC-I binding site, and therefore could potentially display some level of correlation reflecting structural constraints.

Polymorphism and co-evolving sites
To further investigate the relationship between polymorphism and co-evolving sites, we measured conservation in human using information content (see Materials and Methods) to derive a polymorphism score for each site. A position with a minimal score is rarely mutated in human MHC-I alleles whereas a position with a high score is highly mutated. We then used Enrichment Analysis (see Materials and Methods) to determine the overlap (or absence thereof) between sites displaying strong co-evolutionary constraints across vertebrates as measured by DCA and polymorphic sites in human population. DCA scores were established for  Polymorphic sites preferentially avoid co-evolving residues in MHC class I proteins each site based on the highest DCA values with any other site more than 4 amino apart in the sequence, and sites where ranked based on these scores (x-axis in Fig 3A, lower panel) to compute the enrichment (or absence thereof) in polymorphic sites among sites with highest DCA scores. Using a threshold of 0.01 on the information content to define polymorphic sites, our analysis showed that pairs of sites with the highest DCA score mainly comprise sites that are non-polymorphic in human ( The advantage of the enrichment approach is that is does not require fixing a threshold on the DCA scores. We further note that the cloud of points for DCA values lower than 0.08 in Fig 3A was expected since the majority of DCA values obtained from any alignment are significantly bigger than zero. However, as observed in previous studies, only the top ranking pairs give meaningful information about structural contacts. This is the reason why we used enrichment analysis in this work, as opposed to correlation coefficient whose value would be dominated by the low DCA scores, which cannot be interpreted in terms of biologically meaningful co-evolutionary constraints.

Polymorphism and stability
We then investigated the relationship between polymorphism and predicted importance for structural stability. Stability score of each site was evaluated using FOLD-X AlaScan software [10,27] using the X-ray structure of HLA-A02:01 in complex with a 9-mer ligand (PDB: Polymorphic sites preferentially avoid co-evolving residues in MHC class I proteins 2BNR). Sites with different stability values were then used in the same enrichment analysis as before to compare with polymorphic sites. Here as well, we observed that polymorphic sites tend to be distinct from sites predicted to play a role in protein stability (Fig 3B, P = 0.04). This observation holds when considering other alleles and their corresponding pdb structures to evaluate stability score of each residue (Table 1). We further investigated the relationship between polymorphism and the number of structural contacts made by each residue (Materials and Methods). As expected from the stability analysis (Fig 3B), residues making many contacts tend on average to be enriched in non-polymorphic sites (Fig 3C), although the enrichment did not pass the 0.05 threshold for significance. In general, the fact that polymorphic sites that do not lead to dysfunctional proteins, such as those in MHC proteins, are less implicated in protein stability has been documented in many previous studies [28][29][30][31][32]. However, to our knowledge, our work is the first to perform such analysis specifically on MHC proteins.
To assess whether co-evolving pairs of residues may simply reflect sites involved in protein stability, we investigated the relationship between DCA scores and either stability or number of contacts. We observed a very poor correlation between DCA scores and stability scores These results show that amino acid correlation patterns are not simply recapitulating the importance of residues for protein stability and could highlight distinct constraints that cannot be captured by stability predictions or number of structural contacts.

Co-evolving constraints in the presence of peptide ligands
MHC-I molecules are known to interact with many peptides and the presence of a peptide is required for MHC-I folding. To explore the effect of the presence of peptide ligands on DCA Table 1. Enrichment of non-polymorphic sites with respect to stability evaluated in different structures. The pvalues of enrichment analysis (column 3, also see Fig 3B) of non-polymorphic residues among sites contributing most to protein stability using different pdb structures (column 2) of MHC-I alleles (column 1) is shown below. For each pdb structure, we merged the peptide and the MHC-I allele on the same chain and ran AlaScan to measure the stability scores. predictions, we built an expanded version of DCA, called DCApeptides, that can take as input several peptide ligands for each protein sequence. The set of peptides interacting with a given protein are used to compute the single and paired frequencies used in DCA, as described in Materials and Methods. Although major efforts have been invested in the field to experimentally characterize the MHC-I binding specificity repertoire in human and mice [33][34][35][36], the vast majority of MHC-I molecules do not have experimental ligands. To fill this gap, we selected 100'000 random 9-mer peptides from several organisms and evaluated the predicted binding affinity of MHC-I sequences to each of these peptides using NetMHCpan3.0 [37] (see Materials and Methods). For each MHC-I sequence we then selected the top 2% of the peptides, following the cut-off currently suggested by the authors of NetMHCpan [37]. These predicted ligands were included in the co-evolution calculations using the DCApeptides algorithm. Overall, results did not change much and we still observed the decoupling between coevolving and polymorphic sites (Fig 4). However, it should be noted that these are predicted ligands and the signal captured by DCApeptides reflects at best what is implicitly modelled in the predictor and not necessarily the real inter-molecular constraints.

DCAPeptides for inter-molecular contact predictions
To further explore the DCApeptides algorithm in the case of experimental ligands, we restricted the study to human MHC-I alleles having experimental ligands in IEDB [36] (see Materials and Methods). The number of such alleles is much smaller (156) and, as expected, we did not observe good structural contact predictions (Fig 5A). However, when restricting the analysis to inter-molecular pairs, we observed that the top 4 inter-molecular DCA pairs mapped accurately to existing structural contacts ( Fig 5B). Moreover, these 4 pairs of sites involved residues P2 and P9 in the MHC-I ligands, which are known to be the main specificity determining residues (socalled anchor residues). Overall, our results indicate that DCApeptides predictions are stronger among MHC-I residues then between MHC-I residues and their ligands. However, DCA predictions among MHC-I residues do not pinpoint structural contacts (as in Fig 2C), while DCA predictions between MHC-I residues and their ligands revealed known interactions. We further extended our benchmarking of the DCApeptides algorithm to the human PDZ protein domains, which are also known to interact with several ligands (in our dataset, these ligands came from a phage display experiment [38], see Materials and Methods). Here as well, we observed stronger correlation among the PDZ domain residues (S7A Fig). Some of the DCA predictions mapped to known structural contacts (15/27). More interestingly, when focusing only on correlations between PDZ residues and their ligands, we saw that DCApeptides could accurately predict some of the contacting pairs of residues. In particular, the top 2 predictions involved both position -2 in the PDZ ligands (S7B Fig), which is known to be the main specificity determining position for PDZ ligands [39]. Altogether, our results suggest that, when focusing on domains with available ligands from one species, intra-molecular DCApeptides predictions are not able to identify residues in structural proximity (likely because of the much lower number of sequences imposed by the constraint of having experimental ligands available), but inter-molecular DCApeptides predictions can accurately pinpoint structural contacts.

Discussion
Co-evolution analyses have been widely used in biological studies, focusing mainly on co-evolution across species [14,40]. To our knowledge, our work is the first co-evolution analysis of a protein family that displays at the same time high variability between species and high polymorphism within species. As MHC-I polymorphism is known to be functionally important to entail different alleles with a wide range of binding specificities, our observation that polymorphic sites tend on average to show less co-evolutionary constraints may reflect the importance of preserving high mutability of these sites. It is also interesting to note that the decoupling between polymorphic sites and co-evolving sites was even stronger than between polymorphic sites and sites involved in protein stability (Fig 3), suggesting that co-evolution constraints captured by DCA may be especially detrimental for polymorphic sites.
To predict co-evolving sites within MHC-I molecules, we used the DCA model introduced in [12,23], [22]. DCA demonstrated its statistical power on protein domains for which many homolog sequences are available (typically >10'000 sequences, ideally spanning both eukaryotes and prokaryotes) [22]. This study demonstrates that DCA predictions are highly enriched in structural contacts in MHC-I protein family, although the number of species is restricted to 445 (Fig 1). As in all DCA analyses, we focused here on sites that are distant in the sequence (i.e., more than 4 amino acids apart), which ensures that predictions of structural contacts are not simply resulting from sequence proximity. As such our work suggests that polymorphic sites tend to show less co-evolutionary constraints with sites distant in the primary sequence. Importantly, polymorphic sites have similar numbers of structural contacts with residues distant in the sequence (S8 Fig) as other residues, and therefore the observations made in this study could not simply be explained by the absence of such contacts.
The co-evolution signal detected in our analysis likely comes from the presence of divergent vertebrate species in the dataset, since very similar predictions were obtained by excluding the 20'256 human sequences in the datasets (Fig 2C and 2D), or by excluding species with more than 500 sequences in the dataset (S3 Fig). We anticipate that the fast evolutionary dynamic of MHC-I proteins may contribute to generating a stronger co-evolutionary pattern compared to other protein families, which could explain why we were able to detect it, although the MHC-I family is restricted to vertebrates.
DCA does not consider the actual phylogeny and takes only the alignment of sequences as input [14,18]. However, MHC-I evolution is difficult to characterize especially because it was subject to several duplication events along vertebrate evolution. Moreover the rate of evolution and the role of MHC-I in the immune system differ from one vertebrate species to another [41][42][43]) making it even more challenging to use available phylogenetic-dependent methods to predict co-evolving constrained sites since these models assume a homogeneous rate of substitutions across species evolution.
Ligands binding to MHC-I molecules play a role in MHC-I binding stability, which is why we included the ligands in stability predictions based on HLA-A02:01 structure. In vivo, MHC-I molecules are known to interact with tens of thousands of different peptides [33,44] and their specificity cannot be summarized with one single peptide. This is the reason why we extended the DCA framework to consider multiple ligands per protein in the alignment (Fig  4). Unfortunately, due to the scarcity of experimentally determined MHC-I ligands in most species except for human and mouse, the co-evolution analysis could not be carried out only with experimental ligands for all alleles included in our dataset. We therefore used for each allele 2'000 predicted ligands corresponding to the top 2% of a set of 100'000 peptides randomly selected from different proteomes [37]. As such, it is likely that the inter-molecular coevolutionary signal observed in Fig 4 only captures the signal that is present in the NetMHCpan predictor, and may therefore not capture signals coming from more distant species that are not included in the training set of this algorithm. Nevertheless, he fact that the decoupling between polymorphic and co-evolving sites was observed both without and with ligands suggests that our results do not depend significantly on the presence of ligands in our analyses.
Our extension of the DCA algorithm to consider multiple ligands of the same protein further enabled us to analyse inter-molecular co-evolution for both MHC-I and PDZ proteins with experimentally determined ligands. Remarkably, in both cases, the inter-molecular predictions pinpointed structural contacts, whereas the intra-molecular predictions did not (for the majority of them, at least). Similar results were recently reported in a study of Antibody-antigen interactions [45], where maximum-entropy models such as DCA could help predicting affinity between antigens and antibodies, but not structural contacts within antibodies. We anticipate that our extension of DCA (available at: https://github.com/GfellerLab/DCApeptides) will contribute to future analyses of the differences between inter-and intra-molecular amino acid coevolution patterns.

Conclusion
MHC-I molecules have emerged recently in life history and are mainly restricted to vertebrate species. Despite the limited number of species that contain MHC-I genes, we observed that co-evolution constraints identified by statistical methods such as DCA accurately predicted several structural contacts. Moreover, we found that the co-evolution signal was dominated by inter-species amino acid changes and was not due to the variations between alleles within the same species (e.g., human). To our knowledge, this work is the first co-evolution analysis of a protein family that displays at the same time high variability between species and high polymorphism within species. Finally, our results suggest that MHC-I polymorphic sites, in addition to providing distinct binding specificities, preferentially avoid residues that show either high amino acid co-evolution patterns or play an important role in protein stability.

MHC-I domain alignment
In this study, we analysed the PFAM domain family named Histocompatibility antigen, domains alpha 1 and 2 of class I with the identifier PF00129. In PFAM v30 the domain family was composed of a total of 40'856 protein sequences [2]. We removed 117 bacterial and viral sequences from the dataset and kept only vertebrate MHC-I for a total of 40'739 sequences. The human sequences constitute 49.7% of the family followed by the Rhesus macaque sequences that represent 4.9% of the family (Fig 1). We filtered highly gapped columns (>70%), and the final alignment corresponds to positions 2 to 179 in HLA-A02:01 allele (residue following the numbering in the crystal structures such as PDB:2BNR chain A).
We further collected the most frequent human alleles in the allele frequency database [46] for USA NMDP European Caucasian population (comprising a total of 1,242,890 individuals). 331 alleles had a frequency exceeding 0.00001 (97 HLA-A, 181 HLA-B and 55 HLA-C alleles).

Direct coupling analysis
We used Direct Coupling Analysis (DCA) model [12] for the intra-molecular analysis of coevolving sites within MHC-I domain family alignment. DCA uses as input the frequency f i (A) of amino acid A in column i, the frequency f j (B) of amino acid B in column j, and the joint frequency count f ij (A,B) for pairs of amino acid A and B in columns i and j within a protein alignment, for all pairs of position i and j. These frequencies are computed including reweighting of sequences with >80% sequence identity and pseudo counts equal to the effective number of sequences after reweighting, as described in [12]. The sum of weights displayed in Fig 1 for each clade corresponds to the sum of 'm a ' values, where m a represents to the weight of sequence a (see Morcos et al. [12]), and can be interpreted as the effective number of sequences in this clade. Julia's version of PlmDCA [26] [25] was run on the same alignment with default parameters. The algorithm starts by removing the duplicate sequences. Once these sequences were removed PlmDCA analysed 22954 sequences, with an effective number of sequences Meff equal to 173.44.

Mapping DCA prediction on contact maps
As a reference structure for MHC-I domain, we used the structure of HLA-A02:01 in complex with a canonical 9-mer peptide (PDB: 2BNR; [47]). We consider that two sites are close in the structure if the distance between any of the heavy atoms is smaller or equal to 8Å, as suggested by the authors of the original DCA study [12], and built the contact map (grey dots in Fig 2). Similar contact maps were built using cut-off of 5Å in S2 2. Compute the precision (i.e., true positives divided by the total number of DCA predictions) for numbers of predictions ranging from 1 to 900.

DCA scores: From pairs to sites
DCA provides a score for every pair of sites. To reflect whether a site is under a co-evolutionary constraint we first ranked the scores in a decreasing order. We iteratively attributed individual score for each site as follow: 1. At the beginning none of the sites has an individual score (I). Given a site s, I s = 0.
2. Remove the first pair p composed of sites s 1 and s 2 on the top of the sorted list where p s1s2 is the pair score.
3. Check if s 1 has an individual score. If it has an individual score then go to step 4. If not, attribute an individual score to s 1 such that I s1 = p s1s2 .
4. Check if s 2 has an individual score. If it has an individual score then go to step 5. If not, attribute an individual score to s 2 such that I s2 = p s1s2 .
5. Re-iterate from 2 to 4 until all pairs of site from the list are considered.

Entropy and polymorphism
For human sequences in the PFAM alignment, we used one minus the Shannon entropy (i.e., 1 þ P 20 stands for the frequency of amino acid A at position i) to measure the polymorphism score at each position [48]. This score has a minimal value of zero when all amino acid frequencies in a site are equal and a maximal score of one when only one perfectly conserved amino acid is found at a given position. We omitted the gaps from the entropy measure. The polymorphism analysis was also performed using only the most frequent human MHC-I sequences (331 alleles, see before). To this end the human alleles were aligned with MUSCLE [49] and amino-acid to compute the Shannon entropy were weighted by the allele frequency in the USA NMDP European Caucasian population.

Stability score
To evaluate the structural stability impact of each residue, the AlaScan function of the FOLD-X software [10,27] was used to calculate the energy contribution of each residue. The structures were first repaired using RepairPDB function. The stability score of each site was measured using a reformatted pdb structure of 2BNR [47] where MHC-I residues from position 1 to 179 and the ligand were merged on chain A.

Number of contacts
The number of contacts of each site was measured using the pdb structure 2BNR (HLA-A02:01 allele in chain A and the ligand). For a given site, the number of contacts is the number of residues that are maximum 5Å distant from this site in the crystallized structure.

Enrichment analysis
Enrichment Analysis was used to investigate the relationship between polymorphic sites and sites displaying strong co-evolution constraints as estimated by DCA. A site was considered to be non-polymorphic in human alleles when its polymorphism score was lower than a threshold of 0.01 (see S4 Fig for results with other thresholds). To compute enrichment curves, sites were ranked based on their DCA score (x-axis in lower panels of Fig 3). Whenever a non-polymorphic site is encountered along the ranking (yellow bars), the enrichment curve goes up. Whenever a polymorphic sites is found the enrichment curve goes down. The same enrichment analysis was also applied to investigate the relationship between polymorphic sites involved in structural stability or sites displaying many contacts in the crystal structure of HLA-A02:01. For the enrichment analysis and p-value calculations, we use a weighted version of the Kolmogorov-Smirnov statistic with exponent measure equal to 1, as in all standard enrichment analyses [50].

Extension of DCA to consider multiple ligands
To model the existence of multiple (predicted) ligands for each MHC-I protein, the amino acid frequencies f i and f j for all sites and joint frequencies f ij for all pairs of sites (i.e. including both sites in the MHC and sites in the ligands) were computed. Following the nomenclature used in [12] the point frequency for position i in ligand is computed as: where L a i;n stands for the i th amino acid in the n th ligand of protein a, and N a stands for the number of ligands of a and M stands for the number of MHC-I sequences. The joint frequency between position i in the protein and position j in the ligand is computed as: Where A a i stands for the i th amino acid in protein a. Finally, the joint frequency between two ligand positions (i and j) is computed as: The sequence reweighting (m a ) corresponds to the number of sequences with more than 80% sequence identity to protein a, and was computed considering only the MHC-I sequence identity. This implies that each ligand has a weight equal to the weight of its protein (1=m a ) divided by the number of ligands of this protein (N a ), in order to ensure proper normalization. The same pseudo-count l ¼ M eff ¼ P M a¼1 1=m a was applied as in the standard DCA. In the case of 9-mer MHC-I ligands, this resulted in a total alignment of 178+9 = 187 positions, where the first 178 positions are characterized by a single amino acid at each position, while the last 9 positions are characterized by a distribution of amino acids for each MHC-I and each position in the ligands. All the rest of the DCA algorithm remains the same (inversion of the (187 Ã 20) x (187 Ã 20) covariance matrix and estimation of the Direct Information scores). The script to run these calculations can be accessed at: https://github.com/GfellerLab/DCApeptides.

Prediction of MHC-I ligands
To explore the impact of MHC-I ligands on the enrichment analysis of Fig 3A, we attempted to run DCApeptides on the full alignment, including multiple peptide ligands for each MHC-I protein.
Since the MHC-I ligand repertoire for the vast majority of MHC-I alleles in different species is still not experimentally available, we generated 100'000 random 9-mer peptides from 7 proteomes (Anguilla anguilla, Bos taurus (Bovine); Gallus gallus; Homo sapiens (Human); Larimichthys crocea; Mus musculus (mouse); Tinamus Guttatus) and predicted the binding affinity of MHC-I alleles to each of these peptides using NetMHCpan3.0 [37]. We then selected the top 2% predictions for each MHC-I allele in our alignment and computed the co-evolution patterns including these ligands based on DCApeptides (see above). Only MHC-I sequences without gaps at binding site positions used in NetMHCpan3.0 were considered (27,373 MHC-I sequences in total).

Experimental MHC-I and PDZ ligands
Experimental MHC-I ligands were retrieved from IEDB [36]. In total 156 human MHC-I alleles had experimental ligands (annotated as "Positive-High", "Positive-Intermediate", "Positive-Low" or "Positive"). Only 9-mers were considered and these ligands were used with DCApeptides. X-ray structure of HLA-A02:01 (PDB:2BNR) in complex with a 9-mer peptide was used to compute the contact maps of Fig 5. Experimental PDZ ligands were retrieved from a large phage display screen performed for 54 human PDZ domains [38]. All ligands were aligned at their C-terminus. The contact map in S7