Figures
Abstract
Identifying protective antigens (PAs), i.e., targets for bacterial vaccines, is challenging as conducting in-vivo tests at the proteome scale is impractical. Reverse Vaccinology (RV) aids in narrowing down the pool of candidates through computational screening of proteomes. Within RV, one prominent approach is to train Machine Learning (ML) models to classify PAs. These models can be used to predict unseen protein sequences and assist researchers in selecting promising candidates. Traditionally, proteins are fed into these models as vectors of biological and physico-chemical descriptors derived from their residue sequences. However, this method relies on multiple third-party software packages, which may be unreliable, difficult to use, or no longer maintained. Furthermore, selecting descriptors is susceptible to biases. Hence, Protein Sequence Embeddings (PSEs)—high-dimensional vectorial representations of protein sequences obtained from pretrained deep neural networks—have emerged as an alternative to descriptors, offering data-driven feature extraction and a streamlined computational pipeline. We introduce PSEs as a descriptor-free representation of protein sequences for ML in RV. We conducted a thorough comparison of PSE-based and descriptor-based pipelines for PA classification across 10 bacterial species evaluated independently. Our results show that the PSE-based pipeline, which leverages the FAIR ESM-2 protein language model, outperformed the descriptor-based pipeline in 9 out of 10 species, with a mean Area Under the Receiver Operating Characteristics curve (AUROC) of 0.875 versus 0.855. Additionally, it achieved superior performance on the iBPA benchmark (0.86 AUROC vs. 0.82) compared to other methods in the literature. Lastly, we applied the pipeline to rank unseen proteomes based on protective potential to guide candidate selection for pre-clinical testing. Compared to the standard RV practice of ranking candidates according to their biological descriptors, our approach reduces the number of pre-clinical tests needed to identify PAs by up to 83% on average.
Citation: Podda M, Savojardo C, Luigi Martelli P, Casadio R, Sîrbu A, Priami C, et al. (2025) A descriptor-free machine learning framework to improve antigen discovery for bacterial pathogens. PLoS One 20(6): e0323895. https://doi.org/10.1371/journal.pone.0323895
Editor: Rajesh Kumar Pathak, Chung-Ang University, KOREA, REPUBLIC OF
Received: September 23, 2024; Accepted: April 15, 2025; Published: June 5, 2025
Copyright: © 2025 Podda et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: Data used in this study are available from the Zenodo repository (DOI: 10.5281/zenodo.15065206).
Funding: This study was funded by GSK. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: This research was commissioned by GSK. AB is employed by GSK. This does not alter our adherence to PLOS ONE policies on sharing data and materials. The authors declare no other financial and non-financial relationships and activities and no other conflicts of interest.
Introduction
The development of sub-unit protein vaccines against bacterial pathogens begins with identifying potential candidates for subsequent pre-clinical trials, where their ability to elicit a protective immune response in animal models is evaluated [1]. Proteins that successfully pass pre-clinical trials are termed protective antigens (PAs). This early phase is crucial for the entire vaccine development process, since a strong protective potential significantly increases the likelihood of successful optimization and downstream development, even if challenges in stability or formulation arise – challenges that later stages are specifically designed to address.
A significant challenge in pre-clinical testing is the limited capacity of laboratories to conduct in-vivo experiments. Due to the time- and cost-intensive nature of pre-clinical tests, researchers are compelled to refine lists of promising candidates to avoid exceeding a predefined resource budget.
The capacity for pre-clinical testing varies based on several site-specific factors. These are arduous to summarize, but can be estimated in the order of the hundred proteins per laboratory at most. A retrospective analysis of large bacterial PA discovery projects (Table 1) reveals that the median number of proteins tested by pre-clinical laboratories is 230. This number covers only a small fraction of the typical bacterial proteome, which usually comprises thousands of proteins. For instance, the proteome of Neisseria meningitidis (strain MC58) includes 2000 proteins. Yet, even the most extensive PA discovery project to date [2] could only manage to test a mere 350 (18%). Moreover, the current trend leans towards testing fewer candidates against a broader range of bacterial strains to ensure comprehensive coverage of circulating variants. As a result, the number of unique candidates progressing to the pre-clinical phase has decreased compared to the past.
In addition to the limitations imposed by pre-clinical capacities, identifying PAs is inherently challenging due to their small proportion within a bacterial proteome. This stark disparity is evident in PA discovery projects that have ambitiously attempted to test a large number of proteins (e.g., Neisseria meningitidis and Streptococcus agalactiae in Table 1). Consequently, discovering novel PAs can be considered as a “needle in a haystack” problem, further complicated by the stringent resource constraints dictated by pre-clinical capacities.
In response to these major challenges, Reverse Vaccinology (RV) [9] has emerged as a leading bioinformatics pipeline for identifying candidates from bacterial proteomes, with the goal of advancing to pre-clinical testing only proteins with strong vaccine potential. With respect to traditional vaccine development, RV offers significant ethical advantages by reducing reliance on animal testing and expediting the identification of viable candidates, in adherence to the R3 principles of ethical research – Replacement, Reduction, and Refinement [64]. Furthermore, RV enables faster and more targeted vaccine design, ensuring rapid responses to emerging infectious diseases. This not only aligns with the principles of ethical research but also enhances global health preparedness by streamlining the early stages of vaccine development.
Central to RV is the understanding that not all proteins are equally suitable as vaccine candidates, and that it is possible to identify promising ones by analyzing their residue sequences. To carry out the analysis, the wealth of information contained in the protein sequences is routinely distilled into a collection of biological or physico-chemical descriptors extracted with various bioinformatics tools.
Biological descriptors include, but are not limited to, the probability of being surface exposed, the likelihood of being an adhesin, and the predicted number of transmembrane domains. Physico-chemical descriptors, on the other hand, measure local residue properties such as counts, weight, and charge, as well as properties at the sequence level. Once these descriptors are extracted, they are used to identify promising candidates, either directly by filtering out proteins that fail to meet certain criteria (for instance, proteins with a low probability of being surface exposed), or indirectly by employing Machine Learning (ML) models trained to predict the likelihood of being PAs from the protein sequences represented as vectors of descriptors. Over the past two decades, experimental and data scientists from around the world have meticulously refined their unique “recipe” of descriptors for representing protein sequences (refer to [10] for an extensive review). Accordingly, a remarkable stream of bioinformatics tools for descriptor extraction have been developed [11–15].
Recent advancements in ML, particularly in the sub-field of Deep Learning [16], have introduced an innovative method for representing biological sequences, known as Protein Sequence Embeddings (PSEs). A PSE is a high-dimensional representation of a protein sequence, learned by training deep neural networks with self-supervision on large protein databases [17]. Typically, a PSE consists of one distinct embedding vector for each amino acid in the sequence. These vectors are contextualized, meaning that their values are shaped by the local and global context of the sequence in which the corresponding amino acid is situated. This allows for capturing the specific roles and interactions of each residue within its unique sequence context. The patterns encoded in PSEs are highly general, enabling their adaptation to learn a variety of predictive tasks, ultimately bypassing the necessity of a comprehensive understanding of the underlying physical or biological mechanisms [18–20].
Compared to conventional descriptors, PSEs are appealing because they facilitate the automatic extraction of features guided solely by the downstream predictive task, without requiring substantial domain expertise. Furthermore, they streamline the computational pipeline for converting protein sequences to vectors, merely requiring the weights of a trained neural network to calculate the embeddings. In contrast, extracting descriptors requires a multitude of bioinformatics software which can occasionally be defective (due to coding errors), challenging to acquire (for instance, due to non-permissive licensing), or become obsolete (i.e., their development or debugging is discontinued). However, despite their potential, PSEs have not been adopted by the RV community to date.
With these motivations, we developed a descriptor-free ML pipeline that takes raw FASTA sequences as input, and returns their probability of being PAs as output. The main component of the pipeline is a ML classifier that learns to assign high probability to known PAs and low probability to non-PAs. The crucial difference between the proposed pipeline and previous approaches is representing protein sequences as PSEs rather than as descriptor vectors. We compared the performance of the proposed pipeline with a similar descriptor-based pipeline using a dataset of protein sequences labeled as PAs or non-PAs. We independently tested the generalization capabilities of both methods using a Leave-One-Bacteria-Out (LOBO) validation strategy which consists of holding out the proteins of one bacterial species for testing, mimicking a scenario where a novel vaccine is sought for a previously unstudied bacterium. Our experimental results reveal that the proposed PSE-based pipeline outperformed the descriptor-based pipeline for 9 species out of 10 total, with an average Area Under the Receiver Operating Characteristics curve (AUROC) of 0.876 against 0.856. For a comprehensive comparison with other approaches in the literature, we also evaluated our pipeline on the iBPA benchmark (refer to [21] for further information), where it achieved the highest performance on all four metrics, with an AUROC of 0.86 (previously 0.82). Lastly, we show an application of the pipeline to improve the selection of candidates for pre-clinical testing. Specifically, we used our pipeline to rank the candidates in descending order according to the probability of being PA as assigned by our trained PSE-based pipeline. We compared the proposed ranking with a RV-based ranking determined on the values of biological descriptors. In repeated simulations using different bacterial proteomes, we show that testing candidates in the order proposed by our method leads to the re-discovery of known PAs with 83% less pre-clinical tests on average. A visual high-level summary of this work is provided in Fig 1.
This work is structured in three stages. In the first stage (green box) we created a suitable dataset of PA and non-PA sequences. In the second stage (orange box) we selected a predictive pipeline to classify the sequences. In the third stage (purple box) we assessed the performances of the pipeline with three different experiments.
The paper is structured as follows. The Background section introduces background concepts and discusses related work, while the Materials and methods section presents the computational methods used in the study. The Experiments section describes the experiments performed, while the Results section presents the results. Finally, the Discussion section discuses the significance of the results, the current limitations of the proposed approach, and the future research it might foster.
Background
Descriptors
As hinted in the introduction, descriptors are numerical values that quantify several characteristics (either biological or physico-chemical) of a protein sequence. Below, we review the two major categories of descriptors.
Biological descriptors.
Biological descriptors of amino acid sequences refer to attributes that relate to the biological function of the protein, such as motifs and patterns, post-translational modifications, and subcellular localization. Among biological descriptors, as claimed in the pioneering publication of RV [9], surface exposure—i.e., accessibility to antibodies—is the cornerstone criterion to be met by any vaccine candidate, either viral or bacterial. The tenet that vaccine antigens are outer proteins has been substantiated experimentally [8,22], and in a recent review, a statistical over-representation of antigens among extracellular proteins has been documented [21]. Additional evidence of the strong association between antigenicity and surface exposure can be found in several other works [2,23,24]. Accordingly, all the different flavors of RV available so far have subcellular localization as the primary filter [10]. Finally, in [25], surface exposure was identified (through a feature selection process among hundreds) as the driving feature to discriminate between antigens and non-antigens. Other commonly considered biological descriptors in RV include the presence of epitopes, the probability of being an adhesin, the number of transmembrane helices, and different immunogenicity scores. Nevertheless, besides surface exposure, there is no strong consensus about which precise set of biological descriptors makes a protein a “good” antigen. This might be due to the fact that the biological processes underlying our immunological response are still not completely understood as of today.
Physico-chemical descriptors.
Physico-chemical descriptors are a class of descriptors that encapsulate both global and local measurements, capturing a wide array of physical and chemical properties inherent in the sequences. One subset of these descriptors, known as compositional descriptors, quantifies specific attributes such as the hydrophobicity, charge, and molecular weight associated with individual amino acids within the sequence [26]. In addition to these, there are other commonly used descriptors that provide a more global perspective. These measure how the compositional descriptors are distributed throughout the sequence or how they change along the sequence. Auto-correlation [27] is a prime example of such a descriptor, providing a measure of the correlation between values at different points in the sequence. To capture more complex patterns that go beyond those expressed by single amino acids, these descriptors are often extended to consider di- and tri-peptides of the sequence. This allows to describe patterns and properties that emerge from the interactions and relationships between adjacent amino acids in the sequence, providing a more comprehensive representation of the sequence’s physico-chemical properties.
Protein sequence embeddings.
Protein sequence embeddings are a type of representation for proteins that embodies the biological information inherent in the sequence. They are inspired by techniques in natural language processing, particularly word embeddings [28], which represent words as vectors in a way that captures semantic relationships between them.
Protein sequence embeddings are derived from protein language models (PLMs). These are essentially deep neural networks trained on vast collections of protein sequences by minimizing a self-supervised objective known as masked language modeling [17]. In simple terms, the model is tasked with predicting amino acids in the sequence that have been intentionally hidden or “masked,” forcing the model to learn from the context in which the missing amino acid appears. This approach is similar to the original training of word embeddings, where a word is predicted based on its context within a sentence. As a result of this process, each amino acid in the sequence is represented as a high-dimensional vector.
Each entry in the vector can be thought of as a feature that captures some aspect of the protein’s biological or physico-chemical properties. The exact interpretation of these features can be complex, as they are learned by the model and not directly tied to specific, predefined characteristics. However, the key is that proteins (or amino acids) that are similar in some biological sense will have similar embeddings. PSEs have been shown to encode various hallmarks of biological sequences, ranging from simple physico-chemical properties to more complex ones such as remote homology [29]. Thus, they have been used to learn a variety of sequence-based predictive tasks [18–20], although their use for antigen prediction has not been explored yet.
Related work
In the past years, several RV methods to identify potential PAs out of bacterial proteomes have been proposed. They can be roughly categorized into filtering approaches and ML approaches.
Filtering approaches such as NERVE [30], Vaxign [31], Jenner-predict [32], VacSol [33], work by applying filters to the candidates until a viable subset is found. These filters are typically based on applying a cut-off value to biological or physico-chemical descriptors. Multiple filters are applied sequentially, in no particular order among themselves. For example, one could decide to exclude (from an initial set of candidates) protein sequences with predicted probability of being an adhesin below 0.5. On the remaining set, one could further decide to exclude those with predicted number of transmembrane domains below 2, and so on.
Traditional ML approaches such as VaxiJen [34], the methods by Heinson [35] and Bowman [25], VaxignML [21], and and deep learning-based approaches such as Vaxi-DL [36], Vaxign-DL [37] learn a model that classifies protein sequences as PAs or non-PAs from a labeled dataset of protein sequences. Crucially, in contrast with filtering approaches, these methods do not discard candidates, but rather assign a score (generally a probability) that indicates the likelihood of being PAs. Typically, trained ML models are applied to the entire proteome of a bacterial species for which a vaccine is sought. The proteins can be ranked according to the probability assigned by the model, indicating which ones are more likely to be protective.
In this work, we set our focus on this second category. To the best of our knowledge, using PSEs to represent sequences in this applied context has not yet been explored by the RV community, which usually relies on representing protein sequences as vectors of descriptors.
Materials and methods
Data collection and preprocessing
We collected 708 unique PA sequences from databases Protegen [38] and VaxiJen [34] with a corresponding UniProt entry. We removed homologs by clustering the sequences with mmseqs-cluster [39] (sequence identity > 0.3, coverage > 0.5), taking one representative sequence for each cluster found. This narrowed down the number of PA sequences to 458 (comprising 82 different bacterial species).
While Protegen and VaxiJen only contain PA sequences, ground truth non-PA sequences are not available for this predictive task. Therefore, we artificially created non-PA samples following previous works [21,35]. The process to add a protein sequence to the set of non-PA sequences is exemplified in Fig 2. For a given PA sequence, we randomly sampled from UniProt one candidate non-PA sequence of the same bacterial species. We rejected the candidate if found to be similar to any PA sequence in the set of antigenic sequences or to non-PA sequences already accepted (sequence identity > 0.3, coverage > 0.5, e-value < 0.001). Otherwise, we accepted the candidate.
Given a reference PA sequence, a non-PA candidate sequence is drawn at random from the reference’s proteome in UniProt. The two sequences are checked for similarity: if they are similar, the candidate is rejected and a new one is drawn. If they are not, the candidate is tested for similarity against all previously accepted non-PAs. If they are, the candidate is rejected and a new one is drawn. Otherwise, the candidate is added to the set of non-PA sequences. This process is repeated for each PA sequence available.
We finally assigned to each PA sequence a positive label (1), and to each non-PA sequence a negative label (0). In total, the constructed dataset contains 916 labeled sequences.
Computational pipeline
The core object of this study is a predictive pipeline that accepts protein sequences in FASTA format as input and yields a probability value for each protein sequence, indicating their likelihood of being a PA. Irrespective of the numerical representation of protein sequences (be it PSEs or descriptors), the pipeline can be deconstructed into three distinct sub-modules. The initial module, referred to as the Extraction module, processes protein sequences in FASTA format and converts them into numerical vectors with
components, depending on whether a PSE-based or descriptor-based representation is chosen. The subsequent module, known as the Preprocessing module, applies various transformations to the vectors, such as scaling and optional feature selection. The final module, termed the Classification module, accepts the preprocessed vectors and inputs them into a calibrated classifier that generates the ultimate probabilistic predictions. Importantly, the exact configuration of the Preprocessing and Classification modules (e.g., the amount of feature selection, the specific classifier) is chosen during model selection to maximize performances. The pipeline is depicted in Fig 3. Notice that, once trained, the pipeline can be applied to predict the proteins of any bacterial species without retraining, regardless of whether the species is rare or under-studied.
The pipeline proposed in this study is composed of three modules, which vary depending on how protein sequences are represented (blue path: PSE-based; red path: descriptor-based). Green boxes represent fixed operations that are applied once to the sequences, while red boxes are tuned during the experiments, meaning that their configuration is chosen during model selection to maximize performances. Orange boxes indicate numerical quantities, either vectors or scalars.
In the remainder of this section, we discuss the specific modules in detail in relation to the protein representation of choice (PSE- or descriptor-based).
Extraction module.
The extraction module of the descriptor-based pipeline consists of a stack of bioinformatics software which are used to extract both biological and physico-chemical descriptors. For a given input amino acid sequence, we collected d = 1560 descriptors from 6 different bioinformatics software:
- 6 biological descriptors pertaining subcellular localization obtained from the PSortB software package [11];
- 1 biological descriptor for the probability of being an adhesin, obtained from the SPAAN software package [12];
- 1 biological descriptor for the probability of containing signal peptides, obtained from the SignalP 4.1 software package [40];
- 4 biological descriptors obtained from the TMHMM 2.0 software package [13], regarding the number and composition of transmembrane helices;
- 1 biological descriptor corresponding to the immunogenicity score provided by the IEDB software package[14];
- 1547 physico-chemical descriptors obtained from the ProPy software package [41], including amino acid composition, auto-correlation, and quasi ordered sequence numbers.
The Extraction module of the PSE-based pipeline utilizes a publicly accessible, pre-trained PLM from Facebook Artificial Intelligence Research (FAIR), known as ESM-2 (https://github.com/facebookresearch/esm). This model is a deep transformer architecture comprising 33 hidden layers, pretrained on the UniRef-50 dataset using a masked language modeling approach. After a preliminary comparison with other models such as ProteinBERT [60] and ESM-1, ESM-2 emerged as the best candidate to serve as Extraction module. Our finding is aligned with previous results in the literature, where ESM-2 outperformed several single-sequence protein language models across a range of structure prediction tasks [55,59,62,63]. We fed the FASTA sequences to the model and collected the output from the layer, which is the final hidden layer of the network. Given a single sequence, the output from this layer is a matrix of dimension
, where L denotes the sequence length and d = 1280 denotes the embedding dimension (i.e., there is one embedding for each amino acid in the sequence). By averaging the amino acid embeddings across the length dimension, we obtained a single PSE of size d = 1280 representing the entire sequence. S1 Fig shows the different processes that allow to obtain a vector of descriptors (left) and a PSE vector (right) from an example residue sequence.
Preprocessing module.
The first step of the Preprocessing module is a feature scaling step that transforms the features in a suitable range to avoid slow convergence of the downstream classifiers. During model selection, we opted between standard scaling, which independently scales all vector components in relation to the mean, and min-max scaling, which independently scales all components within the [0, 1] range. This step is exclusively applied to the descriptor vectors, as PSEs are already scaled appropriately.
The second step is a feature selection step accomplished with Principal Component Analysis (PCA). PCA is a technique that converts a set of potentially correlated features into a set of uncorrelated principal components via an orthogonal projection. It can be employed for dimensionality reduction by selecting the top-k principal components based on the original variance they preserve. During model selection, we determined whether to apply PCA and, if so, the number of components to retain. Alternatively, all features could be retained if PCA was not applied.
Classification module.
The Classification module consists of a classifier chosen along with its hyper-parameters during model selection from a pool of 5 alternatives:
- Logistic Regression (LR) [42] uses the logistic function to model the probability of the positive class. The coefficients of the logistic regression model are estimated by maximum likelihood.
- Random Forest (RF) [43] is an ensemble learning method that constructs a multitude of decision trees at training time and outputs the class that is the mode of the classes of the individual trees.
- eXtreme Gradient Boosting (XGB) [44] is a gradient boosting framework that uses a linear model solver and a sequence of decision trees as base models. At each iteration, the current model is trained to reduce the error committed by the previous.
- Support Vector Machine (SVM) [45] is a maximum margin classifier that operates by finding the hyperplane which maximizes the distance between the two classes. It can use the kernel trick to transform the input space, enabling it to handle non-linear classification boundaries.
- Multi-Layer Perceptron (MLP) [46] is a feedforward neural network with a single hidden layer and ReLU activations.
Classifier calibration
While this work deals with a probabilistic binary classification task, not all the classifiers described above return proper probabilities. For example, the SVM predicts a score which is related to the distance of the input from the margin. To ensure that the classifiers output proper probabilities, we applied calibration to the models after training. Specifically, model calibration tries to enforce the following property on the outputs of a classifier:
where P indicates probability, Y is a discrete random variable which ranges in the set of possible labels, and
is a continuous random variable ranging in the interval [0,1]. In words, calibration consists of adjusting the classifier’s outputs so that they match the observed class frequencies. In this work, calibration is performed using Platt’s Scaling [47]. Basically, the method produces calibrated probabilities
with the following logistic regression model:
where parameters are estimated with maximum likelihood.
Model selection
During the evaluation process, we used randomized hyperparameter search [48] to optimize the pipeline steps. This included the scaling method (applicable only for the descriptor-based pipeline), the decision to apply feature selection and its extent, the choice of classifier and its hyper-parameters. Randomized hyper-parameter search entails sampling hyper-parameter combinations from predefined distributions and assessing the model performance with each combination. In this study, each hyper-parameter combination was evaluated using 5-fold Cross-Validation (CV). The combination that produced the best average negative log-likelihood (NLL) across the 5 validation folds was chosen. We used NLL to select the best pipeline since it has been shown to consistently drive the selection towards well-calibrated classifiers [49]. The resulting optimized pipeline was then trained on the complete training set and subsequently evaluated on a separate test set to obtain an unbiased performance estimate. The full table of the hyperparameters that were optimized during model selection is shown in S1 Table. Notice that in this study we do not take into account class imbalance, meaning that PAs and non-PAs are in the same proportion within the training set. The rationale of this choice is two-fold. On the one hand, previous studies in PA classification found that performances were not significantly impacted when the dataset is artificially imbalanced [21]. On the other hand, dealing with imbalanced data with over-sampling or under-sampling is known to negatively affect model calibration [61].
Experiments
In this section, we provide the details of the three experiments carried out in this work. For each section describing an experiment, the first part describes the goal of the experiments and how the comparison with competing methods was structured, while the second part is devoted to describing the metrics used to measure performance. The data and code used in this study, as well as the experimental results, are available in the accompanying GitHub repository: https://github.com/marcopodda/dfpac. This includes all protein sequences used in this work, together with their vectorial representations (descriptors and PSE). Further information to reproduce the experiments can be found in the repository.
PSE-based vs. descriptor-based model evaluation
The initial experiment involved a comparative analysis of the descriptor-based pipeline and the PSE-based pipeline. The objective was to ascertain which data representation – descriptors or PSEs – offers superior predictive accuracy in identifying protective antigens. It is important to remark that we are not evaluating a single classifier, but the entire pipeline, which also includes the model selection of the classifier. We employ a Leave-One-Bacteria-Out (LOBO) evaluation scheme: for a given bacterial species, all protein sequences in the labeled dataset belonging to that species are allocated to the test set, while the remaining sequences are assigned to the model selection set. This approach tests the model’s ability to generalize to unseen bacterial species, aligning with methodologies used in prior studies [21,36,37] and to the typical RV stage of vaccine research and development projects in the pharmaceutical industry. We performed LOBO evaluations on 10 bacterial species (6 Gram positive, 4 Gram negative): Actinobacillus pleuropneumoniae, Campylobacter jejuni, Chlamydia muridarum, Escherichia coli, Mycobacterium tuberculosis, Neisseria meningitidis, Staphylococcus aureus, Streptococcus pneumoniae, Streptococcus pyogenes, and Yersinia pestis. Fig 4 shows the workflow of the chosen LOBO evaluation procedure.
NLL: negative log-likelihood; AUROC: Area Under the Receiver Operating Characteristics curve; AUPR: Area Under the Precision-Recall curve; WF1: Weighted F1 score; MCC: Matthews Correlation Coefficient.
Metrics. For each bacterial species tested, we used the trained model to predict its (unseen) protein sequences. We employed the following metrics for the comparison, similarly to previous works such as [21,36,37]: Area Under the Receiver Operating Characteristic (AUROC), Area Under the Precision-Recall curve (AUPR), Weighted F1 (WF1), and Matthews Correlation Coefficient (MCC). To smooth out the effect of randomness, the training procedure was repeated 10 times, each time using a different fixed seed for random numbers generation. We report the mean of each performance metric across these 10 trials. Once a random seed is fixed during a trial, the two performances (PSE-based vs. descriptor-based) are comparable fairly since they use the same data splits and undergo an identical model selection.
Benchmark evaluation
In this experiment, we compare the PSE-based pipeline to methods from the literature on the iBPA benchmark [21]. The original iBPA benchmark includes 249 proteins, but we were only able to recover 243 upon request to the authors. To ensure a fair comparison, we trained the PSE-based pipeline using the same dataset as the reference work. Essentially, the training dataset for this task consisted of 397 antigens and 397 non-antigens. We excluded 3 proteins from the original dataset due to unclear or non-bacterial species. During training, the PSE-based pipeline was optimized with the same hyper-parameter tuning procedure as in the LOBO evaluation. However, we fixed the classifier to SVM since in the LOBO evaluation it performed better than the alternatives 71% of times on the validation set. We compared against 8 different methods from the literature, reusing the results from the work of [37]. We remark that the PSE-based pipeline is trained with 394 antigens out of the original 397 (99.2%), and tested on 243 proteins out of the original 249 (97.5%). We therefore consider the comparison fair to the best of our possibilities.
Candidate selection simulation
In this experiment, we simulate a situation where a researcher, provided with a bacterial proteome, must determine the order in which candidate antigens should proceed to pre-clinical tests (assuming a sequential testing protocol). The optimal approach would prioritize actual antigens for testing, thereby saving pre-clinical resources by minimizing tests on non-antigens. We compare two distinct strategies for establishing this testing sequence:
- one that ranks the candidates according to their biological descriptors, similar to the filtering approach used in RV, which we term the RV-based approach,
- and another that exploits the output probability given by the PSE-based pipeline trained during the LOBO evaluation to rank candidates, which we call the likelihood-based approach.
We downloaded the entire proteomes of the 10 test species mentioned above from UniProt. These proteomes were enriched by adding, for a given species, all PA sequences found in our dataset (taking care of removing duplicate sequences). Since surface exposure is the gold standard that all candidates must adhere to in order to be eligible, the comparison was restricted to protein sequences whose subcellular localization was predicted as “extracellular space” or “outer membrane” by the BUSCA software package [50]. After filtering for surface exposure, we assigned a positive label to known PAs (1), and a negative label (0) to the remaining proteins. Note that this approach is conservative, since proteins that we deem as negative are actually untested, and therefore it is unknown whether they are PA or not.
The RV filtering strategy was implemented by arranging the protein sequences using 4 descriptors as sorting keys: probability of having signal peptides (from high to low), probability of being an adhesin (from high to low), number of transmembrane helices (from large to small), immunogenicity score (from high to low). The order of precedence among the 4 keys was randomly determined in each trial (since there is no universal preference criteria among them). Conversely, the likelihood-based strategy was implemented by sorting the protein sequences in descending order according to their likelihood of being PAs. More precisely, given a bacterial species, the likelihoods were assigned from the corresponding PSE-based pipeline trained during the LOBO evaluation (meaning that the protein sequences of the bacterial species under study were never seen during training).
Metrics.
To compare performances in this experiment, we devised a novel metric designed to reward assigning higher rank to known PAs. Given a bacterial proteome composed of m protein sequences , let
be a vector that labels the sequences in S as known PAs (1) or non-PAs (0). Let
be the “ideal” permutation of
that rearranges all known PAs first, and
be the label vector rearranged accordingly. Similarly, let
be a permutation to evaluate, and
be the label vector rearranged accordingly. In practice, the permutation
is obtained by arranging the sequences in S according to some precomputed ranking (be it RV-based or likelihood-based). Finally, let
be a cumulative sum function defined over vectors as follows:
Conceptually, the vector returned by is a histogram where the i-th bar represents the running sum up to the i-th element of the input vector. With these information, we can define the normalized Antigen Discovery Rate (nADR) as:
where the i subscript is used to denote the i-th component of the vector returned by . In practice, the
first calculates the area of the two histograms (by summing up the histogram bins) and then computes the ratio of the two areas obtained. Intuitively, it quantifies how closely the evaluated ranking matches the ideal ranking, with a higher score indicating a better match, akin to information retrieval metrics like the normalized Discounted Cumulative Gain [51]. A simple illustration of how the metric is computed is shown in Fig 5. Note that
has many desirable characteristics: it ranges in the interval (0,1] since
holds trivially for
; it is independent of L since it is normalized; and it equals 1 if and only if
(i.e., when the evaluated ranking is optional). It is not defined when the denominator is 0, i.e., when the proteome does not contain known PAs, in which case we simply force it to be 0. In our experiments,
is computed 10 times with different random seeds for each held-out bacterial species. We report the mean of these trials.
We employ a toy proteome composed of m = 5 protein sequences, of which 3 are known PAs. In the picture, transforms the input vector in a histogram of running sums, while
sums the histogram bins.
To further characterize the quality of the rankings, during the experiments we also monitor an index that we term the First Hit Index (), defined as:
which stores the position of the first known antigen in ranking order. Intuitively, it can be interpreted as the minimum number of pre-clinical tests needed to recall a known antigen. Referring back to Fig 5, we have that = 2.
Fold enrichment analysis
To further characterize the results of the previous experiment, we conducted a fold enrichment analysis of known PAs within the 90-th percentile of the average (across 10 different random seeds) ranking produced by the likelihood-based strategy. In simpler terms, we checked the statistical over-representation of known PAs within the high-probability subset of the proteome. The fold enrichment is defined as the ratio between the observed number of PAs in the 90th percentile of the ranking and the expected number across the whole proteome. On the same subset, we computed the recall metric, defined as the percentage of known PAs found in the 90-th percentile. To add statistical significance to the results, we also computed the p-values (at the conventional 0.05 cutoff) associated to the enrichment with a hypergeometric test.
Results
Here, we present the results of the experiments detailed previously. Each sub-section refers to an experiment, following the same order in which they are described in the Experiments section.
PSEs outperform descriptors in LOBO evaluation
Table 2 summarizes the results of the evaluation as described in the Experiments section. In 7 out of 10 species (A. pleuropneumoniae, C. jejuni, C. muridarum, E. coli, M. tuberculosis, N. meningitidis, S. pneumoniae), the PSE-based pipeline outperforms the descriptor-based pipeline across all measured metrics. In two cases (S. aureus and Y. pestis), PSE-based performs better on 3 metrics out of 4. Only in one case (S. pyogenes), the descriptor-based pipeline performs better than the PSE-based baseline (in 3 metrics out of 4). Lower generalization of S. pyogenes’ PSEs with respect to descriptors might be caused by an underrepresentation of sequence characteristics or distributional properties in the sequences used to pretrain the ESM-2 model. More precisely, it could be that the S. pyogenes sequences, while antigenic, are different from all other sequences (perhaps due to peculiarities of the sequencing process). In that case, it might be that this variability has been overfitted by PSEs, while descriptors are able to better generalize. However, ours is merely a hypothesis that needs to be tested in further studies. Indeed, the gap between PSEs and descriptors, while in favor of the latter, is still narrower than the other cases. Also, the MCC metric, which measures correlation between the predictions and the actual antigenicity, is slightly in favor of PSEs.
Averaging performances across species (last row of the table), we found that for all metrics, the PSE-based baseline performed better than descriptor-based models. This result provides clear evidence that PSEs are a more effective means to represent proteins for RV tasks, with superior performance in general.
PSE-based pipeline outperforms competitors in iBPA benchmark
Table 3 reports the results of the benchmark evaluation on the iBPA dataset, as detailed in the Benchmark evaluation section. The PSE-based pipeline outperformed all the competitors in all the benchmark metrics. Notably, we report a 6.75% improvement in WF1 and a 11.8% improvement in MCC with respect to the previous best model. Summing up, while the previous experiment demonstrated that the PSE-based pipeline proposed in this study generalizes better than descriptor-based pipelines in a LOBO setting, this result demonstrates that it generalizes better than all other methods in the literature on unseen proteins, regardless of the bacterial species.
Likelihood-based ranking outperforms RV-based ranking in simulated candidate selection
Table 4 shows the results of the simulated candidate selection for pre-clinical tests, where we compare the proposed likelihood-based approach against the RV-based approach described in the Candidate selection simulation section. In all the 10 test species under study, the likelihood-based strategy significantly improves on the metric with respect to the baseline, by 49% on average. By inspecting the FHI, we can also observe that in 8 cases out of 10 (i.e., excluding M. tuberculosis and Y. pestis), using the likelihood-based strategy leads to the re-discovery of the first known PA within at most 4 pre-clinical trials on average, while the RV-based strategy, in the same cases, requires at least 6 and at most 41. Even in the most difficult cases of Y. pestis (resp. M. tuberculosis), our strategy requires 18 (resp. 12) pre-clinical tests to re-discover the first known PA, while the RV-based strategy requires 125 (resp. 38). Overall, our approach allows to save (on average across the 10 species) up to 83% of pre-clinical tests with respect to the RV-based strategy.
The evolution of the simulated pre-clinical tests is shown in Fig 6, where in the x-axis we indicate the number of pre-clinical trials (assuming they are performed sequentially), and on the y-axis we plot the cumulative distribution of known PAs re-discovered according to the likelihood-based strategy (green line) versus the RV-based strategy (gray line). As can be seen in all the species under study, the rate at which PAs are re-discovered with the likelihood-based approach is consistently above that of the RV-based approach, which indicates that to re-discover the same number of known PAs, the proposed method requires a smaller number of pre-clinical tests.
The plots display the number of pre-clinical trials (x-axis) in relation to the cumulative distribution of known PAs (y-axis) re-discovered with the proposed likelihood-based strategy (green line), compared to the RV-based strategy (gray line). Both strategies are described in the Candidate selection simulation section. Shading around the lines identify a 95% confidence interval around the mean of the cumulative distribution. In all cases, the green line is above the gray line, indicating that to re-discover the same number of known PAs, our method requires a smaller number of pre-clinical tests.
High-probability subset of likelihood-based ranking is enriched with known PAs
The fold enrichment analysis, presented in Table 5, shows a statistically significant over-representation of known PAs within the 90th percentile of the ranking (i.e., the subset containing high-probable PAs according to the likelihood-based strategy) compared to the expectation under random sampling. The result is consistent across the 10 species evaluated, with p-values below the 0.05 cutoff. Remarkably, the proposed method is capable of recalling 36% of known PAs within the 90th percentile of the ranking (on average across the 10 species). These results suggest that the PSE-based pipeline is assigning known PAs to the top positions of the ranking at a higher rate than a random model, which conforms to our expectations.
Discussion
Since its inception two decades ago, research in RV has focused on skimming the number of potential good antigens to be tested in animals, searching for a signature of biological descriptors in the amino acid sequences. This extensive search has confirmed surface exposure as the unchallenged winner; in fact, descriptor-based ML attempts have found surface exposure as the major driver of discrimination between antigens and non-antigens, relegating other descriptors to marginal contributions. As such, many predictive tools are now currently available to reliably assess the subcellular localization of proteins. However, past antigen discovery projects have clearly shown that cellular localization is a necessary but not sufficient condition to establish whether a protein is a vaccine antigen.
With this motivation, the present study diverges from the longstanding paradigm for which there exists a “recipe” of biological descriptors which, in addition to surface exposure, can be used to identify good antigens. Instead, we explored the new emerging route of sequence embeddings, which has recently achieved breakthroughs in the field of biological sequence analysis. With our approach, the protein primary structure is used directly as input to the ML model, which learns a hidden discriminatory rule based on patterns of residue co-occurrences without imposing any a-priori knowledge (in the form of precomputed biological descriptors). Our experiments validate the hypothesis that being descriptor-agnostic can improve predictive performances.
A second major result presented in this paper is to show that ranking candidate antigens from top to bottom using a trained PSE-based model makes a better use of available resources (in this case, the budget imposed by pre-clinical capacities) than common strategies used by RV. In a sense, we shift the role of RV from “sieve” to a sort of “recommender”. Besides leading to a faster discovery of novel antigens as demonstrated experimentally, this new perspective allows to “broaden the horizon” by encouraging to test not-yet-studied proteins which are however ranked by the model in the top positions.
On a final note, we are also aware that the proposed approach has limitations. Firstly, both the descriptor-based pipeline and the candidate selection simulation are grounded on the accuracy of subcellular prediction methods, which, although reliable in general, are not flawless. Secondly, the classes of animal-tested PAs come from the stringency of experimenter-defined thresholds on laboratory read-outs which have a native numerical scale (like ELISA titers, human serum bactericidal assays titers, p-values in statistical analyses of challenge studies), which could generate noisy data. Thirdly, current PSE models cannot handle long protein sequences. In this study, we truncated the sequences at 1022 residues (with complete coverage for approximately 96% of the sequences in our dataset). In future works we would like address this issue once available methodologies to embed long sequences become mature [56–58]. Lastly, PSE are less interpretable than descriptors due to the fact the features that compose them are abstract and hierarchical, and do not map directly into biological insights.
All these aspects considered, the results presented in this study represent a relevant step towards improving the antigen discovery process, both in terms of reduction of animal tests, increased hit ratio of good antigens, and simplification of the computational discovery pipeline.
Supporting information
S1 Fig. Different processing of residue sequences.
The figure shows how a residue sequence is processed to become a vector of descriptors (left) or a PSE (right), with the corresponding final dimensions (1560 for descriptors, 1280 for PSEs).
https://doi.org/10.1371/journal.pone.0323895.s001
(TIFF)
S1 Table. Table of hyperparameters.
The table shows the different hyperparameters that were optimized during model selection.
https://doi.org/10.1371/journal.pone.0323895.s002
(XLSX)
References
- 1. Rappuoli R, Hanon E. Sustainable vaccine development: a vaccine manufacturer’s perspective. Curr Opin Immunol. 2018;53:111–8. pmid:29751212
- 2. Pizza M, Scarlato V, Masignani V, Giuliani MM, Aricò B, Comanducci M, et al. Identification of vaccine candidates against serogroup B meningococcus by whole-genome sequencing. Science. 2000;287(5459):1816–20. pmid:10710308
- 3. Maione D, Margarit I, Rinaudo CD, Masignani V, Mora M, Scarselli M, et al. Identification of a universal Group B streptococcus vaccine by multiple genome screen. Science. 2005;309(5731):148–50. pmid:15994562
- 4. de Alwis R, Liang L, Taghavian O, Werner E, The HC, Thu TNH, et al. The identification of novel immunogenic antigens as potential Shigella vaccine components. Genome Med. 2021;13(1):8. pmid:33451348
- 5. Moriel DG, Bertoldi I, Spagnuolo A, Marchi S, Rosini R, Nesta B, et al. Identification of protective and broadly conserved vaccine antigens from the genome of extraintestinal pathogenic Escherichia coli. Proc Natl Acad Sci U S A. 2010;107(20):9072–7. pmid:20439758
- 6. Wizemann TM, Heinrichs JH, Adamou JE, Erwin AL, Kunsch C, Choi GH, et al. Use of a whole genome approach to identify vaccine molecules affording protection against Streptococcus pneumoniae infection. Infect Immun. 2001;69(3):1593–8. pmid:11179332
- 7. Sanduja P, Gupta M, Somani VK, Yadav V, Dua M, Hanski E, et al. Cross-serotype protection against group A Streptococcal infections induced by immunization with SPy_2191. Nat Commun. 2020;11(1):3545. pmid:32669564
- 8. Bensi G, Mora M, Tuscano G, Biagini M, Chiarot E, Bombaci M, et al. Multi high-throughput approach for highly selective identification of vaccine candidates: the Group A Streptococcus case. Mol Cell Proteomics. 2012;11(6):M111.015693. pmid:22286755
- 9. Rappuoli R. Reverse vaccinology. Curr Opin Microbiol. 2000;3(5):445-50.
- 10. Dalsass M, Brozzi A, Medini D, Rappuoli R. Comparison of open-source reverse vaccinology programs for bacterial vaccine antigen discovery. Front Immunol. 2019;10:113. pmid:30837982
- 11. Yu NY, Wagner JR, Laird MR, Melli G, Rey S, Lo R, et al. PSORTb 3.0: improved protein subcellular localization prediction with refined localization subcategories and predictive capabilities for all prokaryotes. Bioinformatics. 2010;26(13):1608–15. pmid:20472543
- 12. Sachdeva G, Kumar K, Jain P, Ramachandran S. SPAAN: a software program for prediction of adhesins and adhesin-like proteins using neural networks. Bioinformatics. 2005;21(4):483–91. pmid:15374866
- 13. Krogh A, Larsson B, von Heijne G, Sonnhammer EL. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol. 2001;305(3):567–80. pmid:11152613
- 14. Fleri W, Paul S, Dhanda SK, Mahajan S, Xu X, Peters B, et al. The immune epitope database and analysis resource in epitope discovery and synthetic vaccine design. Front Immunol. 2017;8:278. pmid:28352270
- 15. Teufel F, Almagro Armenteros JJ, Johansen AR, Gíslason MH, Pihl SI, Tsirigos KD, et al. SignalP 6.0 predicts all five types of signal peptides using protein language models. Nat Biotechnol. 2022;40(7):1023–5. pmid:34980915
- 16.
Goodfellow I, Bengio Y, Courville A. Deep learning. Cambridge, MA, USA: MIT Press; 2016.
- 17. Devlin J, Chang M, Lee K, Toutanova K. Bert: pre-training of deep bidirectional transformers for language understanding. arXiv, preprint, 2019. https://arxiv.org/abs/1810.04805
- 18. Yang KK, Wu Z, Bedbrook CN, Arnold FH. Learned protein embeddings for machine learning. Bioinformatics. 2018;34(15):2642–8. pmid:29584811
- 19. Hu S, Ma R, Wang H. An improved deep learning method for predicting DNA-binding proteins based on contextual features in amino acid sequences. PLoS One. 2019;14(11):e0225317. pmid:31725778
- 20. Bepler T, Berger B. Learning protein sequence embeddings using information from structure. CoRR. 2019;abs(1902.08661).
- 21. Ong E, Wang H, Wong MU, Seetharaman M, Valdez N, He Y. Vaxign-ML: supervised machine learning reverse vaccinology model for improved prediction of bacterial protective antigens. Bioinformatics. 2020;36(10):3185–91. pmid:32096826
- 22. Biagini M, Bagnoli F, Norais N. Surface and exoproteomes of gram-positive pathogens for vaccine discovery. Curr Top Microbiol Immunol. 2017;404:309–37. pmid:28204975
- 23. Barocchi MA, Censini S, Rappuoli R. Vaccines in the era of genomics: the pneumococcal challenge. Vaccine. 2007;25(16):2963–73. pmid:17324490
- 24. De Groot AS, Rappuoli R. Genome-derived vaccines. Expert Rev Vaccines. 2004;3(1):59–76. pmid:14761244
- 25. Bowman BN, McAdam PR, Vivona S, Zhang JX, Luong T, Belew RK, et al. Improving reverse vaccinology with a machine learning approach. Vaccine. 2011;29(45):8156–64. pmid:21864619
- 26.
Gromiha MM. Protein sequence analysis. In: Protein Bioinformatics. Elsevier; 2010, pp. 29–62. https://doi.org/10.1016/b978-8-1312-2297-3.50002-3
- 27. Hollas B. An analysis of the autocorrelation descriptor for molecules. J Math Chem. 2003;33(2):91–101.
- 28.
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed representations of words and phrases and their compositionality. In: Burges CJ, Bottou L, Welling M, Ghahramani Z, Weinberger KQ, editors. Advances in neural information processing systems, vol. 26. Curran Associates, Inc.; 2013.
- 29. Rives A, Meier J, Sercu T, Goyal S, Lin Z, Liu J, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci U S A. 2021;118(15):e2016239118. pmid:33876751
- 30. Vivona S, Bernante F, Filippini F. NERVE: new enhanced reverse vaccinology environment. BMC Biotechnol. 2006;6:35. pmid:16848907
- 31. He Y, Xiang Z, Mobley HLT. Vaxign: the first web-based vaccine design program for reverse vaccinology and applications for vaccine development. J Biomed Biotechnol. 2010;2010:297505. pmid:20671958
- 32. Jaiswal V, Chanumolu SK, Gupta A, Chauhan RS, Rout C. Jenner-predict server: prediction of protein vaccine candidates (PVCs) in bacteria based on host-pathogen interactions. BMC Bioinformatics. 2013;14:211. pmid:23815072
- 33. Rizwan M, Naz A, Ahmad J, Naz K, Obaid A, Parveen T, et al. VacSol: a high throughput in silico pipeline to predict potential therapeutic targets in prokaryotic pathogens using subtractive reverse vaccinology. BMC Bioinformatics. 2017;18(1):106. pmid:28193166
- 34. Zaharieva N, Dimitrov I, Flower DR, Doytchinova I. VaxiJen dataset of bacterial immunogens: an update. Curr Comput Aided Drug Des. 2019;15(5):398–400. pmid:30887928
- 35. Heinson AI, Ewing RM, Holloway JW, Woelk CH, Niranjan M. An evaluation of different classification algorithms for protein sequence-based reverse vaccinology prediction. PLoS One. 2019;14(12):e0226256. pmid:31834914
- 36. Rawal K, Sinha R, Nath SK, Preeti P, Kumari P, Gupta S, et al. Vaxi-DL: a web-based deep learning server to identify potential vaccine candidates. Comput Biol Med. 2022;145:105401. pmid:35381451
- 37. Zhang Y, Huffman A, Johnson J, He Y. Vaxign-DL: a deep learning-based method for vaccine design and its evaluation. bioRxiv, preprint, 2023. 2023.11.29.569096. pmid:38076796
- 38. Yang B, Sayers S, Xiang Z, He Y. Protegen: a web-based protective antigen database and analysis system. Nucleic Acids Res. 2011;39(Database issue):D1073-8. pmid:20959289
- 39. Mirdita M, Steinegger M, Söding J. MMseqs2 desktop and local web server app for fast, interactive sequence searches. Bioinformatics. 2019;35(16):2856–8. pmid:30615063
- 40. Petersen TN, Brunak S, von Heijne G, Nielsen H. SignalP 4.0: discriminating signal peptides from transmembrane regions. Nat Methods. 2011;8(10):785–6. pmid:21959131
- 41. Cao D-S, Xu Q-S, Liang Y-Z. propy: a tool to generate various modes of Chou’s PseAAC. Bioinformatics. 2013;29(7):960–2. pmid:23426256
- 42.
Hosmer D, Lemeshow S, Sturdivant R. Applied logistic regression. John Wiley & Sons. 2013.
- 43. Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.
- 44.
Chen T, Guestrin C. Xgboost: a scalable tree boosting system. In: SIGKDD Proc. ACM Press; 2016, pp. 785–94.
- 45. Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20(3):273–97.
- 46.
Bishop CM. Neural networks for pattern recognition. Oxford University Press; 1995.
- 47.
Platt J. Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. In: Advances in large margin classifiers. MIT Press; 2000, pp. 61–74.
- 48. Bergstra J, Bengio Y. Random search for hyper-parameter optimization. J Mach Learn Res. 2012;13:281–305.
- 49.
Guo C, Pleiss G, Sun Y, Weinberger K. On calibration of modern neural networks. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. Curran Associates, Inc.; 2017, pp. 1321–30.
- 50. Savojardo C, Martelli PL, Fariselli P, Profiti G, Casadio R. BUSCA: an integrative web server to predict subcellular localization of proteins. Nucleic Acids Res. 2018;46(W1):W459–66. pmid:29718411
- 51.
Wang Y, Wang L, Li Y, He D, Liu T. A theoretical analysis of NDCG type ranking measures. In: Shalev-Shwartz S, Steinwart I, editors. Conference on learning theory, vol. 30, 2013, pp. 25–54.
- 52. Vita R, Mahajan S, Overton JA, Dhanda SK, Martini S, Cantrell JR, et al. The Immune Epitope Database (IEDB): 2018 update. Nucleic Acids Res. 2019;47(D1):D339–43. pmid:30357391
- 53. Xiang Z, He Y. Vaxign: a web-based vaccine target design program for reverse vaccinology. Procedia Vaccinol. 2009;1(1):23–9.
- 54. Rahman MS, Rahman MK, Saha S, Kaykobad M, Rahman MS. Antigenic: an improved prediction model of protective antigens. Artif Intell Med. 2019;94:28–41. pmid:30871681
- 55.
Tolloso M, Galfrè S, Pavone A, Podda M, Sifrbu A, Priami C. How much do DNA and protein deep embeddings preserve biological information? In: International Conference on Computational Methods in Systems Biology. 2024, pp. 209–25.
- 56. Xiong Y, Zeng Z, Chakraborty R, Tan M, Fung G, Li Y, et al. Nyströmformer: a Nystöm-based algorithm for approximating self-attention. Proc AAAI Conf Artif Intell. 2021;35(16):14138–48. pmid:34745767
- 57. Beltagy I, Peters M, Cohan A. Longformer: the long-document transformer. CoRR. 2020;abs(2004.05150).
- 58.
Martins P, Marinho Z, Martins A. ∞-Former: infinite memory transformer. CoRR. 2021. https://doi.org/abs/2109.00301
- 59. Lin Z, Akin H, Rao R, Hie B, Zhu Z, Lu W, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023;379(6637):1123–30. pmid:36927031
- 60. Brandes N, Ofer D, Peleg Y, Rappoport N, Linial M. ProteinBERT: a universal deep-learning model of protein sequence and function. Bioinformatics. 2022;38(8):2102–10. pmid:35020807
- 61. Carriero A, Luijken K, de Hond A, Moons K, van Calster B, van Smeden M. The harms of class imbalance corrections for machine learning based prediction models: a simulation study. Stat Med. 2025;44(3–4).
- 62. Capela J, Zimmermann-Kogadeeva M, van Dijk ADJ, de Ridder D, Dias O, Rocha M. Comparative assessment of protein large language models for enzyme commission number prediction. BMC Bioinformatics. 2025;26(1):68. pmid:40016653
- 63. Mall R, Kaushik R, Martinez ZA, Thomson MW, Castiglione F. Benchmarking protein language models for protein crystallization. Sci Rep. 2025;15(1):2381. pmid:39827171
- 64.
Russell W, Burch RL, Hume CW. The principles of humane experimental technique. Universities Federation for Animal Welfare; 1992.