Multienzyme deep learning models improve peptide de novo sequencing by mass spectrometry proteomics

Generating and analyzing overlapping peptides through multienzymatic digestion is an efficient procedure for de novo protein using from bottom-up mass spectrometry (MS). Despite improved instrumentation and software, de novo MS data analysis remains challenging. In recent years, deep learning models have represented a performance breakthrough. Incorporating that technology into de novo protein sequencing workflows require machine-learning models capable of handling highly diverse MS data. In this study, we analyzed the requirements for assembling such generalizable deep learning models by systemcally varying the composition and size of the training set. We assessed the generated models’ performances using two test sets composed of peptides originating from the multienzyme digestion of samples from various species. The peptide recall values on the test sets showed that the deep learning models generated from a collection of highly N- and C-termini diverse peptides generalized 76% more over the termini-restricted ones. Moreover, expanding the training set’s size by adding peptides from the multienzymatic digestion with five proteases of several species samples led to a 2–3 fold generalizability gain. Furthermore, we tested the applicability of these multienzyme deep learning (MEM) models by fully de novo sequencing the heavy and light monomeric chains of five commercial antibodies (mAbs). MEMs extracted over 10000 matching and overlapped peptides across six different proteases mAb samples, achieving a 100% sequence coverage for 8 of the ten polypeptide chains. We foretell that the MEMs’ proven improvements to de novo analysis will positively impact several applications, such as analyzing samples of high complexity, unknown nature, or the peptidomics field.


Introduction
Bottom-up mass spectrometry-based proteomics (MS) is focused on the sensitive identification and quantification of peptides and, thereby, proteins in arbitrarily complex samples [1,2]. In the standard workflow, peptides are first produced through the proteolysis of proteins with the enzyme trypsin. In the following step, the generated peptides are separated by liquid chromatography and measured by mass spectrometry in tandem (LC-MS/MS). Finally, the peptide-spectrum matches (PSM), the assignment of the peptide sequences to individual MS spectra, are produced using comprehensive compendia of reference protein sequences database [3].
Some of MS's remarkable applications are in the infection medicine proteomics field, where it is employed to characterize the molecular mechanism behind invasive bacterial diseases [4][5][6], modeling host-pathogen interactions [7][8][9][10][11][12][13] and investigate systemic proteome changes [14][15][16][17][18]. The use of the trypsin protease is justified by its efficiency, stability, and specificity to cleavage only at the C-terminal of the basic residues, arginine, and lysine [19]. However, its applicability is limited by the amino acid composition of the target proteins and the pH of the digestion solution [20,21]. Proteases other than trypsin, such as Elastase, Glu-C, Asp-N, Pepsin, ProAlanasa, are employed to achieve different cleavage patterns or work in various pH ranges [22][23][24][25]. Despite the increasing maturity of bottom-Up MS, peptide identification is restricted to the sequences included in a reference database. Consequently, it is unattainable to study proteins derived from organisms without sequence or which are extinct, environmental samples, and microbiomes. Other examples involve therapeutic monoclonal antibodies, i.e., immune system proteins composed of heavy (HC) and light (LC) chains containing conserved and variable regions. The latter region is typically not contained in the traditional sequence databases for either chain [24,26,27]. To overcome this limitation, de novo MS peptide sequencing is intended to extract partial or complete sequence information directly from collected MS spectra. In this strategy, the identities and positions of the amino acids are determined by the differences in mass of a series of consecutive fragments, for example, fragment ions of type b and y. To this end, programs have been created which implement algorithms based on graph theory, Hidden Markov models, linear and dynamic programming, such as PEAKS [28], NovoHMM [29], Lutefisk [30], Sherenga [31], pNOVO [32,33], and Pep-Novo [34], among others. As in other fields of proteomics [35], the application of deep learning represented a performance breakthrough in de novo MS peptide sequencing, as in the case of DeepNovo [27]. Deep learning algorithms attempt to simulate the behavior of the human brain-albeit by using many connected layers of neurons, which allows it to learn multiple levels of representation of high-dimensional data [35][36][37][38]. This key aspect translates into revolutionary advances in many research fields, such as image processing [39], speech recognition [40], and natural language processing [37]. In the supervised learning flavor, a model learns to make predictions based on labeled training data. Here, features like the amount of data and their diversity directly impact the resulting model's generalizability, i.e., their ability to react to new data and make accurate predictions. Therefore, generalizability is central to the success of a model and its further implementation [36,38]. DeepNovo software outperformed other state-of-the-art methods at the level of amino acids and peptides. It combines convolutional and recurrent neural networks and local dynamic programming to learn the characteristics of tandem mass spectra, fragment ions, and sequence patterns of peptides. A later version (DeepNovoV2) added an order-invariant network architecture (T-Net) and a sinusoidal m/z positional embedding [41], which exceeds its predecessor by at least 13% at the peptide level [42].
It has been reported that the generation and analysis of overlapping peptides through multi enzymatic digestion is an efficient procedure for tandem MS de novo protein sequencing [24,25,33,43]. Here, the same sample of the target protein is digested independently with a set of proteases with different cleavage patterns. Consequently, the generated peptides can overlap to reconstruct the primary structure of the protein of interest. This approach can even resolve some of the challenges encountered in conventional strategies, which depend on the cloning/ sequencing of coding mRNAs [43][44][45]. Given the mentioned facts, integrating DeepNovo deep learning architecture to handle the multi enzymatic MS samples can be game-changing for the de novo protein sequencing field. In order to accomplish this, it requires generalized models. In this context, we refer to de novo sequencing models capable of successfully decoding the MS spectra of peptides with varied N-and C-terminus. Previous DeepNovo studies reported models trained exclusively from a compendium of tryptic peptides, referred to in this manuscript as trypsin-SEMs (trypsin Single Enzyme Models, [27,42]). This fact leaves the door open to questions related to the generalizability of the trypsin-SEMs. Firstly, it is uncertain whether these models have extended applicability to other MS datasets, i.e., having high accuracy on samples generated using proteases with different cleavages specificities to the one employed to produce the model's training set. In like matter, how the training set's composition impacts the resulting model's generalizability. Similarly, the effects of characteristics of the target spectra that facilitate peptide sequencing remain unexplored.
We studied the requirements for building generic DeepNovo models for the de novo MS sequencing task in the present work. For that purpose, we analyzed how the peptide composition and size of the training set affect the resulting model's generalizability. The efficiency of these de novo sequencing models was assessed on two highly sequence-diverse test sets by calculating the recall at the peptide level, i.e., the fraction of actual peptide sequences that were entirely correctly predicted [27,42]. Data showed reiteratively that using a collection of peptides with a wide variety of N-and C-termini amino acids led to 76% more generalizable models than the termini-restricted ones. Furthermore, DeepNovo models kept improving in the de novo peptide MS sequencing task as we continued extending the training set data with the multienzyme digestion of various species samples. We further proved the relevance of these multienzyme deep learning (MEM) models by de novo sequencing the heavy and light monomeric chains of five commercial monoclonal antibodies (mAbs). MEMs fully sequenced 8 of 10 target proteins, extracting over 10000 confirming and overlapping peptides from mAb MS samples digested with six different proteases. We consider that MEMs, combined with other mass spectrometric techniques, will help de novo analyze MS samples of higher complexity, such as the mixture of mAbs.

Results and discussions
To integrate DeepNovo into the de novo protein sequencing pipeline, we need deep learning models capable of performing de novo sequencing in MS spectra of samples digested with numerous proteases. Therefore, it is first mandatory to determine the basis for building such generic models. For that purpose, we explored the effect of the training set composition on the resulting model generalizability, following the workflow in Fig 1. We initially created five peptide datasets by digesting Detroit 562 cell line samples with five proteases: trypsin, chymotrypsin, elastase, gluc, and pepsin (see Material and Methods section for LC-MS/MS and spectra annotation details). In each dataset, 21492 peptides were randomly selected and split into training(90%), validation(5%), and test (5%) sets (see De novo model generation and evaluation section for details). We then systematically built multiple models from the training sets data. In order to assess all models' generalizability, it was essential to evaluate their performance on a dataset composed of highly variable peptides in terms of amino acid composition and peptide length distribution. For that reason, we constructed the Detroit test set by merging all five We started with three sample cohorts; Detroit 562 cells, 5 commercially available antibodies, and a large collection of samples from different species. The samples were aliquoted and digested using five enzymes, measured using LC-MS/MS, and analyzed using traditional database searches with multiple search engines. All data were also analyzed using the published DeepNovo deep learning model. Several DeepNovo models were created, see text for details, and evaluated in three ways. The internal validation evaluated the model performance on data generated with the same enzyme(s) as the model was trained with. The external validation evaluated the model performance using data generated with enzyme(s) different from the model creation data. We finally assessed each model's performance in de novo sequencing five full-length antibodies.
https://doi.org/10.1371/journal.pcbi.1010457.g001 PLOS COMPUTATIONAL BIOLOGY test sets. Here, we used the recall at the peptide level as a quantitative and comparative metric of all trained models' capability for successfully de novo sequencing peptides with varied Nand C-terminus. Following that logic, we used the peptide recall on the complete test sets as a metric for the generalizability assessment (global peptide recall). Similarly, we calculated the peptide recall on the Detroit set's components, i.e., local peptide recall. In addition, given that the protease employed during the sample preparation has a direct effect on the resulting peptides collection termini variability, we calculated the number of unique trimers on the N-terminal (Tn) and C-terminal (Tc) for all the generated models' training sets in this study. Using trimers leads to unique and not overlapping termini N/C-fragments, as the minimum peptide length for the DDA search was set to 6. Likewise, Tn and Tc values increase with the model size. Overall, Tn and Tc are quantitative metrics for the extent of the training sets' variability at each peptide termini. Higher values of Tn and Tc represent higher variability in the peptide dataset at N and C-termini, respectively. Moreover, selecting trimers allowed us to measure the termini variability dependent on the model's size. We also introduced the diversity factor (DF), defined as log(Tn/Tc), as a measure of the variability balance between the training set's N-and C-terminus. DF values near zero represent models with a better balance between the number of trimers at each terminal. Similarly, positive and negative DF values indicate a larger proportion of Tn and Tc, respectively. The S1 Table includes all generated models' diversity attributes and performance on the Detroit test set.

Nonspecific enzymes training datasets yield more generalized models
We built the first round of models from the five individual enzyme datasets and identified them as Single Enzyme Models (SEMs). Fig 2 displays the characteristics and performance of all five SEMs on the Detroit test set. Two findings are worth mentioning regarding SEMs: 1) Using less specific proteases for the peptide generation leads to more N/C termini balanced training sets (Fig 2A). In contrast to pepsin, trypsin protease has a high specific cleavage pattern that generates a training set with high Tn and low Tc values, as the peptides end in either arginine or lysine amino acids. This observation is supported by the DF values for SEMs, i.e., pepsin (0.52) < chymotrypsin (0.76) < elastase (0.79) < Glu-C (0.88) < trypsin (1.50); 2) Models' generalizability correlates inversely with DF values (Fig 2B). When comparing performance on the Detroit test set at the peptide level, data shows that SEMs built with less specific enzymes, specifically pepsin-SEM, chymotrypsin-SEM, and elastase-SEM, outperform 14-46% of those generated from proteases with more specific cleavage patterns, such as gluc-SEM and trypsin-SEM. These differences in SEMs' generalizability are explained when considering their local peptide recall on the Detroit set's components (Fig 2C). We found that the most contributing factor was related to the models' performance on inter-enzyme datasets, e.g., where the proteases for generating the training and test sets differed. An illustrative example arises when considering the local peptide recall on the chymotrypsin, elastase, and gluc peptide datasets, for which the performance of the pepsin-SEM was approximately 46-86% higher than the trypsin-SEM. In addition, all SEMs performed best when there was a match between the protease employed to generate the SEM's training set and the Detroit set's portion. In these cases, local peptide recall ranged from 0.46 to 0.69. These results are comparable to previous Deep-Novo works where only trypsin was used [27,42]. Here, local peptide recall values show that less specific SEMs outperformed 6-48% of the highly cleavage pattern-specific ones. These results suggest that SEMs generated from the digestion with trypsin and gluc are more biased at the spectra decoding stage, especially for purposing the C-terminus peptide amino acids.
Inspired by the results of the first round, we then decided to test if it was possible to modulate models' generalizability as a function of their training set's diversity factor. For that purpose, we built new models distributed in two categories: 11 monoterminal (MoTMs) and 12 multiterminal (MuTMs) models. In MoTMs, the training sets were restricted by using peptides that share one specific amino acid at one of the termini positions. Given the MS data available, we built AlaN, GlyN, GluC, IleC, IleN, PheC, ArgC, LysC, ThrN, SerN, and ValN MoTMs. The models' nomenclature is composed of the amino acid three-letter code followed by the termini type; for example, in the ThrN and PheC MoTMs' training sets, all peptides have a Thr or Phe amino acid at the N-or C-terminal, respectively. Contrary to MoTMs regarding the diversity factor feature, MuTMs prioritized maximum variability at both terminals by selecting peptides from all SEMs' training sets. Furthermore, for a fair comparison with the previous SEMs' global and local peptide recall results, MoTMs and MuTMs were built with the same amount of spectra as SEMs. Considering SEMs as reference, three new groups are distinguishable regarding Tn and Tc values distributions (Fig 3A). Two groups belong to MoTMs, which have low Tn and Tc values for the N-termini and C-termini restricted MoTMs, respectively. The third group belongs to the MuTMs, containing high values for both Tn and Tc parameters.

PLOS COMPUTATIONAL BIOLOGY
the other half of the models were better than the chymotrypsin-SEM one. In contrast, 10 of 11 MoTMs-biased were worst than the trypsin-SEM at generalizing. On the other hand, the models' performance on the Detroit test set's components shows how MuTMs cluster together as they exhibited more uniform local peptide recall values across all sample types (Fig 3C). On PLOS COMPUTATIONAL BIOLOGY the contrary, the MoTMs´performance depended on the cleavage rules' overlap between the sample and the model's training set. For example, the ArgC-MoTM performed best on the trypsin sample. However, local peptide recall values dropped 57-90% in the remaining sample types. Similar behavior in performance was observed in other MoTMs, such as GluC-MoTM and PheC-MoTM. These observations suggest that, under the same amount of training data, it is possible to design more generalizable models by maximizing and balancing the training set's Tn and Tc values.

Large MEMs perform best
Since all SEMs perform the best on similar data types as the model's training set, we then decided to build 26 new models by mixing all possible combinations of the five Single Enzyme models' training sets, i.e., multienzyme models (MEMs) from the combination of 2 (10 MEMs), 3 (10 MEMs), 4 (5 MEMs), and 5 (1 MEM) SEM-datasets. Here, the MEM composed for all five Detroit 562 peptide datasets was called the Kilo MEM. Data shows that appending one or more different peptide datasets to any existing SEM dataset yields growth in Tn, Tc, and generalizability parameters for the resulting MEM (Fig 4). As expected, the increase in Tn and Tc values was more noticeable when the merged datasets did not share the same cleavage rules as in chymotrypsin-gluc and trypsin-elastase-gluc dataset combinations (Fig 4A). Furthermore, generalizability and diversity factor values suggest that MEMs generalize better and are more termini-balanced as we increase the number of peptide datasets (Fig 4B). An illustrative example of MEMs' rising performance is shown in Fig 4C, where we displayed the path to generating the Kilo MEM from the pepsin-SEM. Two observations are worth mentioning: 1) new datasets contributed positively to the resulting MEMs' generalization, and 2) the formed MEM always performed better than its antecessors models. The Kilo MEM not only doubles the termini peptide dataset variability but also produced an increase of 38% in diversity factors concerning all SEMs. As a result, the Kilo MEM outperforms 1.8-2.4 times the SEMs.
The results of the SEM and MEMs demonstrated that features such as the training set's size and peptide sequence variability significantly impact the resulting model's generalizability. At this point, we hypothesized that expanding sequence variability by creating a training set that includes peptides across different species will lead to a more generic model than the Kilo MEM. To prove it, we generated an external dataset, called here Giga, by digesting various species samples, such as Saccharomyces cerevisiae, Escherichia Coli, Equus caballus, Streptococcus pyogenes, and Mus musculus with trypsin, chymotrypsin, elastase, and gluc proteases. We followed the same protocol for sample injection, MS detection, and database search (See Material and methods). After spectra annotation, the Giga dataset was ten times larger than the Detroit 562 dataset. We then trained and applied the Giga MEM to the Detroit test set and compared the results with the Kilo MEM. Data shows that the Giga MEM generalized 29.4% better than the Kilo MEM, outperforming 24-41% in all Detroit test set's composing sample types (Fig 5).
In the same way, the Giga MEM generalizes 2.1-3.0 times better than the SEMs.
The Giga dataset was also used as an external test set. Specifically, we tested the generalizability of the 5 SEM and 26 MEMs. Interestingly, generalizability values on the Giga test set supported our previous findings on the best conditions to build more generic models (S1 Text). Here, it is crucial to mention the pepsin-SEM results; In the Detroit test set case, the most considerable portion of de novo sequenced spectra corresponded to peptides generated with the same protease as the SEM's training set. However, pepsin was not part of the multienzyme protocol for generating the Giga external peptide test set. Despite that, the pepsin-SEM performed best among all SEMs. Overall, generalizability results on the Detroit and Giga test sets suggest that, like other deep learning architectures, DeepNovo kept improving in the de novo peptide MS sequencing task as we fed the model with extensive and highly diverse peptide MS data.

Fragment ions distribution impact MS de novo peptide sequencing
After establishing the criteria for building generalizable models, we further explored how the peptide composition impacts the ability to de novo sequence its spectrum correctly. In this respect, we studied the Kilo MEM results on the Giga test set (Fig 6). Initially, we evaluated the effect of the peptide length distribution on the overall deep learning model's performance by

PLOS COMPUTATIONAL BIOLOGY
tracking the peptide recall as we varied the maximum peptide length (Fig 6A). We observed that performance decreased as we included longer peptides in the test set. Data shows that the probability of de novo MS sequencing correctly 6-residue peptides was 86.1% and fell quickly to 40% when considering peptides of up to 14 residues.
Moreover, this performance decay differed for all components of the Giga set, suggesting that the identity of the peptides also impacts their chance of being MS sequenced. To explain these differences across the four datasets, we calculated the peptide length distribution (Fig  6B). Data shows that 75% of data in the elastase dataset are peptides of length 12 or shorter, explaining why it was more accessible to de novo MS sequence elastase data over chymotrypsin and gluc data. For the latter, 75% of the data were peptides of length 13 or longer.
Since the peptide length distributions could not explain performance differences related to the trypsin sample, we further calculated singly-charged b-and y-ion recall for all peptides spectra composing the Giga test set, e.g., the proportion of the fragment ions found experimentally over the total expected ones theoretically. Here the ion recall is a quantitative metric of the ability of a particular peptide to produce b/y-ions under specific experimental conditions [24,25,46,47]. For the fragment ions extraction, the m/z tolerance was 15ppm. We also calculated the peptide recall as a function of the minimum values for the b/y-ion recall pairs. The b/y-ions recall grid shows that the probability of de novo MS sequence correctly a peptide increase with its capacity of producing either b-or y-ions (Fig 6C). Data shows that the global peptide recall on the Giga test set was higher than 70% when peptides produced at least 80% and 60% of the expected b-and/or y-ion fragments. These results suggest that the de novo MS sequencing performance on a specific sample type is bound to its b/y ion recall distributions. Fig 6D shows that the y-ion recall distribution order fits the peptide recall behavior for all sample types. It is worth mentioning that the tryptic peptides had the highest proportion of the expected singly-charged y-ions compared to the other sample types, explaining its remarkable performance across a wide range of peptide lengths (Fig 6A), i.e., 55% of the annotated spectra had at least 60% of the y-ions expected. For these peptides, y-ion fragments bear a charged residue, like arginine or lysine, which are more abundant and produce more intense peaks under the HCD fragmentation method [48,49]. On the contrary, the peptides from the digestion with gluc had a low proportion of y-and b-ions (Fig 6E). Furthermore, elastase b/yion recall distributions are consistent with a high proportion of short peptides.

MEMs for full-length de novo sequencing of antibodies
Once we established the requirements for building generalizable models and how the quality of the input spectra impacts the subsequent de novo MS peptide sequencing process, we tested the efficiency of using the MEMs in the de novo protein sequencing pipeline. For this effort, we selected a challenging and biological interest system, such as the complete sequencing of monoclonal antibodies (mAbs). We aimed to fully de novo MS sequence the heavy (HC) and light (LH) chains of five commercial mAbs: Erbitux, Herceptin, Prolia, Silulite, and Xolair. We digested each mAb sample with six proteases: trypsin, chymotrypsin, elastase, gluc, pepsin, and aspn. It is worth mentioning that the latter enzyme was not part of the models' generation protocol. On the other hand, we created the Giga+ MEM by combining the training sets of the Kilo and Giga MEMs. We considered eight models (5 SEM + 3 MEMs) for comparison purposes. For analyzing results, we initially calculated the relative coverage for the entire variables space, i.e., models x samples x chains matrix (Fig 7A, S2 Text, S1 Table). This way, we got an insight into the model performance across all sample types and which enzymes facilitate the de novo sequencing of the HC and LC subunits. In addition, we examined the length distribution of the sequence matching peptides for all sample types (Fig 7B). These plots provide information about the decoding power of the models. It also shows the capacity of the different proteases to produce easily detectable peptides from the de novo MS sequencing perspective. Here, we initially discussed the impact of using different proteases for the de novo sequencing of monoclonal antibodies, specifically, how it affects the ability of all models to achieve high protein coverage and generate a high number of different easy-to-decode peptides. We then

PLOS COMPUTATIONAL BIOLOGY
Regarding the sample types, the data shows that working with the chymotrypsin and elastase proteases had many benefits related to good protein coverage (Fig 7A) and the extraction of a high amount of matching peptides (Fig 7B). Data shows that digesting the samples with these proteases yields better individual protein coverage, wherein in 75% of cases, the sequence relative coverage was at least 0.80 and 0.75 for chymotrypsin and elastase, respectively. Additionally, the total amount of peptides extracted was 2 to 8 times greater than the rest of the proteases (S2 Text). It is worth noting that these were the only enzymes where, for lengths between 6 and 9, all of the considered deep learning models identified more than 100 peptides. These observations suggest that working with the chymotrypsin and elastase proteases leads to high amounts of readily de novo MS sequenceable peptides. As expected, the gluc and aspn digested samples got the lowest matching peptide extraction values, yielding the worst individual protein coverages. These proteases produced long peptides with low b-and y-ion recalls, making them more difficult to de novo sequence.
When comparing the performance of the deep learning models, the Giga and Giga+ MEMs were evident superior after considering the values of the protein coverages and the amount of matching extracted peptides parameters. For the Giga+ MEM, the median value of protein coverage was 0.96 after considering all mAbs and sample types. Moreover, it extracted 10367 unique and confirming peptides, an amount 2-2.8 times greater than the Kilo MEM and all SEMs (S2 Text). Interestingly, and based on the same parameters, the pepsin SEM was among the five SEMs. These findings supported our previous statements about the necessary criteria for building generalizable models. It is worth noting that the Giga+ MEM sequenced all light chains and 3 of 5 mAbs heavy subunits for the combined sample results, i.e., Herceptin, Silulite, and Xolair mAbs. The remaining proteins had coverage of at least 0.97. It is essential to consider that, in mAb, the HC subunit can bear glycans in their constant region [50,51]. In some cases, such as for Erbitux, glycans are also found in the HC variable region [52].
As the overlapping of peptides is necessary for the assembly of protein sequences, we also decided to go deeper into the analysis of MAbs de novo results and introduce the confident positional score (CS). For a residue in the position i of the protein sequence, is defined as C i = log 2 (f i +1). Here f i is the positional frequency for position i, i.e., the number of de novo sequenced matching peptides for position i in the protein sequence (Fig 7C). Higher consecutive CS values represent regions with more evidence in the de novo protein sequencing process, being especially important for MAbs HC and LC variable regions, for which the sequences are unknown. In contrast, sequence regions with no detected peptides have a zero positional frequency, ergo, a zero CS value. After combing all sample types, the Giga+ MEM got a positional frequency greater than ten for 90.7% of the amino acids comprising the study mAbs. Moreover, this parameter value increased to 50 or more for 45.7% of said amino acids. Similarly, there were no confirming peptides for only 0.03% of residues. Furthermore, For the mAbs variable region, the median positional frequency was 45 and 51 for the HC and LC subunits, respectively (S3 Text). For the five HC subunits, data show that CS values decreased up to 30% in the glycans' surrounding regions, likely because of a steric effect as these bulky species prevent efficient digestion. In the case of the Erbitux mAb, the regions with zero CS values matched the glycans location for the HC constant and variable domains (Fig 8), suggesting that removing the glycans should be incorporated in the sample preparation to guarantee the complete MS sequencing of mAbs. Given the coverage and positional frequency results, the findings discussed here set a precedent for using multienzymatic deep learning models as an alternative for sequencing proteins from their multienzymatic digestion.
In future studies, it may be interesting to explore using the multienzyme de novo sequencing protocol in conjunction with other complementary MS techniques like Top-down to sequence mixtures of mAbs. Similarly, the positional frequency concept opens room for developing new MS-based protein assembly methods.

Conclusions
We proposed the use of MEMs to improve the de novo sequencing of peptides and proteins from DDA-MS data. Toward that aim, the effects of the properties of the training and test sets on the de novo sequencing process were explored. On the one hand, the data suggest that variability at both terminals, among the peptides which make up the training set, affects eventual generalizability. Consequently, the use of multiple proteases is recommended to generate more robust models. In the same vein, since DeepNovoV2 learns characteristics of spectra and sequences, an increase in the number of data points also improves the resultant model performance. These claims are supported by the peptide recall results for the test sets and the number of peptides extracted from the samples produced by multienzymatic digestion of commercial antibodies. On the other hand, it was discovered that the models' de novo sequencing capacity is limited by the identity of the peptides and experimental conditions, which have direct consequences on their ability to produce ionic fragments of interest. This result explains why peptide recall fell with an increase in length of peptides, as well as the differences found among the samples from trypsin, chymotrypsin, elastase, and gluc. Finally, the findings described here will assist in other areas of peptidomics, the creation of Data-Independent-Acquisition libraries, and the sequencing of complex mixtures of monoclonal antibodies.

Sample preparation for mass spectrometry
Commercial antibodies. For sample preparation for mass spectrometry of commercial antibodies, 10 μg of each (Xolair, Novartis; Herceptin, Roche; SiLuLite, Sigma MSQC4

PLOS COMPUTATIONAL BIOLOGY
Universal Antibody Standard; Prolia, Amgen; and Erbitux, Merck) was denatured with 8M urea-100 mM ammonium bicarbonate, the disulphide bonds reduced with 5 mM Tris (2-carboxyethyl) phosphine hydrochloride (TCEP) for 60 min at 37˚C, 800 rpm, and alkylated with 10 mM iodoacetamide for 30 min in the dark at room temperature. The samples were diluted to a urea concentration <1.5 M with 100 mM ammonium bicarbonate. The antibodies were digested separately with 1 μg of trypsin, Promega; chymotrypsin, Promega; LysC/trypsin, Promega; elastase, Promega; GluC, Promega; or AspN, Promega for 18h at 37˚C, 800 rpm. The digested samples were acidified with 10% formic acid to a pH of 3.0. The peptides were purified and desalted using SOLAμ reverse phase extraction plates (Thermo Scientific) according to the manufacturer's instructions. Peptides were dried in a speedvac and reconstituted in 2% acetonitrile, 0.2% formic acid prior to mass spectrometric analyses.
Detroit 562 cell line. Briefly, ca.5 million cultured mammalian epithelial cells (Detroit 562 cell line) were kindly provided by Sounak Chowdhury. Suspension cells were first centrifuged at 5000g rcf, 4˚C for 10 mins, followed by aspiration of supernatant and one time cold 1X PBS wash. Remained cell pellets were then added with 1 ml lysis working solution, composed of 1X RIPA lysis and extraction buffer, ThermoFisher, and 1X protease/phosphatase inhibitor cocktail, ThermoFisher. After 15 min incubation on ice, the cell lysates were precipitated by trichloroacetic acid (TCA), washed with 3X acetone, and dried in a speedvac. Completely dried protein extracts were reconstituted in 100 mM ammonium bicarbonate buffer and measured for protein concentration by BCA assays, ThermoFisher. 10 ug of cell lysate proteins were aliquoted for each reaction, 10 experimental replicates for each enzyme, 5 enzymes in total. Sample preparation of reduction, alkylation, enzyme digestion, acidification was described as above. Specifically, C18 spin columns were used for purification and desalting after the digestion. Except for the pepsin-digested group, LysC/Trypsin was introduced for a 1 hr pre-digestion prior to a 1 hr pepsin digestion.

Liquid chromatography tandem mass spectrometry
The peptides of the digested commercial antibodies were analyzed on Q Exactive HF-X mass spectrometer (Thermo Scientific) connected to an EASY-nLC 1200 ultra-high-performance liquid chromatography system (Thermo Scientific). The peptides were loaded onto an Acclaim PepMap 100 (75μm x 2 cm) C18 (3 μm, 100 Å) pre-column and separated on an EASY-Spray column (Thermo Scientific; ID 75μm x 50 cm, column temperature 45˚C) operated at a constant pressure of 800 bar. A linear gradient from 3 to 38% of 80% acetonitrile in aqueous 0.1% formic acid was run for 120 min at a flow rate of 350 nl min-1. One full MS scan (resolution 120 000 @ 200 m/z; mass range 350-1650 m/z) was followed by MS/MS scans (resolution 15000 @ 200 m/z) of the 15 most abundant ion signals. The isolation width window for the precursor ions was 1.3 m/z, they were fragmented using higher-energy collisional-induced dissociation (HCD) at a normalized collision energy of 28. Charge state screening was enabled, and precursor ions with unknown charge states and a charge state of 1, and over 6 were rejected. Data was additionally collected for non-tryptic digestions as above, but including peptides with a charge state of 1. The dynamic exclusion window was 10 s. The automatic gain control was set to 3e6 and 1e5 for MS and MS/MS with ion accumulation times of 45 ms and 30 ms, respectively.

Computational analyses
Spectra annotation. A snakemake [53] was created for the DDA search. All DDA raw files were initially converted to Mascot generic format (MGF) by ThermoRawFileParser software. Ursgal package [54,55] was used as an interface for searching the spectra against data's  [63]. Moreover, a threshold of 1% peptide FDR was set for decisive candidate inclusion.
De novo model generation and evaluation. The process of creating a model involves 3 steps, namely: 1) establishing the training, validation, and test sets; 2) creation of the input files for DeepNovoV2 [42]; and 3) model training. Starting from the DDA search results over the Detroit 562 system, we extracted an equal number of annotated scans from each of the five protease datasets to make a suitable quantitative comparison of the resulting models. In that sense, limited by the dataset with the lowest number of unique peptides (gluc-dataset), we randomly selected 21492 peptides. These were then randomly divided into training, validation, and test sets in proportions of 90%, 5%, and 5%, respectively. For the second step, a snakemake workflow was created for the extraction of the selected spectra and generation of the features and MGF files. Finally, model training was done in 20 epochs [27,42]. For the feature extraction process, we consider a total of 12 ion-types, namely: a, b, y, a(2+), b(2+), y(2+), a-H2O, b-H2O, y-H2O, a-NH3, b-NH3, and y-NH3. Maximum peptide length and mass were adjusted to 4000 Da and 30, respectively. These models were called SEMs.
The evaluation of the initial models was accomplished through full cross-validation. This was done with the aim of obtaining a perspective on the performance of each test set, as well as overall. The same modifications employed in database search were considered for all of the de novo searches in this research. Additionally, the maximum deviation of the precursor mass was adjusted to 15ppm. Peptide recall was used as a measure of the quality of the models.
Structural modeling. The Fc and Fab domains of each antibody were de novo modeled separately by AlphaFold2 [64,65], considering MMseqs2 [66] to generate the multiple sequence alignment and homo-oligomer state of 1:1. For each selected model, the sidechains and the disulfide bridges were adjusted and relaxed using Rosetta relax protocol [67]. The loops in the hinge region were then re-modeled and characterized using DaReUS-Loop web server [68]. Finally, the full-length structure was relaxed, and all disulfide bridges (specifically in the hinge region) were adjusted using the Rosetta relax protocol. Visualization of the monoclonal antibodies was done through USCF Chimera software [69].