False discovery rate estimation and heterobifunctional cross-linkers

False discovery rate (FDR) estimation is a cornerstone of proteomics that has recently been adapted to cross-linking/mass spectrometry. Here we demonstrate that heterobifunctional cross-linkers, while theoretically different from homobifunctional cross-linkers, need not be considered separately in practice. We develop and then evaluate the impact of applying a correct FDR formula for use of heterobifunctional cross-linkers and conclude that there are minimal practical advantages. Hence a single formula can be applied to data generated from the many different non-cleavable cross-linkers.


Introduction
Cross-linking mass-spectrometry (CLMS) has become an increasingly popular tool for analyzing protein structures, protein networks and protein dynamics [1][2][3][4]. Recently the question of what is the correct error estimation to use with CLMS has been addressed with the help of a target-decoy database approach [5], based on previous work for cross-linked [6,7] and linear peptides [8][9][10][11]. This approach to estimating a false discovery rate (FDR) of cross-links is based on the assumption that the cross-linker used is homobifunctional, i.e. have the same reactive group on either end. However, heterobifunctional cross-linkers are also used in the field, for example 1-ethyl-3-(3-dimethylaminopropyl)carbodiimide hydrochloride (EDC) [12][13][14][15] or succinimidyl 4,4'-azipentanoate (SDA) [15][16][17][18][19][20][21]. It is unclear how far these cross-linker choices affect FDR estimation as they do link different amino acids and consequently one has to consider different search spaces for each site of the cross-linker. Here, we provide some theoretical insights on extending the target-decoy approach to FDR estimation when using heterobifunctional cross-linkers, and assess whether it is necessary to use a different formula for FDR estimation. Note that these considerations are for non-cleavable cross-linkers. While MScleavable cross-linkers with independent identification of both peptides could be treated the same way, by taking the two identifications as one combined identification, they are currently handled differently for FDR estimation [22,23].

Results and discussion
Currently, the most commonly used cross-linkers are non-directional, e.g. when looking at a mass-spectrum of a cross-linked peptide, there is no means to distinguish a cross-link that was formed as peptide A linked to peptide B, than from a cross-link formed as peptide B linked to peptide A. But the most commonly used formula [24][25][26][27][28] FDR % TD À DD TT ð1Þ is actually for directional cross-links [5]. Here TT is the number of observed target-target matches (both cross-linked peptides come from the target database), TD is the number of observed target-decoy matches (one linked site comes from the target database and one from the decoy database) and DD stands for the number of decoy-decoy matches (both peptide matches are from the decoy database). A correct formula for the more commonly used nondirectional cross-linker (e.g. BS 3 or DSS) would be [5]: This formula requires knowledge of the number of possible target-decoy pairs in the initial search database (TD DB ). However, the error made by using formula 1 approaches zero relatively fast with increasing database size. Therefore in practical terms the directional formula is also applicable to data of non-directional cross-linkers such as BS 3 or DSS.
Directionality (or the lack of it) is not the only property of a cross-linker. Cross-linkers can also be homobifunctional or heterobifunctional. For homobifunctional cross-linkers, any peptide in the database that can react with one side of the cross-linker, can also react with the other side. For heterobifunctional cross-linker that is not the case, which has consequences for constructing the target and decoy search space. It leads to distinct databases (set of peptides or residue pairs) for each side of the cross-linker. The formulas used previously, assume a homobifunctional cross-linker.
A set of considerations (see supporting information S1 File) leads us to an FDR estimation formula for non-directional, heterobifunctional cross-linkers: Besides the observed target-target (TT), target-decoy and decoy-target (TD), and decoydecoy matches, it needs a set of parameters describing the search database ( Table 1). As formula 2 can be simplified to formula 1 in all practical terms we wondered how big an error would occur when also using the much simpler formula for directional, homobifunctional cross-linkers (formula 1), in place of formula 3. The error appears once matches with two decoy peptides are encountered. Before then, one arrives at the same FDR value with formula 3 and 1. Up to this point we have a linear problem (Fig 1a), as we can use the decoys only to model the hits with one wrongly identified partner, and overlook any match to two wrongly identified partners. Statistically, these will be rare, however they are not modeled until a significant number of decoy-decoy matches are encountered.
The situation changes once matches with two decoys are encountered. Here we start modeling how likely we have hits with two wrongly matched partners. The random space for a nondirectional heterobifunctional cross-linker is somewhere between the directional and nondirectional spaces for the homobifunctional cross-linker (Fig 1b). In fact the larger the non-overlap is between the two sites of the cross-linker-and therefore the smaller T ab and D ab are-the closer it behaves like a directional, homobifunctional cross-linker and the simplification of formula 1 applies.
The error made when using formula 1 for heterobifunctional cross-linkers is smaller than the error made when using formula 1 for non-directional homo-bifunctional cross-linkers (Fig 2). Already, at 200 entries (i.e. peptide, linkable residues or proteins, depending of what level the FDR should be estimated on [5]) in the database, even for a 100% overlap between both sides of the cross-linker (effectively resulting in a directional homobifunctional crosslinker) the error of FDR estimation incurred by using formula 1 instead of formula 3 should not exceed 1%. For example when cross-linking human serum albumin (HSA Uniprot: P02768), which has 585 residues in the active form, of which 129 are Lysine, Serine, Threonine or Tyrosine and the protein amino terminus, with SDA, the maximal error resulting from using formula 1 should be less than 0.2% from the estimated FDR-i.e. 5% would be <5.01% (Table 2). This error is usually smaller than the actual resolution of the FDR estimation [5]. Considering EDC in a second example: there is a 100% non-overlap between both sides of the cross-linker (Lysine, Serine, Threonine, Tyrosine, and the protein amino terminus on one side and Glutamic acid, Aspartic acid, and the protein carboxy terminus on the other side). An FDR calculation using formula 1 would result in the same estimate as using formula 3. At the level of peptides, the situation would look slightly different. Taking HSA cross-linked with EDC and a tryptic digest with four missed cleavages would result in 23 peptides exclusively for one side (T a ), 31 peptides for the other side (T b ) and 329 peptides (T ab ) that could be linked to either side of the cross-linker. This would lead to a maximal error of around 0.45% (i.e. 5% would become 5.023%).
In conclusion, from a theoretical point of view formula 3 is to be used for FDR estimations when working with heterobifunctional cross-linkers. However, for all practical purposes, the simpler formula 1 gives an approximation with an error smaller than the resolution of FDR estimation.