Fig 1.
Characteristics of linear motif classes in the ELM database.
(A) Histogram of observed linear motif class lengths. The total number of classes is 172. (B) Histogram of allowed amino acids (ei) at each motif position. The total number of positions is 1028. (C) Histogram for the number of instances within a linear motif class. Empty bars: known instances from the ELM database. Black bars: potential unique instances calculated from the corresponding regular expression. The total number of classes is 172.
Fig 2.
Process from the raw data for motif classes found in ELM db to the calculation of the potential number of motif classes.
Fig 3.
Measurement of the distance in sequence space between a pair of linear motif classes.
We illustrate the calculation for the regular expressions [LI].C.[DE] and [FI].W. Due to the different lengths of the two regular expressions there are three possible alignments, all of them hanging ends that belong to the longer regular expression. The second alignment does not match a pair of fixed positions and does not help us test the distance in sequence space between the two motifs. The first and third alignments match two pairs of fixed positions each. For each of them, we count the number of motif-discriminating positions where no amino acid can match both regular expressions. The result is one for the first alignment and two for the third alignment. We take the minimum of these two figures. Thus, the distance in sequence between these two linear motif classes is of at least one motif-discriminating position.
Fig 4.
Number of potential linear motif classes as deduced from the ELM database.
(Left) Number of motif-discriminating positions for all possible linear motif pairs in the database. The total number of pairs is 14706. (Right) Number of potential linear motif classes for different numbers of motif-discriminating positions.
Table 1.
The number of potential linear motif classes of the structure (2,20,1,20,2) that exist depends on the number of motif discriminating positions k required to differentiate two classes.
Fig 5.
Number of potential linear motif classes as a function of protein alphabet size.
(Left) Number of potential linear motif classes for different numbers of motif-discriminating positions, as a function of alphabet size. Black: 0 positions. White: 1 position. Red: 2 positions. Blue: 3 positions. Green: 4 positions. The dashed vertical line highlights the results for an alphabet size of 33 amino acids. (Right) Quotient of the number of potential linear motif classes for alphabet sizes of 33 and 20, as a function of the number of motif-discriminating positions.
Fig 6.
Maximal occupancy of the protein sequence space by linear motif classes as a function of the number of motif-discriminating positions and protein alphabet size.
Maximal occupancy of the protein sequence space for different numbers of motif-discriminating positions.
Table 2.
Number of motif and instances from different sources.
(a) Manually curated linear motif classes in the ELM database [3]. (b) Calculated from (f) and (h). (c) This work, Fig 4. (d) This work, Fig 5. (e) Manually curated linear motif instances in the ELM database [3]. (g) Calculated from (a) and (e). (f) Estimated using the ANCHOR algorithm for sequence-insensitive motif detection [9]. (h) Estimated by performing sequence searches using regular expressions and applying empirical filters to the results [21].