Table 1.
Molecular descriptors.
Figure 1.
The surface shows how the RMSE of the predicted binding free energies of the test set, calculated via equation 1, vary with the number of features used in the rate constant models. This surface correspond to scheme 2 in Table 4. The and
models which are selected, which use two features each, corresponds to the RMSE minimum.
Table 2.
Significant correlations between association rates and molecular descriptors.
Table 3.
Significant correlations between association rates and molecular descriptors for the validated set.
Figure 2.
A Venn Diagram showing the four combinations of training, model selection and validation sets.
Rectangles corresponds to all 137 complexes in the binding affinity benchmark [12]. The left circle corresponds to the 44 complexes for which kinetic data could be found. The right circle corresponds to the set of 57 complexes with high confidence affinities. These are the complexes for which similar affinities have been determined in multiple experimental setups, as previously determined [13]. The intersection of these sets contains 27 complexes.
Figure 3.
The and
models, applied to the all the complexes for which kinetic data is available (with outlier 2OZA omitted from models c and d). Complexes in the intersection with the high confidence interactions are shown as circles, with the remainder shown as triangles. Points are coloured according to binding affinity. The combined
predictions, applied to the validation set, are also shown. These correspond to the set of high confidence affinities for which the rate constants are not known.
Table 4.
Results for training, model selection and validation.
Table 5.
Results for training, model selection and validation (2OZA omitted).
Table 6.
Selected models.
Figure 4.
A Flowchart of the feature selection algorithm.
The algorithm can be divided into two parts. In the first, a set of descriptor subsets, T, is constructed by first iterating over the set of descriptors subsets kept in the previous iteration, S. In the first iteration, S contains only the empty set. For each member, S, new descriptor subsets are created by combining S
with each descriptor not already in S
. These are collected into T, and evaluated by their 5-fold cross-validated RMSE in the second part of the algorithm. The 20 best performing subsets are kept for the next iteration, and that with the lowest RMSE is stored for later model selection and validation. If the lowest RMSE in the current iteration, cb, is higher than the lowest RMSE found in all previous iterations, gb, then the speculative round counter, sr, is incremented. Otherwise it is reset to 0. The algorithm terminates after 10 consecutive speculative rounds.