Characterizing Changes in the Rate of Protein-Protein Dissociation upon Interface Mutation Using Hotspot Energy and Organization
Figure 8
Specialized feature selection models and descriptor-data region networks.
Feature selection models using a genetic algorithm are run for different data regions of the off-rate dataset for which both linear (using Linear Regression) and non-linear (using SVM regression) models are investigated. For each data region, the GA-FS is run 50 times designed to find an optimal feature set of size 5. Initial features available in the population are the 110 molecular descriptors and 16 hotspot descriptors generated by RFspot_KFC2. An inner-cross validation loop is used as a scoring function for driving the feature selection whereas and outer-cross validation loop is used to assess the model prediction accuracy. (A) and (B) shows the importance of the most selected features for each data region. The features shown are those that are part of the final model for any data region on more than 50% of the GA-FS runs, and the color bar displays this percentage. The features on the y-axis are ordered as: coarse-grain potentials, atomic-based potentials, physics-based energy terms and hotspot descriptors. (C) and (D) are descriptor-data region networks for (A) and (B) respectively. Circled nodes represent data regions and square nodes represent features; therefore, only edges between circle and square nodes are present. An edge is present if the feature is in the final model for the given data region in more than 50% of the GA-FS runs (dotted edge), between 70–90% of the GA-FS runs (normal edge), more than 90% of the GA-FS runs (bold edge). Coarse-grain potentials (blue), atomic-based potentials (yellow), physics-based energy terms (green), hotspot descriptors (pink) and data regions (gray). From the descriptor-data region networks, descriptors highly specific to certain classes of off-rate mutations can be observed. Conversely, as in the case of the GS-FS (SVM) data region network, a cluster of broadly-predictive hotspot descriptors is also shown. (E) Mean PCC of the optimal models found by the GA-FS runs for each data region. For comparison, PCC results on the data regions results are also shown for RFSpot_KFC2Off-Rate+Mol. Note that the latter model is trained on all 713 off-rate mutations, and the predictions are separated post prediction into data regions and analyzed for their PCC. This effectively compares the predictions of specialized models vs. one-fits-all model. Though we find no evidence that specialized models perform better than a one-fits-all model, certain subsets of mutations, such as those at the rim regions, show notable improvements when a specialized model is employed.