Cell-Type-Specific Predictive Network Yields Novel Insights into Mouse Embryonic Stem Cell Self-Renewal and Cell Fate
Networks trained using the same mESC gold standard but different feature sets had markedly different evidence of overfitting. We generated networks using three different feature sets: a minimalist library of 16 datasets composed largely of non-cell-type-specific data from molecular interaction databases, our mESC-specific compendium composed of 164 datasets restricted to mouse mESC data and a small amount of data not specific to any cell type, a superset compendium composed of all mESC training data plus an additional 646 non-tissue specific mouse microarrays, and a negative control compendium containing all datasets except those with mESC data. Using machine learning metrics, we found that the network trained on a small amount of non-tissue-specific data achieved the lowest ROC curve AUCs and had the least amount of overfitting. The mESC-specific network achieved a higher AUC, and showed evidence of minimal overfitting. The superset and negative control networks had the highest AUCs, but also showed extreme overfitting with a difference of greater than 0.1 between training and test set AUCs. Bootstrapping followed by out of bag averaging largely correct for overfitting in the mESC-specific, superset, and negative control networks. However network content varied dramatically. B. Overfitting in Networks with Randomly Generated Gold Standards. Networks trained on randomly generated gold standards performed better than random according to standard machine learning metrics, but 4-fold cross validation revealed these networks had evidence of overfitting that could be corrected for using regularization and bagging techniques. C. Evaluating Network Differences using Positive Gold Standard Posteriors. A scatterplot of superset versus mESC-only network positive gold standard posterior edge (those with a prior of 1) illustrates that while there is relatively high correlation (Pearson correlation r = 0.6592), there is also a broad range of disparity between the two networks. A scatterplot of negative control versus mESC shows that there is less correlation between the two networks (Pearson correlation r = 0.2311), and reveals the subset of the training gold standard supported by non-mESC data.