Rapid runtime learning by curating small datasets of high-quality items obtained from memory

We propose the “runtime learning” hypothesis which states that people quickly learn to perform unfamiliar tasks as the tasks arise by using task-relevant instances of concepts stored in memory during mental training. To make learning rapid, the hypothesis claims that only a few class instances are used, but these instances are especially valuable for training. The paper motivates the hypothesis by describing related ideas from the cognitive science and machine learning literatures. Using computer simulation, we show that deep neural networks (DNNs) can learn effectively from small, curated training sets, and that valuable training items tend to lie toward the centers of data item clusters in an abstract feature space. In a series of three behavioral experiments, we show that people can also learn effectively from small, curated training sets. Critically, we find that participant reaction times and fitted drift rates are best accounted for by the confidences of DNNs trained on small datasets of highly valuable items. We conclude that the runtime learning hypothesis is a novel conjecture about the relationship between learning and memory with the potential for explaining a wide variety of cognitive phenomena.

S3 Appendix: Modeling fitted drift rates with network confidences Modeling fitted drift rates with network confidences: The assumption we made in the main body of the paper, namely that only drift rate changes from exemplar to exemplar, seems like a safe one, but we can eliminate it and more directly estimate the drift for individual exemplars by fitting an appropriate model to participant reaction time data.In consideration of the relatively small number of data points we have for each individual exemplar, we use a simple model based on the shifted Wald distribution.The Wald distribution, also (somewhat misleadingly) called the inverse-Gaussian distribution, describes the distribution of times at which a stochastic process with positive drift γ crosses some defined threshold α.The shifted Wald distribution simply adds an offset θ to the start of the process.The probability density function of the shifted Wald distribution is: for x > θ, where x denotes a reaction time.
Anders et al. [1] recommend the shifted Wald distribution as a simple, interpretable cognitive process model of response times that can be used in situations with relatively few data points.
While it most directly applies to decisions with only a single possible response, such as go/no-go tasks, they also note that it can be applied to more complicated tasks to produce a relatively abstract aggregate description of the process producing the entire reaction time distribution.The entropy of neural network responses is a similarly aggregate measure of drift, so this is useful even if, as Matzke and Wagenmakers [2] found, the fitted shifted Wald drift only imperfectly corresponds to the drifts of the individual components for each possible decision in a more complicated model.
Thus, we can abstractly interpret our task as having only a single outcome that represents any decision being made, without any other change in our earlier logic concerning the interpretation of the parameters.Therefore, for each exemplar, we performed maximum likelihood estimation using the Broyden-Fletcher-Goldfarb-Shanno optimization algorithm to fit a shifted Wald distribution to the human reaction times (normalized as before) induced by the exemplar, where the main parameter of interest is the drift rate γ for each exemplar.We then found the Spearman's rank correlation coefficient between the drift for an exemplar and the mean normalized confidence, as well as the individual confidences, induced by the same exemplar in one hundred neural networks, set up as before and trained on each of the subsets, for both MNIST and Devanagari.Here, we expect a good fit to produce a negative correlation, as low entropy should correspond to a high drift rate.
As seen in Fig A (right graph), for both MNIST and Devanagari, we found a moderate but highly significant negative correlation for networks trained on the good set, lesser (and less significant) negative correlations for random sets and the full dataset, and a positive correlation for networks trained on the bad set.The correlations produced by the good set were also significantly different from the others.Thus, based on both the direction, magnitude, and significance of the correlations, networks trained on the good training set again provide the best account of partici-

Fig A:
Fig A: Spearman rank correlation coefficients between the mean of networks' confidences and fitted drift values for MNIST (left) and Devanagari (right), based on training one hundred randomlyinitialized networks on each training set.The left, blue bar is the correlation between the mean network confidence for each exemplar and mean human RT for the same exemplar; the right, orange bar is the mean of the correlations between each network's confidence and mean human RT, with error bars representing standard errors of the means.Asterisks above bars indicate the level at which the correlation was significantly different from zero; asterisks on brackets indicate the level at which pairs of sets of correlations are different from each other: * * * : p < 0.001; * * : p < 0.01; * : p < 0.05; n.s.: p > 0.05.