Bayesian validation of grammar productions for the language of thought

Probabilistic proposals of Language of Thoughts (LoTs) can explain learning across different domains as statistical inference over a compositionally structured hypothesis space. While frameworks may differ on how a LoT may be implemented computationally, they all share the property that they are built from a set of atomic symbols and rules by which these symbols can be combined. In this work we propose an extra validation step for the set of atomic productions defined by the experimenter. It starts by expanding the defined LoT grammar for the cognitive domain with a broader set of arbitrary productions and then uses Bayesian inference to prune the productions from the experimental data. The result allows the researcher to validate that the resulting grammar still matches the intuitive grammar chosen for the domain. We then test this method in the language of geometry, a specific LoT model for geometrical sequence learning. Finally, despite the fact of the geometrical LoT not being a universal (i.e. Turing-complete) language, we show an empirical relation between a sequence’s probability and its complexity consistent with the theoretical relationship for universal languages described by Levin’s Coding Theorem.


Introduction
It was not only difficult for him to understand that the generic term dog embraced so many unlike specimens of differing sizes and different forms; he was disturbed by the fact that a dog at three-fourteen (seen in profile) should have the same name as the dog at three-fifteen (seen from the front). (. . .)With no effort he had learned English, French, Portuguese and Latin. I suspect, however, that he was not very capable of thought. To think is to forget differences, generalize, make abstractions. In the teeming world of Funes, there were only details, almost immediate in their presence. [1] a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 The theory of computation, through Levin's Coding Theorem [24], exposes a remarkable relationship between the Kolmogorov complexity of a sequence and its universal probability, largely used in algorithmic information theory. Although both metrics are actually non-computable and defined over a universal prefix Turing Machine, we can apply both ideas to other non-universal Turing Machines in the same way that the concept of complexity used in MDL can be computed for specific, non-universal languages.
In this work, we examine the extent to which this theoretical prediction for infinite sequences holds empirically for a specific LoT, the language of geometry. Although the inverse logarithmic relationship between both metrics is proved for universal languages in the Coding Theorem, testing this same property for a particular non-universal language shows that the language shares some interesting properties of general languages. This constitutes a first step towards a formal link between probability and complexity modeling frameworks for LoTs.

Bayesian inference for LoT's productions
The project of Bayesian analysis of the LoT models concept learning using Bayesian inference in a grammatically structured hypothesis space [25]. Each LoT proposal is usually formalized by a context free grammar G that defines the valid functions or programs that can be generated, like in any other programming language. A program is a derivation tree of G that needs to be interpreted or executed according to a given semantics in order to get an actual description of the concept in the cognitive task at hand. Therefore, each concept is then represented by any of the programs that describe it and a Bayesian inference process is defined in order to infer from the observed data the distribution of valid programs in G that describes the concepts.
As explained above, our aim is to derive the productions of G from the data, instead of just conjecturing them using a priori knowledge about the task. Prior work on LoTs has fit probabilities of productions in a context free grammar using Bayesian inference, however, the focus has been put in integrating out the production probabilities to better predict the data without changing the grammar definition [23]. Here, we want to study if the inference process could let us decide which productions can be safely pruned from the grammar. We introduce a generic method that can be used on any grammar to select and test the proper set of productions. Instead of using a fixed grammar and adjusting the probabilities of the productions to predict the data, we use Bayesian inference to rule out productions with probability lower than a certain threshold. This allows the researcher to validate the adequacy of the productions she has chosen for the grammar or even define one that is broad enough to express different regularities and let the method select the best set for the observed data.
To infer the probability for each production based on the observed data, we need to add a vector of probabilities θ associated with each production in order to convert the context free grammar G into a probabilistic context free grammar (PCFG) [26].
Let D = (d 1 , d 2 , . . ., d n ) denote the list of concepts produced by the subjects in an experiment. This means that each d i is a concept produced by a subject in each trial. Then, P(θ j D), the posterior probability of the weights of each production after the observed data, can be calculated by marginalizing over the possible programs that compute D: where each Prog = (p 1 , p 2 , Á Á Á, p n ) is a possible set of programs such that each p i computes the corresponding concept d i .
We can use Bayesian inference to learn the corresponding programs Prog and the vector θ for each production in the grammar, applying Bayes rule in the following way: Sampling the set of programs from P(Prog j θ) forces an inductive bias which is needed to handle uncertainty under sparse data. Here we use a standard prior for programs that is common in the LoT literature to introduce a syntactic complexity bias that favors shorter programs [25,27]. Intuitively, the probability of sampling a certain program is proportional to the product of the production rules that were used to generate such program, and therefore inversely proportional to the size of the derivation tree. Formally, it is defined as: where is the probability of the program p i in the grammar, and f r (p i ) is the number of occurrences of the production r in program p i . In (2), P(θ) is a Dirichlet prior over the productions of the grammar. By using the term P(θ) we are abusing notation for simplicity. The proper term would be P(θ j α) to express a Dirichlet prior with a 2 R ' its associated concentration vector hyper-parameter where ℓ is the number of productions in the grammar. This hierarchical Dirichlet prior has sometimes been replaced with a uniform prior on productions as it shows no significant differences in prediction results [15,17]. However, here we will use the Dirichlet prior to be able to infer the production probabilities from this more flexible model.
The likelihood function is straightforward. It does not use any free parameter to account for perception errors in the observation. This forces that only programs that compute the exact concept are taken into account, and it can be easily calculated as follows: where P(d i j p i ) = 1 if the program p i computes d i , and 0 otherwise. Calculating P(θ j D) directly is, however, not tractable since it requires to sum over all possible combinations of programs Prog for each of the possible values of θ. To this aim, then, we used a Gibbs Sampling [28] algorithm for PCFGs via Markov Chain Monte Carlo (MCMC) similar to the one proposed at [29], which alternates in each step of the chain between the two conditional distributions: Pðy j Prog; DÞ ¼ P D ðy j f ðProgÞ þ aÞ: Here, P D is the Dirichlet distribution where the positions of the vector α were updated by counting the occurrences of the corresponding productions for all programs p i 2 Prog.
In the next section, we apply this method to a specific LoT. We add a new set of ad-hoc productions to the grammar that can explain regularities but are not related to the cognitive task. Intuitively, these ad-hoc productions should not be part of the human LoT repertory, still all of them can be used in many possible programs to express each concept.
So far, Probabilistic LoT approaches have been successful to model concept learning from few examples [13,14]. However, this does not mean that Bayesian models would be able to infer the syntax of the model's grammar from sparse data. Here we test such hypothesis. If the method is effective, it should assign a low probability to the ad-hoc productions and instead favor the original set of productions selected by the researchers for the cognitive task. This would not only provide additional empirical evidence about the adequacy of the choice of the original productions for the selected LoT but, more importantly, about the usefulness of Bayesian inference for validating the set of productions involved in different LoTs.

The language of geometry: Geo
The language of geometry, Geo [16], is a probabilistic generator of sequences of movements on a regular octagon like the one in Fig 1. It has been used to model human sequence predictions in adults, preschoolers, and adult members of an indigene group in the Amazon. As in other LoT domains, different models have been proposed for similar spatial sequence domains like the one in [17]. Although both successfully model the sequences in their experiments, they propose different grammars for their models (in particular, [16] contains productions for expressing symmetry reflections). This difference can be explained by the particularities of each experiment. On the one hand, [16] categorized the sequences in 12 groups based on their complexity, displayed them in an octagon and evaluate the performance of a diverse population to extrapolate them. On the other hand, [17] categorized the sequences in 4 groups, displayed them in an heptagon and evaluate the performance of adults not just to predict how the sequence continues, but to transfer the knowledge from the learned sequence across auditory and visual domains. Despite the domains not being equal, the differences in the grammars strengths the need for automatic methods to test and validate multiple grammars for the same domain in the LoT community.
The production rules of grammar Geo were selected based on previous claims of the universality of certain human geometrical knowledge [30][31][32] such as spatial notions [33,34] and detection of symmetries [35,36].
With these production rules, sequences are described by concatenating or repeating sequence of movements in the octagon. The original set of productions is shown in Table 1 and -besides the concatenation and repetition operators-it includes the following family of atomic geometrical transition productions: anticlockwise movements, staying at the same location, clockwise movements and symmetry movements.
The language actually supports not just a simple n times repetition of a block of productions, but it also supports two more complex productions in the repetition family: repeating with a change in the starting point after each cycle and repeating with a change to the resulting sequence after each cycle. More details about the formal syntax and semantics can be found in [16], though they are not needed here.
Each program p generated by the grammar describes a mapping S ! S + , for S = {0, . . ., 7}. Here, S + represents the set of all (non empty) finite sequences over the alphabet S, which can be understood as a finite sequence of points in the octagon. These programs must then be executed or interpreted from a starting point in order to get the resulting sequence of points. Let p = [+1, +1] be a program, then p(0) is the result of executing p starting from point 0 (that is, sequence 1, 2) and p(4) is the result of executing the same program starting from point 4 in the octagon (sequence 5, 6).
Each sequence can be described with many different programs: from a simple concatenation of atomic productions to more compressed forms using repetitions. For example, to move through all the octagon clockwise one point at a time starting from point 0, one can use

Geo's original experiment
To infer the productions from the observed data, we used the original data from the experiment in [16]. In the experiment, volunteers were exposed to a series of spatial sequences defined on an octagon and were asked to predict future locations. The sequences were selected according to their MDL in the language of geometry so that each sequence could be easily described with few productions.
Participants. The data used in this work comes, except otherwise stated, from Experiment 1 in which participants were 23 French adults (12 female, mean age = 26.6, age range = 20 − 46) with college-level education. Data from Experiment 2 is later used when comparing adults and children results. In the later, participants where 24 preschoolers (minimal age = 5.33, max = 6.29, mean = 5.83 ± 0.05).
Procedure. On each trial, the first two points from the sequence were flashed sequentially in the octagon and the user had to click on the next location. If the subject selected the correct location, she was asked to continue with the next point until the eight points of the sequences were completed. If there was an error at any point, the mistake was corrected, the sequence flashed again from the first point to the corrected point and the user asked to predict the next location. Each d i 2 S 8 from our dataset D is thus the sequence of eight positions clicked in each subject's trial. The detailed procedure can be found in the cited work. Extending Geo's grammar We will now expand the original set of productions in Geo with a new set of productions that can also express regularities but are not related to any geometrical intuitions to test our Bayesian inference model. In Table 2 we show the new set of productions which includes instructions like moving to the point whose label is the square of the current location's label, or using the current point location i to select the i th digit of a well-known number like π or Chaitin's number (calculated for a particular universal Turing Machine and programs up to 84 bits long [37]). All digits are returned in arithmetic module 8 to get a valid point for the next position. For example, PI(0) returns the first digit of π, that is PI(0) = 3 mod (8) = 3; and PI(1) = 1.

Inference results for Geo
To let the MCMC converge faster (and to later compare the concept's probability with their corresponding MDL), we generated all the programs that explain each of the observed sequences from the experiment. In this way, we are able to sample from the exact distribution P(p i j d i , θ) by sampling from a multinomial distribution of all the possible programs p i that compute d i , where each p i has probability of occurrence equal to P(p i j θ).
To get an idea of the expressiveness of the grammar to generate different programs for a sequence and the cost of computing them, it is worth mentioning that there are more than 159 million programs that compute the 292 unique sequences generated by the subjects in the experiment, and that for each sequence there is an average of 546,713 programs (min = 10, 749, max = 5, 500, 026, σ = 693, 618). Fig 2 shows the inferred θ for the observed sequences from subjects, with a unit concentration parameter for the Dirichlet prior, α = (1, . . ., 1). Each bar shows the mean probability and the standard error of each of the atomic productions after 50 steps of the MCMC, leaving the first 10 steps out as burn-in.
Although 50 steps might seem low for a MCMC algorithm to converge, our method calculated P(p i j d i , θ) exactly in order to speed up convergence and to be able to later compare the probability with the complexity from the original MDL model. In Fig 3, we show an example trace for four MCMC runs for θ +0 , which corresponds to the atomic production +0, but is representative of the behavior of all θ i . (see S1 Fig for the full set of productions). Fig 2 shows a remarkable difference between the probability of the productions that were originally used based on geometrical intuitions and the ad-hoc productions. The plot also shows that each clockwise production has almost the same probability as its corresponding anticlockwise production, and a similar relation appears between horizontal and vertical symmetry (H and V) and symmetries around diagonal axes (A and B). This is important because the original experiment was designed to balance such behavior; the inferred grammar reflects this. Fig 4 shows the same inferred θ but grouped according to production family. Grouping stresses the low probability of all the ad-hoc productions, but also shows an important difference between REP and the rest of the productions, particularly the simple concatenation of productions (CONCAT). This indicates that the language of geometry is capable of reusing simpler structures that capture geometrical meaning to explain the observed data, a key aspect of a successful model of LoT.
We then ran the same inference method using observed sequences from other experiments but only with the original grammar productions (i.e. setting aside the ad-hoc productions). We compared the result of inferring over our previously analyzed sequences generated by adults with sequences generated by children (experiment 2 from [16]) and the actual expected sequences for an ideal player.  for each atomic production that is inferred after each population. The figure denotes that different populations can converge to different probabilities and thus different LoTs. Specifically, it is worth mentioning that the ideal learner indeed uses more repetition productions than simple concatenations when compared to adults. In the same way, adults use more repetitions than children. This could mean that the ideal learner is capable of reproducing the sequences by recursively embedding other smaller programs, whereas adults and children more so have problems understanding or learning the smaller concept that can explain all the sequences from the experiments, which is consistent with the results from the MDL model in [16].
It is worth mentioning that in [16] the complete grammar for the language of geometry could explain adults' behavior but had problems to reproduce the children's patterns for some sequences. However, they also showed that penalizing the rotational symmetry (P) could adequately explain children's behavior. In Fig 5, we see that the mean value of (P) for children is 0.06 whereas in adults it's 0.05 (a two-sample t-test reveals t = -12.6, p = 10−19). This might not necessarily be contradictory, as the model for children in [16] was used to predict the next symbol of a sequence after seeing its prefix by adding a penalization for extensions that use the rotational symmetry in the minimal program of each sequence. On the other hand, the Bayesian model in this work tries to explain the observed sequences produced by children considering the probability of a sequence summing over all the possible programs that can generate it and not just the ones with minimal size. Thus, a production like (P) that might not be part of the minimal program for a sequence might not necessarily be less probable when considering the entire distribution of programs for that same sequence.

Coding Theorem
For each phenomenon there can always be an extremely large, possibly infinite, number of explanations. In a LoT model, this space is constrained by the grammar G that defines the valid hypotheses in the language. Still, one has to define how a hypothesis is chosen among all possibilities. Following Occam's razor, one should choose the simplest hypothesis amongst all the possible ones that explain a phenomenon. In cognitive science, the MDL framework has been widely used to model such bias in human cognition, and in the language of geometry in particular [16]. The MDL framework is based on the ideas of information theory [38], Kolmogorov complexity [39] and Solomonoff induction [40].
Occam's razor was formalized by Solomonoff [40] in his theory of universal inductive inference, which proposes a universal prediction method that successfully approximates any distribution μ based on previous observations, with the only assumption of μ being computable. In short, Solomonoff's theory uses all programs (in the form of prefix Turing machines) that can describe previous observations of a sequence to calculate the probability of the next symbols in an optimal fashion, giving more weight to shorter programs. Intuitively, simpler theories with https://doi.org/10.1371/journal.pone.0200420.g005 low complexity have higher probability than theories with higher complexity. Formally, this relationship is described by the Coding Theorem [24], which closes the gap between the concepts of Kolmogorov complexity and probability theory. However, LoT models that define a probabilistic distribution for their hypotheses do not attempt to compare it with a complexity measure of the hypotheses like the ones used in MDL, nor the other way around.
In what follows we formalize the Coding Theorem (for more information, see [41]) and test it experimentally. To the best our knowledge, this is the first attempt to validate these ideas for a particular (non universal) language. The reader should note that we are not validating the theorem itself as it has already been proved for universal Turing Machines. Here, we are testing whether the inverse logarithmic relationship between the probability and complexity holds true when defined for a specific non universal language.

The formal statement
Let M be a prefix Turing machine -by prefix we mean that if M(x) is defined, then M is undefined for every proper extension of x. Let P M (x) be the probability that the machine M computes output x when the input is filled-up with the results of fair coin tosses, and let K M (x) be the Kolmogorov complexity of x relative to M, which is defined as the length of the shortest program which outputs x, when executed on M. The Coding Theorem states that for every string x we have up to an additive constant, whenever U is a universal prefix Turing machine -by universal we mean a machine which is capable of simulating every other Turing machine; it can be understood as the underlying (Turing-complete) chosen programming language. It is important to remark that neither P U , nor K U are computable, which means that such mappings cannot be obtained through effective means. However, for specific (non-universal) machines M, one can, indeed, compute both P M and K M .

Testing the Coding Theorem for Geo
Despite the fact that P M and K M are defined over a Turing Machine M, the reader should note that a LoT is not usually formalized with a Turing Machine, but instead as a programming language with its own syntax of valid programs and semantics of execution, which stipulates how to compute a concept from a program. However, one can understand programming languages as defining an equivalent (not necessarily universal) Turing Machine model, and a LoT as defining its equivalent (not necessarily universal) Turing Machine G. In short, machines and languages are interchangeable in this context: they both specify the programs/terms, which are symbolic objects that, in turn, describe semantic objects, namely, strings. The Kolmogorov complexity relative to Geo. In [16], the Minimal Description Length was used to model the combination of productions from the language of geometry into concepts by defining a Kolmogorov complexity relative to the language of geometry, which we denote K Geo . K Geo ðxÞ is the minimal size of an expression in the grammar of Geo which describes x. The formal definition of 'size' can be found in the cited work but in short: each of the atomic productions adds a fixed cost of 2 units; using any of the repetition productions to iterate n times a list of other productions adds the cost of the list, plus blog(n)c; and joining two lists with a concatenation costs the same as the sum of the costs of both lists.
The probability relative to Geo. On the other hand, with the Bayesian model specified in this work, we can define Pðx j Geo; yÞ which is the probability of a string x relative to Geo and its vector of probabilities for each of the productions.
For the sake of simplicity, we will use P Geo ðxÞ to denote Pðx j Geo; yÞ when θ is the inferred probability from the observed adult sequences from the experiment.
Here, we calculate both P Geo ðxÞ and K GeoðxÞ in an exact way (note that Geo, seen as a programming language, is not Turing-complete). In this section, we show an experimental equivalence between such measures which is consistent with the Coding Theorem. We should stress, once more, that the theorem does not predict that this relationship should hold for a specific nonuniversal Turing Machine.
To calculate P Geo ðxÞ we are not interested in the normalization factor of Pðx j progÞPðprog j yÞ because we are just trying to measure the relationship between P Geo and K Geo in terms of the Coding Theorem. Note, however, that calculating P Geo ðxÞ involves calculating all programs that compute each of the sequences as in our previous experiment. To make this tractable we calculated P Geo ðxÞ for 10,000 unique random sequences for each of the possible sequence lengths from the experiment (i.e., up to eight). When the length of the sequence did not allow 10,000 unique combinations, we used all the possible sequences of that length.
Coding Theorem results Fig 6 shows the mean probability P Geo ðxÞ for all sequences x with the same value of K GeoðxÞ and length between 4 and 8 (|x| 2 [4,8]) for all generated sequences x. The data is plotted with a logarithmic scale for the x-axis, illustrating the inverse logarithmic relationship between K Geo ðxÞ and P Geo ðxÞ. The fit is very good, with R 2 = .99, R 2 = .94, R 2 = .97, R 2 = .99 and R 2 = .98 for Fig 6A, 6B, 6C, 6D and 6E, respectively.
This relationship between the complexity K Geo and the probability P Geo defined for finite sequences in the language of geometry, matches the theoretical prediction for infinite sequences in universal languages described in the Coding Theorem. At the same time, it captures the Occam's razor intuition that the simpler sequences one can produce or explain with this language are also the more probable.

Discussion
We have presented a Bayesian inference method to select the set of productions for a LoT and test its effectiveness in the domain of a geometrical cognition task. We have shown that this method is useful to distinguish between arbitrary ad-hoc productions and productions that were intuitively selected to mimic human abilities in such domain.
The proposal to use Bayesian models tied to PCFG grammars in a LoT is not new. However, previous work has not used the inferred probabilities to gain more insight about the grammar definition in order to modify it. Instead, it had usually integrated out the production probabilities to better predict the data, and even found that hierarchical priors for grammar productions show no significant differences in prediction results over uniform priors [15,17].
We believe that inferring production probabilities can help prove the adequacy of the grammar, and can further lead to a formal mechanism for selecting the correct set of productions when it is not clear what a proper set should be. Researchers could use a much broader set of productions than what might seem intuitive or relevant for the domain and let the hierarchical Bayesian inference framework select the best subset.
Selecting a broader set of productions still leaves some arbitrary decisions to be made. However, it can help to build a more robust methodology that -combined with other ideas like testing grammars with different productions for the same task [23]-could provide more evidence of the adequacy of the proposed LoT. Having a principled method for defining grammars in LoTs is a crucial aspect for their success because slightly different grammars can lead to different results, as has been shown in [23].
The experimental data used in this work was designed at [16] to understand how humans encode visuo-spatial sequences as structured expressions. As future research, we plan to perform a specific experiment to test these ideas in a broader range of domains. Additionally, data from more domains is needed to demonstrate if this method could also be used to effectively prove whether different people use different LoT productions as outlined in Fig 5. Finally, we showed an empirical equivalence between the complexity of a sequence in a minimal description length (MDL) model and the probability of the same sequence in a Bayesian inference model which is consistent with the theoretical relationship described in the Coding Theorem. This opens an opportunity to bridge the gap between these two approaches that had been described ad complementary by some authors [42].