^{1}

^{2}

^{3}

^{4}

^{5}

^{5}

^{5}

^{6}

^{7}

^{1}

^{4}

^{1}

^{4}

^{5}

Some authors are part of a science popularization project named "El Gato y La Caja" (

We present the results of a gamified mobile device arithmetic application which allowed us to collect vast amount of data in simple arithmetic operations. Our results confirm and replicate, on a large sample, six of the main principles derived in a long tradition of investigation: size effect, tie effect, size-tie interaction effect, five-effect, RTs and error rates correlation effect, and most common error effect. Our dataset allowed us to perform a robust analysis of order effects for each individual problem, for which there is controversy both in experimental findings and in the predictions of theoretical models. For addition problems, the order effect was dominated by a max-then-min structure (i.e 7+4 is easier than 4+7). This result is predicted by models in which additions are performed as a translation starting from the first addend, with a distance given by the second addend. In multiplication, we observed a dominance of two effects: (1) a max-then-min pattern that can be accounted by the fact that it is easier to perform fewer additions of the largest number (i.e. 8x3 is easier to compute as 8+8+8 than as 3+3+…+3) and (2) a phonological effect by which problems for which there is a rhyme (i.e. "seis por cuatro es veinticuatro") are performed faster. Above and beyond these results, our study bares an important practical conclusion, as proof of concept, that participants can be motivated to perform substantial arithmetic training simply by presenting it in a gamified format.

Smartphones, with more than 3 billion users across the world [

Here we investigate the capacity of this experimental approach to inquire about the cognitive mechanisms of arithmetic. Our aims are twofold, first to show that the main observations derived from a long tradition of laboratory based experimentation in arithmetic can be replicated with a rapid massive strategy of data acquisition. Second, to inquire whether big data in arithmetic may help us understand order effects in arithmetic that could not be reliably observed before in laboratory settings.

To these aims, we developed Moravec, an Android OS app that allows users to perform different types of operations, from 1-digit additions to 4-digits squares. The name Moravec is in honor of Hans Moravec´s paradox that states that high-level reasoning, including arithmetic with large numbers, requires very little computation, but low-level sensory and instead motor tasks often require vast computational resources. One key aspect of this investigation was to gamify the app (structure of levels of difficulties, the flow and graphic interface) to engage participants—who play at their own pace and without receiving any monetary compensation.

Two weeks after the app release, and without any major effort of diffusion, a total of 513 subjects had solved 90.000+ problems. These numbers are one order of magnitude higher than those reached in previous experiments in the area, and are constantly growing because of the possibility to participate anytime anywhere without the need of an investigator to be present.

The app presents different arithmetic operations which include:

Additions: two 1-digit numbers (e.g. 5 + 9) and two 2-digit numbers (e.g. 34+ 49).

Two numbers multiplications: 1-digit by 1-,2-,3-, and 4- digit numbers (e.g. 9x4, 86x8, 806x2, 3214x8).

Squares: numbers between 11 and 9999 (e.g. 21^{2}, 872^{2}, 7002^{2}).

These are all the atomic operations required to square a 4-digit number, using the algorithm described by the expert calculator Arthur Benjamin [

(a) Initial screen with the logo and four buttons: Play (“Jugar”), Practice (“Práctica”), Tutorial and Statistics (“Estadisticas”). (b) One of the many tutorial screens explaining the algorithms and with a link to a youtube tutorial video. (c) One of the statistics shown. It plots RTs as a function of the trial number for 2-digits square operations. (d) In the “Practice” section the user can choose which operation to train. For each operation three difficulty levels can be chosen: initial, medium and advanced. In the advance practice, problems are shown for 5 seconds and then disappear while in the other cases its remains visible until the user press enter (see (e)).

In the Play section, users go through 150 levels, starting with simple calculations and moving on to higher levels with more difficult problems, the most difficult task being to square 4 digits numbers. In each level 20 problems are presented and in order to advance to the next level 15 correct answers are required. If the answer is correct but it takes too much time (the exact duration depends on the level), this problem is neglected.

The Tutorial section (^{2} = (x-a)*(x+a)+a^{2}). There is a link to tutorial videos (a YouTube’s playlist called “Tutorial Moravec”) created specifically for the app, explained by author AR who is an expert mental calculator.

In the Statistics section (

In the “Practice” section the user can choose one operation to train, regardless of the default flow of the game (

Levels in the “Play” mode are organized according to standard gaming recommendations [

To generate an engaging experience, difficulty progresses as a series of increases and decreases. The difficulty in each level is shown according to a prediction based on the number of mental computations and memory load required by each problem (magenta). This correlates as expected with the measured RTs (green) after the experiment.

One remarkable characteristic in this study is that we reached substantial amounts of data with a very modest diffusion strategy. Moravec was released with only two posts in the personal Facebook pages of two of the authors (AR and FZ) and a third post in a private Facebook group, with 150 members, created by some of the authors as a science popularization project.

We also created a Facebook Group in which during the weeks of the experiment we updated news and pushed users to continue using the app. We identified the ones with better scores and published their result to enthusiasm users.

All subjects were informed that their data were to be collected via internet through their mobile devices and used for research. The consent was obtained before playing the game's first level: a pop-up window was launched and participants had to accept that their data would be used for research. This consent procedure was approved by the ethic committee. All the experiments described in this paper were reviewed and approved by the ethics committee: “Comité de Ética del Centro de Educación Médica e Investigaciones Clínicas “Norberto Quirno” (CEMIC)” qualified by the Department of Health and Human Services (HHS, USA): IRb00001745—IORG 0001315. The raw data and the R code developed for the mixed model analysis are available for download at

When opening the app for the first time, an optional form is presented to the users. Personal information such as birthdate, gender and education-level achieved were asked. From the 513 total users, 448 completed the form: 291 were male (65%), the ages ranged from 12 to 65 years old (M = 26.1, SD = 8.7), 367 subjects had completed high school (82%) and 125 had a university degree (28%). Argentineans represented 88% of the subjects, while the other 12% were from Mexico, Colombia, US and others.

We show in

For each trial, we define the response time as the time elapsed between the visual presentation of the problem and the response of the enter key. Errors, response times that exceeded 4 standard deviations of the mean (calculated per operation type) and problems in which participants erased a digit were not included in this analysis. We considered for this analysis only operations with 1 digit by 1 digit (additions and multiplications).

The most widely and consistent reported effects which can account for the variance in RTs in one digit additions and multiplications are [

The

The

The

The

An

All the effects that increase RTs also result in an increase in error rates. In addition, there is a specific effect on the pattern of errors in multiplications: responses are more likely to be correct answers for problems nearby in the table. For instance, 6 x 8 = 48 is more likely to be answered as 42 (6 x 7) or 54 (6 x 9), than 49 or 47, i.e., table-distance are more common than numerical-distance errors.

After substantial theoretical debate, one of the models that provides a better explanation of the data for multiplications is the connectionist model of retrieval introduced by Vergus and Fias in 2005 [

Two recent articles provide further support for the IN model [

It has been also shown, however, that above and beyond neighborhood-answer consistency, neighborhood-magnitude consistency also play a role [

We analyzed the data collected during the first two weeks after release. We used R [

Inter-subjects means and standard errors RTs are shown as a function of the product of the operands for tie and non-tie problems. Linear and logarithmic regression results are included.

ADDITION | intercept [ms] | std error | t-value | slope [ms/u] | Slope std error | t-value |
---|---|---|---|---|---|---|

2060 | 36 | 58 | 23.7 | 0.4 | 56.4 | |

1868 | 32 | 58 | 7.2 | 0.6 | 13.2 | |

2406 | 55 | 44 | 35.0 | 0.6 | 61.2 | |

2139 | 45 | 47 | 10.5 | 0.7 | 15.8 |

To perform this linear mixed effects analysis of the relationship between RTs and size, as fixed effects we entered product of the operands and their equality (tie or non-tie problems), with an interaction term to the model. As random effect, we had by-subject random intercepts and slopes. Visual inspection of residual plots did not reveal any obvious deviations from homoscedasticity or normality. P-values were obtained by likelihood ratio tests of the full model with the effect in question against the model without the effect in question. We obtained p< 2.2e-16 for the three effects (size, tie and size x tie).

We analyzed and compared eight other potential regressors apart from prod, the product of the operands. The regressors are: prod, the sum of the operands, its sum squared, these three regressors in log scale, the square root of its products and its sums, and the min operand. Each regressor was implemented separately and was included as the only one independent variable to predict RTs based on size, i.e., we ran a new regression for each regressor. Models were evaluated through the Akaike Information Criterion (AIC), which is a measure of the quality of each model relative to a null model that assumes that RTs are constant and do not depend on the problem. The preferred model is the one with the minimum AIC value. Values for each model are shown in

Bold letters indicate best fit, i.e. minimum AIC value. Observe that in the tie cases, the min, sqrt(prod) and sum regressors represent the same condition.

prod | Sum | sum^2 | log(prod) | log(sum) | log(sum^2) | sqrt(prod) | sqrt(sum) | Min | ||
---|---|---|---|---|---|---|---|---|---|---|

MULTIPLICATION | nonties | 314488 | 314518 | 314563 | 314705 | 314705 | 314265 | 314581 | 314687 | |

ties | 52903 | 52862 | 52903 | 52862 | 528843 | 52862 | ||||

ADDITION | nonties | 279525 | 279558 | 279607 | 279386 | 279905 | 279905 | 279676 | 279649 | |

ties | 34505 | 34503 | 34508 | 34508 | 34508 | 34496 |

In

We found significant order-effects in both addition (mean = 55ms, SE = 15 ms, t-value = 3.7) and times tables (mean = 45 ms, SE = 19ms, t-value = 2.4). By the likelihood ratio tests, including or not the fixed effect Op1>Op2 in the mixed linear model, we found in both cases that answers are faster when the maximum operand appears first (e.g. 7x4 are answered faster than 4 x 7; p = 0.02 for the times table and p = 0.0002 for the addition table). A reversed order effect for the times table had been observed before in Chinese mainland speakers, which seems to be driven by language biases, since they learn the table in only the min-max order [

As predicted by standard models, all the main effects that result in an increase of RT also results in an increase in the errors. Just for visualization purposes, we present in

The diameters of each circle represent the magnitude of the measure. Larger numbers are more difficult, resulting in longer RTs and larger error rates, with the exception of the equal-numbers multiplications and additions, and multiplications by 5, highlighted in the figure.

ADDITION | intercept [%] | std error | t-value | slope [%/u] | Slope std error | t-value |
---|---|---|---|---|---|---|

1.6 | 0.3 | 5.6 | 0.071 | 0.007 | 9.5 | |

1.2 | 0.6 | 2.1 | 0.051 | 0.013 | 3.9 | |

1.56 | 0.46 | 3.4 | 0.211 | 0.008 | 24.6 | |

2.83 | 0.65 | 4.3 | 0.06 | 0.01 | 4.8 |

Also in consistency with current models and with lab based observations (see a review in [

Our results confirm theoretical predictions [

Of all one digit multiplications, only four can be pronounced in a rhymed verse (in Spanish) when pronounced in the proper order. Those whose last digit of the result equals the second multiplier and when the result is larger than 20: 7 x 5 = 35, 9 x 5 = 45, 6 x 4 = 24 and 6 x 8 = 48 (note that 3 x 5 = 15 and 6 x 2 = 12 do not rhyme). Hence, a working hypothesis is that these specific problems will 1) show the largest order effect and 2) the effect will be such that the problem in the rhyme order will be performed faster.

In our study we were in a good position to examine this hypothesis, because we had substantial data that could allow us to inquire the order effect for each individual problem (

Here we showed the validity of a research program in arithmetic based on large scale data sampling through mobile devices. Our results confirm, on a large sample, six of the main principles derived in a long tradition of investigation: size effect, tie effect, size-tie interaction effect, five-effect, RTs and error rates correlation effect, and most common error effects.

Our large dataset allowed us to perform an analysis of order effects for each individual problem. This was an aspect of the model for which there were controversial findings. Current theoretical models [

Here we had an ideal dataset to examine a prediction of this model, namely, that problems that can be pronounced in a rhymed verse will 1) show the largest order effect and 2) the direction of the effect will be such that problems presented in the rhyme order will be performed faster. Results confirmed these two predictions for the four problems that are phrased in a rhyme.

We also observed an order effect for multiplication that showed a max-then-min pattern. One possible explanation for this is that calculation is easier when performing fewer additions of the largest number (i.e. 8 x 3 is easier to compute as 8 + 8 +8 than as 3 + 3 +… + 3). In Spanish the linguistic expression for 8 x 3 is "8 por 3" which translates to 8 + 8 + 8. This expression, which is similar to "8 multiplied by 3" is the opposite expression than in English where "8 times 3" means literally 3 + 3 +… + 3.

To confirm that 1) "8 por 3" is interpreted as 8 + 8 + 8 and 2) that it is easier when performing fewer additions of the largest number, we conducted two queries through massive data gathering in twitter using our account (@ElGatoyLaCaja). In one query, we asked participants for the literal interpretation of A x B and in the other we asked how they organize this operation mentally.

In the first query we asked “What do you think ‘6 por 3’ means? We are not asking how do you solve it, but what do you think it literally means”. For the answer, two choices were given: “6 + 6 + 6” and “3 + 3 + 3 + 3 + 3 + 3“. A total of 915 people participated: 80% answered “6 + 6 + 6” and 20% the other option (p < 2e-78, according to a two-tailed binomial test, the null hypothesis being that both answers have equal probability). We repeated the same poll after some hours but asking for the literal interpretation of “3 por 6” instead of “6 por 3”. The results were consistent with the previous poll: 1080 people participate, 55% answered “3 + 3 + 3 + 3 + 3 + 3” and 45% the other option (p = 0.001, according to a two-tailed binomial test). This confirms that “A por B” is mostly interpreted as A repeated B times.

Then we studied how participants organize these problems mentally, independently of the literal interpretation of the question. In one poll we asked: “When you compute 7 x 4, what do you imagine:”. Three answers were enabled: “7 + 7 + 7 + 7”, “4 + 4 + 4 + 4 + 4 + 4 + 4” and “It is indistinct”. A total of 1008 subjects participate, 73% answered the first option, 3% the second, and 24% the third (p < 9e-175, according to a two-tailed binomial test, neglecting the answers “it is indistinct”). We repeated the same poll hours later but replacing “7 x 4” by “4 x 7” in the question. A total of 1045 subjects participate, 68% answered the first option (i.e. “7 + 7 + 7 + 7”), 10% the second, and 27% the third (p < 3e-76, according to a two-tailed binomial test).

Results were consistent with our hypothesis: in Spanish A x B is interpreted as A summed B times and, independently of the presented order, most people preferentially organize this operation as the max operand repeated a number of times given by the min operand. This suggests that, for the Spanish speakers, when a multiplication is presented in the max-min order it is presented in the more natural and easier mental representation of the product used during the learning process. Future comparative studies should explore the hypothesis that this effect may be reversed in phrasal structures (like in English and French) in which 8 times 3 is interpreted as 3 + 3 +… + 3 (8 times).

Above and beyond these results, our study bares an important practical conclusion, as proof of concept, that participants can be motivated to perform substantial arithmetic training simply by presenting it in a gamified format. There are many similar initiatives that have not been so successful in achieving such an engagement and we can then ask, with the hopes of improving future similar initiatives, what made this experience successful. The space of games is very high dimensional and complex and hence there is no way for an individual study to come up with definite conclusions. However, we mention some aspects that in our experience turned out to be important: 1) an engaging gameflow, with difficulty levels carefully designed according to gaming best practices (as explained in Section 1); 2) the long-term goal of squaring large numbers, a classical operation for expert mental calculators that, because of its apparent difficulty, produces an hedonic feeling when achieved, and 3) the ‘collaborative science’ slogan used to launch the app, describing it as a new form of common-based peer production of scientific data (from the feedback obtained, many of the participants were much more engaged to participate knowing that it was clear that this project was a way to contribute to the construction of scientific knowledge), and the frequent posts regarding performance that were made, generating constant exchange between game programmers/designers and users as well as in between users.

We would like to thank