Does mathematics training lead to better logical thinking and reasoning? A cross-sectional assessment from students to professors

Clio Cresswell; Craig P. Speelman

doi:10.1371/journal.pone.0236153

Abstract

Mathematics is often promoted as endowing those who study it with transferable skills such as an ability to think logically and critically or to have improved investigative skills, resourcefulness and creativity in problem solving. However, there is scant evidence to back up such claims. This project tested participants with increasing levels of mathematics training on 11 well-studied rational and logical reasoning tasks aggregated from various psychological studies. These tasks, that included the Cognitive Reflection Test and the Wason Selection Task, are of particular interest as they have typically and reliably eluded participants in all studies, and results have been uncorrelated with general intelligence, education levels and other demographic information. The results in this study revealed that in general the greater the mathematics training of the participant, the more tasks were completed correctly, and that performance on some tasks was also associated with performance on others not traditionally associated. A ceiling effect also emerged. The work is deconstructed from the viewpoint of adding to the platform from which to approach the greater, and more scientifically elusive, question: are any skills associated with mathematics training innate or do they arise from skills transfer?

Citation: Cresswell C, Speelman CP (2020) Does mathematics training lead to better logical thinking and reasoning? A cross-sectional assessment from students to professors. PLoS ONE 15(7): e0236153. https://doi.org/10.1371/journal.pone.0236153

Editor: Jérôme Prado, French National Center for Scientific Research (CNRS) & University of Lyon, FRANCE

Received: January 13, 2020; Accepted: June 30, 2020; Published: July 29, 2020

Copyright: © 2020 Cresswell, Speelman. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the paper and its Supporting Information files.

Funding: The authors received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Mathematics is often promoted as endowing those who study it with a number of broad thinking skills such as: an ability to think logically, analytically, critically and abstractly; having capacity to weigh evidence with impartiality. This is a view of mathematics as providing transferable skills which can be found across educational institutions, governments and corporations worldwide. A view material to the place of mathematics in curricula.

Consider the UK government’s commissioned inquiry into mathematics education “Making Mathematics Count” ascertaining the justification that “mathematical training disciplines the mind, develops logical and critical reasoning, and develops analytical and problem-solving skills to a high degree” [1 p11]. The Australian Mathematical Sciences Institute very broadly states in its policy document “Vision for a Maths Nation” that “Not only is mathematics the enabling discipline, it has a vital productive role planning and protecting our well-being” (emphasis in original) [2]. In Canada, British Columbia’s New 2016 curriculum K-9 expressly mentions as part of its “Goals and Rationale”: “The Mathematics program of study is designed to develop deep mathematical understanding and fluency, logical reasoning, analytical thought, and creative thinking.” [3]. Universities, too, often make such specific claims with respect to their teaching programs. “Mathematics and statistics will help you to think logically and clearly, and apply a range of problem-solving strategies” is claimed by The School of Mathematical Sciences at Monash University, Australia [4]. The School of Mathematics and Statistics at The University of Sydney, Australia, directly attributes as part of particular course objectives and outcomes skills that include “enhance your problem-solving skills” as part of studies in first year [5], “develop logical thinking” as part of studies in second year, which was a statement drafted by the lead author in fact [6], and “be fluent in analysing and constructing logical arguments” as part of studies in third year [7]. The University of Cambridge’s Faculty of Mathematics, UK, provides a dedicated document “Transferable Skills in the Mathematical Tripos” as part of its undergraduate mathematics course information, which again lists “analytic ability; creativity; initiative; logical and methodical reasoning; persistence” [8].

In contrast, psychological research, which has been empirically investigating the concept of transferability of skills since the early 1900s, points quite oppositely to reasoning skills as being highly domain specific [9]. Therefore, support for claims that studying mathematics engenders more than specific mathematics knowledge is highly pertinent. And yet it is largely absent. The 2014 Centre for Curriculum Redesign (CCR) four part paper “Mathematics for the 21st Century: What Should Students Learn?” concludes in its fourth paper titled “Does mathematics education enhance higher-order thinking skills?” with a call to action “… there is not sufficient evidence to conclude that mathematics enhances higher order cognitive functions. The CCR calls for a much stronger cognitive psychology and neuroscience research base to be developed on the effects of studying mathematics” [10].

Inglis and Simpson [11], bringing up this very issue, examined the ability of first-year undergraduate students from a high-ranking UK university mathematics department, on the “Four Cards Problem” thinking task, also known as the Wason Selection Task. It is stated as follows.

Each of the following cards have a letter on one side and a number on the other.

Here is a rule: “if a card has a D on one side, then it has a 3 on the other”. Your task is to select all those cards, but only those cards, which you would have to turn over in order to find out whether the rule is true or false. Which cards would you select?

This task involves understanding conditional inference, namely understanding the rule “If P then Q” and with this, deducing the answer as “P and not Q” or “D and 7”. Such logical deduction indeed presents as a good candidate to test for a potential ability of the mathematically trained. This task has also been substantially investigated in the domain of the psychology of reasoning [12 p8] revealing across a wide range of publications that only around 10% of the general population reach the correct result. The predominant mistake being to pick “D and 3”; where in the original study by Wason [13] it is suggested that this was picked by 65% of people. This poor success rate along with a standard mistake has fuelled interest in the task as well as attempts to understand why it occurs. A prevailing theory being the so named matching bias effect; the effect of disproportionately concentrating on items specifically mentioned in the situation, as opposed to reasoning according to logical rules.

Inglis and Simpson’s results isolated mathematically trained individuals with respect to this task. The participants were under time constraint and 13% of the first-year undergraduate mathematics students sampled reached the correct response, compared to 4% of the non-mathematics (arts) students that was included. Of note also was the 24% of mathematics students as opposed to 45% of the non-mathematics students who chose the standard mistake. The study indeed unveiled that mathematically trained individuals were significantly less affected by the matching bias effect with this problem than the individuals without mathematics training. However, the achievement of the mathematically trained group was still far from masterful and the preponderance for a non-standard mistake compared with non-mathematically trained people is suggestive. Mathematical training appears to engender a different thinking style, but it remains unclear what the difference is.

Inglis, Simpson and colleagues proceeded to follow up their results with a number of studies concentrated on conditional inference in general [14, 15]. A justification for this single investigatory pathway being that if transfer of knowledge is present, something subtle to test for in the first place, a key consideration should be the generalisation of learning rather than the application of skills learned in one context to another (where experimenter bias in the choice of contexts is more likely to be an issue). For this they typically used sixteen “if P then Q” comprehension tasks, where their samples across a number of studies have included 16-year-old pre-university mathematics students (from England and Cyprus), mathematics honours students in their first year of undergraduate university study, third year university mathematics students, and associated control groups. The studies have encompassed controls for general intelligence and thinking disposition prior to training, as well as follows ups of up to two years to address the issue of causation. The conclusive thinking pattern that has emerged is a tendency of the mathematical groups towards a greater likelihood of rejecting the invalid denial of the antecedent and affirmation of the consequent inferences. But with this, and this was validated by a second separate study, the English mathematics group actually became less likely to endorse the valid modus tollens inference. So again, mathematical training appears to engender a different thinking style, but there are subtleties and it remains unclear what the exact difference is.

This project was designed to broaden the search on the notion that mathematics training leads to increased reasoning skills. We focused on a range of reasoning problems considered in psychological research to be particularly insightful into decision making, critical thinking and logical deduction, with their distinction in that the general population generally struggles with answering them correctly. An Australian sample adds diversity to the current enquiries that have been European focussed. Furthermore, in an effort to identify the impact of mathematics training through a possible gradation effect, different levels of mathematically trained individuals were tested for performance.

The study

Well-studied thinking tasks from a variety of psychological studies were chosen. Their descriptions, associated success rates and other pertinent details follows. They were all chosen as the correct answer is typically eluded for a standard mistake.

The three-item Cognitive Reflection Test (CRT) was used as introduced by Frederick [16]. This test was devised in line with the theory that there are two general types of cognitive activity: one that operates quickly and without reflection, and another that requires not only conscious thought and effort, but also an ability to reflect on one’s own cognition by including a step of suppression of the first to reach it. The three items in the test involve an incorrect “gut” response and further cognitive skill is deemed required to reach the correct answer (although see [17] for evidence that correct responses can result from “intuition”, which could be related to intelligence [18]).

Lily pads

In a lake, there is a patch of lily pads. Every day, the patch doubles in size. If it takes 48 days for the patch to cover the entire lake, how long would it take for the patch to cover half of the lake?

Widgets

If it takes 5 machines 5 minutes to make 5 widgets, how long would it take 100 machines to make 100 widgets?

Bat and ball

A bat and a ball cost $1.10 in total. The bat costs a dollar more than the ball. How much does the ball cost?

The solutions are: 47 days for the Lily Pads problem, 5 minutes for the Widgets problem and 5 cents for the Bat and Ball problem. The considered intuitive, but wrong, answers are 24 days, 100 minutes and 10 cents, respectively. These wrong answers are attributed to participants becoming over focused on the numbers so as to ignore the exponential growth pattern in the Lily Pads problem, merely complete a pattern in numbers in the Widgets problem, and neglect the relationship “more than” in the Bat and Ball problem [19]. The original study by Frederick [16] provides a composite measure of the performance on these three items, with only 17% of those studied (n = 3428) reaching the perfect score. The CRT has since been studied extensively [19–21]. Research using the CRT tends not to report performance on the individual items of the test, but rather a composite measure of performance. Attridge and Inglis [22] used the CRT as a test for thinking disposition of mathematics students as one way to attempt to disentangle the issue of filtering according to prior thinking styles rather than transference of knowledge in successful problem solving. They repeat tested 16-year old pre-university mathematics students and English literature students without mathematics subjects at a one-year interval and found no difference between groups.

Three problems were included that test the ability to reason about probability. All three problems were originally discussed by Kahneman and Tversky [23], with the typically poor performance on these problems explained by participants relying not on probability knowledge, but a short-cut method of thinking known as the representativeness heuristic. In the late 1980s, Richard Nisbett and colleagues showed that graduate level training in statistics, while not revealing any improvement in logical reasoning, did correlate with higher-quality statistical answers [24]. Their studies lead in particular to the conclusion that comprehension of, what is known as the law of large numbers, did show improvement with training. The first of our next three problems targeted this law directly.

Hospitals

A certain town is served by two hospitals. In the larger hospital, about 45 babies are born each day, and in the smaller hospital, about 15 babies are born each day. As you know, about 50 percent of all babies are boys. However, the exact percentage varies from day to day. Sometimes it may be higher than 50 percent, sometimes lower. For a period of one year, each hospital recorded the number of days on which more than 60 percent of the babies born were boys. Which hospital do you think recorded more such days? (Circle one letter.)

(a). the larger hospital
(b). the smaller hospital
(c). about the same (that is, within 5 percent of each other)

Kahneman and Tversky [23] reported that, of 50 participants, 12 chose (a), 10 chose (b), and 28 chose (c). The correct answer is (b), for the reason that small samples are more likely to exhibit extreme events than large samples from the same population. The larger the sample, the more likely it will exhibit characteristics of the parent population, such as the proportion of boys to girls. However, people tend to discount or be unaware of this feature of sampling statistics, which Kahneman and Tversky refer to as the law of large numbers. Instead, according to Kahneman and Tversky, people tend to adhere to a fallacious law of small numbers, where even small samples are expected to exhibit properties of the parent population, as illustrated by the proportion of participants choosing the answer (c) in their 1972 study. Such thinking reflects use of the representativeness heuristic, whereby someone will judge the likelihood of an uncertain event based on how similar it is to characteristics of the parent population of events.

Birth order

All families of six children in a city were surveyed. In 72 families the exact order of births of boys and girls was GBGBBG.

(a). What is your estimate of the number of families surveyed in which the exact order of births was BGBBBB?
(b). In the same survey set, which, if any, of the following two sequences would be more likely: BBBGGG or GBBGBG?

All of the events listed in the problem have an equal probability, so the correct answer to (a) is 72, and to (b) is “neither is more likely”. Kahneman and Tversky [23] reported that 75 of 92 participants judged the sequence in (a) as less likely than the given sequence. A similar number (unspecified by Kahneman and Tversky, but the statistical effect was reported to be of the same order as in (a)) reported that GBBGBG was the more likely sequence. Again, Kahneman and Tversky suggested that these results reflected use of the representativeness heuristic. In the context of this problem, the heuristic would have taken the following form: some birth orders appear less patterned than others, and less patterned is to be associated with the randomness of birth order, making them more likely.

Coin tosses

In a sequence of coin tosses (the coin is fair) which of the following outcomes would be most likely (circle one letter):

(a). H T H T H T H T
(b). H H H H T T T T
(c). T T H H T T H H
(d). H T T H T H H T
(e). all of the above are equally likely

The correct answer in this problem is (e). Kahneman and Tversky [23] reported that participants tend to choose less patterned looking sequences (e.g., H T T H T H H T) as more likely than more systematic looking sequences (e.g., H T H T H T H T). This reasoning again reflects the representativeness heuristic.

Three further questions from the literature were included to test problem solving skill.

Two drivers

Two drivers set out on a 100-mile race that is marked off into two 50-mile sections. Driver A travels at exactly 50 miles per hour during the entire race. Driver B travels at exactly 45 mph during the first half of the race (up to the 50-mile marker) and travels at exactly 55 mph during the last half of the race (up to the finish line). Which of the two drivers would win the race? (Circle one letter.)

(a). Driver A would win the race
(b). Driver B would win the race
(c). the two drivers would arrive at the same time (within a few seconds of one another)

This problem was developed by Pelham and Neter [25]. The correct answer is (a), which can be determined by calculations of driving times for each Driver, using time = distance/velocity. Pelham and Neter argue, however, that (c) is intuitively appealing, on the basis that both drivers appear to have the same overall average speed. Pelham and Neter reported that 67% of their sample gave this incorrect response to the problem, and a further 13% selected (b).

Petrol station

Imagine that you are driving along the road and you notice that your car is running low on petrol. You see two petrol stations next to each other, both advertising their petrol prices. Station A’s price is 65c/litre; Station B’s price is 60c/litre. Station A’s sign also announces: “5c/litre discount for cash!” Station B’s sign announces “5c/litre surcharge for credit cards.” All other factors being equal (for example, cleanliness of the stations, number of cars waiting at each etc), to which station would you choose to go, and why?

This problem was adapted from one described by Galotti [26], and is inspired by research reported by Thaler [27]. According to Thaler’s research, most people prefer Station A, even though both stations are offering the same deal: 60c/litre for cash, and 65c/litre for credit. Tversky and Kahneman [28] explain this preference by invoking the concept of framing effects. In the context of this problem, such an effect would involve viewing the outcomes as changes from some initial point. The initial point frames the problem, and provides a context for viewing the outcome. Thus, depending on the starting point, outcomes in this problem can be viewed as either a gain (in Station A, you gain a discount if you use cash) or a loss (in Station B, you are charged more (a loss) for using credit). Given that people are apparently more concerned about a loss than a gain [29], the loss associated with Station B makes it the less attractive option, and hence the preference for Station A. The correct answer, though, is that the stations are offering the same deal and so no station should be preferred.

And finally, a question described by Stanovich [30, 31] as testing our predisposition for cognitive operations that require the least computational effort.

Jack looking at Anne

Jack is looking at Anne, but Anne is looking at George. Jack is married, but George is not. Is a married person looking at an unmarried person? (Circle one letter.)

(a). Yes
(b). No
(c). Cannot be determined

Stanovich reported that over 80% of people choose the “lazy” answer (c). The correct answer is (a).

The above questions survey, in a clear problem solving setting, an ability to engage advanced cognitive processing in order to critically evaluate and possibly override initial gut reasoning, an ability to reason about probability within the framework of the law of large numbers and the relationship between randomness and patterning, an ability to isolate salient features of a problem and, with the last question in particular, an ability to map logical relations. It might be hypothesised that according to degrees of mathematical training, in line with the knowledge base provided and the claims of associated broad and enhanced problem-solving abilities in general, that participants with greater degrees of such training would outperform others on these questions. This hypothesis was investigated in this study. In addition, given that no previous study on this issue has examined the variety of problems used in this study, we also undertook an exploratory analysis to investigate whether there exist any associations between the problems in terms of their likelihood of correct solution. Similarities between problems might indicate which problem solving domains could be susceptible to the effects of mathematics training.

Method

Design

A questionnaire was constructed containing the problems described in the previous sections plus the Four Cards Problem as tested by Inglis and Simpson [11] for comparison. The order of the problems was as follows: 1) Lily Pads; 2) Hospitals; 3) Widgets; 4) Four Cards; 5) Bat and Ball; 6) Birth Order; 7) Petrol Station; 8) Coin Tosses; 9) Two Drivers; 10) Jack looking at Anne. It was administered to five groups distinctive in mathematics training levels chosen from a high-ranking Australian university, where the teaching year is separated into two teaching semesters and where being a successful university applicant requires having been highly ranked against peers in terms of intellectual achievement:

Introductory—First year, second semester, university students with weak high school mathematical results, only enrolled in the current unit as a compulsory component for their chosen degree, a unit not enabling any future mathematical pathway, a typical student may be enrolled in a Biology or Geography major;
Standard—First year, second semester, university students with fair to good high school mathematical results, enrolled in the current mathematics unit as a compulsory component for their chosen degree with the possibility of including some further mathematical units in their degree pathway, a typical student may be enrolled in an IT or Computer Science major;
Advanced1—First year, second semester, university mathematics students with very strong interest as well as background in mathematics, all higher year mathematical units are included as possible future pathway, a typical student may be enrolled in a Mathematics or Physics major;
Advanced2—Second year, second semester, university mathematics students with strong interest as well as background in mathematics, typically a direct follow on from the previously mentioned Advanced1 cohort;
Academic—Research academics in the mathematical sciences.

Participants

123 first year university students volunteered during “help on demand” tutorial times containing up to 30 students. These are course allocated times that are supervised yet self-directed by students. This minimised disruption and discouraged coercion. 44 second year university students completed the questionnaire during a weekly one-hour time slot dedicated to putting the latest mathematical concepts to practice with the lecturer (whereby contrast to what occurs in tutorial times the lecturer does most of the work and all students enrolled are invited). All these university students completed the questionnaire in normal classroom conditions; they were not placed under strict examination conditions. The lead author walked around to prevent discussion and coercion and there was minimum disruption. 30 research academics responded to local advertising and answered the questionnaire in their workplace while supervised.

Procedure

The questionnaires were voluntary, anonymous and confidential. Participants were free to withdraw from the study at any time and without any penalty. No participant took this option however. The questionnaires gathered demographic information which included age, level of education attained and current qualification pursued, name of last qualification and years since obtaining it, and an option to note current speciality for research academics. Each problem task was placed on a separate page. Participants were not placed under time constraint, but while supervised, were asked to write their start and finish times on the front page of the survey to note approximate completion times. Speed of completion was not incentivised. Participants were not allowed to use calculators. A final “Comments Page” gave the option for feedback including specifically if the participants had previously seen any of the questions. Questionnaires were administered in person and supervised to avoid collusion or consulting of external sources.

The responses were coded four ways: A) correct; B) standard error (the errors discussed above in The Study); C) other error; D) left blank.

The ethical aspects of the study were approved by the Human Research Ethics Committee of the University of Sydney, protocol number [2016/647].

Results

The first analysis examined the total number of correct responses provided by the participants as a function of group. Scores ranged from 1 to 11 out of a total possible of 11 (Problem 6 had 2 parts) (Fig 1). An ANOVA of this data indicated a significant effect of group (F(4, 192) = 20.426, p < .001, partial η² = .299). Pairwise comparisons using Tukey’s HSD test indicated that the Introductory group performed significantly worse than the Advanced1, Advanced2 and Academic groups. There were no significant differences between the Advanced1, Advanced2 and Academic groups.

Download:

Fig 1. Mean number of correct responses (numbcorr) for each group.

Error bars are one standard error of the mean.

https://doi.org/10.1371/journal.pone.0236153.g001

Overall solution time, while recorded manually and approximately, was positively correlated with group, such that the more training someone had received, the longer were these solution times (r(180) = 0.247, p = .001). However, as can be seen in Fig 2, this relationship is not strong.

Download:

Fig 2. Box and whisker diagrams depicting mean approximate response time (minutes) for each group.

https://doi.org/10.1371/journal.pone.0236153.g002

A series of chi-squared analyses, and their Bayesian equivalents, were performed on each problem, to determine whether the distribution of response types differed as a function of group. To minimise the number of cells in which expected values in some of these analyses were less than 5, the Standard Error, Other Error and Blank response categories were collapsed into one category (Incorrect Response). For three of the questions, the expected values of some cells did fall below 5, and this was due to most people getting the problem wrong (Four Cards), or most people correctly responding to the problem (Bat and Ball, Coin Tosses). In these cases, the pattern of results was so clear that a statistical analysis was barely required. Significant chi-squared results were examined further with pairwise posthoc comparisons (see Table 1).

Download:

Table 1. Proportion of participants selecting particular response types for all problems as a function of group.

https://doi.org/10.1371/journal.pone.0236153.t001

Four cards

The three groups with the least amount of training in mathematics were far less likely than the other groups to give the correct solution (χ² (4) = 31.06, p < .001; BF₁₀ = 45,045) (Table 1). People in the two most advanced groups (Advanced2 and Academic) were more likely to solve the card problem correctly, although it was still less than half of the people in these groups who did so. Further, these people were less likely to give the standard incorrect solution, so that most who were incorrect suggested some more cognitively elaborate answer, such as turning over all cards. The proportion of people in the Advanced2 and Academic groups (39 and 37%) who solved the problem correctly far exceeded the typical proportion observed with this problem (10%). Of note, also, is the relatively high proportion of those in the higher training groups who, when they made an error, did not make the standard error, a similar result to the one reported by Inglis and Simpson [11].

The cognitive reflection test

In the Lily Pads problem, although most people in the Standard, Advanced1, Advanced2 and Academic groups were likely to select the correct solution, it was also the case that the less training someone had received in mathematics, the more likely they were to select an incorrect solution (χ² (4) = 27.28, p < .001; BF₁₀ = 15,554), with the standard incorrect answer being the next most prevalent response for the two lower ability mathematics groups (Table 1).

Performance on the Widgets problem was similar to performance on the Lily Pads problem in that most people in the Standard, Advanced1, Advanced2 and Academic groups were likely to select the correct solution, but that the less training someone had received in mathematics, the more likely they were to select an incorrect solution (χ² (4) = 23.76, p< .001; BF₁₀ = 516) (Table 1). As with the Lily Pads and Widget problems, people in the Standard, Advanced1, Advanced2 and Academic groups were highly likely to solve the Bat and Ball problem (χ² (4) = 35.37, p < .001; BF₁₀ = 208,667). Errors were more likely from the least mathematically trained people (Introductory, Standard) than the other groups (Table 1).

To compare performance on the CRT with previously published results, performance on the three problems (Lily Pads, Widgets, Bat and Ball) were combined. The number of people in each condition that solved 0, 1, 2, or 3 problems correctly is presented in Table 2. The Introductory group were evenly distributed amongst the four categories, with 26% solving all three problems correctly. Around 70% of the rest of the groups solved all 3 problems correctly, which is vastly superior to the 17% reported by Frederick [16].

Download:

Table 2. Performance on the Cognitive Reflection Test as a function of group.

https://doi.org/10.1371/journal.pone.0236153.t002