Predicting risk of dyslexia with an online gamified test

Dyslexia is a specific learning disorder related to school failure. Detection is both crucial and challenging, especially in languages with transparent orthographies, such as Spanish. To make detecting dyslexia easier, we designed an online gamified test and a predictive machine learning model. In a study with more than 3,600 participants, our model correctly detected over 80% of the participants with dyslexia. To check the robustness of the method we tested our method using a new data set with over 1,300 participants with age customized tests in a different environment -a tablet instead of a desktop computer- reaching a recall of over 78% for the class with dyslexia for children 12 years old or older. Our work shows that dyslexia can be screened using a machine learning approach. An online screening tool in Spanish based on our methods has already been used by more than 200,000 people.


Introduction
More than 10% of the world population has a specific learning disability with neurobiological origin called dyslexia.According to the International Dyslexia Association, "dyslexia is characterized by difficulties with accurate and/or fluent word recognition and by poor spelling and decoding abilities.These difficulties typically result from a deficit in the phonological component of language that is often unexpected in relation to other cognitive abilities and the provision of effective classroom instruction" [1].If someone knows they have dyslexia, they can learn coping strategies to overcome its negative effects [2,3].However, when people with dyslexia are not provided with appropriate support, they often fail in school: 35% of drop out of school, and it is estimated that less than 2% of people with dyslexia will complete a four year college degree [4].
Detecting dyslexia is especially difficult in languages with transparent orthographies, such as Spanish.In such languages, the correspondence of grapheme (letter) and phoneme (sound) is more consistent than in languages with deep orthographies, such as English, where people with dyslexia struggle more in learning how to read [5,6] and thus dyslexia is easier to detect.Dyslexia is called a "hidden disability" due to the difficulty of its diagnosis in languages with transparent orthographies because manifestations of dyslexia are less severe [5,6].As a result, Spanish speakers primarily learn that they might have dyslexia through school failure, which is often too late for effective intervention.Current methods of diagnosis and screening require professionals to collect performance measures related to reading and writing via a lengthy in-person test [7][8][9], measuring, e.g., reading speed (words per minute), reading errors, writing errors, reading words, reading fluency or text comprehension.
While machine learning techniques are broadly used in medical diagnosis [10], in the case of dyslexia it has been only used in combination with eye-tracking measures [11,12].The scope of this study is to determine whether people with and without dyslexia can be screened by using machine learning with input data from the interaction measures when being exposed to gamified linguistic questions through an online test, so it is easier to administer.

Method
We designed 32 linguistic exercises appropriate for inclusion into a web-based gamified test and conducted a study with 4,333 participants (469 with professional dyslexia diagnosis).Using a within-subject experimental design, we collected numerous performance measures during test completion.

Content Design
We designed the gamified exercises using two methods.First, some exercises were based on an empirical analyses of a corpus or errors written in Spanish by people with dyslexia [13] because errors reflect specific difficulties that comprise dyslexia [14,15]: we annotated the mistakes with general linguistic characteristics as well as with phonetic and visual information [13].We then analyzed the mistakes and extracted statistical patterns to later use in the creation of the test questions.Examples of these patterns are found in the most frequent linguistic and visual features shared in the errors which are phonetically and visually motivated.For instance, the most frequent errors involve letters in which the one-to-one correspondence between graphemes (letters) and phonemes (sounds) is not maintained, such as (<b, v>, <g, j>, <c, z>, <c, s>, <r>) and the letter <h>, which, in most cases, does not have a phonetic realization in Spanish.Another example of this phonological motivation found in errors is that mistakes involving vowel substitutions take place between phonemes that share one or two phonetic features, with lip rounding being the most frequently involved in errors (<a, e>).On the other hand, the visual motivation is demonstrated in that 46.91% of the error letters occur with mirror letters, i.e., 'p' and 'q' or 'n' and 'u' [13].
Second, we designed test exercises to target specific cognitive processes, different types of knowledge, and difficulties entailed in reading [16][17][18].Each exercise addresses three or more of the following dyslexia-related indicators shown in Table 1 that are different types of Language Skills, Working Memory, Executive Functions and Perceptual Processes.
The language of the exercises was reviewed by five speech therapists from Spain, Chile and Argentina, to guarantee that the Spanish variant presented in the exercise was neutral.To ensure that the pronunciation was performed correctly, the voices in the exercises were recorded by a professional voice actress.Likewise, to ensure that the difficulty level was appropriate, each question was reviewed by the speech therapists.See Figure 1 for an example of the exercises layout.
Exercises 1-21 entangled auditory and visual discrimination and categorization of different linguistics elements (phonemes -sounds-, graphemes -letters-, syllables, words, non-words).As the level increases, the elements are harder distinguish, because they are phonetically and orthographically more similar.The questions of the test were presented in increasing order of complexity and were intended for children seven years old or older.
Previous work [14,17,18] has shown that people with dyslexia have difficulty recognizing their own reading and spelling errors, including insertion, deletion, substitution or transposition of letters and syllables as well as detecting syntactic and semantic errors in sentences, that is, errors in the structure or in the sentence meaning.Hence, exercises 22-29 focus on correcting words and sentences by fixing the type of errors found in texts written by people with dyslexia.For instance, the user is asked to re-order the letters in the common error *'seite' to form the correct word 'siete''seven' (See Figure 1).These exercises target lexical knowledge, word identification, reading comprehension, and other linguistics skills such as phonological, syntactic and semantic awareness.
The final exercises (30 to 32) target sequential visual and auditory working memory by asking the player to write sequences of letters in an specific order as well as words and non-words.
The test was implemented in HTML5, CSS and Javascript with a back-end PHP server and a MySQL database.

Participants
Children and adults with dyslexia were recruited through a public call to dyslexia centers and dyslexia associations; the inclusion criteria specified that participants should have a dyslexia diagnosis performed by a registered professional.Participants without dyslexia were recruited through schools and limited to children and adults who had never had language problems in school records.Determining accurate ground truth in dyslexia diagnosis is difficult precisely because many people go undiagnosed and we do not know the ground truth accuracy of the professional diagnoses.All the participants' native language was Spanish.

Dependent measures
To quantify task performance, we collected the following dependent measures for each question: (i) number of Clicks; (ii) number of correct answers (Hits) ; (iii) number of incorrect answers (Misses); (iv) Score defined as sum of Hits per set of exercises; (v) Accuracy, defined as the number of Hits divided by the number of Clicks; (vi) Missrate, defined as the number of Misses divided by the number of Clicks.

Compliance and Ethics Statements
Interested organizations responded to our public calls, and, those where we verified that met the participation requirements (age, mother languages, technical requirements and dyslexia diagnosis for the experimental group) were included.Overall, 103 organizations from Argentina, Chile, Colombia, Spain, and USA participated in the study: 3 universities, 60 schools including primary and secondary, 22 specialized centers that support people with dyslexia, and 18 non-profit organizations compose of 4 foundations and 14 associations of dyslexia in Hispanic countries.Most organizations included both, dyslexia and non-dyslexia subjects.For each of the organizations there was one or more supervisors who were trained to administer the study protocol.

Procedure
This study was approved by the Carnegie Mellon University Institutional Review Board (IRB).First, participants gave their written on-line consent.In case the participant was under-aged we gathered consents from the participants and their parents.Then, the participant or the supervisor -in case the participant was under-aged-filled out a demographic questionnaire, including the date of their dyslexia diagnosis (if any).Next, they were given instructions on how to fulfill the tasks and completed the study: they completed the gamified test for 15 minutes without interruption.Supervisors could not help participants complete the test.For schools, parental consent was obtained in advance and the study was supervised by the school counselor or the therapist.All participants and supervisors were volunteers.

Data sets
We had 4,333 participants with an age range of 7 to 70 years old.Of them, 469 had diagnosed dyslexia, 294 suspected that they had dyslexia but could not present a definitive diagnosis and 3,570 did not have dyslexia.We decided not to use the middle group as that could add noise to the solution (later, our predictive model suggests that we were right, as only 63% of them were predicted with dyslexia).We were left with 4,039 participants, where 11.6% had dyslexia.The data for each participant consisted of a total of 196 features: age, gender, whether the participant is bilingual, the school marks in the Spanish subject during the current academic year, and 32 x 6 = 192 dependent measures derived from their interaction while playing the 32 questions of the test (the 6 measures per game presented previously).
As the overall data set (A) was imbalanced (the class dyslexia is much smaller), we also created a balanced (B) data set with 936 participants by doing a stratified sampling in gender and age over the subset of people without dyslexia.In addition, as with age people with dyslexia cope better with their problems, we built age-bounded versions of data set B: {B19, B16, B14, B12, B10, B8}, where Bn implies that all participants with age less or equal to n are considered.In Table 2, we show the characteristics of both data sets, and in Figure 2 we show a two-dimensional projection of the All dataset using T-distributed Stochastic Neighbor Embedding (t-SNE) [19], which gives an idea of the complexity of our classification problem.

Predictive model
For the predictive model, we used Random Forests [20] due to their non-linearity and good level of interpretability, as this technique is based in decision trees.We used the standard parameters with 200 iterations and unlimited height, obtaining models with a few hundreds trees.In the case of the All data set, we used weighted attributes to balance the data as a trivial classifier (everyone does not have dyslexia) would have obtained an accuracy of 88.4%, since 11.6% of the participants have dyslexia.
For the evaluation we use a 10-fold cross validation, obtaining the accuracy shown in the last column of Table 2 and in Figure 3. From the table we can see that the results for data set B have much better accuracy, justifying why we do the age analysis on that data set.In the figure, we also include the predictive power (percentage of accuracy for 100 people), which shows that about 700 participants are enough to reach the accuracy plateau.That implies than doubling the size of B10 or multiplying by four the size of B8, would significantly improve the accuracy for younger children.

Deploying the model
In practice what is important is not the accuracy but the recall (or sensitivity) of the dyslexia class (or the complement of this which is the rate of false positives, that is, the fraction of people without dyslexia that are predicted to have dyslexia) and the rate of false negatives (that is, the fraction of people with dyslexia predicted as without having dyslexia, which is the complement of the specificity), as the impact of each type of error is different.Indeed, missing a child that may have dyslexia is much worse than sending a child without dyslexia to a specialist.We show the relation between those two variables in Figure 4 for the model of data set B, where we can see that after a recall of 81%, the rate of false negatives grows much faster.false negatives would be 9 times more frequent than false positives if we assume a 10% prevalence.However, in practice, only people that have learning problems takes the test, and we estimate that they are about 20% of the population [21].In this case, not only the rate but also the number of false negatives and positives would be similar.Hence, we decided to set the threshold for the model when both types of errors have the same rate.
Figure 5 shows the absolute percentage difference of the model for data set B, finding an approximate threshold of 0.485 which gives a recall of 79.3% at the minimum point.
We then adjusted the decision threshold to 0.485 instead of 0.5, which is the standard value used in Random Forests, to reach a higher recall (sensitivity) in the dyslexia class.
Our estimation that 20% of the people who take the test having dyslexia is proven realistic, as 51% of the people taking the test are predicted to have risk of dyslexia, which implies a prevalence of 10.2%.In Table 3, we show the precision and recall results per class while in Figure 6, we show the precision-recall graph for the dyslexia class.3. Classifier precision and recall per class for a threshold of 0.485.

Discussion
In this section, we explore the impact of the different features and discuss the model's limitations.

Feature analysis
To analyze which were the best features in our models, we used standard information gain in decision trees.For example, for model B, the two most important features were gender and the performance in Spanish classes, which makes sense given that dyslexia is more salient in males and people with dyslexia fail at school.The next 36 best features were related to some question in the test.However, as questions are atomic, we need to aggregate all the features per question.top one) of them, where we also aggregated all the demographic features.Notice that the most relevant questions are initially almost in order, as the test was designed to have the most discriminating questions first (Q4 is an exception).However, all questions discriminate and only using the top features does not give good results.For example, using the top 9 questions plus the demographic variables, the accuracy obtained is just 67.5%.In Table 5, we aggregate features by type, where we can see that, as expected, mistakes are more important than successes, although the difference is small.

Limitations
Our machine learning model trained from human-computer interaction data is able to classify people as having dyslexia or not with high sensitivity, and using this type of data to screen dyslexia is novel.However, it indirectly considers measures that have previously used in traditional diagnoses.Indeed, paper based tests use reading and writing performance measures such as reading speed, spelling errors, and text comprehension [7][8][9], and the measures gathered with our online test indirectly measures such user's performance when the participant is exposed to the linguistic questions.
Nevertheless, the results of this online test should be taken as screening only and cannot serve as a diagnosis due to at least three reasons.First, our online test does not take into consideration other factors such as the intelligence quotient of the participant.In professional practice, intelligence tests such as WISC [22], are normally taken in order to diagnose dyslexia and exclude other possibles causes of the phonological skills deficiencies.
Second, our test does not discriminate other conditions.It is increasingly recognized that dyslexia co-occurs with other disorders [23].For instance, dyslexia is often co-morbid with dyscalculia [24] and attention deficit hyperactivity disorder (ADHD) [25].Notably, 40% of the people with dyslexia have dyscalculia [26], and from 18 to 42% of the population with dyslexia also have attention deficit hyperactivity disorder (ADHD) [25].Also, there are other language disorders, such as specific language impairment (SLI), that require professional assessment.These comorbidities make professional diagnoses a more challenging task, and, in practice, sometimes dyslexia is misdiagnosed by ADHD and vice versa [27,28].In our approach we took as ground truth the current dyslexia diagnosis accessed by a professional, however, that ground truth could vary depending on the professional assessment.Furthermore, there can be other factors that can play a role such as fatigue and concentration.
Finally, our test cannot report different degrees of dyslexia and does not consider the personal history of the user which can also play a role on dyslexia diagnosis.
Hence, these results need to be taken as preliminary results since the model was train on a relatively small data set and further experiments with more participants under other conditions, e.g., mobile devices and tablets need to be carried out to confirm that the results are fully reliable.Results with a larger data set would also help in tuning better the right threshold for better models.

Conclusions
The approach presented in this article shows that dyslexia can be screened in a language with shallow orthography, such as Spanish, using machine learning in combination with measures derived from a gamified online test.However, the results of this approach should be taken as a screening test in practice, never as a dyslexia diagnosis, since there are other factors such as intelligence quotient and dyslexia comorbidities that needs professional oversight.
This approach of screening dyslexia is easy to take on the Web, since it does not require special equipment.So far, our test has been deployed as an open access on-line tool used already more than 200,000 times in Spanish speaking countries.Since estimations of dyslexia are much higher than the actual diagnosed population, we believe this method has potential to make a significant social impact.Similar methods could lead to earlier detection of dyslexia and prevent children from being diagnosed with dyslexia only after they fail in school.

Fig 2 .
Fig 2. This visualization is a projection of a higher-dimensional space of the All data set.Yellow data points are participants without dyslexia while green data points are participants with dyslexia.

Fig 3 .
Fig 3. Accuracy for the age based data sets and the corresponding predictive power.

%FalseFig 4 .
Negatives vs. Dyslexia Recall % False negatives (y) versus dyslexia class recall (x), both in percentages, when the model threshold varies.If the whole population would take the test and the rate of both errors is similar, June 10, 2019 7/13

Fig 5 .
Balancing false positives and false negatives (percentage difference in y versus model threshold value in x).

Fig 6 .
Fig 6.Precision and recall curve for the dyslexia class, varying the model threshold.

Table 1 .
Cognitive indicators used in the creation of test exercises.

Table 2 .
Characteristics of the data sets.

Table 4 .
Relative question importance based on feature analysis.

Table 5 .
Relative importance by feature type aggregation.