Brain training improves recovery after stroke but waiting list improves equally: A multicenter randomized controlled trial of a computer-based cognitive flexibility training

Background Brain training is currently widely used in an attempt to improve cognitive functioning. Computer-based training can be performed at home and could therefore be an effective add-on to available rehabilitation programs aimed at improving cognitive functioning. Several studies have reported cognitive improvements after computer training, but most lacked proper active and passive control conditions. Objective Our aim was to investigate whether computer-based cognitive flexibility training improves executive functioning after stroke. We also conducted within-group analyses similar to those used in previous studies, to assess inferences about transfer effects when comparisons to proper control groups are missing. Methods We conducted a randomized controlled, double blind trial. Adults (30–80 years old) who had suffered a stroke within the last 5 years were assigned to either an intervention group (n = 38), active control group (i.e., mock training; n = 35), or waiting list control group (n = 24). The intervention and mock training consisted of 58 half-hour sessions within a 12-week period. Cognitive functioning was assessed using several paper-and-pencil and computerized neuropsychological tasks before the training, immediately after training, and 4 weeks after training completion. Results and conclusions Both training groups improved on training tasks, and all groups improved on several transfer tasks (three executive functioning tasks, attention, reasoning, and psychomotor speed). Improvements remained 4 weeks after training completion. However, the amount of improvement in executive and general cognitive functioning in the intervention group was similar to that of both control groups (active control and waiting list). Therefore, this improvement was likely due to training-unspecific effects. Our results stress the importance to include both active and passive control conditions in the study design and analyses. Results from studies without proper control conditions should be interpreted with care.


Objective
Our aim was to investigate whether computer-based cognitive flexibility training improves executive functioning after stroke. We also conducted within-group analyses similar to those used in previous studies, to assess inferences about transfer effects when comparisons to proper control groups are missing.

Methods
We conducted a randomized controlled, double blind trial. Adults (30-80 years old) who had suffered a stroke within the last 5 years were assigned to either an intervention group (n = 38), active control group (i.e., mock training; n = 35), or waiting list control group (n = 24). The intervention and mock training consisted of 58 half-hour sessions within a 12-week period. Cognitive functioning was assessed using several paper-and-pencil and computerized PLOS

Introduction
Approximately 60% of stroke survivors show cognitive impairments which often persist in the chronic phase after stroke [1,2]. Executive impairments, in particular, have a large impact on everyday life and may predict poor cognitive recovery after stroke [3,4], making efforts to improve these functions highly relevant. Computer-based training approaches may complement existing rehabilitation programs. They have the advantage that they can be exercised at home, thus facilitating intense and repeated practice, a key element for restitution-based rehabilitation. The aim of restitution training is not to master compensational strategies, but restoration of impaired functions through stimulation. It is very difficult, if not impossible, to distinguish whether improvements in cognitive functioning are due to restitution of the function or to the use of implicitly learned strategies. Still, it is important to evaluate whether mere retraining of cognitive functions can result in improved cognitive functioning, because restitution-based therapy has been effective in the domains of motor function, language, and vision (e.g., [5,6]). So far, the evidence for the effectiveness of computer-based training in improving executive functioning after acquired brain injury is inconclusive. Several studies have reported improvements in tasks similar to the training (near transfer effects) as well as improvements in tasks that differ from the training (far transfer effects). Most of these studies, however, suffered from methodological limitations [7]. First, most studies lacked a control condition or included only passive (non-treated) control conditions. In studies that included active control (i.e. mock training) groups, computer-based training failed to outperform mock training [8,9]. Without an active control condition, positive training effects may well result from nonspecific elements such as spontaneous recovery, test-retest effects, or the Hawthorne effect (i.e., the effect of merely participating in a scientific study, entailing expectancy, personal attention, motivation, et cetera; [10,11]). Without a waiting list control condition, improvement in both training groups could be due either to both programs being effective or to nonspecific elements. Thus, both an active control condition and a waiting list control condition should be included to control for all training-unspecific effects. Second, most studies did not adjust statistically for multiple testing despite including a large number of outcome measures, and are, therefore, prone to type 1 error (i.e., report of positive results where there are none). However, there are some studies with healthy individuals that did correct for multiple testing (e.g., [12]). Third, the relation between study outcomes and training task progression was often not investigated, so it remains unclear to what extent functional improvements were related to the training. Furthermore, training duration was generally short (median = 15.6 hours), and sample sizes were relatively small (median = 16; [7]).
A further methodological issue is that previous training studies may not have targeted relevant and/or process-pure cognitive functions or their underlying neural mechanisms. Three major components of executive functions have been discerned: (1) control of one's behavior, including inhibition of strong but inappropriate responses, (2) mental set shifting (i.e., changing from one set of task rules to another), and (3) information updating [13]. Training studies that did reveal reliable far transfer in healthy elderly commonly involved rapid task switching as a key ingredient of training [14], thereby targeting mental set shifting, which is at the core of executive functioning.
The aim of the current study was to test the hypothesis that three months of computerbased, commercially available, cognitive flexibility training improves executive functioning after stroke, while accounting for the above-reviewed methodological issues. We additionally conducted within-group analyses similar to those used in previous studies, to assess inferences about transfer effects when comparisons to proper control groups are missing. Participants trained five times per week half an hour for 12 weeks, which we expected to suffice to trigger restitution-based recovery of executive functions. The intervention training included rapid task switching. Difficulty of tasks was adapted individually to the performance of participants [14]. An active control group (i.e. mock training) and a waiting list group were included to control for nonspecific effects. We expected that the cognitive flexibility training would result in more pronounced transfer effects on executive functioning compared to the mock training, and that the performance of the waiting list group would not change over time.

Materials and methods
A detailed description of the design, training tasks, and outcome measures of this study has been published previously [15].

Participants
Participants were recruited from six Dutch rehabilitation centers and patient societies (April 2013-March 2015; last follow-up measurement in November 2015). They were included when they had had a stroke 3 months to 5 years ago, were between 30 and 80 years old, and (had) received rehabilitation therapy as inpatient or outpatient. Participants were required to have cognitive impairments after stroke (as testified by medical records), with cognitive complaints still present at study entry. Finally, participants were required to be able to work with the computer and have daily access to a computer with Internet connection.
Exclusion criteria were presence of neurodegenerative disease; epilepsy; serious psychiatric illness; any disease other than stroke that results in severe cognitive impairments; drug or alcohol dependency; severe color blindness, aphasia, neglect, or computer fear; disabling vision or auditory problems; and diagnosed learning disability. Furthermore, participants who were not mentally or physically fit enough to be able to complete 12 weeks of training were excluded. Finally, those who were not able to understand the training instructions or who could not execute the training due to any other unforeseen reason, after instructions or after the first training week, were excluded.
A priori sample size calculations were based on our aim to detect at least large group effects [16] between two groups with one outcome measure. With a power of .80 and an alpha of .05 (one-tailed), this effect would be revealed in univariate analyses with a minimal sample size of 20 per group. As a switch training in healthy elderly resulted in an effect size d = 0.40 [17] we aimed to maximally include 138 participants, resulting in 3 x 40 while taking into account a 15% attrition rate. Inclusion would stop at this number or when the recruitment period was over. A schematic overview of the participant flow can be found in

Experimental design
The study was a prospective multicenter, double-blind, randomized controlled study (RCT). Participants were assigned randomly to the intervention group, the active control group, or to the waiting list group by randomization software (Minimpy; [18]). This assured minimal differences between groups in time since stroke (post-acute versus chronic), level of computer experience (<0.5, 0.5-4, >4 hours), age (30-49, 50-59, 60-69, 70-80), education (primary or lower secondary, middle secondary, higher, university), cognitive screening scores (Telephone Interview for Cognitive Status, TICS [19]: <33, 33-37, >37), and sex (male, female). Participants were not informed that one of the training programs was a mock training. Instead they were told that the study aimed to compare two types of computer-based cognitive training programs. The groups were coded by the research coordinator such that the assessors were blind to which training condition the participant was assigned. The waiting list group was added during the course of the study (see S1 File). At that moment, 25 participants were included in the intervention group and 28 in the active control group.
Computer tasks were administered online at baseline (T0), after 6 weeks of training or waiting (T1), after training completion or after 12 weeks of waiting (T2), and (for the training groups) 4 weeks after training completion (T3). Conventional neuropsychological tasks tapping several cognitive domains were administered at T0 and T2. In addition, brain MRI scans were obtained at T0 and T2 in a subset of the sample and several questionnaires were administered at all time-points, results of which will be presented elsewhere.
The study was approved by the ethical review board of the University of Amsterdam (i.e., Commissie Ethiek voor de Afdeling Psychologie; approved December 2012) and by the medical ethical review board of the VU University Medical Center, Amsterdam (i.e., the Medisch Ethische Toetsingscommissie Vrije Universiteit Medisch Centrum; approved July 2013, amendment approved May 2014). The study is registered before study commencement as Training Project Amsterdam Seniors and Stroke (TAPASS) with the Central Committee on Research Involving Human Subjects Register NL4468502913 (www.toetsingonline.nl). Additionally, to fulfill the World Health Organization Registry criteria it was also registered with the Netherlands National Trial Register NTR5174. The authors confirm that all ongoing and related trials for this intervention are registered.

Procedures
Whenever participants indicated that they wanted to participate in the study, they were asked to sign an online informed consent form and complete an online screening questionnaire and a cognitive screening by phone (TICS) to assess inclusion and exclusion criteria. After randomization (T0), participants gave written informed consent and performed neuropsychological and computer tasks administered by two junior psychologists trained and supervised by a neuropsychologist at the University of Amsterdam. The assessor of neuropsychological tasks was blind to the training allocation of the participant. After task administration, he/ she registered which training condition they thought (i.e., guessed) the participant was in, to check whether they were truly blind. Note that the person administrating the computer tasks and training instructions was not blind to training allocation. In addition, participants performed several computer tasks online at home (see [15] for a detailed description).
Participants in the training groups trained at home five times per week during half an hour for 12 weeks. The training program consisted of 58 training sessions (29 hours of training). A daily log was completed before and after each training session. A trained student contacted the participant by phone to ask about their training progression and an automatic training reminder was sent by e-mail whenever participants did not train for two days that week. The waiting list group received care as usual which most often meant they did not receive any treatment; they were not contacted by phone. After 12 weeks of either training or waiting (T2), participants came back to the university to perform the same tasks as before (T0). After completing these tasks, the waiting list group started the intervention training.
The online tasks were also administered after 6 weeks of waiting or training (T1). Four weeks after training completion (T3), the online tasks were repeated among training groups to measure long-term effects of the training.

Intervention materials
Both computer training programs were carried out at home via a preexisting brain training website, www.braingymmer.com. The training tasks (games) were professionally programmed to be stimulating. Based on a pilot study, we adjusted elements such as the time allowed to complete a task to stroke survivors and to the elderly population. Individual feedback was given after each task based on a three-star rating scale and at the end of each session. During the workout, the next task was automatically selected and presented to the participant. Participants were thus not free to navigate the website and select their own tasks. They were, however, able to select the level of difficulty where a higher level became available as soon as one out of three stars was achieved. The training duration of 12 weeks with five half-hour sessions per week was based on previous training studies from which we concluded that a minimum training period of three months was required. Moreover, we asked stroke survivors and healthy elderly what they thought was an achievable time investment in addition to their everyday life activities.
Cognitive flexibility training. The intervention training consisted of nine tasks in the cognitive domains of working memory, attention, and reasoning. To stimulate cognitive flexibility, tasks from the same cognitive domain were never performed successively. In the first week, each task was performed for 10 minutes in order to get to know the task (i.e., three tasks per session). After that, sessions included 10 tasks of 3 minutes each to challenge cognitive flexibility. The degree of difficulty of the task was adapted to the participant's performance; they were asked to continue to the next level as soon as they received two or three out of three stars. All tasks consisted of 20 levels.
Mock training. The mock training consisted of four tasks that were not expected to train executive functions because they did not involve updating, set shifting, or inhibition. Each session consisted of three tasks. Thus, participants only switched to the next task after 10 minutes. The tasks were not adaptive; participants were asked to train at the same level for one or two weeks before they were allowed to go to the next level. The instructors asked participants not to train beyond level nine. However, some participants disregarded this and trained at higher levels anyway.

Outcome measures
Primary outcomes. For the primary analyses, executive functioning was measured with several neuropsychological tasks and one computer task (see Table 1 for outcome measures used per task). These included the number-letter switching condition of the Trail Making Test (TMT) from the Delis-Kaplan Executive Function System (D-Kefs; [20]), category fluency [21] and letter fluency tasks [22], an online version of the Tower of London (ToL, based on [23]), and Letter-Number Sequencing (LNS, Wechsler Adult Intelligence Scale III-NL; [24]). Raw scores were corrected for demographics at baseline assessment based on norm scores where available. None of the outcome measures was used in the training program, thus all measured transfer effects.
Secondary outcomes. Objective cognitive functioning was assessed for the following domains: cognitive flexibility, attention, verbal memory, working memory, reasoning, psychomotor speed, and inhibition. Most domains consisted of multiple tasks (see Table 1 for tasks

Cognitive domain Task
Outcome measure

Primary outcome measures
Executive functioning -D-Kefs TMT number-letter switching condition [20] Age-corrected z-scores of time to completion -Category fluency [21] Number of words mentioned within one minute (average of two categories) -Letter fluency [22] Education-corrected z-score of number of words mentioned within one minute (three different starting letters) -Tower of London* [23] Number of moves-minimal required moves to solve the ToL ‡. N. B. Maximal possible moves score per item was 20 and unsolved items were scored with 20 -Letter-Number Sequencing [24] Age-corrected z-scores of total number of correct items and outcome measures used per domain). The operation span task was not used for analyses, because online task presentation became unreliable due to changes in the Flash plugin in the most used internet browser. All scores were recoded in such a way that higher scores represent better performance. For details about data preparation of the primary and secondary outcomes, see S1 File. Training performance. For both experimental groups, training performance was assessed based on task level and highest score obtained per level. Per task, the obtained high-scores for each level were converted into a percentage of the maximal possible score and added up to a total task score. The intervention training consisted of three domains and the domain score was based on the average task score of the tasks within that domain. Finally, a total score was calculated by taking the average of the three domain scores (i.e., the average score of all tasks). The mock training consisted of four tasks and the total score was based on the average of these tasks. Scores could range from 0 to 2000 (i.e., maximum 100 per level).

Statistical analysis
Primary analyses of transfer effects were performed with one repeated-measures MANOVA. The dependent variables were time to completion on the switch condition of the D-Kefs TMT, number of words mentioned during the category fluency and the letter fluency tasks, square root transformation of the number of steps of the ToL, and score on LNS. The independent variable was group (intervention, mock training, and waiting list control group). Time-points in this model were before (T0) and after training or waiting period (T2). Post-hoc univariate ANOVAs were performed when the time effect of the MANOVA was significant. P-values that do not survive Bonferroni-Holm correction are reported.
Secondary analyses were performed in a similar way, i.e. by a single repeated-measures MANOVA, with inhibition and composite scores of cognitive flexibility, attention, verbal memory, working memory, reasoning, and psychomotor speed as dependent variables. A composite score was calculated per domain by calculating the average z-score based on the mean and standard deviation of all participants at T0. However, if norm scores were available, demographically-corrected z-scores were used instead. Both primary and secondary analyses were repeated with age, education, and time since stroke as covariates to explore the influence of these variables on training effects. We performed these analyses because they were originally planned. Due to absent correlations between the covariates and the outcome measures and a lack of statistical power, the outcomes are not statistically reliable (results can be found in S1 File).
Whenever the training resulted in a significant improvement of the dependent variables that were additionally measured at T1 (after 6 weeks of training) and T3 (follow-up 4 weeks after training completion), the time-points T1 and T3 were added to the model. This was done to determine whether the training was already effective after 6 weeks of training, and to assess whether training effects would persist after the training.
Blinding for training assignment of assessors and participants was checked with a binomial test. Training performance differences between groups were analyzed with Mann-Whitney tests because the training scores were not normally distributed. The dependent variable was the difference score, which was calculated by subtracting the average training score after the first time each task was performed (10 minutes per task; at T0) from the final average training score (at T2). The independent variable was group (intervention and mock training). The relation between training improvement and performance change on outcome measures was examined with Pearson's correlation (statistically tested one-tailed).
Exploratory Student paired t-tests were performed in both the training groups and the waiting list group with all outcome measures at T0 and T2 as dependent variables in order to compare our results with previous training studies. In addition, Bayesian independent samples ttest was performed with JASP (Version 0.7.5.5; Computer software) to explore whether the evidence was in favor of H0 (data are from the same group) or H1 (data come from two groups; training groups 6 ¼ waiting list group). Finally, to examine whether the training had an effect on overall cognition, a repeated-measures ANOVA was conducted with a composite score of all outcome measures as the dependent variable and group (intervention, mock training, and waiting list control group) as independent variable. Norm-corrected z-scores were used where available. For the remaining tasks, z-scores were calculated based on the mean and standard deviation at T0.
All analyses were performed as intention-to-treat analysis, including all participants who started the study. Additionally, analyses were rerun as per-protocol analysis. In these latter analyses, only participants who completed the training according to protocol (e.g., completed at least 50 sessions) and who performed the tasks at T2, were included in the analyses.
Outliers in the (transformed) raw data were detected by Grubbs' Extreme Studentized Deviation test [31] and were replaced with the closest value of the other participants. Missing values that were due to the participant (e.g., too tired to complete the task, or incidentally incapable of performing the task altogether) were substituted with the lowest observation of the group at that time-point for that task, or for the D-Kefs TMT with z = -3 as this is the lowest score possible for that task. Data that were missing but not due to the participant (e.g., caused by technical problems) were substituted based on the last observation carried forward. In cases where the baseline score was missing, last observation carried backwards was used, such that the data from the closest time-point after baseline was used. If both time-points were missing, the average of the group was used. Note that in this way substituted data were conservative. The results, therefore, more likely reflect an underestimation than an overestimation. Multivariate outliers were only replaced if an explanation could be found for the extreme values. Analyses were run with and without outliers. All reported results are without univariate outliers and with multivariate outliers, but if results differed, both analyses were reported.
Normality was checked with Shapiro-Wilk test and by evaluating skewness and kurtosis. SPSS version 19 (IBM; Armonck, USA) or a later version was used. P-values of .05 or lower (two-tailed if not mentioned otherwise) were considered significant.

Pre-training
Out of 223 potential participants who were screened, 97 passed all in-and exclusion criteria and were included in the final analyses (see Fig 1 for participant flowchart including drop-out reasons). Prior to training, the three groups did not differ in age, educational level, sex, time since stroke, or baseline cognitive functioning except for attention (see Table 2 for scores and statistics). The intervention group had significantly higher baseline scores on the attention composite (p < .01) and reported significantly higher levels of fatigue (p < .01) than the active control group, but not waiting list group.
The blinding for training assignment was confirmed. Assessors of the neuropsychological tasks did not guess the training condition significantly better than chance at baseline (T0: 30%; p = .30) and after training or waiting (T2: 42%; p = .11). The active control group was not informed about the existence of a mock training. More than half (66%) of the active control group thought they had received an intervention training compared to 89% of the intervention group. Moreover, the training groups did not significantly differ with respect to motivation during training, perceived difficulty of or interest in the training, number of workouts completed, or drop-out rate. This suggests that the active control group did perceive their training similarly to the intervention group.

Training tasks
Thirty-six participants started the intervention training and 33 started the mock training. The average number of training sessions completed was 48.4, which equals 24.2 hours. This did not differ significantly between the intervention training group and the active control group (t(71) = .44, p = .66). From the participants who completed the post-training assessment (T2; n intervention = 29, n active control = 30), the average number of sessions completed was 56.8 (i.e., 28.4 hours). Both training groups improved on the training tasks (see S1 Fig) and improvement per task did not differ significantly between groups (Mann-Whitney U = 628.5, n intervention = 36, n active control = 33, p = .68).
The intervention training was intended to be more adaptive than the mock training. However, the degree of adaptiveness was compromised in 83% of the active control group and 17% of the intervention group participants. First, even though the active control group had been  Computer-based cognitive training after stroke instructed to stay below level 10, 83% of the active control participants continued at levels higher than nine. They did so in on average 10% of their training time compared to 14% in the intervention group, which was allowed to train until level 20. Thus, the mock training was more adaptive than originally planned, which may mitigate the difference between intervention and mock training. However, because the intervention training included rapid task switches and tasks were focused on executive functions it was still believed to be superior to the mock training. Second, in the intervention group, five participants (17%) were slightly less challenged in the last weeks of the training because they reached the highest level and score possible on one of the nine tasks. Though not likely, this may have yielded a small ceiling in training effects.  Table 3). Univariate repeated-measures ANOVAs revealed improvements over time (T2-T0) for attention (p < .001, ɳ p 2 = .30), reasoning (p = .03, ɳ p 2 = .05) and psychomotor speed (p < .001, ɳ p 2 = .30), but the intervention training did not result in larger improvement compared to either of the control groups. The time effect for reasoning did not survive adjustment for multiple comparisons. Results were similar when analyses were repeated with univariate outliers, except for the time effect of reasoning which obtained trend-wise significance (p = .06, ɳ p 2 = .04). As with the primary analyses, results did not change when analyses were run without participants for whom adaptiveness of the training was compromised. Relation between improvement on training task and outcome measures. The improvement on training tasks did not correlate significantly with improvement on primary transfer tasks (r ranging from -.13 to .31). The only significant correlation was between the attention domain score of the intervention training and the improvement on the ToL (r(34) = .31, p = .03). For the secondary outcome measures the correlations were also low to moderate (r ranging from -.19 to .45). Significant correlations in the intervention group were found between improvements in attention training tasks and reasoning composite (r(34) = .31, p = .03); memory training tasks and attention composite (r(34) = .33, p = .02); and improvement of reasoning training tasks correlated with improvements in the attention composite, reasoning composite, and psychomotor speed composite (r = .35, p = .02; r = .32, p = .03; r = .28, p = .05; respectively). Mock training improvement only correlated significantly with attention composite improvement (r(31) = .45, p < .01).

Transfer effect of training
Planned explorative analyses. We performed a repeated-measures ANOVA to examine the training and waiting effect on overall cognition based on a composite score of all outcome measures, and obtained results similar to those above. There was a significant time effect (F(1, 94) = 20.8, p < .001, ɳ p 2 = .18), indicating that the performance of all three groups improved. However, the group Ã time interaction effect was not significant (F(2, 94) = 0.74, p = .48, ɳ p 2 = .02, see Fig 2), thus there was no group difference in improvement. Because the average overall z-score was close to 0, indicating participants were not severely impaired, we reran the analyses (post-hoc) including only participants who were impaired (i.e., z-score lower than -1.65) on at least one of the outcome measures at baseline, again obtaining similar results as for the primary and secondary analyses mentioned before. Comparison with statistical methods from previous studies. Previous studies typically only performed analyses in the intervention group without direct statistical comparison with a control group or without correcting for multiple testing. To investigate how our results compare to these from previous studies, we also performed within-group analyses (even though we consider these analyses less appropriate, because between-group analyses are required to control for training-unspecific effects). Paired Student t-tests of the 26 outcome measures in the intervention group revealed 10 significant improvements (and 2 trends towards improvement), 3 of which would survive adjustment for multiple comparisons (see Table 4). The waiting list group improved significantly on four outcome measures, none of which would survive multiple comparison adjustment. Thus, without direct statistical comparison of all three groups within the same analyses we would have concluded that brain training results in cognitive improvement.
However, as mentioned above, based on repeated-measures MANOVAs when the three groups are compared directly, these group differences are not significant. Lack of statistical power is an unlikely explanation for the absence of group differences, because univariate Computer-based cognitive training after stroke Table 4

Intervention group (n = 38)
Waiting list group (n = 24)  ANOVAs did also not show significant differences between the three groups (results not reported but available upon request), not even when both training groups were pooled and compared with the waiting list group. Results from Bayesian analyses were also in favor of the H0 (i.e., evidence that the data was more likely to originate from one group, see S1 Table) for the outcome measures that showed no group differences based on the ANOVAs. Per-protocol analyses. The main repeated-measures MANOVAs were repeated for participants who completed the training according to the protocol. Compared to the intention-totreat analyses, 20 participants were excluded (18 drop-outs, 2 completed < 50 training sessions). Thus, analyses were based on 77 participants (n intervention = 28, n active control = 29, n waiting list = 20). At baseline (T0), participants who did not follow the protocol had significantly lower memory scores (t(95) = 2.70, p < .01), D-Kefs TMT scores (t(23.7) = 2.21, p = .04), and there was a trend for lower attention scores (t(23.7) = 1.88, p = .06) than participants who followed the protocol. They did not differ significantly on the other baseline variables nor on the rehabilitation received during the study.
Results of the training did not differ from the intention-to-treat analyses. Participants improved on training tasks, but there were no transfer effects of the training to the primary outcomes (F(10,136) = 0.70, p = .73, ɳ p 2 = .05) nor to the secondary outcomes (F(14,132) = 0.73, p = .74, ɳ p 2 = .07).

Follow-up.
As there was no training effect, T1 was not evaluated. Results for the training groups at follow-up were analyzed with a repeated-measures MANOVA with T0 and T3 as time-points, followed by the same analysis with T2 and T3 as time-points. Long-term training effects could not be compared to a passive control group, because the waiting list group did not perform T3 after waiting.

Discussion
The aim of the current study was to evaluate whether a commercially available computer-based brain training improves executive functioning after stroke. We included training elements that had been reported in previous studies to enhance the effects of training. With this training program, we found that participants improved on training tasks and on several outcome tasks, with improvements persisting even 4 weeks after training completion. However, when groups were compared, the intervention group did not improve significantly more than the active and waiting list control groups. This implies that the transfer effects of training were small and may be explained largely by variables unrelated to the training, such as the Hawthorne effect.
Our second aim was to evaluate whether studies without proper control groups can draw conclusions about transfer effects. The improvements over time we found do corroborate most findings of previous brain training studies for patients with acquired brain injury [8,9,[32][33][34][35][36][37][38][39]. Therefore, without comparing the result of the intervention group to those of the waiting list and active control group, we would also have-unjustifiably-concluded that the intervention training was effective. The current study shows that a positive within-group effect (i.e., time effect) does not necessarily imply an effect of training because the time effect may not differ significantly from the one seen in groups that did not train. Our results underscore the importance of including control conditions in the study design and analyses.
For the main executive functioning outcomes, the time effect disappeared after correcting for education. This also replicates previous results [8]. The time effect of the secondary outcome measures was not affected by these corrections. In the current study, education did not correlate strongly with the outcome measures and the effect of each education level differed per outcome measure, which suggests that education resulted in non-systematic variation. It is therefore unlikely that education accounted for the training effect and it is more likely that the time effect disappeared due to covariation with non-systematic, noise-like patterns.
With the same outcome measures as used in the current study, two studies did report transfer effects of training to executive functioning, verbal memory [38], and to attention [34]. Compared with our study, their participants were more severely impaired, suggesting that training programs may be more effective for these patients. A limitation is that an active control group was not included, so their results may be training-unspecific. Another limitation is that only a small number of tasks per cognitive domain was used. It is important, however, to use multiple outcome measures per cognitive domain because an improvement in a small selection of outcome measures taxing the same function is less convincing than improvement on several outcome measures within a function [40]. With the use of composite scores, as we did in our study, only training effects that consistently improve a given domain will be revealed, not merely the skills for one particular task. This would be strong evidence that the training indeed improved the underlying cognitive functioning measured by that set of tasks.
There were several limitations in our study. First, our sample consisted of stroke patients who were relatively high functioning with an average z-score of -0.6. Only 63% of the participants had scores that suggest impairment (z < -1.65) at least in one outcome measure. This may have been due to a selection bias, because participants needed to be able to use a computer and endure the workload of this study. Nevertheless, the training results did not change when we only included these more severely impaired patients.
Second, similar to what was found in two large studies with healthy elderly adults [41,42], the effect size of improvement on outcome measures was small, on average ɳ p 2 = .20 (which corresponds to d = 0.29). To reveal these effects with a power of 0.8 and alpha of .05 (one sided), 444 participants (3x 148) would be needed. However, an intervention with such a small effect size is hardly clinically relevant. Third, the adaptiveness of the training was somewhat compromised in both training groups. As a result, the active control training was more challenging than we had planned, and five intervention group participants were slightly less stimulated because they reached the ceiling in one of the nine training tasks during the last phase of their training. This may explain the absence of difference in transfer effects between the intervention and the mock training. However, results without these participants did not differ from those including them, thus rendering the compromised adaptiveness of the training an unlikely candidate to explain the absence of training effects.
Fourth, we expected that training five times per week during half an hour per day for a total of 29 hours would be sufficient to induce restitution-based recovery. Possibly, this still was not sufficient and more time or a higher frequency of training is needed [42,43]. Even though participants were coached during the entire training period and could contact us whenever they ran into problems, they might have benefited from additional face-to-face contact and support [44]. It is also possible that the training was not sufficiently tailored to improve cognitive functioning in patients.
Fifth, a general limitation of executive functioning tasks pertains to learning effects during repeated performance. Yet, no parallel versions were available for the tasks that improved over time. Training effects may thus have been absorbed by these retest effects.
Our study was designed to maximize the likelihood of uncovering training effects on executive functions. Despite our efforts, we did not find any evidence in favor of the notion that computer-based training helps to improve executive functioning. The efforts included the following: Our study had sufficient statistical power to reveal clinically relevant effect sizes. The training included important aspects of a training program that are thought to be essential, and used professionally programmed tasks. These were stimulating and tapped into three cognitive domains. The gaming platform was tailored to our research and allowed control over many aspects of the training, including (forced) task order. The training included rapid task switches and was spread over a long period with a high frequency per week. Participants were coached and motivated to train. Finally, our replication of time effects found by previous studies, and all time effects showing improvements rather than deterioration, indicate that the lack of training transfer is not due to measurement errors.
We recommend that future studies adjust their training to the specific needs of each patient and provide more face-to-face coaching, especially at the beginning of the training. It is important to control essential training elements (e.g., level selection should be done by the program and not by the participant) in order to ensure compliance to study protocol. Finally, studies should include proper control conditions, multiple outcome measures per cognitive domain, and large sample sizes.
Our study showed that a high-potential cognitive flexibility training did not convincingly result in improvements on several cognitive outcome measures. We did find improvements over time, but improvements were similar in all three groups, including in the waiting list group. The lack of training results in this study does not imply that computer training programs in general are ineffective. However, caution is needed when interpreting positive results of previous training studies, because most of them did not include adequate control groups. Based on our results, we cannot recommend general brain training programs that are now commercially available when performed five times per week during three months. A training protocol as used in our study is unlikely to yield clinically relevant benefits for the rehabilitation of cognitive impairments after stroke. It remains possible, however, that training programs with a much higher training dose, or programs tailored to the specific needs of individual patients, will be effective and clinically relevant.