Skip to main content
Browse Subject Areas

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Practice Effects on Story Memory and List Learning Tests in the Neuropsychological Assessment of Older Adults

  • Brandon E. Gavett,

    Affiliation Department of Psychology, University of Colorado Colorado Springs, Colorado Springs, Colorado, United States of America

  • Ashita S. Gurnani,

    Affiliation Department of Psychology, University of Colorado Colorado Springs, Colorado Springs, Colorado, United States of America

  • Jessica L. Saurman,

    Affiliation Department of Psychology, University of Colorado Colorado Springs, Colorado Springs, Colorado, United States of America

  • Kimberly R. Chapman,

    Affiliation Alzheimer's Disease Center, Boston University School of Medicine, Boston, Massachusetts, United States of America

  • Eric G. Steinberg,

    Affiliation Alzheimer's Disease Center, Boston University School of Medicine, Boston, Massachusetts, United States of America

  • Brett Martin,

    Affiliation Boston University School of Public Health, Boston, Massachusetts, United States of America

  • Christine E. Chaisson,

    Affiliation Boston University School of Public Health, Boston, Massachusetts, United States of America

  • Jesse Mez,

    Affiliation Alzheimer's Disease Center, Boston University School of Medicine, Boston, Massachusetts, United States of America

  • Yorghos Tripodis,

    Affiliation Boston University School of Public Health, Boston, Massachusetts, United States of America

  • Robert A. Stern

    Affiliation Alzheimer's Disease Center, Boston University School of Medicine, Boston, Massachusetts, United States of America


Two of the most commonly used methods to assess memory functioning in studies of cognitive aging and dementia are story memory and list learning tests. We hypothesized that the most commonly used story memory test, Wechsler's Logical Memory, would generate more pronounced practice effects than a well validated but less common list learning test, the Neuropsychological Assessment Battery (NAB) List Learning test. Two hundred eighty-seven older adults, ages 51 to 100 at baseline, completed both tests as part of a larger neuropsychological test battery on an annual basis. Up to five years of recall scores from participants who were diagnosed as cognitively normal (n = 96) or with mild cognitive impairment (MCI; n = 72) or Alzheimer's disease (AD; n = 121) at their most recent visit were analyzed with linear mixed effects regression to examine the interaction between the type of test and the number of times exposed to the test. Other variables, including age at baseline, sex, education, race, time (years) since baseline, and clinical diagnosis were also entered as fixed effects predictor variables. The results indicated that both tests produced significant practice effects in controls and MCI participants; in contrast, participants with AD declined or remained stable. However, for the delayed—but not the immediate—recall condition, Logical Memory generated more pronounced practice effects than NAB List Learning (b = 0.16, p < .01 for controls). These differential practice effects were moderated by clinical diagnosis, such that controls and MCI participants—but not participants with AD—improved more on Logical Memory delayed recall than on delayed NAB List Learning delayed recall over five annual assessments. Because the Logical Memory test is ubiquitous in cognitive aging and neurodegenerative disease research, its tendency to produce marked practice effects—especially on the delayed recall condition—suggests a threat to its validity as a measure of new learning, an essential construct for dementia diagnosis.


Alzheimer’s disease (AD) is a neurodegenerative disease characterized by early and progressive decline in episodic memory due to medial temporal lobe pathology [1,2]. Episodic memory refers to the ability to learn and recall personal experiences, whereas semantic memory is a more stable representation of factual knowledge [3]. Recent research suggests that AD pathology begins to accumulate decades before clinical symptoms become apparent [4]. When the earliest clinical symptoms of AD appear, they typically involve isolated episodic memory deficits that do not affect functional independence; in this case, a diagnosis of mild cognitive impairment (MCI) is appropriate. Even as the clinical presentation of AD progresses from MCI to dementia, recall of highly rehearsed material and memories from long ago are more likely to remain intact compared to newly learned information and recent events, which are likely to be rapidly forgotten [5]. As such, cognitive tests that evaluate an individual's ability to learn new information, as opposed to one's ability to retrieve more stable remote memories, are more sensitive to AD, especially in its early stages [6]. This speaks to the importance of using cognitive assessment measures that are novel to the examinee when measuring new learning.

Two of the most popular methods for measuring new learning in the verbal/auditory modality are the list learning and story memory paradigms [7]. These methods are commonly used in the assessment of older adults—especially those with MCI and AD—to identify episodic memory impairment and track changes in new learning over time [812]. Within the past half century, numerous verbal list learning tests have been developed, such as the California Verbal Learning Test (CVLT) [13], Hopkins Verbal Learning Test (HVLT) [8], Word List Recall test from the Consortium to Establish a Registry for Alzheimer’s Disease (CERAD) [14], and Rey Auditory Verbal Learning Test (RAVLT) [15]. Most list learning tests provide recall scores for each learning trial, as well as a sum of recall across immediate learning trials; one or more delayed recall scores at various intervals (e.g., short and long delay); and a yes/no recognition score. The list learning paradigm, in its various formats, has been shown to distinguish between healthy aging, MCI, and AD with good sensitivity and specificity [6,1620]. Consistent with these findings, AD pathology has been shown to be associated with poorer performance on list learning tests in comparison to pathology free individuals [21].

One list learning test, the Neuropsychological Assessment Battery (NAB) List Learning test [22], uses three learning trials of 12 words that can be organized into three semantic categories. This allows for the assessment of semantic clustering along with immediate free recall, short- and long-delayed free recall, recognition, intrusions, and repetitions. It has been shown to have classification accuracies similar to or better than other list learning measures and utility for predicting the clinical course and diagnostic outcomes of cognitively normal, MCI, and AD individuals [10,11].

In comparison to the variability among the multitude of list learning tests available to neuropsychologists, there are fewer tests of story memory available, with one test that predominates: the Logical Memory (LM) subtest from the various Wechsler Memory Scale (WMS) editions [2326]. On this test, examinees are read one or two stories and are asked to recall them immediately and again after a delay. Wechsler's Logical Memory subtest has been shown to effectively distinguish between AD, MCI, and healthy controls [27,28] and has been consistently used as a measure of episodic memory in large-scale longitudinal studies [12,2931]. A number of studies have compared story memory with list learning and found that list learning tests possess better sensitivity for distinguishing between healthy controls, MCI, and AD, and for predicting the rate of conversion from MCI to AD [6,3234]. In addition, list learning tests that require active organization strategies to cluster the stimuli into meaningful categories (e.g., CVLT-II, NAB List Learning) are more susceptible to executive functioning deficits than stimuli that are presented in a logically organized fashion, such as Logical Memory [35]. As such, a logically organized story may be easier to encode than an unorganized list of words, especially for individuals with executive functioning difficulties.

Serial assessment is often used to monitor disease progression in the course of neurodegenerative conditions such as AD. Data from multiple evaluation points can assist with differential diagnosis, as criteria for AD and other dementias require a decline in cognition from a previous ability level [36]. In theory, normative data can be used to interpret an examinee’s performance in relation to age-matched peers; however, normative data are often collected at only one time point, making it difficult to justify their use in the context of serial assessments [3741]. In order to accurately interpret the meaning of a change in score between two or more time points, clinicians must be aware of critical factors that can impact scores on subsequent assessments such as regression to the mean, the reliability of the measure, practice effects, and maturation effects (e.g., aging) [39]. Because of the complexity of interpreting change on serial assessments, several guidelines and methods have been proposed, such as various reliable change models and standardized regression-based methods [4244]. Despite the importance of this issue, there is still a dearth of research examining performance trends across serial assessments.

Practice effects are conceptualized as the amount of improvement expected to occur with repeated exposure to a test [43]. Age, education level, disease status, and the characteristics of the test itself can all influence the magnitude of practice effects [39,45]. Practice effects can occur differentially over multiple testing points and have been described as having the most impact on the first two retests [46]. Some research has found that practice effects within the memory domain (as measured by both a list learning and story recall task) may be present between baseline and second testing, followed by a sharp decline on subsequent evaluations in normal individuals who eventually convert to mild cognitive impairment or AD [47]. However, other research has suggested that the absence of practice effects may be a good indicator of preclinical dementia [48,49]. Because of the emphasis on longitudinal assessment in dementia research (e.g., the Alzheimer's Disease Centers' Uniform Data Set, the Alzheimer's Disease Neuroimaging Initiative), it is essential to properly characterize the expected practice effects on various instruments that are commonly used in dementia assessment and to understand how clinical diagnosis (e.g., control, MCI, AD) affects scores obtained via serial assessment.

Although the Logical Memory subtest has been shown to be a valid marker of medial temporal lobe dysfunction [30,50] and has adequate inter-rater reliability [51], its test-retest reliability and associated practice effects are an important limitation to its longitudinal application [39]. The Logical Memory subtest has been shown to produce considerable practice effects, even with the use of alternate forms [52]. For instance, Gavett et al. [39] examined reliable change on several neuropsychological tests over multiple visits and found large practice effects (0.84 points per year for immediate recall and 1.10 points per year for delayed recall) for the Logical Memory subtest that substantially outweighed associated maturation effects.

One method used to attenuate practice effects is the use of alternate forms. Alternate forms are variations on a test that have been determined to be relatively equal through the process of test development and norming [53]. Historically, it has been more common for alternate forms of a test to be developed for list learning tests than story memory tests. For example, alternate forms are available for the NAB List Learning Test, CVLT-II, RAVLT, and HVLT-R, among others. In contrast, alternate forms are not typically available for the Wechsler Logical Memory test. Practice effects have been observed using the alternate form of the CVLT-II (Cohen's d range = -0.01 to 0.18), but were less pronounced than when the standard form was used at the first and second testing (Cohen's d range = 0.27 to 0.61) [54]. Significant practice effects on the CVLT were detected but controlled for using a dual baseline procedure in a sample of HIV-positive participants [55]. Although there is evidence for practice effects on list learning tasks, one study [56] found that a list learning task was one of the least susceptible to these effects compared to other tasks in a comprehensive battery in those with and without brain injury across 20 closely spaced assessments.

Because of the regularity with which older adult research participants undergo memory assessment, repeated exposure to the same Logical Memory narrative may cause considerable practice effects that reduce the validity of the test as a measure of new learning. Consequently, reductions to the validity of the Logical Memory story may have unintended consequences for application of inclusion and exclusion criteria for AD research projects. The goal of the current study is to compare the practice effects produced on the two episodic memory paradigms reviewed above—list learning and story memory—in a sample of older adults diagnosed as cognitively normal, MCI, or AD. The list learning test used in the current study is from the NAB, whereas the story memory test used in the current study is the WMS-R version of Logical Memory (Story A only). Due to the fact that the content of the Logical Memory test is organized logically, and because it does not employ alternate forms, we hypothesized that this test will produce larger practice effects than the NAB List Learning Test, which is unorganized and uses alternate forms. Further, we hypothesized a dose-response relationship between diagnosis and practice effects, such that—on both memory tests—controls will exhibit the most pronounced practice effects, followed by those with MCI; participants with AD are expected to exhibit the least pronounced practice effects.

Materials and Methods

Participants and Procedure

The Boston University Medical Campus Institutional Review Board approved this study. Participants were volunteers in the longitudinal research registry of the Boston University (BU) Alzheimer's Disease Center (ADC), which is one of the 34 past and present ADCs nationwide funded by the National Institute on Aging. The registry uses clinician referrals, community outreach (e.g., lectures, presentations), and word-of-mouth to recruit cognitively healthy individuals as well as those with MCI and AD dementia. All participants provided written consent to have their data used for research purposes, in accordance with the Declaration of Helsinki. Participants completed annual study visits, which include a comprehensive neuropsychological assessment battery along with a detailed neurological examination and gathering of social, medical, and family history. A more detailed description of the registry has been published previously [57].

For the current study, we began with archival data collected from January 03, 2005 to February 15, 2016. The initial data set was made up of 689 participants who participated in a total of 3703 study visits. At each study visit, participants were assigned a clinical diagnosis by a consensus team of experts made up of neurologists, neuropsychologists, psychiatrists, nurse practitioners, and research assistants. All clinical diagnoses were made according to commonly accepted criteria, with one exception. The Petersen criteria [58,59] were used to diagnose MCI; however, individuals with no self or informant complaint, but with objective cognitive impairment, were also considered to have MCI for the current study. The National Institute of Neurological and Communicative Disorders and Stroke—Alzheimer's Disease and Related Disorders Association (NINCDS-ADRDA) criteria were used to diagnose participants with AD [60].

Because the BU ADC had been enrolling participants prior to the establishment of the current data collection procedures, some participants had previous exposure to neuropsychological testing. Therefore, we excluded participants whose baseline visit occurred prior to initiation of the current protocol, so that each participant's baseline visit corresponded to their first known exposure to the two memory tests used in this study. We sought to include participants who—at their most recent study visit—were given a consensus diagnosis of control, MCI, or AD (either Possible or Probable). Therefore, we excluded participants who, at their last study visit, were either diagnosed with a non-AD etiology for dementia or who exhibited subtle cognitive difficulties that were not sufficient to meet criteria for MCI. We also excluded individuals whose primary language was not English in order to eliminate any potential language confounds. The vast majority of participants described themselves as either Black/African American or White/Caucasian. Therefore, we excluded four Asian participants so that analyses about race would be restricted to the two majority groups in this sample. The flowchart in Fig 1 depicts the application of these inclusion and exclusion criteria to generate the sample that was used for all subsequent analyses. Finally, because few participants completed more than five annual study visits, we restricted our analyses to the first five visits.

Fig 1. Participant Flow.

Flowchart illustrating application of inclusion and exclusion criteria for the current study.

Neuropsychological Measures

For the current study, two different measures of verbal episodic memory were administered to participants at each study visit. These two measures were used to test the hypothesis that story memory tests are more susceptible to practice effects than list learning tests. The story memory test used was Logical Memory Story A from the Wechsler Memory Scales—Revised (WMS-R) [24]. The list learning instrument was the NAB List Learning Test [22].

WMS-R Logical Memory Story A.

In the Logical Memory test, participants are read a logically organized story and asked to recall the story immediately after its presentation (Immediate Recall). Approximately 20 minutes later, the participants are again asked to recall the story from memory (Delayed Recall). The version used in this study uses only one story (Story A) read once to participants at each study visit. This procedure is based on those used across all ADCs following the National Alzheimer’s Coordinating Center’s Uniform Data Set [31]. Possible scores for both Logical Memory Immediate and Delayed Recall trials range from 0 to 25, with higher scores reflecting more details recalled.

NAB List Learning.

In the NAB List Learning Test, participants are read a list of 12 words that can be organized into three different semantic categories and asked to recall as many of those words as possible. The list of words is repeated a second and third time, with recall trials immediately following each presentation. The Immediate Recall total score is the sum of all words recalled across the three learning trials. After the three learning trials, a second list of words is presented that has partial overlap with the categories embedded within the first list of words. After recall of this distractor list, participants are asked to spontaneously recall the words from the first list (Short Delay Free Recall). Later, after a delay of approximately 12 minutes, participants are again prompted to spontaneously recall the first list of words (Long Delay Free Recall). Finally, a yes/no paradigm is used to evaluate recognition memory. For the current study, we use the Immediate Recall (range = 0–36) and Long Delay Free Recall (range = 0–12) scores as the primary scores from the NAB List Learning Test, as these are most compatible with the Immediate and Delayed Recall scores generated by Logical Memory. For both NAB List Learning scores, higher values reflect more words recalled. Importantly, the NAB List Learning Test has an alternate form; in the current study, participants alternated between the two forms at each study visit as method for attenuating practice effects.

Data Analysis

To test the hypothesis that story memory causes more pronounced practice effects than list learning, we used linear mixed effects regression. Linear mixed effects models provide a flexible framework for evaluating model fit, estimating population parameters, and accounting for the hierarchical nature of longitudinal data. We performed two separate linear mixed effects models, one for the immediate recall condition (Logical Memory Immediate Recall and NAB List Learning Immediate Recall) and one for the delayed recall condition (Logical Memory Delayed Recall and NAB List Learning Long Delay Free Recall). In each model, fixed effects predictor variables included age at baseline (centered), sex (male vs. female), years of education (centered), race (White/Caucasian vs. Black/African American), time since baseline visit (years), visit number (1–5), test (Logical Memory vs. NAB List Learning), and diagnostic group (control, MCI, or AD). We also modeled interaction effects between visit number, test, and diagnostic group in order to determine if the latter two variables are associated with differential rates of change (practice effects) over time.

Because the Logical Memory and NAB List Learning Test raw scores are scaled differently, we converted them to z-scores to promote direct comparisons. We derived the z-scores for each test using the means and standard deviations of the control participants at baseline as the standard for comparison.

In addition to the fixed effects described above, we added random intercept and slope terms to the regression models, which allow for inter-individual variability in baseline performance (intercept) and rate of change over time (slope). We analyzed the data using linear mixed effects modeling with restricted maximum likelihood estimation in R software version 3.2.4 [61] and its lme4 package version 1.1–10 [62]. To perform hypothesis tests on the fixed effects parameter estimates, we used the Satterthwaite approximation [63] to estimate the appropriate degrees of freedom, as implemented in R's lmerTest package version 2.0–29 [64].


Participant age at baseline ranged from 51 to 100. Years of education ranged from 3 to 21. The median number of study visits completed was 4. A more detailed breakdown of the baseline demographic characteristics of the current sample is presented in Table 1. A graph depicting the clinical diagnosis that was assigned to participants at each study visit is shown in Fig 2. To ensure that there were no between-groups differences in interval from baseline to any of the five follow-ups, linear mixed effects regression was used to examine the effect of the interaction between group and visit number on the duration of the assessment interval. There were no significant differences in the test-retest interval for any of the visits when comparing the control group to the MCI (b = 0.009, SE = 0.029) and AD (b = 0.037, SE = 0.026) groups. The immediate recall data, shown as a function of test, visit number, and diagnosis, are plotted in Fig 3. The model for immediate recall yielded random effects standard deviations of 0.71 for the intercept, 0.03 for the slope, and 0.74 for the residual. The fixed effects parameter estimates for the immediate recall data are presented in Table 2.

Fig 2. Clinical Diagnosis by Visit.

Heatmap depicting the clinical diagnosis assigned to each participant at each study visit. The left panel represents participants whose most recent diagnosis was Control. The middle panel represents participants whose most recent diagnosis was MCI. The right panel represents participants whose most recent diagnosis was AD. Colors reflect the diagnosis made at a given visit, which does not always correspond to the most recent diagnosis. Participants' most recent visit may have occurred beyond the 5 visits used in the current study.

Fig 3. Immediate Recall Practice Effects.

Standardized test scores on the immediate recall condition as a function of visit number, test, and clinical diagnosis. Error bars represent 95% confidence intervals and account for within-subjects variability.

Table 1. Baseline Demographic Characteristics of the Current Sample.

Table 2. Results of the Linear Mixed Effects Model for Immediate Recall.

As can be seen in Table 2, there were a number of main effects and interactions influencing immediate recall performance. Younger age, female sex, and higher education were associated with better immediate recall scores. The effect of time since baseline visit was not a significant predictor of immediate recall performance. As expected, participants diagnosed with MCI and AD recalled substantially less than controls. In addition, noticeable practice effects were found, such that each visit was associated with a 0.39 standard deviation increase in the immediate recall performance of controls. Although the learning slope in MCI participants was significantly lower than in controls, it was nevertheless sizeable in effect (0.26 standard deviations per year). The AD group, on the other hand, showed a decline (-0.06 standard deviations per year) in performance over time. Finally, there was no main effect of test (Logical Memory vs. NAB List Learning) on overall immediate recall, nor did test interact with visit number or diagnosis to suggest any differential practice effects for Logical Memory vs. NAB List Learning on immediate recall (see Fig 3).

The delayed recall data, shown as a function of test, visit number, and diagnosis, are plotted in Fig 4. The model for delayed recall yielded random effects standard deviations of 0.7 for the intercept, 0.08 for the slope, and 0.69 for the residual. The fixed effects parameter estimates for the delayed recall data are presented in Table 3.

Fig 4. Delayed Recall Practice Effects.

Standardized test scores on the delayed recall condition as a function of visit number, test, and clinical diagnosis. Error bars represent 95% confidence intervals and account for within-subjects variability.

Table 3. Results of the Linear Mixed Effects Model for Delayed Recall.

The data in Table 3 reveal a number of main effects and interactions influencing delayed recall performance. Similar to the results for the immediate recall condition, younger age, female sex, and higher education were all associated with better delayed recall scores. In contrast, maturation effects (time since baseline), which were not a significant predictor of immediate recall scores, did have an influence on delayed recall scores, producing a decline of approximately -0.21 standard deviations per year. As above, noticeable practice effects were found; however, the delayed recall results differed from the immediate recall results in that the test variable (Logical Memory vs. NAB) was associated with different patterns of performance across visits. The three-way interaction between visit, test, and diagnosis reveals that controls and AD participants differ significantly when comparing the practice effects produced by each test, whereas MCI participants did not differ from controls. When controlling for all other covariates, the data reveal learning slopes in controls of 0.47 for Logical Memory and 0.31 for the NAB List Learning Test; learning slopes in MCI participants of 0.3 for Logical Memory and 0.25 for the NAB List Learning Test; and learning slopes in AD participants of 0.15 for Logical Memory and 0.13 for the NAB List Learning Test (Fig 3).

As can be seen in Table 1, most of the participants in the AD group were already suffering from cognitive impairment at their baseline visit. Therefore, it may be possible that floor effects on the two memory tests interfered with our ability to observe either practice effects or more pronounced cognitive decline in this group. To further explore this possibility, we excluded our analyses to the AD group only and divided this sample into two subsamples based on a median split of MMSE scores at baseline. We then examined trends in Immediate and Delayed recall on Logical Memory and NAB List Learning as a function of baseline MMSE, using mixed effects regression models similar to those described above. The median baseline MMSE value was 23; the low and high baseline MMSE groups were therefore composed of AD participants with baseline MMSE scores of ≤ 23 (n = 66) and > 23 (n = 55), respectively. On the Delayed Recall trial, there was a significant interaction between visit number and baseline MMSE group, b = -0.107, SE = 0.051, such that AD participants with MMSE scores of > 23 at baseline showed more rapid decline on Delayed Recall than AD participants with MMSE scores of ≤ 23 at baseline. Although the interaction term for Immediate Recall was nearly identical, (-0.109), this estimate was less precise (SE = 0.073) and therefore did not achieve statistical significance.


Story recall and list learning are two of the most common methods used in clinical research to measure verbal episodic memory [7]. In particular, the Wechsler Logical Memory test is frequently used to determine eligibility for research studies related to AD and other dementias. Users of the test should keep in mind that the story's main character would be 71 years old [65] at the time this study was conducted, and because her story has changed relatively little since its inception, it is likely that many patients and research participants have had some exposure to her plight. The ubiquity with which this test is administered to older adults, and the role that this test has in determining diagnosis, study eligibility, and monitoring treatment outcomes [29], all highlight the need to understand the practice effects that occur on this test. A previous study examining longitudinal outcomes in a national sample of cognitively healthy older adults showed that Logical Memory Immediate and Delayed recall produced substantial practice effects with repeated exposure to the story [39]. Therefore, the goals of the current study were twofold: 1) to extend the previous findings to individuals with MCI and AD and 2) to directly compare practice effects on Logical Memory to practice effects on the NAB List Learning Test. We hypothesized that, due to its logically organized structure [35] and lack of alternate forms, Logical Memory would produce more pronounced practice effects than the NAB List Learning Test across all groups (control, MCI, AD).

The results of this study only partially supported our hypotheses. Although practice effects were pronounced—in controls and MCI patients—on both immediate and delayed recall conditions, the format of the test (story vs. list) only served to moderate practice effects in the delayed recall condition. For immediate recall, linear rates of change across visits did not differ based on the test administered. In the delayed recall condition, however, the hypothesized difference in practice effects based on test was observed. In controls, there was a large practice effects discrepancy between tests; the practice effects discrepancy in the MCI group was smaller but not significantly different from controls. In contrast, the practice effects discrepancy in the AD group was found to significantly differ from controls, suggesting that diagnosis moderates the effect of test on learning slope with repeated exposure to a stimulus. In particular, individuals with better memory are more susceptible to exhibiting practice effects over time on delayed recall of Logical Memory compared to NAB List Learning, whereas individuals with memory impairment due to AD do not perform differently over time on the two delayed recall tasks.

One possible factor that may have influenced these findings is the potential for floor effects to have obscured relevant trends in the recall performance of individuals in the AD group, most of whom entered the study having already experienced cognitive decline. Analysis of recall trends in the AD group only, using baseline MMSE grouping as a predictor variable, was performed to examine the potential influence of floor effects on the most impaired subset of our sample. The results indicated that these floor effects may have prevented us from observing even more rapid decline in the AD sample than was possible given the difficulty of the two memory tests used here. Had our sample of AD participants been less impaired at baseline, the primary analyses may have shown more decline, rather than seeming stability, in recall scores across visits. A less impaired sample at baseline would have likely magnified the observed differences in practice effects between the AD group and the other two groups.

Because our model included age at baseline, number of visits, and time (years) since baseline as predictors, the current results demonstrate that the practice effects that are elicited by repeated exposure to tests of verbal episodic memory are—for controls and patients with MCI—more powerful than the decline in episodic memory that occurs as part of the aging process. These findings are consistent with previous work showing that tests of both episodic and semantic memory are affected more strongly by repeated exposure to a stimulus (i.e., practice effects), whereas tests of attention and executive functioning are affected more strongly by maturation (i.e., aging) [39]. This pattern was not observed in participants with AD, however, which indicates that the underlying disease process exerts a more powerful influence over test results than repeated exposure to a test. Such a conclusion is consistent with previous data suggesting that the absence of practice effects may be an important clinical marker for underlying pathology [48,49].

These results conflict with other research that has suggested that practice effects level off after two to three exposures to a test [46]. Although we only modeled our data using a linear trend, a visual inspection of the data presented in Figs 3 and 4 reveals a continued effect of practice beyond the second and third visits. Some experts have recommended employing a prebaseline assessment period whereby participants are exposed to the tests prior to the baseline visit in order to minimize practice effects (e.g., [55,66]). Although our results did not seek to examine the effectiveness of this recommendation, per se, our findings of continued practice effects beyond the first two to three visits may suggest that—at least for episodic memory tests—it is unlikely that practice effects can be eliminated entirely.

Because control participants—and, to a lesser extent, those with MCI—exhibited such pronounced practice effects, there is a convincing argument to be made that all episodic memory measures should be normed with not just cross-sectional data, but with longitudinal data as well. Given the importance of serial neuropsychological assessment in the elderly for the purposes of differential diagnosis, treatment monitoring, clinical trial efficacy, and so forth, it is not advised to rely on cross-sectional normative data to interpret results obtained via serial assessment. At the very least, data focusing on test-retest reliability and practice effects are essential for interpreting reliable change [42,44].

Due to the retrospective nature of this study, one important confound remains unaccounted for. As discussed above, the NAB List Learning Test used alternate forms, whereas the Logical Memory test did not. Therefore, the results of this study cannot disentangle the paradigmatic differences between story memory and list learning. It is possible that these results simply reflect the relative differences between using and not using alternate forms of a test. On the other hand, the inherent organization of the Logical Memory task may make it more amenable to practice effects. Further research is needed to differentiate the role of alternate forms in attenuating practice effects using the same learning paradigm. We can only conclude from the current study that the Wechsler Logical Memory test, which lacks alternate forms, produced more pronounced practice effects on delayed recall in controls and MCI participants than when compared to the alternate forms of the NAB List Learning Test. It is also important to note that the retrospective nature of the current study only allowed for the analysis of two different memory tests. As such, these results may not apply to other test of episodic memory.

Other limitations are as follows: The sample was primarily recruited through convenience methods in a geographically restricted area. Similarly, the sample was highly educated and lacking in racial diversity. Therefore, the external validity of the study may be limited. Because only controls and participants with MCI and AD were included, the results cannot be generalized outside of those diagnostic groups. Similarly, the MCI group in the current sample included participants who demonstrated objective cognitive impairment, but not all of these participants presented with a self- or informant complaint of cognitive difficulties. Therefore, the current results may not generalize to participants diagnosed with MCI using the requirement of a cognitive complaint [59]. Further, the diagnostic groups were defined based on clinical diagnosis without pathological confirmation. Another limitation of the study is that not all participants completed five study visits, either due to attrition or because participants enrolled less than four years ago. Therefore, the follow-up data have been obtained from a highly selected sample and may also lack external validity. In addition, the results may be biased by the fact that the Logical Memory and NAB List Learning Tests were used for clinical diagnosis; this potential bias is likely to affect the analyses related to the three different diagnostic groups (control, MCI, AD), but is not likely to produce confounds related to the study's primary results, which show that the Logical Memory is associated with more pronounced practice effects on delayed recall than NAB List Learning. Finally, the current results are exclusively based on group data, and may not be applicable for individual-level decision-making. Future research should seek to determine the diagnostic accuracy of practice effects data across multiple visits for clinical decision-making at the individual level.

Author Contributions

  1. Conceptualization: BEG RAS.
  2. Data curation: BEG KRC EGS BM CEC JM YT RAS.
  3. Formal analysis: BEG.
  4. Funding acquisition: CEC RAS.
  5. Investigation: KRC EGS.
  6. Methodology: BEG JM YT RAS.
  7. Project administration: CEC RAS.
  8. Resources: BEG CEC YT RAS.
  9. Supervision: BEG EGS CEC JM YT RAS.
  10. Validation: BEG CEC JM YT RAS.
  11. Visualization: BEG.
  12. Writing – original draft: BEG ASG JLS KRC.
  13. Writing – review & editing: BEG ASG JLS KRC EGS BM CEC JM YT RAS.


  1. 1. Mattson MP. Pathways towards and away from Alzheimer’s disease. Nature. 2004;430: 631–9. pmid:15295589
  2. 2. Salmon DP. Disorders of memory in Alzheimer’s disease. Handbook of neuropsychology, vol 2: Memory and its disorders. Amsterdam: Elsevier; 2000. pp. 155–195.
  3. 3. Budson AE, Price BH. Memory dysfunction. The New England Journal of Medicine. 2005;352: 692–699. pmid:15716563
  4. 4. Sperling RA, Aisen PS, Beckett LA, Bennett DA, Craft S, Fagan AM, et al. Toward defining the preclinical stages of Alzheimer’s disease: Recommendations from the National Institute on Aging-Alzheimer’s Association workgroups on diagnostic guidelines for Alzheimer’s disease. Alzheimer’s & Dementia. 2011;7: 280–292. pmid:21514248
  5. 5. Hodges JR, Patterson K. Is semantic memory consistently impaired early in the course of Alzheimer’s disease? Neuroanatomical and diagnostic implications. Neuropsychologia. 1995;33: 441–459. pmid:7617154
  6. 6. Rabin LA, Paré N, Saykin AJ, Brown MJ, Wishart HA, Flashman LA, et al. Differential memory test sensitivity for diagnosing amnestic mild cognitive impairment and predicting conversion to Alzheimer’s disease. Neuropsychology, Development, and Cognition Section B, Aging, Neuropsychology and Cognition. 2009;16: 357–376. pmid:19353345
  7. 7. Rabin LA, Paolillo E, Barr WB. Stability in test-usage practices of clinical neuropsychologists in the United States and Canada over a 10-year period: A follow-up survey of INS and NAN Members. Archives of Clinical Neuropsychology. 2016;31: 206–230. pmid:26984127
  8. 8. Brandt J, Benedict RH. Hopkins Verbal Learning Test-Revised. Professional Manual. Lutz, FL: Psychological Assessment Resources; 2001.
  9. 9. Fleisher AS, Sowell BB, Taylor C, Gamst AC, Petersen RC, Thal LJ. Clinical predictors of progression to Alzheimer disease in amnestic mild cognitive impairment. Neurology. 2007;68: 1588–1595. pmid:17287448
  10. 10. Gavett BE, Poon SJ, Ozonoff A, Jefferson AL, Nair AK, Green RC, et al. Diagnostic utility of the NAB List Learning test in Alzheimer’s disease and amnestic mild cognitive impairment. Journal of the International Neuropsychological Society. 2009;15: 121–129. pmid:19128535
  11. 11. Gavett BE, Ozonoff A, Doktor V, Palmisano J, Nair AK, Green RC, et al. Predicting cognitive decline and conversion to Alzheimer’s disease in older adults using the NAB List Learning test. Journal of the International Neuropsychological Society. 2010;16: 651–660. pmid:20374677
  12. 12. Johnson DK, Storandt M, Balota DA. Discourse analysis of logical memory recall in normal aging and in dementia of the Alzheimer type. Neuropsychology. 2003;17: 82–92. pmid:12597076
  13. 13. Delis DC, Kramer JH, Kaplan E, Ober BA. California Verbal Learning Test—2nd Ed. (CVLT-II) Manual. San Antonio, TX: The Psychological Corporation; 2000.
  14. 14. Morris JC, Heyman A, Mohs RC, Hughes JP, Belle G van, Fillenbaum G, et al. The Consortium to Establish a Registry for Alzheimer’s Disease (CERAD). Part I. Clinical and neuropsychological assessment of Alzheimer’s disease. Neurology. 1989;39: 1159–1165. pmid:2771064
  15. 15. Rey A. L’examen clinique en psychologie. Presses universitaires de France; 1964. p. 222.
  16. 16. Estévez-González A, Kulisevsky J, Boltes A, Otermín P, García-Sánchez C. Rey verbal learning test is a useful tool for differential diagnosis in the preclinical phase of Alzheimer’s disease: comparison with mild cognitive impairment and normal aging. International Journal of Geriatric Psychiatry. 2003;18: 1021–1028. pmid:14618554
  17. 17. Kaltreider LB, Cicerello AR, Lacritz LH, Weiner MF, Honig LS, Rosenberg RN, et al. Comparison of the CERAD and CVLT List-Learning tasks in Alzheimer’s disease. Clinical Neuropsychologist. 2000;14: 269–274. pmid:11262701
  18. 18. Karrasch M, Sinervä E, Grönholm P, Rinne J, Laine M. CERAD test performances in amnestic mild cognitive impairment and Alzheimer’s disease. Acta Neurologica Scandinavica. 2005;111: 172–179. pmid:15691286
  19. 19. Kuslansky G. Detecting dementia with the Hopkins Verbal Learning Test and the Mini-Mental State Examination. Archives of Clinical Neuropsychology. 2004;19: 89–104. pmid:14670382
  20. 20. Lacritz LH, Cullum CM, Weiner MF, Rosenberg RN. Comparison of the Hopkins Verbal Learning Test-Revised to the California Verbal Learning Test in Alzheimer’s disease. Applied Neuropsychology. 2001;8: 180–184. pmid:11686654
  21. 21. Gurnani AS, Gavett BE. The differential effects of Alzheimer’s disease and Lewy body pathology on cognitive performance: A meta-analysis. Neuropsychology Review. in press
  22. 22. Stern RA, White T. Neuropsychological Assessment Battery. Lutz, FL: Psychological Assessment Resources; 2003.
  23. 23. Wechsler D. A standardized memory scale for clinical use. The Journal of Psychology. 1945;19: 87–95.
  24. 24. Wechsler D. WMS-R: Wechsler Memory Scale-Revised. Psychological Corporation; 1987.
  25. 25. Wechsler D. Wechsler Memory Scale—Third Edition. San Antonio, TX: The Psychological Corporation; 1997.
  26. 26. Wechsler D. Wechsler Memory Scale—Fourth Edition. San Antonio, TX: The Psychological Corporation; 2009.
  27. 27. Rubin EH, Storandt M, Miller JP, Kinscherf DA, Grant EA, Morris JC, et al. A prospective study of cognitive function and onset of dementia in cognitively healthy elders. Archives of Neurology. 1998;55: 395–401. pmid:9520014
  28. 28. Storandt M, Hill RD. Very mild senile dementia of the Alzheimer type. II. Psychometric test performance. Archives of Neurology. 1989;46: 383–386. pmid:2705897
  29. 29. Chapman KR, Bing-Canar H, Alosco ML, Steinberg EG, Martin B, Chaisson C, et al. Mini Mental State Examination and Logical Memory scores for entry into Alzheimer’s disease trials. Alzheimer’s Research & Therapy. 2016;8: 9. pmid:26899835
  30. 30. Libon DJ, Preis SR, Beiser AS, Devine S, Seshadri S, Wolf PA, et al. Verbal memory and brain aging: an exploratory analysis of the role of error responses in the Framingham Study. American Journal of Alzheimer’s Disease and Other Dementias. 2015;30: 622–628. pmid:25788434
  31. 31. Weintraub S, Salmon D, Mercaldo N, Ferris S, Graff-Radford NR, Chui H, et al. The Alzheimer’s Disease Centers’ Uniform Data Set (UDS): The neuropsychologic test battery. Alzheimer Disease and Associated Disorders. 2009;23: 91–101. pmid:19474567
  32. 32. Jager CA de, Hogervorst E, Combrinck M, Budge MM. Sensitivity and specificity of neuropsychological tests for mild cognitive impairment, vascular cognitive impairment, and Alzheimer’s disease. Psychological Medicine. 2003;33: 1039–1050. pmid:12946088
  33. 33. Kavé G, Heinik J. Neuropsychological evaluation of mild cognitive impairment: three case reports. The Clinical Neuropsychologist. 2004;18: 362–372. pmid:15739808
  34. 34. Silva D, Guerreiro M, Maroco J, Santana I, Rodrigues A, Bravo Marques J, et al. Comparison of four verbal memory tests for the diagnosis and predictive value of mild cognitive impairment. Dementia and Geriatric Cognitive Disorders Extra. 2012;2: 120–131. pmid:22590473
  35. 35. Tremont G, Halpert S, Javorsky DJ, Stern RA. Differential impact of executive dysfunction on verbal list learning and story recall. The Clinical Neuropsychologist. 2000;14: 295–302. pmid:11262704
  36. 36. McKhann GM, Knopman DS, Chertkow H, Hyman BT, Jack CR, Kawas CH, et al. The diagnosis of dementia due to Alzheimer’s disease: Recommendations from the National Institute on Aging-Alzheimer’s Association workgroups on diagnostic guidelines for Alzheimer’s disease. Alzheimer’s & Dementia. 2011;7: 263–269. pmid:21514250
  37. 37. Bläsi S, Zehnder AE, Berres M, Taylor KI, Spiegel R, Monsch AU. Norms for change in episodic memory as a prerequisite for the diagnosis of mild cognitive impairment (MCI). Neuropsychology. 2009;23: 189–200. pmid:19254092
  38. 38. De Santi S, Pirraglia E, Barr W, Babb J, Williams S, Rogers K, et al. Robust and conventional neuropsychological norms: diagnosis and prediction of age-related cognitive decline. Neuropsychology. 2008;22: 469–484. pmid:18590359
  39. 39. Gavett BE, Ashendorf L, Gurnani AS. Reliable change on neuropsychological tests in the Uniform Data Set. Journal of the International Neuropsychological Society. 2015;21: 558–567. pmid:26234918
  40. 40. Holtzer R, Goldin Y, Zimmerman M, Katz M, Buschke H, Lipton RB. Robust norms for selected neuropsychological tests in older adults. Archives of Clinical Neuropsychology. 2008;23: 531–41. pmid:18572380
  41. 41. Pedraza O, Lucas JA, Smith GE, Petersen RC, Graff-Radford NR, Ivnik RJ. Robust and expanded norms for the Dementia Rating Scale. Archives of Clinical Neuropsychology. 2010;25: 347–58. pmid:20427376
  42. 42. Duff K. Evidence-based indicators of neuropsychological change in the individual patient: Relevant concepts and methods. Archives of Clinical Neuropsychology. 2012;27: 248–261. pmid:22382384
  43. 43. Heilbronner RL, Sweet JJ, Attix DK, Krull KR, Henry GK, Hart RP. Official position of the American Academy of Clinical Neuropsychology on serial neuropsychological assessments: The utility and challenges of repeat test administrations in clinical and forensic contexts. The Clinical Neuropsychologist. 2010;24: 1267–1278. pmid:21108148
  44. 44. Hinton-Bayre AD. Deriving reliable change statistics from test-retest normative data: Comparison of models and mathematical expressions. Archives of Clinical Neuropsychology. 2010;25: 244–256. pmid:20197293
  45. 45. McCaffrey RJ, Duff K, Westervelt HJ. Practitioner’s Guide to Evaluating Change with Neuropsychological Assessment Instruments. New York: Springer; 2000.
  46. 46. Ivnik RJ, Smith GE, Petersen RC, Boeve BF, Kokmen E, Tangalos EG. Diagnostic accuracy of four approaches to interpreting neuropsychological test data. Neuropsychology. 2000;14: 163–77. pmid:10791857
  47. 47. Machulda MM, Pankratz VS, Christianson TJ, Ivnik RJ, Mielke MM, Roberts RO, et al. Practice effects and longitudinal cognitive change in normal aging vs. incident mild cognitive impairment and dementia in the Mayo Clinic Study of Aging. The Clinical Neuropsychologist. 2013;27: 1247–64. pmid:24041121
  48. 48. Duff K, Lyketsos CG, Beglinger LJ, Chelune G, Moser DJ, Arndt S, et al. Practice effects predict cognitive outcome in amnestic mild cognitive impairment. American Journal of Geriatric Psychiatry. American Association for Geriatric Psychiatry; 2011;19: 932–939. pmid:22024617
  49. 49. Hassenstab J, Ruvolo D, Jasielec M, Xiong C, Grant E, Morris JC. Absence of practice effects in preclinical Alzheimer’s disease. Neuropsychology. 2015;29: 940–948. pmid:26011114
  50. 50. Walhovd KB, Fjell AM, Brewer J, McEvoy LK, Fennema-Notestine C, Hagler DJ, et al. Combining MR imaging, positron-emission tomography, and CSF biomarkers in the diagnosis and prognosis of Alzheimer disease. American Journal of Neuroradiology. 2010;31: 347–354. pmid:20075088
  51. 51. Sullivan K. Estimates of interrater reliability for the Logical Memory subtest of the Wechsler Memory Scale-Revised. Journal of Clinical and Experimental Neuropsychology. 1996;18: 707–712. pmid:8941855
  52. 52. Dikmen SS, Heaton RK, Grant I, Temkin NR. Test-retest reliability and practice effects of expanded Halstead-Reitan Neuropsychological Test Battery. Journal of the International Neuropsychological Society. 1999;5: 346–356. pmid:10349297
  53. 53. Benedict RHB. Effects of using same- versus alternate-form memory tests during short-interval repeated assessments in multiple sclerosis. Journal of the International Neuropsychological Society. 2005;11: 727–36. pmid:16248908
  54. 54. Woods SP, Delis DC, Scott JC, Kramer JH, Holdnack JA. The California Verbal Learning Test–second edition: Test-retest reliability, practice effects, and reliable change indices for the standard and alternate forms. Archives of Clinical Neuropsychology. 2006;21: 413–20. pmid:16843636
  55. 55. Duff K, Westervelt HJ, McCaffrey RJ, Haase RF. Practice effects, test-retest stability, and dual baseline assessments with the California Verbal Learning Test in an HIV sample. Archives of Clinical Neuropsychology. 2001;16: 461–76. pmid:14590160
  56. 56. Wilson BA, Watson PC, Baddeley AD, Emslie H, Evans JJ. Improvement or simply practice? The effects of twenty repeated assessments on people with and without brain injury. Journal of the International Neuropsychological Society. 2000;6: 469–79. pmid:10902416
  57. 57. Jefferson AL, Wong S, Bolen E, Ozonoff A, Green RC, Stern RA. Cognitive correlates of HVOT performance differ between individuals with mild cognitive impairment and normal controls. Archives of Clinical Neuropsychology. 2006;21: 405–412. pmid:16893623
  58. 58. Petersen RC. Mild cognitive impairment as a diagnostic entity. Journal of Internal Medicine. 2004;256: 183–194. pmid:15324362
  59. 59. Winblad B, Palmer K KM. Mild cognitive impairment–beyond controversies, towards a consensus: report of the International Working Group on Mild Cognitive Impairment. Journal of Internal Medicine. 2004;256: 240–246. pmid:15324367
  60. 60. McKhann G, Drachman D, Folstein M, Katzman R, Price D, Stadlan EM. Clinical diagnosis of Alzheimer’s disease: Report of the NINCDS-ADRDA Work Group* under the auspices of Department of Health and Human Services Task Force on Alzheimer’s Disease. Neurology. 1984;34: 939–944. pmid:6610841
  61. 61. R Core Team. R: A Language and Environment for Statistical Computing [Internet]. Vienna, Austria: R Foundation for Statistical Computing; 2016. Available:
  62. 62. Bates D, Maechler M, Bolker B, Walker S. Fitting linear mixed-effects models using lme4. Journal of Statistical Software. 2015;67: 1–48.
  63. 63. Satterthwaite FE. An approximate distribution of estimates of variance components. Biometrics. 1946;2: 110–114. pmid:20287815
  64. 64. Kuznetsova A, Bruun Brockhoff P, Haubo Bojesen Christensen R. lmerTest: Tests in Linear Mixed Effects Models [Internet]. 2015. Available:
  65. 65. Kent P. The evolution of the Wechsler Memory Scale: A selective review. Applied Neuropsychology: Adult. 2013;20: 277–291. pmid:23445503
  66. 66. Goldberg TE, Harvey PD, Wesnes KA, Snyder PJ, Schneider LS. Practice effects due to serial cognitive assessment: Implications for preclinical Alzheimer’s disease randomized controlled trials. Alzheimer’s & Dementia: Diagnosis, Assessment & Disease Monitoring. 2015;1: 103–111. pmid:27239497