Reliability and validity of the UK Biobank cognitive tests

UK Biobank is a health resource with data from over 500,000 adults. The cognitive assessment in UK Biobank is brief and bespoke, and is administered without supervision on a touchscreen computer. Psychometric information on the UK Biobank cognitive tests are limited. Despite the non-standard nature of these tests and the limited psychometric information, the UK Biobank cognitive data have been used in numerous scientific publications. The present study examined the validity and short-term test-retest reliability of the UK Biobank cognitive tests. A sample of 160 participants (mean age = 62.59, SD = 10.24) was recruited who completed the UK Biobank cognitive assessment and a range of well-validated cognitive tests (‘reference tests’). Fifty-two participants returned 4 weeks later to repeat the UK Biobank tests. Correlations were calculated between UK Biobank tests and reference tests. Two measures of general cognitive ability were created by entering scores on the UK Biobank cognitive tests, and scores on the reference tests, respectively, into separate principal component analyses and saving scores on the first principal component. Four-week test-retest correlations were calculated for UK Biobank tests. UK Biobank cognitive tests showed a range of correlations with their respective reference tests, i.e. those tests that are thought to assess the same underlying cognitive ability (mean Pearson r = 0.53, range = 0.22 to 0.83, p≤.005). The measure of general cognitive ability based on the UK Biobank cognitive tests correlated at r = 0.83 (p < .001) with a measure of general cognitive ability created using the reference tests. Four-week test-retest reliability of the UK Biobank tests were moderate-to-high (mean Pearson r = 0.55, range = 0.40 to 0.89, p≤.003). Despite the brief, non-standard nature of the UK Biobank cognitive tests, some tests showed substantial concurrent validity and test-retest reliability. These psychometric results provide currently-lacking information on the validity of the UK Biobank cognitive tests.


Introduction
UK Biobank is a large prospective cohort study that was designed to investigate the health of middle-aged and older adults residing in the UK (https://www.ukbiobank.ac.uk/) [1]. stability of these tests. No one has examined the test-retest reliability of the newer UK Biobank tests introduced since baseline. One of the most replicated findings in psychological research is that performance on tests of cognitive function are positively correlated and that a measure of general cognitive ability (g) can be extracted from scores on a diverse set of cognitive tests [4][5][6][7]. One way to examine the validity of the UK Biobank cognitive assessment would be to confirm that these brief and unsupervised cognitive tests are positively correlated. Principal component analysis (PCA) is often used to examine the inter-correlational structure of cognitive tests. A composite score, which is created by saving scores based on the first unrotated principal component, typically accounts for about 40% of the variance in a wide range of different cognitive tests [8,9]. Using UK Biobank baseline data, Lyall et al. [3] entered scores on the UK Biobank Reaction Time, Pairs Matching, Fluid Intelligence and Numeric Memory tests into a PCA model. The first unrotated principal component accounted for 40% of the variance and the individual test scores all loaded at � 0.49 on this component [3], confirming the scores were positively correlated and that a g component was present in the original UK Biobank tests. The correlational structure of the newer UK Biobank tests introduced since baseline have not been investigated.
The aim of the current study was to expand on the work carried out by Lyall et al. [3] and investigate aspects of the reliability and validity of the UK Biobank cognitive tests; both the original baseline tests and newer tests introduced since baseline. In the current study, we recruited an independent sample of participants who had not taken part in the UK Biobank study. These participants were administered the enhanced UK Biobank cognitive assessment that was given during the imaging study and included all baseline tests as well as more detailed tests introduced since baseline (S1 File, S1 Table). Participants also completed a battery of wellvalidated, standard cognitive tests; hereinafter we will call these 'reference tests'. For each of the UK Biobank tests, we chose a reference test that we judged was assessing the same underlying cognitive domain. Participants also completed brief screening tests of cognitive impairment and measures of subjective memory complaints, which are often applied in studies of normal and pathological ageing to measure global cognitive functioning (hereinafter we will call these 'general' tests). Approximately four weeks after the baseline assessment, a subsample of participants returned and repeated the enhanced UK Biobank cognitive assessment.
To investigate the reliability and validity of the UK Biobank tests, three sets of analyses were carried out. First, the concurrent validity of the UK Biobank tests was investigated by correlating scores on the UK Biobank tests with scores on the reference tests. We predicted that the correlations between UK Biobank tests and reference tests that assessed a similar cognitive domain should show higher correlations than those between UK Biobank tests and reference tests that assessed different cognitive domains. Second, we investigated whether a g component was present in the correlations among the unsupervised UK Biobank cognitive tests, and tested whether any such g component correlated highly with a g component created using the reference tests which were administered by a trained tester under standardised conditions. The present study predicted that the correlation between a g component created using the UK Biobank tests and a g component created using the reference tests would be high (e.g., r > 0.7). Third, we characterised the short-term test-retest stability of the UK Biobank tests by correlating the cognitive test scores from time 1 and time 2 (approximately 4 weeks apart).

Participants
A sample of participants who had not taken part in the UK Biobank study was recruited. Participants were identified through the University of Edinburgh Volunteer Panel and Join Dementia Research (https://www.joindementiaresearch.nihr.ac.uk/home?login). Both are databases of volunteers who are interested in taking part in research. Potential participants who were aged 40 to 80 years old and who were able to travel to the Psychology Department at the University of Edinburgh were contacted and invited to take part in this study. The age range of 40 to 80 years was used in the current study because this is approximately the age range of UK Biobank participants across the various data collection points to date (http:// biobank.ndph.ox.ac.uk/showcase/field.cgi?id=21003). The UK Biobank baseline study aimed to recruit participants aged 40 to 70 years (mean age = 56.53, SD = 8.10). These participants have become older as they have been followed up. The age range of UK Biobank participants at the UK Biobank imaging study-which uses the same cognitive assessment as in the current study-was 44 to 82 years (mean age = 62.59 years, SD = 10.24).
UK Biobank participants and people with a diagnosis of dementia or mild cognitive impairment were not eligible for this study. A total of 160 participants were recruited. Written informed consent was obtained for all participants. This study received ethical approval from the University of Edinburgh Psychology Research Ethics Committee (reference number 2-1718/3).

Materials
UK Biobank cognitive test battery. UK Biobank provided us with a stand-alone version of the UK Biobank cognitive test battery that was administered at the UK Biobank imaging study. To make the testing session as similar as possible to the UK Biobank clinic study sessions, the present study used the same touch screen monitor and computer setup as that which is used at the UK Biobank clinics' imaging sessions. The UK Biobank cognitive assessment was designed to be fully-automated and the tests were administered unsupervised. During the UK Biobank clinic assessment, there are UK Biobank staff present in the clinic while participants are completing the cognitive assessment; however, participants are expected to sit and work through the cognitive assessment independently. All instructions for the UK Biobank cognitive tests are presented onscreen. In the present study, participants were given brief oral instructions that they were going to complete some tasks on the computer on their own, and that they were to follow the instructions on the screen. To emulate the UK Biobank testing conditions, one author (CF-R) was present in the room while participants completed the UK Biobank tests, but participants were left to work through the tests on their own. The tests administered as part of the UK Biobank cognitive assessment are listed in Table 1 and a detailed description of each test's contents, administration, and scoring is provided in the S1 File.
General tests. A cognitive screening test and a subjective memory questionnaire (see Table 1) were included in the current study to investigate the correlations between the UK Biobank cognitive tests and frequently-used measures of global cognitive function. Detailed descriptions of these tests are provided in S1 File. Reference tests. To be able to test the concurrent validity of each UK Biobank test, a battery of standard neuropsychological tests was administered. For each UK Biobank test, we selected one or more well-validated, standard cognitive tests that resembles the UK Biobank test in terms of the underlying cognitive domain thought to be assessed, and in the actual content of the task ('reference tests'). The reference tests were all administered under standardised conditions, one-to-one, face-to-face, by a trained tester, strictly following the administration instructions. The reference tests chosen for each UK Biobank test are shown in Table 1. A detailed description of each of the reference tests' contents, administration, and scoring is provided in S1 File. No reference test was chosen for the UKB Fluid IQ test because no test was identified by the authors as a suitable comparator.
Demographic and health questionnaire. Information on age, sex, and education was collected. Participant's age was calculated from their date of birth. To measure education, participants were asked, "How many years of full-time education have you completed?". General health was assessed by asking participants, "In general, would you say your health is excellent, very good, good, fair, or poor", and "Compared to one year ago, how you would rate your health in general now?". Participants selected from the following answers: Much better now than one year ago; somewhat better now than one year ago; about the same; somewhat worse now than one year ago; much worse now than one year ago.
UK Biobank cognitive assessment questionnaire. The UK Biobank cognitive test battery was designed to be administered unsupervised, and, although there were staff members present in the UK Biobank clinic during test administration, participants were expected to work through the cognitive assessment independently without a tester observing. Because no tester was there to help participants understand the test instructions, the onscreen instructions for each test must be clear. To assess this, after completing the UK Biobank cognitive assessment participants were asked, "Did you generally find that the instructions for all the tasks were clear?". Participants answered either yes or no. Next, participants were shown screenshots of each UK Biobank cognitive test and were asked "Were the instructions for this test clear?". Participants answered either yes or no. The UKB Numeric Memory task was designed to assess backward digit span, which involves participants remembering a sequence of digits and then mentally reversing them in their mind. However, for this task, all the numbers in the to-be-remembered sequence were presented on the screen at once. This meant that some individuals were able to get the correct answer by reading the number sequence from right-to-left and not reversing the digits in their mind. These individuals are actually performing a forward digit span, which is an easier task. To identify the number of participants who completed this task forwards or backwards, participants were asked how they completed this task. The possible options were: read from left-toright and reversed the digits in your mind; read from right-to-left and did not need to reverse digits in your mind; a mixture of both; something different.

Procedure
Study visits took place in the Psychology Department at the University of Edinburgh at a time mutually agreed by the participant and the tester. Appointments were available in the morning, afternoon, or evening on both weekdays and weekends, to suit the participant's schedule. All assessments were administered by the same psychology-graduate tester (CF-R) in a quiet room, one-to-one, free of distractions. After reading the information sheet and signing the consent form, participants were administered the demographic and health questionnaire, and then the self-rated memory questionnaire. The testing session took approximately 2.5 to 3 hours to complete. To limit any effects of fatigue on test performance, the test order was counter-balanced. Individuals with even participant ID numbers completed the UK Biobank tests before completing the M-ACE and reference tests. Individuals with odd participant ID numbers completed the M-ACE and reference tests first and then completed the UK Biobank tests. The UK Biobank questionnaire was administered immediately after completing the UK Biobank cognitive assessment. Approximately half-way through the session, participants were given a short, approximately 15-20 minute break (with refreshments), again to try to limit any effects of fatigue. The test order for participants with odd and even ID numbers is shown in S1 File (S2 Table).
We aimed to recruit a subsample of 50 participants to come back approximately four weeks after the first visit and repeat the UK Biobank cognitive assessment for a second time. Participants who indicated on the consent form that they would be willing to return for a second study visit, and who were able to arrange an appointment four weeks (± 1 week) after the first assessment, were invited back to complete the UK Biobank cognitive assessment again. Individuals who agreed to return for a second visit were administered the UK Biobank questionnaire after completing the UK Biobank tests for a second time.

Statistical analyses
All analyses were performed in R version 3.5.2. To examine the association between UK Biobank cognitive tests, general tests, and reference tests with basic demographic characteristics, correlations were calculated between all cognitive tests and age, sex, years of education, and self-reported general health. For all correlational analyses reported in this paper, both Pearson r and Spearman rho correlations were calculated. Point-biserial correlations were calculated for correlations with sex and UKB Prospective Memory, as these are binary variables. Concurrent validities of the UK Biobank tests were calculated by correlating the UK Biobank cognitive tests with the general tests and reference tests. Partial correlations, adjusting for age, were also calculated to determine whether the sizes of the associations between the UK Biobank cognitive tests and the general and reference tests remained after controlling for age. The correlation between scores on WMS-IV Designs I and II was high (r = 0.65, p < .001); therefore, a total score was created by summing the scores on Designs I and Designs II (WMS-IV Designs Total; max score = 240). Similarly, the correlation between WMS-IV VPA I and II was high (r = 0.89, p < .001); therefore, a total VPA score was created by summing the scores on VPA I and VPA II (WMS-IV VPA Total; maximum score = 70). Methods to adjust for multiple testing were not used here. This study was interested in the size of the associations between different cognitive tests, and was less interested in whether these associations were statistically significant.
To investigate whether a measure of general cognitive ability created using UK Biobank cognitive tests was highly correlated with a measure of general cognitive ability created using some of the well-validated reference tests, three measures of general cognitive ability were created using the following combinations of tests: 1) all UK Biobank tests included in the enhanced assessment administered at the imaging study; 2) the UK Biobank baseline tests; and 3) a selection of the reference tests. Some of the cognitive tests have multiple parts that are highly correlated (e.g., TMT part A and TMT part B; DLRT Simple RT and DLRT Choice RT). Only one score from each cognitive test was used to create each measure of general cognitive ability to ensure that highly correlated parts of tests do not overly influence the general cognitive ability score. For DLRT, the Choice RT part was chosen as this is a more cognitively challenging task than Simple RT. For TMT and UKB TMT, part B was used because TMT is thought to be a test of executive function-specifically switching ability-and part B is thought to assess this switching ability, whereas part A is often thought to be assessing processing speed.
Each of the measures of general cognitive ability were created by entering cognitive tests scores into a PCA, checking the eigenvalues and scree plots, and saving the scores on the first unrotated principal component. Before cognitive test scores were entered into the PCA, test distributions (S1 File, S1 Fig) were inspected and, where possible, scores with non-normal distributions were transformed. The specific transformations performed are described below. To reduce the influence of any outliers, tests scores were winsorized to 3 SD.

General cognitive ability-Using 11 reference tests
Scores on the following reference tests were entered into a PCA: TMT part B (log-transformed), SDMT, WMS-IV Designs Total, WAIS-IV Digit Span Total (created by summing scores on WAIS-IV Digit Span Forward and Digit Span Backward), D-KEFS Tower Test, DLRT Choice RT (log-transformed), NIH Toolbox Picture Vocabulary, NART (scores were reverted and log-transformed), PPVT (scores were reverted and log-transformed), WMS-IV VPA Total, and COGNITO Matrices score. This measure of general cognitive ability was designed to reflect a g component created using well-validated and comprehensive cognitive tests that have been viewed as the 'gold standard' cognitive measures. As the RMBM Appointments test is brief and contains only 2 items, this test was not included in this comprehensive measure of general cognitive ability. Eigenvalues and scree plot (S1 File, S2 Fig) indicated two components. These two components accounted for 56% of the variance in the 11 reference cognitive tests. Test loadings (unrotated and rotated using oblique rotation) are shown in S1 File (S3 Table). Inspection of the rotated loadings suggests that the first component appears to reflect processing speed. The tests which load most highly on this component include TMT part B (-0.83), SDMT (0.82), WAIS-IV Designs Total (0.73), and DLRT Choice RT (-0.71). Non-speeded verbal tests load highly on the second component. The loadings for the NART, NIH Toolbox Picture Vocabulary, and PPVT were 0.91, 0.90, and 0.84, respectively. Scores on the first unrotated principal component, which accounted for 35% of the total variance, were saved and used as a measure of general cognitive ability (g:reference-11).

General cognitive ability-Using 11 UK Biobank cognitive tests
Scores on the following tests were entered into a PCA: UKB Pairs Memory (log (x+1) transformed), UKB RT (log-transformed), UKB Prospective Memory, UKB Fluid IQ, UKB Numeric Memory, UKB TMT part B (log transformed), UKB Symbol Digit, UKB Picture Vocabulary, UKB Paired Associate Learning, UKB Tower Test, and UKB Matrices. Eigenvalues and scree plot (S1 File, S3 Fig) indicated two components. These two components accounted for 46% of the variance in the 11 UK Biobank cognitive tests. Test loadings (unrotated and rotated using oblique rotation) are shown in S1 File (S4 Table). Like the results from the PCA for the reference tests, the first rotated component from the UKB tests appears to reflect processing speed, whereas the second component reflects non-speeded and verbal abilities. Examining the rotated loadings, tests that load highly on the first component include UKB TMT part B (-0.82), UKB Symbol Digit (0.80), and UKB Tower Test (0.71). UKB Picture Vocabulary loads highly on the second component (0.91). UKB PAL (0.48) and UKB Fluid IQ (0.45)-two verbal tests-also load moderately highly on the second component. Scores on the first unrotated principal component, which accounted for 34% of the total variance, were saved and used as a measure of general cognitive ability (g:UKB-11).
The results of the PCA of 11 reference tests, and the PCA of 11 UK Biobank tests were generally similar, with the first rotated component reflecting processing speed and the second reflecting non-speeded, verbal ability. The first unrotated principal component (i.e., g) also accounted for a similar proportion of the variance in the test scores (35% and 34% for the reference and the UK Biobank tests, respectively). As a result of the fact that many of the tests used to create these g components were speeded tests, these measures of general cognitive ability we have created are largely measuring speeded/fluid cognitive abilities. We also note that only one vocabulary test was used to create g:UKB-11, whereas three were used in the creation of g:reference-11. This discrepancy is likely to be an important reason why the loading on the first unrotated principal component for UKB Picture Vocabulary (0.19) is lower than for NIH Toolbox Picture Vocabulary (0.51).
General cognitive ability-Using 5 UK Biobank cognitive tests. Scores on the following tests were entered into a PCA: UKB Pairs Memory (log (x+1) transformed), UKB RT (logtransformed), UKB Prospective Memory, UKB Fluid IQ, and UKB Numeric Memory. Eigenvalues and scree plot (S1 File, S4 Fig) indicated one component. This component accounted for 38% of the total variance in the 5 tests. The test loadings are reported in S1 File (S5 Table). Scores on this unrotated principal component were saved and used as a measure of general cognitive ability (g: .
Correlations between the three measures of general cognitive ability were calculated. The correlations and age-adjusted correlations between g:reference-11 and each of the UK Biobank cognitive tests were calculated. We also calculated the correlations and age-adjusted correlations between g:UKB-11 and g:UKB-5 with each of the general and reference tests.
To investigate whether participants thought the instructions for the UK Biobank cognitive tests were clear, the number and percentage of participants who answered 'no' to "Did you generally find that the instructions for all the tasks were clear?" was calculated. Next, the number and percentage of participants who reported 'no' when asked whether the instructions for each individual UK Biobank test were clear was calculated.
The number and percentage of participants who reported carrying out a forward digit span, a backward digit span, a mixture of both, or something else when completing UKB Numeric Memory was calculated. Between-group analysis of variance (ANOVA) was used to determine whether mean performance on the UKB Numeric Memory test differed by technique reported.
To measure the short-term stability of the UK Biobank tests, Pearson and Spearman test-retest correlations were calculated between scores on the UK Biobank tests at Time 1 and Time 2.

Results
Participant characteristics are reported in S1 File (S6 Table). A total of 160 participants (mean age = 62.59, SD = 10.24) completed the full assessment at Time 1. Of these, 52 participants (mean age = 61.69, SD = 9.70) returned and repeated the UK Biobank tests at Time 2. The mean time to repeat was 28.88 days (SD = 2.02, range = 26 to 36). The sample used here were relatively highly educated (mean years of full-time education = 16.19, SD = 2.73), and most reported their health to be very good (n = 85; 53.1%) or excellent (n = 36, 22.5%). Extended descriptive statistics (n, mean, SD and range) for each of the cognitive tests administered in this study are reported in S1 File (S7 Table).
For all correlations reported throughout this report, both Pearson correlations and Spearman rank-order correlations were calculated. These correlations tended to be very similar, therefore only the Pearson correlations are reported in the main text. Spearman correlations are reported in the supplementary materials. All correlations carried out for this study, and their exact p-values, are reported in S1 Table (Pearson correlations; S8 Table) and S2 Table  (Spearman rank-order correlations; S9 Table).

Correlations between cognitive test scores and demographic and health variables
The Pearson correlations between each of the UK Biobank tests and age is shown in Table 2. Note that, UKB RT, UKB TMT part A, and UKB TMT part B are measuring response times. On these tests, higher scores indicate that the participants took longer to complete the tests and therefore higher scores reflect poorer performance. The score for UKB Pairs Matching is the number of errors made matching all of the cards and, therefore, a higher score also reflects poorer performance. For all other UK Biobank tests, higher scores indicate better performance. All UK Biobank tests correlated significantly with age. In all but one, older individuals performed more poorly on these tests (absolute r = 0.16 to 0.60, p � .040). The exception was UKB Picture Vocabulary where older participants performed better than younger participants on this test (r = 0.18, p = .022). The strongest age associations were seen for tests measuring processing speed. Older adults tended to have lower scores on UKB Symbol Digit (r = -0.60, p < .001), and were slower on UKB TMT part A (r = 0.58, p < .001), and UKB TMT part B (r = 0.57, p < .001). UKB Pairs Matching (r = 0.34), UKB Tower Test (r = -0.45), and UKB Matrices (r = -0.47) also had absolute correlations of > 0.3 (for all, p < .001) with age.
The Pearson and Spearman rank-order correlations between all cognitive tests and age, sex, years of education, and general health are shown in S1 File (S10 and S11 Tables). Male participants had lower scores than female participants on UKB PAL (r = -0.28, p < .001), but higher scores on the UKB Tower Test (r = 0.19, p = .018) and UKB Matrices (r = 0.17, p = .034). Individuals with more years of education were quicker on UKB TMT parts A (r = -0.18, p = .024) and B (r = -0.21, p = .007), and scored higher on UKB Picture Vocabulary (r = 0.29, p < .001) and UKB Matrices (r = 0.32, p < .001). None of the UK Biobank tests were associated with general health.

Associations with general tests
The Pearson correlations between the UK Biobank tests and the general tests are reported in Table 2. The Spearman rank-order correlations are reported in S1 File (S12 Table). Reporting poorer self-rated memory was associated with more errors on UKB Pairs Matching (r = 0.18, p Except for UKB RT, higher scores on the M-ACE were associated with better performance on all UK Biobank cognitive tests, with absolute effect sizes of 0.2 or higher. The M-ACE was most strongly correlated with performance on UKB PAL (r = 0.47, p < .001). UKB Picture Vocabulary (r = 0.38), UKB Fluid IQ (r = 0.35), UKB Numeric Memory (r = 0.34), and UKB Pairs Matching (r = -0.31) also correlated moderately with the M-ACE (for all, p < .001).

Associations with reference tests
The Pearson correlations between the UK Biobank cognitive tests and the reference tests are reported in Table 2. The Spearman rank-order correlations are reported in S1 File (S12 Table). The correlations highlighted in bold in Table 2 (and S1 File, S12 Table) reflect the correlations between the UK Biobank test and the chosen reference test(s) which was judged to be assessing the same cognitive capability or domain. In going through each UK Biobank cognitive test's results below we first describe the correlation with the respective reference test(s), and then we highlight correlations with 'non-reference' tests that have absolute effect sizes greater than 0.3.

UKB Pairs Matching.
Better performance on UKB Pairs Matching was associated with higher scores on WMS-IV Designs Total-the reference test for UKB Pairs Matching (r = -0.33, p < .001). Better performance on the UKB Pairs Matching test was also moderately associated with better performance on D-KEFS Tower Test (r = -0.40, p < .001) and COG-NITO Matrices (r = -0.38, p < .001).
UKB RT. The DLRT Simple and Choice RT were chosen as reference tests for UKB RT. Slower response on UKB RT was associated with slower responses on DLRT Simple RT (r = 0.52, p < .001) and DLRT Choice RT (r = 0.43, p < .001). The SDMT, another measure of processing speed, also correlated moderately with UKB RT such that individuals who had quicker responses on UKB RT scored higher, and were therefore quicker, on the SDMT (r = -0.35, p < .001).
UKB Prospective Memory. There was a small, positive correlation between UKB Prospective Memory and the chosen reference test, RMBM Appointments (r = 0.22, p = .005). All other reference tests, except NART and DLRT Simple RT, had stronger correlations with UKB Prospective Memory than that reported between UKB Prospective Memory and RMBM Appointments. Correctly answering the UKB Prospective Memory test on the first attempt was most strongly associated with higher scores on COGNTIO Matrices Partial correlations adjusting for age. The age-adjusted Pearson and Spearman correlations between the UK Biobank tests and the general and reference tests are reported in S1 File (S13 Table for the age-adjusted Pearson correlations; S14 Table for the age-adjusted Spearman correlations). After controlling for age, none of the correlations between self-rated memory and the UK Biobank tests were significant. The exception was the correlation between selfrated memory and UKB TMT part A, which remained significant, though reduced in size (r = 0.31; age-adjusted r = 0.17). All correlations between M-ACE and the UK Biobank tests were smaller-though most remained significant-when adjusting for age, except for the correlation between M-ACE and UKB Picture Vocabulary, which became stronger (r = 0.38; ageadjusted r = 0.43). The correlations between UKB Tower Test (r = 0.20; age-adjusted r = 0.12) and UKB Matrices (r = 0.22; age-adjusted r = 0.14) with the M-ACE were no longer significant when adjusting for age.
Generally, the age-adjusted correlations between the UK Biobank tests and the reference tests tended to be smaller than the raw correlations, though the difference between the raw correlations and the age-adjusted correlations was small. The largest differences were seen for the correlations between the following UK Biobank tests with their respective reference tests: UKB TMT part B (r = 0.66; age-adjusted r = 0.55), UKB Symbol Digit (r = 0.64; age-adjusted r = 0.45), UKB Pairs Matching (r = -0.33; age-adjusted r = -0.19), and UKB TMT part A (r = 0.44; age-adjusted r = 0.24). For all other correlations between UK Biobank cognitive tests and the reference tests, the change in the strength of the correlation between the raw correlations and the age-adjusted correlations was � 0.07.

Measures of general cognitive ability
The Pearson correlation between a measure of general cognitive ability created using 11 wellvalidated reference tests (g:reference-11) and 11 UK Biobank tests (g:UKB-11) was r = 0.83 (p < .001; age-adjusted r = 0.79, p < .001). The correlation was similar when re-run using a measure of general cognitive ability that was created excluding scores on the COGNITO Matrices and NIH Toolbox Picture Vocabulary test, which share items with UKB Matrices and UKB Picture Vocabulary (see S1 File). The correlation between g:reference-11 and a measure of general cognitive ability created using the five UK Biobank baseline tests (g:UKB-5) was r = 0.74 (p < .001; age-adjusted r = 0.69, p < .001).
Correlations between g:reference-11 and UK Biobank tests. Pearson correlations and age-adjusted Pearson correlations between g:reference-11 and each of the UK Biobank tests are reported in Table 3 (Spearman rank-order correlations are reported in S1 File, S16 Table). All UK Biobank cognitive tests correlated with this measure of general cognitive ability, such that higher scores on g:reference-11 were associated with better performance on the UK Biobank cognitive tests (for all, p < .001). UKB RT had the lowest correlation with g:reference-11 (r = -0.29, p < .001), whereas UKB TMT part B had the strongest correlation (r = -0.62, p < .001). Other UK Biobank tests which correlated positively with general cognitive ability at > 0.5 were UKB Matrices (r = 0.58), UKB Fluid IQ (r = 0.55), UKB Numeric memory (r = 0.55), UKB Symbol Digit (r = 0.54), and UKB Tower Test (r = 0.52) (for all, p < .001). The age-adjusted Pearson correlations between g:reference-11 and UK Biobank tests (Table 3) tended to be weaker than the raw correlations, except for the correlation between g: reference-11 and UKB Picture Vocabulary, which became stronger (raw r = 0.43, p < .001; ageadjusted r = 0.58, p < .001).
Correlations between g:UKB-11 and g:UKB-5 with the general and reference tests. Pearson correlations and age-adjusted Pearson correlations between general cognitive ability created using the UK Biobank tests (g:UKB-11 and g:UKB-5) and the general tests and reference tests are shown in Table 4 (Spearman rank-order correlations are reported in S1 File, S17 Table). Higher scores on g:UKB-11 were associated with better performance on all the general and reference tests, except for self-rated memory and NART which were not significantly associated with g:UKB-11. Higher g:UKB-11 score was most strongly related to better performance on SDMT (r = 0.68), TMT part B (r = -0.64), and WMS-IV Designs Total (r = 0.63) (for all, p < .001).
When adjusting for age, some of the associations between g:UKB-11 and the reference tests reduced in strength (e.g., correlations with tests of speed, executive function, and reasoning), whereas others became stronger (e.g., correlations with vocabulary tests, RMBM Appointments, and WAIS-IV Digit Span). However, when adjusting for age, all tests except self-rated memory were associated with g:UKB-11 such that a better g:UKB-11 score was associated with better test scores on the general and reference tests. Whereas there was no association between g:UKB-11 and the NART when calculating raw correlations (r = 0.10, p > .05), there was a moderate and positive association between g:UKB-11 and the NART when adjusting for age (age-adjusted r = 0.35, p < .001).
Higher scores on g:UKB-5 were also associated with better performance on the general and reference tests, except the NART, which was not associated with g:UKB-5. Again, the association between g:UKB-5 and the NART became significant when adjusting for age (r = 0.31, p < .001). Generally, the correlations seen between g:UKB-5 and the reference tests were lower than those seen between g:UKB-11 and the reference tests.

UK Biobank questionnaire
Clear test instructions. The number and percentage of participants who thought the UK Biobank tests were unclear is reported in S1 File (S18 Table). A total of 8 (5.5%) participants reported that they thought the UK Biobank test instructions in general were unclear. Nearly one quarter of participants (n = 35, 24.1%) reported that they thought the instructions for the UKB Tower Test were not clear. Participants generally thought the instructions for UKB RT, UKB Picture Vocabulary and UKB Matrices were clear. Only 3 (2.1%), 2 (1.4%), and 1 (0.7%) participants, respectively, reported that the instructions for these tests were not clear.
UKB numeric memory technique. Of the 141 individuals who were asked about the technique used to complete UKB Numeric Memory, only 20 (14.2%) participants reported that they performed the UKB Numeric Memory test as a backward digit span (e.g., read from leftto-right and reversed the digits in their mind). Most (n = 102; 72.3%) performed a forward digit span (e.g., read from right-to-left and did not reverse the digits in their mind). The remaining participants (n = 19; 13.5%) reported using a mixture of both techniques. Participants who did a backward digit span (mean = 6.70, SD = 1.08) had a slightly lower mean score on UKB Numeric Memory than those who did a forward digit span (mean = 6.92, SD = 1.24) or those who did a mixture of both (mean = 7.11, SD = 1.25). A between-group ANOVA did not reveal any differences in UKB Numeric Memory scores by technique used (F (2, 136) = 0.518, p = .602).  The test-retest reliabilities (Pearson and Spearman rank-order correlations) for each UK Biobank test are reported in Table 5. This table also contains some of the test-retest correlations reported elsewhere for some of the reference tests. UKB Pairs Matching had the lowest Pearson test-retest correlation (r 12 = 0.41, p = .003). Test-retest reliability was high for UKB Picture Vocabulary (r 12 = 0.89, p < .001) and UKB TMT part B (r 12 = 0.78, p < .001). Test-retest correlations for all other UK Biobank tests were moderate (r 12 = 0.43 to 0.61). The testretest reliability found here for the UK Biobank tests tended to be lower than those reported elsewhere for the reference tests. For example, the test-retest reliability for WMS-IV VPA I and VPA II was r 12 = 0.79 and r 12 = 0.81, respectively [10], whereas the test-retest correlation For UKB Prospective Memory, the value is n (percentage) agreement for whether participants gave the same response (i.e., correct or incorrect on first attempt) at Time 1 and Time 2. f Test-retest interval mean = 23 days (range 14 to 84 days), n = 244 [10]. g Period-free reliability. Participants completed the DLRT Simple RT and DLRT Choice RT twice immediately one after the other, n = 20 [17].

Discussion
Using a sample of 160 middle-aged and older adults this study investigated the concurrent validity and test-retest reliability of the UK Biobank cognitive tests. This study had three main findings: 1) generally, the UK Biobank tests correlated moderately-to-strongly with well-validated, standard tests designed to assess the same cognitive domain; 2) a measure of general cognitive ability can be created using all of the UK Biobank tests, as well as using only the five UK Biobank baseline tests, and these measures of general cognitive ability are highly correlated with a measure of general cognitive ability created using a battery of standard cognitive measures; 3) most of the UK Biobank tests showed moderate-to-high test-retest reliability, but these tended to be lower than those reported elsewhere for the reference tests.

Concurrent validity
Despite the brief and non-standard nature of the UK Biobank cognitive tests, they tended to correlate moderately-to-strongly with well-validated cognitive tests that were designed to assess the same cognitive domain or specific ability. The UK Biobank cognitive tests mostly showed modest to good concurrent validity. Below, we summarise the findings from the concurrent validity analysis; however, when interpreting the concurrent validities, it is important to be aware that the degree of similarity between each of the UK Biobank tests and the chosen reference tests varies. Whereas some of the reference tests use the same items as the UK Biobank tests (e.g., NIH Toolbox Picture Vocabulary and UKB Picture Vocabulary), others are different versions of the same test (e.g., SDMT and UKB Symbol Digit), and others still are different tests that are thought to assess the same underlying cognitive ability (e.g., WMS-IV Designs and UKB Pairs Matching). Thus, some reference tests used here were better 'matches' for the UK Biobank tests than others, and therefore readers should bear this in mind when interpreting the respective UK Biobank-reference tests' associations.
The UK Biobank Picture Vocabulary test showed especially good concurrent validity. This test correlated very highly (r = 0.83) with the original version of this test-the NIH Toolbox Picture Vocabulary test-and also with another picture vocabulary test, the PPVT (r = 0.74). In a validation study of the NIH Toolbox [13], the correlation between the NIH Toolbox Picture Vocabulary test and the PPVT was r = 0.78, which is very similar to the correlation found here between the UK Biobank version of this test and the PPVT (r = 0.74). In addition, the UKB Picture Vocabulary test was also found to correlate highly (r = 0.75) with the NART, which is often used as an estimate of crystallised cognitive ability [14,15]. The results from this study suggest that the UK Biobank Picture Vocabulary is a valid measure of crystallised ability and may be used as an estimate of premorbid cognitive functioning.
The UKB TMT part B and UKB Symbol Digit tests, which both correlated at greater than 0.6 with the original, paper-and-pencil versions of these tests [11,16], also showed good concurrent validity. In addition to correlating highly with their reference tests, UKB TMT part A and UKB Symbol Digit also correlated positively with a number of other non-reference tests that also have a speeded component (e.g., DLRT Choice RT) providing additional support that these tests are assessing processing speed. Other UK Biobank tests which showed reasonably good concurrent validity (i.e., correlated relatively highly with the chosen reference test) include the UKB RT, UKB Numeric Memory, UKB TMT part A, UKB PAL, and UKB Matrices. Of note, the UKB RT score, which is created from a mean of only 4 trials, correlated at 0.52 with DLRT Simple RT and at 0.43 with DLRT Choice RT, which are more detailed tests of reaction time created from a mean of 20 trials and 40 trials, respectively [17]. Therefore, despite the brief nature of the UKB RT test, it appears to have relatively good concurrent validity.
UKB Pairs Matching had only a moderate correlation (r = -0.33) with WMS-IV Designs Total, the chosen reference test. The differences between UKB Pairs Matching and WMS-IV Designs may account for this lower correlation. Better performance on UKB Pairs Matching had stronger associations with better performance on D-KEFS Tower Test and COGNITO Matrices than it did with the chosen reference test. D-KEFS Tower Test and COGNITO Matrices are both visuospatial reasoning tests.
UKB Prospective Memory did not correlate highly with the chosen reference test (r with RMBM Appointments = 0.22). Reasons for this low correlation could be that both UKB Prospective Memory and RMBM Appointments are very brief, 1-2 item tests, and a high proportion of participants scored full marks on these tests. For the one-item UKB Prospective Memory test, 69% of participants correctly answered this question correctly on the first attempt. For RMBM Appointments, 59% of participants scored 4/4. Therefore these tests had limited variance in the relatively healthy sample used here. Correctly answering the UKB Prospective Memory test correlated moderately with other memory tests (e.g., WMS-IV Designs and WMS-IV VPA), as well as tests of executive function (D-KEFS Tower Test) and reasoning (COGNITO Matrices).
Whereas the UKB Tower Test had moderate positive correlations with D-KEFS Tower Test (r = 0.40)-the reference test-it had larger correlations with WMS-IV Designs Total, SDMT, and TMT part B. Like the UKB Tower Test, TMT part B is thought to measure executive function. The correlation with SDMT may reflect the fact that the UKB Tower Test was a timed test -participants were tasked with completing as many Tower trials as possible in 3 minutesand may therefore be measuring processing speed as well as executive function. The WMS-IV Designs is a measure of visuospatial memory. The UKB Tower Test requires participants to mentally move the hoops on the pegs in their mind, therefore it is likely to also be measuring visuospatial abilities.
We did not include a reference test for UKB Fluid IQ, which was designed to assess fluid ability. A previous study using UK Biobank baseline data [18] found that scores on the UKB Fluid IQ test showed mean values that remained relatively stable between the ages of 40 and 60 years and therefore did not show the age-related decline across the adult lifespan that is the hallmark of fluid ability [2]. Hagenaars et al. [18] suggested that, because of the relative stability in middle-age, UKB Fluid IQ may in fact be measuring a more crystallised ability. In the present study, however, we found that UKB Fluid IQ was negatively correlated with age and that it correlated most strongly (r � 0.38) with tests of working memory (WAIS-IV Digit Span), and non-verbal reasoning (COGNITO Matrices)-tests thought to assess more fluid abilities [2]. Therefore, this test may be more fluid than was suggested by Hagenaars et al. [18], although it did also have moderate correlations with the three standard vocabulary tests.
The UKB PAL test exhibited negative skew (S1 File, S1 Fig) suggesting that most participants find this test quite easy. Despite the negative skew, scores on the UKB PAL test were found to correlate moderately with the M-ACE (r = 0.47), a brief assessment of global cognitive functioning that is designed to identify individuals who may have possible cognitive impairment [19]. The UKB PAL test may be a useful test to identify individuals in UK Biobank who may have a possible cognitive impairment.
In addition to correlating relatively highly with the chosen reference tests, most UK Biobank cognitive tests also had positive correlations with many non-reference tests, and they loaded strongly on the general cognitive component. When we write about tests correlating because they both assess the same 'cognitive domain' or 'underlying cognitive ability' it might also be in part or in whole because they both assess general cognitive ability (g). It is an error not to acknowledge this, as Schmidt [20] discusses in detail. However, mindful of the fact that there is variance beyond g and that is accounted for at the level of cognitive domains and specific abilities [9], and the fact that readers will wish to know how the largely-undocumented UK Biobank tests relate to better-validated tests, we think the references we have made to domains and specific abilities are appropriate.

General cognitive ability
In the present study, we compared whether measures of general cognitive ability created using the brief, bespoke UK Biobank tests that were administered unsupervised correlated strongly with a measure of general cognitive ability created using well-validated tests administered under standardised conditions. The correlations between general cognitive ability created using well-validated tests and general cognitive ability created using the UK Biobank tests were high (r = 0.83 for a measure created using all 11 UK Biobank tests; r = 0.74 for the 5 baseline UK Biobank tests). These correlations reported here were lower than those reported in one study [5] that found that three measures of general cognitive ability created using three entirely different cognitive test batteries correlated nearly perfectly (r � 0.99). However, the correlations found here are in line with another study [6] that compared five different measures of general cognitive ability and found that they correlated at r � 0.77. This suggests that, despite the brief and non-standard nature of the UK Biobank cognitive assessment, a measure of general cognitive ability can be created using these tests. UKB TMT part B, UKB Matrices, UKB Numeric Memory, and UKB Fluid IQ all correlated at � 0.55 with the general measure of cognitive ability created using the standardised tests, suggesting these UK Biobank tests load strongly on general cognitive ability.

Reliability
Test-retest correlations for UKB Picture Vocabulary (r 12 = 0.89) and UKB TMT part B (r 12 = 0.78) were high, and comparable to those reported for other, well-validated, measures of picture vocabulary (NIH Toolbox Picture Vocabulary intraclass correlation = 0.81; 95% CI 0.73 to 0.87 [13]; and PPVT r 12 = 0.94 [21]) and for the original paper-and-pencil version of the TMT part B (r 12 = 0.89 [22]). Therefore, UKB Picture Vocabulary and UKB TMT part B show good stability. Good short-term stability of cognitive tests is especially important when examining longitudinal change. Low stability means that any differences in scores over time may not be due to real change in test performance, but due to error of measurement.
Generally, the test-retest reliability for most of the UK Biobank tests was substantial. UKB RT, UKB Fluid IQ, UKB Numeric Memory, and UKB Symbol Digit had test-retest correlations of greater than 0.5. However, mean performance on the UKB Fluid IQ, UKB RT, and UKB Symbol Digit was found to be significantly higher at Time 2, compared to Time 1, suggesting these tests may be most prone to repeat testing effects. UKB Pairs Matching, UKB Prospective Memory, UKB TMT part A, UKB PAL, UKB Tower Test, and UKB Matrices had modest test-retest correlations (e.g., between 0.4 and 0.5). Although the test-retest correlations for the UK Biobank tests were found to be adequate, they tended to be lower than those reported previously for the reference tests, suggesting that the UK Biobank tests are less stable across time than well-validated tests administered under standardised conditions. The relative brevity of some of the UK Biobank tests might contribute to the lower reliability. However, UKB Matrices uses the same 15 items as COGNITO Matrices, and both are administered via a computer and yet the test-retest correlation for UKB Matrices was r 12 = 0.44, whereas the test-retest correlation reported for the COGNTIO Matrices test was higher, at r 12 = 0.70 [23]. It is not clear why the UK Biobank tests have lower test-retest reliability than other measures of cognitive function.
Using the UK Biobank baseline and repeat data, Lyall et al. [3] investigated the stability of UKB Pairs Matching, UKB RT and UKB Fluid IQ. Like the current study, Lyall et al. [3] found that UKB Pairs Matching had the lowest test-retest reliability. However, the test-retest reliability for UKB Pairs Matching was substantially larger in the current study (r 12 = 0.41) than was reported using UK Biobank data in Lyall et al. [3,18], who reported the test-retest reliability of the UKB Pairs Matching test to be r 12 = 0.19 [3]. The lower test-retest reliability reported in Lyall et al. [3] might be because they used a test-retest interval of over 4 years, which is much longer than the four-week test-retest interval used in the current study and therefore the testretest correlation reported in Lyall et al. may in part reflect cognitive change over time, in addition to test stability. Despite the differences in the test-retest interval, the current study and the study by Lyall et al. [3] found very similar stability estimates for UKB RT (Lyall et al. r 12 = 0.54 [3]; present study r 12 = 0.55) and UKB Fluid IQ (Lyall et al. r 12 = 0.65 [3]; present study r 12 = 0.61), suggesting these tests do show relatively good stability.

Other psychometric considerations in some UK Biobank cognitive tests
The UKB Numeric Memory test was designed as a backwards digit span task to assess working memory-the ability to temporarily store information in short-term memory long enough to manipulate it [15]. Backward digit span tasks require individuals to both remember a sequence of numbers and mentally reverse these numbers in their mind, and this differs from a forward digit span task where participants are only required to remember a sequence of digits [15]. Despite the fact that the UKB Numeric Memory test was designed to assess backward digit span, we found that only 14.2% of the sample tested in the current study reported performing a backward digit span. All other participants reported that they either carried out a forward digit span (72.3%), or they used a mixture of both techniques (13.5%). This means that, for the majority of participants, this test is not assessing the type of mental performance that it was intended to assess.
This study also found that nearly one-quarter of participants reported that they thought the test instructions for the UKB Tower Test were unclear. Given that the UK Biobank tests are administered unsupervised, and participants are expected to sit at a computer in a UK Biobank clinic and work through these tests independently, it is important that the test instructions are clear and the participant knows exactly what to do before starting the test proper. The UKB Tower Test had several pages of instructions. The length of the test instructions might be an important contributor to why participants reported that the test instructions for UKB Tower Test were unclear. Other UK Biobank tests with lengthy instructions, including UKB Symbol Digit (9.7%) and UKB TMT (8.3%) also had higher percentages of participants reporting that the test instructions for these tests were not clear, whereas tests with relatively short instructions, such as UKB Matrices (0.7%) and UKB Picture Vocabulary (1.4%) tended to have very few participants reporting that they thought the instructions were not clear. All of these tests, however, had practice examples which should allow participants to see what is involved before starting the task proper, even if they did not fully understand the test instructions before starting the practice trials.

Advantages and limitations
The main advantage of this study is that the fully-automated UK Biobank cognitive assessment was compared to a large number of well-validated, standard cognitive tests that were administered under standardised conditions. This meant that the brief and non-standard UK Biobank cognitive tests were compared to what many would consider to be the 'gold standard' measures of cognitive ability. For the current study, UK Biobank provided us with a standalone version of the UK Biobank cognitive assessment that is currently being administered at the UK Biobank imaging study. UK Biobank also provided us with a UK Biobank buttonbox to be used for the UKB RT test and with details about the computing equipment used at the UK Biobank clinic assessments which enabled us to very closely mimic the UK Biobank clinic cognitive assessment.
There are some limitations to the current study. The sample size is relatively small, especially for the test-retest sample. The testing conditions in the current study were not identical to the testing conditions used during the UK Biobank clinic assessments. In the current study, participants were assessed individually in a quiet room, free of distraction. The UK Biobank assessment centre could be busy and sometimes noisy (CF-R and IJD both spent a day at one UK Biobank testing centre during people's imaging visits). In the current study, the UK Biobank tests were administered in a more usual and standardised psychological testing environment. It is not clear whether the reliability and validity reported in the current study would differ if the UK Biobank tests had been administered in a busy and sometimes noisy environment that was seen when the authors visited the UK Biobank testing centre. In addition to the cognitive assessment administered at the UK Biobank assessment centre, UK Biobank have also collected cognitive data using web-based assessments. For the web-based assessment, participants are sent a link, via email, and were to complete the cognitive tests at home. The testing conditions of the web-based assessment, therefore, were even less controlled than at the UK Biobank assessment centre. We do not know whether the results of the current study would generalise to the UK Biobank web-based assessment.
This study only examined some aspects of the validity and reliability of the UK Biobank tests. We did not examine, for example, their internal consistency or predictive validity for other 'real-world' outcomes. Another limitation is that the sample used in the current study was relatively highly educated. The mean years of full-time education in the current sample was 16.19 years. However, UK Biobank participants-especially repeat study samples-were also highly educated. At baseline, 18% of participants reported having a college or university degree. Data collection for the imaging study is still ongoing; however, almost half (48%) of participants who have attended so far report having a college or university degree (http:// biobank.ndph.ox.ac.uk/showcase/field.cgi?id=6138). Because the samples used here (and in UK Biobank) mostly consists of relatively highly educated individuals, it is likely that the range of cognitive test scores found here are not representative of the range of cognitive test scores that would be identified in the entire population. Therefore, the correlations reported here between the UK Biobank tests and the reference tests may be attenuated, compared to those reported if we had used a samples more representative of the general population.

Conclusions
This study examined the concurrent validity and test-retest reliability of the enhanced UK Biobank cognitive assessment that is currently being administered to UK Biobank participants attending the UK Biobank imaging study. The UK Biobank cognitive tests are administered using a fully-automated touch-screen assessment, and participants complete these tests unsupervised. The tests in UK Biobank tend to be short. They were created specifically for UK Biobank, or were adapted for use in a fully-automated assessment. The present study found that they showed a range of concurrent validity coefficients with well-validated, standard tests of cognitive ability, and most tests tended to have moderate-to-good test-retest reliability. UK Biobank is one of the largest and most detailed health resources available worldwide. This paper provides currently-lacking information on the psychometric properties of the UK Biobank cognitive tests. Researchers wishing to use the UK Biobank cognitive data should consider analysing cognitive test data from those tests which have been found here to have both moderate-to-high concurrent validity and short-term stability.
Supporting information S1 File. Supplementary materials for Reliability and validity of the UK Biobank cognitive tests. (PDF) S1 Table. Pearson correlations (below the diagonal) and age-adjusted Pearson correlations (above the diagonal) between all cognitive tests and demographic variables. (XLSX) S2 Table. Spearman rank-order correlations (below the diagonal) and age-adjusted Spearman rank-order correlations (above the diagonal) between all cognitive tests and demographic variables.