Agreement between Computerized and Human Assessment of Performance on the Ruff Figural Fluency Test

The Ruff Figural Fluency Test (RFFT) is a sensitive test for nonverbal fluency suitable for all age groups. However, assessment of performance on the RFFT is time-consuming and may be affected by interrater differences. Therefore, we developed computer software specifically designed to analyze performance on the RFFT by automated pattern recognition. The aim of this study was to compare assessment by the new software with conventional assessment by human raters. The software was developed using data from the Lifelines Cohort Study and validated in an independent cohort of the Prevention of Renal and Vascular End Stage Disease (PREVEND) study. The total study population included 1,761 persons: 54% men; mean age (SD), 58 (10) years. All RFFT protocols were assessed by the new software and two independent human raters (criterion standard). The mean number of unique designs (SD) was 81 (29) and the median number of perseverative errors (interquartile range) was 9 (4 to 16). The intraclass correlation coefficient (ICC) between the computerized and human assessment was 0.994 (95%CI, 0.988 to 0.996; p<0.001) and 0.991 (95%CI, 0.990 to 0.991; p<0.001) for the number of unique designs and perseverative errors, respectively. The mean difference (SD) between the computerized and human assessment was -1.42 (2.78) and +0.02 (1.94) points for the number of unique designs and perseverative errors, respectively. This was comparable to the agreement between two independent human assessments: ICC, 0.995 (0.994 to 0.995; p<0.001) and 0.985 (0.982 to 0.988; p<0.001), and mean difference (SD), -0.44 (2.98) and +0.56 (2.36) points for the number of unique designs and perseverative errors, respectively. We conclude that the agreement between the computerized and human assessment was very high and comparable to the agreement between two independent human assessments. Therefore, the software is an accurate tool for the assessment of performance on the RFFT.


Abstract
The Ruff Figural Fluency Test (RFFT) is a sensitive test for nonverbal fluency suitable for all age groups. However, assessment of performance on the RFFT is time-consuming and may be affected by interrater differences. Therefore, we developed computer software specifically designed to analyze performance on the RFFT by automated pattern recognition. The aim of this study was to compare assessment by the new software with conventional assessment by human raters. The software was developed using data from the Lifelines Cohort Study and validated in an independent cohort of the Prevention of Renal and Vascular End Stage Disease (PREVEND) study. The total study population included 1,761 persons: 54% men; mean age (SD), 58 (10) years. All RFFT protocols were assessed by the new software and two independent human raters (criterion standard). The mean number of unique designs (SD) was 81 (29) and the median number of perseverative errors (interquartile range) was 9 (4 to 16). The intraclass correlation coefficient (ICC) between the computerized and human assessment was 0.994 (95%CI, 0.988 to 0.996; p<0.001) and 0.991 (95%CI, 0.990 to 0.991; p<0.001) for the number of unique designs and perseverative errors, respectively. The mean difference (SD) between the computerized and human assessment was -1.42 (2.78) and +0.02 (1.94) points for the number of unique designs and perseverative errors, respectively. This was comparable to the agreement between two independent human assessments: ICC, 0.995 (0.994 to 0.995; p<0.001) and 0.985 (0.982 to 0.988; p<0.001), and mean difference (SD), -0.44 (2.98) and +0.56 (2.36) points for the number of unique designs and perseverative errors, respectively. We conclude that the agreement between the computerized and human assessment was very high and

Introduction
Cognitive decline is a common chronic condition in old age. Worldwide, an estimated 36 million people live with dementia and it is expected that this number will double every twenty years to approximately 115 million in 2050 [1,2]. It is generally believed that dementia is the result of a long-term pathological process that spans at least two to three decades. This is supported by the recent finding that cognitive decline is already evident at the age of 45 years [3]. Therefore, cognitive decline is an important outcome in life course epidemiology and prospective cohort studies. However, few cognitive test are sensitive to cognitive changes across life span, from young adulthood to old age.
The Ruff Figural Fluency Test (RFFT) is a sensitive cognitive test for changes in nonverbal fluency suitable for all age groups [4][5][6]. The test measures the ability to draw as many unique designs as possible within a set time period. Performance on the RFFT is associated with various biological characteristics such as for example, frontal gray matter volume in Alzheimer's disease and right frontal delta magnitude on quantitative electroencephalography [7,8]. The test provides insight in many different cognitive abilities that range from initiation and planning to divergent reasoning and mental flexibility [4,5]. These characteristics and the limited time that is required to administer the test, make the RFFT a useful outcome measure for cognitive function. Therefore, the RFFT was introduced as a cognitive test in the Lifelines Cohort Study that included 167,729 participants of the general population [9]. However, the assessment of performance on the RFFT is time-consuming as the number of unique designs can be large and some designs can be complex and highly similar. These characteristics of the RFFT probably also increase the chance of errors and differences between raters. Moreover, assessment of a neuropsychological test as the RFFT requires expert knowledge and human raters have to be trained and supervised by a qualified neuropsychologist. This can be a burden on resources. To overcome these problems and to improve the usability of the RFFT in large sample studies, we developed a dedicated software program that was specifically designed to analyze performance on the RFFT by automated pattern recognition.
The aim of this study was to compare assessment of performance on the RFFT by the new software program with conventional assessment by human raters. The total study population included 1,761 community-dwelling persons aged 40-87 years. All RFFT protocols were assessed by the new software and two independent human raters.

Study population
The study population included 1,761 participants of the fifth survey of the Prevention of Renal and Vascular ENd-stage Disease study (PREVEND) who performed the RFFT. The PREVEND study was initiated in 1997 in the city of Groningen, the Netherlands, and designed to investigate prospectively the natural course of (micro)albuminuria and its relation to renal and cardiovascular disease in the general population [10,11]. The fifth survey of PREVEND was performed from 2009 to 2012.

Ethics statement
The PREVEND study has been approved by the Medical Ethical Committee (METc) of the University Medical Center Groningen, Groningen, the Netherlands, and was conducted in accordance with the guidelines of the Declaration of Helsinki. Written informed consent was obtained from all participants. The authors MFE, MEAVE and GJI were involved with the collection of the data and had access to identifying information. The data were anonymized prior to analysis.

Ruff Figural Fluency Test
As described previously [6], the Ruff Figural Fluency Test (RFFT) [4,5] is a measure of nonverbal fluency consisting of five parts [5,12]. All parts (1 to 5) consist of 35 five-dot patterns arranged in seven rows and five columns on an 8.5 x 11" sheet of paper. However, the stimulus pattern differs between the parts (Fig 1). In part 1, the five-dot pattern forms a regular pentagon. In parts 2 and 3, the five-dot pattern of part 1 is repeated but includes various distractors: diamonds in part 2, and lines in part 3. In parts 4 and 5, the five-dot pattern is a variation of the pattern of part 1 and these parts do not contain distracting elements. In each part, the task is to draw as many unique designs as possible within one minute by connecting the dots in a different pattern. Repetitions of designs are scored as perseverative errors. Performance on the RFFT is expressed as the total number of unique designs (the sum of all five parts) and the total number of perseverative errors [5,12].

Human assessment
Performance on the RFFT was analyzed independently by two trained raters (referred to as rater 1 and rater 2). The analysis was repeated by a third independent rater (rater 3) if the number of unique designs or perseverative errors as analyzed by the first two raters differed by more than two points in one part or more than four points in total. Then, for each participant, the RFFT scores as analyzed by the two raters who were most concordant were averaged. All raters were undergraduate students ranging in age from 18 to 22 years old. RFFT protocols were analyzed by different subsets of raters. The human assessment was defined as the criterion standard.

Computerized assessment
All RFFT protocols were scanned in color with a Kodak i620 scanner at a resolution of 300 dots per inch and saved in portable network graphics (PNG) format. Subsequently, the RFFT protocols in PNG format were analyzed by the specifically designed software for the computerized assessment.
The software was developed using data from the Lifelines Cohort Study [9]. Lifelines is a multidisciplinary prospective population-based cohort study examining in a three-generation design the health and health-related behaviors of 167,729 persons living in the north of the Netherlands. It employs a broad range of investigative procedures in assessing the biomedical, socio-demographic, behavioral, physical and psychological factors which may contribute to health and disease of the general population, with a special focus on multimorbidity and complex genetics.
Development of the software was based on the principle that in each cell of the standard RFFT protocol, no more than ten different connections can be drawn between any two dots of the five-dot pattern (S1 File). These connections can be combined into 1023 different designs; each correct design is a combination of one or more connections. Therefore, the first step of the software was to identify all correct, or true, connections in a cell. This was done by a set of algorithms that performed a series of subsequent tasks for each cell of the standard RFFT protocol: 1. identifying the active dots in each cell (Fig 2).
3. designating all red pixels of the design drawn by the respondent to candidate connections. 4. checking if the red pixels that are designated to a specific candidate connection actually form a line that is compatible with the candidate connection. If not, the candidate connection is rejected; if so, the candidate connection undergoes the next check.
5. checking if the candidate connection is a false positive error. If so, the candidate connection is rejected; if not, the candidate connection is accepted as a true connection (Fig 2).
After performing task 1 to 5, the software combined the true connections that were identified in a cell into one design and calculated a design identifier (design ID). The design IDs, which were exclusive and corresponded to only one of the 1023 correct designs, were used to count the number of unique designs and perseverative errors. Further details on the software can be found in the supporting information (S1 File).

Statistical analysis
Normally distributed data are presented as mean and standard deviation (SD) and non-normally distributed data as median and interquartile range (IQR). Differences in continuous data were tested with the unpaired t-test or, if appropriate, Mann-Whitney U-test. Differences in proportions were tested with the Chi-squared test. Agreement between two human assessments as well as between the computerized and human assessment was analyzed by the twoway mixed, absolute agreement, single measures intraclass correlation coefficient (ICC) and Lin's concordance correlation coefficient. In addition, 95% limits of agreement were calculated by the Bland-Altman method. In all analyses, the level of statistical significance was set at 0.05. Lin's concordance correlation coefficients were calculated with Stata Statistical Software Release 13 (StataCorp LP, College Station, TX, USA). All other statistical analyses were done with IBM SPSS Statistics 22.0 (IBM, Armonk, NY, USA).

Agreement between human assessments
Unique designs. For the number of unique designs, the intraclass correlation coefficient between two human assessments (different raters) was 0.995 (95%CI, 0.994 to 0.995; p<0.001) (Fig 3, left panel); the Lin's concordance correlation coefficient was 0.995 (95%CI, 0.994 to 0.995; p<0.001). The mean difference (SD) between two human assessments (different raters) was -0.44 (2.98). This was not dependent on the average result of the assessments (Fig 4, left  panel). The 95% limits of agreement were -6.30 and +5.39.

Agreement between computerized and human assessment
Unique designs. For the number of unique designs, the intraclass correlation coefficient between the computerized and human assessment was 0.994 (95%CI, 0.988 to 0.996; p<0.001) (Fig 3, right panel); the Lin's concordance correlation coefficient was 0.994 (95%CI, 0.993 to 0.994; p<0.001). The mean difference (SD) between the computerized and human assessment was -1.41 (2.78). This was dependent on the average result of the computerized and human assessment (Fig 4, right panel). The number of unique designs was somewhat higher in the computerized assessment than in the human assessment in persons with a low performance and somewhat lower in persons with a high performance. The 95% limits of agreement were -6.87 and +4.03. There was one clear outlier in the comparison between the computerized and human assessment (Fig 3, right panel; Fig 4, right panel). According to the computerized assessment, the number of unique designs was 26 for this person but according to the human assessment, it was 72 (difference, -46 points). Visual inspection of the original RFFT protocol revealed that this person did not strictly follow the instructions when performing the RFFT. The lines that were drawn to connect the dots of the five-dot pattern of the RFFT did not merely connect the dots but went a few millimeters through and crossed the dots (S1 Fig). According to the computerized assessment, there were 49 violations of the procedure by this person. Most of these violations were assessed as a unique design by the human raters.
Perseverative errors. For the number of perseverative errors, the intraclass correlation coefficient between the computerized and human assessment was 0.991 (95%CI, 0.990 to 0.991; p<0.001) (Fig 5, right panel); the Lin's concordance correlation coefficient was also 0.991 (95%CI, 0.990 to 0.991; p<0.001). The mean difference (SD) between the computerized and human assessment was +0.02 (1.94). This was not dependent on the average result of the computerized and human assessment (Fig 6, right panel). The 95% limits of agreement were -3.80 and +3.82.

Discussion
The RFFT is a nonverbal fluency test that is sensitive to changes in cognitive function in young as well as old persons [4][5][6]. Therefore, the RFFT is a useful tool for life course studies of cognitive disorders as it is generally assumed that changes in cognitive function begin at a relatively young age [3]. However, conventional assessment of performance on the RFFT by human raters can be time-consuming and may be impractical in large sample studies. Therefore, we developed a dedicated software program for computerized assessment of the performance on the RFFT to be able to assess the performance of a large number of persons in a relatively short time. In this study, we report that the agreement between the computerized and human assessment was very high and comparable to the agreement between two independent human assessments. The computerized and human assessments yielded highly concordant results. This  makes the software program well-suited for the assessment of performance on the RFFT in other large sample studies.
The agreement between two independent human assessments of performance on the RFFT, or interrater reliability, was investigated in only a few studies that included relatively small or highly specific populations. In a study by Berning et al., 143 RFFT protocols were assessed by 30 pairs of raters [14]. They found an ICC of 0.93 for unique designs and 0.74 for perseverative errors. In a study by Sands, 50 RFFT protocols of patients with mixed neurological disease were assessed by two raters [15]. Sands found an ICC of 0.99 for unique designs and 0.99 for perseverative errors. Finally, in a study by Ross et al., 90 RFFT protocols of healthy young persons, undergraduate students recruited from introductory psychology courses, were all assessed by seven raters [16]. They found an ICC of 0.95 for unique designs and 0.86 for perseverative errors. Thus, in all three studies, it was found that the agreement between independent human raters was high to very high for the number of unique designs and moderate to high for the number of perseverative errors. For the number of unique designs, these results were confirmed in our study as we also found a very high agreement between human raters for this measure. For the number of perseverative errors, however, we also found a very high agreement. This difference between the other studies and our study can probably be explained by the much larger study population in our study. Furthermore, there was a difference between the other studies and our study in source population. Whereas the other studies included persons from highly specific source populations such as patients with neurological disease or students, our study included community-dwelling persons ranging in age from 40 years to 75 years or older, and ranging in educational level from primary school to university. Therefore, it can be concluded that the RFFT has a high to very high interrater reliability and that this finding is not limited to specific study samples but can be generalized to the adult population.
The assessment of RFFT protocols can be difficult and time-consuming because persons who undergo the test may draw complicated designs or a large number of highly similar designs. This is particularly the case in study samples that include young and highly educated people [6]. As a consequence, assessment of performance on the RFFT can be challenging and requires fine visual discrimination and sustained attention to detail [14]. Not surprisingly, the accuracy of the assessment is dependent on the rater's experience of assessment as well as on his own performance on the RFFT [14]. Although the effects of these two factors are probably small, their impact can be substantial in large scale studies or over a large number of clinical cases [14]. To avoid these sources of error and to be able to assess a large number of RFFT protocols as part of the Lifelines Cohort Study [12], we developed dedicated software for the computerized assessment of the RFFT. In this study, the software performed well and we found a high agreement between the computerized and human assessment. For the number of perseverative errors, the agreement between the computerized and human assessment was even somewhat higher than the agreement between two independent human raters. Thus, the software that we developed for computerized assessment of performance on the RFFT is an accurate tool that can reduce time and manpower needed for the assessment of RFFT protocols in large scale studies.
The mean difference between the computerized and human assessment of performance on the RFFT was quite small and comparable to the mean difference between two independent human assessments. Nevertheless, in 5% of the participants, the difference was more than five unique designs or more than three perseverative errors and in a smaller percentage of participants, these differences were even much higher. Although this probably is of minor importance in large scale studies, we think that these differences can be relevant for the assessment of individual patients in clinical practice. However, similar differences were found between independent human assessments. In our study, such differences were partly explained by differences in the interpretation of the scoring rules as specified in the professional manual [5], and not so much by overseeing erroneous designs or perseverative errors. Due to the time pressure that is part of the RFFT, participants may work hastily and draw lines that are not straight but curved or lines that do not completely connect the dots of the five-dot pattern. Such imprecise designs may be assessed differently by different raters. Here, computerized assessment probably is more consequent and reproducible than human assessment although clearly, the accuracy of computerized assessment is also dependent on adherence to the test instructions.
The importance of adherence to the test instructions was further underlined by the finding that the number of unique designs was slightly underestimated in the computerized assessment as compared with the human assessment. It has been our experience that participants who work fast and draw curved lines or lines that do not completely connect the dots are mostly people with high performance on the RFFT. Probably, human raters are more liberal and consider such hasty designs more often as correct designs than the computer software. Although it can be debated whether human assessment was too liberal or computer software too strict, it is likely that the agreement between human and computerized assessment can be still further improved by strict adherence to the test instructions and, if necessary, repeated feedback during administration of the RFFT. We also recommend that a statement along the lines of "when drawing a design, make sure each line extends all the way to the dot" is included in the standard instructions for the RFFT.
A potential limitation to our study is the recruitment and training of the human raters. Although we recruited young and highly educated people as raters and all raters received training and supervision, the raters in our study were not professional neuropsychologists or psychometrists. Possibly, the agreement between professional neuropsychologists and psychometrists is higher than the agreement between the human assessments reported in this study. It may also be higher than the agreement between the computerized and human assessments. On the other hand, young and highly educated people generally have the best performance on the RFFT [6], and the raters in this study gained extensive experience in the assessment of RFFT protocols. Both rater fluency and performance have a positive effect on the assessment accuracy of RFFT protocols [12]. The main strength of our study is its study population that included a large number of community-dwelling persons who varied widely in age and educational level. As a result, the agreement between the computerized and human assessments could be studied across a wide performance range which is important for the generalizability of our findings.

Conclusion
In this study, the agreement between two independent human assessments, or interrater reliability, of performance on the RFFT was very high. We also found that the agreement between the computerized and human assessment was very high and comparable to the agreement between the human assessments. Thus, in large scale studies, performance on the RFFT can be accurately assessed by the software application specifically designed for this task.