Standardized on-road tests assessing fitness-to-drive in people with cognitive impairments: A systematic review

Objective The on-road assessment is the gold standard because of its ecological validity. Yet existing instruments are heterogeneous and little is known about their psychometric properties. This study identified existing on-road assessment instruments and extracted data on psychometric properties and usability in clinical settings. Method A systematic review identified studies evaluating standardized on-road evaluation instruments adapted for people with cognitive impairment. Published articles were searched on PubMed, CINHAL, PsycINFO, Web of Science, and ScienceDirect. Study quality and the level of evidence were assessed using the COSMIN checklist. The collected data were synthetized using a narrative approach. Usability was subjectively assessed for each instrument by extracting information on acceptability, access, cost, and training. Results The review identified 18 published studies between 1994 and 2016 that investigated 12 different on-road evaluation instruments: the Performance-Based Driving Evaluation, the Washington University Road Test, the New Haven, the Test Ride for Practical Fitness to Drive, the Rhode Island Road Test, the Sum of Manoeuvres Score, the Performance Analysis of Driving Ability, the Composite Driving Assessment Scale, the Nottingham Neurological Driving Assessment, the Driving Observation Schedule, the Record of Driving Errors, and the Western University’s On-road Assessment. Participants were mainly male (64%), between 48 and 80 years old, and had a broad variety of cognitive disorders. Most instruments showed reasonable psychometric values for internal consistency, criterion validity, and reliability. However, the level of evidence was poor to support any of the instruments given the low number of studies for each. Conclusion Despite the social and health consequences of decisions taken using these instruments, little is known about the value of a single evaluation and the ability of instruments to identify expected changes. None of the identified on-road evaluation instruments seem currently adapted for clinical settings targeting rehabilitation and occupational priorities rather than road security alone. Study registration PROSPERO registration number CRD42018103276.

fitness-to-drive [20]. Furthermore, general practitioners mention cognitive disorders as the first reason to justify their decision to withdraw the driver's license in 64% of cases [21], without always documenting the true effects of disorders on driving skills. Different means are available to assess fitness-to-drive: off-road tests of prerequisites (physical, perceptual, and cognitive), simulator-based assessments, and on-road assessments. In the presence of cognitive impairments, neuropsychological tests are not considered a good predictor of fitness-to-drive [22]. It is thus advisable not to base decision-making solely on off-road testing, but also to conduct an assessment of on-road driving performance in case of doubt [16].
For this purpose, driving simulators can be used. This technology standardizes the procedure and controls a number of variables (e.g., traffic density and user behavior) [23]. This method also allows avoiding risky situations in traffic [24]. However, it is costly, and its acceptability to older adults is questionable (sickness simulation and lack of users' familiarity with technology) [24,25]. The use of the simulator is also questionable in case of visual-perceptive disorders due to the two-dimensional representation of driving [18]. Finally, the ecological validity of simulators is questionable [26].
In terms of ecological validity, the on-road assessment is therefore considered the gold standard [27]. This assessment usually includes a closed course (e.g., in a parking lot) that allows the operational level to be safely assessed before entering traffic [18]. Depending on the construction of the instrument, the open road evaluation then allows investigation of the three levels of skills and control from Michon's model [28]. However, many variables influence driving, including weather conditions, users' behavior, road conditions, etc. The inability to control some variables makes it difficult to standardize this type of evaluation [16]. Psychometric and clinimetric values of on-road assessment methods need to be accounted for. To our knowledge, there are no systematic reviews investigating the added value of on-road assessments on the clinical decisional process for recommending driving cessation. Before starting the review, systematic review protocols on this subject were searched on PROSPERO database. No ongoing studies were found.

Method
The aim of this systematic review is to identify and describe psychometric and clinimetric values for existing standardized on-road tests adapted for people with cognitive impairment due to acquired brain injury, dementia, or age-related disorders. The review also aims to describe costs, training requirements, accessibility, and usability of each instrument for future practical implementation. A protocol was recorded on PROSPERO (registration number CRD42018103276).

Selection criteria
Articles meeting the following criteria were included: (a) assessments used with people with suspected or objective cognitive impairment related to acquired brain injury, dementia, or age; (b) on-road assessment; (c) standardized instruments; (d) original articles including a form of validation of the assessment (identifiable in the title or abstract); (e) articles written in English or French.
Excluded were articles comprising (a) a simulator based assessment; (b) a first driving license test; (c) the use of highly specialized equipment (high costs, low reproducibility, etc.).
(inception-January 2018). Grey literature has not been explored, as this study focuses on available on-road assessments.
A pilot phase was carried out on PubMed to refine the equation until it could identify already known articles [29]. Three categories of keywords have been defined (assessment, driving, and cognitive impairment). To foster a conservative search, the category "cognitive disorders" was not systematically used. No additional limits were used. The same literature search was conducted at the end of the data extraction phase (January 2019) on PubMed, CINHAL and Web of Science. Full search is available at the following link: http://doi.org/10.5281/ zenodo.3687014.
References of the selected articles were consulted (search backward) and the articles citing those selected were screened (search forward) for additional studies [30]. Authors of the selected articles were contacted for additional papers.

Study selection
Titles have been simultaneously screened by two reviewers (LV and DB) against the inclusion criteria. The abstracts were then independently screened by two reviewers (LV and DB). Following the removal of duplicates, the full-text articles were assessed for eligibility against selection criteria. Any disagreements that arose between the reviewers at each stage of the study selection process have been resolved through discussion or with a third reviewer (PV). Reasons for exclusion of full-text articles were documented.

Data extraction
Data extraction forms were created by two authors (LV and DB) and validated by a third author (PV). The extracted data include characteristics of the selected studies, characteristics of each identified on-road assessment, their psychometric and clinimetric properties, and characteristics for implementation.
The type of extracted data was defined by COnsensus-based Standards for the selection of health Measurment Instruments (COSMIN) manual [31] and included authors and year of publication, participant characteristics (number, age, gender, health condition, eligibility criteria), context, description of the test (materials, evaluators and process), and participant scores. The characteristics of the tests include: the name of the test and citation of the studies, the target population, the distance and duration of the course, the design of the road (closed and/or open road and difficulty), the items (number, categories and description), the rating system, the availability of a cut-off score, changes made between several versions, and the available versions (language) [31][32][33]. The psychometric and clinimetric properties of the identified onroad assessments were extracted according to the COSMIN taxonomy [34]. Finally, the implementation characteristics were cost, accessibility (i.e. how to obtain them), prerequisites (e.g. training required), and acceptability [31]. The latter represents the users' perspective on the relevance of the content and context of the test [35].
Data were extracted by two independent reviewers (LV and DB) using the standardized data extraction forms after a pilot phase on four articles. Any disagreements that arose between the reviewers during the data extraction process have been resolved through discussion or with a third reviewer (PV). Authors of selected articles have been contacted to request missing or additional data for clarification, when required.

Assessment of the risk of bias and the quality of evidence
Assessments of the risk of bias and the quality of evidence in the selected studies were conducted using the COSMIN checklist. Although this tool was originally developed for systematic reviews of Patient-Reported Outcome Measures (PROM), it can be used for Clinician-Reported Outcome Measures (ClinROM) [31,36]. The evaluation of the quality of the studies was carried out in three stages: (a) assessment of the risk of bias by psychometric property by article; (b) assessment of the risk of bias by psychometric property by instrument (if more than one article); and (c) assessment of the quality of evidence by psychometric property by instrument, using a Grading of Recommendations Assessment, Development and Evaluation (GRADE) approach adapted by COSMIN. These three steps were carried out jointly by two authors (LV and DB). A third author (PV) reviewed these evaluations.

Data analysis
It was anticipated that the heterogeneity of the identified on-road assessments would make it difficult to group the data for a meta-analysis. A narrative synthesis supported by tables was therefore carried out [37][38][39].

Assessment of publication bias
Assessing the risk of publication bias in studies on measurement properties is complicated because of a lack of registry for such studies [31]. Thus, the assessment of the risk of publication bias was not carried out.

Results
The literature search identified 5,463 records (Fig 1). Following the titles and abstracts screening and the removal of duplicates, 64 full-text articles were retrieved. Following additional search strategies (search backward, search forward, and emails to authors) 28 full-text articles were added. Thus, 92 full-text articles were assessed for eligibility and 18 met selection criteria. No additional studies were selected after the last literature search in January 2019. The study selection process is presented in the PRISMA flow diagram in Fig 1 [40].
All but two evaluation instruments have been developed for English-speaking countries. The articles were published between 1994 and 2016. The target populations are mainly people with dementia (5 of 12) and people aged 60 and over with different levels of cognitive functioning (4 of 12). One assessment tool was intended for people who have had a stroke, one for people with multiple sclerosis, and finally one for people with cognitive disorders of various etiologies. All tests take place on open roads and six of them start with a closed course. The route is predefined for ten on-road assessments and two take place in the participants' ecological environment (DOS and CDAS). The test lasts between 45 and 60 minutes except for the DOS and the CDAS (31 minutes and 4 hours filmed over two weeks respectively) [47,54]. This information is not available for two tests. The distance covered during the tests averaged 23.5 kilometers (ET = 10.5; 9.6-40) for nine of them. Three data were missing. A cut-off score is available for two instruments: It allows a dichotomization for the SMS [49] and a

PLOS ONE
On-road fitness-to-drive evaluation  (Table 2), with the exception of the acceptability of two on-road assessments [41,54], required contact with the authors. Data for five instruments are not available in the absence of a response from the authors. Two instruments are available on

PLOS ONE
On-road fitness-to-drive evaluation the Internet (WURT and NNDA), one by contact with the author (TRIP), one after training (P-Drive) and one in a journal (CDAS). Access has not yet been specified for the remaining instruments. Three on-road assessments are free of charge (WURT, CDAS, and NNDA). The price has not been established for the TRIP and RODE, and is 800€ for the P-Drive (training included). An online training is being developed for another on-road test (RODE). Finally, it is recommended to have training as a specialist in driving rehabilitation for five on-road assessments (TRIP, CDAS, NNDA, RODE, and UWO) and as an occupational therapist for one (WURT). The characteristics of the selected studies are presented in Table 3. Participants, whose average proportion of men is 64.28% (ET = 14.5; 40-87%), are on average 70.8 years old (ET = 8.7; 48-80.2). The average study sample size is 66 (SD = 50; 6-205). A dual control vehicle was used in nine studies, without specifying the type of gearbox. A dual control vehicle with automatic transmission was used in five studies. Participants were given a choice between their private vehicle and a dual control vehicle in one study. Three studies opted for the use of participants' vehicles. Information is missing in two studies. Table 4 lists the psychometric properties investigated in the included articles: In addition to structural validity, internal consistency, inter-rater reliability, criterion validity, and construct validity, properties related to reassessment such as responsiveness or test-retest reliability were only poorly investigated (none and twice respectively). These psychometric properties were assessed for risk of bias using the COSMIN checklist and a detailed assessment of the quality of the evidence using the GRADE approach. Details of these assessments are available on Zenodo (http://doi.org/10.5281/zenodo.3687014) under Appendix 2 (Risk of bias) and Appendix 3 (Summary of Findings). Fig 2 illustrates the quality of the evidence of psychometric properties by assessment (colors) in a synthetic way, as well as the evaluation of the aggregated results by psychometric property against the criteria for good measurement properties defined in the COSMIN checklist (+, -, ±, ?) [38]. The aggregate results are sufficient for 18 psychometric properties, undetermined for eight, insufficient for three, and inconsistent for one. The inconsistency of the criterion validity for the P-Drive can be explained by the fact that the three studies exploring criterion validity include different target populations and different expertise level of the evaluators. No quality of evidence of a psychometric property was rated as high (A), four were rated as moderate (B), five as low (C) and 21 as very low (D). This low quality is partly explained by the sample size (-1 to the quality of the evidence if 100�n�50 and -2 if n<50) and by the indirectness. The latter refers to a different target population than the one of this systematic review. Indeed, most tests, except the P-Drive, have a relatively narrow target population.

Discussion
In this systematic review, 12 on-road tests were identified to assess the fitness-to-drive in people with cognitive impairment. These on-road tests are heterogeneous with regard to their components and few have been the subject of several validation studies. However, no test seems to really stand out from the others in terms of the quality of the evidence, particularly because of limited sample sizes. It is important to consider that this concerns all research on driving, because of the high costs in this field: Statistical power suffers from limited sample sizes [58].
As health conditions are potentially progressive, it is important not to decide on fitness-todrive on a single assessment [20]. Health conditions can be transitory, episodic, or permanent. To carry out this reassessment, the test used should have good psychometric properties in terms of test-retest reliability, responsiveness, measurement stability over time, and change      (6)

PLOS ONE
On-road fitness-to-drive evaluation  (Continued)

PLOS ONE
On-road fitness-to-drive evaluation  detection [59]. Among the tests identified in this systematic review, only test-retest reliability was assessed in two on-road tests (WURT and SMS) [42,48]. On-road assessment allows for exploring the three dimensions of Michon's model: strategic, tactical, and operational [28]. Tactical and operational dimensions are systematically explored during an on-road assessment as they relate to vehicle maneuvering. However,

PLOS ONE
On-road fitness-to-drive evaluation depending on the construction of the test, the strategic dimension is not necessarily investigated. Only UWO, CDAS, and DOS explore the strategic dimension through a task of planning of a route [47,54,56,57]. It seems important to investigate all dimensions of driving and not only those related to vehicle maneuvers. Indeed, the strategic dimension influences driving safely by answering the questions of when, where, with whom, and how, among others.
Combined with reassessment, the assessment of fitness-to-drive could allow the identification of self-regulatory behaviors. These represent changes in driving behaviors (e.g., highway avoidance) that promote safe driving in cases of functional limitations, discomfort, or lack of self-confidence [60]. Self-regulation may happen in tactical, strategic, or life-goal aspects (e.g., choice of place to live in relation to one's occupation or purchase of another vehicle). Thus, the identification of self-regulatory behaviors depends on the dimensions investigated by the onroad test (tactical and strategic according to Michon's model). Since the tactical dimension is systematically explored, the identification of self-regulatory strategies at this level is accessible. This makes it possible to identify the adoption of behaviors such as avoiding distracting elements when driving (e.g., radio), or a better gap acceptance [60]. Since only three on-road tests assess the strategic dimension, work is needed to facilitate the identification of self-regulatory strategies at this level.
A cut-off score is available for two on-road tests (P-Drive and SMS) [49][50][51]. However, only the P-Drive allows for trichotomization of the evaluation, i.e., the classification of the people assessed in three categories (fit / doubtful as to their fitness-to-drive / unfit): Two studies were conducted and showed two different cut-off scores allowing dichotomization; it was suggested to use them as limits defining the gray zone, thus allowing a trichotomization [50]. This gray area is important as it is a gateway for interventions [32,33]. These may be aimed at maintaining driving performance and anticipating of managing a mobility transition following driving cessation [17].
Several studies have allowed participants to use their private vehicles [47,52,54] or have given them a choice [53], suggesting some ecological validity. In addition, two tests are conducted in an ecological environment (CDAS and DOS) [47,54]. According to the Person-Environment-Occupation-Performance Model (PEOP), occupational performance emerges from the interaction between the person, their environment, and the occupation [61]. The use of standardized roads and the instructors' professional vehicle ignores the potential influence of the ecological environment on performance. In this context, the ecological validity of the assessment is questionable [54]. Indeed, familiarity with the environment while driving reduces cognitive load, particularly at the attention level, as suggested in a simulator study [62]. Familiarity also promotes spatial orientation [63]. In addition, the use of the instructors' professional vehicle results in a reduction of performance [64]. However, familiarity with the road increases distractibility while driving [65] as well as reaction time [66]. In addition, risk perception and ability to respect speed regulation are reduced [67,68]. All ages combined, the majority of accidents occur on familiar roads [69]. However, these results are from studies that are not limited to participants with cognitive impairment. Studies suggest that tactical self-regulation is influenced by the level of familiarity with the environment in which the evaluation takes place. This could raise safety issues for evaluations that do not cover unknown environments.
Furthermore, the choice of the vehicle could have an influence on safety during the test. In the event of a serious error, the absence of dual control could impede on the evaluator's ability to intervene quickly [33]. In this context, beginning the evaluation on a closed-course could allow an assessment of the operational dimension of driving beforehand. This ensures sufficient control over safety before entering traffic [28]. However, this component of the test was not identified in all the tests [44,45,48,49,53], including those conducted in an ecological setting [47,54].
The involvement of two people, a driving instructor to ensure safety and usually an occupational therapist for the scoring, was very common. As confirmed by other studies, only three instruments, the CDAS, the New Haven, and the RIRT relied on a single person [43,46,47]. However, the added value of having a second evaluator was never properly studied. The shortage of qualified occupational therapists in some countries, such as Switzerland [70], could question this practice. Resources could be optimized by having occupational therapists focus on activity analysis, community mobility, and mobility transition [71]. Given the almost perfect inter-rater reliability between occupational therapists and instructors identified in two studies [42,57], the latter appear to represent a potential way to address the shortage of therapists in this domain. According to Nilsen [72], another resource to consider in the implementation of an evaluation instrument is the accessibility to appropriate training. One training course is available for the P-Drive in Scandinavia and another is under development for the RODE. The WURT does not require any specific training, but simply to follow the protocol.
Even though systematic review protocols were searched before our study, a systematic review concerning the reliability and validity of on-road driving tests has recently been published [73]. They identified 21 on-road tests. This difference can be explained by the different eligibility criteria: They included studies not restricted to people with cognitive impairment, and studies in which the validation process couldn't be identified in the title or abstract. The authors also found that validation studies mainly focused on the inter-rater reliability, that measurement error and responsiveness were not investigated despite their importance. However, these psychometric properties are of great importance, as explained previously and as mentioned as well by Sawada et al. [73].
Given the limitations of on-road tests concerning their psychometric properties and their components' variability, these tests are not as promising as expected and costly in terms of financial and human resources. However, it is necessary to have reliable, valid and specific instruments in order to support decision making concerning the withdrawal of the driving licence, its retention or its restriction [74]. A less expensive and potentially effective alternative would be the use of cognitive tests to assess fitness-to-drive [75]. A systematic review exploring the relationship between cognitive tests and on-road driving performance in people with dementia has been conducted [75]. It appears that composite batteries of cognitive tests are more appropriate to predict on-road driving performance than cognitive tests focused on a single cognitive ability. Though, these composite batteries are not sufficiently validated: Cutoff scores enabling trichotomization would be useful. One specific test, the Useful Field Of View (UFOV©) appears to be a potential predictor of driving fitness-to-drive [76]. This instrument assesses cognitive domains (selective attention, divided attention and processing speed) in three tasks [76]. However, there are many different versions of this instrument and some authors have modified it for their studies [77]. For this reason, the interpretation of the results is more complex than it appears. In addition, the UFOV© possibly measures visual functions that are not directly related to driving. Indeed, in addition to measuring processing speed and different types of attention, it involves visual functions such as visual acuity. The latter is not a predictor of fitness-to-drive [77]. As visual acuity decreases with age, the UFOV© score decreases and this does not necessarily reflect a poorer driving performance. Thus, it seems important to have age-related normative values for UFOV© [78].
Another alternative would be the use of simulators: They allow to evaluate complex behaviors in a controlled environment when these behaviors might not be safe, practical nor ethical during an on-road test [79]. They must be immersive, sufficiently challenging and have complex scenarios in order to emulate accurately and consistently the real-world performance [79]. Technological advances have made it possible to make simulators much more dynamic by improving visual quality, for example, but also by being able to control traffic, include other road users and modify the driving environment in real time according to the behavior of the person being assessed [79]. These authors stress the importance of the simulator's choice, as it is widely used in practice and research and, depending on the simulator, it can lead to biased interpretation. There are few validation studies on simulators. The authors suggest that the simulator does not systematically represent a valid assessment of driving performance [79]. Each simulation setting is unique and must be valid, which can lead to a very important financial investment: Simulators are therefore much more expensive than they seem.
Finally, according to Wynne et al. [79], the perception of risk in simulators is also reduced: The engagement in risky situations is higher and compliance with traffic regulations is reduced. Simulator's driving performance is influenced by this absence of stress, which can be present during an on-road test. In addition, the simulator does not allow the evaluation of situations where familiarity with the environment could have an impact on on-road driving performance [79].

What does this study add
The choice of components (ecological environment, private vehicle, evaluation of the strategic dimension) when constructing the instrument is of great importance as they influence performance and safety during the evaluation.
• This study brings elements regarding implementability characteristics in addition to the psychometric properties of the instruments.
• This systematic review was specifically focused in on-road tests for people with cognitive impairment with a comparison of their psychometric properties and components.
• Familiarity's influence must be explored in order to guide on-road tests' elaboration.

Recommendations for practice and research
In conclusion, none of the methods for assessing fitness-to-drive appears to be ideal and none of them alone appears to be sufficient. Thus, these different tests could be combined in order to fuel the decision making process regarding the withdrawal, restriction or retention of the driving licence. The choice of an evaluation instrument remains an important concern: It is necessary to use valid instruments. In sum, off-road tests can be used to identify situations requiring further assessment. On-road or simulator-based performance tests could support decision making in ambiguous situations. In this regard, Sawada et al. consider WURT, P-Drive and TRIP as potential gold standards [73]. In the present study, WURT, P-Drive, TRIP, RIRT, SMS and New Haven seem to present the best results in terms of summarized results and quality of the evidence, the others being dismissed out-of-hand because of their weak results or quality of the evidence. Of these instruments, the P-Drive seems to stand out from the others: As mentioned previously, it allows trichotomization, a training is available, its target population is larger than in the other instruments and it is the most studied instrument among the articles selected in this systematic review. Nevertheless, limitations remain: The strategic dimension is not explored as well as the responsiveness, the measurement error and the test-retest reliability. When used in other languages than Swedish or English, a transcultural adaptation would be necessary to make this assessment (and the training) accessible. Finally, its implementation is limited by the shortage of occupational therapists. Specific interdisciplinary regional training programs for occupational therapists and driving instructors could facilitate the implementation of improved methods for evaluating fitness-to-drive in people with cognitive impairment. There seems to be an association between environmental familiarity and driving performance. However, more studies on this subject are needed to determine the relevance of conducting a driving assessment in an ecological environment at the expense of road standardization and vice versa. In addition, this association should be explored in people with cognitive impairments. In this context, there is also a need for studies to evaluate the responsiveness of evaluation instruments to known changes in health conditions, and to develop methods to distinguish driving lapses and errors due to health conditions from those due to other causes. A possible solution could be to develop instruments that rely on at least three separate driving phases: evaluation, intervention, and re-evaluation.
Finally, it is relevant to provide evidence of the added value or necessity of the involvement of occupational therapists during the on-road assessment, as that has become the norm in many countries.

Limitations of the study
As some tests were initially developed on simulators (e.g. P-Drive), it is possible that some validation studies were not selected due to the defined selection criteria. Some data concerning the psychometric properties may therefore be missing. In addition, different steps of the systematic review were carried out jointly for educational purposes, which may limit the quality of the methodology.
As the choice of information sources is limited to five databases, it is possible that some onroad assessments available from institutional or government reports have not been included. The same applies to tests developed in doctoral theses, for example. Finally, test-retest and inter-rater reliability were not differentiated when assessing the risk of bias using the COSMIN checklist. The readability of the results may thus be affected.

Conclusion
This systematic review identified 12 on-road evaluation instruments adapted for people with cognitive impairment. When compared with recommendations from the scientific literature, these instruments do not comply to scientific standards for medical diagnosis procedures. Following a single-step evaluation procedure, risks of falsely recommending driving cessation, which might lead to important health consequences, could still be present. This is particularly the case for most evaluation methods that do not provide the opportunity to express uncertainty in the results. Trichotomization is of great importance as it favors interventions that could help maintain driving, or preconize a transition to anticipate driving cessation and facilitate mobility transition.