Method-oriented systematic review on the simple scale for acceptance measurement in advanced transport telematics

Acceptance intuitively is a precondition for the adaptation and use of technology. In this systematic review, we examine academic literature on the “simple scale for acceptance measurement” provided by Van der Laan, Heino, and de Waard (1997). This measure is increasingly applied in research on mobility systems without having been thoroughly analysed. This article aims to provide such a critical analysis. We identified 437 unique references in three aggregated databases and included 128 articles (N = 6,058 participants) that empirically applied the scale in this review. The typical study focused on a mobility system using a within-subjects design in a driving simulator in Europe. Based on quality indicators of transparent study aim, group allocation procedure, variable definitions, sample characteristics, (statistical) control of confounders, reproducibility, and reporting of incomplete data and test performance, many of the 128 articles exhibited room for improvements (44% below.50; range 0 to 1). Twenty-eight studies (22%) reported reliability coefficients providing evidence that the scale and its sub-scales produce reliable results (median Cronbach’s α >.83). Missing data from the majority of studies limits this conclusion. Only 2 out of 10 factor analyses replicated the proposed two-dimensional structure questioning the use of these sub-scales. Correlation results provide evidence for convergent validity of acceptance, usefulness, and satisfying with limited confidence, since only 14 studies with a median sample size of N = 40 reported correlation coefficients. With these results, the scale might be a valuable addition for technology attitude research. Firstly, we recommend thorough testing for a better understanding of acceptance, usefulness, and satisfying. Secondly, we suggest to report scale results more transparently and rigorously to enable meta-analyses in the future. The study protocol is available at the Open Science Framework (https://osf.io/j782c/).


Introduction
The "simple scale for acceptance measurement" from colleagues [1] (hereafter referred to as Simple Scale) has been widely applied in transportation research. Researchers used the scale as subjective assessments of bicycles [2], helicopters [3] 13], North America [14,15], or Australia [16]. It was created as "a simple, standard tool for the assessment of acceptance that can be used by the majority of researchers and that allows a comparison of impact of new devices with other systems" [1]. However, aside from the original publication [1], no article systematically investigated the Simple Scale regarding reliability, validity, and application contexts. Debates about the Simple Scale address the level of data the scale produces ranging from "relative (ordinal) levels of rater acceptance" [17] to Likert-type interval level data [18,19]. Some authors argue that acceptance includes additional facets other than the dimensions usefulness and satisfying produced by the Simple Scale [20]-e.g. perceived ease of use from the technology acceptance model [21] or perceived behavioural control from the theory of planned behaviour [22]. Others [23] see the Simple Scale with its limit of being two-dimensional only as a starting point in designing a standardised measure for acceptance. While the scale might be intuitively useful and easy to use, its psychometric characteristics remain unclear.
The purpose of this paper is to understand how the Simple Scale is applied, how reliable and valid it is, and what results can be expected when it is used. Those four questions are answered by a method-oriented systematic review on articles that empirically applied the Simple Scale in the various contexts listed above. As a result, researchers in transportation science are better informed about the strengths and weaknesses of the Simple Scale which improves their work; they can interpret their results before the background of various other applications; and they gain insights into what to expect when they apply the scale. These are the main contributions of this method-oriented systematic review on the Simple Scale for acceptance measurement.

The Simple Scale
The original authors define acceptance of a technical system as "direct attitudes towards that system. Attitudes are here defined as predispositions to respond, or tendencies in terms of 'approach/avoidance' or 'favourable/unfavourable'" toward the system [1]. Accordingly, they used nine item pairs spanning a 5-point scale in the format of a semantic differential taken from colleagues' [24] catalogue of opinion measures (e.g., useful-useless, bad-good, or nice-annoying).
Having tested the measure in six studies and having calculated simultaneous component analyses with varimax rotation between samples (N = 291), the authors [1] identified two subscales: usefulness (items 1, 3, 5, 7, and 9) and satisfying (items 2, 4, 6, and 8). They exhibited reliability coefficients (Cronbach's α) in the range of.73 to.91 for usefulness and.81 to. 94 for satisfying. An instruction how best to apply the measure consists of seven steps [1]. The authors suggest (1) an instruction before technology use, (2) an instruction after technology use, (3) coding six items with +2 to -2 and three mirrored items -2 to +2, (4) performing reliability analysis on both sub-scales, (5) calculating means for each item if reliability is sufficient (Cronbach's α >. 65), (6) calculating means for both sub-scales usefulness and satisfying, and (7) calculating difference scores between the pre-and post-measures for both sub-scales [1].
The remainder of this article evaluates the Simple Scale and with it the success in developing "a simple, standard tool for the assessment of acceptance that can be used by the majority of researchers and that allows a comparison of impact of new devices with other systems" [1]. Since the Simple Scale is increasingly used in recent years [25][26][27][28], such a systematic evaluation is necessary to understand its psychometric characteristics and guide authors in further applications. Thus, this paper supports researchers in the field of transportation science interested in subjective evaluations of a system.

Research questions
We planned and designed the systematic review in accordance with PROSPERO and AMSTAR guidelines for quality enhancement of systematic reviews [29,30]. It is registered in the Open Science Framework (link: https://osf.io/j782c/). The PRISMA guideline can be found in the S1 File. We did not formulate any restrictions on people, interventions, comparisons, outcomes, and study designs (PICOS). Since this is a method-oriented review, we were primarily interested in the performance of the scale for acceptance measure. In accordance with other method-oriented systematic reviews [31], we formulate the following research questions: • Q 1 : How do researchers apply the scale?
• Comparing the contexts and research questions being investigated together with (descriptive or inferential) statistics used to answer them provides insights in the use of this semantic differential.
• Q 2 : How reliable is the scale?
• Comparing Cronbach's alphas across studies gives an indication of the scale's reliability. Additionally, factor extractions and model fit indices in exploratory and confirmatory factor analyses act as parameters to assess whether the scale produces the proposed two-factor structure.
• Q 3 : How valid is the scale?
• Comparing the studies' findings regarding correlates provides a measure for discriminant and convergent validity of the scale.
• Q 4 : What are mean results for acceptance measures?
• Given sufficiently homogeneous scale applications, the weighted average and the distribution of effects give an indication of expected outcomes for the respective application context.

Literature overview
We conducted a systematic literature search on studies empirically applying the Simple Scale in May 2018. We searched the following databases: • EBSCOhost (all databases included), • Web of Science (Science and Social Science Citation Index), and • Google Scholar using the identical search terms: A simple procedure for the assessment of acceptance of advanced transport telematics.
In every database, this yielded one search result, namely the original research paper [1]. We marked the option to show all articles that cited this study and exported the resulting lists of citations to a blank Endnote library. With this procedure, we retrieved 559 citations. In successive steps, we reduced this population by removing duplicates, screening the titles and abstracts, and reading their downloaded full texts. All empirical applications of the Simple Scale regardless of geographical region were eligible for inclusion (i.e., all translations of the items), as long as the article to be included was written in English or German. We excluded modifications of the scale's items-e.g., replacing "assisting-worthless" with "ugly-attractive" [32] or "nice-annoying" with "not nice-nice" [33]-, but included modifications of the scale's range-e.g., 1 to 5 instead of the original +2 to -2 [34]. In the last step, we screened reference lists of eligible articles to identify further results not listed in the three aggregated databases. We thus added 13 studies to our final population (N = 437 without duplicates). Fig 1 presents the PRISMA flow diagram of our literature search. Even with the support of our university librarian, we were unable to retrieve full-texts for ten citations marked in the S2 File. After reading all retrieved full texts, 247 articles remained eligible for inclusion. We included all peer-reviewed articles in the analysis. Additionally, we included all conference proceedings and doctorate theses not already included as journal articles with a quality score �.25 (see below). This led to the inclusion of 128 articles for analysis.

Coding
We coded all 247 empirical applications of the Simple Scale according to the first section of our coding manual presented in the S1 Appendix. It provided metadata of the articles including author names, year of publication, title of the study, geographical setting (country of data collection; if not reported, country of first author's affiliation), institutional link, article type (peer-review journal, conference proceedings, doctorate or graduate theses, reports, and books or book chapters), and journal name in case of peer-review journal publications. We coded the included 128 articles according to the remaining sections of the coding manual. Its second section consisted of the studies' designs and contents, namely the domain of study, study design (e.g., within-or between-subjects and longitudinal or cross-sectional data collection), research questions, methods, study outcomes, sample size and characteristics (gender and age), and (experimental) conditions. The third section included specifications on the Simple Scale applications, namely reports of the scale's level (e.g., ordinal, interval, or Likert) and range, presentation of scale results (numbers in text or table, bar charts, figures, aggregated or itemwise, or two-dimensional diagram), factor loadings on each subscale, and medians, means, standard deviations, and reliability coefficients of both subscales and the entire scale. The fourth section dealt with relationships of the scale with itself and other constructs and included variables of the analysed model, correlates of the Simple Scale, and other statistics. The last section dealt with miscellaneous aspects such as translation and adaptations of the scale, comments, and the team members extracting the data.
The first author coded all included articles. Four team members gave support in coding and discussions. In contrast to other methodological reviews [31], we did not apply independent coding. The resources needed to double-code all 128 articles on 40 categories (at least 5,120 cells on the spreadsheet) would vastly outnumber the benefits of independent coding-particularly since most codes in all sections consisted in copy and paste of the content without assessment and decisions. Merely the code 'domain of study' involved category formulation and allocation. This was done in a team meeting with four team members all of which had prior experience in the method.

Risk of bias quality appraisal
In order to estimate the risk of bias, we assessed the quality of the 128 included articles using eight items from colleagues [35] covering the multiple aspects aim, group allocation procedure, variable definitions, sample characteristics, (statistical) control of confounders, incomplete data, reproducibility, and test performance reporting. Each item provided a score between 0 (criterion not met) and 1 (criterion met). The items and corresponding codes are presented in the S2 Appendix. Each article was independently coded by the first author and one of three other researchers. A set of 30 articles was used as training material. After those were assessed independently, all four coders met to discuss interpretations of the questions and applications of the criteria. After aligning the approaches, the remaining articles were coded independently. This dataset formed the basis for the calculation of Cohen's kappa as a measure of interrater reliability. Conflicts after completing the quality appraisal were resolved in three meetings between the researchers. We calculated an overall quality score for each article by averaging answers of all applicable items [35]. The overall quality score ranged between 0 (low quality) and 1 (high quality). We used a t-test for independent samples to compare quality scores between articles with one group and articles with more than one group (i.e., with between-subjects conditions). We calculated an ANOVA to compare quality scores between the article types "peer-reviewed journal article", "conference proceeding", and "doctorate thesis". For all analyses, we used α = .05 as significance indicator. Lastly, we analysed difficulty (i.e., relative frequency of "criterion met") and item-scale correlations of the items.

Statistical analyses
We calculated descriptive statistics of the articles' metadata, i.e. country and context of origin, or year and type of publication, as well as other features such as sample characteristics, scale range, or presentation of scale results. From these analyses, we could derive typical Simple Scale applications suitable to answer our first research question. For participants' age, we estimated mean ages from categories by assuming equal distribution of individual ages in the categories. Because of incomplete reporting in the study population, we can only partially answer research questions Q 2 -Q 4 using descriptive analyses and a narrative synthesis instead of planned meta-analytic procedures.

Literature review
We identified 437 unique references. Of those, 247 applied the Simple Scale empirically-90 peer-review journal publications, 84 conference proceedings, 32 doctorate or graduate theses, 25 reports, and 16 books or book chapters. An Endnote library with all references can be found in the S2 File next to a spreadsheet with codes for section A of the coding manual for 247 articles (S1 Dataset). We included peer-reviewed journal articles, conference proceedings, and doctorate theses (N = 128) in further analyses.
The combined sample size of the 128 studies was N = 6,058 (range 3 to 387; median 32). Note that in some cases the same dataset had been used for more than one publication (e.g., N = 72 in [36 and 37]), and that some articles theses used more than one sample in more than one study (e.g., [38][39][40][41][42]). Of all studies reporting gender distribution (112 articles; N = 5,462), 57% of participants were male. Mean age of participants was M = 37.15 years in 100 studies (N = 4,289 participants) reporting means. Estimated mean age for participants was M = 37.62 years in 14 studies (N = 546 participants) only reporting age categories. The remaining studies with N = 1,223 participants did not report age in a way to estimate a mean.

Quality appraisal
Cohen's kappas between the first author and the other three researchers were.53,.54, and.71 before, and 1, 1, and 1 after conflict resolution, respectively. The largest discrepancies in appraisal were in item 3, item 5, and item 7 with 59%, 61%, and 71% initial agreement, respectively. Item statistics are displayed in Table 1. Codes for each article and item can be found in the S2 Dataset. The quality appraisal tool had a reliability of Cronbach's α = .47 suggesting that these items do not form a narrow, one-dimensional construct of quality. This is exactly as expected since we aimed to include different facets of quality not contingent on another.
Overall, quality scores were low (M = .55, SD = .17; scale range 0 to 1). Sixty-five articles (51%) retrieved a score above.50. The majority of studies reported their aims (item 1) and procedure to be reproduced (item 7) at least partially. Surprisingly, a minority of studies defined the constructs they used to fulfil their stated aim (item 3) and reported test performance indicators such as Cronbach's α (item 8). Low quality scores mean that the study is more difficult to interpret and reproduce, because important information is missing.
Coding of item 6 was difficult since an absence of protocol violation and missing data documentation might also be the result of no protocol violation and no missing data. However, this would mean that the majority of studies had no missing data whatsoever-a far-fetched assumption for empirical attitude research. Removing item 6 from the overall quality score calculation resulted in slightly improved quality scores across articles (M = .58, SD = .17) with 69 articles (54%) receiving a quality score above.50.

Applications of the Simple Scale (Q 1 )
Our first research question addressed the application of the scale regarding the studies' metadata. The 128 articles were published by 313 different authors. For peer-reviewed journal articles, journals with the most publications were Transportation Research: Part F (23), Accident Analysis & Prevention (14), Applied Ergonomics (7), and Human Factors (6). The 128 articles spanned 22 years of research with a focus on the recent years (57% of articles published since 2014). Most applications of the scale emerged from technical and engineering departments of research institutions focusing on transportation. We identified 17 topics of focus in the included articles with driver assistance systems (45), automated driving (21), intelligent speed adaptation (14), vehicle safety systems (11), and electric vehicles (11) being the most frequent.
Geographically, the 128 studies were conducted in 15 different countries with 75% of studies emerging from Germany (40), The Netherlands (30), USA (18), and the UK (14). Most studies collected data within subjects (80), some between subjects (15), and the remaining studies within and between subjects (38). Seven studies additionally used a longitudinal design over multiple weeks or months. The vast majority of studies used a (driving) simulator (77), field trials (39), or both (3). The remaining studies relied on online questionnaires or in-lab mock-up equipment other than simulators.
Most publications did not test theoretical models with variables explaining certain outcomes such as system use or acceptance [43,44]. Instead, the typical application of the Simple Scale consisted in its loose connection with a paper otherwise concerned with technical aspects of a new system in transportation. Here, speed, lateral offset, absolute driver torque, steering wheel angle, glace duration, or reaction times were assessed to evaluate the system's

PLOS ONE
Systematic review on the simple scale for acceptance measurement performance. It seemed the Simple Scale was an add-on to enhance technical arguments with a subjective assessment from the users. This is exemplified by colleagues [8] who after explicating technical aspects and tests at length stated "[i]n addition, subjective evaluations were conducted to check for system acceptance". Articles centring on acceptance such as [9] ("[t]he core of this work is an extensive SEM analysis on the factors driving smart charging acceptance") were the exception. Consequently, application, reporting, and presentation of the scale's results varied and were in many cases incomplete. Seventeen studies used a different scale range than the original -2 to 2 (e.g., 1 to 5, 1 to 7 or -50 to 50), and 18 studies did not report the scaling leaving 93 studies (73%) reportedly using the Simple Scale in its original scale range. Twenty-one studies erroneously reported that the semantic differential consisted of Likert-scales. Some studies adopted the items to form an actual Likert-scale measuring (dis-)agreement [34].
Only eight studies reported factor analyses to test the two-dimensional structure of the Simple Scale, and only 28 studies reported reliability coefficients as a measure for scale accuracy (see next section). Nonetheless, 78 studies formed means corresponding to the two sub-scales usefulness and satisfying without reporting whether data structure and scale characteristics allow for this procedure. Six studies reported the scale's or sub-scales' medians.
The majority of studies (73) reported descriptive statistics of the Simple Scale as numbers in tables or text. The remaining studies used illustrations such as a two-dimensional diagram with the two sub-scales usefulness and satisfying as dimensions (21), bar charts (18), or other figures (11). Three studies used plain text, and the remaining three studies did not report descriptive statistics from the Simple Scale.
Fourteen studies reported relationships between the Simple Scale and other constructs resulting in 70 estimates. We used these for answering the third research question below. Table 2 presents all 128 articles with the information listed above. A spreadsheet with codes for all 128 articles can be found in the S3 Dataset.

Reliability (Q 2 )
The second research question addressed the reliability of the scale. The original authors [1] argued that Cronbach's α >.65 suffices for the sub-scales' reliabilities. However, recent articles argue for increased lower and upper limits of reliability whilst criticising Cronbach's α as a measure that overestimates reliability if its assumptions are violated [144][145][146]. Based on these debates, we consider values of Cronbach's α �.80 as acceptable measures for reliability of established scales.
Based on the median coefficients and the weighed means, the Simple Scale and its two subscales usefulness and satisfying can be seen as reliable. However, missing data from 100 studies limits the certainty of these results considerably.
Only eight studies with N = 869 participants calculated a total of ten explorative factor and principal component analyses. Two of these analyses yielded the aspired two factors usefulness and satisfying [9,13]. Three factor analyses resulted in only one factor named acceptance [36,

Validity (Q 3 )
The third research question addressed validity via correlations with closely related constructs. Fourteen studies (N = 1,360) reported correlations of the Simple Scale and its two sub-scales e Standard deviation estimate calculated from categories using the formula ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi P q . f The reported SD = 11.3 for the age group 65-75 years lies outside the mathematically possible range; thus we refrained from estimating total SD. https://doi.org/10.1371/journal.pone.0248107.t002 Table 3. Reliability coefficients of the Simple Scale and its two sub-scales usefulness and satisfying across 28 studies.
The entire scale correlated with other measures of usefulness, e.g. perceived usefulness from TAM r = .88 and performance expectancy from UTAUT r = .86 [20], as well as with other measures of satisfaction, e.g. comfort r = .71 and enjoyment r = .38 [87]. These results indicate convergent validity of the Simple Scale [147].
A limitation of these correlations were sample sizes of median N = 40 in the 14 studies. Such a low N reduces test power considerably. We refer to colleagues who have demonstrated the effect of small sample sizes on the informative value of correlation analyses [148,149].

Acceptance scores (Q 4 )
The fourth research question addressed the values of the Simple Scale across studies within homogeneous application scenarios. In total, 111 studies (N = 5,046) reported 432 means for the sub-scale usefulness, 430 means for the sub-scale satisfying, and 34 means of the entire Simple Scale. Means presented in figures were estimated by the authors using lines in MS PowerPoint. Only 261 means of the scale and sub-scales (29%) were accompanied by corresponding standard deviations-a necessary condition to estimate standard errors for aggregating results across studies. Lastly, application contexts varied introducing critical heterogeneity into the data. Driver assistance systems were the most frequently researched topic. However, even within this study population applied technologies varied between haptic steering guidance, fatigue monitoring, congestion assistant, or forward collision warnings. These arguments-lack of standard deviations to estimate standard errors and heterogeneity of application context-inhibit any sensible calculation of aggregate scores of the Simple Scale.
The only tendency we could deduce from this database was generally larger usefulness than satisfying scores. Means for both sub-scales were reported in 424 instances across 97 studies (N = 4,095). In 318 of these cases (77%), the mean for usefulness was higher than the mean for satisfying across 15 different research topics.

Discussion
This systematic literature review assessed applications of a "simple procedure for the assessment of acceptance of advanced transport telemetrics" [1]-a nine item semantic differential scale measuring acceptance with the two sub-scales usefulness and satisfying whose popularity is increasing and whose systematic evaluation has been pending. In sum, 128 publications with N = 6,058 participants provided results of the scale. In this section, we discuss findings about the scale followed by a reflection of how the scale was applied and how results were reported.

Scale
Our most important finding questions the two-factor structure of the Simple Scale. Only two out of ten factor and principal component analyses were able to replicate both sub-scales. The combined sample size of these analyses (N = 869 in eight studies) outnumbered the original authors' [1] own sample threefold producing more convincing results. Instead, the Simple Scale might produce a single acceptance score with high internal consistency (median Cronbach's α = .90). Reported correlation coefficients between the two sub-scales were high (r �.55; four studies with N = 329 participants) suggesting a close relationship between usefulness and satisfying. This might explain why the two-factor structure was not replicated in the majority of factor and principal component analyses included in this review.
We thus recommend researchers who apply the scale to calculate explorative-or better confirmative-factor analyses with correlated factors and report their factor loadings together with model fit indices before using usefulness and satisfying scores. We refer to references [150][151][152] for more information on these procedures. Research on safety equipment in mobility or other emerging technologies with potential to disrupt markets relies on valid results. Objectivity (i.e., transparent and clear reporting) and reliability (i.e., checking test performances) are necessary to provide valid results and should thus be considered paramount in all fields of research.
As a second major finding, we identified the tendency that the Simple Scale produces higher means for usefulness than for satisfying in 77% of cases (97 studies with N = 4,095 participants). A first explanation for this finding is that indeed, the researched systems are more useful than satisfying. This might particularly be the case for systems that interfere with (driving) decisions of participants to increase safety. These systems might understandably be rated more useful than satisfying. However, the tendency was observed across 15 different research topics. Thus, an alternative second explanation points towards a possible method effect of the Simple Scale itself. Here, participants might be inclined to answer more affirmative to the five items for usefulness than to the four items for satisfying because of the items' wording. A method effect would explain the finding of higher usefulness than satisfying scores across research topics. However, without the possibility to meta-analyse, both explanations seem probable and the result can only be seen as a tendency.

Applications and reporting
Reporting of scale results was limited so that it was not possible to assess the scale using metaanalytic procedures. As examples, only 26% of reported means were accompanied by standard deviations, and only 22% of studies reported reliability coefficients. This was surprising since the original authors [1] themselves instructed researchers applying their scale to calculate Cronbach's α as a measure of scale performance. We found that only half of the included studies (52%) received a quality score above.50 (scale range 0 to 1) using reporting of aims, sample characteristics, variable definitions, test performance, reproducibility, and missing data as indicators for study quality. These findings are worrying and need contextualising.
We identified that the Simple Scale is typically applied in papers predominantly concerned with technical aspects of a new system in transportation. Understandably, technical aspects (e.g., lateral offset, glace duration, or reaction times) are paramount for the systems' performance and evaluation particularly in engineering and transportation research departments where most publications of the Simple Scale emerged. Ideally, subjective assessments using psychometric scales are applied and reported with as much rigour and conscientiousness as their objective technical counterparts. We thus urge researchers to critically reflect on their use of subjective measures and to report as extensively on the scales' performance and results as journal guidelines allow. Only then is it possible to assess method effects and data structure using meta-analytical procedures.