On the quality of quantitative instruments to measure digital competence in higher education: A systematic mapping study

In this study, we report on a Systematic Mapping Study (SMS) on how the quality of the quantitative instruments used to measure digital competencies in higher education is assured. 73 primary studies were selected from the published literature in the last 10 years in order to 1) characterize the literature, 2) evaluate the reporting practice of quality assessments, and 3) analyze which variables explain such reporting practices. The results indicate that most of the studies focused on medium to large samples of European university students, who attended social science programs. Ad hoc, self-reported questionnaires measuring various digital competence areas were the most commonly used method for data collection. The studies were mostly published in low tier journals. 36% of the studies did not report any quality assessment, while less than 50% covered both groups of reliability and validity assessments at the same time. In general, the studies had a moderate to high depth of evidence on the assessments performed. We found that studies in which several areas of digital competence were measured were more likely to report quality assessments. In addition, we estimate that the probability of finding studies with acceptable or good reporting practices increases over time.


Introduction
In a world governed by Information and Communication Technologies (ICTs) [1], most of the essential processes of modern society are automatized in one way or another [2]. For this reason, it is essential that professionals have sufficient ICT skills and competencies, adapting to the demands of the modern working world [3,4]. Higher education institutions play an essential role in this context [5], that is, by integrating different strategies to provide these digital competencies to their educational community [6] and building coherent evaluation processes and instruments. The latter is particularly important not only for diagnosing the educational community, but also to verify the extent to which an intervention program has been effective. The act of evaluating digital competencies is, from a theoretical perspective, a measurement task. As a consequence, it largely depends on the quality of the employed instrument. Regardless of the structure of the instrument, its quality is given by the fulfillment degree of two psychometric properties: reliability and validity [7]. As recent reviews showed [8,9], the topic of digital competence in higher education comprises of a large and fertile body of studies. This literature has been characterized on several occasions from different perspectives: ranging from concept use [10,11] to organizational infrastructures, strategic leadership, and teaching practices [12].
However, the evaluation process itself and, more specifically, the quality of the employed instrument are two important topics that have not been extensively addressed in the past. So far, there is still an uncertainty about how and to what extent studies ensure that the instruments used are adequate to measure digital competencies in higher education. In our opinion, characterizing current literature on these issues is relevant for both researchers and practitioners. In the first case, researchers are provided with an overview on 1) the main studies' features and 2) the trends in reporting quality assessments. As a consequence, current literature is assessed and some important research opportunities to explore in the near future are identified. Additionally, such characterization would help us to reflect on what we have done well and what we have not done when conducting or reporting on quality assessments. In the case of practitioners, since literature is assessed according to how the quality of the employed instruments has been assured, they are provided with a top-quality list of studies that serve as a good starting point for reusing previous experiences.
In order to shed light on these issues, in this paper, we conducted a systematic mapping study [13]. More specifically, we aim at 1) characterizing the literature demographically and methodologically, 2) describing how and to what extent studies assured the quality of employed instruments, and 3) identifying what studies' features explain certain reporting practices. We specifically focused on literature using quantitative instruments in the form of questionnaires. The rationale behind this move, as noted by [11], is that these data collection methods are among the most commonly employed by researchers in this field. Therefore, it is expected that the results obtained can characterize the vast majority of published studies on this topic. for working, studying and in everyday life, (3) the ability to evaluate digital technologies critically, and (4) motivation to participate and commit in the digital culture." (p. 655).
Another definition of digital competence can be realized by approaching this concept through the frameworks defined in both the scientific context [8] and in that of government policies [15]. A notable example in the latter case is the Digital Competence Framework for Citizens (DigComp) created by the European Commission. This framework has gone through 3 fundamental versions. In the first one, DigComp 1.0 [16], 21 digital competencies (dimension 2) described in terms of knowledge, skills and attitudes were considered (dimension 4). Moreover, these competencies were organized in 5 areas (dimension 1): Information, Communication, Content-creation, Safety and Problem-solving. To assess how competent a citizen is, DigComp 1.0 proposes three proficiency levels (dimension 3): A (foundation level), B (intermediate level) and C (advanced level). The framework also includes examples of use on the application of digital competencies for different purposes (dimension 5).
The main contribution of the second version, DigComp 2.0 [17], was the redefinition of some concepts and terms (dimensions 1 and 2). However, it maintained the five areas of digital competencies of the first version. Finally, the third version, which was named DigComp 2.1 [18], established 8 levels of proficiency (dimension 3) instead of the 3 defined by DigComp 1.0. These levels were defined with consecutive numbers from 1 to 8 with the following distribution: Foundation (1 and 2), Intermediate (3 and 4), Advanced (5 and 6) and Highly specialized (7 and 8). It is important to note that DigComp 2.1 did not include an update to dimension 4 related to knowledge, skills and attitudes. Instead, the authors focused more on dimension 5, that is, by showing examples of use of the framework in the employment and learning contexts.
From the above-mentioned definitions, it is easy to conclude that measuring the degree or level of digital competence involves at least three areas: knowledge, skills and attitudes. In other words, the development of an instrument for measuring digital competencies should include assessment items related to these three competence areas.
Existing literature on the subject includes several works related to our research topic and type (secondary study). Table 1 summarizes the current related studies. These studies were selected from the systematic search explained in Sec. 4, but considering only secondary studies published in the period 2016-2020.
As seen in the table, only five studies addressed the topic of digital competencies evaluation. They are [8,11,[19][20][21]. However [19], did not focus on higher education and the analyzed studies are only from Latin America. Similarly [8], did not focus on higher education and is not a systematic review. In the case of [11], the authors included some relevant factors for the evaluation process, such as the method of data collection (instrument) and the study area of the participants. However, it is not clear which works actually evaluated digital competencies. Regarding these two factors, the authors concluded that most of the studies use mixed methods or surveys for measuring, and are based on populations from different knowledge areas. An important limitation of this study is that it dates back to 2018, so more recent contributions are not present in the review. The study conducted by [20] found that programs and actions developed by the HEIs leave out the development of competencies in content creation and safety. To identify this gap, the authors used the DigComp 2.1 framework [18] as a reference. Finally, in the meta-analysis developed by [21], the authors found that, in the field of higher education in Latin America, the proportion of students and teachers with digital skills is moderate (64%), with no notable differences between both types of populations.
Regardless of the progress achieved by the above mentioned studies, some important aspects of the process of evaluating digital competencies in higher education remain unexplored. This is the case for the quality of the instruments employed for conducting such an evaluation.

Quality of quantitative instruments
When developing a quantitative instrument, it is important to assess its quality [7,28,29]. This is a process that mainly depends on assessing its reliability and validity [29]. From a psychometric perspective, the first property states whether the instrument provides the same (or similar) results under similar conditions or inputs, while the second one states whether it measures what is supposed to be measured [28]. Several methods to conduct these assessments have been proposed in the past [7] and have been classified into different categories. For instance, the detailed review provided by [30] identified four types of both reliability and validity assessments as shown in Table 2.
Ideally, a study developing or administering an instrument should present enough details about these eight assessments types. However, this is not always possible because the presence of research limitations (e.g., time, lack of another instrument to compare with, access to the participants). In any case, it is important to conduct an assessment of at least one of these types in order to guarantee a suitable degree of consistency and accuracy for the instrument [28]. Even if the instrument have been proposed and validated in a previous study, it is a good practice to check its reliability and validity [28].

Related work on quality evaluation of quantitative instruments
Critically evaluating the quality of quantitative instruments is not a new research topic and has been developed for quite some time in various areas of knowledge, such as education, psychology and health. In what follows, we will review some of the most important reported experiences, emphasizing the conclusions related to the quality of the instruments considered.
In [31] the authors focused on evaluating the quality of the methods used in high-quality trials of continuing medical education. Of the 136 studies selected, only 34.6% reported reliability or validity assessments. In the same context of medical education, in [32] the authors reviewed the instruments and questionnaires used for peer review published up to May 2010. In line with the results of the previous study [31], it was found that most of the questionnaires did not provide sufficient psychometric data.
Smokeless tobacco dependence measures were the focus of the review conducted in [33]. From the 4 selected studies, the authors conclude that the instruments analyzed have limitations in terms of reliability and validity. The same difficulties were detected in the critical synthesis developed in [34] on the so-called Implicit Relational Assessment Procedure, a computer-based psychological measure. From 31 studies published before March 2013, the authors conclude that although there is growing evidence of validity in the studies, they lack sufficient reliability to ensure replicability.
In a more extensive work where 53 studies published in the period 1995-2012 were included [35], reviewed studies that administered the Parenting Style and Dimensions Questionnaire instrument. In this case, the authors highlight that only a few studies involved complex reliability and validity assessments.
The apathy scales validated in generic and specific neurodegenerative disease populations was the focus of the review conducted in [36]. Of the 16 studies analyzed, the authors found a great heterogeneity of results. More specifically, the methodological quality of the studies ranged from poor to excellent.
In an educational context [37], analyzed the validity and reliability of the structured objective clinical evaluation (OSCE) with nursing students. By reviewing 19 papers published up to April 2016, the authors concluded that validity and reliability was adequate in most studies. However, considering that one of the selection criteria in the search conducted by the authors was precisely to include psychometric assessments, this result was somewhat expected. A more objective conclusion is obtained if the 14 studies excluded by the authors that did not meet this criterion are taken into account. In this sense, the 19 studies represent approximately 58% of the relevant studies. The extent to which parallel administration of the same scale shows consistent results.
The use of the scale by the same administrators at the same time (i.e., inter-rater reliability), administering two parallel forms of the same scales to the same sample successively (i.e., alternative form reliability)

Scalability
The extent to which individual items in the scale measure the latent trait that is being measured and do so distinctly from other items in the scale.

Mokken scaling
Validity Face validity The extent to which the scale is understandable and perceived as relevant by the subjects to ensure their cooperation and motivation.
Not tested using statistical procedures. Subjects, experts or the researcher may be involved in the consideration of whether a scale appears to be relevant

Content validity
The extent to which the scale adequately samples all possible questions that exist.
Critical review by an expert panel for clarity and completeness or comparison with the literature, or both

Criterion validity
The extent to which the scale aligns to criterion measures that have been established as valid.
Concurrent validity (information about the criterion that is available at the time the test is administered), predictive validity (information about the criterion measure is obtained after the test has been administered)

Construct validity
The extent to which the scale correlates with the construct under investigation.
A great heterogeneity of results was also observed by [38] in their evaluation on the replicability, comparability and validity of quality assessment tools for urban green spaces. This work was based on 15 primary studies published up to July 2019.
In [39] the authors summarized the instruments used to measure constructs of marital quality by analyzing 91 primary studies. As the authors indicate, most of the instruments reported include sufficient exploratory evidence of construct validity, but without explicitly defining the construct under study.
In the context of nursing education, the review developed by [40] aimed to determine how valid and reliable simulated patient scenarios are. Relying on 17 studies found in the Cumulative Index to Nursing and Allied Health Literature, the authors conclude that academics are inconsistent in developing both reliable and valid simulated scenarios.
Similarly, in [41] a comprehensive review of the literature was conducted on instruments measuring the competencies of special educators. In total, 20 instruments reported by 29 studies were characterized. The authors found that only 11 instruments (e.g. 55%) have evidence of reliability and validity assessment.
What these experiences tell us is that there are serious problems related to the quality of quantitative instruments. This is an issue that affects several areas of knowledge, including education sciences. However, the current state of the instruments used to measure digital competencies in higher education remains to be known. Thus, the results of our work will allow us to verify, among other things, to what extent this field would be affected by this issue.

Methodology
This paper follows the methodology described by [13] for conducting systematic mapping studies. In turn, this is a type of secondary study [42]. It is also a correlation study since we aim at analyzing the association between the variables under study.
According to the selected guide, mapping is achieved through three main steps: planning, conducting, and reporting. The following sections describe how the first two steps were developed, while the third one is fulfilled by writing this paper.
The selection and data extraction processes were carried out in parallel by two authors. In order to evaluate the concordance of the results of these processes, we relied on Cohen's kappa (κ). However, it is important to clarify that although we could have used other more sophisticated indicators [43], we consider Cohen's kappa to be sufficient for our purposes. Our decision is in line with previous research such as [44], where Cohen's kappa was employed for similar aims.

Research questions
The main goal of this research is to provide an overview of how literature related to the evaluation of digital competencies in higher education reports on the quality of the employed instruments (e.g. questionnaires). We consider the literature published during the period from January, 2010 to July, 2020, which is a time frame commonly used in literature reviews in the field [9,12,45].
More specifically, we were interested in answering the following research questions: • RQ1). What are the main demographic and methodological features of the studies evaluating digital competencies in higher education?
• RQ2.1). What types of assessments and what specific methods are most often reported?
• RQ2.2). How comprehensive and how deep are these quality reports?
• RQ2. 3). What studies achieve the best balance between coverage and depth?
• RQ3). What studies' characteristics are more likely associated with certain reporting practices?

Search
To design the search formula for finding the relevant studies, we proceeded as follows. First, we used the PICO tool [46] to identify the relevant terms according to the population under study, intervention method, comparison group, and outcomes. Second, careful readings of similar reviews such as that of [11,21] were useful in order to complement the results of applying the PICO tool. As a result, the following search formula was defined: This formula was used to search three relevant databases: Scopus, Web of Science, and ERIC (Education Resources Information Center). These databases cover a great part of the scientific literature about Education Sciences and are widely used in the review studies related to this field [11,12].
The results obtained from applying the above formula are shown in Table 3. It is worth noting that the search was conducted in July 2020.

Selection of the studies
During the selection, we followed several steps which are summarized in Fig 1. We considered the following as inclusion and exclusion criteria: Inclusion criteria: • Studies measuring digital competencies quantitatively in the context of higher education • Studies published in the period of 2010 to July, 2020

• Studies published as journal articles
Exclusion criteria: • Studies published in conference proceedings, book chapters The selection process was conducted by two authors independently, from the Screening step up to the Eligibility step (Fig 1). In the first case, the observed agreement and accounting for chance agreement were p 0 = 0.928 and κ = 0.856 (p-value< 0.001), respectively. As for the Eligibility step, these values were p 0 = 0.973 and κ = 0.938 (p-value< 0.001), respectively. In the case of the 36 studies rejected after reading their full texts, the particular reasons were as follows: 23 due to the exclusive use of qualitative data collection techniques, and 13 for not specifically assessing digital competencies.

Data extraction
Data extraction was conducted based on the template depicted in Table 4. As it can be observed, 10 variables related to both demographic and methodological features were considered. Regarding reliability and validity assessments, 4 different types were included for both groups (8 in total) [30]. For each type, we recorded the extent to which the study reported on the assessments, that is, Not mentioned, Only mentioned, Referenced (a previous study), and Details are provided. These dimensions were mapped into numerical values (0, 1, 2 and 3, respectively) in order to compute specific indicators for characterizing the studies' reporting practices. Such indicators are described in the next section.
As in the selection process, data extraction was carried out by two authors in order to mitigate personal biases, especially when evaluating studies. As a result, the observed agreement and accounting for chance agreement were p 0 = 0.965 and κ = 0.904 (p-value< 0.001), respectively.

Analysis and classification
Analysis and classification of the studies was carried out after data extraction. The obtained results were tabulated and visually summarized as shown in Sec. 5. A complete list of the reviewed studies and their corresponding classification according to the used data extraction template can be located at the following link https://osf.io/me36k. In order to characterize the reporting practice of the studies, we defined three indicators, which are computed from the data extracted. They measure three different features of the studies. The first one, which we have called External Coverage represents how exhaustive the study is in conducting both reliability and validity assessments of any kind. Here, three cases are possible: 1) the study has not reported on any group of assessments, 2) the study reported on a single group (reliability or validity), and 3) the study reported on both groups. These three cases were labeled and numerically coded as None = 0.0, Moderate = 0.5 and High = 1.0, respectively.
Similarly, the second indicator, that we named Internal Coverage, is devoted to measure how comprehensive the study is inside the group of assessments it reported on. We have defined this indicator as the proportion between the number of assessment types conducted by the study and the number of possible assessments within the reported group or groups. Given that each group (reliability or validity) has 4 types of assessments, the total of possible assessments will be 4 if the study only reports on a single group, while it will be 8 if it reports on both groups at the same time. In turn, this indicator will range from 0 (conducting no assessment types at all) to 1 (conducting all assessment types withing the group or groups), with other values in between corresponding to several degrees of coverage.
The third indicator, Reporting Depth, measures how deep the study reports on the conducted assessments, that is, regardless of its external or internal coverage. It is computed as the normalized average from the study's reporting levels achieved in the types of assessment that were conducted. These reporting levels are the ones defined in Table 4, which also have numerical codes. For example, a study conducting only one type of assessment with a reporting level of Referenced = 2 will have a Reporting Depth of 2/(1 � 3) = 0.667. Note that we divided by 1 � 3 because only one (1) type of assessment was conducted, while the maximum value a reporting level that may be achieved is 3 (corresponding to Details are provided). Therefore, this indicator ranges from 0 (no depth) to 1 (maximum depth). Of course, the latter case corresponds to the studies providing details in all of the conducted assessments. From these indicators it is possible not only to rank the studies, but also to characterize their reporting practices. Note that this is necessary in order to answer research questions RQ2.3, RQ3, and RQ3.1. Here, three different approaches from multi-criteria decision analysis can be adopted [47]: 1) Full aggregation approach, 2) Outranking approach, or 3) Goal, Aspiration or Reference-level approach. Since, in our context, it is possible to certainly know both the ideal and the anti-ideal studies according to the three indicators we defined above, the third approach was adopted. Specifically, we applied the Technique of Order Preference Similarity to the Ideal Solution (TOPSIS) method [48], which assumes that the best alternative (study) is the one with the shortest distance to the ideal alternative and the furthest distance from the anti-ideal alternative [47]. Note that the ideal study is the one with External Coverage = Internal Coverage = Reporting Depth = 1. Contrarily, the anti-ideal study is the one with External Coverage = Internal Coverage = Reporting Depth = 0. Following the steps from the TOPSIS method [47], we computed the Relative Closeness coefficient for each study using euclidean distance and the same weights for the three indicators. This coefficient ranges from 0 to 1, where a value approaching 1 means that the study is close to the ideal study, while a value approaching 0 means the opposite. Conceptually, this ratio provides a good insight into how satisfactory the quality assessment reporting process was in the study.
As noted above, Relative Closeness is, by definition, a continuous measure. Although this feature allows for convenient ordering of studies (to identify those with best practices), it is not entirely adequate to characterize studies. It would be more appropriate here to split the range of possible values of this measure into specific ranges associated to meaningful labels. Several alternatives can be adopted here (e.g., partitioning according to percentile ranges). For the sake of simplicity, we considered three classes (levels) for defining the variable Reporting Practice as follows: Good if r i >r >0 : where r i is the relative closeness of study i andr >0 ¼ 0:683 is the midpoint of the interval [0.367, 1.000]. We have considered this particular interval because 0.367 and 1.000 are indeed the minimum and maximum values, respectively, that a study can achieve according to our definition of Relative Closeness. As a summary, Fig 2 illustrates how the proposed indicators contribute towards obtain both the ranking of the studies and the characterization of their reporting practice. Notice that this process is achieved through the TOPSIS method.
In addition to the above, we considered it fair to evaluate the studies from the point of view of their dissemination in the literature. In this sense, we have considered an indicator that we have called the Dissemination Index, which quantifies the degree to which the study has been cited by other works. Although this indicator does not accurately capture the real use of the instrument proposed by the study, it does give us an estimated idea of its impact on the scientific community. Formally, we have calculated this index as the rate of the number of citations of the study divided by the number of years it has been published. In this way, we seek to mitigate the possible advantage in the number of citations that studies published in earlier years may have over those published more recently. This index is a non-negative continuous magnitude with a minimum value of 0. A value close to 0 indicates that the study had low dissemination. The number of citations for each study was obtained from Google Scholar (scholar.google.com), while the calculation of the number of years was based on the year 2021. We selected Google Scholar for two reasons. The first is that this database has broad coverage (not only of the scientific literature, but also of the so-called gray literature) [49], and on the other hand, it serves as an independent reference to mitigate the bias of considering citations exclusively by the databases used in our systematic mapping study.
Finally, it is worth mentioning that, in order to find out what demographic and methodological features explain studies' reporting practices (RQ3.1), we proceeded with an association analysis. Specifically, we conducted a Pearson's Chi-squared test for testing whether each of the demographic and methodological variables are significantly associated with the Reporting Practices. We supplemented this analysis by calculating Cramér's V, which is an effect size measurement for the Chi-square test [50]. Although other indicators could be equally effective in characterizing the strength of association (e.g., V 2 ), we decided on this measure because it is intuitive and common in the context of association analysis of nominal categorical variables [50]. Further, for those significant associations (p-values below 0.05), we proceed with a post hoc analysis based on the standardized residuals of the Pearson's Chi-squared test [51]. To address RQ3.1, we relied on an ordinal logistic regression model for describing Reporting Practice as a function of the variable Year. We have selected this regression model because of the nature of the variables involved.

Validity assessment
As suggested by [13], the validity of a systematic mapping study should be conducted by analyzing the main threats occurring in achieving the following validity types: • Descriptive validity. The main thread here is that subjective studies have less descriptive validity than quantitative ones. However, our study is based in the count of data using a well-structured template (see Table 4). Consequently, we assumed that this threat is under control.
• Theoretical validity. Study selection and data extraction are important sources of threats affecting this validity type. However, we employed snowballing sampling (backward and forward) in order to mitigate the possibility of not including relevant studies. Additionally, authors reviewed each other's steps in order to control the bias when selecting and extracting data. Regarding the quality of the sample of studies, it is clear that it is high since it comes from databases with great coverage of high quality venues.
• Generalizability. To assess this type of validity, we have to consider the internal and external generalizability of the results and methodology. In the first case, it is clear that results from this research can be generalized in the context of digital competence evaluations in higher education. So, internal generalizability is guaranteed. External generalizability is not guaranteed since digital competencies are not only measured in the context of higher education. Additionally, we have focused on quantitative instruments only, so our results describe an specific group of studies. With respect to the methodology, since systematic mapping studies and association analysis are general research methods, internal and external generalizability is guaranteed.
• Interpretive validity. In this case, threats may exist here because authors have worked together in previous research. So, it is possible that similar judgments when selecting and analyzing the primary studies arose.
• Reproducibility. We consider that reproducibility is guaranteed. This is because enough details are provided in this paper, that is, by following a systematic guide like the proposed by [13].

Results
In the following section, we present the results obtained from our analysis and classification of the studies. We organized them according to the research questions defined in Sec. 4.1.

Reporting practices of quality assessments (RQ2)
In this section, we answered several questions about how the studies reported on the quality of the employed instruments. To this end, Fig 4 summarizes the main results we obtained. Fig 4a, it is possible to observe that Internal Consistency is the most reported type of reliability assessment by far. About 50% of the studies provided enough details or referenced another study. Consistent with this result , Fig 4b shows that 86% of the studies conducting reliability assessments relied on Cronbach's alpha, which is indeed a typical method for measuring internal consistency [30].

What types of assessments and what specific methods are most often reported? (RQ2.1). From
A different situation occurs with the validity assessments (Fig 4c). In this case, the studies are more evenly distributed. However, it is interesting to see that although Content validity is the most frequently reported assessment, about the half of the studies reporting on this provided no more detail than just mentioning that they did it. Fig 4d is more specific on how the assessment was conducted. Expert judgment, a typical method for content validity, is present in 36% of the studies conducting validity assessments. Interestingly, Exploratory Factor Analysis (EFA) and Confirmatory Factor Analysis (CFA) are present in 25% of the studies. They are used for conducting construct validity assessments.

How comprehensive and how deep are these quality reports? (RQ2.2).
To characterize the reporting practices of the studies, we relied on the indicators described in Sec. 4.5.

PLOS ONE
From Fig 4e, we can see that, in the case of External Coverage, 49% of the studies present a high coverage (reporting reliability and validity at the same time). However, 36% did not report on any group at all, while 15% did it moderately (that is, on a single group). Regarding the other two indicators, box plots from Fig 4f show that while more than 50% of studies achieve extreme values of Reporting Depth (0 or 1), in Internal Coverage, the maximum value is 0.75. In both cases, we see that at least the 25% of studies (1st Quartile) are equal to the minimum value of the 2nd Quartile, which is indeed 0. This is consistent with 36% of the studies with no External Coverage, as shown in Fig 4e. In fact, having no External Coverage at all implies no Internal Coverage and no Reporting Depth. So, the 36% of studies that are in this latter case correspond to those without any quality reporting practice at all.  Table 5 shows the studies arranged in descending order according to their Relative Closeness (the indicator we defined in Sec. 4.5). Also, note that in the first column, we have included the ordinal variable Reporting Practice, which allows for categorizing the studies through three groups: None, Acceptable, and Good. Studies belonging to the first group, which are also the ones with Relative Closeness = 0, were omitted from Table 5 due to space limitations. However, they can be accessed in the resource provided in Sec. 4.5.

What studies achieve the best balance between coverage and depth in their reports? (RQ2.3).
To complement this analysis, the rightmost column of Table 5 contains the dissemination index of the studies. It is easy to see that this characteristic allows us to classify the studies differently than Relative Closeness. In order to facilitate the identification of studies with high dissemination, we have divided the set of studies into tertiles. In the first one (T1) are the studies with dissemination index in the interval [0, 2.5], in the second (T2) those with values in the interval (2.5, 7.33], and in the third one (T3) those with the values in (7.33, 64.7]. Of course, those belonging to the T3 are the most widely disseminated studies of all. From these intervals, it is noteworthy that the distribution that the studies follow is not normal (e.g., it is highly skewed to the right), indicating that most of the studies have low dissemination rates compared to the highest value study (64.7). The 14 studies belonging to the latter category are marked with � in Table 5. Note that these studies have a dissemination index greater than or equal to 8.0 and would be, according to our analysis, those that achieve an adequate balance between quality and dissemination. To obtain an overview of how the studies are distributed according to these two characteristics, Table 6 summarizes the number for each variable level of Reporting Practice and Level of Dissemination. The results show a generally homogeneous distribution among the combinations of levels, although it is remarkable that the lowest number of studies (5) corresponds to those with Good reporting practices and Low levels of dissemination. A Chi-squared test confirms that there are no differences between these groups of studies (χ 2 = 2.897, df = 4, p-value = 0.575). Table 7 shows the results obtained for an association analysis between Reporting Practice and studies' demographic and methodological variables. Note that we also included the indicator of Cramér's V for quantifying, in the range [0, 1], the strength of the association between variables. The larger this indicator, the stronger is the association between the variables. As Table 7 shows, only Measured dimension resulted in a significant association with Reporting Practice (p-value < 0.05). Cramér's V indicates a low strength in this association (= 0.303).

What studies' characteristics are more likely associated with certain reporting practices? (RQ3)
In order to find out which dimensions account for this significant association, we proceed with a post hoc analysis of the standardized residuals [51]. Table 8 summarized the results for this analysis. We observe that only one pair of dimensions was significantly associated (Several and None). This negative association means that studies focused on measuring several areas of digital competence are more likely to report on quality assessments. ? (RQ3.1). Finally, we addressed the question of how studies' reporting practices have evolved over the years. In this case, we estimated an ordered logistic regression model [99]. The corresponding model resulted in a significant coefficient for Year (= 0.279) and a standard error of 1.373e − 04. Brant's test for checking the proportional odds assumption [100] gave a χ 2 = 0.040 and a probability (p-value) equal to 0.850. Therefore, the relationship between each pair of outcome groups is the same under the coefficient estimate. In order to better understand the effects of this coefficient, Fig 5 shows the predicted probabilities of each dimension of Reporting Practice as a function of the years. This plot shows a clear pattern in the evolution of the studies' reporting practices. Note that the dominant dimension was None (no reporting at all) up to 2017. From this year up to the present (July 2020), Acceptable and Good reporting practices become more likely to appear in literature. We may also observe a decreasing trend over time in the predicted probabilities related to practice of no reporting quality assessments (None). On the contrary, predicted probabilities related to the practices of providing acceptable and good reports increase along the years. In the specific case of Acceptable reporting practices, the probabilities become steady from 2017 to the present.

Discussion and conclusion
From the results obtained, we observe that the number of studies evaluating digital competence in higher education has grown from 2010 to the present. This increasing behavior has Table 6. Distribution of studies according to their reporting practice and level of dissemination.

Reporting Practice
Low (T1) Medium (T2) High (T3) been also reported by similar studies such as [11,27]. Another important pattern arising from our study is the predominance of studies coming from Europe and which are based on undergraduate students from Social Sciences programs. This is somehow expected if we consider that evaluating digital competence in higher education is a research topic related to this discipline in this geographical area. So, these results suggest that researchers have been focused on studying what is perhaps the closest population to them: undergraduate students from Social Sciences. Other higher education actors from different disciplines, continents or with different roles (e.g., academic staff) have been less or not studied at all. This is consistent with the results reported in other similar studies [101], and this gap constitutes a clear opportunity for further research studying those underrepresented populations. As a positive aspect, we observed most of the studies have been based on medium or large samples (involving more than 100 participants). Additionally, our results revealed that ad hoc questionnaires for measuring several dimensions of digital competencies were in the preference of the researchers. More specifically, the employed instruments mostly relied on the selfassessment of the participants. Although measuring several dimensions is a positive aspect of the studies, the high presence of self-assessment, ad hoc instruments is, from our viewpoint, a weakness in this field. In the case of ad hoc instruments, it would be a symptom of the lack of consensus on what digital competencies are in the context of higher education [11], or perhaps a consequence of divergent research purposes. In any case, this undermines the reuse of existing instruments, which indisputably affects the opportunity to improve and validate them on a large scale. As a related consequence, the reported results (even for populations with similar characteristics) are increasingly heterogeneous as time passes, making it difficult to draw general and precise conclusions from them.
With respect to the quality of the venue where the studies were published, different results from the SJR and JCR indicators were observed. While most studies appear in journals recognized by the SJR indicator, this is not the case for the JCR indicator. These differences are explained in part by the large coverage of SJR, as compared to that of JCR [102]. Since JCR is a more exclusive indicator, the results suggest that, under this indicator, most of the studies are published in low-tier journals.
In regard to the reported quality assessments, the obtained results are somewhat consistent with the use of questionnaires as data collection methods. It is for this reason that assessments of internal consistency (through Cronbach's alpha), face validity (through pilot studies), content validity (through Expert judgment) and construct validity (through factor analysis) abound. Taking into account that, in this list, only internal consistency is a type of reliability assessment, it is clear that most studies report more types of validity than reliability assessments.
From the proposed indicators for appraising studies' reporting practice, we observed that less than half of the studies conducted assessments from reliability and validity at the same time (External Coverage). In addition, the proportion of types of assessments conducted within each group (Internal Coverage), was lower than 50%, which indicates poor coverage. Regarding the depth in providing evidence supporting the quality assessments (Reporting Depth), the results showed that more than the half of the studies provided good levels. Overall, these results indicate that serious issues exist when conducting and reporting quality assessments of the employed instruments. It seems, however, that this is not a problem typical to digital competence evaluation in higher education, but a more general one in Educational Sciences. Studies such as that of [32] in the context of medical education and, more recently [41], in special education identified similar problems. Interestingly, we found no association between the degree of dissemination of the studies and their quality. This result indicates that studies with low quality in their reporting practices have statistically the same dissemination, in terms of citations, as those with acceptable or good quality.
The statistical analysis conducted, aimed at identifying what studies' features could help to understand certain reporting practices, revealed just a few insights. We found only one significant, negative association between measuring several digital competence areas at the same time, and no reporting of any quality assessment at all. This suggests that those studies devoted to measuring several areas of digital competence are more likely to report on quality assessments of the employed instruments. Although in part explained by the lack of enough data, the absence of significant associations in the case of the other variables is a warning sign. For instance, we might expect that those studies published in top-tier venues are more likely to exhibit good reporting practices. In the same line, we would like to see more concern in conducting, and hence, reporting quality assessments by those studies using ad hoc, self-assessment instruments. However, no evidence was found supporting these beliefs. The good news is that, according to our estimates, the tendency to report more and with better quality evaluations of the instruments used to measure digital competencies in higher education is growing over time. Although slight, this growth is significant.
In summary, it is clear that, based on the collected evidence, we can not trust in all of the available instruments published till date. The list of studies we provided, which rank them according to their degree of coverage and depth in reporting quality assessments, is expected to help researchers and practitioners in identifying relevant instruments to advance in the field.

Implications
The results obtained by our study have important implications from both practical and research perspectives. First, our demographic and methodological characterization of the studies implies that practitioners and policymakers have a rich body of prior experience that can be taken into account when measuring digital competencies in higher education settings. Similarly, researchers have clear opportunities for future research. The evaluation of less studied populations (e.g., university professors from regions such as South America, North America or Africa) or validating existing instruments through comparative studies, are two examples of these opportunities.
Second, the fact that more than half of the studies do not conduct reliability and validity assessments at the same time, and those that do, cover less than half of the criteria within each category, implies that the selection of an existing instrument to measure digital competencies in higher education should be done with care. From a research perspective, this result implies that more emphasis should be placed on ensuring that the instruments used meet adequate levels of reliability and validity. Research opportunities exist in this context, especially related to the development of validation studies that are based on proposed instruments with little evidence of quality assessments.
Third, the fact that the quality and dissemination of the studies do not correlate is a warning sign regarding the selection of the most appropriate instruments in higher education. This implies that the practitioner, as well as the researcher, should not be guided purely by the quantity of citations of the study, but also the quality of the instrument in terms of the psychometric properties that we have considered in this paper. In the same line, our results imply that the selection of instruments based on demographic or methodological similarities is not reliable. Therefore, emphasis should be placed on the specific evidences reported by the studies.
Fourth, the evolution of studies towards the inclusion of a greater number of quality assessments has as its main implication that, in the near future, it should be easier to find better validated instruments to measure digital competencies in higher education. This augurs a favorable scenario for practitioners and researchers. However, much remains to be done. The great heterogeneity of approaches to evidence the quality of a quantitative instrument is a clear indication of the absence of a "standard" in this field of research. Future research could focus on proposing, for each measurement scenario, which methods should be applied to ensure the reliability and validity of the instrument under consideration.
In a more general context, the fact that academia is placing increasing interest in measuring digital competencies is a good sign that awareness is growing of the importance of understanding and monitoring the achievement of the UN Sustainable Development Goals (SDGs) [103]. As precisely mentioned in [104], ICTs are considered key catalysts for the achievement of the 17 SGDs. In this context, our results contribute to raising awareness of the importance of correctly measuring digital competencies in higher education, a key step to know how far we have progressed and what still needs to be done.

Limitations
Regardless of the relevance of the results obtained, this research has important limitations. First, we based our results on a limited sample extracted only from journal articles that were published in the last 10 years. Therefore, we do not know the extent to which these results also apply to studies published in different venues and on different dates. Similarly, our study is limited by the variables used for the characterization of the studies. Therefore, there is a possibility that other variables (not included here) not only better describe the studies, but also more appropriately explain the presence of certain reporting practices.
Another important limitation is that we have evaluated the studies according to the practices and levels (depth) with which they report quality assessments. Therefore, in no way do our results indicate which instruments are more appropriate for measuring digital competence in higher education. We are aware that this is a much more complex task which depends on several factors including context. Therefore, in the future, it will be necessary to develop more research to answer these and other related scientific questions. In this sense, an important question is, "How do qualitative data collection techniques guarantee the quality of measurement?" Our future research will focus on providing answers to these issues.