Fig 1.
Excerpt of the survey that was set up for the classification task.
The annotators were told to assign only one category per given artifact. If an artifact is a compound noun, the nested entities such as adjectives or second nouns that further describe the term were provided for tagging as well. In this question, ‘CO2 fixation’ is an example for a two term artifact and ‘groundwater’ an example for a one term artifact.
Fig 2.
The frequency of the categories and how often they were assigned to given phrases and terms, with and without QUALITY correction.
Fig 3.
Fleiss’ Kappa values for the individual information categories (with QUALITY correction): a) for all artifacts b) for artifacts with one and two terms.
Table 1.
Annotator’s agreement with QUALITY correction overall and for one term, two terms, three terms and more per artifact.
Fig 4.
Frequency of category mentions and inter-rater agreement with QUALITY correction.
Table 2.
Metadata standards in the (life) sciences obtained from re3data [57] and RDA metadata standards catalog [58].
The number in brackets denotes the number of repositories supporting the standard (provided in re3data).
Table 3.
Comparison of metadata standards and information categories.
The categories are sorted by the frequency of their occurrence determined in the previous question analysis, the asterisk denotes the categories with an agreement less than 0.4.
Table 4.
Metadata schemes and formats offered by selected data repositories in their OAI-PMH interfaces.
Table 5.
The date stamps used for each metadata standard and their descriptions obtained from the standard’s website.
Table 6.
Total number of datasets parsed per data repository and metadata standards and schemata.
The numbers in brackets denote the number of datasets used for the analysis. All datasets were harvested and parsed in May 2019.
Fig 5.
Timelines for all repositories presenting the number of datasets per metadata standard and schema offered.
For several repositories, the timelines for the different metadata standards and schemata are almost identical and overlap. Obviously, when introducing a new metadata standard or schema, publication dates were adopted from existing metadata structures. Figshare’s timeline for RDF was computed separately as the data are too large to process it together with the other metadata files.
Fig 6.
Metadata field usage in all data repositories evaluated.
The graphics display the percentage of metadata fields used per data repository and its best matching standard with respect to the information categories.
Table 7.
Comparison of data repositories and their best matching standard with the information categories.
The categories are sorted by the frequency of their occurrence determined in the question analysis. The asterisk denotes the categories with an agreement less than 0.4.
Table 8.
Five most common keywords and their frequencies in the metadata field dc:subject.
The last row denotes the amount of files with an empty dc:subject field.
Table 9.
Filter strategies used per data repository to select 10,000 datasets.
The number in brackets denotes the total number of available datasets (OAI-DC standard) at the time of download (October/November 2019).
Table 10.
NLP analysis: Number of datasets with named entities (out of 10,000 processed files in a reduced OAI-DC structure) per repository.
Each file contains a subset of the original metadata, namely, dc:title, dc:description, dc:subject and dc:date.