Completeness of Digital Accessible Knowledge (DAK) about terrestrial mammals in the Iberian Peninsula

The advent of online data aggregator infrastructures has facilitated the accumulation of Digital Accessible Knowledge (DAK) about biodiversity. Despite the vast amount of freely available data records, their usefulness for research depends on completeness of each body of data regarding their spatial, temporal and taxonomic coverage. In this paper, we assess the completeness of DAK about terrestrial mammals distributed across the Iberian Peninsula. We compiled a dataset with all records about mammals occurring in the Iberian Peninsula available in the Global Biodiversity Information Facility and in the national atlases from Portugal and Spain. After cleaning the dataset of errors as well as records lacking collection dates or not determined to species level, we assigned all occurrences to a 10-km grid. We assessed inventory completeness by calculating the ratio between observed and expected richness (based on the Chao2 richness index) in each grid cell and classified cells as well-sampled or under-sampled. We evaluated survey coverage of well-sampled cells along four environmental gradients and temporal coverage. Out of 796,283 retrieved records, quality issues led us to remove 616,141 records unfit for this use. The main reason for discarding records was missing collection dates. Only 25.95% cells contained enough records to robustly estimate completeness. The DAK about terrestrial mammals from the Iberian Peninsula was low, and spatially and temporally biased. Out of 5,874 cells holding data, only 620 (9.95%) were classified as well-sampled. Moreover, well-sampled cells were geographically aggregated and reached inventory completeness over the same temporal range. Despite the increasing availability of DAK, its usefulness is still compromised by quality issues and gaps in data. Future work should therefore focus on increasing data quality, in addition to mobilizing unpublished data.


Introduction
The mobilization via the Internet of a vast amount of biodiversity data offers new opportunities for basic research and evidence-based decision-making in conservation [1][2][3]. Biodiversity data exchange infrastructures such as the Global Biodiversity Information Facility (GBIF) facilitate access to massive amounts of primary biodiversity data records (PBR) which become here. The dataset was checked, cleaned and filtered in several steps. We first excluded records from the islands and Spanish cities in North Africa, domestic species and invasive species (e.g., raccoon and coypú). Secondly, we removed all the records that lacked a collection date, and that had not been determined down to species level. We also removed records that had coordinates falling outside the boundaries of the Iberian Peninsula. Thirdly, following Sousa-Baena et al. [4], we kept the unique combinations of 1) scientific name, 2) latitude and longitude, and 3) collection date.
We obtained from the Spanish Society of Mammalogists (SECEM) the raw data of the Spanish national atlas updated until 2016 [27], and downloaded the data of the Atlas of Portugal [28]. We repeated the GBIF cleaning process on these records (hereinafter "atlas data"). The information about collection date in the Atlas of Portugal was of very low resolution (i.e., records dated before or after 2000). Thus, we decided to use records from Portuguese atlas for the spatial gap assessment but omit them while addressing the temporal gaps.
While filtering the GBIF data records, we detected a remarkable peak of 179,757 (30.9%) observations in 2007. All of them belonged to one single dataset (ARM dataset) shared by the National Inventory of Terrestrial Species of Spain, which compiled information about terrestrial species in the country harvested from the Spanish national atlases of birds, amphibians, and mammals. These observations qualified as duplicates after merging the data in the Spanish atlas, even though the information level was different (more precise and exhaustive in the atlas data than in the version shared through GBIF). We removed those records as superseded by the atlas data which were also more up to date than the GBIF data.
Before merging both datasets, we assigned all records to the 10 km reference Universal Transverse Mercator (UTM) grid, corresponding to the lowest spatial resolution of the data we had, tagging each record with its cell's ID. Joining both datasets may have resulted in the creation of duplicate records (same observation record uploaded twice), as the national atlases were built by contributions from different authors or institutions (e.g., natural history collections, research projects, opportunistic observations) who could also have independently uploaded their records to GBIF. To address this, we generated a marker for each record by combining scientific name, collection year and cell's ID. Then, we used these markers to extract the unique records in the atlas dataset. We added the unique records to the GBIF dataset, thus building the full dataset.
We performed a factor correspondence analysis on the full dataset in order to observe taxonomical patterns or biases that could affect the results as a consequence of the difference in sampling methods [29]. A bimodal pattern was observed (Fig A in S1 Appendix), roughly corresponding to taxon groups of different size and sampling method: large-and medium-size mammals, usually detected by indirect methods such as footprints, camera trapping, or scats [31], and small mammals sampled through analysis of owl pellets or traps [32]. Accordingly, we segmented the data into two taxonomic groups: non-small mammals comprising the orders Lagomorpha, Artiodactyla, Carnivora, Chiroptera and the family Erinaceidae of the order Eulipotyphla, and small mammals including the order Rodentia and the family Soricidae of the order Eulipotyphla.
The C c index ranged from zero (low completeness) to one (high completeness). Cells with low sample levels can exhibit high but artificial completeness values [4]. Therefore, we established a minimum number of records per cell to calculate the estimators. This minimum was established by assessing the relationship between C c and number of records, following Sousa-Baena et al. [4]. For the generation of the species accumulation curves (SAC), we used the function 'speccacum' (method = 'exact') in the R package vegan [34], and then we obtained the slope of the final 10% of the SAC. Values close to zero (flat slopes) indicated high completeness (saturation of the curve almost reached) whereas higher values (steeper slopes) indicated low completeness [15]. We plotted the completeness values based on C c vs. C m to test whether they were correlated. As the analysis showed high correlation for the full dataset (r Spearman: -0.60, p value < 0.001, Fig B in S1 Appendix) we chose to use C c . Finally, we classified the cells as well-sampled or under-sampled based according to two criteria, strict (high-threshold) and lax (low-threshold), based on thresholds used in previous studies but adapted to data availability [17]. Well-sampled cells for the strict criterion required having more than 50 records and a value of C c equal to or greater than 0.8, while the lax criterion identified well-sampled cells as having more than 25 records and a value of C c equal to or greater than 0.7. The following analyses were performed based on well-sampled cells according to the lax criterion.

Environmental coverage
We evaluated the coverage of well-sampled cells along environmental gradients [17]. We downloaded bioclimatic variables at 2.5-minute resolution (approximately five km resolution) from the WorldClim database [35] and land cover data from the 2006 Corine Land Cover (CLC) database at 100-meter resolution from the European Environmental Agency [36].
Each climatic variable was averaged over each cell. For land cover uses, we first reclassified CLC categories into five new categories. CLC classifies land cover uses into 44 classes, grouped in a three-level hierarchy [36]. 'Urban,' 'crops' and 'wetlands' categories consisted of all land use categories grouped under classes 1, 2 and 4 respectively from the level 1 of the CLC nomenclature ("major categories"). Class 3 in the CLC nomenclature ('Forest and semi-natural areas) was divided into two categories: 'Forest' included class 3.1 (Forests), and 'scrubland' included classes 3.2 and 3.3 (Shrub and herbaceous vegetation association and Open spaces with little or no vegetation, respectively). We then summarised the percentage of coverage of each land cover type in each cell.
We conducted a Variance Inflation Factor (VIF) analysis to discard correlated variables (VIF > 5), and correspondingly retained Annual Mean Temperature (AT), Annual Precipitation (AP), forest coverage (FC), and crops coverage (CC). Then we performed Kolmogorov-Smirnov (K-S) goodness-of-fit tests to compare the frequency distribution of well-sampled (lax criterion) and all sampled cells against the background cells (i.e. all territory) across the environmental gradients, following Clifford et al. [37]. Low values of the D statistic indicated that well-sampled cells reached a high level of survey coverage spanning all of the background environmental gradients [17].

Temporal coverage
For each cell, we calculated (1) the median of collection years, (2) time since the last records were collected and (3) the interquartile range. Furthermore, we correlated C c of well-sampled cells (lax criterion) with the median collection year to determine whether the high level of completeness was acquired from recent or historical surveys [14]. We also assessed whether well-sampled cells were spatially aggregated with regard to the decades in which completeness was reached using the Moran I test. Following Stropp [14], we calculated C c for each cumulative decadal period, starting from 1900 to 1910, 1900 to 1920, . . ., until the last period 1900 to 2020. We assumed that a variation of 5% (upwards or downwards) of the final value of C c was enough to conclude that the inventory completeness had been reached.
All analysis and data management were performed in R [38], and for the final graph modifications we used Inkscape [39]. See List A in S1 Appendix for references of the R packages employed in the study.

Inventory completeness
582,720 records were retrieved from GBIF. We discarded 25,244 records from species not targeted for the study. Among the remaining records, 284,427 (48.81%) had quality issues such as lacking a collection date or species determination, or having wrong coordinates (Fig 1). Another large exclusion of data corresponded to the removal of the dataset from the National Inventory of Terrestrial Species of Spain that contained 179,757 records that had already been fed to GBIF at a different resolution. From the initial spatial coverage of 5,444 cells, after the filtering process, only 1,486 cells contained GBIF-mediated data (23.85% of the territory). The final number of GBIF records fit for this purpose was 93,292 (16% of the original dataset). Similarly, the data from the atlases yielded 213,563 records distributed in the Iberian Peninsula, but data quality issues precluded 126,713 of them (59,33%, Fig 1).
Finally, the combination of both datasets resulted in a dataset with 179,767 records of 89 species distributed in 5,874 cells out of the 6,232 cells covering the Iberian Peninsula. Although the spatial coverage of mammals was high (Fig 2A), the number of records per species in each   Fig 2B). For example, 33% of the cells contained one record per species (Fig 2B). Concerning taxonomic groups, the analysis showed that non-small mammals contained 97,501 records of 55 species distributed in 5,850 cells, while small mammals (Rodentia and Soricidae) contained 82,266 records of 34 species found in 3,785 cells ( Table 1).
The inventory completeness in the Iberian Peninsula was low for the full dataset. The mean value for C c was 0.62 for all cells having at least 25 occurrences, the frequency threshold we allowed for a cell to be included in the calculations. Based on the lax criterion (n � 25, C c � 0.7), 9.95% of the cells (620) were classified as well-sampled while the strict criterion (n � 50, C c � 0.8) left 262 well-sampled cells (4.20%). We found that most of the well-sampled cells were located within Spain, particularly at both ends of the Pyrenees, on the Mediterranean coast and the midwest of Spain (Fig 2A).

Environmental and temporal coverage
The environmental coverage of well-sampled cells (lax criterion) was significantly low for all the selected variables (Fig 3). For the full dataset, the K-S D statistic was high, ranging from 0.6 to 1 (mean 0.76) in well-sampled cells. Similarly, poor environmental coverage of the well-sampled cells resulted for non-small mammals and small mammals (Rodentia and Soricidae) separately, with D values also ranging from 0.6 to 1 (mean 0.76 and 0.89 respectively, Fig 3).
The temporal coverage of the complete dataset was 183 years, spanning from 1835 to 2018 (Fig 4). However, the inventory records were scarce from the early 1800's to mid-1900's with only 651 before 1960 for the whole Iberian Peninsula. Over the following decades records accumulated, peaking in 2006. There was a taxonomical difference in the accumulation pat- Over the full dataset, even though we found that the inventory completeness was negatively correlated with the median year in which occurrences were recorded (Spearman's r = -0.06, p-value = 0.02), the effect was very small, and its significance was likely due to high sampling Table 1  size. However, the separate taxonomic groups showed no significant correlation between the median collection year and the inventory completeness at α = 0.05. Nearby well-sampled cells (lax criterion) usually reached inventory completeness at roughly the same time (Moran's I: 0.13, p < 0.001, Fig 5A). The pattern held after splitting the dataset by taxonomic groups (Fig 5B and 5C).

Quality of DAK of terrestrial mammals in the Iberian Peninsula
Our goal was to characterize the availability of Digital Accessible Knowledge about terrestrial mammals distributed in the Iberian Peninsula, which could become suitable for further distributional, biogeographical or ecological studies requiring spatial and temporal representativeness. First, a coarse search in GBIF yielded more than a half-million occurrence data records from Spain and Portugal. Successive quality, suitability and reliability checks filtered out the majority of these data, and only 16.01% were deemed suitable for representativeness analysis on both the space and time dimensions. On the other hand, raw data collected from the national atlases yielded less than half as many records, of which 22.23% were fit for our  purpose. Although the leakage rate [40] was thus higher in the case of the GBIF, overall, we observed comparable retention of records in both datasets as well as similar factors leading to attrition. The main reason for discarding records was data quality. While having partial data still allows records to be used for specific purposes, we needed complete records in our analysis. The lack of collection dates was the largest cause of leakage from GBIF and Spanish atlas data (Fig 1). The Portuguese atlas data had fewer data quality issues, but had a much larger proportion of records with inadequate distribution data that were removed first (Fig 1), and the collection dates were merely specified as before or after the year 2000.
The quality of data is of capital importance [41]. Imprecise data can suggest inaccurate diversity patterns [42] or biased species distribution models [43]. Errors such as misspelled species, or occurrences mismatching the expected territory for a record (e.g., political divisions) can be relatively easy to track and correct [44]. However, other errors such as wrong dates can easily go unnoticed and in many cases cannot be corrected although they could be detected (for example, misrepresenting unknown day of the month or month of the year [45], an anomalous concentration of records on the first day of each month or year). In our case, the coincidence of the publication date of the Spanish national atlas and a single batch of 179,757 observation records from 2007 found in the GBIF dataset led us to discover that those records had been assigned the atlas' publication year rather than the actual occurrence dates. We were thus able to remove these GBIF records from the dataset, substituting them with the raw data. However, this type of detailed intervention may be difficult to accomplish when working with very large datasets, where these time-consuming checks for errors and inconsistencies [46] can easily become cost-ineffective. Thus, it is of utmost importance that data publishers provide accurate data and comprehensive metadata to inform data users about the true limitations of the data [40].

Inventory completeness
Studies addressing inventory completeness are becoming more frequent [4,5,13,15,17,18,40,[47][48][49]. Overall, these assessments tend to conclude that inventory completeness is spatially biased. We found that to be also the case for mammals in the Iberian Peninsula. Our analysis showed that DAK completeness was low, and spatially and temporally biased. Despite the relatively high proportion of territory cells holding data at the chosen spatial resolution, the number of records per cell was low (Fig 2B). The lax criterion excluded 90.05% of the cells from the full dataset. Well-sampled cells tended to be geographically aggregated, particularly in the western and eastern edges of the Pyrenees, the Mediterranean coast and one spot in the midwest of Spain (Fig 2A). Several causes might account for the comparatively high level of inventory in those areas, such as local atlases, research projects, or citizen science. When splitting the dataset by taxonomic groups, the spatial distribution of well-sampled cells for both groups did not substantially differ from the full dataset (Fig 2). DAK about non-small mammals seemed to be higher than that for small mammals in the Iberian Peninsula (e.g., 531 wellsampled cells compared to 205 using the lax criterion). Still, the overall completeness was very low.
Although both GBIF and the atlases yielded comparable numbers of records, as expected the spatial coverage provided by the atlas data was higher [50] (see Fig C in S1 Appendix). Grid-based biological atlases are often generated by compiling existing data, aggregating them to a reference grid that is reported as a single, generalized georeferencing for the data, and then seeking to fill in the cells lacking in data [51]. Consequently, atlases commonly represent trade-offs between their spatial coverage (which is generally high, and largely based on expert judgement) and the sampling effort (which is generally low). Also, the "unknown recurrence" (not reporting data considered as redundant) is common while constructing atlases [52]. All of this results in presence-only datasets with a low number of records and species per cell that may be assigned a low-resolution coordinate set (corresponding to the centerpoints of each cell in the grid). In our case, over the full dataset, almost half of the cells had equal numbers of records and observed species: that is, each species was reported as a single observation in the cell, probably leading to the underestimation of well-sampled cells [52].

Survey and temporal coverage
Although the spatial coverage of mammals was high (i.e., most of the cells contained data, Fig  2), the low and spatially biased inventory completeness led to low, and spatially biased, survey coverage over all datasets (Figs 2A, 2B, 2C and 3).
The information about mammals in the Iberian Peninsula has been accumulating since the early 19th century, although most records that were made in, or converted into, digital form were only contributed in recent decades (Fig 4). We observed that records peaked at the start of the 21st century, but the inventorying activity decreased thence. One possible cause for the recent decrease in number of records might be the lag between data collection and data mobilization [53] although other causes could also be possible, such as an actual reduction in field campaigns or data collection, or possibly the processing of the backlog of natural history collections data [54]. The temporal pattern of inventorying was similar among taxonomic groups, even though they peaked at different times (Fig 4). The high inventorying of small mammals (Rodentia and Soricidae) during the later years of the 20th century in the western side of the Pyrenees (Figs 2G and 4) was driven by the publication of the extensive regional atlas for the province of Navarra [55]. Similarly, the publication of the Spanish national atlas resulted in a distinct increase in the number of records (Fig 4).

Low completeness or low data sharing?
The results in this paper stress the necessity of mobilizing the data buried in personal databases, museums catalogs or research centers. The three countries in the Iberian Peninsula have been active in data sharing since GBIF was launched [45], with Spain leading the way with more than three-quarters of the peninsula's data published as of September 2018 (https:// www.gbif.org/occurrence/search?country=ES). Nonetheless, we are aware that many data may be still locked away [56,57]. Our results show not only a map of the inventory completeness of the mammals but also the areas where biodiversity data have been made digitally accessible. As the concentration areas tend to coincide with the political boundaries of a few, selected Spanish administrative divisions, it is quite likely that data exist as well in other regions within Spain that have not yet been mobilized through digital platforms and, thus, are not part of the DAK. However, unless scientific community and administrations incorporate the publication of data as a routine step in their workflow [58], access to data will continue to rely on the will of data holders to invest their resources to doing so.
Designing, surveying, and managing data have associated costs that can run high. Specifically, sampling mammals is a highly demanding task [31,32]. Mammal species usually show elusive behavior, or have nocturnal habits that influence their detectability and increase the efforts necessary to document them. Such effort involved in collecting data could be wasted if all the potential in the resulting data are not put to full use by properly converting them into DAK. We believe that releasing all the compiled information is a desirable and highly valuable step forward for research and conservation. As pointed out in several studies, data gathered in databases can be relevant for designing new and more efficient sampling protocols [59] even if the source of information is biased or scarce [11].

Conclusion
Incomplete inventories limit the information yield, particularly when trying to determine the true spatial distribution of species as absences are difficult to confirm. Obtaining a better understanding of the species distributional and temporal patterns requires filling the existing information gaps, both positive and negative. While some gaps will surely be impossible to fill due to lack of surveys, much information from actual surveys may still remain to be discovered, debugged and published. Such data would allow us to fill historical gaps and to design better surveys to fill contemporary gaps. However, the volume of data by itself is not always a surrogate of increased knowledge [5,40]. We call for significant efforts to be made to increase the quality of the shared data, thus enlarging their fitness-for-use spectrum. The present and future of biodiversity research and conservation rely on freely available data. Good and sufficient data availability may facilitate better allocation of limited resources for research. With the release of that data we might be better prepared to understand changes in distributional patterns of species, and to answer questions when dealing with the biodiversity crisis we are facing.
Supporting information S1 Appendix. Supporting information on methods. (PDF)