rAvis: An R-Package for Downloading Information Stored in Proyecto AVIS, a Citizen Science Bird Project

Citizen science projects store an enormous amount of information about species distribution, diversity and characteristics. Researchers are now beginning to make use of this rich collection of data. However, access to these databases is not always straightforward. Apart from the largest and international projects, citizen science repositories often lack specific Application Programming Interfaces (APIs) to connect them to the scientific environments. Thus, it is necessary to develop simple routines to allow researchers to take advantage of the information collected by smaller citizen science projects, for instance, programming specific packages to connect them to popular scientific environments (like R). Here, we present rAvis, an R-package to connect R-users with Proyecto AVIS (http://proyectoavis.com), a Spanish citizen science project with more than 82,000 bird observation records. We develop several functions to explore the database, to plot the geographic distribution of the species occurrences, and to generate personal queries to the database about species occurrences (number of individuals, distribution, etc.) and birdwatcher observations (number of species recorded by each collaborator, UTMs visited, etc.). This new R-package will allow scientists to access this database and to exploit the information generated by Spanish birdwatchers over the last 40 years.


Introduction
During the past several decades, developers have focused their attention on constructing web repositories to store and share biological information. On the one hand, there are online repositories with information generated by scientists, like specimens collected for museums and herbariums, fossil records or genetic data (e.g. GBIF: http://gbif.org, NOW: http://helsinki.fi/ science/now/, GeneBank: http://ncbi.nlm.nih.gov/genbank/). On the other hand, there are web sites that store biological information collected by non-scientists, or so-called 'citizen science'.
Citizen science has proven to be an appropriate method to provide researchers with valuable information [1][2][3], and is increasingly used as an adequate way to sample species occurrences and distributions [4], to collect data to investigate urban ecology [3,5,6], or to collect data on bird biology, ecology and diversity [7][8][9]. In our case, data stored in Proyecto AVIS, our citizen science project to collect data from amateur Spanish ornithologists, show the same general patterns described by scientists based on their own samples and field experiments. Power law distributions of species/area [10] and species/ abundance [11] have been detected (Figure 1), suggesting that the data stored in Proyecto AVIS have similar properties to the data collected by scientists.
One of the main characteristics of the citizen science databases is that they are huge. For instance, birdwatchers' observations stored in the eBird database reached 100,000,000 observations and over 10,000 species (http://ebird.org). As a result, there are terabytes of information about species occurrences (latitude, longitude, altitude, time, habitat, diet, alleles, etc.) stored in online databases that follow different formats and standards of data storage [12], and the challenge now is developing easy strategies to use this information for research [13].
Currently, there are ongoing projects to generate tools to standardize the information stored in those databases (e.g. http:// ecodataretriever.org) and to develop R-packages to connect online biological databases to the R-environment (http://ropensci.org/). As a consequence, large international databases are now being made available through R using packages like rebird [14], rfishbase [15], rgbif [16] or rvertnet [17] (connecting R with eBird, Fishbase, GBIF and VertNet databases, respectively). All of these new data exponentially increase our capabilities to answer questions about species conservation, global change, macroecology and biogeography.
R is an open source and collaborative framework (http://.rproject.org/), and is one of the most used environments for analyzing biological data and for developing scientific software [18]. Many young scientists are becoming advanced R-users (but see [19]). Thus, R is becoming a standard environment for developing easy-to-use (and re-use) functions and for sharing them with the academic community. For all of these reasons, we decided to build an R package to directly download the information stored in Proyecto AVIS from the R environment, in order to promote the use of the data stored in this database within the growing scientific R-community.

Proyecto AVIS
Each citizen science project stores singular and, consequently, important information [6,9,[20][21][22]. Proyecto AVIS (http:// proyectoavis.com) is a citizen science project born in August 2005 with the idea of collecting the data stored in the field notebooks of amateur Spanish ornithologists and sharing them with both other amateur ornithologists and the scientific community. More than one hundred collaborators, including several NGOs, have been actively participating in the project uploading their bird observations. Overall, the database contains records over 40 years , stores 82,503 records, totalling 4,739,171 individuals from 413 species, which represents 90% of the total number of species recorded in Spain. In addition, it contains information from 1,717 different UTMs (squares of 10610 km), representing 30% of the Spanish territory (query to the database: November 2013).
The Proyecto AVIS database and web page were built using open source software (MySQL, Perl, Apache) and free GIS layers. Proyecto AVIS requires five mandatory fields for each bird observation: 'species', 'number of individuals', 'observation period', 'date' and 'UTM 10610 km square', plus several optional fields that include variables like 'hour', 'sex', 'age' or 'habitat'. To standardize the taxonomy, the bird species list follows the Bird List of Spain from SEO/BirdLife [23]. Bird occurrences in the Proyecto AVIS database are georeferenced using the projected UTM 10610 km square system and the MGRS labelling convention (Military Grid Reference System). The UTM/MGRS is the standard system for mapping species occurrences in Spain and is the system used by the Spanish bird atlases [24,25]. To help users identify the UTMs in which they recorded the species, the web application includes an easy-to-use tool to geo-referenced the observations based on a Google Maps TM routine.
The Proyecto AVIS web page (http://proyectoavis.com) includes several user-friendly tools for exploring the database, like summaries of the bird observations or graphics of the species records throughout the year, and it allows registered users to download detailed information about the species observations to Excel files. However, although the database is already available on the Internet, its use for research has not been properly exploited. Proyecto AVIS lacks a specific package to connect the web repository with the R-environment, and we believe that this fact has prevented scientists from using Proyecto AVIS information.
Description of the package rAvis exclusively contains R code, which maximizes its portability across platforms, and it works in Unix-like and Windows operating systems. The rAvis functions have been optimized following the standards criteria for software quality [26,27] and they are accessible through GitHub (https://github. com/javigzz/rAvis). Bugs can be reported using GitHub; https:// github.com/javigzz/rAvis/issues. rAvis is freely available on the Comprehensive R Archive Network; CRAN (http://cran.rproject.org/) and complete information about rAvis, its functions and their parameters is available in the package help. rAvis uses functions from other R-packages to get and plot the data stored in Proyecto AVIS. Namely, R-libraries stringr [28], XML [29], tools [30], RCurl [31], scrapeR [32] and gdata [33] are used to download the bird observations; maptools [34], raster [35] and rgdal [36] to plot the GIS files; and, finally, scales [37] is used to plot bird occurrences with a transparency.
Exploring Proyecto AVIS. We developed several functions to explore the database in an easy and visual way and other functions to download the selected information (see Table 1 and run the example). First, avisSpeciesSummary allows users to download a table with a summary of the records stored in Proyecto AVIS aggregated by species: number of observations of each species, number of individuals recorded, number of different UTMs (10610 km) with observations, number of birdwatchers that recorded the species. Second, avisContributorsSummary returns a table with a general summary of the records stored in the database Finally, avisHasSpecies checks if a species name exists in Proyecto AVIS and then, avisMapSpecies allows users to explore the distribution of the observations of the species by setting the name of the species and selecting the type of map; administrative boundaries ('admin') or physical map ('phys') ( Figure 2).
For constructing the plots we used free GIS layers. We downloaded the Spanish administrative map from http://.divagis.org/, the Spanish UTM map from the Spanish government online map repository http://bscw.rediris.es/pub/bscw.cgi/ 524254?client_size = 13666580, and the Spanish physical map from http://.openstreetmap.org/ using the R-library Open-StreetMap [38].
Advanced queries to Proyecto AVIS. We constructed two main functions to set flexible queries about the species occurrences and the birdwatcher observations: avisQuerySpecies and avisQuer-yContributor, respectively. These functions download the information stored in Proyecto AVIS, and are intended to be tuned by the users in relation to their specific objectives. Also, we programmed avisQuery as a flexible function to pass any argument allowed in Proyecto AVIS database. We decided not to predefine queries or to pre-process the data because this would narrow the possibilities for research [12]. Instead, we allow the users to set their own queries to Proyecto AVIS. Arguments include taxonomic levels, like species, family, order; individual characteristics, like age, sex, breeding status; temporal filters, like year and month; or environmental filters, like habitat. Moreover, we added a UTMlatlong conversion to all queries. Thus, the position of the observations is given in two different formats: projected UTMs 10610 km and geographic coordinates WGS84 (common latitude-longitude coordinates, which are not available in the current web application from Proyecto AVIS). We did not program more specific graphics or statistical analyses because we understand that the purpose of this package is to obtain the biological information stored in Proyecto AVIS and not to re-program statistical algorithms that are already available in other R-packages. We assume that R-users would employ different R-packages for calculating their own statistics and constructing their own plots (see the example).

Conclusions
We have programmed rAvis, an R-package designed to help researchers explore and download the information stored in Proyecto AVIS. Thus, biogeographers, macroecologists and ornithologists working in spatial ecology or temporal series, in addition to researchers working on citizen science can easily take advantage of the unique data stored in this database for their own research.