Cyber-taxonomy of name usage has focused primarily on producing authoritative lists of names or cross-linking names and data across disparate databases. A feature missing from much of this work is the recording and analysis of the context in which a name was used—context which can be critical for understanding not only what name an author used, but to which currently recognized species they actually refer. An experiment on recording contextual information associated with name usage was conducted for the fiddler crabs (genus Uca). Data from approximately one quarter of all publications that mention fiddler crabs, including 95% of those published prior to 1924 and 67% of those published prior to 1976, have currently been recorded in a database. Approaches and difficulties in recording and analyzing the context of name use are discussed. These results are not meant to be a full solution, rather to highlight problems which have not been previously investigated and may act as a springboard for broader approaches and discussion. Some data on the accessibility of the literature, including in particular electronic forms of publication, are also presented. The resulting data has been integrated for general browsing into the website http://www.fiddlercrab.info; the raw data and code used to construct the website is available at https://github.com/msrosenberg/fiddlercrab.info.
Citation: Rosenberg MS (2014) Contextual Cross-Referencing of Species Names for Fiddler Crabs (Genus Uca): An Experiment in Cyber-Taxonomy. PLoS ONE 9(7): e101704. https://doi.org/10.1371/journal.pone.0101704
Editor: Robert Guralnick, University of Colorado, United States of America
Received: January 15, 2014; Accepted: June 10, 2014; Published: July 8, 2014
Copyright: © 2014 Michael S. Rosenberg. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: The author has no support or funding to report.
Competing interests: The author has declared that no competing interests exist.
There are numerous projects focused on making literature on taxonomic names more accessible and useful . Taxonomy databases such as the World Register of Marine Species (WoRMS) and the Integrated Taxonomic Information System (ITIS) are focused primarily on providing authoritative lists of names, along with synonymy. The Biodiversity Heritage Library (BHL) is digitizing and providing open access to millions of pages of taxonomic literature. Projects such as BioNames  attempt to link across major resources, including databases of texts, taxonomic names, and phylogenetic trees. The Global Biodiversity Information Facility (GBIF) tracks and links museum specimens with names and collection locations. Many cybertaxonomy projects have automated extraction of taxonomic names from electronically available literature at their core. The potential usefulness of these approaches and resources is huge. However, one area in which these projects explicitly fail is context. While they can discover/recognize that a specific name appears in a particular publication, they do not (and arguably, cannot) determine the context in which the name was used, and sometimes this context is extremely important for understanding both what the author meant and the currently recognized species to which they were actually referring , . For example, an automated search of Hoffmann  and Kingsley  might discover that both publications use the species name Gelasimus marionis Desmarest, 1823. Cross-referencing this name against WoRMS would indicate that today this name is recognized as a junior synonym for the fiddler crab Uca vocans (Linnaeus, 1758). What it fails to discover is that Hoffmann was referring to a species found in Madagascar, while Kingsley was referring to a species found in the Philippines. For the last 40 years, it's been recognized that what used to be called Uca vocans consists of a complex of closely related species; Uca vocans sensu stricto is found throughout parts of the western Pacific Ocean, while the species found in the Indian Ocean (including Madagascar) to which Hoffmann refers is today known as Uca hesperiae Crane, 1975. Because this is not simply an issue of synonymy and priority, without understanding the context in which the name was used, it is both difficult for automated approaches to correctly identify the species that Hoffmann studied and to recognize that these two papers refer to different species as we understand them today.
In addition, most of the literature-based projects are preferentially focused (for good reasons) on taxonomic journals and papers. However, for greatest usefulness and coverage it will eventually be critical to include all literature, not just taxonomic literature, in these endeavors. The goal here is not just to resolve taxonomic uncertainty (accurately identify the correct species) in systematics, but in experimental studies as well. Without inclusion of taxonomic usage in experimental studies, we run the risk of not recognizing experimental variation due to phylogenetic variation, potentially bias systematic reviews and meta-analyses due to incorrect species designation, and generally make comparative analyses more difficult. For example, the most widely studied fiddler crab in experimental work has likely been Uca pugilator (Bosc, 1802) [7; personal observation], a species with a geographic range that used to be thought to include the entire Atlantic coast of the United States, including the Gulf of Mexico, from Massachusetts through Texas. Based in part on the recognition of minor color morphs with variance in physiological response to experimental conditions , , in 1974 U. pugilator was split into two species , the traditional form located on the Atlantic coast from Massachusetts through northwestern Florida, and a new species, U. panacea Novak and Salmon, 1974, which overlaps with U. pugilator in northwestern Florida but extends west to Texas. Thus, experimental studies on “U. pugilator” which predate the recognition of U. panacea (or which are unaware of the taxonomic change) may or may not be recognized as the correct species, depending on where the specimens were collected (one of the primary biological supply companies which provide fiddler crabs for experimental studies is located right at the sympatric zone, further complicating the issue). Vernberg and Costlow  reported metabolic differences in U. pugilator from Florida versus those from North Carolina and New York; the importance and interpretation of this variation changes if it turns out that the Florida specimens are a different species. Generally, taxonomists focus on prior taxonomic literature and thus tend not to revise or comment upon taxonomic names found in experimental studies.
In an effort to resolve these types of problems, I conducted an experiment in cyber-taxonomy focused on identifying the context of name use. Throughout, the term “context” is used in a similar, although slightly broader, sense to that of concept taxonomy , . It needs to be stated up front that this effort was an experiment and not meant to serve as a general approach to solving these issues. I was not trying to invent a system that would generally solve the problem; instead, my goal was to test an approach in recording, resolving, and parsing context. This report is intended to highlight issues which have not generally been discussed in the literature in the hope that it may help guide others interested in finding better approaches and solutions to these sorts of problems.
For this study I focused on the genus Uca, the fiddler crabs. It is of a relatively manageable size (102 extant species are currently recognized), with extensive literature, and a history of occasionally complex systematic confusion. Prior to this project a database with approximately 2,500 known references to the genus had already been constructed, with over half of the publications already collected in either paper or electronic form, allowing a solid starting point for working from the literature. Additionally, a long-standing website on the genus (http://www.fiddlercrab.info) provides a useful, established platform (>33,000 hits over the last year) for releasing the experimental cyber-taxonomy results which make up the focus of this study.
Materials and Methods
Although other contextual schemes were considered it was eventually determined that taxonomic names were primarily used in one of four contexts: (1) reference to a specimen, (2) reference to a geographic location, (3) reference to a literature citation, and (4) without context. Because many publications were not available in electronic form and because context could not clearly be computationally determined, all data was recorded manually in a spreadsheet.
For each reference to a fiddler crab appearing in a publication, the following information was recorded (Fig. 1): (1) the publication was identified with a unique key (generally a combination of author and year) which would allow cross-referencing of the publication; (2) a key to identify each unique name used in the publication—later this was expanded to allow identification (or ignore when necessary) of the distinct context of each name when it was used in multiple contexts. This was necessary since citing authors often do not apply their use of a name to the entirety of contexts in which it was used in the original publication (see below); (3) the scientific name as used in the publication, with the exact spelling and capitalization preserved; (4) when applicable, the common name associated with the scientific name. Some few publications only used common names, but were otherwise important for context or history (while sometimes interesting in their own right, common names are not the focus of this study); (5) where in the publication the scientific name occurs or is applied (e.g., page numbers, figures, plates, etc.); (6) the context of how the name was used (multiple columns of the spreadsheet, described in detail below); (7) either the correct species name as we recognize it today or an indication that the species should be determined through computational cross-referencing (see below); (8) general notes on the publication or the specific use of the name in that publication (for example, if a name was used as a type description).
The columns represent: (1) a unique key to identify a publication; (2) a numeric key for separating different names used in a single publication and in different contexts; (3) the exact name as used in the publication; (4) where in the publication the name occurs or is applied; (5) the context of the use; (6–7) additional information on the context, with details depending on the type of context (described in text); (8) the “actual species”: either the accepted species (as we now understand it) which the authors was referring to or an equals sign (for citation contexts) indicating the accepted species should be computationally determined; (9) notes on the name usage. A period generally indicates no data (columns could not be left blank). Two additional columns of data were also recorded: the common name(s) used in the publication and notes on the publication in general. These columns were rarely used and were left out of the figure to save space. Specific records indicated with letters in circles are discussed further in the text.
Most of the records in the database contain species level names, but data on specific discussions of generic and subgeneric names (even absent of species) were also recorded, since these are both generally important and of taxonomic interest. All spelling variants, including typographical errors, were maintained in the primary records to allow and demonstrate the degree of variation found in the literature. A separate table was constructed to allow matching of spelling and typographic synonymy to the accepted spelling. For example, the species name coarctata has also been recorded in the literature as coarctatus, coartatus, and corctata. The first of these is a deliberate variant based on taxonomic gender-matching rules with a genus (Gelasimus) of alternate gender; the latter two are mistakes due to either typographical errors or confusion by authors.
The specific contextual data recorded depended on the type of context. For both specimen and location contexts (which in the end were largely treated identically), the geographic location associated with the specimen/location was recorded. When available for specimen contexts (which was rare, particularly for older publications), museum lot or specimen numbers were often recorded as well. The specimen context was generally reserved for explicit taxonomic studies and museum depositions. Experimental studies which used a species as study subjects but did not otherwise keep the specimens at the completion of the experiment were recorded as locations (based on where the specimens were obtained).
For citation contexts, the cited work was recorded (based on the publication key, Fig. 1, column 1), as well as the key indicating the name in the cited work to which the citation applies (Fig. 1, column 3). This key could be recorded as either a general citation to the use of that name, or to a specific contextual use in the original work. For example, an author (of work A) might use the name Uca pugilator and generally cite U. pugilator in an earlier work (work B); in this case, one can apply the citation to all contexts of U. pugilator in work B. In another case, an author (of work C) might use U. pugilator, but specifically cite only part of a previous publication (work D) (for example by using the phrase “in part” in a taxonomic context). In this case, the citation needs to specify only the relevant cited contexts. This distinction was recorded by using a combination of integer and decimal keys. Each unique name was given a different integer base as the key (starting with 1). When multiple contexts appeared in a paper, each context was given an additional decimal code to the key (1.1, 1.2, 1.3, etc.). A citing paper could either be coded with just the integer portion (referring to all contexts with that integer base) or to the full decimal key (referring only to that specific context). When a citing paper referred to multiple, but not all, contexts, independent citation entries were made for each cited context. In rare cases, citations to a publication were general and not specific to any internal context; these were coded by reserving the key zero for such citations. When the cited publication has not been added to the name database (whether due to lack of access or because it has yet to be recorded), cross-referencing to the specific context could not be determined and a period is used as a placeholder to indicate the missing data.
In Figure 1, Macnae  uses Uca bellator (Adams and White, 1848) in two contexts: in the first (record A), he refers to a species found in Eastern Queensland, Australia; in the second (record B), he applies the name to a citation , specifically to name #2 found in Hess (record C): Gelasimus signatus Hess, 1865. If a later author applied a name to all of Macnae's uses of U. bellator, we would record that citation as Macnae1966 | 3 (referring to both