The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text

Figure 2

Precision and recall for separate S800 categories.

Because the S800 corpus consists of seven different taxonomic categories (the eighth category is not taxonomic), it can provide insights into which types of species are hard to identify in text and which are easy. Plotting the precision and recall on each of the seven categories separately for both the LINNAEUS and the SPECIES tagger shows little difference between the taggers, but big differences between categories. It is clear that both methods are considerably worse at tagging names of viruses than at tagging cellular organisms, and that bacterial and fungal species—for which Linnaean nomenclature is primarily used—are the easiest to identify in text.

