Taxamatch, an Algorithm for Near (‘Fuzzy’) Matching of Scientific Names in Taxonomic Databases

doi:10.1371/journal.pone.0107510

Figure 1.

Sample species-level algorithm performance (as true vs. false hits returned) with increasing threshold settings.

Performance of five dynamic algorithm variants using the CAAB expert misspellings species dataset at a range of thresholds. Horizontal alignment and scales are chosen to emphasize general similarities in response patterns between otherwise different algorithms. Curves for LD and DLD lie under that for MDLD where not visible.

More »

Expand

Figure 2.

Overall schematic of an optimized algorithm for comparing taxonomic names as described above and implemented in Taxamatch.

More »

Expand

Figure 3.

Example search result from the author's Taxamatch-enabled IRMNG data search as at May 2013.

Sample result screen from Taxamatch-enabled search via the current (2013) implementation of the IRMNG database web search interface using as input the misspelled name Halymenia dilitata Zanardini, an error for Halymenia dilatata (genus exact match, ED 1 near match in species epithet). Note the additional return in this case of the false hit Halymenia digitata J. Agardh at the same edit distance, however a poorer match on authority (0.17 vs. 0.88; species ordering is alphabetic), also the return of multiple near match genera (even if no near match species is currently held in those genera) since on occasion the desired target species may be missing from the reference database but the target genus may not.

More »

Expand

Figure 4.

Species level precision:recall curves for all algorithms tested, as binomial names, means of three datasets (data from Table 1).

Performance of five dynamic algorithm variants, three phonetic algorithms plus two Taxamatch variants using the three available misspellings datasets for species, at a range of thresholds in the case of the dynamic algorithms. Performance of Taxamatch and Taxamatch ‘no shaping’ variant are highlighted in blue and red circles, respectively. Data values closest to 1,1 (upper right corner) indicate best performing setting (maximum effectiveness) for a given algorithm. (Note at this scale, curves for certain variants i.e. LD, DLD lie behind others i.e. MDLD in some places, similarly for bigrams versus trigrams).

More »

Expand

Figure 5.

Genus level precision:recall curves for all algorithms tested, means of four datasets (data from Table 1).

Performance of five dynamic algorithm variants, three phonetic algorithms plus two Taxamatch variants using the four available misspellings datasets for genus names only, at a range of thresholds in the case of the dynamic algorithms. Performance of Taxamatch and Taxamatch ‘no shaping’ variant are highlighted in blue and red circles, respectively. As in previous Figures, data values closest to 1,1 (upper right corner) indicate best performing setting (maximum effectiveness) for a given algorithm.

More »

Expand

Figure 6.

Genus-level precision values for Taxamatch (both variants) as a function of genus length.

Data shown are from the four genus-only datasets combined. Superimposed columns indicate distribution of target genus names in the reference database as a function of genus length. The two most frequent lengths of genera in the reference (IRMNG) database at this time (n = 62,228 and 64,062 for genus lengths 10 and 11 characters, respectively) together comprise 29.7% of all genus names in the database, excluding known misspellings, nomina nuda, later usages and virus genera.

More »

Expand

Table 1.

Overall algorithm performance as recall, precision and effectiveness (F₁ measure) – mean values of all datasets tested (3 for species, 4 for genera).

More »

Expand

Table 2.

Sample execution times by algorithm: selected species tests, tested against all IRMNG species (1.67 m names) in May 2013.

More »

Expand

Table 3.

Sample execution times by algorithm: selected genus tests, tested against all IRMNG species (465 k names) in May 2013.

More »

Expand

Figure 7.

Species-level recall by error type for each algorithm tested in the present study.

Values for the dynamic/variable threshold algorithms are derived from their ‘best’, i.e. most effective settings, for all species data pooled (data from Table 4).

More »

Expand

Table 4.

Algorithm recall (on 0–1 scale) by error type for all species data pooled (n = 2,859); dynamic algorithms used at their ‘best’ settings (maximum F₁ value) as determined from Table 1.

More »

Expand

Figure 8.

Species-level effectiveness (F₁) for all algorithms at all settings, disaggregated by dataset.

Note variation in F₁ value with varying threshold setting for each of the dynamic algorithms, with peak at setting 0.85 (bigrams), 0.80 (trigrams) and ED 2 for the LD, DLD and MDLD tests.

More »

Expand

Figure 9.

Genus-level effectiveness (F₁) for all algorithms at all settings, disaggregated by dataset.

Note variation in F₁ value with varying threshold setting for each of the dynamic algorithms, with peak at setting 0.80 (bigrams), 0.75 (trigrams) and ED 1 for the LD, DLD and MDLD tests. Taxamatch values are slightly depressed by this metric compared with some of the ‘best’ dynamic algorithms on account of sacrificing some precision for close to 100% recall (cf. Figure 5).

More »

Expand

Table 5.

Summary of current Taxamatch-enabled taxonomic data systems known to the author as at August 2013.

More »

Expand