TaeC: A manually annotated text dataset for trait and phenotype extraction and entity linking in wheat breeding literature

doi:10.1371/journal.pone.0305475

Table 1.

Corpus document selection query to PubMed with wheat breeding, genetics and species criteria.

More »

Expand

Fig 1.

Example of annotation of single-word trait and wheat species in a text excerpt from the TaeC corpus.

The WTO class associated with the earliness term in the text is plant precocity (WTO_0000100). The NCBI taxonomy class associated with the wheat term in the text is Triticum aestivum (TaxID_4565).

More »

Expand

Fig 2.

Example of annotation of complex trait mention in a text excerpt from the TaeC corpus.

More »

Expand

Fig 3.

Screenshot of the AlvisAE annotation editor used to manually annotate TaeC with many different annotations.

More »

Expand

Table 2.

Figures of the TaeC corpus in number of annotated entities and classes per type.

More »

Expand

Table 3.

Examples of the wheat phenotype and Trait ontology labels and the corresponding text mentions in publications.

More »

Expand

Table 4.

File.txt contains the text of the document.

File.a1 contains the named entity annotations, i.e. the internal identifiers of the named entity, their type (i.e. Trait or Species), their position and their text form. File.a2 contains the semantic annotations, i.e. the internal identifier of the semantic annotation, the name of the reference (i.e. NCBI taxonomy, or WTO), the identifier of the named entity as declared in File.a1, and the external identifier of the class in the reference.

More »

Expand

Table 5.

Performances of the rule-based methods AlvisTaxa and ToMap on TaeC for the named entity recognition and the named entity linking tasks of the species, trait, phenotype types.

Strict match measure and relaxed match measures of the entity span and of the class are shown in columns (1) and (2), respectively.

More »

Expand

Table 6.

Performance evaluation of RoBERTa and BioBERTa machine learning methods for the recognition of phenotype, trait, and species entities.

Performances is measured by precision, recall, micro-F1 measure and Jaccard index as a relaxed measure. The best performance of the two methods is shown in bold.

More »

Expand

Table 7.

Performance evaluation of RoBERTa and BioBERTa methods for the recognition of characteristics and species entities.

Performances is measured by precision, recall, micro-F1 measure and Jaccard index as a relaxed measure. The best performance of the two methods is shown in bold.

More »

Expand

Table 8.

Performance evaluation of BioSyn and C-Norm NEL methods on the prediction of phenotype and trait class from entities, i.e. entity linking.

The performance is measured by precision, recall, micro-F1 measure and Wang similarity as a relaxed measure. The best performance of the two methods is shown in bold.

More »

Expand

Table 9.

Performance of combined NER and NEL methods RoBERTa + C-Norm on phenotype and trait entities.

Performances is measured by micro-F1 measure and Wang similarity as a relaxed measure.

More »

Expand

Table 10.

Examples names and descriptions of cultivars from literature illustrate the complexity of their annotation.

More »

Expand