Figure 1.
Complexity of Type IV secretion system (T4SS) architecture and nomenclature.
(A) Model of the VirB/VirD P-T4SS encoded on the pTi plasmid of Agrobacterium tumefaciens. LPS = lipopolysaccharide, OM = outer membrane, M = murein layer, IM = inner membrane, C = cytoplasm. (B) Description of the VirB/VirD proteins. (C) Diversity encompassed by the major groups of T4SSs. P, P-T4SS: top = Rickettsia prowazekii (rvh) [31],[45] bottom = Helicobacter pylori (cag pathogenicity island, cag-PAI) [46]. Genes with homology to vir genes are colored accordingly. cag-PAI genes colored gray are not known to form the T4SS scaffold, while genes colored white are involved in T4SS function but have no clear homology to vir genes. F, F-T4SS: top = Escherichia coli (tra/trb of F plasmid), bottom = Neisseria gonorrhoeae (tra/trb of gonococcal genetic island). Capital letters depict tra genes while lower case letters depict trb genes, with remaining genes given their full names. I, I-T4SS: top = tra/trb of the IncI plasmid R64, bottom = Legionella pneumophila (dot/icm) [47]. Capital letters depict icm and tra genes while lower case letters depict dot and trb genes. GI, GI-T4SS: top = Haemophilus influenzae (tfc), bottom = Salmonella enterica Typhi (tfc). NOTE: Genes of F-, I- and GI-T4SSs with homology to vir genes are colored accordingly.
Table 1.
Corpus statistics for T4SS concepts: Bacteria, Cellular Component (Cell. Comp.), Biological Process (Bio. Process.), Molecular Function (Molecular.Fn.).
Table 2.
Statistics for dictionaries extracted from domain-specific resources for each of the entity classes.
Table 3.
Entity Recognition across classes contrasting dictionary-based, dictionary-based with corpus enrichment, and machine learning strategies.
Table 4.
Number of terms in each class for Bacteria, Cellular Component, Biological Process, and Molecular Function classes for T4SS, near-miss, and general documents.
Table 5.
Typological breakdown of entity mention variability in typographical, morphological, syntactic, reduction, and abbreviation classes.
Figure 2.
Comparison of the effect of normalization.
Different classes of entity mention variability (Typographical, Morphological, Syntactic, Reduction, and Abbreviation) across different entity classes (Bacteria, Cellular component, Biological process, and Molecular function). The graph indicates the percentage reduction in unique strings contributed by each class of normalization process.
Table 6.
Impact of normalization of entity mentions expressed by reduction in number of unique strings, broken down by entity class.