Annotation Error in Public Databases: Misannotation of Molecular Function in Enzyme Superfamilies

doi:10.1371/journal.pcbi.1000605

Figure 1.

Enzyme superfamilies and their constituent functional families examined in this analysis.

Families analyzed in this work are shown organized by the superfamilies to which they belong. Names of superfamilies and families are from the SFLD. E.C. numbers are included where available. Dashes (—) are used for those families for which a full E.C. number has yet to be assigned. Each family is designated by a specific color and these mappings are also used in Figure 3 and Video S1. The number of sequences in each family that were analyzed from each database is listed; the total number of sequences analyzed from each database is also given.

More »

Expand

Figure 2.

The misannotation analysis protocol.

Annotations determined to be incorrect are labelled with the following codes depending on the type of misannotation: ‘No Superfamily Association’ (NSA); ‘Missing Functionally important Residue(s)’ (MFR) ‘Superfamily Association only’ (SFA) ‘Below Trusted Cutoff’ (BTC). See Methods for more detailed discussion of these definitions.

More »

Expand

Figure 3.

Percent misannotation in the families and superfamilies tested.

The results are organized by superfamily: Panel A: enolase, B: crotonase, C: vicinal oxygen chelate, D: terpene cyclase, E: haloacid dehalogenase and F: amidohydrolase. Each panel depicts the percent misannotation for the superfamily in four plots, corresponding to the databases investigated. In each plot, the black bar denotes the average percent misannotation for that superfamily in that database. The percent misannotation for each family within the superfamily is given by a colored circle. The size of the circle provides an estimate of the number of sequences evaluated for that family (scaling in legend). An X through an open circle means that no sequences annotated with that function were retrieved from that database. The order of the families depicted for each superfamily is arbitrary but is consistent through all four plots. The colors of the family circles correspond to those used in Figure 1, which provide a mapping between these family colors and their gold standard functions.

More »

Expand

Figure 4.

The change in misannotation over time in the NR database for the 37 families investigated.

Sequences are plotted by the year when they were originally deposited in the database (x-axis). The number of sequences (left y-axis, bar graph) found to be correctly annotated is shown in green. The number of sequences found to be misannotated is shown in red. The bars for each year represent only the sequences deposited into the database in that year. The fraction (right y-axis, line plot) of sequences deposited each year into the NR database that were misannotated is given by the open nodes, connected by the black line to aid in visualizing the overall trend. This fraction represents the number of sequences in the 37 test families predicted to be misannotated divided by the total number of sequences deposited each year from the test set, i.e. the sum of the sequences depicted in the red and green bars for each year.

More »

Expand

Figure 5.

Distribution of major types of misannotation found in the NR database.

Classification of misannotated sequences follows the steps of the protocol given in Figure 2: ‘No Superfamily Association’ (NSA); ‘Missing Functionally important Residue(s)’ (MFR) ‘Superfamily Association only’ (SFA) ‘Below Trusted Cutoff’ (BTC), as described in methods. The codes were grouped into two sets that specify whether the misannotation is associated with overprediction or to other types of errors (e.g., missing a required residue).

More »

Expand

Table 1.

Examples of predicted misannotations in the NR database.

More »

Expand

Figure 6.

Network view of a misannotated sequence.

The protein similarity network shows clustering of sequences from an all-by-all BLAST analysis of a subgroup of the enolase superfamily. Light grey nodes (circles): unknown function; dark grey nodes: sequences annotated in the SFLD but not examined in this analysis; colored nodes: sequences colored by SFLD annotation (as designated in Figure 1, enolase superfamily). Squares represent proteins that have been experimentally characterized and colored circles represent those in which residues known to be important for function and other characteristics for that specific family are conserved. Edges (lines) show BLAST connections between sequences that have an E-value at least as good as 10⁻⁵⁰. Lengths of edges indicate that sequences in tightly clustered groups are relatively more similar to each other than sequences with few and distant connections. The sequence annotated in GenBank as a mandelate racemase (gi|17987990, yellow dot) clusters with fuconate dehydratases (red cluster) suggesting that it should be annotated as a fuconate dehydratase instead of as a mandelate racemase. The blue cluster containing two characterized mandelate racemases is not close to the fuconate dehydratase cluster, providing further evidence that this sequence is not a mandelate racemase.

More »

Expand