Finding Sequences for over 270 Orphan Enzymes

doi:10.1371/journal.pone.0097250

Figure 1.

Orphan enzymes break the link between sequence and function.

Amino acid sequence information connects knowledge about protein function in major databases (blue) and sequence-based predictive tools (green), enabling many critical tasks in contemporary biology (outer ring). The absence of sequence data for orphan enzymes that have been experimentally characterized (yellow) disconnects knowledge about those enzyme activities, their associated motifs, domains, and other sequence-linked traits from this family of databases and sequence-based predictive tools. This valuable information remains “trapped” in the literature, inaccessible for genome annotations, predictions, and to guide hypothesis formation for bench biology.

More »

Expand

Figure 2.

Three chokepoints generate orphan enzymes.

Starting from an enzyme activity that has been characterized in the lab (gray), there are three major chokepoints that can lead to enzyme activity not being linked to sequence data, generating an orphan enzyme (yellow) rather than a sequenced enzyme with an associated EC number (purple). In the laboratory stage, an enzyme may not be sequenced due to issues such as complexity of the sequencing process for that enzyme, a loss of researcher interest, or inadequate funds to pursue sequencing. When an enzyme has been sequenced, sequence data may not be deposited in GenBank and other major sequence databases, despite the presence of sequence data in a scientific publication. Finally, errors in depositing the sequence with those databases can prevent connection of that sequence data with the enzyme activity (see FIGURE 5).

More »

Expand

Figure 3.

Putative orphan enzymes were evaluated via the literature and databases to find sequences or identification information.

Each putative orphan was evaluated via a multistep process relying on sequence databases, the literature, and patent databases. Each evaluation process began by collecting all names for the enzyme activity. The BRENDA and MetaCyc databases, which link enzyme data to EC numbers, were then examined. At this and all subsequent steps, sequence data was collected when found. Documents were then collected, including texts cited in BRENDA and MetaCyc, texts found via PubMed search, and patents from the U.S. Patent and Trademark Office. Identification information (inset box) were collected from each publication. When available, peptide sequence data were collected to attempt to identify the full protein sequence via BLAST. When possible, identification information were used to predict candidate sequences for subsequent testing in the laboratory.

More »

Expand

Table 1.

Sequences were identified for 275 putative orphan enzymes, most frequently by fixing database errors.

More »

Expand

Figure 4.

The end-to-end process of resolving an orphan enzyme may include literature searches, database searches, and laboratory work.

Beginning with a putative orphan enzyme (POE) (yellow), an investigator can maximize the likelihood of finding sequence data while minimizing effort by following a few steps. An immediate search of the OEP database will indicate if the orphan is already recognized as such and give the researcher access to any data about that orphan enzyme that others have already collected, including if it has been resolved (and perhaps the link between sequence and activity simply haven't been propagated to major sequence databases yet). The next steps are to carry out a literature and database evaluation of the orphan and then potentially follow that with laboratory identification. It may be helpful to submit information about the orphan enzyme, including the fact that it exists as well as any supporting identification information, to the OEP web site at the two marked points in the process (OEP symbols). This makes the information available to others in the research community who may be able to help identify sequence for the enzyme activity.

More »

Expand

Figure 5.

Orphan creation can be avoided by ensuring each enzyme has an accession number and an EC number.

Starting with an enzyme with new sequence data (yellow and linked box), there is a well-defined set of steps to follow to avoid generating new orphan enzymes. Following this process ensures that each enzyme will be properly linked to a GenBank accession number and whenever possible, an EC number. It also helps prevent assigning incorrect EC numbers or not assigning an EC number when one that fits the enzyme activity is already available. Since the process of requesting and generating a new EC number can take 2+ months, papers may need to be submitted before an EC number exists and then the new EC number added to them in proof or via addenda (blue box). We recommend as a courtesy that researchers also update UniProt (blue asterisk) at any point at which the GenBank record for an enzyme is updated. Sequences and requests can be submitted to both GenBank and UniProt regardless of whether the data involved has been published in the peer-reviewed literature. Once an enzyme has both an associated accession number and EC number, these should be used in all future annotations, publications, and database submissions related to the enzyme.

More »

Expand