Toll-Like Receptor Signaling in Vertebrates: Testing the Integration of Protein, Complex, and Pathway Data in the Protein Ontology Framework

The Protein Ontology (PRO) provides terms for and supports annotation of species-specific protein complexes in an ontology framework that relates them both to their components and to species-independent families of complexes. Comprehensive curation of experimentally known forms and annotations thereof is expected to expose discrepancies, differences, and gaps in our knowledge. We have annotated the early events of innate immune signaling mediated by Toll-Like Receptor 3 and 4 complexes in human, mouse, and chicken. The resulting ontology and annotation data set has allowed us to identify species-specific gaps in experimental data and possible functional differences between species, and to employ inferred structural and functional relationships to suggest plausible resolutions of these discrepancies and gaps.


Introduction
Diverse electronic databases now play central roles in storing, integrating, and analyzing information relevant to human biology. UniProt maintains definitive catalogs of the properties of human proteins and those of model organisms widely used in biomedical research [1]. Model organism databases like the Mouse Genome Database generate comprehensive catalogs of genes, functional RNAs and other genome features as well as heritable phenotypes, and curate phenotype annotations including associations of model systems with human diseases [2]. Biological pathway resources like the Reactome Knowledgebase [3] record the molecular details of processes within the human organism. These processes, decomposed into reactions, yield a network of molecular transformations that is an extended version of a classic metabolic map. Pathways identify routes connecting proteins and small molecules within the map.
Reactome and other pathway resources are rich sources of complex information curated by experts and stored in data structures developed to meet the needs of their core user communities. This richness and specialization, however, is also a limitation. The unique organization of each resource makes attempts to integrate and analyze data across resources difficult. Biomedical ontologies provide tools that can address these problems. These ontologies provide rigorous, unambiguous descriptions of biological objects and of the relationships among them using standardized and well-understood formats. Ontology structures enable the development of powerful computational tools that can reliably integrate and through both rational and statistical methods analyze the large, diverse sets of experimental data curated by independent groups of experts and stored in independent electronic databases. Within the OBO Foundry model, ontologies have been developed to describe orthogonal features of biology, but to a common standard to ensure interoperability [4]. Such ontologies link diverse structural and functional annotations into a single, coherent logical frame. Reasoning tools can identify discrepancies in represented data and suggest plausible attributes for entities that have not been experimentally studied.
GO, the Gene Ontology [5,6], provides structured controlled vocabularies of biological terms that describe the molecular functions of gene products, their roles in biological processes, and their organization into cellular components. PRO, the Protein Ontology [7][8][9][10], captures the gene products themselves, including evolutionary families of proteins and, within each family, canonical and modified forms of proteins ("proteoforms"), the complexes they form, and their relationships. These PRO annotations link canonical species-independent forms of these entities to species-specific forms and variants.
In this work, we propose that PRO can aid the integration of disparate data and enable biologically sound inferences. As a proof-of-concept, we analyzed innate immune signaling data from different organisms (human, chicken and mouse) and sources (Reactome and Center for Computational Immunology). We studied whether Reactome's annotations for human and chicken proteins and complexes involved in innate immune signaling [11] can be imported into formal annotations of proteins and complexes in PRO in a way that supports inferences of complex formation, subcellular localization, and roles in biological processes for corresponding mouse proteins catalogued by the Center for Computational Immunology [12].
The innate immune systems of humans and mice have both been extensively characterized so this exercise has allowed us to test the reliability of annotations in one species for predicting complex formation, subcellular location, and function in the other, and to identify true differences in the signaling processes between the two species. Where experimental data exist only for one species, we have asked whether the PRO evolutionary family framework supports plausible inferences to fill gaps.
The innate immune signaling system The innate immune system is an evolutionarily ancient signaling mechanism that provides an initial defense against invading microorganisms. Pattern recognition receptors (PRRs) expressed either on the cell surface or in the cytoplasm recognize microbe-associated molecular patterns (MAMPs) [13][14][15]. MAMP binding to a PRR triggers a signaling cascade that can result in the production of cytokines and other molecules that mediate inflammation. Several wellconserved PRR families have been identified [16]. Of these, the TLR family is the best characterized in terms of known ligands and downstream signaling pathways [17][18][19][20]. The first member of the Toll gene family was identified in Drosophila and shown to play a role in embryonic dorsal-ventral patterning [21]. The Drosophila Toll gene family was later shown to be critical for anti-fungal and antibacterial responses [22,23]. Homologs of the Drosophila Toll protein have been identified in many other species.

TLR Protein Family
The TLR protein family (PRO PR_000001096) contains six subfamilies with distinct ligand specificities and signaling properties [12,24,25]. Despite a wide range of ligands, TLRs share common structural features: a large extracellular domain (ECD), a transmembrane domain and a cytoplasmic Toll/Interleukin 1 receptor (TIR) domain. The ECD in turn consists of a varying number of leucine-rich repeats (LRR) and is responsible for MAMP recognition. The TIR domain (Pfam PF01582) interacts with downstream proteins when the ECD is activated by MAMP binding. Phylogenetic analysis of ECDs suggests that these sequences have evolved relatively rapidly in a process driven by the positive selection imposed by changing microorganisms, while TIR domains have evolved more slowly under purifying selection. TIR domains appear to have co-evolved with the intracellular adaptor molecules with which they interact [12].
For this study, we have focused on initial steps of the signaling cascades initiated by interactions of the well-studied TLR3 and TLR4 receptors with their ligands (Fig 1). These receptors share common steps in the signaling cascade but are distinct in complex composition and the initial steps of signaling. TLR3 is associated with endosomal membranes and is implicated in the recognition of intracellular viral dsRNA. TLR4 is associated with the plasma membrane and is predominantly activated by extracellular lipopolysaccharide (LPS) derived from bacteria. Both TLR3 and TLR4 utilize TIR-domain-containing adapter-inducing interferon-β (TRIF) to signal from the endosomal compartment. TRIF-mediated signaling is essential for IFN regulatory factor (IRF)-dependent production of type I IFN. While TLR3 signals exclusively through adaptor TRIF, TLR4 can also utilize myeloid differentiation primary response 88 protein (MyD88) from its plasma membrane location. The MyD88-dependent pathway is shared by all TLR receptors except TLR3, leading to production of proinflammatory cytokines.

Methods
PRO captures continuant properties of proteins and protein complexes such as the covalent modifications that differentiate the modified forms of a protein from one another and the identities and numbers of copies of the components of a protein complex [7,8]. To describe the roles of proteins and complexes in the biological transformations that make up a pathway, however, it is also necessary to capture their occurent properties: molecular functions which these proteins exercise, the biological processes in which they participate and the subcellular locations which they may occupy.
Previous work within the Reactome project [11] and under the auspices of the Center for Computational Immunology [12,26] has yielded catalogs of human, mouse, and chicken proteins and complexes involved in TLR signaling. Reactome annotations have also associated functions and subcellular locations with these proteins. PRO terms have been generated for entries in these catalogs and they have been cross-referenced to entries in Reactome, to the canonical forms of proteins in UniProt, and to entries for small molecules in CHEBI [27]. Annotated reactions and associated input and output physical entities are compiled in the supporting information associated with this paper (S1 Table); PRO, Reactome, UniProt and CHEBI terms for physical entities are shown in Tables 1 and 2.
Both Reactome and the Center for Computational Immunology provided tab-delimited files of complexes, components and functional annotations which were used as the starting point to create PRO terms for complexes. A PRO curator reviewed the evidence for the complexes and their components and also aligned the equivalent complexes between human, mouse and chicken. The PRO curator also i) mapped the complexes to the most appropriate GO protein complex term as a parent, or created PRO complex terms to link the complexes when needed; and ii) added PRO terms for all the complex components when these were not in the ontology. The details of the curation protocol used in this work are available online [28]. Any discrepancy was reported back to the groups, re-evaluated, and resolved. The final content in PRO for the TLR set has been agreed between the different parties.
To annotate GO molecular function, biological process, and cellular location properties of these proteins and complexes in the PRO framework, we have also used relations from the OBO Foundry Relation Ontology (RO) [29,30] in the PRO framework [7][8][9].

Function
To annotate functions of instances of proteins and complexes we associate PRO terms for these entities and GO terms for molecular functions with the RO relation has_function. For example:  Stanzas are in PAF format as described previously [7]; phrases to capture function, process, and location annotations are highlighted in red.

Entity1 PRO has_function GO:####### GO
where Entity1 is a protein or complex annotated in PRO and GO:####### is a molecular function term defined in GO. This assertion is incorporated into the PRO PAF entry for a modified CD14 isoform as shown in Fig 2A. Instances of complexes are annotated in the same way as individual proteins. For example: IRF7-P:IRF7-P complex (human) (PR:000027086) has_function sequence-specific DNA binding transcription factor activity (GO:0003700).
Here and below, complexes are named by listing their constituent proteins separated by colons [31].
In addition, the PRO framework enables representation of molecular functions of components of a complex having distinct roles within the complex by creation of a term for the subtype of protein that is part of such a complex, Complex1 PRO has_component Protein1 PRO AND Protein1 PRO has_function GO: ####### GO

Process
To annotate the involvement of instances of TLR proteins and complexes in signaling processes, PRO terms for entities are associated with GO biological process terms with the RO relation participates_in. For example (Fig 2B), traf6:ticam1:activated TLR3 complex (human) (PR:000037344) participates_in MyD88-independent toll-like receptor signaling pathway (GO:0002756) Location Cellular localization is annotated by relating the PRO term for a physical entity to a GO cellular component term. For example (Fig 2C), IRF7 unphosphorylated 1 (PR:000037791) located_in cytoplasm (GO:0005737). Similarly, IRF7-P:IRF7-P complex (human) (PR:000027086) located_in nucleoplasm (GO:0005654).
While this is an ontological assertion about a cellular entity rather than about a protein type, inclusion of this assertion allows the ontology to be queried to identify the cellular compartment or compartments in which a process occurs.
We then use the RO relations has_component and has_part, already implemented in PRO, to form triples that relate macromolecular complexes to their component proteins and to relate proteins with their domains, respectively. For example, ticam1:viral dsRNA:TLR3 complex (mouse) (PR:000037308) has_component PR:Q80UF7 {cardinality = "2"}! TIR The PRO terms and annotations related to this paper have been collected in a separate set of TLR-specific files, available via FTP [32]. All terms and annotations are also part of PRO release 43 and later.
The organization and content of the PRO annotation file (PAF) have been described previously [7]. Briefly, the PAF shows the annotation of PRO entities using GO or other ontologies, and adopts the format of the GO annotation file with some modifications. The PAF annotations connect PRO terms to terms from these ontologies and include the corresponding relation. Additional columns account for sequence coordinate specifications, such as the range of the sequence (for cleaved forms) or sites of covalently modified residue(s). In addition to the qualifiers used by GO (like NOT), the PAF introduces the qualifiers increased and decreased, along with a column to indicate what the object of comparison is. PAF documentation is available [33].

Results and Discussion
Here we describe strategies to integrate PRO annotations for complexes [8] with functional annotations derived from pathway databases like Reactome and other resources, focusing on the initial steps of the TLR3 and TLR4 signaling pathways in human, mouse, and chicken.
TLR3 and TLR4 together represent key signaling strategies used by Toll receptors to initiate reactions of innate immunity. TLR4 is unique in that upon activation it recruits adaptor molecules for both MyD88-dependent and MyD88-independent signaling. TLR3 specifically uses the TRIF-signaling pathway but without the use of TRAM (Fig 1) [34]. All other TLRs activate MyD88-dependent signaling only.
Experimental studies of chicken, mouse and human systems have established that in all three species the TLR3-mediated signaling pathway is triggered by recognition of viral dsRNA and the TLR4-mediated signaling pathway is triggered by recognition of bacteria-derived LPS [35,36]. The initial pathway steps in which a TLR receptor binds its ligand and then interacts via its cytosolic domain with its first downstream target have been annotated (S1 Table). Many of the annotations of individual proteins and complexes (Tables 1 and 2) are based on experimental observations; the rest are inferences based on relationships between the experimentally characterized proteins and their uncharacterized but structurally similar orthologues. In the course of this work, 603 new PRO terms were created, 20 for families, 64 for genes, 110 for organism-specific forms of genes, 69 for covalent modifications, 42 for organism-specific covalent modifications, 48 for GO complexes, 109 for PRO complexes, 119 for organism-specific PRO complexes, and 50 with Reactome cross-references.
In vertebrates the sensing of LPS involves transfer of LPS monomers to CD14 mediated by the LPS-binding protein (LBP). CD14 in turn commonly delivers the LPS to a complex of myeloid differentiation protein-2 (MD2) and TLR4 which transduces the signal through the recruitment of adaptor proteins to the TIR domain of TLR4 [34,37]. There are three versions of the LPS:CD14 complex, namely GPI-anchored CD14:LPS, soluble CD14:LPS and transmembrane CD14:LPS (Fig 3). Each of these complexes features a distinct form of CD14. Mammals express two of these forms, soluble and GPI-anchored, whereas in birds only a complex with the transmembrane version of CD14 has been identified to date [38].
Downstream signaling complexes such as the MD2:TLR4 complex show another interesting difference between taxa. Whereas the mammalian version participates in both the MYD88-dependent and independent signaling pathways, the chicken version may only be able to participate in the MYD88-dependent pathway [34,39]. This functional difference is captured in PRO annotations as shown in Table 3, illustrating the use of PRO annotations as a tool for making discoveries.
Comparison of the PRO annotations for mouse and human CD14 complexes identified a potentially significant gap in our understanding of CD14 function (Table 1; Fig 1). In wellstudied mouse and human systems, CD14 binds LPS and brings it in close proximity to the TLR4:MD2 complex allowing the recognition of LPS by MD2 and TLR4. Data from mouse cells, however, suggest that CD14 may be dispensable for the downstream events [40][41][42] while data from human cells suggest that CD14 is translocated to the endosomal compartment in association with the TLR4 receptor complex [43][44][45], thus arguing that CD14 may be required for downstream TLR4 signaling events.
Although ligand binding and transfer by CD14 has been extensively studied by mutagenesis and epitope mapping of blocking antibodies in both human and animal models [40,41,[46][47][48][49], the molecular mechanism behind CD14 interaction with the receptor complex remains elusive. Mechanisms for ligand-induced endocytosis of CD14 and control of endosomal trafficking of the TLR receptor complex likewise remain unclear.
Further we found that mouse complexes containing MyD88 protein are represented in two forms, containing alternatively spliced long and short isoforms of MyD88, MyD88l and MyD88s (Tables 1 and 2). The long or canonical form of MyD88 protein is a bipartite domain adaptor molecule composed of an amino-terminal death domain and a carboxyl-terminal TIR domain. MyD88l bridges interleukin-1 receptor-associated kinase 4 (IRAK4) to the TIRdomain of receptor signaling complex. The short form MyD88s lacks the region between the death domain and the TIR domain. MyD88s is also recruited to the TIR-domain of TLR4 receptor complex but it blocks NFkappaB induction because it fails to activate IRAK4 in mouse cells [50,51]. In contrast, although human cells have been reported to express MyD88s, only TLR4 complexes involving the canonical long form of MyD88 have been observed. This  difference is consistent with the lack of evidence showing that LPS-induced activity of MyD88s inhibits MyD88-mediated TLR4 pathway in human cells.
A key feature of the work described in this paper is that it involves the annotation of specific instances of physical entities: the collections of molecules in particular cells occupy a subcellular location or exhibit a function. Work now underway on development of a formal ontology for these classes and relationships will enable us to use these annotations as the basis for assertions to support automated reasoning. While the expert manual annotation process does not scale well, it does provide a large body of validated data that will provide a rigorous test of automated reasoning tools.

Conclusion
We have described an annotation process that integrates PRO ontology terms for protein complexes with GO terms for molecular function, biological process, and cellular component. The resulting annotations are explicitly tagged to indicate their basis in experimental data or in manually verified inferences based on sequence similarity among proteins. The results highlight similarities and differences between signaling processes mediated by two members of the TLR family, TLR3 and TLR4, and among three vertebrate species, human, mouse, and chicken. This annotation strategy is readily extended to the large data sets in pathway databases like Reactome and with the continued development of ontologies and reasoning tools should allow these resources to be mined efficiently and reliably, to discover putative novel functional relationships among proteins and protein complexes and to critically assess their plausibility.
Supporting Information S1