Conserved motifs in nuclear genes encoding predicted mitochondrial proteins in Trypanosoma cruzi

Trypanosoma cruzi, the protozoan parasite that causes Chagas’ disease, exhibits peculiar biological features. Among them, the presence of a unique mitochondrion is remarkable. Even though the mitochondrial DNA constitutes up to 25% of total cellular DNA, the structure and functionality of the mitochondrion are dependent on the expression of the nuclear genome. As in other eukaryotes, specific peptide signals have been proposed to drive the mitochondrial localization of a subset of trypanosomatid proteins. However, there are mitochondrial proteins encoded in the nuclear genome that lack of a peptide signal. In other eukaryotes, alternative protein targeting to subcellular organelles via mRNA localization has also been recognized and specific mRNA localization towards the mitochondria has been described. With the aim of seeking for mitochondrial localization signals in T. cruzi, we developed a strategy to build a comprehensive database of nuclear genes encoding predicted mitochondrial proteins (MiNT) in the TriTryps (T. cruzi, T. brucei and L. major). We found that approximately 15% of their nuclear genome encodes mitochondrial products. In T. cruzi the MiNT database reaches 1438 genes and a conserved peptide signal, M(L/F) R (R/S) SS, named TryM-TaPe is found in 60% of these genes, suggesting that the canonical mRNA guidance mechanism is present. In addition, the search for compositional signals in the transcripts of T. cruzi MiNT genes produce a list, being worth to note a conserved non-translated element represented by the consensus sequence DARRVSG. Taking into account its reported interaction with the T. brucei TRRM3 protein which is enriched in the mitochondrial membrane fraction, we here suggest a putative zip code role for this element. Globally, here we provide an inventory of the mitochondrial proteins in T. cruzi and give evidence for the existence of both peptide and mRNA signals specific to nuclear encoded mitochondrial proteins.


Databases
This work was performed using data from the TriTrypDB 33 available at Tritrypdb.org [24].

UTR sequences
To determine the boundaries and sequences of the UTRs of the transcripts in T. cruzi RNASeq data from Smircich et al. [25] and the UTRme software was used [26].

Ortholog genes
TriTrypDB tools were used for massive ortholog gene finding when data for compared organisms were available. When this approach was not possible, Best Reciprocal BLAST hit was performed [27], applying in house bash scripts developed for this purpose.

Gene ontology
Annotation of Gene ontology (GO) terms was performed using the online tool DAVID (Database for Annotation, Visualization and Integrated Discovery, v6.7) [28]. Background databases employed were different, as indicated in each case, depending on the analysis performed. This tool was also used for GO term enrichment analysis. As reported [29], an enrichment score higher than 1.3 was accepted as meaningful. A variant of the Exact Fisher Test (EASE score) was used for p-value calculation. For each ontological category the FDR (false discovery rate) was controlled with the Benjamini method [30], considering acceptable only those values lower than 0.05. Alternatively, TriTrypDB tools were used to analyze GO term enrichment of the gene lists using the whole genome as background. In this case the p-value cutoff was set on 0.01 and the search was limited to GO Slim terms.

Compositional and structural analysis
Both CDS and UTR GC content was computed with Geecee tool (EMBOSS package). On the other hand, GC content at third codon position (GC3) for CDS was estimated using the INCA software (INteractive Codon usage Analysis) [31]. This tool also allowed the analysis of the codon usage bias using the MELP measure to quantify synonymous codon usage (MILC: Measure Independent of Length and Composition; MELP: MILC-based Expression Level Predictor) [32]. The RNAfold algorithm (Vienna RNA v2.0 package) was used to infer the minimum free energy (MFE) structure and the thermodynamic stability of the whole or partial region of the transcripts [33]. MFE was computed separately for the untranslated regions or the coding sequence.

Conserved sequence signals
The search for sequence conserved signals was done using the MEME suite package tools [34]. The elicitation of discriminative regular expression motifs in specific data subsets was performed using DREME tool [35], applying the gene complement of the query subset as negative control. The motifs retrieved were compared against Ray motifs' database, using TOMTOM tool [34] from the same package. MEME tool was used to generate consensus regular expressions when comparing motifs from different searches.
When studying signal peptides, three different approximations were made. The first approximation was performed using MEME and FIMO tools [36] both from MEME suite package with a threshold = 0.001. In addition, TargetP [16] and PredSL [37] signal peptide predictors were used with the default parameters selecting non-plant sequence.
For the motif searching analysis we chose to use a 5' and 3'UTR length corresponding to the 60 th percentile of the distribution (S1 Table). For the 3'UTR, this gives 350 nt (3'UTR 350 ), while for the 5'UTR the length corresponds to 100 nt. With the purpose of comparing and controlling results obtained for both UTRs it is desirable to achieve equal length for both UTRs, so the 5'UTR was defined as the region starting at -100nt from the AUG to +250 nt downstream from it (5'UTR 350 ).

Construction of MiNT: A database of nuclear genes encoding mitochondrial proteins in T. cruzi
In order to generate a database that comprehensively contains the nuclear transcripts that encode mitochondrial proteins (MiNT) in T. cruzi, an inclusive strategy was followed. First, all the genes whose products were annotated as "mitochondrial" in TriTrypDB 33 were used to conform a base-set which, surprisingly, consisted only on 145 genes. As this strategy revealed meager results, several other complementary approaches were performed (Fig 1). Using Tri-TrypDB tools, a search by ontology terms (Cellular component = "Mitochondrion"; GO:0005739) was done, obtaining 216 genes. As expected, most of these genes were included in the initial data-set, yielding only a little increment (total 283 genes). Therefore, to further expand the T. cruzi database, we use the TriTrypDB search tool based now on text, which allows the exploration of the enclosing author notes. We reasoned that this search could help to detect genes encoding mitochondrial products that have not yet been updated in regard with their function, localization or even for their complete sequence. Using this strategy and after a manual revision of the results, 474 genes including 193 new and 281 already present in the database, were identified. The inclusion of these new genes is supported by the enriched terms associated to cell respiration and Krebs cycle, found for this subset through Functional Annotation Clustering tool (DAVID). Since, in the case of T. brucei, at least 1065 mitochondrial proteins encoded by the nuclear genome were predicted [15], we considered that the current number of genes in the MiNT database was still too low.
Following the above described search approaches in T. brucei, we identified 195 genes annotated as mitochondrial products, 1256 genes under the GO term (GO:005739) and 1431 genes based on text search, summing up a total of 1501, out of the 11703 nuclear genes (13%), encoding mitochondrial products (S2 Table). This database includes not only all the genes that encode mitochondrial products as observed by fluorescent protein tag (109) [38] and most of the genes (1049/1061) proposed by Xiaobai Zhang et al. [15] but also adds 452 more genes. It was not surprising to found that our strategy in T. brucei yielded a higher number of nuclear genes encoding mitochondrial products than in T. cruzi since the annotation of the T. brucei genome is much more complete than the one of T. cruzi.
As a final approach, we decided to search for the T. cruzi genes orthologs to the ones we identified in T. brucei following our strategy. After manual revision, 1346 genes were found to have an ortholog gene in T. cruzi. With this strategy, 994 new genes were added to the T. cruzi database. Thus, we completed the database named as MiNT which, after a final manual curation, contains a total of 14% nuclear genes (1438 out of 10597) encoding mitochondrial proteins (S3 Table).
The same strategy was also performed to define the mitochondrial proteins encoded in the nuclear genome in L. major. As in T. cruzi, the results of the search by annotation, GO:0005739 or text (175, 313 and 467 genes respectively) gave a poor number of only 477 genes. As before, we added to this group L. major orthologs to T. brucei and T. cruzi MiNT genes. From T. brucei MiNT, 1433 ortholog genes in L. major were identified, increasing the number of nuclear genes encoding mitochondrial proteins to 1490. In addition, when using T. cruzi MiNT 1271 orthologs in L. major were found, yielding a MiNT database of 1558 gene (S1 Fig and S4 Table).
As expected, considering the number of genes encoding hypothetical proteins in T. cruzi, a great number of the MiNT genes (36%, 519/1438) are annotated as such. Nonetheless, 97% of them (501/519) have an ortholog gene either in T. brucei or in L. major, strongly suggesting a shared functional role in these organisms.
In brief, following a strict inclusion strategy to avoid false positives, we found that around 15% of the TriTryps' nuclear genome encodes mitochondrial proteins (14%, 13% and 17% for T. cruzi, T. brucei and L. major respectively).

Compositional and structural analysis of T. cruzi MiNT
In order to study the compositional and structural features of the nuclear genes encoding mitochondrial products in T. cruzi, the transcripts belonging to MiNT were split into the CDS and UTRs and both were independently analyzed.
For the CDSs, no significant differences were found when comparing length (S2 Fig) or GC content (Fig 2A). However, both GC3 and the MELP (Fig 2B and 2C) were significantly higher in MiNT transcripts than in the rest of the transcriptome suggesting high protein production. Another difference between MiNT CDSs and the CDSs of the rest of the transcriptome (No-MiNT) is found at the structural level. Indeed, in accordance with the mitochondrial prokaryotic origin [39], lower structured CDS are predicted by the MFE per base analysis ( Fig 2D) for the nuclear derived transcripts encoding mitochondrial proteins.
We also performed a comparison among MiNT and No-MiNT UTRs in epimastigotes ( Fig  3). While no differences were observed when comparing the length of the 5'UTRs, MiNT 3'UTR are significantly shorter than those of No-MiNT genes (Fig 3A). Regarding the GC content, both 5' and 3' UTR of MiNT present a lower GC content than the ones of No-MiNT genes ( Fig 3B). Finally, considering the minimum free energy level, the predicted secondary structure for both 5' and 3'UTR were found to be less stable in MiNT than in No-MiNT ( Fig  3C). Similar results were obtained using the UTRs from the metacyclic trypomastigote stage (S3 Fig) The distinctiveness observed for the coding and not for the regulatory regions of the nuclear genes encoding mitochondrial proteins may be explained by the more stringent requirements that govern their functionality. Indeed, the codon usage is highly non-random with respect to both GC3 and MELP. Overall, the compositional and structural analysis of the nuclear encoded mitochondrial genes of T. cruzi revealed both high expression characteristic values (higher GC3 and MELP) and prokaryotic origin traces (less structure complexity at CDSs and UTRs) when compared with no-MiNT.

Expression analysis of MiNT genes along the life cycle of T. cruzi
Expression evidence for the 99% of MiNT genes (1427/1438) whether in micro-arrays data [40], expressed sequence tags (EST) or RNA-seq data has been reported [25,41].
First, the micro-arrays data published by Mining et al. [40], was used to compare MiNT gene expression across the life cycle of the parasite (Fig 4A). This analysis revealed that MiNT genes have an exacerbated expression when compared with the rest of the genome, being higher in the replicative stages, and lower in both trypomastigotes forms. Similar results were obtained using the RNA-Seq data from Smircich et al. [25] (Fig 4B).  As a complementary approach, the analysis of ribosome footprinting data available for the vector parasite stages (epimastigote / metacyclic trypomastigote) [25] was performed (Fig 4C). Higher ribosome occupancy for MiNT when compared to the rest of the genes at the replicative epimastigotes was revealed. No such effect was observed at the metacyclic trypomastigote stage. Indeed, the translation slowdown that is a characteristic of this infective stage is also clearly observed for MiNT.

Search for an amino-terminal localization peptide signal in mitochondrial proteins encoded by nuclear genes in T. cruzi
While the presence of a signal peptide targeting proteins to its final localization is not mandatory, at least 70% of the mitochondrial proteins are demonstrated to carry a peptide that is responsible for their subcellular localization in yeast [42]. Certain loosely characteristics, such as an amphipathic character and the presence of at least two basic amino acids have been proposed for the mitochondrial signal peptide [43]. Nonetheless, there is not a conserved consensus sequence reported for this signal.
Aiming to define a consensus sequence to identify those proteins whose localization could be directed by a signal encoded in the aminoacidic sequence in T. cruzi, we firstly analyzed several experimentally tested signal peptides. Thirty-five previously reported signals [44] were used as an input to define a preliminary consensus sequence (Fig 5A). It was then submitted to FIMO analysis using the mitochondrial annotated proteins as the target database. MEME analysis found a consensus mitochondrial targeting sequence named as TryM-TaPe (Trypanosomal Mitochondrial Targeting Peptide) (Fig 5B).
We extended the search of TryM-TaPe to the complete MiNT database and found that 865 out of 1438 encoded proteins of MiNT database (60%) contained this conserved sequence, a number that is consistent with reports in other eukaryotes [42]. Though we cannot rule out the presence of other peptide signals, the existence and representation of TryM-TaPe validates the reliability of the T. cruzi MiNT database.

Search for mRNA localization signals in nuclear genes encoding mitochondrial proteins in T. cruzi
The absence of a peptide signal in nuclear encoded mitochondrial proteins may be overcome by the presence of signals in the transcript mediating the approach of mRNAs to the mitochondria. To facilitate the search for these transcript localization signals, the identification of a reliable set of proteins without a mitochondrial localization sequence (MTS) would be advisable. For this purpose, we firstly identified all proteins in MiNT with an MTS (MiNT-MTS dataset) to then obtain those without MTS (MiNT-NoMTS dataset). Since TryM-TaPe may not be the only signal peptide that could be acting as a mediator for transcript or protein localization to the mitochondrial surrounding, two common MTS predictors (TargetP and PredSL) were also used. As shown above, FIMO search led to 865 proteins carrying TryM-TaPe, meanwhile TargetP and PredSL predicted 660 and 521 proteins encoded by MiNT transcripts carry an MTS, respectively (Fig 6). We decided to include in MiNT-MTS those genes predicted by at least two of the three methods (620), while the remaining 818 genes correspond to MiNT-NoMTS.
In order to investigate the presence of sequence elements enriched in MiNT-NoMTS with respect to MiNT-MTS, we analyzed the UTRs of those transcripts. Considering that nucleotide motifs may also complement the function of the signal peptide, the reciprocal search was also carried out. We used the DREME tool [35] on the UTR 350 (as described in Materials and Methods) to search for enriched motifs within each database and the TOMTOM tool [45] to search for putative interacting RBPs. As the database used in TOMTOM [46] includes L. major and T. brucei gambiense proteins, T. cruzi orthologs were searched.
The analysis of MiNT-NoMTS yielded 21 and 20 motifs in the 5'UTR 350 and 3'UTR 350 respectively and several interacting proteins were predicted (S5 Table and Table 1). Common trans-acting factors such as: the protein PABP (polyadenylate-binding protein), a general factor implied in different steps of mRNA metabolism and the protein DRBD12, known to destabilize the wide spread ARE-containing target genes [47], were found to interact to motifs in both the 3'UTR 350 and 5'UTR 350 . In addition, motifs recognized by DRBD3 (one) and TRRM3 (three) were also found. In T. brucei DRBD3 was found to be associated with mRNAs encoding membrane proteins, playing a role in mRNA stabilization, splicing, translation and transport [48]. Interestingly, TRRM3 is found within the mitochondrion or associated to its membrane in T. brucei procyclic forms [23]. As usually found in mitochondrial proteins, TRRM3 is  [16] and PredSL [37] predictors with the default parameters selecting non-plant sequence, and FIMO using TryM-TaPe were used on the T. cruzi MiNT database.
For MiNT-MTS we found 12 overrepresented sequences in the 5´UTR 350 and 12 in the 3 UTR 350 (S6 Table). As expected, and validating our approach, the sequence encoding TryM--TaPe was found. TOMTOM analysis allowed to associate four motifs of the 3'UTR 350 to RNA binding proteins ( Table 2). Two of them are recognized by the general factor PABP and two others by two isoforms of the double RNA binding domain, DRBD3 and DRBD9. Remarkably, one motif is recognized by TRRM3. As mentioned above, this is not surprising since subcellular localization signals may act in collaboration with peptide signals. Thus, this finding reinforces the proposed role. All the sequences, in the 3'UTR 350, found to be associated to TRRM3 (Motifs 11, 12, 13 and 15) were used to obtain a consensus recognition motif (DARRVSG) which in turn, can also be recognized by TRRM3 according to the TOMTOM algorithm (pvalue 8.10 e -03). It is worth noting that the RNA binding domains for all the putative interactors here presented have a high identity to the respective T. cruzi ortholog, suggesting that they could recognize the same motifs (see alignments on S7 Table).
In spite of the fact that the relevance of the motifs found will require further study, it is tempting to propose that TRRM3 and its cognate recognition motif play a role as a zip code transporting specific mRNAs to the mitochondrial surroundings.

Conclusions
Aiming to identify conserved signals among the nuclear genes encoding mitochondrial proteins in T. cruzi, we searched for the genes annotated as such in the TriTrypDB. Despite its availability since 2005 [50], and the many efforts to its improvement, completion and annotation from there on, we only obtained meager results. Thus, we undertook the task of obtaining a comprehensive list of nuclear genes encoding mitochondrial proteins which would not only serve as the dataset target for the aim of this work but also constitute by itself a contribution to the current state of the knowledge of T. cruzi genome. Following an in-silico strategy, a wide inventory of the nuclear genes encoding mitochondrial proteins, MiNT, in the TriTryps was obtained (1438, 1501 and 1558 for T. cruzi, T. brucei and L. major respectively). The search for enriched motifs in T. cruzi MiNT allowed the identification of a list of conserved signals. Signals involved in different metabolic steps were identified. For the well-known mitochondrial localization peptides, we could establish a consensus motifhere named TryM-TaPe, M(L/F) R (R/S) SS, present in 60% of T. cruzi MiNT database. In addition, a putative mitochondrial localization role is here proposed for the nucleic element DARRVSG that may be recognized by the conserved TRRM3 protein which is enriched in the mitochondrial membrane fraction in T. brucei. While work is in progress to analyze the role of this element, its actual interaction with TRRM3 and the function of this RBP, these findings suggest that in addition to the canonical peptide localization signal, mRNA localization could be guided to the mitochondria surroundings via zip code nucleic signals present in the UTRs in T. cruzi.   Multiple comparisons amongst groups were performed by Dunn's multiple comparison test and differences were seen by comparing mean ranks. (TIF) S1 Table. T. cruzi UTR length and statistical analysis. Length obtained for both 5' and 3'UTRs in epimastigotes and trypomastigotes stages data from Smircich et al. [25] using UTRme tool [26]. The descriptive statistics for each group are also included. (XLSX) S2 Radio. We also thank several colleagues that have provided critical insight in this study during scientific meetings.