Deciphering RNA Regulatory Elements Involved in the Developmental and Environmental Gene Regulation of Trypanosoma brucei

Trypanosoma brucei is a vector-borne parasite with intricate life cycle that can cause serious diseases in humans and animals. This pathogen relies on fine regulation of gene expression to respond and adapt to variable environments, with implications in transmission and infectivity. However, the involved regulatory elements and their mechanisms of actions are largely unknown. Here, benefiting from a new graph-based approach for finding functional regulatory elements in RNA (GRAFFER), we have predicted 88 new RNA regulatory elements that are potentially involved in the gene regulatory network of T. brucei. We show that many of these newly predicted elements are responsive to both transcriptomic and proteomic changes during the life cycle of the parasite. Moreover, we found that 11 of predicted elements strikingly resemble previously identified regulatory elements for the parasite. Additionally, comparison with previously predicted motifs on T. brucei suggested the superior performance of our approach based on the current limited knowledge of regulatory elements in T. brucei.

The obtained results from this analysis demonstrated the power of our graph-based approach in identification of functional RREs based on co-expression graphs.

Application of GRAFFER to Cell cycle transcriptome
For the T. brucei cell cycle co-expression graph, we extracted expression profiles from (6) and considered genes that showed at least a 1.5 fold change in one cell cycle stage compared with early G1 phase. This dataset comprised of four cell states, monitoring gene expression as T.
brucei cells move through cell cycle (Early G1, Late G1, S phase, and G2/M phase). Performing the same steps as our previous attempt, we applied GRAFFER on the constructed co-expression graph from this dataset. In this case, our approach identified five significant motifs (S12.a Fig   and S5 Table). The low number of significant motifs was anticipated because of the low number of samples in the dataset. Comparison of the predicted motifs with experimentally established motifs revealed that one of our motifs matched a well-studied RRE in trypanosomatids. This experimentally validated RRE is involved in cell cycle regulation in trypanosomatid organisms (7). Importantly, genes harboring each of these experimental and computational motifs were significantly upregulated in the late G1 cell cycle phase (S12.b Fig).

T. brucei 3′-UTR sequences
The 3′-UTR sequences were downloaded from TriTrypDB v.5, considering lengths reported in (8). In cases of alternative poly-adenylation, the median length was selected. In cases that gene did not have an identified 3′-UTR length, 400nt (the median 3′-UTR length of T. brucei genes) downstream of the translational stop codon was selected. Preliminary analysis of 3′-UTR lengths revealed that although the median length is 400nt, some transcripts can have very long 3′-UTRs (S13.a Fig). Recent discoveries suggested that alternative poly-adenylation site selection can have regulatory impact on the expression level of transcripts in different organisms (9). For transcripts with alternative 3′-UTRs, the longer isoforms potentially have more binding sites for RNA-binding proteins and/or miRNAs. In general, the outcome of having more regulatory regions is that isoforms with shorter 3′-UTRs have elevated expression levels compare with the longer isoforms of the same transcript (10). In support to the regulatory role of alternative polyadenylation site selection, the 3′-UTR length of at least one transcript in T. brucei is reported to be developmentally regulated (11). Moreover, alternative trans-splicing (which can lead to variation in 3′-UTR lengths) plays a role in the developmental regulation of some T. brucei genes (12).
Previous studies on T. brucei suggested that poly-adenylation site selection in this organism is linked to the selection of the downstream 3′-splice-acceptor site (13). Considering both dependency on splice-acceptor-site selection and the error in sequencing that may occur because of the low complexity of 3′-UTR regions, the existence of minor variations on detected polyadenylation sites was anticipated. To test the possibility that gene expression is regulated by alternative poly-adenylation site selection, we first examined the agreement between two published studies on poly-adenylation sites of T. brucei transcripts (8, 14). Considering each study independently, we defined poly-adenylation regions by considering ±50nt around each detected poly-adenylation site. If two adjacent poly-adenylation sites had overlapping regions, relevant regions were merged and the new region was defined as the union of both. Thus, two poly-adenylation sites in different regions would be at least 100nt far from each other, shown schematically in S13.b Fig. By applying this selection criterion, we tolerated false negative results to reduce false positives. This analysis revealed that for many genes in T. brucei, there are at least two poly-adenylation regions supported by two independent studies (S13.c Fig). Next, we examined the agreement of 3′-UTR length variation for transcripts with at least two polyadenylation regions in both studies. Considering standard deviation of 3′-UTR length variation obtained from each article, we observed a moderate but significant agreement for 3′-UTR length variation between the two studies (S13. for the random distribution of some non-functional motifs, leading to a bias in our motif prediction approach. To take these issues into account, we restricted the maximum 3′-UTR length for each transcript to 1000nt (i.e., the first 1000nt of 3′-UTR regions were considered for motif prediction). We found that replacing considered 3′-UTR lengths with the defined lengths by Siegel et al. (8) has no effect on the significance state of 88 predicted motifs, with only one exception (S15 Fig). It is likely that by considering the whole 3′-UTR lengths instead of the truncated version, the approach will predict more motifs that may not be biologically relevant.

RNAcompete
RNAcompete is a single-cycle competition based approach whereby 240,000 different sequences compete to bind to a single RBP (3,15

Comparison with previous studies
To evaluate the performance of our graph-based approach, we compared the GRAFFER results with three other genome-wide studies conducted on T. brucei (16)(17)(18). It is important to note that RREs are not extensively characterized in T. brucei; which forced us to compare the results of each study with a limited set of previously known RREs. Therefore, some of the novel RREs predicted by these approaches may be valid, but not discovered yet. Two of these studies applied the FIRE program (2) in different contexts to predict RREs. FIRE is an information theory-based approach that seeks informative RREs from clusters of co-expressed genes. An independent experimental study showed that the predicted motifs for human are of high quality (19). The third study applied an alignment-free approach, which benefits from simultaneous consideration of four closely related Trypanosomatid species: T. brucei, T. cruzi, T. vivax and T. congolence.
In the first genome wide analysis of T. brucei genes, the lack of genome-wide experiments available at the time caused the authors to predict "function-specific" RREs by clustering genes according to their function (16). This analysis led to the identification of 21 RREs in the 3′-UTRs of T. brucei genes. Considering the same criteria as applied for the GRAFFER motifs, four out of the 21 predicted motifs showed significant similarity with only four different RNAcompete motifs (S4 Table). Predictions did not match with other experimentally-derived motifs.
In the second genome-wide analysis of T. brucei genes, whole genome microarray data was available; therefore, the authors employed a sophisticated approach for direct integration of transcriptome measurements obtained from three independent studies (17). Importantly, two of the transcriptome datasets used in the study are also used for predictions of RREs in our approach. Clustering of the co-expression network and application of FIRE algorithm in this case had led to the prediction of 14 RREs. Comparison with RNAcompete results revealed that three of the 14 predicted motifs showed significant similarity with only three different RNAcompete motifs (S4 Table). Predictions did not match with other experimentally-derived motifs.
In the third genome-wide analysis of T. brucei genes, a novel algorithm (COSMOS) was developed on the assumption that orthologous genes in close organisms tend to have a similar set of RREs (18). Application of COSMOS on four closely related Trypanosomatid organisms revealed 222 linear and 166 structural motifs that are conserved among these four organisms.
Comparison with RNAcompete results revealed that nine of the 388 predicted motifs had significant similarity with nine different RNAcompete motifs (S4 Table). However, considering the GRAFFER and COSMOS motifs that matched to the same RNAcompete motif, in all cases the GRAFFER motifs showed higher selectivity (higher enrichment) than the COSMOS motifs.
It should be pointed that COSMOS was also able to identify three further well-studied motifs.
One of them is a structural motif that could not be predicted in the current implementation of GRAFFER algorithm (GRAFFER only searches for linear motifs). The other two are cell cyclerelated motifs. GRAFFER successfully discovered one of these motifs from the transcriptome data of cell cycle progression (see above). However, the second motif is related to a set of transcripts with subtle variations in their expression, as mentioned in (6). In our motif prediction pipeline, we constructed co-expression graphs by focusing on highly variable genes. Therefore, we most probably missed this motif because we did not have its cognate targets in the coexpression graph. We used g:profiler web server for enrichment analysis (24). In our analysis, we only considered categories with between 50 upto 1500 annotated genes. Each module was analyzed independently and enriched terms with Benjamini corrected p-value less than 0.01 were selected. To avoid redundant GO- hly similar to each other because of the existence of experimental replicates and/or conserved RNA binding domains). The bold blue frame indicates cases in which the GRAFFER motif was enriched (two tailed hypergeometric, p-value <0.01) among the RNA targets of the RBP as reported in (3); and the bold black frame indicates not enriched among the target RNAs.

S6 Fig. Comparison of GRAFFER
we set two criteria: 1 human miRNA. We used g:profiler web server for this analysis (24); 2) The 5′-extermity of the miRNA should match to the reverse complement of the pr 7 represent the binding sites for 10 human miRNAs. match with miRNAs, but also they can represent the binding site of RBPs (highlighted with the box). Comparison of an experimentally established RRE (UAUUUUUU) that is involved in developmental T. brucei . As shown, both developmental responses. a) Transcripts targeted by the experimentally-derived or GBM_TB_17304 were selected and then tested for a using Mann-Expression data were extracted from (20). b) Proteome T. brucei were analyzed using Mann-Whitney Protein expression data were extracted from Gunasekera et al.   b.

S13 Fig. Characteristics of T. brucei 3′-UTRs
(a) 3′-UTR length variation of T. brucei genes according to Siegel et al (8). In cases where a gene has alternative poly-adenylation sites, the 3′-UTR length is defined as the median length; (b) schematic representation of the defined poly-adenylation sites. Upward arrows represent the location of detected poly-adenylation sites for a gene. Each region is defined as 50nt before and after the detected poly-adenylation site. If the distance between two poly-adenylation sites was less than 100nt, the two corresponding regions were merged together. (c) Number of polyadenylation sites and regions determined in two independent studies. As shown, many genes have more than one determined poly-adenylation region. (d) Correlation of two studies for 3′-UTR length variation of genes with more than one poly-adenylation region. The Y-axis and Xaxis indicate the standard division of 3′-UTR lengths according to Siegel et al. For each condition, genes wee sorted according to their expression value. Sorted genes were then divided into 30 different bins. The enrichment of genes with long 3′-UTRs in each bin was examined using Fisher's exact test. Yellow color shows over-representation of genes in the corresponding bin. Similarly, blue represents under-representation of these genes in the cognate bin. The figure is pseudo-colored, only statistically significant bins are colored (Bonferroni corrected p-value < 0.05). Highlighted conditions on the left show overall significant up-or downregulation using Mann-Whitney rank sum statistics. Blue backgrounds indicate downregulation and orange backgrounds represent upregulation of genes with long 3′-UTRs.  Z−score truncated UTR length Z−score Real UTR length