Functional Genome Annotation by Combined Analysis across Microarray Studies of Trypanosoma brucei

Background Functional annotation of trypanosomatid genomes has been a daunting task due to the low similarity of their genes with annotated genes of other organisms. Three recent studies have provided gene expression profiles in several different conditions and life stages for one of the main disease-causing trypanosomatids, Trypanosoma brucei. These data can be used to study the gene functions and regulatory mechanisms in this organism. Methodology/Principal Findings Combining the data from three different microarray studies of T. brucei, we show that functional linkages among T. brucei genes can be identified based on gene coexpression, leading to a powerful approach for gene function prediction. These predictions can be further improved by considering the expression profiles of orthologous genes from other trypanosomatids. Furthermore, gene expression profiles can be used to discover potential regulatory elements within 3′ untranslated regions. Conclusions/Significance These results suggest that although trypanosomatids do not regulate genes at transcription level, trypanosomatid genes with related functions are coregulated post-transcriptionally via modulation of mRNA stability, implying the presence of complex regulatory networks in these organisms. Our analysis highlights the demand for a thorough transcript profiling of T. brucei genome in parallel with other trypanosomatid genomes, which can provide a powerful means to improve their functional annotation.

Combined analysis of microarray data for T. brucei

Construction of the coexpression networks
Expression values in each microarray experiment, expressed as log ratios of the signal from experimental cDNA to the signal from reference cDNA, were normalized to have a mean of 0.0 and a standard deviation of 1.0. This was done by calculating the average (μ) and standard deviation (σ) for each experiment, and transforming each value by subtracting the average and dividing by the standard deviation: x=(x′-μ)/σ, where x′ is the original value and x is the transformed (normalized) value. Given two genes α and β from S (S is the set of all genes with associated expression profiles) and their normalized expression values across different experiments of the experiment set E, the coexpression value of α and β can be calculated as the Pearson correlation coefficient of X E α and X E β , where X E α represents the measurements for α in the set E, and X E β represents the measurements for β in the set E, as shown in the figure below: The precision of a network in finding functional interactions is calculated by comparing the network to gold standard positive and negative sets. The gold standard positive set I consists of all gene pairs that share at least one function according to KEGG pathway database: where F i and F j represent the set of functions for genes i and j according to KEGG. The gold standard negative set I′ includes all gene pairs that do not share any function, given that each gene has at least one annotation in KEGG pathway database: The term "tbr01100" (Metabolic pathways) was ignored in all analyses.
The limitations and incompleteness of both I and I′ need to be noted: not all T. brucei genes with known functions are represented in KEGG; therefore, I is far from complete. Furthermore, the annotations for genes that are present in KEGG may not be complete, meaning that two genes may actually share a pathway, but this information is missing from KEGG; therefore, I′ may contain some gene pairs that should actually belong to I but are mistakenly assumed as negatives.
The positive predictive value (PPV, also referred to as precision) of the network G E θ is defined as:

Identification of conserved coexpression linkages among genes
We identified 5300 orthologs of T. brucei genes in the closely related organism Leishmania infantum based on reciprocal best BLAST-P hits with e-values <1×10 -6 . The set of T. brucei genes whose L. infantum orthologs could be unambiguously identified is referred to as S′. The experiment set E′ for L. infantum was obtained from three different studies [4,5,6]. The conserved coexpression network G E,E′ θ,θ′ is the set of all gene pairs that are coexpressed according to both experiment sets E=E KQJ (for T. brucei) and E′ (for L. infantum):

5.
Find the values for θ and θ′ using the new I and I′
Restore I and I′

Supplementary Methods
The G x was found to have a precision of 0.48 and S′ coverage of 0.113 which are very close to the values for the conserved coexpression network that is reported in the paper, implying that the procedure used to find the best values for θ and θ′ did not over-train them.

Network-based prediction of gene function
We evaluated the association of each gene with each KEGG pathway using a hypergeometric-based method: Assume that N is the set of nodes in the network G, C α ⊂N is the set of nodes that are connected to the node α (excluding the node α itself), and M⊂N is the set of nodes that have the particular function f M according to KEGG, again excluding the node α itself: The null hypothesis H 0 is that C α is independent of M. To evaluate this hypothesis, we assume a hypergeometric distribution for |M∩C α |: where x obs =|M∩C α |≤x≤min(|M|,|C α |) and "hypergeo" is the hypergeometric distribution function. If H 0 is rejected, the node α is considered associated with M and, thus, with function f M . Since node α itself is not included in the calculation of the probability value, there is no Supplementary Methods need to cross-validate this procedure, as it naturally resembles a leave-one-out cross-validated procedure.
We evaluated the performance of this procedure for each network and each pathway separately. The p-value cutoff for rejecting the null hypothesis was selected to be ≤0.05 and to result in a PPV≥0.80, meaning that at least 80% of predictions are correct.

Identification of potential regulatory motifs in UTRs
T. brucei genes were clustered based on the normalized values of the experiment set E KQJ .
We used different clustering approaches: Iclust [7] uses an information-based strategy to cluster the genes into a predefined number of clusters. By default, this number is √ |S|, where S is the set of all T. brucei genes with available expression profiles. Alternatively, we used the standard k-means algorithm with either an initial set of 100 means or an initial set of 30 means. The algorithm converged to 82 and 19 clusters, respectively. Gene clusters along with either complete or truncated 3′ UTR sequences were submitted to FIRE [8] with default parameters. The truncated sequences contained the first 1000bp from the 5′ end of each 3′ UTR. Prior to identification of potential regulatory elements, FIRE removes homologous sequences. In the paper, we only discuss the results of running FIRE on the set of 19 clusters and the truncated sequences; the complete set of results can be found at http://webpages.mcgill.ca/staff/Group2/rsalav/web/Suppl/20100109/index.htm.