Identification of upstream transcription factor binding sites in orthologous genes using mixed Student’s t-test statistics

Background Transcription factor (TF) regulates the transcription of DNA to messenger RNA by binding to upstream sequence motifs. Identifying the locations of known motifs in whole genomes is computationally intensive. Methodology/Principal findings This study presents a computational tool, named “Grit”, for screening TF-binding sites (TFBS) by coordinating transcription factors to their promoter sequences in orthologous genes. This tool employs a newly developed mixed Student’s t-test statistical method that detects high-scoring binding sites utilizing conservation information among species. The program performs sequence scanning at a rate of 3.2 Mbp/s on a quad-core Amazon server and has been benchmarked by the well-established ChIP-Seq datasets, putting Grit amongst the top-ranked TFBS predictors. It significantly outperforms the well-known transcription factor motif scanning tools, Pscan (4.8%) and FIMO (17.8%), in analyzing well-documented ChIP-Atlas human genome Chip-Seq datasets. Significance Grit is a good alternative to current available motif scanning tools.


Introduction
A DNA sequence motif is a short conserved pattern that can be coordinated by regulator proteins, such as transcription factors (TFs). DNA sequence motifs represent functionally important regions of the genome and are one of the basic units of molecular evolution that are usually conserved among species [1]. Locating these motifs in the genome and understanding their function is fundamental in building molecular models for biological processes such as human diseases [2,3]. Researchers often face the task of identification of putative binding sites for TFs in whole genomes, termed "motif scanning" [4]. Over the past several decades, many computational pipelines have been described that utilize position weight matrices (PWM) for this task.
MAST searches DNA motifs against a database composed of short sequences and assigns a score to each target sequence assuming that every motif occurs once [5]. MCAST uses a hidden Markov model (HMM) to scan DNA sequences for regions comprising one or more of the given motifs [6], whereas SWAN utilizes a log likelihood ratio (LLR) scoring system built by training a two-state HMM on the background sequences [1]. FIMO computes a LLR score for each position in a DNA sequence motif and converts this score to a q-value using dynamic programming methods [7]. TRAP introduces a physical binding model to predict the relative binding affinity of a transcription factor for a given sequence [8]. PWMScan scans sequence motifs using Bowtie [9] or "matrix_scan", which employs a conventional search algorithm [10]. The Python-based program Motif scraper searches motifs specified as a text string using IUPAC degenerate bases [11]. Several tools such as Toucan [12], OTFBS [13], and CREME [14] count all matches in the target and control sequences and apply binomial statistics for over-representation. Other tools such as Clover [15], PAP [16], oPOSSUM [17], Pscan [18], and TFM_Explorer [19] scan sequence sets from co-regulated or co-expressed genes with TF motifs, and assess motifs that are significantly over-or under-represented, to identify common regulators of the sequence sets. WeederH was designed for discovering conserved TFBS and distal regulatory modules in sequences from orthologous genes [20]. MatrixREDUCE predicts functional transcription factor binding through alignment-free and affinity-based analyses of orthologous promoter sequences in closely related species [21]. Table 1 summarizes motif scanning parameters of 18 currently available tools.
To overcome the shortcomings of currently available tools as described below, a novel motif-scanning algorithm "Grit" was developed that identifies genome-wide upstream TFBS from a known collection of PWMs for promoters of orthologous genes without the need of sequence alignment. This study addressed the question of finding significant associations between TFs and orthologous gene sets by introducing a new computational framework that uses mixed Student's t-test statistics and adopts new ways of constructing promoter sequence sets of orthologous genes. Its application to the human genome has yielded fruitful results, demonstrating its desirability as a motif scanning tool.

Building putative promoter sets for orthologous genes
PWMs for TFs were obtained from the Jaspar database release 2020, referred here as JASPAR-2020 [22], which comprises a collection of TF motifs for human species. The Ensembl Biomart web tool release 100 [23] was used to extract the putative promoter sequences 2 kb upstream of the transcript, for all genes in 294 genomes (S1 Table). The promoter set for orthologous genes used for scanning of TFBS was built by first comparing the cDNA sequence from the target genome (TG, human) with the cDNA sequence from the other 293 reference genomes (RGs, genomes other than human) to identify the orthologous gene clusters, and consequently, put the 2 kb upstream sequence of the orthologous genes together. The BLASTN parameter was "-word_size 11 -reward 2 -penalty -3 -gapopen 5 -gapextend 2 -evalue 1e-6", and the BEST-to-BEST approach based on the E-value (mutual best hit) was used to define orthologous gene pairs between the two species. This promoter set was referred to as the "2K-set" and was available from the Grit website, the promoter sequence for the TG in this set was referred to as the "TPS". A random background promoter sequences set was randomly selected from the 2K-set and named as "BPS".

Statistical identification of TFBS in a target genome
First, we obtained a statistical score based on a component of HMM (Eq 1). The implementation of this raw score (RS) represented the ideals of existing statistical approaches [15]. The RS calculation represents repeated averaging of LLRs. RS represented the LLR for a motif being present at one particular location in a sequence, where w was the width of the motif, L denoted the location being considered, L k was the nucleotide at position k within this location, p(L k ) is the background probability of observing nucleotide L k estimated from the frequency of L k in that sequence, and q(k, L k ) is the probability of observing nucleotide L k estimated from the frequency of the K th location in the motif. The RS for a motif present in a sequence with length l was the natural logarithm of the average of LRs taken over all locations of s, where M s was the number of locations in the sequence calculated as l-w + 1. Statistically significant TFBS in the target genome are identified by a mixed Student's t-test.

Theory of mixed Student's t-test
We tested the significance of RS of a gene in the TG for a given motif assuming that the RSs for the sequences in the 2K-set for this gene were normally distributed. We propose a new statistical approach that is an extension of the Student's t-test, namely, the "mixed Student's t-test". A slightly varied statistical approach from the canonical Student's t-test was proposed-giving a background set (bkg) and a testing set (obs), we determine whether one value (one) from the obs is significantly different from the mean of the values in bkg, where one, obs and bkg are RSs for TPS, 2K-set, and BPS, respectively. The mixed Student's t-test statistic can be calculated by combining the one-sample Student's t-test and independent two-sample Student's t-test. The calculation of the t-statistics (t') and degree of freedom (df) of the mixed Student's t-test were presented as Eqs 2 and 3, respectively. The p-values can be estimated by the "cdflib" function [24].
The coefficient of conserved variation (CCV, Eq 4) and standard difference (SD, Eq 5) are calculated, which indicate the degree of conservation of the TFBS among species and the altitude of difference in RS scores between the TG and the RGs, respectively.
CCV ¼ j � X obs j ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi P n obs i¼1 ðobs i À oneÞ 2 =n obs q ð4Þ

Development of the Grit software
Utilizing the mixed Student's t-test statistics, we developed a tool, called Grit, for screening TFBS by coordinating TFs to their promoter sequences in orthologous genes. The tool takes JASPAR-2020 (specified by the -m option), 2K-set (-i option), and BPS (-b option) as it's input. Running the tool with default options (-n 10 -z 200 -s 1 -t 0.05 -p 0) will produce a result file (-o option) containing the predicted TFBS. There are three major steps built into the program: 1) calculate the RS for each PWMs presented in each promoter set for orthologous genes using Eqs 1 and 2) calculate the p-values for the significance of RS of each genes for each given PWMs using the mixed Student's t-test statistics; and 3) perform multiple testing correction for all p-values using the FDR method [25], and retain the TFBS with FDR � threshold defined by the -t option. The source code has been deposited in GitHub and is available under academic free license.

Performance assessment methods of the Grit software
The ReMap datasets were obtained from the ReMap website release 2020, referred to as ReMap-2020 [26], and ChIP-Atlas website release 2021, referred to as Atlas-2021 [27]. True positives (TP) were defined as predicted binding sites overlapping 80% with experiment-supported binding sites from ReMap or the ChIP-Atlas ChIP-Seq datasets. False positives (FP) were defined as predicted binding sites that did not overlap with experiment-supported binding sites, and false negatives (FN) were defined as experiment-supported binding sites that were not identified. ], as reported by Sand et al. [28] and Jayaram et al. [29] for all of the datasets analyzed. All assessments of the six tools were performed on Amazon EC2 computation services in parallel. For software such as PWMScan and Clover, where the local version was not available or too slow to analyze all PWMs, a random subset of the transcription factors (35 TFs) was analyzed. The Sn, PPV, and ACCg values for each tool were compared by paired Student's t-test.

Mixed student's t-test with simulated datasets
Two normally distributed datasets were generated and used as bkg (mean = −10, SD = 5, gray) and obs (mean = −2, SD = 7, dark green) datasets. Three values from the obs (located at 25, 50, and 75 percentiles, red) were tested for their significance, and produced p-values of 1.0, 0.03, and 1E-25, respectively ( Fig 1A). Three representative genes from the human genome were used for testing: purple for a gene at the 75% quantile of CCV and 25% quantile of SD, dark green for a gene at the 50% quantile of CCV and 75% quantile of SD, and red for a gene at the 25% quantile of CCV and 75% quantile of SD, all of which produced p-values less than 1E-6 ( Fig 1B). The p-values, CCV, and SD for each entity (one) in obs were calculated. As shown in Fig 1C, the p-values decreased with increasing CCV and SD, indicated that the mixed Student's t-test prefers TFBS either having higher CCV or having higher SD, or both. We also noted that the mixed t-test behaved as a one-sample t-test when distributions of values in obs and bkg were the same, or a two-sample t-test when observation (one) was located at the mean of the obs (S1 Text).

Prediction of TFBS in human genome using Grit
A schematic of the pipeline used in this study was indicated in Fig 1D, which included: (1) blast TG with RGs; (2) build the 2K-set for homolog genes using the BEST-to-BEST approach; (3) run Grit using Jaspar-2020 and the 2K-set; and 4) assess the performance of Grit using public ChIP-Seq datasets. The promoter set contained 2 kb length sequences for a putative promoter region of 35,342 homologues gene clusters. To estimate the accuracy of this promoter set, we compared it with experimentally validated human promoters available in the EPD database containing 29,598 human gene promoter sequences [30]. The TPS contained promoter sequences for 93.2% of these genes showing post alignment with the TPS of the EPD sequences with an E-value < 1E -6 .
Grit was used for identification of TFBS in the human genome by applying it to the 2K-set datasets. The Grit run took 22 h and identified 7.57 million significant TFBS for 537 TFs (FDR � 0.05). A target gene was assigned a TF if the gene was found in at least one TFBS. Grit prediction results were assessed with six publicly available motif scanning tools designed for high throughput analysis using the 829 ReMap-2020 datasets (S2 Table) obtained from the ReMap database [26]. The results were shown in Fig 2. FIMO and Swan consistently achieved higher Sn but lower PPV for ChIP-Seq datasets as compared with other tools (p-value � 0.05). The average Sn of Grit is lower than FIMO but the average PPV of Grit is the highest among all competitors. As results, Grit attained the highest average ACCg, followed by FIMO, Swan,

PLOS COMPUTATIONAL BIOLOGY
Identification of transcription factor binding sites using mixed Student's t-test Pscan, and PWMScan, Clover had the lowest ACCg. It is noticed that Grit outperformed FIMO 29% as evaluated by ACCg (p-value � 0.05). The number of predicted targets for FIMO and Swan was strikingly high, covering approximately 80% of human genes on average, whereas the number of predicted targets for Grit, Pscan, and PWMScan were significantly smaller, with Clover producing the lowest targets (p-value � 0.05).

Performance of Grit using ChIP-Atlas datasets
Additionally, performance of the six scanners was evaluated using 111 high-quality Atlas-2021 target gene sets (S3 Table) collected from experimentally validated human ChIP-Atlas data with literature support [27]. Table 2 listed a random subset of the assessment results. The Sn values of FIMO were higher than those of Grit (33.0%, p-value � 0.05), whereas the PPV values for Grit were higher than those of FIMO (2.15 fold, p-value � 0.05). The ACCg values for Grit were higher than those of FIMO (17.8% on average, p-value � 0.05), indicating that Grit performed better than FIMO for Atlas-2021. Furthermore, the Grit method slightly outperformed

PLOS COMPUTATIONAL BIOLOGY
Identification of transcription factor binding sites using mixed Student's t-test the Pscan method (4.8% on average, p-value � 0.05). Analysis using JASPAR-2020, ReMap-2020, and Atlas-2021 datasets identified Grit, Pscan, and FIMO as the best tools for identifying TFBS (complete prediction results for all the tools have been provided on the Grit website), ranking them based on ACCg in the order Grit > Pscan > FIMO > Swan > PWMScan > Clover.

Differences among Grit and other prediction tools
The prediction results of Grit and five other tools were compared. There were 38.9% TFBS in ChIP-Atlas datasets that were not identified by the other five prediction tools; 32.8% of TFBS were identified by both Grit and the other tools; 11.5% of TFBS were identified by the other tools but not by Grit; and 16.8% of TFBS were identified by Grit but not by the other tools. A total of 2.9% best TFBS identified by Grit for the same gene did not overlap with those identified by the other tools. A comparison of the numbers of TFBS identified by Grit and by the other five tools showed that each tool produced dramatically different prediction results ( Fig  3A and 3B). To show the unique features between Grit and the other tools, we investigated the distributions of CCV and SD for Grit TFBS and Grit specific TFBS (TFBS detected by Grit but did not by other tools, Grit-other, the "-" symbol means subtracting). The results indicated

PLOS COMPUTATIONAL BIOLOGY
Identification of transcription factor binding sites using mixed Student's t-test that Grit − FIMO and Grit − Swan had significant higher CCV values, while Grit − Clover, Grit − Pscan, and Grit − PWMScan had significantly higher SD values, than Grit TFBS (pvalue � 0.05, Fig 3C and 3D).

Comparative genomics is required for TFBS prediction
Identification of TFBS is essential for understanding how TFs regulate gene expression, ultimately controlling processes such as cell cycle progression, stress response, or stem cell

PLOS COMPUTATIONAL BIOLOGY
Identification of transcription factor binding sites using mixed Student's t-test differentiation into adult tissues [31][32][33]. A typical computational issue is deciding, giving a PWM, if a nucleotide sequence contains a valid instance of the TFBS modeled by the PWM itself [4]. Reliable predictions on a single sequence are nearly impossible without further filtering because of the redundant information available on promoter sequences [18]. The activities of functionally important TFs are highly conserved among both closely related and distant species, thereby causing frequent occurrence of their binding sites in orthologous genes [1]. A gene can be compared with its orthologs by analyzing sequence conservation in evolutionarily preserved transcribed regions, which enables the identification of orthologous gene sets, and TFBS can be predicted from the promoter sequences of these genes [21]. Although the predicted TFBS require further wet-lab experiment validation, with the increasing availability of PWMs, this in silicon approach has gained wide popularity [29]. These functionally important binding sites in closely related species can be identified by promoter sequence alignment and phylogenetic printing methods [16,[34][35][36]. However, promoters of orthologous genes in distantly related species are always poorly conserved, and identification of TF binding sites in these sequences is difficult [1,20]. This study used cross-species comparison to build co-regulated orthologous gene sets, without the need for non-coding sequence alignment. Therefore, this approach is well suited for comparative genomics across large evolutionary divergences, when existing alignment-based methods are not feasible. The rationale is that the promoters of most of the genes targeted by the same TF(s) should contain significantly higher scores for TFBS than some suitably computed numbers obtained from a collection of unrelated genes or a random background model.

Mixed Student's t-test is useful in discovering TFBS
By counting the number of matches and mismatches in target and control sequences, overrepresented motif analysis was performed using hypergeometric distribution [14,37]. A more intricate procedure, accounting for sequences with zero, one, two, or three or more matches in the target and control sets has been reported [31]. Several studies have suggested counting all matches in the target and control sequences, and proposed two different binomial formulas for assessing motif over-representation [12,14]. Notably, the widely used Pscan program calculates an RS similar to Clover's z-test to analyze over-or under-representation of TFBS. The p-value is computed by counting the number of times a random dataset yields a score higher than the input sequence set. Our tool "Grit" calculates an RS similar to Clover and Pscan, RS is the average exponent of the standard motif matrix score and is proportional to the factor's total equilibrium occupancy of the TFBS in sequence in a simple thermodynamic model [38][39][40]. Note that RS is a function of the length of the promoter sequence S, and if S is extended to include nucleotides that do not coordinate the motif, RS would decrease. Given sets of equallength target and control sequences, it is possible to test for significance by ranking the RSs from both sets and performing statistical analyses. The newly developed mixed Student's t-test was performed for TG and RGs for sites where the TFBS were expected to be conserved. Additionally, we considered the possibility of motif variation among species with highly diverged in RGs, such as pig or cattle, because of significant changes in the binding scores of TF among orthologous genes in the 2K-set of reference species. However, in cases of sufficiently large numbers of RGs, the binding affinity scores should show a normal distribution. The statistical analysis prefers to detect TFBS either conserved among species (high CCV) or having significant RS differences between the target and control sequences (high SD), or both. In contrast to the statistical test implemented in other tools, which produce a "whole" p-value for the gene set but fail to tell whether a specific sequence has certain TFBS or not, the mixed Student's t-test is not only able to utilize the information from comparative genomics, but also produces a theoretical p-value for an individual sequence of interest.

Single-and multi-species prediction tools
FIMO, Swan, and PWMScan were designed to not only identify potential matches to a motif, but also for potential matches that are greater than expected by chance, considering the genomic background [1,7,10]. All were designed for TFBS prediction in a single-species and produced a large number of TFBS as expected. Compared with these tools, Grit identified significantly smaller numbers of binding sites, which highlights the major differences between these tools. Grit has been designed to predict TFBS based on PWMs, and these sites were either highly conserved or had high RS among the promoter sequences. With the added condition that the TFBS were required to be highly conserved among species, which was not a criterion for single-species scanners, the final lists produced were relatively small, have a higher CCV, and were thus likely to be more suitable for further experimental validation.
Clover and Pscan were designed for multi-species TFBS scanning [15,18]. Similar to the Clover algorithm, Grit computed an RS for each input sequence, representing the average likelihood of each TFBS to a promoter. Regulatory regions of higher eukaryotes often contain multiple binding sites for the same transcription factor, with weaker "shadow" copies of the motif also present [41]. Therefore, considering the average score of multiple matches per sequence will likely aid in the discovery of functional motifs. Another issue is the definition of a "background" suitable for assessing the significance of the results obtained. In Clover, this is performed by shuffling the columns of the motif, or by building random sequence sets of the same size and length of the sequence set investigated [15]. However, the algorithm implemented in Clover is computationally intensive, taking 15 days to process 25 PWMs for the human genome. Similar to Pscan, Grit treats the input sequences as a sample taken from a "universe" composed of all promoter sequences available for the species investigated, several subsamples are taken from the universe, with a default size = 200 and n = 10, and used as the background. For each promoter set of orthologous genes, the RSs obtained from the input sequence set can be compared with the RSs on the subsets randomly taken from the whole genome promoter set, and the p-value can be produced by the mixed Student's t-test.

Availability and future directions
Grit is a good alternative to current available motif scanning tools and is publicly available at http://www.thua45.cn/grit under an academic free license. Further directions will be development of algorithms like gene-set enrichment analysis, to analyze transcriptome data.