PpTFDB: A pigeonpea transcription factor database for exploring functional genomics in legumes

Pigeonpea (Cajanus cajan L.), a diploid legume crop, is a member of the tribe Phaseoleae. This tribe is descended from the millettioid (tropical) clade of the subfamily Papilionoideae, which includes many important legume crop species such as soybean (Glycine max), mung bean (Vigna radiata), cowpea (Vigna ungiculata), and common bean (Phaseolus vulgaris). It plays major role in food and nutritional security, being rich source of proteins, minerals and vitamins. We have developed a comprehensive Pigeonpea Transcription Factors Database (PpTFDB) that encompasses information about 1829 putative transcription factors (TFs) and their 55 TF families. PpTFDB provides a comprehensive information about each of the identified TFs that includes chromosomal location, protein physicochemical properties, sequence data, protein functional annotation, simple sequence repeats (SSRs) with primers derived from their motifs, orthology with related legume crops, and gene ontology (GO) assignment to respective TFs. (PpTFDB: http://14.139.229.199/PpTFDB/Home.aspx) is a freely available and user friendly web resource that facilitates users to retrieve the information of individual members of a TF family through a set of query interfaces including TF ID or protein functional annotation. In addition, users can also get the information by browsing interfaces, which include browsing by TF Categories and by, GO Categories. This PpTFDB will serve as a promising central resource for researchers as well as breeders who are working towards crop improvement of legume crops.


Introduction
Pigeonpea [Cajanus cajan (L.) Millspaugh], a diploid legume crop (2n = 2x = 22), is a member of the tribe Phaseoleae with the estimated genome size of 858 Mbp. It is the main source of proteins, minerals and vitamins for more than a billion people in the developing world. In addition, this plant is not only useful as a source of nutrition for human consumption but their leaves, seed and pod husks are used as animal feed. Pigeonpea is unique among all the legume crops because it is a woody shrub, and its stem and branches are used for firewood, fencing, a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 thatch and making baskets by the rural population [1,2]. Therefore, cultivation of pigeonpea (C. cajan) is beneficial for both economic, health and environmental perspective. It is known for their high nitrogen fixation ability from the atmosphere with the help of symbiotic nitrogen-fixing bacteria (Bradyrhizobium). Its nitrogen fixation ability reduces the need of synthetic crop fertilizers hence reduces cause of water pollution [3]. According to the Agricultural and Processed Food Products Export Development Authority (APEDA), 2, 51,644.32 MT of pulses has been exported by India to different countries of the world amounting to Rs. 1,603.22 crores during the year 2015-16. The other major exporting countries are Pakistan, Algeria, Sri Lanka, Turkey and United Arab Emirates [4].
Pigeonpea, is tolerant to various biotic and abiotic stresses including many strains of sterility mosaic, drought, salinity etc. thus has drawn the interest of plant research community to examine its biology. Plant stress responses are often regulated by multiple signalling pathways that activate gene transcription and associated downstream mechanisms [5,6]. The cis-regulatory elements of related transcription factors (TFs) are the functional elements located in the promoter region of the genes that determine the spatial and temporal transcriptional activity of the gene during various biological processes [7].
In this study, we performed genome wide sequence analysis of pigeonpea for the identification of TF and developed a comprehensive database named as Pigeonpea Transcription Factors Database (PpTFDB) by using various computational analyses. We used our in-house data published in 2011 as the first draft of the pigeonpea genome sequence [1] for the development of PpTFDB. Transcription factors (TFs) are an essential part of the transcription machinery and the identification, characterization and expression analysis of transcription factor families is one of the major areas of research. Transcription factors are involved in the control of gene expression in all living organisms. Transcription factors (TFs) regulate the gene expression through binding to specific cis-regulatory sequences in the promoters of their target genes [8]. The control of gene expression in plants as well as in other living organisms is essential for the regulation of biological processes like development, differentiation and response to various environmental signals [9][10][11]. Many TF databases are available for many plant species whose data is available in the public domain. However, such transcription database is not available in case of pigeonpea. Therefore the objectives of present study were to construct a comprehensive Pigeonpea Transcription Factor Database (PpTFDB) which will serve as a central resource for researchers of legume community.

Identification of transcription factors
In the pigeonpea genome sequence we predicted 47891 proteins coding genes along with their CDS by using Glycine max (soybean) as a reference for the gene prediction by FGENESH program. The whole genome sequence of pigeonpea was downloaded from https://www.ncbi.nlm. nih.gov/bioproject/258132. The complete set of TF sequences was downloaded from Plant Transcription Factor Database [12] and HMM profiles were created for each of the TF family by using the HMMER program [13]. The HMM profiles was then used to search against the pigeonpea proteome data using HMMER program with default E-value. The raw alignments data file was manually inspected to ensure reliability. A total of 1829 putative TFs were identified and characterized into 55 TF families (S1 Table). The complete set of sequence data of each TF family including amino acid, CDS and genomic DNA sequence is made available to the users and can be downloaded from PpTFDB for further analysis.

Identification of SSR and orthologous groups
The CDS sequences of the putative TFs were used for SSRs (Simple Sequence Repeats) generation using MISA tool (http://pgrc.ipk-gatersleben.de/misa/). The identified SSRs were used to design primers by using BatchPrimer3 online tool (http://probes.pw.usda.gov/batchprimer3/) using various parameters (Range of primer length = 20-25 bp, Size of PCR product = 100-250 bp; with optimum of 280 bp, GC content of 40-60% with optimum of 50%). In order to find out the orthologous to each of the identified putative TFs, the protein sequences were analysed with protein BLAST [16] program against the protein sequences of various legume crops including soybean (G. max), mung bean (Vigna radiata), adzuki bean (Vigna angularis), common bean (Phaseolus vulgaris) and barrel medicago (Medicago truncatula) using default parameters. Homology with >80% similarity was considered as a significant threshold for selecting anorthologue (S2 Table).

Database construction and implementation
The pigeonpea Transcription Factor database (PpTFDB) was designed by using Three-Level Schema Architecture (Fig 1). All the data tables were deposited in the MSSQL Server 2008 in relational manner for custom search and easy retrieval of data. A diagrammatic representation of data tables incorporated (database schema) in the PpTFDB is shown in Fig 2. A brief description about each TF family and a hyperlink to respective literature introduces precisely about respective TF family. The hyperlinks to the external databases such as Pfam (http:// pfam.sanger.ac.uk/), SMART (http://smart.embl-heidelberg.de/), PrositeProfiles (http:// prosite.expasy.org/), SUPERFAMILY (http://supfam.cs.bris.ac.uk/), Panther (http://www. pantherdb.org/panther/) and Gene Ontology (GO) (http://amigo.geneontology.org/) enables the user to go to the database and get comprehensive information about a candidate TFs.

Database search criteria
In PpTFDB under the search tab, search by TF ID facilitates the user to search the particular transcription factor based on assigned TF ID to those TFs. After entering in the TF ID and click search button, the database will provide complete details of TF that includes chromosomal location, physical properties, annotation details, sequence information, orthologue details and SSR details with three pairs of primers, designed for each SSRs. Similarly user can also search the database by entering Pfam ID, SMART ID, InterProScan ID, ProSiteProfiles ID, SUPERFAMILY ID and Panther ID of interest (S1 Fig). After entering the function ID and search, users will be redirected to the page having list of TFs with annotation information and hyperlinked field's accession id, detail information, SSR details and orthologs. By clicking on a particular accession id, the related database page will be opened which will have detailed information about the family. The "detail information" link redirects to the page containing information includes contig position, length, TF family, physico-chemical properties and sequence information. The SSR details link redirects to the page containing information of the predicted SSR in particular TFs and a link to the primer details that contains information on three pairs of primer. The orthologs link provides the information about percentage orthology with other legume crops.
Browse by TF categories. This browse option facilitates the user to search available transcription factors in PpTFDB according to their TF family. Total 1829 TFs were predicted and categorized into 55 TF families. This web page contains information about the predicted 55 TF families and the no. of TFs present in each family. For the detailed information about particular TF family click on 'Get details' button which will redirects the user to a separate web page. This page includes information about the TF family, PubMed link for literature, the list of TFs available under this category and hyperlinked fields detail information, annotation details, SSR These hyperlink fields will provide various details about TFs like contig name in which the TFs is present, start-and end-position, lengths, orientation, physical properties, protein CDS and genomic sequence, functional annotation information predicted by InterProScan, predicted SSR information with three pairs of primers and percentage orthology with other legume crops.
Browse by GO categories. This browse option enables the user to search transcription factors in PpTFDB according to the assigned GO category. The predicted 1829 TFs were subjected to BLAST2GO program to assign their respective GO categories like cellular component, biological process and molecular function. This web page containing the information about the no. of TFs assigned to a particular GO category and a hyperlinked field GO id containing sequences in the fasta format are made available for the users for further analysis (S3 Fig). The Pigeonpea genome sequenced by two different groups Singh et al. [1] and Varshney et al. [18] for the same genotype ICPL 87119, known as Asha. For developing Pigeonpea Transcription Factor Database (PpTFDB), we used in-house Pigeonpea data sequenced by Singh et al. [1] because it is available in the form of contigs assigned to specific chromosome. The difference in no. of transcription factors listed in Plant Transcription Factor Database (http:// planttfdb.cbi.pku.edu.cn/index.php?sp=Cca) for each family and our PpTFDB is due to the differences in the coverage of genome size. Singh et al. [1] sequenced about 60% of the estimated 858 Mb size of the pigeonpea genome whereas Varshney et al. [18] covered 72.7% of the estimated genome size. Hence, variation in number of TF identified in present study and that of present in PpTFDB might be due to the availability of 60-72% of the genome sequence data available in the public domain. Once the whole genome sequence data is generated and made available, the present database will be further enriched with the updated information. The sequenced plant genomes data available in public domain enables the researchers to carryout high-throughput research in the area of comparative genomics and transcriptomic data analysis [19], gene expression analysis [20], functional genomics [21] proteomics [22] and database development [23]. Such databases are very helpful for the biologists for functional validations of the genes identified in silico.
Transcription factors (TFs) play a major role in controlling various processes like responses to biotic and abiotic stresses, development, differentiation, metabolism and defense responses to pathogens etc. TFs also play roles in plant innate immunity by regulating genes related to pathogen-associated molecular pattern-triggered immunity, effector-triggered immunity, hormone signaling pathways and phytoalexin synthesis [24,25]. Recently, the structure-based approaches of TF-binding site prediction have gained substantial interest due to the rapidly increasing structural database of TFs-DNA complexes that can provide much more information for the prediction of TF binding sites than sequence-based approaches. Most of the structure-based approaches have been used as a model that is based on solved TF-DNA complexes and a scoring function for evaluating the binding affinity between a DNA subsequence and a transcription factor [26]. The integration of genomics information with the knowledge obtained from functional and structural studies will facilitate better understanding of gene regulation in plants for the development of new varieties with agronomically important traits, and regulation of plant defense mechanisms.
We believe that the PpTFDB will be beneficial for researchers as well as plant breeders who are working for the improvement of legume crops and genome-wide studies of TF families. This database is user friendly and also provides the researchers options to freely download the entire data set used to build this database.

Conclusion
PpTFDB is a user-friendly web interface that provides a range of information about the pigeonpea TFs for public domain. This database will decrease the effort in extracting genomic information about pigeonpea TF families by the researchers and breeders. The availability of the comprehensive information, including individual or family-wise TFs, protein functional annotation and gene ontology annotation of predicted TFs in the database is expected to prioritize the functional analysis of TFs of interest. We believe that the information about pigeonpea TFs available in the database will support basic and applied research. The database will be updated on regular basis with the availability of updated version of data. Further, additional information related to the pigeonpea TFs, and gene expression related data including expression patterns in different cultivars and genomic variations will also be integrated in the database in near future.