A Genome-Wide Survey on Basic Helix-Loop-Helix Transcription Factors in Giant Panda

The giant panda (Ailuropoda melanoleuca) is a critically endangered mammalian species. Studies on functions of regulatory proteins involved in developmental processes would facilitate understanding of specific behavior in giant panda. The basic helix-loop-helix (bHLH) proteins play essential roles in a wide range of developmental processes in higher organisms. bHLH family members have been identified in over 20 organisms, including fruit fly, zebrafish, mouse and human. Our present study identified 107 bHLH family members being encoded in giant panda genome. Phylogenetic analyses revealed that they belong to 44 bHLH families with 46, 25, 15, 4, 11 and 3 members in group A, B, C, D, E and F, respectively, while the remaining 3 members were assigned into “orphan”. Compared to mouse, the giant panda does not encode seven bHLH proteins namely Beta3a, Mesp2, Sclerax, S-Myc, Hes5 (or Hes6), EBF4 and Orphan 1. These results provide useful background information for future studies on structure and function of bHLH proteins in the regulation of giant panda development.


Introduction
The basic helix-loop-helix (bHLH) proteins form a large superfamily of transcription factors that play crucial roles in a wide range of developmental processes including neurogenesis, myogenesis, hematopoiesis, sex determination and gut development. The bHLH domain is approximately 60 amino acids long and comprises a DNA-binding basic region (b) and two helices separated by a variable loop region (HLH) [1]. The HLH domain promotes dimerization, allowing the formation of homodimeric or heterodimeric complexes between different bHLH family members. The two basic domains which are brought together through dimerization bind specific hexanucleotide sequences.
In the past two decades, protein functions of animal bHLH family members have been well characterized mainly through studies on bHLH proteins in model organisms including the nematode (Caenorhabditis elegans), fruit fly (Drosophila melanogaster) and mouse (Mus musculus). It has been established that animal bHLHs are classified into 45 families based on their different functions in the regulation of gene expression. In addition, they are divided into 6 groups according to target DNA elements they bind and their own structural characteristics. Specifically, group A consists of 22 families. They mainly regulate neurogenesis, myogenesis and mesoderm formation. Group B consists of 12 families. They mainly regulate cell proliferation and differentiation, sterol metabolism and adipocyte formation, and expression of glucoseresponsive genes. Group C has 7 families. They are responsible for the regulation of midline and tracheal development, circadian rhythms, and for the activation of gene transcription in response to environmental toxins. Group D has only 1 family. It forms inactive heterodimers with group A bHLH proteins. Group E has 2 families, which regulate embryonic segmentation, somitogenesis and organogenesis etc. Group F also has 1 family. It regulates head development and formation of olfactory sensory neurons etc (reviewed in [2]).
The giant panda, Ailuropoda melanoleuca, is a critically endangered mammal confined in six isolated mountain ranges of Southwestern China [13]. As one of the most primitive carnivores, giant panda not only has unique food habit, but also has highly specialized reproductive behavior and low fertility [14], all of which signify that the giant panda has considerably different regulatory mechanisms in growth and development. However, very little is known on structure and function of regulatory genes in the growth and development of giant panda [15,16]. As bHLH proteins present great importance in the regulation of organismal development, in this study, we have made exhaustive effort to obtain the complete list of bHLH family members encoded in the genome of giant panda. As a result, 107 bHLH family members were identified. Phylogenetic analyses with their mouse bHLH homologues revealed that the 107 giant panda bHLH members belong to 44 bHLH families with 46, 25, 15, 4, 11 and 3 members in group A, B, C, D, E and F, respectively, while 3 members were assigned into ''orphan''. The present study provides useful background information for future studies on structure and function of bHLH proteins in the regulation of giant panda development.

Blast Searches
The sets of 45 representative bHLH domains and 114 mouse bHLH motifs were from the additional files of previous reports [4,17], respectively. Each sequence of both sets was used as query sequence to perform tblastn search against the giant panda genome sequences which were accessed through the hyperlink provided on GenBank's MapView webpage (http://www.ncbi. nlm.nih.gov/mapview/). The expect value (E) was set at 10 in order to obtain all bHLH related sequences. The obtained subject sequences were manually examined to keep only one sequence for those that have the same contig number, reading frame and coding regions, to add the missing amino acids to corresponding sites with the help of EditSeq program (version 5.01) of the DNAStar package, and to find introns within the bHLH motifs using NetGene2 application online (http://www.cbs.dtu.dk/ services/NetGene2/). Sequence accession numbers of giant panda bHLH proteins were obtained by using amino acids of each identified bHLH motif to conduct blastp search against giant panda protein sequence databases which were also accessed through the hyperlink on GenBank's MapView webpage.

Sequence Alignment
All sequences that had been improved by the above methods were aligned using ClustalW program embedded in MEGA4 [18] with default settings. Each sequence was examined for their amino acid residues at the 19 conserved sites by manual checking [19]. Sequences with less than nine variations were regarded as potential giant panda bHLH members. The sequences which have less than ten conserved amino acids were discarded and the rest sequences were aligned again using ClustalW. The aligned giant panda bHLH motifs were shaded in GeneDoc Multiple Sequence Alignment Editor and Shading Utility (Version 2.6.02) [20] and copied to rich text file for further annotation.

Phylogenetic Analyses
Phylogenetic analyses to all the identified giant panda bHLH members were carried out in two steps. First, all the obtained giant panda bHLH motif sequences were used to build neighbor-joining (NJ) distance tree with the 114 mouse bHLH motif sequences using PAUP 4.0 Beta 10 [21] based on a step matrix constructed from Dayhoff PAM 250 distance matrix by R. K. Kuzoff (http:// paup.csit.fsu.edu/ nfiles.html). Then, each giant panda bHLH motif sequence was used to conduct in-group phylogenetic analyses [9] with mouse bHLH motif sequences. That is, each amino acid sequence of giant panda bHLH motifs was used to construct NJ, maximum parsimony (MP), and maximum likelihood (ML) phylogenetic trees with mouse bHLH family members of the corresponding group, respectively. The NJ trees were bootstrapped with 1,000 replicates to provide information about their statistical reliability. MP analysis was performed using heuristic searches and bootstrapped with 100 replicates. ML trees were constructed using TreePuzzle 5.2 [22] with quartet-puzzling tree-search procedure and 25,000 puzzling steps. Model of substitution was set to the Jones-Taylor-Thornton [23]. Other parameters were set to default values.

Giant Panda bHLH Family Members
The tblastn searches, sequence alignment, and examination of the 19 conserved amino acid sites revealed that there were 107 bHLH genes encoded in giant panda genome. The names of all 107 giant panda bHLH members are listed in Table 1. Each identified giant panda bHLH (GpbHLH) gene was named according to nomenclature used by mouse bHLH sequences. The alignment of all 107 GpbHLH motifs is shown in Figure S1 and the phylogenetic tree constructed using amino acids from 107 GpbHLH motifs and 114 mouse bHLH motifs is shown in Figure  S2. Figures S1 and S2 together show that there were 46, 25, 15, 4, 11 and 3 members in group A, B, C, D, E and F, respectively. And additional 3 members were assigned into ''orphan''. We found that gene encoding for member of Delilah family was not found in the giant panda genome. In Figure S1, there are two most conserved sites located at sites 23 and 59 of the bHLH motif. Besides, there are other eight sites which are also conserved as indicated with asterisks on top of Figure S1 (amino acid sequences of all 107 giant panda bHLH motifs are available in file S1).

Identification of Orthologous Families
Ortholog identification has had much uncertainty since there is no absolute criterion that can be used to decide whether two genes are orthologous [17]. In our previous studies [9,10], in-group phylogenetic analysis was adopted to identify homologues for the unknown sequences that would form a monophyletic clade among themselves by using a more certain criterion based on the criterion used by Ledent et al. [17,24]: If an unknown single giant panda bHLH forms a monophyletic clade with another bHLH of known family in phylogenetic trees constructed with different methods and all the bootstrap values exceed 50, the known member will be regarded as a homologue of the unknown sequence. Figure S3, as an example here, shows NJ, MP and ML phylogenetic trees constructed with one GpbHLH member (GpAsh1) and eight group A bHLH members from mouse. In all three trees, GpAsh1 formed monophyletic clade with Mash1 of mouse with bootstrap values ranging from 92 to 100. Therefore, GpAsh1 was considered as an ortholog of Mash1 of mouse. The similar in-group phylogenetic analyses were conducted to each of the identified GpbHLH members by referencing Figure S2 to select appropriate related mouse bHLH members for the analysis. All the bootstrap values of constructed NJ, MP and ML trees were listed in Table 1 without showing the correspondent constructed trees. Table 1 showed that the orthology of GpbHLH members with mouse can be divided into the following categories.
Firstly, among the 107 GpbHLH members, 83 GpbHLH members have all the bootstrap values over 50 (55, = bootstrap values, = 100) in constructed NJ, MP and ML trees. We have sufficient confidence to define orthology of these GpbHLH motifs to their corresponding mouse bHLH orthologs.
Secondly, 4 GpbHLH members, namely GpTCF4, GpNDF1, GpUSF2 and GpEBF1, formed monophyletic clade with bootstrap values over 50 in NJ and ML trees. Although they also formed monophyletic clade in MP trees, their bootstrap values ranged from 21 to 45. Therefore, the orthology of these 4 GpbHLH members have been defined according to the statistical support from NJ and ML trees. And 10 GpbHLH members, namely GpMist1, GpAHR2, GpTwist, GpDHand, GpARNT1, GpSREBP1, GpId1, GpHerp2 and GpOrphan3, formed monophyletic clade with  bootstrap values ranging from 50 to 100 in NJ and MP trees, but did not form monophyletic group with any single bHLH sequence in ML trees (marked with n/m* or n/m in Table 1). For these 9 GpbHLH members, we have defined their orthology according to the statistical support from NJ and MP trees. Thirdly, 2 GpbHLH members, namely GpPod1 and GpHen2 formed monophyletic clade in NJ and MP trees with bootstrap values ranging from 20 to 79 but did not form monophyletic group in ML tree. And 4 other GpbHLH members, namely GpTF12, GpMITF, GpDec2 and GpEBF3, formed monophyletic clade with bootstrap values ranging from 72 to 82 in NJ tree, but did not form monophyletic clade in MP and ML trees. Although these 6 GpbHLH members did not have sufficient bootstrap support, we defined orthologs for them because they all have one or two bootstrap support to testify their orthology to the correspondent mouse ortholog. This phylogenetic divergence of bHLH motif sequences between giant panda and mouse probably means that these two mammals have evolved in quite different circumstances.
Finally, there are 4 GpbHLH sequences which did not form monophyletic clade with most of the mouse bHLH motif sequences in all constructed phylogenetic trees. They are GpBeta3b, GpMesp1, GpHen1 and GpBmal2 of which whole protein sequences were used to conduct in-group phylogenetic analyses with whole sequences of corresponding mouse bHLH proteins for defining their orthology (marked with a in Table 1).

Protein Sequences and Genomic Coding Regions of Giant Panda bHLH Genes
Protein sequence accession numbers of all the identified GpbHLH motifs were listed in Table 1. It was found that there are 95 GpbHLH motifs of which protein sequence accession numbers were found in 'Non-RefSeq protein' database (shown as 'XP' plus number). Protein sequence accession numbers of 9 GpbHLH motifs were only found in 'Ab initio protein' database in which all protein sequences were predicted from their corre-sponding genomic sequences (shown as 'hmm' plus number). They are GpAsh3b, GpAsh3c, GpTal1, GpSim2, GpNPAS3, GpId1, GpId4, GpDec2 and Hes7, respectively. There are also 3 GpbHLH protein sequences of which accession numbers were not found in any protein databases. They are GpKA1, GpMist1 and GpOrphan4, respectively. Table 1 showed that, among the 104 bHLH protein sequences deposited in giant panda databases, 58 were annotated in full agreement with our analytical result (shown as the same name in the column of ''annotation in GenBank'' with that in the column of ''gene name''), 33 were annotated differently with our result (shown as a different name in the column of ''annotation in GenBank'' with that in the column of ''gene name''), and 13 were merely predicted proteins (indicated as ''hypothetical protein''). Therefore, our work not only newly identified the 13 protein sequences as bHLH family members but also provided additional information for further investigations on the 33 differently annotated bHLH protein sequences.
The coding regions and intron analysis for 107 giant panda bHLH motifs are listed in Table 2

The Giant Panda bHLH Repertoire
Compared to the 114 bHLH family members of mouse, it was found that the giant panda has one less member in each of the 7 bHLH families namely Beta3, Mesp, Paraxis, Myc, Hes, COE and Orphan. The missing bHLH family members are Beta3a, Mesp2, Sclerax, S-Myc, Hes5 (or Hes6), EBF4 and Orphan 1, respectively. Based on the available data, it is difficult to say whether giant panda does lack these bHLH genes. At present, there are three mammalian species (human, mouse and rat) of which bHLH family members have been identified and classified [4,7]. While human has different members with mouse and rat in only 2 bHLH families, i.e. Myc and H/ E(spl), it is hard to believe that giant panda could have different members in 7 bHLH families. Moreover, among the 7 family members missing in giant panda, zebrafish and chicken are found to lack only one (S-Myc) and two (S-Myc and EBF4) members, respectively [11,12]. Therefore, it is thought that additional bHLH members may be found after a new and higher quality version of giant panda genome sequence is released. Nevertheless, given that very little information is available on bHLH genes and their functions among bear speices, our data provide a good background information for further studies on regulatory functions of bHLH proteins in giant panda and other bear species.

Supporting Information
Figure S1 Alignment of 107 giant panda bHLH family members. Designation of basic, helix 1, loop and helix 2 follows Ferre-D'Amare et al. [25]. The family names and high-order groups have been organized according to Table 1 of Ledent et al. [24]. Highly conserved sites are indicated with asterisks on the top. The first five amino acids of NPAS1 were not available due to incompleteness of the correspondent genomic contig sequences. (TIF) Figure S2 Phylogenetic relationship of 107 giant panda and 114 mouse bHLH members. The tree was constructed with neighbor-joining algorithm with OsRa (the rice bHLH motif sequence of R family) as outgroup. For simplicity, branch lengths of the tree are not proportional to distances between sequences, and bootstrap values less than 50 are not shown. The higher-order group labels are in accordance with Ledent et al. [24]. (TIF) Figure S3 In-group phylogenetic analyses of GpAsh1. (a), (b) and (c) are NJ, MP and ML trees constructed with one giant panda bHLH member (GpAsh1) and nine group A bHLH members from mouse, respectively. In all trees, OsRa was used as the outgroup. (TIF) File S1 Amino acid sequences of 107 giant panda bHLH motifs. The giant panda bHLH family members are arranged as those in Tables 1 and 2, in which their family assignment, protein and coding region information can be found accordingly. (DOC)