Mammalian Comparative Sequence Analysis of the Agrp Locus

Agouti-related protein encodes a neuropeptide that stimulates food intake. Agrp expression in the brain is restricted to neurons in the arcuate nucleus of the hypothalamus and is elevated by states of negative energy balance. The molecular mechanisms underlying Agrp regulation, however, remain poorly defined. Using a combination of transgenic and comparative sequence analysis, we have previously identified a 760 bp conserved region upstream of Agrp which contains STAT binding elements that participate in Agrp transcriptional regulation. In this study, we attempt to improve the specificity for detecting conserved elements in this region by comparing genomic sequences from 10 mammalian species. Our analysis reveals a symmetrical organization of conserved sequences upstream of Agrp, which cluster into two inverted repeat elements. Conserved sequences within these elements suggest a role for homeodomain proteins in the regulation of Agrp and provide additional targets for functional evaluation.


INTRODUCTION
AGRP is an orexigenic, hypothalamic peptide whose role in energy homeostasis has been conserved during vertebrate evolution. In a wide range of species, including mammals [1,2], birds [3,4], and fish [5], neuronal Agrp expression is restricted to a discrete population of neurons that sense the levels of peripheral energy stores, and is dramatically elevated by deficits in energy balance.
AGRP neurons directly receive information about energy stores from leptin, an adipocyte-derived hormone that circulates at levels proportional to fat mass. Diminished levels of circulating leptin correspond to increased Agrp mRNA levels [6]. Leptin receptor occupancy on AGRP neurons results in the phosphorylation of two transcription factors, FoxO1 [7] and STAT3 [8], inducing the cytoplasmic localization of FoxO1 and the nuclear localization of STAT3. A recent study by Kitamura and colleagues [7] implicates a direct, reciprocal role for both STAT3 and FoxO1 in Agrp regulation, where STAT3 represses and FoxO1 stimulates Agrp transcription. However, conserved STAT binding sites in the Agrp promoter region do not function as simple repressor elements and, paradoxically, are required for fasting induced stimulation of Agrp transcription [9], suggesting a more complex mechanism of regulation involving additional cofactors. Support for this also stems from the observation that peripheral energy signals other than leptin regulate Agrp expression. For example, AGRP neurons directly mediate the orexigenic effects of glucocorticoids [10,11] and ghrelin [12]. Central administration of either elevates Agrp mRNA levels [13,14], and intact glucocorticoid signaling is required for fasting induced increases in Agrp expression [14]. Neither STAT3 nor FoxO1 is a known target of glucocorticoid or ghrelin signaling.
Cross-species comparative sequence analysis has facilitated the detection of functional elements that participate in transcriptional regulation [15]. Previously, we identified mouse BAC clones that recapitulate Agrp expression in transgenic mice and that contain regions of high sequence conservation between mouse and human, including a 760 bp region located immediately upstream of Agrp [16]. While we could identify candidate binding elements in this region based on biological inference [9], the level of resolution provided by mouse/human sequence comparison did not permit us to predict putative binding elements based solely on sequence conservation.
The inclusion of sequences from multiple, divergent species sharing a commonly derived phenotype improves the resolution of comparative sequence analysis. The conserved nature of Agrp expression in vertebrates suggests that sequence comparisons of disparate vertebrate species provide an appropriate evolutionary scope for identifying Agrp regulatory elements. However, Agrp genomic sequences from distantly related vertebrate species have diverged to the extent that regional conservation is no longer detectable using traditional alignment methodologies [9]. Here, we describe a comparative sequence analysis of an Agrp genomic region from ten mammalian species, representing several different orders and all three subclasses of mammalia. Our analysis reveals a symmetrical organization of conserved sequences upstream of Agrp, which cluster into two inverted repeat elements (IREs). The proximity of the elements to Agrp and the nearly perfect evolutionary conservation of their specific constituent sequences in all ten mammalian species and in chickens suggest a role in Agrp regulation. In addition, the resolution of the conserved elements provided by this approach allows general predictions concerning putative trans-regulatory factors.

Matrix comparisons of Agrp genomic regions
Genomic sequence surrounding the Agrp locus for chimpanzee (Pan troglodytes), human (Homo sapiens), dog (Canis familiaris), mouse (Mus musculus), and rat (Rattus norvegicus) was obtained from publicly available genome assemblies (chimpanzee assembly CHIMP1, human assembly 35, dog assembly CanFam1.0, mouse assembly m33, rat assembly RGSC 3.4, respectively) on the UCSC genome browser (http://genome.ucsc.edu). Genomic sequence for other species, including chicken (Gallus gallus), cat (Felis catus), cow (Bos taurus), tenrec (Echinops telfairi), platypus (Ornithorhynchus anatinus), and opossum (Monodelphis domestica), was obtained from the trace sequence archives at NCBI (http://www.ncbi.nlm.nih.gov) and assembled using SeqMan (DNAStar, Inc., Madison, WI). Our assembly of chicken genomic sequence at the Agrp locus differed from the publicly available genome assembly (Chicken v1.0), since additional trace sequence was available. Notably, the chicken sequence corresponding to the STAT site and inverted repeat element is not present in the public assembly. Our assembly of the chicken Agrp locus is provided as supplementary material (Text S1). Matrix comparisons were done for each species pair using the dot plot feature in the MegAlign software package (DNAStar, Inc., Madison, WI). We empirically determined an appropriate visualization threshold for dot plots after testing several window sizes and sequence identity levels. The threshold of 70% sequence identity over 30 bp provides specificity to detect relatively small, highly conserved sequences while excluding the majority of background signal.

Measurement of evolutionary constraint
Approximately 1.8 kb of genomic sequence from the Agrp locus of 10 mammalian species, including the transcription unit and the proximal conserved region, were aligned using the multi-Lagan alignment tool (http://lagan.stanford.edu) [17]. All sequences used in this analysis are provided as supplementary material (Text S2). Because the function of putative Agrp regulatory sequences has been previously interrogated using mouse transgenic models [9], we chose to use mouse sequence (corresponding to mouse Chromosome 8:104864049-104863970 from the mouse genome Build 34) as a reference for analysis. To maintain consistency between mouse genome assembly and our alignment coordinates, the multi-species alignment was compressed such that the mouse sequence is ungapped. We measured evolutionary constraint at each nucleotide position using Genomic Evolutionary Rate Profiling (GERP), the methodology of which is described in detail by Cooper et al. [18]. Briefly, GERP estimates an expected neutral substitution rate for a group of aligned species, and quantifies deviation from the estimated neutral rate for each column in a species alignment. The sum of deviation scores in consecutively constrained alignment columns provides a metric of evolutionary constraint, termed a rejected substitution (RS) score. Therefore, the RS score associated with a constrained region is impacted by both the magnitude of deviation from the estimated neutral rates at each column position and the number of consecutively constrained columns. Since some functional regions contain unconstrained nucleotides, GERP allows for RS scores to be merged across unconstrained alignment columns. In this analysis, RS scores were merged across a single unconstrained column. The neutral rate estimate for species in our alignment, based on previous estimates from other genomic loci [18,19], is 2.75 substitutions/site. The maximum false positive score, based on ungapped alignment segments from permuted versions of the data set, was 13.660.8 (n = 10). Table 1 summarizes the RS scores for this region ranking above the maximum false positive estimates for permuted data sets [18], all of which are independent of sequence annotation. The results of GERP analysis were visualized using the java-based graphical user interface, Application for Browsing Constraints (ABC) [20].
In order to calculate constraint scores for Agrp exons, the STAT sites, and the IREs, we modified the concept of an RS score, such that the score for constraint represents the sum of constrained alignment columns in these regions while ignoring unconstrained alignment columns. In this way, the constraint score reflects the level of sequence conservation within a particular region without penalizing unconstrained positions ( Table 2).

RESULTS
Using a matrix analysis that did not rely on global sequence alignment, we searched for short, nearly identical sequences (30 bp of 70% sequence identity) surrounding Agrp in pairwise comparisons of vertebrate species. Analysis of distantly related mammalian species, such as human and platypus ( Figure 1A), identified two highly conserved regions, each extending ,40 bp. Notably, the detection threshold of these regions, which we termed  Figure 1A). IREs were not detected using matrix analysis when pairwise comparisons were extended to include nonmammalian vertebrate species (data not shown). As shown in Figure 1B, the IREs are located immediately upstream of Agrp, each one flanking a conserved STAT binding element [7,9]. Moreover, the IREs show striking similarity to one another when aligned in a reverse complementary manner ( Figure 1A). The sequence similarity between the IREs and their symmetrical orientation with respect to the STAT binding elements suggests that the genomic region upstream of Agrp underwent a duplication that occurred prior to mammalian radiation, and portions of the duplicated region have been subjected to purifying selection such that specific sequences (those corresponding to the STAT elements and IREs) are maintained in two closely related copies.
To quantify evolutionary constraint for different conserved regions at the Agrp locus, we aligned 1.8 kb of Agrp genomic sequence from 10 mammalian species -human, chimpanzee, mouse, rat, cow, dog, cat, tenrec, opossum, and platypus -with multi-lagan. We then used GERP [18] to estimate evolutionary constraint across the sequence alignment. Figure 2A shows constraint variation across the Agrp locus and compares the position of constrained sequences with annotated elements. Based on this approach, the level of evolutionary constraint for sequences overlapping the IREs and the STAT binding sites are extremely high relative to other genomic regions [18] and well above false positive estimates (Table 1 and methods). In fact, using the average constraint score per base pair as a metric, the IREs and STAT binding sites are more conserved than Agrp coding sequences in this alignment of species ( Table 2). In contrast, putative FoxO1 binding sites, located adjacent to the STAT sites, are not well-conserved ( Figure 2B).
A remarkable aspect of the IREs, in addition to their evolutionary constraint, is their organization into palindromic sequence blocks. As illustrated in Figure 3, the distal IRE contains 3 conserved palindromic motifs, constituting 31 of 39 constrained nucleotide positions. The proximal IRE also contains a conserved palindromic motif and a direct repeat motif, constituting 18 of 39 constrained nucleotide positions, as well as conserved half sites for palindromic motifs identified in the distal IRE ( Figure 3). Since palindromes commonly serve as recognition sequences for transcription factors, the clustering of conserved palindromic sequences suggests that the IREs may function as composite binding modules for a complex of transcription factors.
We could not confidently align orthologous sequence from nonmammalian species, such as chicken and zebrafish, using multilagan. Instead, we scanned genomic regions upstream of Agrp in these species, using matrix analysis, to detect motifs similar to those conserved in mammalian IREs. As shown in Figure 4, we identified a region located ,2.5 kb upstream of the first Agrp coding exon in chicken that demonstrates similarity to the proximal IRE of mammals and contains some of the conserved motifs common to both mammalian IREs. In particular, the chicken IRE-like region includes an 18 bp segment which is perfectly conserved with the proximal platypus IRE ( Figure 4B), a striking observation considering the extensive evolutionary distance between these species (,300 million years). Moreover, a consensus STAT site is located 60 bp upstream of the chicken IRE-like region, indicating that the positional relationship of these elements is also preserved ( Figure 4B). We were unable to identify a second region in chicken genomic sequences corresponding to the distal IRE and STAT binding element, and we were also unable to identify similar elements near Agrp in sequenced teleost fish species.

DISCUSSION
In this study, we leveraged the recent availability of sequence information from different species and new methodologies in sequence comparison to improve the ability to detect conserved sequences at the Agrp locus. The resolution provided by comparing 10 mammalian sequences with GERP reveals a previously unappreciated structure to the region upstream of Agrp. Conservation of this genomic structure in mammals indicates that a duplication event occurred prior to mammalian speciation in which the highly conserved sequences have been subsequently Matrix comparison of human (X-axis) and platypus (Y-axis) genomic sequences at the Agrp locus with a stringency threshold set at 70% identity over a 30 bp window reveals conservation in Agrp coding exons (shaded areas) and in the upstream region, representative of similar comparisons between distantly related mammalian species. For reference, the position of Agrp exonic sequence is shown on either axis, with coding sequence in red and non-coding sequence in grey. Sequence similarity between forward and reverse strands is plotted in black and red, respectively. (B) Shows the genomic locus containing Agrp and neighboring genes, MKIAA1930 and Atp6v0d1. The blue boxes indicate areas of mouse/human sequence conservation located within a functionally defined Agrp regulatory region [16]. The relative position of the conserved IREs (in yellow) and the STAT binding sites (in red) upstream of Agrp are illustrated below. Sequence similarity and the symmetrical structure of the elements reveal an ancient duplication within the Agrp region, the axis of which is indicated by a vertical, dashed line. doi:10.1371/journal.pone.0000702.g001 maintained in two homologous copies. Nevertheless, orthologous IREs, even from distantly related species such as human and platypus, are more similar than paralogous IREs, suggesting that paralogous IREs have acquired different functions that are preserved in mammals.
Agrp is also expressed prominently in the adrenal gland [1,21], where it is thought to regulate steroidogenesis [22]. In both the arcuate nucleus and the adrenal gland, Agrp expression levels are elevated by food deprivation and leptin deficiency [1,21], suggesting a common mechanism of regulation. However, the transcript lengths in these expression sites differ. In mice and humans, hypothalamic transcripts have a longer 59 untranslated region than adrenal transcripts, indicating alternative transcrip-tional start sites [1]. Notably, the difference in transcript length, ,300 bp, is similar to the distance between the IREs. It is therefore plausible to speculate that different IREs may regulate Agrp expression in different expression sites.
We also identified a region upstream of chicken Agrp containing both a STAT site and an IRE-like element, which are strikingly similar to their mammalian counterparts in spacing and composition. The chicken STAT site is perfectly conserved with mammalian STAT sites; the IRE-like element, while less conserved, retains some of the features of mammalian IREs, including the TAAT motifs which display the highest degree of constraint in mammals (Figure 3). If the IREs function as composite binding modules for multiple transcription factors, then  Table 1. Coordinates refer to the position in mouse genomic sequence relative to the first position of the Agrp translational start site. While GERP analysis is independent of sequence annotation, constrained sequences overlap with the IREs (yellow), putative STAT and FoxO1 binding elements (red and green, respectively), and Agrp coding regions (red). (B) Sequence alignments indicating the relative conservation of putative FoxO1 and STAT binding sites, with consensus sites shown below the alignments. doi:10.1371/journal.pone.0000702.g002 Figure 3. Mammalian alignment of the inverted repeat element. Sequences comprising the inverted repeat elements from 10 mammalian species were aligned using the multi-Lagan tool. To demonstrate their sequence homology, distal IRE and proximal IRE are shown in a forward and reverse orientation, respectively. Both IREs contain multiple conserved sequence motifs, including palindromic (PI-PIII) and direct (DRI) repeat sequences (shaded in grey), which likely function as transcription factor binding modules. Putative homeodomain transcription factor binding motifs are outlined in red. doi:10.1371/journal.pone.0000702.g003 the chicken IRE-like element might contain a subset of the binding modules found in mammalian IREs as well as distinct modules. An analogous comparison of orthologous regions in other avian species may reveal a pattern of conserved sequences within these modules that is different from mammals, reflecting differences in Agrp regulation, and could potentially provide a molecular perspective for the evolutionary development of pathways regulating Agrp transcription.
The TAAT sequences comprising the core regions of conserved palindromes and direct repeats in both IREs are canonical binding sites for homeodomain (HD) transcription factors, which are generally involved in the regulation of development and cell fate but also function post-developmentally to modulate cell type specific gene expression. While HD transcription factors represent a large family of proteins with over 100 members, most display restricted expression patterns that are predictive of function. Several HD transcription factors are regionally expressed in the hypothalamus.
HD transcription factors have been previously implicated in the regulation of neuropeptide gene expression, and conserved HD binding sites have been identified in the regulatory regions of other hypothalamic neuropeptide genes, including Pomc [23], Orexin [24], and Gonadotropin Releasing Hormone (GnRH) [25]. In particular, the transcriptional regulation of GnRH has been extensively studied, due in part to an immortalized cell line which has retained important characteristics of endogenous GnRH neurons. Multiple HD transcription factors cooperatively bind to the sequences upstream of GnRH, regulating not only cell type specific expression but also dynamic, physiologic responses by serving as scaffolds for additional cofactors [26,27]. The studies of GnRH regulation offer an appealing framework for extending the model of Agrp regulation proposed by Kitamura and colleagues to integrate observations regarding input from other signals and the organization of conserved sequences identified in this analysis.
Within the 760 bp region upstream of Agrp defined by mouse/ human sequence conservation, the IREs and STAT binding sites comprise the majority of non-coding sequence conserved among mammals. Only two other non-coding sequences, which are both ,15 bp in length, meet the threshold for constraint set by GERP ( Table 1). One of these overlaps with a putative TATA box and likely binds the core transcriptional apparatus. The second, which is located between the putative TATA box and the proximal IRE, barely meets the significance threshold and does not contain obvious candidate binding sites.
In contrast to the STAT binding sites, the putative FoxO1 binding sites identified by Kitamura and colleagues, which are located between to the IREs and STAT binding sites, are not wellconserved ( Figure 2B). Nevertheless, injection of an adenovirus encoding a constitutively active FoxO1 in mice increases Agrp expression and stimulates food intake, and chromatin immunoprecipitation studies confirm that FoxO1 binds to this region. These observations suggest that (1) FoxO1 is capable of recognizing non-canonical FoxO1 binding sites in some species, (2) FoxO1 regulates Agrp expression only in a subset of mammalian species, (3) or FoxO1 associates with Agrp upstream regions through a mechanism that does not involve a direct interaction with the putative FoxO1 binding sites. Notably, direct interactions between forkhead and HD transcription factors have been previously described and implicated as a general mechanism for post-developmental regulation of gene expression [28,29].
Like AGRP, NPY is a hypothalamic neuropeptide that stimulates food intake when administered in the CNS [30]. Agrp and Npy are expressed by the same population of neurons in the arcuate nucleus of the hypothalamus and their expression covaries in response to a number of stimuli [6], suggesting that similar cis-regulatory modules may control Agrp and Npy expression. In neuronal cell lines, HD transcription factors bind to TAAT motifs located within a regulatory region identified upstream of Npy [31]. Comparing the organization of conserved non-coding sequences at these two loci might reveal both common and disparate mechanisms of transcriptional regulation. Given the functional similarity between Agrp and Npy, the extent to which their regulatory mechanisms overlap may define distinct roles for these neuropeptides in energy homeostasis.
From a medical standpoint, the significance of understanding Agrp transcriptional regulation stems from the observation that common obesity is characterized by a blunted central response to peripheral energy signals, a phenomenon referred to as leptin resistance [32]. The mechanisms underlying leptin resistance remain poorly understood, but clearly result in the disregulation of neuropeptide genes, including Agrp [33]. The transcriptional regulation of Agrp is likely to involve the integration of information from several inputs. The nature of conserved sequences upstream of Agrp suggests a model whereby homeodomain proteins participate in the process of integrating these signals. The IREs identified in this analysis provide genomic targets for evaluating the role of specific transcription factors in the regulation of Agrp expression.

SUPPORTING INFORMATION
Text S1 A FASTA file containing 6144 bp of genomic sequence corresponding to the chicken Agrp locus.