Diversification of the insulin-like growth factor 1 gene in mammals

Insulin-like growth factor 1 (IGF1), a small, secreted peptide growth factor, is involved in a variety of physiological and patho-physiological processes, including somatic growth, tissue repair, and metabolism of carbohydrates, proteins, and lipids. IGF1 gene expression appears to be controlled by several different signaling cascades in the few species in which it has been evaluated, with growth hormone playing a major role by activating a pathway involving the Stat5b transcription factor. Here, genes encoding IGF1 have been evaluated in 25 different mammalian species representing 15 different orders and ranging over ~180 million years of evolutionary diversification. Parts of the IGF1 gene have been fairly well conserved. Like rat Igf1 and human IGF1, 21 of 23 other genes are composed of 6 exons and 5 introns, and all 23 also contain recognizable tandem promoters, each with a unique leader exon. Exon and intron lengths are similar in most species, and DNA sequence conservation is moderately high in orthologous exons and proximal promoter regions. In contrast, putative growth hormone-activated Stat5b-binding enhancers found in analogous locations in rodent Igf1 and in human IGF1 loci, have undergone substantial variation in other mammals, and a processed retro-transposed IGF1 pseudogene is found in the sloth locus, but not in other mammalian genomes. Taken together, the fairly high level of organizational and nucleotide sequence similarity in the IGF1 gene among these 25 species supports the contention that some common regulatory pathways had existed prior to the beginning of mammalian speciation.


Introduction
Insulin-like growth factor 1 (IGF1) is a 70-residue, secreted protein that along with IGF2 and insulin comprises a conserved protein family found in most mammalian species and in many other vertebrates [1][2][3][4]. IGF1 plays a central role in pre-and post-natal growth in human children and in juveniles of other mammals as a key mediator of the actions of growth hormone (GH) [5][6][7][8][9], and also is involved in control of intermediary metabolism, in tissue repair, and in disease pathogenesis throughout life [10][11][12][13].
Limited analyses suggest that IGF1 genes are conserved among mammals [1,2,14]. Two gene promoters have been shown to control IGF1 gene expression in the few mammals in which it has been studied experimentally, [2,[15][16][17][18][19][20]. In these species, IGF1 genes are a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 from the translation of the many rat Igf1 mRNAs [2] (Fig 1C). Although these molecules differ in the NH 2 -portions of their signal peptides, and in the COOH-terminal parts of their extension peptides or E domains, they all encode the identical 70-amino acid, biologically active, and secreted mature IGF1 protein [2,26].
Consistent with its physiological role in somatic growth, GH is a critical regulator of IGF1 gene expression in mammals [5,18,[27][28][29][30][31]. GH, acting through its trans-membrane receptor and the intracellular tyrosine protein kinase, Jak2 [32][33][34][35], acutely activates the Stat5b transcription factor [36], which then binds to multiple transcriptional enhancers that are found in chromatin throughout the Igf1 locus, leading to stimulation of the two Igf1 promoters in rat liver [37,38], and presumably in other organs and tissues. In humans, a similar pathway has been identified in limited studies in cultured cells, although to date fewer GH-inducible and Stat5b-binding putative enhancers have been functionally detected in the human IGF1 locus than in the rat [39]. Since inactivating mutations in the human STAT5B gene have been characterized that are associated with growth failure and IGF1 deficiency [40][41][42] and genetic loss of Stat5b leads to growth deficiency in mice [43,44], it seems likely that this molecular pathway has been conserved between rodents and humans.
The present studies were initiated in order to understand the breadth and depth of both conservation and variation in IGF1/Igf1 genes in mammals as a means of gaining insight into key aspects of gene regulation as it has evolved during speciation. Using the information found within publically available databases, IGF1 loci and genes have been analyzed in 25 mammalian species representing 15 orders and spanning~180 million years (Myr) of evolutionary diversification. The results demonstrate substantial conservation in coding regions of exons and in overall exon, intron, and proximal gene promoter topology, leading to the idea that common paradigms governing IGF1 gene regulation were present at the onset of the mammalian radiation.

Genome database searches
Mammalian genomic databases were accessed using the Ensembl Genome Browser (www. ensemble.org) [21]. Searches were conducted using BlastN or BlastP under normal sensitivity, with rat Igf1 gene DNA segments (Rattus norvegicus, genome assembly Rnor_6.0), or protein sequences (from the National Center for Biotechnology Information Protein database) as initial queries, respectively. Additional searches were performed in Ensembl with other mammalian species DNA sequences as queries to follow-up and verify initial results. In addition, when relevant information could not be found in the Ensembl Genome Browser, both genomic DNA and gene expression data files were searched using the Sequence Read Archive of the National Center for Biotechnology Information (SRA NCBI; www.ncbi.nlm.nih.gov/sra/), and for human IGF1, RNA data from the Portal for the Genotype-Expression Project (GTEx V7, https://www. gtexportal.org/home/

Nomenclature and experimental strategy
Naming conventions adopted here include the term 'Igf1' for rodent genes and mRNAs, 'IGF1' for human, other primate, and all other mammalian genes and transcripts, and 'IGF1' for all proteins. As a preliminary examination of mammalian IGF1/Igf1 genes and mRNAs within Ensembl revealed that most assignments were incomplete when compared with rat Igf1 or human IGF1, thus limiting the value of using the data for comparative analyses, a major experimental goal was to map all genes as thoroughly as possible. An iterative strategy was developed, in which homology searches first were conducted with segments of the rat Igf1 gene, followed by secondary searches using either components of human IGF1 or other genes that were evolutionarily more similar to specific target species, with a final follow-up using the resources of the SRA NCBI to identify IGF1 gene segments not detected in Ensembl. As revealed below, results revealed substantially higher levels of gene complexity than described in the data curated by Ensembl.
IGF1 genes in mammals IGF1 appears to be a 6-exon, 5-intron gene in 23 of 25 mammalian genomes studied here (Table 1), and the presumptive overall structure resembles that of rat Igf1 or human IGF1 in the vast majority (Table 1). Exceptions include guinea pig, in which an exon 5 was not identified, and wallaby, in which no exon 3 was identified, and in which intron 5 was not found (the latter differences may reflect both poor quality DNA sequence data and the fact that in the wallaby genome the IGF1 locus has not been mapped yet to a single continuous DNA segment) (Table 1). There was reasonable congruence in the lengths of different gene components among the 23 species recognized to have 6 exons and 5 introns (Table 1), and total gene sizes ranged from 71,136 bp in the microbat to 119,762 bp in the opossum (Table 1). For 20 of these genes their lengths were within ±10% of rat Igf1 at 79,281 bp. Nearly all of the variation in the outliers (megabat, microbat, opossum, Tasmanian devil) could be attributed to introns 5 and/or 3, the largest IGF1 introns, which were measured at~60% and~33% longer respectively than the mean in opossum and Tasmanian devil, and were~20% shorter in megabat and microbat (Table 1).
DNA conservation was generally extensive for exons 1-4 among the species studied, with overall nucleotide sequence identities with different rat exons ranging from a high of 94-98% for mouse, to a low of 82-92% for platypus ( Table 2). The lengths of these exons also were conserved among the genomes analyzed (Table 1). In contrast, exons 5 and 6 were more variable, with exon 5 being of different lengths because of its involvement in alternative splicing in most of the 25 species analyzed, as illustrated for rat Igf1 (Fig 1B). The total length of exon 6 was comparable in 22 of 25 species based on mapping of locations of DNA homology with rat Igf1 or human IGF1 (Table 1; exceptions are megabat, platypus, wallaby), although the overall extent of nucleotide sequence identity with the corresponding rat exon was fairly limited ( Table 2). DNA similarity was nearly as high for both Igf1 promoters as for exons 1-4, ranging from 85-92% for promoter 1 and 75-98% for promoter 2, although as observed for exons 5 and 6, the length of sequence homology was variable among different species (Table 3), perhaps indicating that diversification of proximal promoter regulatory elements has occurred during mammalian speciation (but see below).

Conservation in IGF1 protein sequences among mammals
The 70-residue secreted IGF1 molecule is encoded within six different types of protein precursors ( Fig 1C) that differ at their NH 2 -and COOH-termini through a combination of mechanisms that produce many classes of Igf1 mRNAs in the rat (Fig 1B). Mature IGF1 in turn is divided into four domains, termed B, C, A, and D, with the first three being analogous to the B and A chains of insulin, and the C-peptide of pro-insulin (Fig 2) [45]. Among the mammals studied, 70-amino acid IGF1 was not identical to the rat protein in any species (Table 4), although a single conservative substitution was seen in the mouse (Ser 69 to Ala, Fig 2). In all other mammals, at least three differences with rat IGF1 were found, with the most prevalent alterations being Pro 20 to Asp, Ile 35 to Ser, and Thr 67 to Ala (Fig 2). IGF1 was identical in 12 species: squirrel, guinea pig, cow, dolphin, pig, human, chimpanzee, gibbon, orangutan, cat, dog, and megabat; moreover rabbit, sheep, tree shrew, microbat, and armadillo IGF1 each varied by a single residue from this group (Fig 2). The only other mammals with more divergent IGF1 molecules were the three marsupials (opossum, Tasmanian devil, wallaby) and the one monotreme (platypus), whose genome sequences predicted mature IGF1 proteins with eight differences from rat IGF1, and five from other species, although wallaby IGF1 was incomplete (Fig 2). In rodents and in humans, there are two different IGF1 signal peptides. In both species a common COOH-terminal 27-residue segment is encoded by exon 3, and unique NH 2 -terminal fragments by exon 1 (21 amino acids) or exon 2 (5 amino acids, Fig 1C). In all 25 mammals, a 47-49 residue signal peptide derived from exons 1 and 3 could be identified, and was fairly well conserved, although in all species except mouse (1 difference), there were at least 4 substitutions compared with rat ( Fig 3A, left panel, and 3B). The most divergent signal peptides were found in platypus and in the marsupials, in which either 12 or 13 changes were noted from rat, although DNA data for the COOH-terminal part of the wallaby signal peptide was not available in any database (Fig 3). It is unknown how these alterations might affect any aspect of signal peptide function. The NH 2 -terminal part of the signal peptide encoded by exon 2 consists of 5 amino acids in the 18 mammals in which it was detected (Table 4), but only in guinea pig was the predicted sequence identical to the rat. In the other 5 species with an identified exon 2, no open reading frame was found. However, even without an in-frame methionine codon in exon 2 in rabbit, squirrel, opossum, platypus, and Tasmanian devil, the same open reading frame would be maintained in IGF1/Igf1 transcripts containing exon 2 because of the presence of another methionine at the beginning of the common signal peptide encoded in exon 3 (Fig 2B).
At the COOH-terminal end of the IGF1 protein precursor, the common E region (16 amino acids) and E A peptide (19 residues) were similar to the same parts of rat IGF1, with only 2 to 4 substitutions being observed in 16 species (Fig 4A). However, no E A peptide could be detected in microbat or wallaby, and sequence divergence was greater than 20% in elephant, opossum, and Tasmanian devil ( Fig 4A). The E B segment, which is encoded by exon5, and the E C region, encoded by exons 5 plus 6, were less conserved than other parts of IGF1 precursor proteins among the different mammals evaluated (Figs 4B and 5). The E B peptide ranged from 39 to 83 amino acids in length (Fig 4B), and E C from 25 to 56 residues (Fig 5). In all species studied, both segments are highly enriched in basic amino acids (Figs 4B and 5). Although the reasons for the extensive diversification of these regions compared with other parts of the predicted IGF1 precursor [2,26] are unknown, it has been postulated that this variability may be secondary to insertion of a transposon from the mammalian interspersed repetitive-b family into the genome at the IGF1 exon 5 site of a common mammalian ancestor [46]. Of note, there was no E B segment detected in guinea pig or tree shrew, and no E C region in rabbit, guinea pig, or wallaby.

Insights into IGF1 gene regulation
Limited studies during the past three decades have identified parts of human IGF1 and rat Igf1 gene promoters that are important for their basal activity [17,[47][48][49][50], and also have characterized portions of promoter 1 that could mediate actions of the hepatic-enriched transcription factors, HNF-1, HNF-3, and C/EBPα and β on IGF1 gene transcription in the liver [51][52][53].
Other DNA elements also have been mapped in human IGF1 and rat Igf1 promoter 1 that have been found to serve as response elements for hormones that activate cAMP through the transcription factor, C/EBPδ [54,55]. Presented in Table 3 and depicted in Fig 6 are results of analyses comparing the two rat Igf1 promoters with their orthologous regions in 24 other mammalian species. In nearly all species, nucleotide conservation was high in the most proximal parts of each promoter, and overall DNA sequence identity ranged from 85% to 92%, depending on the specific genome (Table 3). Moreover, some of the DNA elements described above that have been mapped in proximal human IGF1 or rat Igf1 promoter 1 or in noncoding region of exon 1 were highly conserved in the other species. Of particular note are binding sites for C/EBPδ and for HNF-3 located in distal exon 1 and found in nearly all species (Fig 6). This contrasts with the more 5' HNF-1 site in promoter 1 that was detected only in rat, human, and other primates (Fig 6).
The most important physiological activator of IGF1/Igf1 is GH [56]. GH stimulates Igf1 gene transcription in rats via interactions of up to seven inducible Stat5b binding elements that are located throughout the locus, being found in far distal 5' flanking DNA and in introns,  but not near either Igf1 promoter [37,38]. These elements have been shown in vivo in rat liver to bind Stat5b and several transcriptional co-factors, including p300, RNA polymerase II, and the mediator complex, to undergo reversible histone modifications [37], and at least in cell culture experiments, to physically interact with Igf1 promoters in a GH-regulated way [57]. These elements thus appear to be bona fide transcriptional enhancers [58,59]. Five of these segments also have been shown to be conserved and to be present in analogous regions in human IGF1 [39], and in several other non-human primates [14].
The same seven elements identified near the rat Igf1 gene could be variably detected in the genomes of other mammals, and tended to be located within respective IGF1 loci at genome coordinates analogous to those mapped for rat Igf1 (Tables 5 and 6, Figs 7 and 8). Comparison of the DNA sequences of these elements revealed varying levels of similarity with the corresponding rat regions, ranging from 84% to 96% identity in all seven segments in mouse, including full conservation of the 9-nucleotide pair canonical Stat5b binding sites, and near identity in DNA spacing between paired elements, to low level DNA sequence similarity in just a single segment (homologue of rat [R] 8-9) in opossum and Tasmanian devil, to no elements in platypus (Tables 5 and 6 and Fig 8). Except for platypus, all other mammals had at  detected in the equivalent of R13 in eleven species, or in the homologue of R53 in rabbit, orangutan, and armadillo. In addition, in the elephant homologue of R8-9, a single nucleotide modification has occurred within the more 5' Stat5b binding sequence, changing it from 5'-TTC TTA GAA-3' to 5'-TTC TTA GTA-3' ( Table 5, Fig 7), and presumably rendering it incapable of binding Stat5b [60][61][62]. A similar inactivating change was found within the 3' Stat5b 9-base pair homologue of R60-61 in the cat [5'-TTC ACA GAC-3' (Table 5)]. Also, in the armadillo equivalents of R34-35 and R58-59, in the pig homologue of R34-35, and in the  guinea pig equivalent of R60-61, the more 3' Stat5b site was not detected (Table 5). Other single, double or triple nucleotide modifications were found, particularly in cow (R13), in elephant (R53), in fourteen species in R58-59, and in sixteen species in R60-61 (Table 5), but these all are observed in authentic Stat5b binding elements (60-62).
To date the mechanisms responsible for the patterns of alternative splicing of IGF1/Igf1 transcripts in different cell types are unknown. Examination of human IGF1 mRNAs in GTEx has revealed marked variation in steady-state levels of transcripts containing just exon 5 (40 to 65%), exon 6 (40 to 55%), or both exons 5 and 6 (2 to 10% of all mRNAs) in the 37 different organs and tissues in the database. Variation is also observed in the fraction of mRNAs containing these different exon combinations in mouse and rat tissues in the SRA NCBI.
The sloth IGF1 locus contains a processed pseudogene Initial screening of the sloth genome with rat Igf1 revealed two DNA sequences with similar levels of identity with rat exons 3 and 4 (87% and 86%, respectively). Two of these segments mapped~50 kb apart in the sloth genome (Table 1), and the other two were adjacent to one another, and were located immediately 5' to the beginning of sloth IGF1 exon 6 ( Fig 9A). Further analysis revealed that contiguous with and 5' to the alternative exon 3 were 52 base pairs  that were identical with the 3' part of sloth IGF1 exon 2, and included the 5 codon open reading frame. Collectively, this 391-nucleotide pair genomic segment was over 98% identical to sloth IGF1 exons 2-4. Conceptual translation of an mRNA predicted from these DNA sequences revealed marked similarity with the sloth IGF1 protein precursor, with only four mismatches in the common signal peptide, one in mature 70-residue IGF1, and none in the common E peptide, for an overall amino acid identity of nearly 96% with the authentic sloth IGF1 precursor (Fig 9B). It seems likely that this represents a processed mRNA that was retrotransposed as a DNA copy back into the sloth IGF1 locus [63]. Alternatively, it is possible that these results reflect some inaccuracies within the incomplete sloth genome. Since this putative pseudogene maps to the 5' end of sloth IGF1 exon 6, it is possible that the postulated mRNA from which this DNA segment derives also included exon 6, although against this argument it should be noted that there is no duplication of authentic sloth IGF1 exon 6 in the genome. IGF1 pseudogenes have not been recognized in mammals to date, yet there is a precedent with the gene for the related peptide insulin in mice and rats. The Ins1 gene appears to have been derived from Ins2 by an analogous mechanism, although in this case, one of two introns was

Discussion
Public biological databases represent rich sources of information about genes from various organisms, and in depth analysis of these data can be the impetus for the development of new hypotheses on evolutionary aspects of gene structure, function, or regulation. This report focuses on the molecular genetics of IGF1, as seen through the lens of 25 mammalian species. These genomes were chosen because they represent 15 different orders and cover~180 million years of evolutionary diversification, although a different cohort of the approximately 100 different mammalian species whose DNA sequences are available might have yielded an analogous data set. It is generally thought that IGF1, a 70-amino acid single-chain secreted protein, plays a central role in regulating somatic growth during childhood in humans and in juveniles of other mammals, and functions as both a mediator of the actions of GH [5][6][7][8][9], and as a readout for the environmental inputs that affect overall health [65]. IGF1 also is involved in tissue repair and in metabolic regulation throughout life [10][11][12][13]. In humans, rats, and mice, the protein precursor of mature IGF1 is derived from the translation of multiple classes of mRNAs that result from transcription from two distinct promoters using several different initiation sites, and alternative RNA splicing [2,19] (see Fig 1B). The genomic analyses presented here suggest that similar mechanisms are active in at least 20 other mammalian species from 11  additional orders. The genomes of these species all encode single-copy IGF1 genes that share structural features with human IGF1 and rat Igf1, namely two promoters, and six exons subdivided by five introns, including a central large intron of~47 -~80 kb separating exons 3 and 4 (Figs 7 and 8, Tables 1-3). Nearly all of these IGF1 genes also appear capable of being transcribed and processed into many classes of IGF/Igf1 mRNAs similar to those found in humans and rats [15,16,19,[22][23][24][25], of being translated into the same types of IGF1 precursor proteins, and of being processed into a highly similar mature IGF1 peptide (Fig 2, Table 4). Regarding the other 2 species with apparently different IGF1 genes, in guinea pig, a homologue of exon 5  Tables 5 and 6). The percentage of nucleotide identity with different parts of rat Igf1 is indicated within each gene and locus (black for exons, red for putative Stat5b binding elements). Other abbreviations are as follows: R2-3, R8-9, R13, R34-35, R53-54, R57-59, R60-61-78-nomenclature for rat Stat5b sites (see (37) was not identified, and in wallaby, the DNA quality of the IGF1 locus was poor in all databases that were searched (Tables 1-3). Thus, it is likely that in the majority of mammals, IGF1 is a 2-promoter, 6-exon, and 5-intron gene. There also is fairly extensive DNA sequence identity in the proximal parts of the two IGF1 promoters among most of the 25 mammalian species evaluated here. This includes conservation of transcription factor binding sites for HNF-1, C/EBPα/β, HNF-3, and C/EBPδ in promoter 1 in most species (Fig 6). These data suggest that common regulatory mechanisms may control some aspects of IGF1 gene expression in the majority of mammals.
A more surprising result of analysis of many mammalian IGF1 genes is the apparent divergence of putative GH-regulated Stat5b binding enhancer elements in IGF1 loci (Figs 7 and 8, Tables 5 and 6). Since GH plays a critical role in the molecular physiology of IGF1 in several mammalian species [20,31,36,37], and since in previous studies, conserved GH-inducible Stat5b-binding enhancer elements were identified in analogous locations in the IGF1/Igf1 loci of humans, several non-human primates, rats, and mice [14,37,39,66], it was assumed that similar DNA segments would be shared among other mammals. With results from additional species, it now appears that the number of recognizable putative GH-responsive Stat5b-binding elements varies considerably (Tables 5 and 6, Figs 7, 8 and 9). Since no potential Stat5bbinding domains were detected in the platypus IGF1 locus, and only a single element with a pair of Stat5b sites was found in opossum, Tasmanian devil, and wallaby (Tables 5 and 6, Fig  8), these results suggest that during mammalian speciation, particularly of marsupials and monotremes, either other mechanisms have arisen to govern GH-activated IGF1 gene transcription (e.g., Stat5b-binding sequences located elsewhere in the loci), or alternatively, GH does not stimulate IGF1 gene expression in these species. Since it is now possible to construct chimeric cell lines containing IGF1 loci from different species and test for GH-regulated transcription [39,57], these different hypotheses may be examined directly.
Ensembl, the SRA NCBI, and other publically available genomic databases contain a wealth of information about different genes from a wide range of animal species, yet much of these data have not been analyzed fully or even characterized yet. For most of the IGF1 genes and loci examined here, the information found within Ensembl was either incompletely or incorrectly annotated, possibly because the analyses were limited in scope or were not reviewed by anyone with expertise in the molecular biology of this gene. Similarly, the data in the SRA have not been evaluated in any detail, and all human RNA samples in GTEx are derived from postmortem tissues and organs. It seems likely that similar situations as seen with IGF1/Igf1 exist for other genes, raising the possibility that there are many opportunities to gain new insights into gene conservation or variation during mammalian and vertebrate evolution. For example, there may be other species besides the sloth in which an IGF1 RNA copy was retro-transposed as DNA back into the IGF1 locus (Fig 9) or even elsewhere in the genome.
As described in detail in humans [67], it is likely that most other mammalian genomes contain several million DNA sequence polymorphisms, and that at least some of these modifications have the potential to alter gene expression based on their locations within enhancers, promoters, or other regulatory components [68]. For example, single nucleotide polymorphisms (SNPs) have been identified in human IGF1 promoter 1 [69], in at least three Stat5b binding elements (39), and in other parts of the locus (see the database on human variation in GTEx), although to date none of these changes have been studied to determine if they alter IGF1 gene expression or regulation. Similar data on genomic variability have not been mapped for other mammals, primarily because of the small number of genomes that have been sequenced. It seems likely that analogous differences will be found, including SNPs, copy-number variations, and DNA insertion-deletions that may play roles in population fitness or other adaptations to changing environments. It also seems plausible that certain polymorphic variants may be present in several closely related mammalian species, giving rise to the hypothesis that they have contributed to organismal fitness of common ancestors.
The complicated role of IGF1 in both normal physiology and in disease is potentially reflected in its complex structure and patterns of expression. The fairly high conservation of the IGF1 gene and protein among the species studied here supports the idea that analogous transcriptional and other regulatory pathways have been present since the onset of mammalian speciation [70,71], ideas that now can be tested in experimental systems with the expectation that they will lead to new insights into the comparative biology of IGF1 and GH signaling or action.