A Preliminary Analysis of the Immunoglobulin Genes in the African Elephant (Loxodonta africana)

The genomic organization of the IgH (Immunoglobulin heavy chain), Igκ (Immunoglobulin kappa chain), and Igλ (Immunoglobulin lambda chain) loci in the African elephant (Loxodonta africana) was annotated using available genome data. The elephant IgH locus on scaffold 57 spans over 2,974 kb, and consists of at least 112 VH gene segments, 87 DH gene segments (the largest number in mammals examined so far), six JH gene segments, a single μ, a δ remnant, and eight γ genes (α and ε genes are missing, most likely due to sequence gaps). The Igκ locus, found on three scaffolds (202, 50 and 86), contains a total of 153 Vκ gene segments, three Jκ segments, and a single Cκ gene. Two different transcriptional orientations were determined for these Vκ gene segments. In contrast, the Igλ locus on scaffold 68 includes 15 Vλ gene segments, all with the same transcriptional polarity as the downstream Jλ-Cλ cluster. These data suggest that the elephant immunoglobulin gene repertoire is highly diverse and complex. Our results provide insights into the immunoglobulin genes in a placental mammal that is evolutionarily distant from humans, mice, and domestic animals.


Introduction
The elephant is the biggest terrestrial placental mammal alive today. It belongs to the order Proboscidea and the family elephantidae, which contains only two existing species: the Asian elephant (Elephas maximus) and the African elephant (Loxodonta africana). The three lineages of this family: Loxodonta, Elephas, and Mammuthus are thought to have originated 4-6 million years ago. Whereas some species of the former two lineages are still alive today, the last representative of the Mammuthus lineage, the woolly mammoth (Mammuthus primigenius), became extinct very recently (about 3.7 thousand years ago) [1]. Phylogenetic analysis suggest that the elephant is most closely related to living mammals of Trichechus (such as the West Indian Manatee, Trichechus manatus) and Procavia (such as the Rock Hyrax, Procavia capensis) [2].
Immunoglobulins are the antigen-recognition molecules of B cells of jawed vertebrates, which usually consist of two identical heavy (H) and two identical light (L) chains. In some exceptional cases, such as shark IgNAR and selected subclasses of camelid IgGs, only heavy chains are used [22][23][24]. Variable regions in the N-terminus of H/L chains are encoded by V H /V L , D H , and J H /J L genes to determine the antigen binding site and antibody specificity. However, constant regions in the C-terminus of H/L chains are encoded by IGHC/C k or C l genes and are responsible for the immunoglobulin classes and functional activities [25,26].
In the mammals studied so far, the locus of unique immunoglobulin heavy chain genes and loci of l and k light chain genes are commonly organized in a ''translocon'' pattern [27,28]. In the heavy chain locus, multiple V H , D H , and J H gene segments are followed by consecutive m, d, c, e, and a gene segments [29]. In the l light chain locus, a cluster of V l gene segments is followed by multiple sets of clustered J l gene segments, each linked to a single C l gene. Differentially, the cluster of V k gene segments is followed by a cluster of J k gene segments, and then by a single C k gene [30].

The elephant genome sequence
The genome sequence of the African Elephant (Loxodonta africana), provided by the Broad Institute via whole genome shotgun, can be obtained from the Ensembl database (http:// www.ensembl.org). LoxAfr3, an assembly of the genome of African Elephant, has been sequenced to 76 coverage (loxAfr3, 76 coverage, July 2009). The elephant immunoglobulin gene sequences were retrieved from the UCSC genome browser (http://genome.ucsc.edu/).

Identification of the elephant Ig genes
Human immunoglobulin gene sequences were used as queries to search the elephant genome scaffolds that contained immunoglobulin genes. A conventional TBLASTN approach was used to identify constant region genes of the elephant immunoglobulins. FUZZNUC, an online software (http://embossgui.sourceforge. net/demo/fuzznuc.html) was used to find adjacent recombination signal sequences (RSSs) for identification of variable, diversity, and joining gene segments. Five or more mismatched bases were allowed to cover all genes. The locations of the annotated elephant gene sequences on the elephant genome are shown in Table S1 (S1-1,S1-5).

Sequence alignments
Editing and comparison of sequences were carried out using the DNAstar program. Alignment of multiple sequences was performed using the Clustal W algorithm, then aligned with Clustal X software, and exported by BioEdit software with view conservation by plotting identities to a standard as a dot.

Dot matrix analysis
A dot matrix analysis (window size 30 bp and mismatch limit 9 bp) was used for comparing two sequences to identify a possible alignment of characters between the sequences.

Phylogenetic analysis
Phylogenetic studies were carried out using MrBayes3.1 and viewed with the TreeView package. All the trees were obtained with 1 million generations for the chains, a sample frequency of a 100, and a burn in of 2,500 (ngen = 1000000; Samplefreq = 100; burnin = 2,500). The site by site rate variation was set to a gamma distribution (rates = gamma) for all the Bayesian trees and a General Time-Reversible (GTR) (nst = 6) model of substitution was chosen. The sequences from other species used in phylogenetic analyses are presented in Table S2 (S2-1,S2-2).

Definition of the V H /V L gene families
In mammals, germline V H and V L gene segments can be grouped into families based on their nucleotide sequence similarity [49]. The established criteria are that the same family members share more than 80% nucleotide similarity, those with less than 70% similarity are put into different families, and those possessing between 70% and 80% similarity are inspected on a case-by-case basis [50]. In our analysis, we placed V H and V L segments having similarity greater than 70% into the same family.

Elephant immunoglobulin heavy chain genes
IgH locus. The public elephant genome assembly used in this study was loxAfr3, which is an assembly of the genome of the African Elephant (Loxodonta africana), sequenced to 76 coverage. The high genome coverage of this assembly confers a high reliability on the gene analysis. BLAST searching localized the elephant IgH locus to genomic scaffold 57. It spans approximately 2,974 kb from the most 59 V H segment (V H 2-112p) to the most 39 c gene (Fig. 1). A single m and eight c genes were identified in this scaffold. Neither e nor a genes could be found, most likely due to sequence gaps.
Constant region genes. Like other mammalian species, the elephant m gene contains four CH and two transmembrane exons. A sequence comparison of m genes among thirteen vertebrate species demonstrated that the critical amino acids for immunoglobulin folding, Cysteine (C) and Tryptophan (W) [51], were highly conserved in elephants ( Figure S1). In addition, the elephant IgM constant region showed the highest amino acid sequence identity to human (63.8%), and the least to echidna (50.8%).
Most mammals also express a d gene, which is always situated immediately downstream of the m, and the distance between m and d usually does not exceed 7 kb. A BLAST search against the elephant whole genome using both DNA and amino acid sequences of the d genes of other mammalian species showed no intact d gene. However, approximately 10 kb downstream of the elephant m (no sequence gaps for 90 kb downstream), we identified a short fragment encoding a polypeptide ( Figure S2) homologous to the IgD CH3 domain of other mammals. This was done by a thorough examination of amino acid sequences encoded by the DNA sequences between m and c1 (based on translation of all reading frames of both sense and anti-sense sequences). An alignment of the elephant IgD remnant and the IgD CH3 domains of several mammalian species is presented in Figure S2. This indicates that the gene has been highly mutated and pseudogenized in the elephant.
In addition to the eight c genes (c1 to c8) in scaffold 57 ( Fig. 1), an additional c gene (tentatively named as c9) was identified in scaffold 495 (data not shown), which spans 77 kb. Scaffold 495 is not assembled together with scaffold 57; therefore, c9 could potentially be either an additional subclass encoding gene or an allelic variant. The identification of multiple IgG subclassencoding genes is in accordance with a previous report, which indicated that there were at least five subclasses of IgG in African elephant sera [20]. Sequence analysis showed no additional Ig genes in genomic scaffold 495, except for the c9 gene. The greatest variation among mammalian IgG subclasses is usually concentrated in their hinge regions [52][53][54]. However, no elephant IgG cDNA sequences have been sequenced, it is very difficult to accurately assess the hinge regions of the elephant IgG heavy chains. The hinge region is usually encoded on a separate exon that could not be identified in the elephant due to the low level of conservation and the absence of cDNA sequences. An amino acid alignment of the nine elephant IgG subclasses is presented in Fig. 2. The first exons (CH1) of c1 and c2 are both missing because of gaps. The CH3 exon of c3 is pseudogenized because of a premature stop codon (marked with a star in Fig. 2), and a frameshift mutation (marked with shadowing in Fig. 2) caused by nucleotide (adenine) insertions at positions 148 and 158, respectively. To clarify the relationship among c chains from mammalian species, a phylogenetic tree of IgG CH2 and CH3 exons was constructed and is shown in Figure S3. The elephant c genes form a distinct cluster. This is consistent with previous analysis, which showed that the divergence of IgG subclasses occurred after speciation [52].
Dot matrix analysis of the elephant IgH locus showed there are switch regions upstream of the m gene and six c genes (c1, c4, c5, c6, c7, and c8), as in humans and mice [55,56]. The switch regions of c2, c3, and c9 could not be identified, most likely due to sequence gaps. Structurally, the switch regions, as in other species, are all composed of pentameric repeats (GGGCT and GAGCT).
The elephant S m region shows substantial nucleotide similarity with those of human, mouse, and pig (Fig. 3). The six elephant S c regions are similar, but share little sequence similarity with human and mouse S c ( Fig. 4 and data not shown).
V H gene segments. A total of 112 V H segments were identified in the elephant IgH locus. 51 of them appear to be potentially functional, because they have leader exons, normal open reading frames (ORF), downstream RSSs, and V gene domain (framework regions (FRs) and complementarity determining regions (CDRs)). The remaining 61 segments contain either in-frame stop codons or frameshifts, and are thus designated as pseudogenes. In addition, there are 17 partial segments of about 200 bp in length, which are regarded as truncated V H sequences. There are gaps above 10 kb in the elephant genome among the V H gene segments (Fig. 1), suggesting that there might be more V H segments. To examine the relationships among the elephant germline V H segments, pseudogenes as well as functional genes were used to construct a phylogenetic tree (Fig. 5). The seven identified V H gene families (1, 2, 3, 4, 5, 6, and 7) were confirmed to be homologous with the corresponding human V H gene families. The elephant V H 4 family contains the most members (72 V H segments), which could be further divided into three groups (Fig. 5). We chose representative V H sequences from elephant and other mammals, covering almost all V H families identified, to construct phylogenetic trees (Fig. 6). The elephant VH genes clearly fall into the three previously known VH clans.
D H gene segments. In the elephant IgH locus, 87 D H segments were identified and are presented in Figure S4 (S4-1,S4-10). It should be noted that there might be more D H segments because of the existence of sequence gaps. Except for DH76, which has a 10 bp spacer, all the D H segments are flanked by characteristic heptamers and nonamers separated by 12-bp spacers. The potential coding regions of D H segments are 10-37 bp in length ( Figure S4, S4-1,S4-10). It has been suggested that coding regions of D H segments of humans can be described by the characteristics of their amino acids [57]. Inspection showed that a great number of polar/hydrophobic amino acids or stop codons occur widely in elephant D H coding regions (data not shown). In humans and mice, the germline D H segments can be classified into families based on the extent of sequence similarity [58,59]. Analysis of nucleotide similarity in the coding regions and flanking RSSs indicated that the 87 elephant D H segments could be divided into seven families. Members within the same family share at least 70% nucleotide identity (data not shown), while some members in a family have completely identical sequences (these are shadowed in Fig. 7). We present the sequence alignment of the seven families in  J H gene segments. There were six germline JH gene segments found in the elephant IgH locus (Fig. 8). All the JH segments had conserved nucleotide sequences at the 39 end. JH1 was pseudogenized by replacement of a Tryptophan (W) residue by a stop codon.

Elephant immunoglobulin light chains
k chain. Immunoglobulin k chain genes of elephant were identified on three scaffolds: 202, 50, and 86. A schematic diagram is shown in Fig. 9. Of the 153 germline V k segments from the three scaffolds, 53 were regarded as potentially functional genes and 100 as pseudogenes. Based on sequence similarity analysis, 142 of the V k segments can be assigned to eight families (V k 1,V k 8) (Table  S3), which contain 2, 31, 2, 102, 1, 1, 2, and 1 members, respectively. The remaining 11 V k pseudogenes could not be assigned to any family because they share less than 70% nucleotide similarity with any other V k gene segment. A phylogenetic tree of the elephant V k functional genes is shown in Fig.10. The six elephant V k families (V k 1,V k 6) correspond to the six human V k gene families. In addition, scaffold 86 includes 24 V k segments showing the same transcriptional orientation as the J k and C k , and 18 V k segments showing a reverse transcriptional direction. Three J k segments and one C k gene on scaffold 86 are displayed in Figure  S5. In addition, V k segments located on scaffolds 202 and 50 also possess two different transcriptional directions.
l chain. Scaffold 68 was determined to contain the elephant l light gene complex (Fig. 9). Sequences analysis revealed that the 12 elephant V l gene segments belonged to six families (Fig. 11), which were homologous with the human V l 1, 3, 4, 7, 9 and 10 families. The remaining three V l pseudogenes could not be assigned to any family because they share less than 70% nucleotide similarity with any other V l gene segment. The three elephant V l families consists of seven members. In contrast to V k , all the V l segments possess an identical transcriptional polarity to the downstream J l segments. In addition, only V l 3-3 and V l 3-7 are identified as potentially functional genes. At the 39 end of the locus, three constant region genes are organized in tandem, where both C l 2 and C l 3 are preceded by a J l . The J l segment before C l 1 is missing because of a sequence gap. Three C l genes show approximately 90% amino acid identity. The sequences of two J l segments and three C l genes are presented in Figure S5.

Discussion
In this study, we have made a preliminary analysis of the immunoglobulin genes in the elephant using the recently released elephant genome, revealing that the elephant IgH locus conforms  to the ''translocon'' pattern. Compared with human IgH locus, which occupies a 1.25 Mb region [60], elephant IgH locus appears to span larger genomic region (approximately 3 Mb).
We translated the nucleotide sequences between the m and c1 genes in all three reading frames in both the positive and negative directions. By blasting the nucleotide and corresponding amino acid sequences against the NCBI database, only the IgD-CH3 remnant was identified.
With the exception of marsupials [61,62], most placental and even monotreme mammals studied so far have been shown to have multiple IgG subclasses encoded by independent sets of exons [63]. The elephant genome contains nine IgG genes, although it is not known whether all of them are functional. This number is larger than that in any placental mammals so far examined (ranging from 1 to 7) [43,[64][65][66][67][68][69], providing another remarkable example for IgH chain constant region diversity in mammals.
Our analysis also suggested a high degree of complexity in the elephant IgVH locus. At least 112 V H segments constitute the elephant germ-line V H repertoire. According to the number of V H gene families, placentals studied so far could be divided into two groups. The multiple gene families group includes mice (16 families), human (seven families), and horse (seven families). The few gene families or single gene family group includes dog (three families), rabbits (one family), cattle (one family), camel (one family), and swine (one family) [70][71][72][73][74][75][76][77][78]. The elephant, having 7 V H gene families, should be put into the first group. The    mammalian V H families can be further classified into three clans: I, II, and III, which have co-existed in the genome for more than 400 Myr [79]. Similar to those of humans, the elephant V H families also conform to three clans: families 1, 5, and 7 form clan I, families 2, 4, and 6 form clan II, and family 3 forms clan III. The largest group of elephant V H genes is the V H 4 family of clan II. It has been demonstrated that the unique V H family identified in cattle belonged to clan II [75,77]. In sheep, most V H genes are also categorized into clan II [80]. Based on a recent report, clan II also appeared to be the largest group in the horse [41], indicating that the herbivore animals may prefer to use the clan II V H genes.
Close attention should also be paid to the elephant D H locus, where at least 87 germline D segments could be mapped to a 450-kb DNA region; the largest number in mammals examined so far. The presence of more D H segments may greatly increase the Ig diversity generated through DNA rearrangement. The size of the elephant D H coding regions ranges from 10 to 37 bp, similar to that of human (11 to 37 bp) [57]. Further inspection revealed that the elephant D H segments were translated in three reading frames abundant in polar/hydrophobic amino acids, which is different to dog [78], horse [41], mouse [81], rabbit [82], and chicken [83], which show preferences for neutral (polar/hydrophilic) amino acids.
For the light chain genes, elephant V k germline genes are more abundant than V l (53 functional V k genes vs. 2 functional V l genes). Different mammalian species possess different ratios of V k and V l . In humans, roughly 60% of the variable light chain repertoire is k (40 functional V k genes vs. 30 functional V l genes). The germline V k genes of mice are dominant by as much as 95% or more [84]. It has been proposed that the preferential use of light chain isotypes at the protein level may be correlated with the overall number of V gene segments [84]. It is thus possible that the k chain predominates over the l chain at the protein level in elephants.
Interestingly, a great number of pseudogenes exist in the elephant V H (61/112), V k (100/153), and V l (13/15) loci. In some species, the base-pair changes could be inferred using an existing pseudogene or germline gene as a template, and therefore pseudogenes in the V loci constitute a potential donor pool for gene conversion to generate immunoglobulin diversity [85][86][87][88]. A great number of V pseudogenes may contribute to the immunoglobulin diversity in elephants.
The study of structure and organization of the immunoglobulin gene loci is vital to the understanding of the nature of antibody molecules. This study provides information for comparative studies of mammalian Ig genes, as well as data for further studies of the elephant immunoglobulin genes.  Figure S5 The alignment of amino acid sequences of J and C genes from elephant IgL chains. A, alignment of the deduced amino acid sequences of the three elephant J k gene segments. B, alignment of the amino acid sequences of the C k proteins from several mammalian species. C, alignment of the deduced amino acid sequences of the two elephant J l gene segments. D, alignment of the deduced amino acid sequences of three elephant C l genes and several mammalian species C l genes. Amino acid residues that are identical to the top counterpart in every panel are shown as dots; Gaps and missing data are indicated by hyphens. (TIF)