Identification and Comparative Analysis of the Protocadherin Cluster in a Reptile, the Green Anole Lizard

Background The vertebrate protocadherins are a subfamily of cell adhesion molecules that are predominantly expressed in the nervous system and are believed to play an important role in establishing the complex neural network during animal development. Genes encoding these molecules are organized into a cluster in the genome. Comparative analysis of the protocadherin subcluster organization and gene arrangements in different vertebrates has provided interesting insights into the history of vertebrate genome evolution. Among tetrapods, protocadherin clusters have been fully characterized only in mammals. In this study, we report the identification and comparative analysis of the protocadherin cluster in a reptile, the green anole lizard (Anolis carolinensis). Methodology/Principal Findings We show that the anole protocadherin cluster spans over a megabase and encodes a total of 71 genes. The number of genes in the anole protocadherin cluster is significantly higher than that in the coelacanth (49 genes) and mammalian (54–59 genes) clusters. The anole protocadherin genes are organized into four subclusters: the δ, α, β and γ. This subcluster organization is identical to that of the coelacanth protocadherin cluster, but differs from the mammalian clusters which lack the δ subcluster. The gene number expansion in the anole protocadherin cluster is largely due to the extensive gene duplication in the γb subgroup. Similar to coelacanth and elephant shark protocadherin genes, the anole protocadherin genes have experienced a low frequency of gene conversion. Conclusions/Significance Our results suggest that similar to the protocadherin clusters in other vertebrates, the evolution of anole protocadherin cluster is driven mainly by lineage-specific gene duplications and degeneration. Our analysis also shows that loss of the protocadherin δ subcluster in the mammalian lineage occurred after the divergence of mammals and reptiles. We present a model for the evolutionary history of the protocadherin cluster in tetrapods.


Introduction
Since their discovery about a decade ago [1,2], the vertebrate protocadherin cluster genes have received considerable attention due to their unusual genomic organization and potential role in specifying the remarkable diversity of the neural network. The clustered protocadherin genes are predominantly expressed in neurons and their protein products are highly enriched in synaptic junctions and axons [1,[3][4][5]. Single neuron RT-PCR experiments have demonstrated that individual neurons, even of the same kind, express an overlapping but distinct subset of protocadherin cluster genes [6][7][8]. Thus the combinatorial expression of protocadherins in individual neurons might provide a profound molecular code for specifying neuron-neuron connections in the developing nervous system [9][10][11]. Indeed, ablation of protocadherin a and c subclusters in mice causes defects in axonal projection of olfactory sensory neurons to the olfactory bulb [12] or drastic impairment in synaptic formation and extensive loss of interneurons in the spinal cord [5,13]. In mammals, the protocadherin cluster genes are organized into three closely-related subclusters, namely the a, b and c subclusters, each of which contains 15 to 22 homologous ''variable'' exons that are arranged in tandem [2]. Each variable exon measuring about 2.4 kb is transcribed from an independent promoter and encodes an extracellular domain (comprising six calcium-binding ectodomain repeats), a transmembrane domain and a short segment of the intracellular domain. In addition to the variable exons, the 39 end of the a and c (but not b) subclusters contains three ''constant'' exons each, which are spliced to individual variable exons in their respective subclusters. These constant exons encode the major part of the intracellular domain. Thus, the protocadherin proteins produced by each of the a and c subclusters comprise a homologous but distinct extracellular domain, and an identical cytoplasmic domain. The extracellular domain is presumably responsible for providing diverse signals for specifying cell-cell interaction through homophilic or heterophilic interaction [14,15] or by interaction with other cell surface molecules [16,17], whereas the cytoplasmic domain is likely to mediate a common intracellular process for implementing the cell interaction signal [18,19]. The protein products encoded by the b subcluster genes, which lack the constant exons, contain only the diverse extracellular domain, and lack the common cytoplasmic domain [2].
The protocadherin cluster represents one of the most evolutionarily dynamic gene loci in vertebrate genomes. Comparative analysis of its subcluster organization and paralog arrangements has provided useful information regarding the dynamic nature of vertebrate genomes [20,21]. To date, the genomic organization of protocadherin cluster has been characterized in several vertebrate lineages, including mammals [2,[22][23][24][25], chicken (the a subcluster only) [26], coelacanth [20], teleost fishes [27][28][29] and a cartilaginous fish, the elephant shark [21]. While the protocadherin cluster in mammals is organized into the a, b, and c subclusters with 54 to 59 genes, the coelacanth cluster possesses an additional single-gene subcluster, the d subcluster, at the 59 end and consists of a total of 49 genes [20]. Teleost fishes such as fugu and zebrafish contain two unlinked protocadherin clusters, Pcdh1 and Pcdh2, due to a fish-specific genome duplication event. Both clusters lack the b subcluster. In addition, the fugu Pcdh1 cluster has lost the c subcluster, thus containing only the d and a subclusters. In contrast, the zebrafish Pcdh1 cluster has retained the d, a and c subclusters whereas the Pcdh2 cluster has lost the d subcluster and retained only the a and c subclusters [27][28][29]. The duplicate protocadherin clusters in fugu and zebrafish contain at least 77 and 107 genes, respectively. The elephant shark possesses three unique protocadherin subclusters in addition to the d subcluster. These subclusters are designated as the e, m and n subclusters. They have no orthologs in bony vertebrates [21]. The different subcluster complement in bony vertebrates and cartilaginous fishes suggests that the common ancestor of jawed vertebrates contained at least seven protocadherin subclusters (a, b, c, d, e, m and n), of which the a, b and c subclusters have been lost in the cartilaginous lineage, whereas the e, m and n subclusters have been lost in bony vertebrates. The d subcluster has been retained in elephant shark, teleost fishes, coelacanth, amphibians and birds [21], but lost in mammals. In addition to the loss of complete subclusters, the variable exons in each protocadherin subclusters (except the d) has experienced repeated lineage-specific gene duplication and degeneration. For instance, while most human protocadherin cluster genes have a clearly-defined one-toone ortholog in other mammals, only a few genes in the human and coelacanth clusters exhibit individual orthologous relationship, suggesting that the variable exons in each of the human and coelacanth clusters have experienced repeated lineage-specific gene duplication and degeneration [20,22,24]. Given the potential role of protocadherins in specifying the neural network, it is plausible that the high frequency of gene turnover in the protocadherin cluster might have played a key role in the adaptive evolution of the central nervous system in vertebrates. Among tetrapods, only mammalian protocadherin clusters have been fully characterized to date. Here, we report the identification and analysis of the protocadherin cluster in a reptile, the green anole lizard (Anolis carolinensis). The protocadherin cluster genes in anole, which represents an intermediate taxon between the coelacanth and mammals, fills a critical gap in the evolutionary history of the protocadherin cluster in tetrapods.

Results and Discussion
Anole protocadherin cluster consists of 71 genes, organized into d, a, b and c subclusters To identify the protocadherin cluster sequence in the anole genome, we first performed a TBLASTN search against the draft anole genome (Broad Institute AnoCar 1.0) using amino acid sequences of mammalian protocadherin constant exons as queries. This led to the identification of a single scaffold (Scaffold_147, 2,899,420 bp) containing the entire protocadherin cluster. Inspection of this scaffold showed that the sequence corresponding to the anole protocadherin cluster represents a high-quality assembly region interrupted by 24 gaps. We subsequently filled 18 of these gaps by PCR amplification from genomic DNA resulting in seven contigs spanning ,1 Mb. Annotation of this gene cluster by GENSCAN and homology comparisons identified 71 protocadherin variable exons and three subsets of constant exons (Fig. 1). We confirmed the splicing sites of the variable and constant exons by RT-PCR using cDNA from anole brain. In addition to the 71 intact variable exons, we were also able to identify 14 pseudogenes. Interestingly, half of these pseudogenes contain single-nucleotide insertion or deletion (Fig. 1). The presence of protocadherin pseudogenes at various stages of degeneration indicates that the protocadherin cluster has continued to experience gene losses in the anole lineage (see below). In addition to the protocadherin genes, we identified 19 non-protocadherin genes upstream and five non-protocadherin genes downstream of the protocadherin cluster. The synteny of these genes flanking the protocadherin cluster is almost totally conserved in the human protocadherin cluster locus (Table 1). This indicates that, in contrast to the protocadherin cluster, its flanking regions are highly stable in reptiles and mammals. The protocadherin clusters in human and mouse contain two non-protocadherin genes (Slc25a2 and Taf7) located between the b and c subclusters [2,22]. However, these genes are not present either in the anole protocadherin cluster or in the protocadherin clusters of non-tetrapod vertebrates. Thus we conclude that these genes were inserted into the protocadherin cluster in the mammalian lineage after it diverged from reptiles.
To determine the subcluster organization of anole protocadherin genes, we first performed phylogenetic analysis of the three subsets of constant exons from the anole protocadherin cluster together with constant exon sequences of protocadherin a, c, d, m and n subclusters from other representative vertebrates. The phylogenetic analysis shows that the three subsets of constant exons in the anole protocadherin cluster represent the d, a and c subclusters (Fig. 2). Since the protocadherin b subcluster lacks constant exons, the identity of this subcluster can only be inferred by the phylogenetic analysis of its variable exons. We therefore performed phylogenetic analysis of the variable exon sequences. This analysis showed that the 15 genes immediately downstream of the anole protocadherin a subcluster belong to the b subcluster (see below). Taken together, our results indicate that the anole protocadherin cluster consists of 71 protocadherin genes, which are organized into four subclusters: the d (one gene), a (17 genes), b (15 genes) and c (38 genes) (Fig. 1). The subcluster organization of the anole protocadherin cluster is therefore identical to that of the coelacanth cluster, but differs from the mammalian protocadherin cluster which lacks the d subcluster at the 59 end. Notably, the total number of genes in the anole protocadherin cluster (71 genes) is significantly higher than that in the coelacanth (49 genes) and mammalian (54-59 genes) clusters.

Anole protocadherin genes have experienced a low frequency of gene conversion
It has been documented that protocadherin genes in teleost fishes and mammals have experienced repeated gene conversion events during evolution [27,29]. In contrast, protocadherin genes in coelacanth and elephant shark have experienced only limited gene conversion events [21,27]. To investigate whether the anole protocadherin genes have undergone gene conversion, we estimated the total number of synonymous substitutions per codon (dS) of the anole protocadherin genes in the four major paralog subgroups: Aca1-15, Acb1-15, Acca1-10 and Accb4-23. We used the synonymous substitution rate as a measure of the frequency of gene conversion because purifying selection for protein function does not act on synonymous sites. In case ECD5 and ECD6 domains of anole have experienced gene conversion, the synonymous substitution rate for these domains should be considerably lower than that for ECD1 to ECD4 domains. However, as shown in Table 2, the synonymous substitution rates in Aca1-15, Acb1-15 and Accb4-23 subgroups are highly similar among the six ectodomains. The ratios between the most and the least divergent ectodomains in these paralog subgroups range from 2. 25

Phylogenetic relationships of anole and other vertebrate protocadherin cluster genes
Previous studies have shown that most mammalian protocadherin genes (e.g., .72% in human and .67% in mouse) have clearly-defined one-to-one interspecies orthologous relationships [22,24,30]. However, few such orthologous relationships can be found between individual mammalian, coelacanth or teleost protocadherin genes. Instead, some of the mammalian protocadherin genes are orthologous to coelacanth and teleost protocadherin genes only as paralog subgroups [20,[27][28][29]. This type of phylogenetic relationships implies that subsequent to the divergence of vertebrate lineages, the variable exons of protocadherin clusters have undergone extensive gene turnover and the paralog complement in each of the current vertebrate protocadherin clusters is a result of multiple repeated lineage-specific gene duplication/degeneration events. To trace the evolutionary history of protocadherin genes in tetrapods, we performed phylogenetic analysis using individual variable exon sequences of anole, coelacanth and human protocadherin clusters. Coelacanth, which is the closest living relative of tetrapods whose protocadherin cluster has been characterized, was chosen as the outgroup. Our results show that the anole a subcluster consists of two divergent subgroups of protocadherin genes, the Aca1-15 and the Acac1-2. While Acac1 and Acac2 are clearly the anole orthologs of human Hsac1 and Hsac2, respectively, the anole Aca1-15 form a paralog subgroup on its own and is orthologous to the human paralog subgroup comprising Hsa1-13 genes (Fig. 3). This phylogeny suggests that individual variable exons in each of the Aca1-15 and Hsa1-13 paralog subgroups are derived from a single ancestral protocadherin paralog in each of the anole and human a subclusters through multiple rounds of lineage-specific gene duplications, and the anole and human ancestral paralogs evolved from a single gene that existed in the common ancestor of reptiles and mammals. The relationships of the anole protocadherin genes to the coelacanth a subcluster however appear to be more complex. While it is clear that the last gene at the 39 end of the coelacanth subcluster (Lma21) is an ortholog of anole Acac2 and human Hsac2 (also located at the 39 end of their respective subclusters), the coelacanth counterparts of anole Acac1 and human Hsac1 seem to have expanded into a paralog subgroup that contains six genes (Lma16-19) (Fig. 3; also see Fig. S1 for a higher resolution phylogenetic tree for this class of protocadherin genes). It appears that the coelacanth protocadherin genes closest to the anole Aca1-15 and human Hsa1-13 paralog subgroups are the Lma14 and its closely related paralog subgroup Lma11-13. Apparently, there is no equivalent to coelacanth Lma2-10 in anole and human a subclusters, suggesting that orthologs for these coelacanth genes have been lost in reptiles and mammals (Fig. 3). These results suggest that the paralog subgroup complement of the anole protocadherin a subcluster is highly similar to the human a subcluster, but considerably divergent from that of coelacanth protocadherin a subcluster.  The genomic organization of protocadherin b subcluster is relatively simple, containing only a single paralog subgroup and lacking the constant region [2,30]. The protocadherin b subcluster has been identified only in mammalian and coelacanth protocadherin clusters, but not in fugu, zebrafish and elephant shark clusters, suggesting that it is specific to lobe-finned fishes and tetrapods. Our phylogenetic analysis shows that the first 15 protocadherin genes immediately downstream of the anole a subcluster, as a paralog subgroup, are orthologous to the human and coelacanth protocadherin b subcluster genes, indicating that this subset of anole protocadherin genes belong to the b subcluster (Fig. 3). The absence of one-to-one orthologous relationships between individual anole, human and coelacanth protocadherin b genes suggests that these genes were derived from multiple, independent lineage-specific gene duplication events in their respective subclusters. Thus, the evolution of protocadherin b subclusters is driven exclusively by lineage-specific variable exon duplication and degeneration. Notably, the gene number of the anole b subcluster (15 genes) is comparable to that of the human b subcluster (16 genes), but is significantly higher than that of the coelacanth b subcluster (4 genes). The expansion of b subcluster genes in reptiles and mammals might have given rise to a higher molecular repertoire to mediate a more diverse and/or complex cell-cell interaction network. However, as protocadherin molecules are highly homologous, and apparently redundant [31], whether the differential gene numbers of the b subcluster could indeed affect the degree of complexity of the protocadherin b-mediated neuron-neuron interaction remains to be determined. It is noteworthy that the overall gene content in the vertebrate protocadherin clusters does not seem to be correlated to their respective brain complexity. For example, while the anole, fugu and zebrafish protocadherin clusters contain 71, .77 and .107 genes, respectively [27][28][29], only 53 protocadherin genes are present in the human protocadherin cluster.
The mammalian protocadherin c subcluster contains three divergent paralog subgroups, the ca, cb and cc, which in human, consist of 12 (Hsca1-12), seven (Hscb1-7) and three (Hscc3-5) genes, respectively. The coelacanth protocadherin c subcluster also contains three major paralog subgroups. However, while it is clear that the last five genes (Lmc20-24) at the 39 end of the coelacanth subcluster belong to the cc subgroup, the other two coelacanth paralog subgroups, which consist of Lmc1, 3,4,6,7,9,[11][12][13][14][15][16]19 and Lmc2,5,8,10,17,18, respectively, do not seem to be directly related to any of the mammalian ca and cb subgroups [20]. The anole protocadherin c subcluster comprises 38 genes and represents the largest c subcluster identified to date. Our phylogenetic analysis shows that the anole c subcluster genes also segregate into three paralog subgroups, which clearly belong to the ca (Acca1-10), cb (Accb1-23) and cc (Accc3-7) subgroups, respectively (Fig. 3). Similar to the mammalian ca and cb subgroup genes [2,22], the anole Acca1-10 and Accb1-23 genes are interspersed in the cluster (Fig. 1). This type of gene arrangement implies that some of the paralogs in the ca and cb subgroups might have been duplicated simultaneously as a contiguous syntenic block at some stage during evolution. Interestingly, our phylogenetic analysis shows that the coelacanth subgroup Lmc1, 3,4,6,7,9,[11][12][13][14][15][16]19 is more closely-related to mammalian and anole ca subgroups, whereas the Lmc2,5,8,10,17,18 subgroup is orthologous to the mammalian and anole cb subgroups [20]. Similar to their mammalian and anole counterparts, genes in these two coelacanth protocadherin subgroups also exhibit an interspersed distribution pattern, which seems to be a unique feature of the c subcluster genes. Interestingly, no paralog subgroups analogous to ca and cb were observed in fugu and zebrafish c subclusters [27][28][29], suggesting that ca and cb subgroups are likely to be unique to tetrapods and coelacanth.
In contrast to protocadherin genes that undergo repeated gene duplication and degeneration, the mammalian protocadherin cluster contains a subset of ''ancient'' genes that are less prone to gene duplication. These genes are referred to as the ''c-type'' protocadherin genes, which include the last two genes (ac1-2) at the 39 end of the a subcluster and the last three genes (cc3-5) in the c subcluster [2,22]. Despite being located in different subclusters, these genes are phylogenetically more closely-related to each other than to other protocadherin genes in their respective subclusters [2,22,24]. The anole protocadherin cluster contains seven such ctype genes: two (Acac1 and Acac2) located in the a subcluster and five (Accc3-7) in the c subcluster (Fig. 1). As shown above, the Acac1 and Acac2 genes in anole a subcluster are clearly orthologous to human Hsac1 and Hsac2, respectively, indicating that unlike other protocadherin genes in the subcluster, the ac1 and ac2 seem to have never experienced gene duplication or degeneration since the divergence of reptiles and mammals. Expression studies in mammals have shown that while other protocadherin genes in the a subcluster are only expressed by selected subset of neurons, the ac1 and ac2 seem to be expressed by every neuron [7,32], suggesting that they might play a key role in establishing the neural network. In the anole protocadherin c subcluster, while Accc3 and Accc5 are clearly orthologous to human Hscc3 and Hscc5, and coelacanth Lmc20 and Lmc23, respectively, the Accc4 and Accc6,7 seem to have no direct orthologs in human. Instead, the anole Accc4 and Accc7 are orthologous to coelacanth Lmc22 and Lmc24, respectively (Fig. 3, S1). No direct interspecies orthologs for anole Accc6, human Hscc2 and Lmc21 were found in this analysis. Lack of direct evidence of recent gene duplication in this protocadherin subgroup suggests that the ancient protocadherin c subcluster might have contained more c-type paralogs than any of the c subclusters in the modern day vertebrates, and subsequent to the divergence of vertebrates, the differential gene loss, rather than gene duplication, has played a major role in the evolution of these c-type genes in the c subcluster.
Consistent with the results of the phylogenetic analysis of constant exons of the d subcluster (Fig. 2), phylogenetic analysis of the variable exons also showed that the single protocadherin gene in the anole d subcluster is a direct ortholog of the coelacanth d subcluster gene (Fig. 3). Thus, the protocadherin d subcluster seems to be present in all non-mammalian vertebrate lineages, including reptiles, birds, amphibians, coelacanth, teleosts and cartilaginous fishes [21]. Unlike the protocadherin genes in their neighboring subclusters, none of the protocadherin d subcluster genes seems to have undergone gene duplication. Such a stable state during evolution suggests that the protocadherin d subcluster gene might play a critical role in establishing the neural network connections specific to non-mammalian vertebrates. The effect of the loss of this cluster in mammals is unclear.

A model for the evolution of protocadherin cluster genes in tetrapods
Based on the inferred phylogenetic relationships of anole, human and coelacanth protocadherin cluster genes, we propose a model for the evolution of protocadherin clusters in tetrapods (Fig. 4). In this model, we propose that repeated gene duplications and degenerations have played a predominant role in the evolution of protocadherin clusters in tetrapods. How these highly homologous and apparently redundant protocadherin paralogs affect the development and complexity of the nervous system is currently unknown. In addition, our model suggests that the paralog subgroup degeneration seem to have played an important role at the early stage of tetrapod evolution (e.g. during the transition from lobe-finned fishes to tetrapods), but not during the transition from reptiles to mammals. Moreover, our phylogenetic analysis supports that differential gene losses rather gene duplication play a predominant role in the evolution of protocadherin cc genes. Given the potential role of protocadherin genes in establishing the neural network, we speculate that the rapid gene turnover of protocadherin paralogs might have contributed to the adaptive evolution of the central nervous system in different tetrapod lineages. Thus, a future challenge will be to investigate how these different complements of protocadherin genes have contributed to the complexity of the nervous system in different vertebrate lineages.

Materials and Methods
Identification and annotation of the green anole lizard protocadherin cluster A draft assembly of the anole genome sequences based on 6.8x coverage sequences has been generated by the Broad Institute (Broad Institute AnoCar 1.0). We identified the genomic sequence of anole protocadherin cluster by TBLASTN search of the draft assembly that is made available on the University of California, Santa Cruz (UCSC) Genome Brower (http://genome.ucsc.edu) using the amino acid sequences of mammalian protocadherin constant exons as a query. The nucleotide sequence of Scaffold_147 (2,899,420 bp), which contains the protocadherin cluster gene sequences, was retrieved from the UCSC Genome Browser. Sequencing gaps in the protocadherin cluster region were filled by PCR using anole genomic DNA as template. We could fill eighteen of the 24 gaps in the anole protocadherin cluster. The sequences corresponding to these gap regions have been submitted to GenBank under accession numbers: GQ485616-GQ485633. The remaining gaps were not amplifiable by PCR due to a high content of repetitive DNA. The annotated anole protocadherin cluster sequences have been submitted to GenBank as Third Party Annotation (accession numbers: BK006912-BK006917). Variable and constant exons of the anole protocadherin cluster and the coding exons of non-protocadherin genes flanking the anole protocadherin cluster were annotated based on GENSCAN prediction (http://genes.mit.edu/ GENSCAN.html) and homology to known protein sequences in the public database (TBLASTN and BLASTX, http://blast.ncbi. nlm.nih.gov). The intron/exon splicing sites of the constant regions and the splicing sites between constant and selected variable exons in the anole protocadherin d, a and c subclusters were confirmed by RT-PCR using cDNA prepared from anole total brain RNA.

Synonymous substitution analysis
Synonymous substitution rates were estimated using CODEML program in the PAML package [33]. The amino acid sequences were aligned by ClustalX and the nucleotide sequence alignments were generated based on the amino acid sequence alignment as template using RevTrans program [34]. The synonymous substitution rate was calculated as average of synonymous substitutions per codon (dS) for each branch in the gene tree of protocadherin subgroups.

Phylogenetic analysis
The coelacanth protocadherin cluster was assembled from BAC sequences in the GenBank (accession numbers: AC150238, AC250248, and AC150308-AC150310) [20]. The human protocadherin cluster sequences were retrieved from the human genome database at the UCSC Genome Browser (http:// genome.ucsc.edu). The amino acid sequences of the constant exons (see Fig. 2) or the ectodomains 1-3 (EC1-3) (see Fig. 3) of the protocadherin cluster genes from various species were aligned using ClustalW [35] as implemented in BioEdit sequence alignment editor [36] under default parameters. Only the extracellular EC1-3 sequences were used for the phylogenetic analysis because this region is less prone to gene conversionmediated sequence homogenization, which, to some extent, would mask the phylogenetic signals [27,29]. ModelGenerator [37] was used to deduce the best-suited amino acid substitution model for the alignments. Maximum likelihood trees were generated using PhyML [38] and displayed using NJplot (http://pbil.univ-lyon1. fr/software/njplot.html). The robustness of the tree was determined using 100 bootstrap replicates. All the trees were unrooted. Figure S1 Phylogenetic analysis of c-type protocadherin and the protocadherin d subcluster genes. Found at: doi:10.1371/journal.pone.0007614.s001 (0.65 MB PDF)