Full-Length Minor Ampullate Spidroin Gene Sequence

Spider silk includes seven protein based fibers and glue-like substances produced by glands in the spider's abdomen. Minor ampullate silk is used to make the auxiliary spiral of the orb-web and also for wrapping prey, has a high tensile strength and does not supercontract in water. So far, only partial cDNA sequences have been obtained for minor ampullate spidroins (MiSps). Here we describe the first MiSp full-length gene sequence from the spider species Araneus ventricosus, using a multidimensional PCR approach. Comparative analysis of the sequence reveals regulatory elements, as well as unique spidroin gene and protein architecture including the presence of an unusually large intron. The spliced full-length transcript of MiSp gene is 5440 bp in size and encodes 1766 amino acid residues organized into conserved nonrepetitive N- and C-terminal domains and a central predominantly repetitive region composed of four units that are iterated in a non regular manner. The repeats are more conserved within A. ventricosus MiSp than compared to repeats from homologous proteins, and are interrupted by two nonrepetitive spacer regions, which have 100% identity even at the nucleotide level.


Introduction
Orb-weaving spiders of the superfamily Araneoidea use specialized abdominal glands to manufacture up to seven different types of protein-based silks or glues, each of which has specific functions and mechanical properties [1]. These seven different silks and glues have evolved highly variable mechanical properties and biological functions. The major ampullate spidroins (MaSps) make up the dragline silk, that is used as a safety line and for the frame of the web. The fiber displays both high tensile strength and extensibility, and is one of nature's best performing materials [2,3]. The minor ampullate silk is used for prey wrapping and for the auxiliary spiral that further stabilizes the web, and is similar to dragline silk in tensile strength but has lower elasticity and irreversibly deforms when stretched [3][4][5][6]. In contrast to dragline silk, minor ampullate silk does not supercontract when hydrated [7,8]. Flagelliform (Flag), or capture silk is highly elastic and coated with glue, and forms the extremely extensible capture spiral of the orb web [2,9]. Aciniform, wrapping silk is used for wrapping and immobilizing prey, building sperm webs, for web decorations, and also as an egg case liner [1,10,11]. Cylindriform silk, also referred to as tubuliform silk, is only secreted by female spiders during the reproductive season and forms the tough outer shell of the egg case, that is sufficiently robust to protect eggs from a variety of threats, such as predator and parasitoid invasion, temperature fluctuations or aqueous environments [12][13][14]. Pyriform silk forms an attachment disc that lashes together the joints of the web and attaches the dragline to substrates, and is used for prey capture and locomotion [15,16]. Web glues, secreted from aggregate glands, are used by the spider to coat the spiral threads to help capture prey, and are one of the most effective biological glues known [17,18].
Spider silk proteins -spidroins -are large and defined by their unique repetitive segments as well as non-repetitive N-and Cterminal regions [11,15,[19][20][21][22]. Some spidroins contain short, simple repeat units, whereas others are composed of longer, more complicated repeats [11,15,21,22]. To date, most published spidroin gene or cDNA sequences are not full-length sequences due to lack of the 59-end of the complete message, probably as a result of that cloning methods are biased to amplification of 39regions of mRNAs. Furthermore, hurdles associated with cloning and sequencing long stretches of repetitive DNA or large size transcripts make it difficult to obtain full-length gene or cDNA sequences [23][24][25][26]. However, there have been a limited number of full-length genomic DNA/cDNA silk sequences reported ( Table 1). The full-length genomic DNA sequences for MaSp1 and MaSp2 from Latrodectus hesperus [21], and two full-length cDNA silk sequences of cylindriform spidroin (CySp)1 and CySp2 from Argiope bruennichi have been described [27]. The limited availability of full-length genomic DNA sequences presents a major obstacle for studies of spidroin gene architecture and regulation, as well as in phylogenetic analyses.
Minor ampullate silk may be particularly interesting for biomedical applications since it is strong and does not supercontract in water. Only partial minor ampullate spidroin (MiSp) sequences have been reported to date. Partial MiSp cDNA sequences and genomic restriction mapping from Nephila clavipes have detected no introns in the repetitive regions, but identified a part of the repetitive region and the C-terminal region [4]. In this study, we describe the first full-length gene sequence for MiSp, including its flanking non-coding regions, obtained by screening a fosmid genomic library from the orb-weaver Araneus ventricosus by a multidimensional PCR method.

Fosmid genomic library construction and screening
A. ventricosus individuals were collected in Shanghai, China, frozen in liquid nitrogen, and stored at 280uC. No specific permits were required for the described field studies; the location from which the spiders were collected is not privately owned or protected in any way, and the spiders collected are not endangered or protected. High molecular weight genomic DNA (HMW-gDNA) was isolated by an improved cetyltriethylammonium bromide method from muscle tissues dissected from the cephalothoraxes of ,20 adult spiders. The HMW-gDNA was sheared mechanically into approximately 40 kb fragments using a 100 mL pipette tip. Subsequently, blunt-end, 59-phosphorylated DNA was generated using the End-Repair Enzyme Mix (CopyControl TM Fosmid Library Production Kits, Epicentre). Fragments ranging from 25-40 kb were excised from separation gels (avoiding dye staining), purified, and ligated into the vector pCC2FOSTM (Epicentre). The resulting vectors with HMW-gDNAs were packaged using MaxPlaxTM Lambda Packaging Extracts, transfected into EPI300 TM -T1R T1 phage-resistant E. coli cells (Epicentre) and the titers of the phage particles were determined. All operations were performed as described for the CopyCon-trolTM Fosmid Library Production kit (Epicentre). From the generated A. ventricosus random genomic Fosmid library, approximately 3.96105 E.coli colonies were picked and arrayed into1026 384-well culture plates (1 clone/well). Each 384-well culture plate was replicated and the original stock plates, containing 8% glycerol, were stored at 280uC.
For efficient screening of the Fosmid genomic library, a modified method based on 4-dimensional PCR [28] was used. The 1026 384-well culture plates were divided into about 25 sets, each containing about 40 culture plates. Each set of 40 plates was transferred into 40 384-well plates (1 clone/well) which were filled with 60 mL of LB medium containing 12.5 mg/mL chloramphenicol. For each set, the 40 384-well culture plates were arrayed as 5 rows 68 columns in numerical order. The admixtures from every column and row were combined into primary pools. Derived from primary pools, superpools contained column admixtures plus row admixtures. Secondary pools which included column admixtures and row admixtures from the 40 384-well culture plates were also constructed. The recombinant pCC2FOS TM vector was extracted from cell cultures combined from one set of 384-well culture plates and used for PCR screening. The superpools were constructed to determine whether the recombinants from one set of 40 384-well culture plates contained positive clones. The primary pools were used to identify which plate contained one or more positive clones while the secondary pools could determine the exact positive clones. Screening primers targeting MiSp were obtained by designing a pair of degenerate PCR primers based on two conserved polypeptides in MiSp and MaSp C-terminal domain (SRISSA and NIGQVD). After optimization of reaction conditions, a ,200 bp fragment was obtained, from which exact sequences for primers were derived. The primers (GMiSp-CF: 59-TTACTCAGGTGTCCTTGG- 39 and GMiSp-CR: 59-ATTGGCTTACTGCATTCT-39) targeted a 162 bp portion of the MiSp 39 non-repetitive region.

Full-length gene sequence
The primers pCC2FOS TM Forward Sequencing Primer: 59-GTACAACGACACCTAGAC-39 and the Reverse Sequencing Primer: 59-CAGGAAACAGCCTAGGAA-39 were designed for Fosmid end-sequencing. About 800 bp of the 59 termini and 39 termini of the inserted DNA were sequenced, and used to identify clones that contained full-length MiSp genes. Finally, the Chinese National Human Genome Center (CHGC) in Shanghai was commissioned to sequence the full-length gene from one positive clone, using a shotgun method. One positive clone was sequenced and assembled to ,76coverage. The positive plasmid was sheared randomly into 1.6-4 kb fragments and subcloned into pSMART. Subclones in two 96-well plates (192 subcloned fragments) were sequenced in two directions with ABI 3730xl DNA sequencer, yielding 384 sequenced fragments that were assembled using phredPhrap. Gaps were closed by sequencing restriction fragments. Finally the full-length sequence was verified with PCR amplification, restriction enzyme digestion and sequencing.

Intron identification
First we used Fgenesh 2.6 to predict intron cleavage sites and then reverse-PCR was used to determine the intron cleavage site. Minor ampullate glands were dissected from euthanized A. ventricosus individuals and flash frozen in liquid nitrogen. Total RNA was extracted according to the manufacturer's protocol by using the standard TRIzol TM method. Four primers were designed based on the full-length MiSp gene sequence (Table 2). Repetitive DNA sequences are not suitable for designing primers, so the reverse primers (including the reverse-transcription primer) annealed to the nonrepetitive spacer region. Because the nonrepetitive N-terminal domain is too far from the intron, sense primers annealed to the upstream repetitive region. cDNA from A. ventricosus was synthesized with nested-reverse transcription PCR in the following manner.
Step 1, the Out-F(RT primer) was used for  reverse-transcription with mRNA as template, amplifying singlestranded cDNA fragments.
Step 2, with the cDNA as template, primers Out-S and Out-F(RT primer) were used to amplify double-stranded DNA fragments.
Step 3, in order to improve the double-stranded DNA specificity, nested-primers, In-S and In-F were used to amplify double-stranded DNA fragments with double-stranded DNA generated in step 2 as template. Running the DNA fragment in step 2 and 3 on agarose gels showed no clear bands in step 2, but in step 3 two bands were seen. These two DNA fragments were sequenced, showing that one is amplified from the first spacer and the other one is amplified from the second spacer (see Figure 1). Sequencing reactions were performed at Beijing Genomics Institute, China.

MiSp full-length gene and protein sequence
Two positive clones were obtained from an A. ventricosus Fosmid genomic library by screening with 4-dimensional PCR targeting the MiSp C-terminal region. Analysis of the 59 and 39 terminal ,800 bp sequences of the inserts revealed identical sequences in both inserts and that the positive clones contained the full-length MiSp gene (data not shown). The complete ,33 kb sequence for one of the clones, called F29-0811, was determined (GenBank accession no. JX513956) and found to encompass the full-length coding sequence for MiSp as well as 6647 bp upstream of its start codon and 14937 bp downstream of its stop codon. After removal of the single intron (see below), the full-length transcript of the MiSp gene is 5440 bp and encodes 1766 amino acid residues. The corresponding MiSp is organized into non-repetitive N-and Cterminal domains and a predominantly repetitive region in between ( Figure 1).
MiSp sequences from cDNA obtained by RT-PCR of mRNA from the minor ampullate glands of A. ventricosus reveal that the MiSp gene presented herein is transcribed and that the transcript is spliced at the predicted borders between the exons and the intron. Futhermore, the fact that we obtained two distinct cDNA sizes (370 and 239 bp) supports that there is only one MiSp gene, containing two spacers, present in A. ventricosus. If there are additional A. ventricosus MiSp genes they are either very similar in spacer sequence and -distance to the one now described, or their spacer regions are so different that the primers did not anneal to their transcripts. Since then, several partial MiSps and MiSp-like proteins have been identified [8,22,[33][34][35], but some of these sequences have not been assigned to either MiSp1 or -2 since clear and secerning classification criteria are lacking. Also the A. ventricosus sequence presented here is difficult to assign according to Colgin and Lewis' criteria and is therefore referred to as MiSp.

MiSp repetitive region
Reoccurring overall patterns can be observed in A. ventricosus MiSp repetitive region, but it lacks higher order organization, as seen for L. hesperus MaSp1 [21]. Generally, a complete repetitive unit is composed of GGXGGY-(GX) a (A) b (GX) c -GGAG-GYGGGX-(GX) d where a = 4-13, b = 3-5, c = 0-3 and d = 1-10. More precisely, the repetitive region of the A. ventricosus MiSp is dominated by the motifs: GGX, GX, poly-A (3-5 consecutive alanines) and GGGX. These motifs are organized into four types of ensemble (with order) repeat units, GGX-GGX-GX, (GX) n -oligoA-(GX) n , GGX-GGX-GGGX and (GX) n (Figure 1). The repetitive region is divided into three parts by the two spacers (see below), repetitive region I, II and III. In repetitive region I the poly-A motif occurs frequently while it only occurs three times in the second repetitive region. Repetitive region III is the longest region with poly-A occuring eight times. In contrast, the poly-A motifs of MaSps are in general longer and highly repeated [21]. The poly-A motif is not even the most abundant motif in A. ventricosus MiSp. However, (GA) n occurs frequently in MiSp while no (GA) n motif is present in L. hesperus MaSp1 and occurs rarely in MaSp2 [21]. It is commonly believed that poly-A as well as (GA) n motifs form b-sheet assemblies in the silk [26,36,37], so it is possible that they to some extent fulfill the same function.
The most common repeat is GX (X = A, Q, I, V, E, S, and D) (Figure 1). The distribution of the seven residues representing X among the repeats is non-even, with GA being by far the most common variant. The second most common motif is GGX (X = A, S, V, E, Y). The second glycine is sometimes replaced by alanine or serine and the distribution of the five X residues is non-  even. The third motif, which has only been described in Araneus diadematus fibrion (ADF)-1 previously [38], is GGGX where X is always tyrosine or alanine. Notably, in A. ventricosus MiSp, repetitive motifs containing proline are absent.
Glycine (G) and alanine (A) are by far the most abundant amino acid residues in MaSp, MiSp and Flag spidroins [4,8,[20][21][22][33][34][35]. Glycine, alanine, serine and tyrosine are the most common amino acids in the predicted A. ventricosus MiSp, constituting more than 78% of the entire spidroin, and glycine and alanine constitute more than 64% ( Figure S1). Because the first two codon positions for alanine and glycine are guanine or cytosine, the composition of the MiSp gene is overall guanine/cytosine-rich (58%). The third positions for glycine, alanine, serine and tyrosine codons in A. ventricosus MiSp is biased towards adenine and uracil ( Figure S2), in accordance with the amino acid compositions and codon usage of MaSps from L. hesperus [21] and other silk proteins [9,19,39,40].

Two identical nonrepetitive spacer regions
The repetitive region of A. ventricosus MiSp is interrupted by two serine-rich spacer regions (Figure 1). The 126 residue spacer regions are identical even at the nucleotide level and show no identity to any proteins except other spacer regions in MiSp and Flag. Serine and alanine, 17% and 20% respectively, are the most abundant residues in this region, while the glycine content (6%) is much lower in the spacers than in the repetitive regions (.36%) ( Figure S1). Interestingly, the spacer regions are similar in both length and amino acid composition to the N-terminal domain although they are apparently non-homologous ( Figure S1). The spacer region is not internally repetitive, with the exception of the sequence, AAASS, which is present as a single tandem repeat. In the so far characterized MiSp spacers, AFAQ, GLD and SA-rich region motifs are highly conserved (Figure 4).

Non-repetitive N-and C-terminal regions
In A. ventricosus MiSp, the non-repetitive N-terminal region (Figure 1) is composed of a predicted signal peptide followed by an about 130-residue domain. The latter is homologous to the fivehelix bundle structure described for Euprosthenops australis MaSp N-terminal domain [41], as judged by sequence alignments ( Figure S3). The SignalP 4.0 program predicted the presence of a signal peptide starting at Met1 and ending with a signal peptidase cleavage site [42] with a probability score of 0.807 ( Figure S3). Signal peptides were not predicted following any other of the Met residues in the N-terminal region. This reinforces that the first Met represents the true translational start site of A. ventricosus MiSp, and that it is a secretory protein.
The A. ventricosus N-terminal domain sequence was analyzed by PSI-Pred v3.0 [43], which predicted five a-helices in essentially the same regions as for other N-terminal domains ( Figure S3). In A. ventricosus MiSp N-terminal domain two cysteines are found in locations corresponding to helix 1 and 4, respectively. The locations of these Cys correspond well with those of conserved Cys in helix 1 and 4 from TuSp, CySp, MaSp2, and MiSp ( Figure S3). Judging from the crystal and NMR structures of E. australis Nterminal domain [41,44], the two cysteines in A. ventricosus MiSp N-terminal domain are localized so that they can form an intramolecular disulphide. Likewise, two cysteines present in Deinopis spinosa MaSp2 N-terminal domain in locations corresponding to helix 1 and 2 ( Figure S3), are situated so that they can form an intramolecular disulphide.
Phylogenetic analysis based on multiple sequence alignments of N-terminal domains ( Figure S3), with the predicted signal peptides removed, shows that A. ventricosus MiSp clusters according to glandular origin (data not shown), as previously described for other spidroins [9,25].
As for the N-terminal domain, the A. ventricosus MiSp C-terminal domain is homologous to previously described C-terminal domains from other spidroins and species ( Figure S4). In the MiSp C-terminal domain, alanine, glycine, serine and valine are common (about 64%), and the glycine content (about 11%) is less than a half of that in the repetitive region but about three times that in N-terminal region ( Figure S1). Notably, cysteine is absent in A. ventricosus MiSp C-terminal domain, in contrast to most other known such domains, which form disulphide linked homodimers [39,45]. The C-terminal domain is evolutionarily conserved, but to a lesser extent then the N-terminal domain (Figures S3 and S4). Notably, the Flag C-terminal domain is divergent, to the extent that it is difficult to align it with other such domains ( Figure S4).

Overall MiSp properties
The MiSp characterized herein and previously described MaSps have glycine and alanine rich motifs that occur in ensemble repeats, but the repeat organization (repetitiveness) and similarity of repeat copies (homogenization) differ. In L. hesperus MaSps, four types of ensemble repeats are strung together to form higher-level repeat units, that can be tandemly arrayed up to twenty times and the iterations share high identity at both the amino acid and nucleotide level [21]. A.ventricosus MiSp shows more sequence and length variation among its ensemble repeats than L.hesperus MaSps. The modular architectures of A.ventricosus MiSp and L.hesperus MaSps likely reflect concerted evolution within a single gene, as has been implicated in maintaining similarity among Flag (440 residues) ensemble repeats and the long repeats of tubuliform spidrion (TuSp1, 200 residues), aciniform spidrion (AcSp1, 200 residues) and pyriform spidroin (PySp, .200 residues) [9,11,15,16,40,46]. Hydropathy profiles predicted according to the method of Kyte and Doolittle [47] show an alternating profile for essentially the entire A. ventricosus MiSp, where the regions of hydrophobicity correspond to the poly-A motifs and hydrophilic regions correspond to the glycine-rich regions ( Figure 5). A closer inspection, however, indicates that the spacers are mainly hydrophilic while the C-terminal domain is overall mainly hydrophobic. The latter likely reflects the ability of the C-terminal domain to dimerize through hydrophobic interactions [48]. The N-terminal domain shows pronounced hydrophilicity in the region between residues 40 and 60, in agreement with its ability to promote spidroin solubility at neutral pH [41].

Variable spidroin gene architectures
The two MiSp spacer regions are identical even at the nucleotide level, a phenomenon previously not observed for any spidroin gene. Moreover, in contrast to the long MiSp spacers the spacers of Nephila and Nephilengys Flag are only 27 residues ( Figure 4B) and the spacers of Araneus and Argiope Flag are even shorter, just 9 residues ( Figure 4C). Nephila and Nephilengys Flag spacers contain highly conserved motifs EDLDIT and GPITI-SEEL while the short Araneus and Argiope Flag spacers are very rich in valine, but lack longer conserved motifs ( Figure 4B and 4C). Not only does the amino acid sequences differ between MiSp and Flag spacer regions, their predicted secondary structures are entirely different. While the MiSp spacers are predicted to contain mainly long stretches of a-helical structure, both the 27-and 9-residue Flag spacers are predicted to contain b-strands only (Figure 4).
The A.ventricosus MiSp gene is the first characterized spider silk gene with a single intron ( Figure 6). In contrast, the previously described MiSp cDNA fragments and genomic DNA restriction mapping of N. clavipes show no detectable introns [4]. In line with this, L. hesperus MaSp1/2 full-length genomic DNA sequences, and L. geometricus MaSp1 and Nephila MaSp2 fragments reveal that these spider silk genes are composed of a single large exon [21,24]. Also the TuSp1, AcSp and PySp genes are suggested to contain single exons [40,49]. On the other hand, the A. trifasciata MaSp2 gene and the Flag gene contain multiple introns that are almost identical within the respective gene ( Figure S5) [20,24]. Thus, all known spider silk genes have different exon-intron architectures, also within a specific spider silk gene (eg MaSp2). Once an intron invades a gene, the intron can be rapidly propagated throughout the gene, a phenomenon that appears to be common in silk genes [20,40,50]. However, for the A.ventricosus MiSp gene this is not the case. In general, the introns within a spider silk gene show high degree of similarity, but between different silk genes the introns are more diverse ( Figure S5). In line with this, the A.ventricosus MiSp intron differs from the Flag and MaSp2 introns both in terms of size and sequence ( Figure 6 and Figure S5).
Many eukaryotic genes are intronless [51], but proteins encoded by single exons are strongly biased towards smaller sizes (less than 1000 amino acids) [52,53]. Spider silk genes are not typical in this sense, eg the intronless L. hesperus MaSp1/2 genes code for 3129 and 3779 amino acid residues, respectively [21], while the introncontaining gene for A.ventricosus MiSp codes for 1766 amino acids. Also, the MiSp intron is large, 5628 bp, compared to other silk genes (496-2620 bp, Figure 6 and Figure S5). Generally, intron length is negatively correlated with expression level [54][55][56], but minor ampullate silk genes are likely highly expressed.