Massively Parallel Amplicon Sequencing Reveals Isotype-Specific Variability of Antimicrobial Peptide Transcripts in Mytilus galloprovincialis

Background Effective innate responses against potential pathogens are essential in the living world and possibly contributed to the evolutionary success of invertebrates. Taken together, antimicrobial peptide (AMP) precursors of defensin, mytilin, myticin and mytimycin can represent about 40% of the hemocyte transcriptome in mussels injected with viral-like and bacterial preparations, and unique profiles of myticin C variants are expressed in single mussels. Based on amplicon pyrosequencing, we have ascertained and compared the natural and Vibrio-induced diversity of AMP transcripts in mussel hemocytes from three European regions. Methodology/Principal Findings Hemolymph was collected from mussels farmed in the coastal regions of Palavas (France), Vigo (Spain) and Venice (Italy). To represent the AMP families known in M. galloprovincialis, nine transcript sequences have been selected, amplified from hemocyte RNA and subjected to pyrosequencing. Hemolymph from farmed (offshore) and wild (lagoon) Venice mussels, both injected with 107 Vibrio cells, were similarly processed. Amplicon pyrosequencing emphasized the AMP transcript diversity, with Single Nucleotide Changes (SNC) minimal for mytilin B/C and maximal for arthropod-like defensin and myticin C. Ratio of non-synonymous vs. synonymous changes also greatly differed between AMP isotypes. Overall, each amplicon revealed similar levels of nucleotidic variation across geographical regions, with two main sequence patterns confirmed for mytimycin and no substantial changes after immunostimulation. Conclusions/Significance Barcoding and bidirectional pyrosequencing allowed us to map and compare the transcript diversity of known mussel AMPs. Though most of the genuine cds variation was common to the analyzed samples we could estimate from 9 to 106 peptide variants in hemolymph pools representing 100 mussels, depending on the AMP isoform and sampling site. In this study, no prevailing SNC patterns related to geographical origin or Vibrio injection emerged. Whether or not the contact with potential pathogens can increase the amount of AMP transcript variants in mussels requires additional study.


Introduction
Mytilus species (Phylum Mollusca, Class Bivalvia) are intertidal filter-feeders distributed worldwide, anchored to hard substrates in dense communities and widely used as bio-sensors of coastal pollution. Mussel populations of the northern and southern hemisphere probably separated 0.54-1.31 million years ago, far after the trans-Arctic expansion towards North America, and before the divergence between the Atlantic and Mediterranean ecotypes [1]. M. galloprovincialis hybridizes with M. edulis in southwest England and the Mediterranean mussel is now reported in Eastern Asia, California, Chile and Western Australia [1][2].
At different latitudes, mussels face tidal and seasonal fluctuations, changeable pollutant loads and also the surrounding bioma with behavioral changes [3], metabolic adjustments [4] and a variety of defense reactions [5][6][7]. With the exception of a few metazoan parasites which also somewhat affect the Mytilus species [8][9], mussels seem refractory to diseases and could instead influence the prevalence of pathogens such as Perkinsus spp. and Betanodavirus (Nodaviridae) in other bivalves and fishes, respectively [10]. Like other invertebrates, bivalve molluscs rely on ancient and rapid defenses to fight potential pathogens, and gene-encoded antimicrobial peptides (AMPs) are major humoral components of their immune system.
Host defense peptides are present in virtually all living organisms, with more than 30 AMPs expressed in humans and about 200 peptides identified in insects (approximately 1500 molecules very diverse in sequence and secondary structures are reported in specific databases). [11][12][13]. Among other structural features, a conserved c-core motif originated from the bidirectional orientation of specific aminoacid residues including an invariant cysteine array has indicated the evolutionary relatedness of cysteine-stabilized a-b (CS-ab) AMPs, kinocidins, invertebrate toxins and snake venoms: such unifying structure provides an interesting hypothesis for context-specific action modes, from the perturbation of negatively charged cell membranes and ion channels to the immunoregulatory functions [14].
In the continuous fight with competitors, predators and pathogens, the evolutionary diversification of AMP types and gene families likely occurred through events of gene duplication, shuffling of functional elements and selection for variation at positions adjacent, or integral to, the conserved structural motifs [13,15]. In Crassostrea gigas, combined mechanisms of sequence diversification (e.g. recombination, parallel homoplasic mutations, indel events) and directional selection have been suggested to explain the remarkable gene multiplicity and variable copy number of defensins and proline-rich peptides, whereas the marked transcript diversity of Cg-bpi, a bactericidal permeability protein, has been mainly referred to the allelic polymorphism of one single gene [16]. Microsatellite-mediated mosaics of sequence elements, low-transcription fidelity and transcript editing support the evidence of about 50 polymorphic genes, and an extraordinary diverse set of Sp185/333 proteins expressed in response to pathogens by the purple sea urchin [17][18]. Worthy of note, the copy number polymorphism of a and b defensin genes with proportional peptide levels in neutrophylic granulocytes, has been related to the individual risk of infection in humans [19][20].
Tens of different AMPs or AMP families have been discovered in marine invertebrates [21]. In the mussels M. galloprovincialis and M. edulis, four different groups of CS-ab AMPs with multiple isoforms have been discovered and classified according to their primary sequence and secondary structure: defensins reported as MGD1 and MGD2, mytilin A, B, C, D and E, myticin A, B and C, mytimycin, the only strictly antifungal peptide with an EF-hand like domain [22][23][24][25]. These AMPs share small size (3.7-4.5 kDa, except mytimycin of 6.2 kDa), positive charge and amphiphilic behavior. Their precursors (pre-pro-peptides) consist of an Nterminal signal peptide, a central mature peptide and a C-terminal extension. Each family is characterized by a cysteine array of 8 (12 in mytimycin) cysteines engaged in intramolecular disulfide bonds.
A broad spectrum of activity, often complementary and not strictly antibacterial, was reported for the mussel defensins, mytilins and myticins [26][27] whereas mytimycin, a 6.2 kDa peptide isolated from normal and immunostimulated mussels, selectively inhibited Neurospora and Fusarium growth [22]. Whether purified in sufficient amounts from cellular fractions or obtained in stable conformations by chemical synthesis or recombinant system, pure peptides are essential to investigate the antibiotic power of the different mussel AMPs.
In situ hybridization and immunolocalization assays performed on mussel hemocytes demonstrated a partially overlapping expression of defensins and mytilins [28]. AMP expression and stored peptides have been observed in several tissues and developmental stages [28][29] to indicate that cells other than hemocytes can produce and release AMPs, a phenomenon well known from frog skin [30] and the male reproductive system of rats [31]. Overall, mussel AMPs display rather complex expression patterns, dependent on developmental stage, seasonality and immunostimulation [32][33][34]. In M. galloprovincialis, massive EST sequencing confirmed the abundance and transcript diversity of AMPs and other key players of the innate immunity [35]. The AMP precursors represented 26-43% of the hemocyte transcripts in mussels injected with viral-like and bacterial preparations; in particular, 74 precursor and 25 mature peptide variants of myticin C were detected in a sample of only 100 mussels, with unique profiles of transcript variants in single mussels and less common alleles differing at single nucleotide positions from the two most common ones [29,36].
The myticin C variation is also remarkable compared to mytilin B, as one mussel can produce 2-10 different mytilin B transcripts but silent substitutions restrict the peptide variants to only a few [37]. In spite of the abundance of other AMPs, just one singleton plus 4 similar sequences denote mytimycin in Mytibase, interactive catalogue including 18788 expressed sequence tags (ESTs) of M. galloprovincialis [25]. Sequencing and Southern blot data indicate one gene copy per genome for defensin MGD2, mytilin B and myticin C [38,36]. Two gene copies or allelic polymorphism could explain the simultaneous presence of two length variants of the mytimycin gene per mussel [39]. The gene copy number of the mussel AMPs need verification since partial gene sequences covering the coding sequence (cds) are only available for MGD1, mytilin B, myticin C and mytimycin [29,[35][36][37][38][39]. Two 3D structures have been established by NMR spectrometry, defensin MGD1 [40] and mytilin B [41].
Thus far no Mytilus genome has been sequenced and, compared to better known model organisms, a limited number of genes have been investigated: for instance those concerning defensin and mytilins [38], heat shock proteins [42], metallothioneins [43] and apoptotic caspases [44]. Also gene-centered studies take advantage of the massive production of ESTs which currently contributes to the identification of molecules and pathways underlying the mussel response to various natural and experimental conditions [45][46][47][48]. Among the 67,726 ESTs and 4680 aminoacid sequences publicly available for the Mytilus genus, about 29 and 32%, respectively, refer to M. galloprovincialis (www.ncbi.nlm.nih.gov, June 2011). Recently, new ESTs have been produced from the digestive gland, foot, gill and mantle of M. galloprovincialis by advanced sequencing [49]. The so called 'pyrosequencing' was the first alternative to the use of chain-terminating inhibitors [50] and it has radically increased the sequencing power as well as the resolution of lowabundance variants [51][52][53]. Adequate read coverage can assure reliable quantification of single nucleotide changes (SNC) when seeking critical mutations or sequence polymorphisms.
Based on 454 pyrosequencing, we have thoroughly studied the sequence diversity of 9 different AMP precursors expressed in hemocytes of mussels (M. galloprovincialis) farmed in three European regions. Similarly, we have analyzed and compared mussels farmed offshore or living inside the Venice Lagoon (Italy), before and after injection with live Vibrio cells.

Results
In Table 1 we summarized the main features of nine CS-ab AMPs expressed in M. galloprovincialis.
Following appropriate primer design, we amplified the related transcript sequences from hemolymph pools representing groups of 100 mussels farmed in south France (Pa), northwest Spain (Vi) and northeast Italy (Ve) or native from the industrial canals of the Lagoon of Venice (Ve nc). In addition, groups of 40 offshorefarmed (Ve ft) and lagoon-native (Ve nt) mussels were injected with 10 7 live Vibrio splendidus cells and similarly processed ( Table 2). The resulting 78 PCR products (13 amplicons66 samples) were then purified, quantified, diluted to the appropriate concentration (,7*10 9 molecules/ml) to compose two equimolecular pools for the emulsion PCR (cDNA concentration was 2.3 and 2.5 ng/ml, respectively) and bidirectional sequencing.
Overall, massively parallel sequencing produced 359,867 output reads and more than 73 Mbases with good quality scores, for a total of 304,621 trimmed reads with 226 bp average length. We discarded about 15% of reads per sample (88% of them shorter than 70 bp) which possibly originated in the PCR amplification or sequencing reaction (short sequences are not expected to bias the amplicon coverage nor the accuracy of SNCs detection).
Total or partial overlapping of the forward and reverse reads allowed the complete coverage of the 13 reference sequences. Hence, 97.5% of the good quality reads correctly mapped against the 9 selected AMPs and could be attributed to the 6 original samples. With the exception of MytM in the sample Ve ft, at least 1034 reads mapped on each AMP precursor transcript (range 1034-10814, File S3). Figure 1 shows relevant differences in the average base coverage calculated per AMP precursor transcript in each sample (33786, the average read depth per AMP).
Separately for each AMP and sample, we subsequently grouped the reads having the same length and 100% identity in two cluster types: (1) equal to the original transcript and (2) with at least 1 SNC. The last ones were employed for SNC detection and related analysis. Only the SNCs covered 306, and representing at least 3% of the reads mapping a given AMP, were considered genuine and counted per AMP precursor (Table 3). Based on the SNC counts, the average value of SNC per base calculated for the coding sequence was 0.18. Considering all AMPs together, as much as 134 SNCs were common to the 6 samples and represented 86% of the genuine cds variation (File S4). The SNC frequency values (cds) indicated MytC and MGD1 as the most polymorphic AMP transcripts, opposite to those of mytilins B and C. The fraction of non-synonymous changes ranged from null (MytlC) to 25 (MytC) whereas the related v values highlighted the opposite cases of MytlC (null) and MytlD (9-15).
All SNC frequency values were used to test statistically the sequence diversity of the 9 selected AMPs (one-way ANOVA, a = 0.001, followed by Tukey's Honestly Significant Difference test, a = 0.05). The null hypothesis (equal SNC frequency between AMPs) was rejected and, according to the Tukey's HSD test we classified the 9 cases from the least changeable mytilins, to the most polymorphic myticins and defensins ( Figure 2).
Common and exclusive SNCs are reported in Figure 3 per geographical region: the common changes represent the majority, with 68, 74 and 78% in the Vi, Ve and Pa samples, respectively. Similar percentages of common SNCs were found in mussels farmed offshore or living wild in the industrial canals of the Venice lagoon (Ve versus Ve nc), injected or not with 10 7 live V. splendidus (Ve versus Ve ft, Ve nc versus Ve nt). Details on the immune stimulation and related host response are reported elsewhere [34,35].
Finally, we analyzed the nucleotidic and aminoacidic substitution patterns of each AMP precursor, separately in the samples from Palavas, Vigo and Venice. The percentage of read clusters differing at least 1 SNC ranged from 56% (MytlD) to 100% (MytB and MGDt). The read clusters covered at least 36 were virtually translated into amino acids, and redundancy due to silent substitutions was removed. As reported in Table 4 we could estimate a number of AMP transcript variants ranging from 9 (Pa MytM) to nearly 100 (Ve MytB, Vi MGD1, Ve MGD1 and Ve MGDt).
The facts emerging from 454 pyrosequencing of the Mytimycin and Myticin C amplicons are reported here below in more detail as instructive examples.
The two gene sequences publicly available for the MytM of M. galloprovincialis (FJ804479.1, FJ804478.1) denote 3 exons and 2 introns, and differ only in the length of intron 2; the short and long version of it occurring simultaneously in single mussels [39]. The amplicon designed in the present work covered 429/456 bp, i.e. 94% of the cds.
The MytM pyrosequencing yielded 6645 aligned reads (33, 30 and 35% of them differing from the reference sequence in the Pa, Vi and Ve samples, respectively).
A nucleotide change in position 58 (Thymine in the place of Cytosine) was detected in all reads and was not considered as SNC because it could represent an error occurred during Sanger  (27,199 and 31 of high quality), respectively. In total, 5690 high quality sequences (86%) were translated in amino acids and produced 9-49 expected peptide variants ( Table 4). Irrespective of the geographical origin, we could consistently identify the two most abundant MytM types: a consensus very similar to the original sequence (MytM_1, MGC05878) and a second one (MytM_2) similar to the sequence MytM-P recently described [39]. Jointly, MytM_1 and MytM_2 represent the 87, 78 and 76% of the MytM reads in the samples Pa, Vi and Ve, respectively (Vi MytM display higher sequence variability than the other two samples).
MytM_2 Partial gene sequences including the cds have been reported for MytC and denote 3 exons and 2 introns, with the mature peptide entirely located in exon 2 [29,36].
The MytC pyrosequencing yielded 22,119 aligned reads fully covering the cds (71%, 88% and 79% of them differing from the original sequence in the Pa, Vi and Ve samples, respectively). The Pa, Vi and Ve MytC reads could be grouped in 844, 777 and 468 clusters or singletons (823, 752 and 466 of high quality), respectively. In total, 18113 high quality reads (82%) were translated in amino acids and produced 41-93 expected peptide variants ( Table 4) without evidence of prevailing variation patterns.
Despite the remarkable number of sequence variants, 98.8% of the MytC peptide clusters retained the typical cysteine array.

Discussion
This study intended to assess, by high-throughput amplicon sequencing, the natural variability of nine AMP precursor sequences found expressed in the Mediterranean mussel. For this purpose, we sampled mussels from farming sites subjected to common European regulations and sanitary controls in three producer countries.
Thirteen sequences almost covering the AMP cds (File S1) have been successfully amplified from hemocyte RNA samples representing mussels farmed in France, Spain and Italy, as well as native mussels of the Venice lagoon area, before and after Vibrio injection, for coupled comparison.
Sample preparation is a decisive step of the sequencing workflow, due to the difficulty in preparing a well-balanced unique amplicon pool for the emulsion PCR and subsequent pyrosequencing. For this purpose, we measured and equalized the concentrations of each amplicon with great attention before pooling. Although the two sequencing half plates produced similar read numbers per sample, a general variability of coverage depth between AMP amplicons was finally evident (Table 3, Figure 1). Nevertheless, we obtained at least 5006 amplicon coverage with one only exception (MytM in the Ve ft sample).
Stringent criteria were then used to identify genuine SNCs, removing false positives without losing substantial information,  and we could retain most of the output reads (82% and 86% to exemplify Myticin C and Mytimycin, respectively). At first look, the output reads of each AMP appeared very diverse, 88% with at least one SNC on average (100% for MGDt, no matter from which sample). The analysis of SNC frequency per base enabled us to rank the selected AMPs on the basis of the transcript variability ( Figure 4). Despite the invariance of the cysteine array, each AMP showed typical levels of diversity irrespective of the geographical origin, with a majority of common SNCs present in all samples (86%, i.e. 134 SNCs) and the Vi sample showing the greatest number of SNCs (292 in total). Moreover, we did not see evidence of increased AMP sequence diversity in farmed and native mussels injected with a high dose of live Vibrio cells. Compared to the samples prepared from 100 mussels (Pa, Vi, Ve, Ve nc), those prepared from 40 mussels (Ve ft, Ve nt) showed fewer SNCs (211%); a fact indicating that the sample size can limit the amount of detectable sequence variants. The ratio of non-synonymous vs. synonymous changes (v) substantiates the evolutionary diversification of the mussel AMP isotypes and suggests the functional advantage of transcript variability for most of the analyzed AMPs (v values higher than 1, indicative of positively selected residues, were frequently detected). However, the v values did not reflect precisely the classification based on SNC frequencies and some AMP (e.g. MytlD) may have been subjected to higher evolutionary pressure than others.
The virtual translation of the transcript consensuses resulting from read clustering allowed the identification of AMP isotypes with a relatively low number of SNCs and high sequence diversity according to different SNC combinations (e.g. MytB and MGDt). In the case of mytimycin, we confirmed two major sequence types previously described [39] with no evidence of additional variants. For the remaining AMPs, it was not possible to identify specific patterns of variation (amino acid changes combined together without scheme).
In conclusion, the sequence data reported in this study further emphasize the sequence diversity of mussel AMP precursors. Redundant expression of diverse AMPs with a broad range of action could be regarded as a strategy to reinforce the host response against invaders (foes trying in their turn to escape detection and the host reactions) while the immune system also has to maintain the organism homeostasis with appropriate responses to commensal microbes (friends) and to danger signals released by damaged host cells [54]. On the other hand, environmental factors act as selective force only if they change the distribution of host genotypes (affecting only some genotypes, not all), thus influencing the immune system evolution of the host in the context of its lifehistory and population traits [55].
The isotype diversity levels found in this study might result from events occurring at DNA level as well as post-transcriptional changes such as deaminase-mediated cytidine to uridine transitions [14,[56][57] Table 3. doi:10.1371/journal.pone.0026680.g002 extension strategies applied to genomic DNA could identify active and remnant gene copies of each AMP isotype, and reveal the mechanisms underlying the observed sequence variation.

Materials and Methods
Sampling sites, treatment and RNA extraction Adult mussels (Mytilus galloprovincialis) with a shell length of 6-8 cm and mixed sex were obtained from commercial shellfish stocks near Palavas (Pa, Mediterranean Sea, France; 43u31949 N, 03u54953 E), Ria de Vigo (Vi, Atlantic Ocean, Spain; 42u14932 N, 08u48926 E) and off-shore Venice (Ve, North Adriatic Sea, Italy; 45u18929.8 N, 12u21932.0 E). In addition, we collected wild mussels from the industrial canals of Porto Marghera (Lagoon of Venice, Italy; 45u27933.5 N, 12u15941 E). More than 100 animals per group were sampled.
According to the EU Directive 91/492, mussels cultivated in waters classified A (e.g. J mile off-shore in the Adriatic Sea) can be marketed without depuration and are assumed not to contain potential pathogens nor biotoxins. Due to heavy mixed pollution, shellfishing was prohibited since 1996 in the area from the industrial district (P. Marghera) to the Venice town, though the overall shellfish quality can be improved by 2 month-depuration in type A waters. Mussels farmed offshore or living in the industrial canals (Venice lagoon area) were acclimatized for one week in sea water collected at flood tide (32%, 22uC) and fed with Isochrisis galbana. Following shell notching, 0.1 ml of exponentially growing bacteria (10 7 V. splendidus LGP32 cells) were injected into the posterior adductor muscle (samples Ve ft and Ve nt).
Hemolymph (1 ml per animal) was withdrawn from the posterior adductor muscle with a syringe containing 0.2 ml of Alsever solution (27 mM sodium citrate, 2.6 mM citric acid, 114 mM glucose and 72 mM NaCl in distilled water) adjusted at pH 7.4, and used to compose pools, each representative of 10 animals.
Haemocytes were pelleted by 15 min centrifugation at 800 xg (4uC), carefully resuspended in 200 ml of TRIZOL reagent (Invitrogen, Carlsbad, USA) and stored at 280uC until use. Total RNA was isolated according to the manufacturer's instructions and resuspended in RNAse-free water. A further purification step with LiCl 2 M was applied to remove possible contaminants. RNA concentration was measured by UV-spectrometry (ND1000, NanoDrop Technologies, Wilmington, USA) and the RNA integrity was verified by microcapillary electrophoresis (RNA 6000 Nano LabChip, Agilent Technologies, Palo Alto, USA).
Finally, equal quantities of each RNA pool (N = 10) were mixed together to compose a unique pool per sample (N = 100 mussels for samples Pa, Vi, Ve and Ve nc; N = 40 for samples Ve ft and Ve nt). cDNA cDNA was synthesized from 1 mg of total RNA using SuperScript II enzyme and oligo(dT) 18 primers (Invitrogen), following the manufacturer's instruction. To increase cDNA yield, the reaction was extended for a second hour, adding 0.5 ml of enzyme. cDNA was then purified with MinElute PCR Purification Kit (Qiagen, Hilden, Germany).
All available ESTs for each selected AMP precursor were aligned using ClustalW [58] and primers were designed on conserved regions flanking the cds, whenever possible, with Primer3 [59]. Due to the pyrosequencing limit of about 250 bp, read length in forward and reverse direction, the maximum length of the PCR products was set at 440 bp (File S1). Degenerated primers have been designed to consider the whole ESTs variability (MytlC, MGD1 and MGDt). Two partially overlapping amplicons were designed for longer cds (MytC, MytlB, and MytlC) or in cases where high sequence variability made the definition of a single primer pair difficult (MytA). Thereby, 13 amplicons were designed in total. Amplicon's specificity was tested using the BlastX algorithm [60].
A tagged sequencing strategy with 59nucleotide barcodes was implemented to facilitate the parallel processing of multiple samples [61][62]. Briefly, the forward and reverse PCR primers were modified by 59-addition of 39 unique 5-mer barcodes (File S2). Barcodes enable the identification of the 454 reads corresponding to specific AMP, amplicon and sample so that PCR amplicons derived from multiple reactions could be combined for the sequencing run. To reduce the likelihood of misidentification, barcodes were designed not to contain homopolymers and to differ each one by at least two bases according to Roche Life Science protocols. Finally, 19-mer sequences corresponding to either the 454 Roche A Adaptor (for forward primers) or B Adaptor (for reverse primers) were fused to each PCR primer (Fusion primer).
Fusion primers were designed in two sets of 39 (13 amplicons63 samples), with each primer pair having a unique barcode.

PCR amplification
The PCR amplifications of 78 amplicons were carried out individually in a PCR volume of 50 ml with 20 ng of cDNA template, 16 Phusion HF buffer, 0.2 mM dNTPs, 1 U HF Phusion DNA polymerase, 1.5 ml DMSO and 0.2 mM of both forward and reverse primers. Amplification was performed in a Mastercycler Gradient Thermal Cycler (Eppendorf, Hamburg, Germany) programmed as follows: 98uC for 30 s followed by 35 cycles of 98uC for 10 s, 60-65uC for 20 s, 72uC for 30 s and a final extension step at 72uC for 5 min.
The resulting amplification products were run on a 2% agarose gel and visualized by SYBR Gold staining (Invitrogen) using UV light transillumination (Gel Doc XR System, Bio-Rad, Hercules, USA). Unspecific small products and primer-dimers were removed using the Agencourt AMPure system (AmPure PCR Purification kit, Brea, USA) and amplicons integrity was confirmed with Agilent 2100 Bioanalyzer (DNA-1000 chip). Good quality amplicons were finally used to compose an equimolecular pool; the number of molecules of each amplicon was calculated with the following formula: Molecules=ml~C|NA=(bp w |10 9 |bp) C: sample concentration (ng/ml) N A : Avogadro constant

Data analysis
Tag sequence were used as keys to part the unprocessed 454 reads into the 6 different samples by means of the GS Amplicon Variant Analyzer Software (AVAST, Roche Life Sciences). Reads in raw format were trimmed using quality score (limit 0.05) and minimum length equal to 100 bp. Subsequently, the output reads were aligned to a backbone consisting of the 9 AMP original transcripts, as obtained from the 13 reference sequences, with CLC Genomic Workbench version 4.6 (CLC Bio, Katrinrbjerg, Denmark). The total number of sequenced bases divided by the length of the amplified transcript provided the average base coverage per AMP.
The reads of each mapping (AMP isotype) showing the same length and 100% similarity were clustered together. Single nucleotide changes (SNCs) were detected considering all the aligned reads of each mapping. Non-specific and low quality matches are ignored during the process and SNCs were considered genuine only when covered at least 306, with a minimum frequency of 3%, and setting the quality level of the changed base and surrounding bases to at least 20 and 15, respectively. SNCs located in the same codon were merged. The expected amino acid changes in the precursor and cds peptide sequences were deduced by virtual translation, and the ratio (v) between non-synonymous and synonymous changes was also computed. SNC frequency per base in the cds region was calculated for each AMP and sample with the following formula: To assess possible differences in levels of sequence diversity between the AMP precursors amplified from each of the 6 samples, data were analysed with 1-way ANOVA (a = 0.001). The null hypothesis predicted that all AMPs had the same variation rate in all samples. If not, Tukey's Honestly Significant Difference test (HSD, a = 0.05) could then discriminate different AMP groups. Genuine SNCs were mapped on the 9 AMP sequences using CLC Genomic Workbench. The cds, signal peptide, mature peptide with the cysteine array and C-terminal regions were systematically localized.
Once the correct sequence reading frame was defined the sequences covered at least 36 were virtually translated to investigate the overall patterns of variation for each AMP transcript precursor in the Pa, Vi and Ve samples. Leftover low quality read ends were manually trimmed and redundancy was removed by using Jalview [64].

Supporting Information
File S1 Number of reads per AMP and sample.