De novo transcriptome analysis and microsatellite marker development for population genetic study of a serious insect pest, Rhopalosiphum padi (L.) (Hemiptera: Aphididae)

The bird cherry-oat aphid, Rhopalosiphum padi (L.), is one of the most abundant aphid pests of cereals and has a global distribution. Next-generation sequencing (NGS) is a rapid and efficient method for developing molecular markers. However, transcriptomic and genomic resources of R. padi have not been investigated. In this study, we used transcriptome information obtained by RNA-Seq to develop polymorphic microsatellites for investigating population genetics in this species. The transcriptome of R. padi was sequenced on an Illumina HiSeq 2000 platform. A total of 114.4 million raw reads with a GC content of 40.03% was generated. The raw reads were cleaned and assembled into 29,467 unigenes with an N50 length of 1,580 bp. Using several public databases, 82.47% of these unigenes were annotated. Of the annotated unigenes, 8,022 were assigned to COG pathways, 9,895 were assigned to GO pathways, and 14,586 were mapped to 257 KEGG pathways. A total of 7,936 potential microsatellites were identified in 5,564 unigenes, 60 of which were selected randomly and amplified using specific primer pairs. Fourteen loci were found to be polymorphic in the four R. padi populations. The transcriptomic data presented herein will facilitate gene discovery, gene analyses, and development of molecular markers for future studies of R. padi and other closely related aphid species.


Introduction
The bird cherry-oat aphid, Rhopalosiphum padi (L.), is a notorious insect pest that devastates wheat crops globally [1][2][3][4]. This species can reduce both the yield and quality of wheat by sucking sap and transmitting barley yellow dwarf virus (BYDV), which leads to serious economic damage to wheat production [1]. In the past few years, due to global climate change, farming systems, wheat varieties, anthropogenic effects, and other factors, the damage to wheat caused Insect materials, RNA extraction, cDNA library construction, and Illumina sequencing A apterous parthenogenetic female of R. padi was collected from a wheat field at Northwest A&F University, Yangling, Shaanxi, China in July 2013 (108˚05'E, 34˚17'N) to set a clone (parthenogenetic line) on seedlings of wheat (Triticum aestivum) cultivar "Xiaoyan 22" at 24 ± 1˚C, 40% RH, and a 16:8 h (L:D) photoperiod. Ten aphids were randomly selected from the clone for RNA extraction. The total RNA of R. padi was isolated using TRIzol reagent (Tiangen Biotech, Beijing, China) with minor modifications at the recovery step, in which RNase-free filter columns (Sangon Biotech, Shanghai, China) were used. RNA quantity and quality were assessed by gel electrophoresis and spectrophotometry, respectively. Ribosomal RNA (rRNA) was depleted from RNA samples using the Ribo-Zero™ rRNA Removal Kit (Human/Mouse/Rat) (Epicentre, Madison, USA) following the manufacturer's instructions. Around 20 μg purified RNA samples were sent to Beijing Genomics Institute (BGI) (Shenzhen, China) for cDNA library construction. Briefly, poly-T oligo-attached magnetic beads (Illumina, San Diego, CA, USA) were used to isolate poly (A) RNA from total RNA. Then a Super-Script Double-Stranded cDNA Synthesis kit (Invitrogen, Camarillo, CA) was employed for double-stranded cDNA synthesis with random hexamer (N6) primers (Illumina, San Diego, CA, USA). To normalize cDNA, the frequency of abundant cDNA species was reduced using Trimmer-2 cDNA Normalization Kit (Evrogen, Moscow, Russia). The T4 DNA polymerase, Klenow DNA polymerase and T4 polynucleotide kinase were used for end-repair and phosphorylation of synthesized cDNA. These repaired cDNA fragments were 3' adenylated using Klenow Exo-(Illumina, San Diego, CA, USA). Illumina paired-end adapters were ligated to the ends of these 3'-adenylated cDNA fragments. The products of this ligation reaction were electrophoresed on a 2% (w/v) Tris-acetate-EDTA-agarose gel for downstream enrichment with different sizes. cDNA fragments of 200 (± 25 bp) were excised from the gel. Fifteen cycles of PCR amplification were performed with PCR primers (PE 1.0 and PE 2.0) and Phusion DNA Polymerase to enrich the quantity of purified cDNA template. The majority of the amplified fragments in the Illumina library were about 200 bp in size. Four samples were sequenced per lane on an Illumina HiSeq 2000 platform (Illumina Inc., San Diego, CA, USA). Paired-end sequencing was used to sequence both ends and the cDNA library was deep-sequenced generating four gigabytes of data and a total of 114,428,314 raw reads. After cleaning and trimming, a total 108,340,100 clean reads were used for assembly and analysis.

De novo assembly and analysis of Illumina reads
To ensure the quality requirement for de novo transcriptome sequencing, a stringent filtering process was carried out. Initially, Illumina reads that passed the Failed-Chastity filter (Illumina) based on a setting of "failed-chastity less than or equal to 1" with a chastity threshold of 0.6, were reserved on the first 25 cycles. Then, all reads with adaptor contamination and lowquality reads with ambiguous sequences "N" were discarded. Finally, we ruled out reads with more than 10% Q < 20 bases. The cleaned reads were assembled de novo using SOAPdenovo2 and contigs with length less than 200 bp were discarded due to a low annotation rate [48]. The paired-end Illumina reads were first combined to produce longer fragments (i.e., contigs) and then mapped back to the contigs. The paired-end reads and contigs were assembled to form longer sequences that originated from the same transcript, with N indicating unknown bases (i.e., scaffolds). The paired-end reads were used for gap filling of the scaffolds to obtain unigenes with the least Ns that could not be extended at either end. All unigenes assembled were compared with the non-redundant protein database (nr) of the National Center for Biotechnology Information (NCBI), non-redundant nucleotide sequence (nt) database (NCBI) (http://www.ncbi.nlm.nih.gov), UniProt/Swiss-Prot (http://www.expasy.ch/sprot), and the Clusters of Orthologous Groups (COG) database (http://www.ncbi.nlm.nih.gov/COG/) using BLASTx [49] with E-values of less than 1e -5 and 1e -10 , respectively. The COG functional classification was used to analyze the gene composition, as well as predict and classify possible functions of transcriptome sequences. With the nr annotation, Blast2GO was used for gene ontology (GO) annotation of the unigenes according to molecular function, biological process and cellular component ontologies (http://www.geneontology.org). The Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway database and the online KEGG Automatic Annotation Server (http://www.genome.jp/tools/kaas/) were used to map unigenes to pathways applying an E-value threshold of 1e -5 . During the determination of the sequence direction of unigenes, the priority order of nr, Swiss-Prot, KEGG and COG was followed when the alignment results of four databases conflicted with each other.

SSR mining and primer design
The MIcroSAtellite (MISA, http://pgrc.ipk-gatersleben.de.sci-hub.org/misa/) Perl script was employed to identify microsatellites in the unigenes. In this study, cDNA-based SSRs (cSSRs) were considered to contain motifs with two to six nucleotides and a minimum of four contiguous repeat units. The criterion of no fewer than six repeat units for di-, five for tri-to tetra-, and four for penta-to hexa-nucleotide repeats was adopted. Based on the MISA results, repeat motifs were set randomly and searched in sequences longer than 200 bp. The frequency of cSSRs refers to kilobase pairs of cDNA sequences containing one SSR. The output was then parsed by the Primer3-2.3.4 program (http://sourceforge.net/projects/primer3/files/primer3/2.

PCR amplification and validation of selected SSRs
To evaluate the amplification efficacy, specificity, and polymorphisms of all selected SSRs, we collected R. padi samples from Lanzhou in the Gansu Province (the sample was coded as GSL; the coordinates are 103˚41 0 E and 36˚05 0 N), Xianyang in the Shaanxi Province (SAX; 108˚05 0 E, 34˚17 0 N); Wuhan in the Hubei Province (HUW; 112˚47 0 E, 32˚08 0 N), and Baicheng in the Jilin Province (JLB; 122˚52 0 E, 45˚39 0 N). Apterous parthenogenetic individuals from the four regions were brought to the laboratory in individual tubes. Twelve individuals from each region were used to set up 12 respective clones (parthenogenetic lines) on seedlings of wheat cultivar 'Xiaoyan 22' at 24±1˚C, 40% RH, and a 16:8 h (L: D) photoperiod. A total of 48 clones were set up for the four regions. One individual was randomly taken from the first generation of each clone for DNA extraction. Genomic DNA was extracted from each individual using the EasyPure Genomic DNA Kit (TransGen Biotech, BeijingChina) according to the bench protocol for animal tissues. DNA was eluted in 30 μL of ultrapure water and stored at -20˚C. Three primers-a forward primer with an M13 (-21) at the 5' end, a reverse primer, and an FAM fluorscent dye-labeled M13 (-21) primer [50] were used for PCR amplification of the microsatellite loci. PCR was performed in a total volume of 25 μL, comprising 12.5 μL of 2× Taq Master Mix (containing 0.05 U/μL Taq DNA Polymerase, 2× Taq PCR Buffer, 3 mM MgCl 2 , and 400 μM dNTP mix) (CoWin Biotech, Beijing, China), 0.5 μL of each forward primer (10 μM), 2 μL of each reverse primer (10 μM), 2 μL of M13 primer (10 μM), and 1.5 μL of genomic DNA (10-30 ng/μL). PCR amplification involved denaturation at 95˚C for 2 min, followed by 30 amplification cycles consisting of 95˚C for 20 s, 20 s at the primer-specific annealing temperature (Table 1), and 72˚C for 20 s. This was followed by eight cycles of 95˚C for 30 s, 53˚C for 45 s, and 72˚C for 45 s, and a final step at 72˚C for 10 min. To examine the length of the amplified PCR products, an ABI3730XL automated DNA sequencer (Applied Biosystems, Foster City, CA, USA) was used, and all genotypes were called with GeneMapper v4.0 software (Applied Biosystems, Foster City, CA, USA).

Statistical analysis
Population genetic parameters for polymorphic loci, such as the observed heterozygosity (Ho), the expected heterozygosity (He), and number of alleles (Na) were calculated using FSTAT v2.9.3.2 software [51]. The Excel Microsatellite Toolkit v3.1 program (MS Tools) [52] was utilized to calculate the polymorphism information content (PIC) of each SSR; the inbreeding index (Fis) was assessed using Popgene v1.3.1 software [53]; Hardy-Weinberg equilibrium (HWE) and P values of HWE were evaluated using GENEPOP v4.0 software [54]. The statistical power of microsatellite markers for detecting genetic differentiation at various Fst (fixation index) levels was evaluated using POWSIM v4.1 software [55]. Based on an effective population size of Ne = 1000, simulations were carried out with Fst values of 0.0001, 0.0025, 0.005, 0.0075, 0.01, 0.0125, 0.015, 0.0175, and 0.02, and 1000 replicates [55]. Chi-squared tests and Fisher's exact tests were used to analyze the allele number and frequency at randomly selected microsatellite loci, and the power of the analysis was indicated by the proportion of tests that were significant at P < 0.05 [8].

Functional annotation of the unigenes
A total of 9,895 unigenes with BLAST matches to known proteins were annotated with 76,184 GO functions based on sequence similarity, with an average of 7.7 GO annotations per unigene (Fig 3). The three main GO annotation categories were biological process (42,447, 55.72%), cellular component (21,425,28.12%), and molecular function (12,312,16.16%). The annotations could be further subdivided into 59 subcategories (Fig 3). For sequences that initially sorted to the biological process classification, cellular process, single-organism process, metabolic process, and biological regulation were among the most represented matches. The major subcategories for the cellular component classification were cell and cell part, whereas in the molecular function classification the major subcategories were binding and catalytic activity.

Validation of microsatellite markers
A total of 60 microsatellite loci were selected at random for verification of their utility in various R. padi populations. Twenty-four of these loci were amplified successfully and yielded bands of the correct sizes; these loci were used to analyze the genetic diversity of R. padi (Table 1). Fourteen loci (RP06, RP08, RP13, RP14, RP22, RP23, RP24, RP30, RP31, RP42, RP43, RP45, RP48 and RP60) were polymorphic in four different geographical populations of R. padi ( Table 3). The number of alleles (Na) per locus ranged from 2 to 9, with an average of 3.98. The observed heterozygosity (Ho) value was 0.417-1.000, whereas the expected heterozygosity (He) value was 0.409-0.823. The the polymorphism information content (PIC) of each locus ranged from 0.221 to 0.765. The inbreeding index (Fis) values ranged from -1.000 to 0.367, most of which were negative values indicating the heterozygous excess in the four R.     Table 3). Power calculations using a chi-squared test ( Fig 5A) and Fisher's exact test (Fig 5B) showed that Fst (fixation index) values as low as 0.0062 could be detected with more than 80% probability. Therefore, the 14 microsatellite loci provided sufficient statistical power to detect population differentiation.

Transcriptome analysis
In this study, the complete R. padi transcriptome was sequenced on the Illumina HiSeq 2000 platform, yielding a total of 9,750,609,000 bp with 114.43 million clean reads. These sequences also produced longer unigenes (mean = 990 bp) than those assembled in Ipomoea batatas (765 bp) [57], Eucalyptus grandis (197 bp) [58], Acropora millepora (440 bp) [59], Bemisia tabaci (266 bp) [60], and Dialeurodes citri (539 bp) [61]. Furthermore, the N50 length of the unigenes was 1,580 bp, longer than that of Ipomoea batatas (765 bp) [57], D. citri (632 bp) [61], and Grapevine phylloxera (936 bp) [62]. This result indicated that the R. padi transcriptome sequences were of high quality, which was likely due to the improved transcript construction and scaffolding and low heterozygosity of the new paired-end sequencing technology. For functional annotation, we utilized several complementary approaches to annotate the assembled sequences using several public databases. About 82.47% of the unigenes had orthologs or homologs in these databases and were assigned at least one functional annotation, which is higher than previous reports of other species using the same sequencing platform [39,60,61,63]. The high level of annotations is due to the availability of complete functional information in all of the public databases, the high mean length of unigenes [39], and the availability of an aphid genome database (http://www.aphidbase.com/aphidbase/). A total of 17.53% unigenes were unmapped in any of the databases, possibly due to the short sequence reads generated [61], the presence of non-coding transcripts among the unigenes [31], and/or the incompleteness of the public sequence databases [30]. In COG and GO functional classification, a large proportion of unigenes (27.22 and 33.58%) were assigned to a wide range of COG and GO classifications (Figs 2 and 3), indicating that the transcriptome data included a wide diversity of transcripts. In the KEGG pathway analysis, a high proportion of unigenes were mapped to metabolic pathways, the RNA transport pathway, and regulation of the actin cytoskeleton pathway (S2 Table). In addition, several pathways related to pesticide resistance -such as ABC transporters, drug metabolism-cytochrome P450, and metabolism of xenobiotics by cytochrome P450-were identified. These annotated unigenes will facilitate more indepth investigations of population genetics and functional genomics of R. padi and other closely related aphid species.

Microsatellite loci characterization
The major traditional methods for microsatellite loci development are the hybrid capture method [64], loci selection from available genetic/genomic information [65], and loci transferable from closely related species [28,66]. Compared with traditional methods, de novo transcriptome sequencing technology is a rapid, cost-effective, and reliable tool that enables microsatellite markers to be developed directly from transcriptome sequences, particularly for non-model species [33,42,67]. Among the 29,467 assembled unigenes, 5,564 (18.89%) possessed 7,936 potential microsatellite loci. This is higher than the values for other insect pests, such as Bombyx mori [68], Tomicus yunnanensi [69], Bactrocera dorsalis [70], and Phenacoccus solenopsis [71]. Six types of microsatellite loci repeat type were identified among the unigenes; the most common were trinucleotide (45.75%) and mononucleotide (28.86%) repeats, in agreement with the results for D. aeneus [45], Timema cristinae [72], and B. dorsalis [70]. Wang et al. (2012) found that trinucleotide microsatellite loci were abundant in the transcriptome data of Tetrao tetrix, and predicted that tri-nucleotides can remain in coding regions without causing reading frame shifts [73]. Therefore, the abundance and frequency of the various microsatellite loci repeat types were related to the size of the transcriptome database, the microsatellite loci search software used, and the parameter criteria [74]. A/T, AAT/ATT, and AT/TA were the most abundant SSR motif types in the R. padi transcriptome database. A/T homopolymers are also more abundant than C/G homopolymers in Schistosoma mansoni [75], Tenebrio molitor [76], and P. solenopsis [71]. Tóth et al. (2000) examined the abundance of microsatellites with repeated unit lengths of 1-6 base pairs in several eukaryotic taxonomic groups [77]. They found poly (A/T) tracts are more abundant in each taxon than poly(C/G) sequences and the plausible explanation for the higher proportion of A/T-rich SSRs is the poly-A tails of retroposed sequences and processed pseudogenes. In 154 non-model eukaryote species, the previous reports found that the GC/CG motif was rare, and that the GC/CG was absent in several eukaryote species [78][79][80][81][82][83]. Indeed, the CG/CG motif was detected at a low frequency (0.89%) in this study. This phenomenon cannot be explained only by the low C/G content of the genome and thus may represent a genuine pattern [78,84,85].

Microsatellite loci development and validation
Among 60 randomly selected potential microsatellite markers, 24 loci (40%) were amplified successfully, and 14 loci exhibited polymorphisms in four R. padi populations. The low amplification rate may be caused by the special structure at the primer (s) location or between the primers, for example, the presence of large intron between primers, or unrecognized splice sites disrupting primer positions. The chimeric primers and assembly errors also could result in failed amplification. However, the other microsatellites obtained from our transcriptome data can provide a larger pool for mining more polymorphic loci. SSR polymorphisms are positively correlated with the number of motif repeats [70]. In the transcriptome database, the number of motif repeats of most SSRs (68.18%) ranged from 4 to 9, and only 9.23% of SSRs had more than 15 repeats. Hence, SSRs in the transcriptome were less polymorphic than genomic SSRs, but possessed potential polymorphisms [30,86]. Among the four R. padi populations, 14 loci had fewer alleles (Na) than genomic SSRs, but similar Ho, He, and Fis values [8]. The PIC is an important parameter for microsatellite polymorphisms [87]. The average PIC values of 14 loci in four R. padi populations were 0.334-0.680, suggesting that 14 were moderately or highly polymorphic. A chi-squared test and Fisher's exact test confirmed that 14 loci had strong statistical power to detect low Fst levels, while amplification at various annealing temperatures using the remaining 36 primer pairs failed.

Conclusions
To our knowledge, this is the first report of the assembly and characterization of the transcriptome of R. padi using the Illumina HiSeq 2000 platform. A total of 2,9467 unigenes were generated and 7,936 EST-SSRs were identified, which will facilitate development of molecular markers for R. padi. Sixty of these loci were selected randomly, and 24 were amplified successfully and validated experimentally in four R. padi populations. Our results will enable development of microsatellite markers and population genetic studies of R. padi.
Supporting information S1