Skip to main content
Browse Subject Areas

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Combining Next-Generation Sequencing and Microarray Technology into a Transcriptomics Approach for the Non-Model Organism Chironomus riparius

  • Marino Marinković,

    Affiliations Microarray Department and Integrative Bioinformatics Unit, Swammerdam Institute for Life Sciences (SILS), University of Amsterdam, Amsterdam, The Netherlands, Department of Aquatic Ecology and Ecotoxicology, Institute for Biodiversity and Ecosystem Dynamics (IBED), University of Amsterdam, Amsterdam, The Netherlands

  • Wim C. de Leeuw,

    Affiliations Microarray Department and Integrative Bioinformatics Unit, Swammerdam Institute for Life Sciences (SILS), University of Amsterdam, Amsterdam, The Netherlands, Netherlands Bioinformatics Centre (NBIC), Nijmegen, The Netherlands

  • Mark de Jong,

    Affiliations Microarray Department and Integrative Bioinformatics Unit, Swammerdam Institute for Life Sciences (SILS), University of Amsterdam, Amsterdam, The Netherlands, Netherlands Bioinformatics Centre (NBIC), Nijmegen, The Netherlands

  • Michiel H. S. Kraak,

    Affiliation Department of Aquatic Ecology and Ecotoxicology, Institute for Biodiversity and Ecosystem Dynamics (IBED), University of Amsterdam, Amsterdam, The Netherlands

  • Wim Admiraal,

    Affiliation Department of Aquatic Ecology and Ecotoxicology, Institute for Biodiversity and Ecosystem Dynamics (IBED), University of Amsterdam, Amsterdam, The Netherlands

  • Timo M. Breit,

    Affiliations Microarray Department and Integrative Bioinformatics Unit, Swammerdam Institute for Life Sciences (SILS), University of Amsterdam, Amsterdam, The Netherlands, Netherlands Bioinformatics Centre (NBIC), Nijmegen, The Netherlands

  • Martijs J. Jonker

    Affiliations Microarray Department and Integrative Bioinformatics Unit, Swammerdam Institute for Life Sciences (SILS), University of Amsterdam, Amsterdam, The Netherlands, Netherlands Bioinformatics Centre (NBIC), Nijmegen, The Netherlands


Whole-transcriptome gene-expression analyses are commonly performed in species that have a sequenced genome and for which microarrays are commercially available. To do such analyses in species with no or limited genome data, i.e. non-model organisms, necessary transcriptomics resources, i.e. an annotated transcriptome and a validated gene-expression microarray, must first be developed. The aim of the present study was to establish an advanced approach for developing transcriptomics resources for non-model organisms by combining next-generation sequencing (NGS) and microarray technology. We applied our approach to the non-biting midge Chironomus riparius, an ecologically relevant species that is widely used in sediment ecotoxicity testing. We sampled extensively covering all C. riparius developmental stages as well as toxicant exposed larvae and obtained from a normalized cDNA library 1.5 M NGS reads totalling 501 Mbp. Using the NGS data we developed transcriptomics resources in several steps. First, we designed 844 k probes directly on the NGS reads, as well as 76 k probes targeting expressed sequence tags of related species. These probes were tested for their affinity to C. riparius DNA and mRNA, by performing two biological experiments with a 1 M probe-selection microarray that contained the entire probe-library. Subsequently, the 1.5 M NGS reads were assembled into 23,709 isotigs and 135,082 singletons, which were associated to ∼55 k, respectively, ∼61 k gene ontology terms and which corresponded together to 22,593 unique protein accessions. An algorithm was developed that took the assembly and the probe affinities to DNA and mRNA into account, what resulted in 59 k highly-reliable probes that targeted uniquely 95% of the isotigs and 18% of the singletons. Concluding, our approach allowed the development of high-quality transcriptomics resources for C. riparius, and is applicable to any non-model organism. It is expected, that these resources will advance ecotoxicity testing with C. riparius as whole-transcriptome gene-expression analysis are now possible with this species.


Microarray technology has, 17 years after its introduction [1], become a well-established tool for whole-transcriptome gene-expression analyses [2]. Although under pressure due to the on-going developments in next-generation sequencing (NGS) [3], this technology is anticipated to have a viable future for the coming decade given its current low cost, relatively limited data-handling burden, as well as accepted pre-processing and data analyses methods. In contrast to NGS, microarray technology allows comprehensive though cost-effective transcriptome analyses, making it thus possible to test various experimental conditions and to use many replicates, the latter being a necessity to account for biological variability [4]. The option of purchasing completely custom-made microarrays containing up to 4.2 million spots each with a different oligonucleotide ( allows for flexible experimentation including the design and the use of huge probe libraries. Over the years, many ground-braking microarray experiments have been performed with respect to unravelling cellular mechanisms, as well as the discovery of predictive/diagnostic biomarkers [5].

Until recently, microarray studies were mainly restricted to several traditional genome-sequenced model species [6]. This meant that in domains that rely on non-model, i.e. non sequenced, organisms, such as ecology and ecotoxicology, microarray technology was only of limited use [6]. The introduction of NGS and in particular medium-long (300–500bp) read pyrosequencing [7] changed this [8], as it became feasible to develop microarrays for any species of interest [9], [10]. In general, the approach to develop transcriptomics resources for non-model organisms is as follows. NGS reads are generated from mRNA and, by lack of reference genome, de novo assembled with one or several transcriptome assemblers. Subsequently, microarray probe libraries are designed that target the assembled sequences (contigs/isotigs) and, depending on the microarray format, all or a selection of the un-assembled reads (singletons) [9][14]. This approach has several drawbacks, as the microarray design strongly relies on the transcriptome assembly which varies depending on the assembler used [15], [16] and which can result in modified sequences due to the partial assembly of NGS reads [17], the insertion of bases to fill gaps and the merging of NGS reads that do not belong to the same transcript. Moreover, as there is no biological confirmation of the obtained NGS reads in this approach, probes can be developed against sequences that do not target the intended organism, as they are the result of sequencing errors [18] and/or contamination [19], [20]. The eventual microarray design will therefore include many probes that recognize sequences not present in the target organism. Finally, as this approach relies solely on in-silico methods, a fraction of the probes may also perform badly in actual microarray experiments.

Given the above discussed drawbacks, the aim of the present study was to establish an advanced approach for developing transcriptomics resources for non-model organisms, consisting of an annotated transcriptome and a high-quality microarray. To achieve this, we formulated the following strategy (Figure 1):

  • Generate NGS data: Perform a NGS experiment on a normalized cDNA library obtained from a broad range of biological samples of the non-model organism.
  • Design probe library: Design up to one million probes targeting all original NGS reads, as well as, expressed sequence tags (ESTs) of related species.
  • Assemble the NGS reads. Assemble the NGS reads into a transcriptome for downstream probe selection and functional annotation.
  • Select probes with targets in the genome: Conduct an array-based comparative genomic hybridization (aCGH) experiment with a probe-selection microarray that contains all the designed probes, to select probes that hybridize well with the genomic DNA (gDNA) of the non-model organism.
  • Select probes for standard mRNA analysis: Conduct an array-based gene-expression (aGE) experiment with the probe-selection microarray to further select probes according to their 3′ location in mRNA, their signal-intensity and their ability to uniquely interrogate the assembled transcripts. The final probe library targets all non-model organism transcripts that can be uniquely targeted while keeping the number of probes to a minimum.
  • Finalize transcriptomics resources: Functionally annotate all transcripts using the annotation tool Blast2GO® [21] and determine which transcripts are targeted by the high-quality GE microarray.

As microarrays are gaining importance in ecotoxicology [22], [23], we applied this approach to a non-model organism commonly used in ecotoxicology for which till date no microarray has been developed and for which very limited sequence data can be found in public repositories, even though the transcriptome of this non-model species has recently been published [24]. The non-biting midge Chironomus riparius (Insecta: Diptera) is a member of the Chironomidae family, which are the most widely distributed and often most abundant insects in freshwater ecosystems [25]. Consequently, chironomids are routinely used to evaluate and monitor the biological quality of rivers and lakes [26], [27]. Their larvae settle in the sediment, where they remain until they emerge as adults. The short life-cycle and ease of rearing have also made C. riparius a commonly used species in sediment ecotoxicology [28], [29] with currently four standardized OECD guidelines being available for acute and chronic toxicity tests [30][33]. Assessing effects on life cycle endpoints has proven to be an effective method for deriving effect concentrations for environmental risk assessment [34]. However, life cycle effects are not supported by a mechanistic insight in the toxicants mode of action nor the physiological changes that occur in toxicant exposed midges. Studies that measure effects at lower levels of biological organization, such as the transcriptome, are therefore required [35], [36].

Results and Discussion

Generation of Transcriptome Next-generation Sequencing Data

Non-model organisms have no or limited genomics data. Due to the size and complexity of the genome, sequencing the transcriptome is a practical alternative to obtain genomics data for such species [8]. The obvious drawback is that only sequences from genes that are expressed in the sequenced samples will become available. Therefore broad sampling, as well as normalization of the pooled sample, i.e. reducing the frequency of highly abundant transcripts, is highly recommended (Figure 1). For C. riparius we included all life cycle stages, i.e. egg ropes, all four larval stages, pupae and male and female adults. Considering C. riparius use in sediment ecotoxicology [28], [29], we also included larvae exposed to different concentrations of several model toxicants (Table 1, extended version in Table S1). These specimens were used to synthesize a normalised cDNA library that yielded 1,549,146 NGS reads with a total length of 500,673,325 bp which is in line with the specification of the 454 NGS platform and which is more than ten times the amount of reads and base pairs previously reported by Nair et al. [24] who obtained, respectively, 138,091 NGS reads and 49,774,676 bp by pyrosequencing toxicant exposed 4th instar larvae.

Figure 1. Strategy to obtain non-model organism transcriptomics resources.

NGS: Next-generation sequencing; ESTs: Expressed Sequence Tags; aCGH: array-based Comparative Genomic Hybridization; GE: Gene Expression; GO: Gene Ontology; EC: Enzyme Commission numbers. * adapted from

Generation of Microarray Probe Library

In our microarray approach, we started by designing a huge probe-library directly on all adapter trimmed NGS reads that were longer than 60 bp and that did not contain unknown bases in their sequence, using the in-house developed NGS array designer ( Designing probes against NGS reads, instead of the assembled contigs/isotigs ensured that the designed probes did not target sequences that might have been modified during the assembly process. The probe-library was subsequently extended with probes designed against publically available ESTs of closely related species. These probes could be an enrichment as they might target conserved sequences that were not present in the sequenced sample. For C. riparius, this resulted, after testing for cross-hybridization, in 919,821 probes. 843,837 NGS read designed probes targeted almost al NGS reads, while 75,984 probes were designed against ESTs belonging to the genus Chironomus and the closely related dipteran species Anopheles gambiae, Anopheles darlingi, Anopheles funestus, Aedes aegypti and Culex quinquefasciatus.

Assembly of Transcriptome NGS Data

For the downstream probe selection procedure, as well as the annotation of the non-model species transcriptome, the NGS reads had to be assembled. For C. riparius we used Newbler (v2.5.3.) which is a de-facto standard assembler for NGS reads generated by pyrosequencing [37] and which has been shown to perform best assembling such NGS reads de novo into a transcriptome [15]. An overview of sequencing and assembly statistics is given in Table 2. From the ∼1,5 million trimmed NGS reads, 87.2% was fully or partially assembled, 8.8% could not be assembled and was labelled singleton, while the remaining 4.1% was discarded as the NGS reads did not meet the required quality standards. While the large number of singletons undoubtedly contained fragments of rare transcripts [38], [39], we suspected that a substantial portion was the result of NGS errors, artefacts of cDNA library preparation and/or contaminants from other sources. Especially the latter option seems plausible, since entire larvae including their gut flora were used for the cDNA library preparation. To keep track of possible differences between the two transcript sets, we kept the isotigs and singletons separated during our entire study.

Table 2. C. riparius transcriptome sequencing and assembly statistics.

Since the Newbler assembler takes alternative splicing into account, the C. riparius transcriptome assembly consisted of ∼27 k contigs (“exons”) that were incorporated into ∼23 k isotigs (“transcripts”, average length 1,369 bp), which in turn were grouped into ∼18 k isogroups (“genes”). 66.8% of the isotigs consisted of a single contig, while the others contained up to 13 contigs. Isotigs that shared contigs were grouped together into isogroups, which resulted in 85.8% with one, 10.5% with two, and 3.7% with three or more isotigs. Correcting for 26 contigs that were not translated by the Newbler assembler into isotigs, we defined the C. riparius transcriptome as the total of 23,709 isotigs and 135,082 singletons.

aCGH Experiment to Select Relevant Microarray Probes using gDNA

To select the most reliable and biological relevant probes from the previously designed probe-library, a probe-selection microarray was developed that contained the probe-library and ample negative-control probes that did not recognize any sequence in GenBank. Performing an aCGH experiment allowed subsequently the identification of probes that hybridized with the gDNA of the non-model organism. Assuming a limited chance on random homology, the aCGH experiment was expected to considerably clean-up the NGS data, eliminating NGS reads that originated from contamination or technological errors, while simultaneously selecting relevant probes for the final non-model species GE microarray. It is recognized, that his procedure would also select against probes targeting exon spanning sequences, however, due to the large number of probes this was not considered a problem.

For C. riparius, we developed a 1 M probe-selection microarray that contained the 919,821 probes from the probe-library and 40.000 negative control probes. This probe-selection microarray was used in an aCGH experiment to analyse the gDNA of C. riparius as well as the gDNA of A. gambiae that served as a positive control. It turned out that there was a correlation between the GC-content of the negative control probes and their signal-intensity, i.e. above a GC-content of 45% a steady increase in signal intensity was observed (Figure S1). This prompted us to limit the CG-content for all probes to a maximum of 50%, what resulted in 816,270 C. riparius NGS-read probes, 42,636 dipteran-specific probes, and 19.000 negative-control probes.

As expected, the NGS-read probes showed stronger signal intensities after hybridisation with C. riparius than A. gambiae gDNA (Figure 2A). The opposite was true for the probes designed against A. gambiae ESTs. In fact, the A. gambiae aCGH signal-intensities of the NGS-read probes were in the range of that of the negative controls, indicating a substantial genomic difference between the malaria mosquito A. gambiae and the non-biting midge C. riparius. To select probes that hybridized well to the C. riparius gDNA, we compared for all probes the log2-ratio (C. riparius/A. gambiae) of the aGCH signal intensities with the summed log2 (C. riparius*A. gambiae) signal intensities in a MA-plot (Figure 2B). The negative- and positive (A. gambiae) control probes behaved as expected, both with low C. riparius aGCH signal intensities and only for the positive control probes high signal intensities for the A. gambiae gDNA. The distributions of the probes designed against the other dipteran species, with the exception of the Chironomus spp, corresponded to the pattern observed for the A. gambiae designed probes (Figure S2).

Figure 2. Array-based comparative genomic hybridization (aCGH) experiment.

(A) Box- and-whisker plot summarizing the obtained log2 signal intensity distributions for the four indicated probes collections, with the light grey boxes representing the C. riparius aCGH signal and the dark grey the aCGH A. gambiae signal. (B) MA-plot of the aCGH data. The dots with the different shades of grey represent the entire probe-library (with a GC-content below 50%). The three defined signal-intensity parameters are indicated by the dashed blue line and the captions I, II, III. The three categories containing the selected probes are indicated by different shades of grey and the letters A, B and C. The red dots are the negative control probes and the green dots the positive control (A. gambiae EST) probes.

Based on the distributions of the control probes (Figure 2B), we defined signal-intensity parameters that allowed a conservative selection of C. riparius gDNA specific probes: parameter I was a C. riparius log2 signal of 10, separating probes with a strong signal when hybridized with C. riparius gDNA; parameter II was a C. riparius log2 signal of 8, separating probes with an intermediate signal when hybridized with C. riparius gDNA; and parameter III was a log2 C. riparius/A. gambiae signal ratio of 1, separating probes with a higher signal to C. riparius than to A. gambiae gDNA. Using these parameters we defined categories to select probes: Category A contained probes above parameter I and III, which are probes that gave a strong and specific C. riparius gDNA signal (217,878); Category B contained probes between parameter I and II and above III, which are probes that gave an intermediate and specific C. riparius gDNA signal (206,608); and Category C contained probes above I and below III, which are probes that gave a strong and non-specific C. riparius gDNA signal (42,379). These categories contained a total of 466,865 reliable probes. The validity of the selection parameters was confirmed by applying the same signal-intensity parameters to the negative and positive control probes, what resulted in the selection of only 2.7% of the negative control probes (Figure 2B) and 86.6% of the positive control probes, with 2,860, 1,428 and 165 positive control probes in the categories A, B, and C, respectively.

Hence, with this approach we were able to reduce our initial pool of probes with 49.6%. However, despite this reduction in probes, still 98.2% isotigs and 60.2% singletons were targeted. This effectively means that the aCGH experiment selected against bad-performing probes, rather than bad NGS reads. A bad-performing probe can be caused by low affinity to the target due to small NGS errors, exon spanning target sequences, reads belonging to other species and sequence-specific microarray technology anomalies.

aGE Experiment to Further Select 3′ Located Probes using mRNA

After the DNA check by the aGCH experiment, we used mRNA derived from the non-model organism in the probe-selection procedure. Given the relative small average size of the isotigs and the many singletons, it was fair to assume that many transcripts in our example represented incomplete mRNA sequences, something that will often be the case in de novo assembled non-model organisms transcriptomes. As the standard aGE protocol uses linearly amplified cDNA, i.e. the mRNA is amplified using oligo-dT primers from the 3′-side, only 3′ located probes will be useful for the final aGE microarray. Identifying probes against the most 3′ located target sequences in the mRNA was possible by combining a standard linear amplification protocol that yielded 3′-end biased labelled material, with a modified linear amplification protocol that yielded highly 3′ restricted material due to the incorporation of dideoxynucleotides during the cDNA synthesizing step (Figure 3A). For this experiment, the same probe-selection microarray was used as in the aCGH experiment.

Figure 3. Array-based gene expression (aGE) experiment.

(A) Schematic representation of the two mRNA linear amplification protocols. The coloured bar represents the mRNA with the 3′ polyA tail indicated by the stretch of A’s. The arrows represent the amplified cDNA products obtained for the regular procedure and the modified procedure, with the length of the arrows indicating the length of the synthesized cDNA’s. (B) MA-plot of the aGE data. The light grey dots represent all aCGH selected probes. The three coloured regions are expected to contain probes targeting transcripts at the 3′ side (blue), probes targeting the middle of the transcripts (red) and probes targeting the 5′side as well as probes with no target transcripts (green). (C) Density plot where the relative position of the three probe populations on the isotigs is demonstrated. The colours of the lines correspond to the colours used in panels A and B. The black line represents a random selection of probes that covers, as expected, the isotigs evenly over the entire length.

For C. riparius, two aGE samples were synthesized from the same mRNA pool that was pyrosequenced, using the regular and modified linear amplification protocols (Figure 3A). Comparing the two samples on the 1 M probe-selection microarray resulted in three populations of probes. First, probes that showed a good intensity signal in both procedures: these probes recognized targets located 3′ in mRNA that was expressed at a high enough level to be detected by microarray technology. Second, probes that performed well in the regular protocol, but not in the 3′ restricted protocol: these probes recognized targets located more 5′ in mRNA that was expressed at a sufficient level. Third, probes that performed badly in both protocols: these probes recognized targets either located too much 5′ in mRNA to be amplified by either protocol, targets from genes that were expressed below the detection level of microarray technology, or targets from genes that were not present in the tested transcriptome, i.e. NGS contamination or erroneous sequences. This approach worked quite well and revealed that the anticipated three probe populations could be identified (Figure 3B). To demonstrate that these populations indeed represented 5′, middle and 3′ located mRNA target sequences, we determined the relative position of their targets on the assembled isotigs (Figure 3C). Compared to a set of randomly chosen probes, the distributions showed a strong preference for the expected location. Hence, there were likely probes targeting 5′, middle and 3′ located sequences for each mRNA in this NGS data set. Since the different probe populations could not clearly be separated, the aGE data was used to rank the probes based on their signal strengths and thus the target location on the NGS-defined transcripts.

To achieve our goal and design a cost-effective 3′-primed gene-expression microarray, we aimed to keep the number of reliable and biologically relevant probes to a minimum, while making sure that as many as possible transcripts were uniquely targeted. For this, a selection algorithm was defined that was executed for both isotig and singleton sets. This algorithm started by categorizing the probes for isotigs using blastn: probes were defined having good matches to isotigs, if the bit score was >80. Probes with no good match to any isotig were discarded. Then, probes were selected only if they were unique to one isogroup. Before the final selection, probes were evaluated to contain only stretches of 7 identical nucleotides or less, as well as 5 subsequent di-nucleotides or less. The selected probes were ranked according to their aGE signal intensity. Starting with the highest-ranked probe, probes were selected for each isogroup until all isotigs within each isogroup were targeted by at least one probe. The same algorithm was then applied to the singleton set.

Applying this algorithm to our C. riparius aGE data, we obtained a final probe library that consisted of 59,409 validated probes uniquely targeting 22,507 isotigs, (corresponding to 17,403 isogroups) and 23,782 singletons.

Finalization of the Transcriptomics Resources

To allow down-stream analyses and interpretation of gene-expression studies, functional annotation of the transcriptome was needed. Functional annotation of the isotigs and singletons was performed independently, starting with a blastx search of both sets against the GenBank non-redundant (nr) protein database followed by a Blast2GO® analyses identifying relevant Gene Ontology (GO) terms [40] and unique enzyme commission (EC) numbers.

For C. riparius, we used a lenient blastx e-value threshold of 1*e−3 and found matches to homologous proteins for 71.0% of the isotigs and 17.9% of the singletons (Table S2). Screening those blastx hits for identical protein accessions, showed that 50.0% of all isotigs and 8.9% of the singletons corresponded to unique protein accessions numbers. From the blastx results we identified a total of 22,593 unique protein accessions of which only 6% were found in both transcript sets. The fact that the isotigs and singletons almost equally contributed to this total set of unique protein accessions indicated that the assembly was exhaustive and thus successful, as well as that biologically relevant information was indeed present in the singletons. The number of unique protein accessions obtained in the present study is higher than the 9,512 unique protein accessions previously reported by Nair et al. [24], most probably because they pyrosequenced less deep and because their biological sample was less diverse as it merely consisted of 4th instar C. riparius larvae. Since the genome sequence of C. riparius is not elucidated we cannot determine the exact coverage of the transcriptome. However, considering that the sequenced genomes of the closely related mosquitos A. aegypti, C. quinquefasciatus and A. gambiae are predicted to have 16,789, 18,883, respectively, 13,133 transcripts [41], it seems likely that we covered a substantial part of the C. riparius transcriptome.

Distributing all the best blastx hits over taxonomic groups (Figure 4A), showed, as expected, that the far majority (92.0%) of isotig-matched proteins belonged to known insect proteins, and especially to the dipterans A. aegypti (29.1%), C. quinquefasciatus (18.2%) and A. gambiae (12.9%). This is in concordance with the distribution reported by Nair et al. [24]. For the singleton-matched proteins the distribution was different in that 49.8% related to insect proteins and 45.0% to other eukaryotic species. Only small fractions of the isotigs and singletons matched to prokaryote proteins (0.7% and 3,9%), which indicates that the chironomid gut flora did not substantially contaminate the NGS data. Since 0.7% of the isotigs and 21.4% of the singletons matched human proteins, we performed a blastn (nucleotide) search against the human genome and transcriptome data available at NCBI, to further estimate the potential human contamination of the NGS data. We detected that 3.5% of NGS reads showed a strong similarity to human sequences over the entire read length and that 94.0% of these ‘human’ NGS reads remained unassembled, i.e. became singletons. As these sequences could also represent conserved genomic sequences, we chose not to remove them from our NGS data.

Figure 4. Taxonomic distribution of the best blastx hits matching C. riparius transcripts.

Distribution of the best blastx hits that were matched to the isotigs (black) and the singletons (light grey) according to their taxonomic origin. (A) All transcripts (isotigs n = 16,824; singletons n = 24,129) that were matched to a BLASTX hit. (B) Transcripts (isotigs n = 16,537; singletons n = 4,7539) that were matched to a BLASTX hit and that are targeted by the final aGE microarray.

Assigning functional categories to the C. riparius transcriptome, we were able to annotate 11,895 (50.2%) isotigs and 12,662 (9.4%) singletons with ∼55 k and ∼61 k GO terms, respectively (Table 3 and Table S3). The distribution over the various GO terms was remarkably similar for the isotigs and singletons (Figure 5) and may suggest that the observed patterns at least partially depended on the abundance of certain GO-terms. To visualize the interaction of the annotated transcripts, we assigned with Blast2GO®, 688 unique EC numbers to 2,973 isotigs and 3,611 singletons (Table 3 and Table S3) and found 126 pathways in the Kyoto Encyclopedia of Genes and Genomes (KEGG) database [42].

Figure 5. Gene Ontology (GO) terms obtained for C. riparius transcripts.

The data represents the distribution of the annotated isotigs (black) and the annotated singletons (light grey) over the various level-2 GO terms. Each bar represent the number of annotated transcripts associated with the specified level-2 GO term as a percentage of the total number of annotated transcripts belonging to the higher-ranked GO category, i.e. cellular component (isotigs n = 8,380; singletons n = 9,277), molecular function (isotigs n = 10,663; singletons n = 11,359) and biological process (isotigs n = 6,249; singletons n = 7,343).

The final GE microarray (Gene-Expression Omnibus accession numbers GPL15611) targeted 22,507 isotigs and 23,782 singletons. Of these targeted transcripts, 73.5% isotigs and 20.0% singletons had a blastx hit, of which 71.4% and 88.3% matched to a unique protein accessions. Importantly, our selected probes targeted 94.9% of the isotigs and 17.6% of the singletons, which covered 98.6% and 34.9% of the unique protein accessions, respectively. The taxonomic distribution of the targeted isotigs was almost the same of that of the entire set of isotigs with 92.7% matching insect proteins. However, for the singletons the percentage that matched insect proteins increased substantially from 49.8% to 82.2% (Figure 4B). As aimed for, the percentage of ‘human contamination’ was substantially reduced in both transcript sets.

Thus by combining NGS and microarray technology we succeeded in designing an annotated high-quality microarray suited for whole-transcriptome gene-expression analysis in a non-model organism. For C. riparius, we selected from a 925 k probe library 59 k highly-reliable probes that have been proven to perform well in both aCGH and aGE experiments. While we designed probes directly on the NGS reads, our approach is also suitable for validating probe libraries designed against assembled transcripts. Concluding, we now have valuable C. riparius transcriptomics resources, i.e. an annotated transcriptome and a 135 K 3′-primed gene-expression microarray, that can advance ecotoxicity testing with C. riparius as whole-transcriptome gene-expression analysis are now possible with this species.

Materials and Methods

Test Organism, Culturing Conditions and Sample Selection

The Chironomus riparius specimens originated from the University of Amsterdam’s in-house laboratory culture and were maintained on artificial sediment at 20±1°C, 65% humidity and a 16∶8 h light: dark photoperiod [28], [29]. The genetic fidelity of this C. riparius laboratory culture was previously confirmed by mitochondrial cytochrome oxidase I (COI) gene sequencing [43] The sample list, including all life cycle stages and toxicant-exposed larvae, is provided in Table S1. Each sample was immediately snap frozen in liquid nitrogen and stored at –80°C until processing.

RNA Isolation, cDNA Library Construction and Next-generation Sequencing

The frozen samples were pooled and homogenization in liquid nitrogen. Total RNA was extracted using the RNeasy Mini kit (Qiagen) with an on-column DNase (Qiagen) digestion to remove traces of genomic DNA. RNA integrity was examined using a RNA 6000 Nano chip on a 2100 Bioanalyzer (Agilent Technologies), while the RNA yield was determined on a NanoDrop ND-1000 UV-VIS spectrophotometer (Thermo Fisher Scientific). An aliquot of the total RNA sample was send to GATC Biotech (Konstanz, Germany) where a normalized cDNA library was prepared and sequenced using Titanium chemistry on a GS FLX Instrument (Roche Diagnostics) according to manufacturer’s protocol. Detailed information is presented in Appendix S1.

De novo Transcriptome Assembly and Functional Annotation

NGS reads were, after removal of adapter sequences, assembled with Newbler v2.5.3. (Roche) in the de novo mode using default assembly parameters. The obtained isogroups and isotigs, as well as, the remaining singletons were renamed in the format of “CripIG000001”, “CripIT000001”, “CripSI000001” with “Crip” standing for C. riparius, “IG” for isogroup, “IT” for isotig, “SI” for singleton, and “000001” for an arbitrary assigned number. The C. riparius transcripts, i.e. isotigs and singletons, were functionally annotated using the Blast2GO® suite [21], [44]. Detailed information of the functional annotation procedure is presented in Appendix S1.

Array-based Comparative Genomic Hybridization (aCGH)

The designed 1×1 M probe-selection microarray was obtained from Agilent Technologies and hybridized with C. riparius and A. gambiae genomic DNA (gDNA). gDNA was extracted from 30 pooled fertilized C. riparius egg ropes, respectively, 30 pooled unfed A. gambiae adults. To keep contamination of the C. riparius gDNA sample to an absolute minimum the midges were allowed to deposit the egg ropes in petri dishes filled with clean water. Egg ropes not older than 30 minutes were collected, stringently rinsed with clean water and immediately flash frozen in liquid nitrogen. After pooling, gDNA was extracted using a CTAB DNA extraction method that included a RNAse A (Sigma-Aldrich) digestion step for removal of residual RNA [45]. gDNA quality and quantity were determined with gel electrophoresis (0.5% agarose in TAE buffer) and NanoDrop ND-1000 measurements. 200 ng DNA was amplified and labelled by strand displacement amplification. Concentrations of amplified products were measured on the NanoDrop ND-1000 and qualified on the BioAnalyzer with the DNA 1000 Kit (Agilent Technologies). Yield and CyDye incorporation of the final labelled products were measured with the NanoDrop ND-1000 in the Microarray Measurement Mode. The 1 M probe-selection microarray was subsequently hybridized with 10 µg of Cy3 labelled C. riparius gDNA and 10 µg Cy5 labelled A. gambiae gDNA according to the Oligonucleotide Array-Based CGH for Genomic DNA Analysis manual (Agilent Technologies version 6.3. The microarray was scanned in an ozone-free room on an Agilent G2505CA scanner at 3 µm resolution and the data was extracted with Feature Extraction version (Protocol CGH_107_Sep09). The log2 transformed median signals were analysed in R ( Detailed information is presented in Appendix S1.

Array-based Gene-expression (aGE)

The pooled C. riparius RNA sample, of which an aliquot was pyrosequenced, was used to identify probes corresponding to the 3′-end of C. riparius transcripts. 200 ng RNA was taken as input for both a regular, as well as, a modified linear RNA amplification. The regular RNA amplification was conducted with the Agilent Low RNA Input Linear Amplification Kit (Agilent Technologies) according to manufacturer’s recommendations. The modified reaction was conducted using the same kit, however, the first step where the mRNA is primed with an oligo (d)T-T7 primer and converted into double-stranded cDNA with the M-MLV reverse transcriptase, was modified by the addition of dideoxynucleotides (ddNTP) in the deoxynucleotide mix. This modification was expected to yield highly 3′-biased cDNA as incorporation of a ddNTP would prematurely terminate cDNA elongation. The modified amplification procedure is described in detail in the Appendix S1. Amplified RNA was checked for quality and quantity using Bioanalyzer and NanoDrop measurements. Yield and CyDye incorporation of the final labelled products were measured with the NanoDrop ND-1000 in the Microarray Measurement Mode. The 1 M probe-selection microarray was subsequently hybridized with 2.5 µg Cy3 labelled regularly amplified RNA and 2.5 µg Cy5 labelled alternatively amplified RNA according to the Two-Colour Microarray-Based Gene-Expression Analysis manual (Agilent Technologies version 6.5). The microarray was scanned in an ozone-free room on an Agilent G2505CA scanner at 3 µm resolution. The data was extracted with Feature Extraction version (Protocol GE2_107_Sep09). The log2 transformed median signals were analysed in R.

Data Deposition

The C. riparius NGS sequence reads were submitted to NCBI Sequence Read Archive ( under accession number SRX147945. The assembled C. riparius isotigs have been submitted to NCBI Transcriptome Shotgun Assembly Sequence Database ( and can be accessed through the GenBank accession numbers KA174710-KA198345. Complete raw microarray data and their MIAME compliant metadata have been deposited at NCBI Gene-Expression Omnibus ( under accession numbers GPL15610 (1 M microarray) and GPL15611 (135 K microarray).

Supporting Information

Figure S1.

aCGH signal intensities of the control probes according to GC-content. Box- and-whisker plot showing the log2 A. gambiae signal intensity distributions of the positive control probes (light grey) and the negative control probes (dark grey) for 20 GC-bins, each bin corresponding to a GC-content increase of 5%.


Figure S2.

MA-plots of the aCGH experiment obtained for the various dipteran probe collections. The grey dots represent the entire probe library, except the negative control probes. The black are the probes targeting the ESTs of respectively (A) Chironomus spp., (B) Anopeheles darlingi, (C) Anopheles funestus, (D), Aedes aegypti and (E) Culex quinquefasciatus.


Table S1.

Detailed C. riparius sample list.


Table S2.

blastx results using different e-value cut-offs.


Table S3.

The C. riparius transcriptome annotation.


Appendix S1.

Additional information to the Materials and Methods.



We would like to thank Dr. Merijn R. Kant and Dr. Floyd A. Wittink for the fruitful discussions and tips during the assembly plus annotation process and microarray experiments, respectively. Dr. Niels Verhulst for providing us with Anopheles gambiae samples and Ing. Jurgo Verkooijen for laboratory assistance.

Author Contributions

Conceived and designed the experiments: MJ MJJ MM TMB WA WCL. Performed the experiments: MJ MM. Analyzed the data: MJJ MM WCL. Contributed reagents/materials/analysis tools: MHSK MM. Wrote the paper: MHSK MJJ MM TMB WA.


  1. 1. Schena M, Shalon D, Davis RW, Brown PO (1995) Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270: 467–470.
  2. 2. MAQC Consortium (2006) The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat Biotechnol 24: 1151–1161.
  3. 3. Shendure J (2008) The beginning of the end for microarrays? Nat Methods 5: 585–587.
  4. 4. Hanssen KD, Wu Z, Irizarry RA, Leek JT (2011) Sequencing technology does not eliminate biological variability. Nat Biotechnol 29: 572–573.
  5. 5. van ‘t Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AAM, et al. (2002) Gene expression profiling predicts outcome of breast cancer. Nature 415: 530–536.
  6. 6. Neumann NF, Galvez F (2002) DNA microarrays and toxicogenomics: applications for ecotoxicology? Biotechnol Adv. 20: 391–419.
  7. 7. Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, et al. (2005) Genome sequencing in microfabricated high-density picolitre reactors. Nature 437: 376–380.
  8. 8. Ekblom R, Galindo J (2011) Applications of next generation sequencing in molecular ecology of non-model organisms. Heredity 107: 1–15.
  9. 9. Vera JC, Wheat CW, Fescemyer HW, Frilander MJ, Crawford DL, et al. (2008) Rapid transcriptome characterization for a nonmodel organism using 454 pyrosequencing. Mol Ecol. 17: 1636–1647.
  10. 10. Garcia-Reyero N, Griffitt RJ, Liu L, Kroll KJ, Farmerie WG, et al. (2008) Construction of a robust microarray from a non-model species largemouth bass, Micropterus salmoides (Lacèpede), using pyrosequencing technology. J Fish Biol. 72: 2354–2376.
  11. 11. Bellin D, Ferrarini A, Chimento A, Kaiser O, Levenkova N, et al. (2009) Combining next-generation pyrosequencing with microarray for large scale expression analysis in non-model species. BMC Genomics 10: 555.
  12. 12. Gong P, Pirooznia M, Guan X, Perkins EJ (2010) Design, validation and annotation of transcriptome-wide oligonucleotide probes for the oligochaete annelid Eisenia fetida. PloS ONE 5: e14266.
  13. 13. Milan M, Coppe A, Reinhardt R, Cancela LM, Leite RB, et al. (2011) Transcriptome sequencing and microarray development for the Manila clam, Ruditapes philippinarum: genomic tools for environmental monitoring. BMC Genomics 12: 234.
  14. 14. Bass C, Hebsgaard MB, Hughes J (2012) Genomic resources for the brown planthopper, Nilaparvata lugens: Transcriptome pyrosequencing and microarray design. Insect Sci. 19: 1–12.
  15. 15. Kumar S, Blaxter ML (2010) Comparing de novo assemblers for 454 transcriptome data. BMC Genomics 11: 571.
  16. 16. Mundry M, Bornberg-Bauer E, Sammeth M, Feulner PGD (2012) Evaluating characteristics of de novo assembly software on 454 transcriptome data: a simulation approach. PloS ONE 7: e31410.
  17. 17. Miller JR, Koren S, Sutton G (2011) Assembly algorithms for next-generation sequencing data. Genomics. 95: 315–327.
  18. 18. Gilles A, Meglécz E, Pech N, Ferreira S, Malausa T, et al. (2011) Accuracy and quality assessment of 454 GS-FLX Titanium pyrosequencing. BMC Genomics 12: 245.
  19. 19. Longo MS, O'Neill MJ, O'Neill RJ (2011) Abundant human DNA contamination identified in non-primate genome databases. PLoS ONE 6: e16410.
  20. 20. Schmieder R, Edwards R (2011) Fast identification and removal of sequence contamination from genomic and metagenomic datasets. PLoS one 6: e17288.
  21. 21. Götz S, Garcia-Gómez JM, Terol J, Williams TD, Nagaraj SH, et al. (2008) High-throughput functional annotation and data mining with the Blast2GO suite. Nucleic Acids Res 36: 3420–3435.
  22. 22. Lettieri T (2006) Recent applications of DNA microarray technology to toxicology and ecotoxicology. Environ Health Perspect. 114: 4–9.
  23. 23. van Straalen NM, Feder ME (2012) Ecological and evolutionary functional genomics - How can it contribute to the risk assessment of chemicals? Environ Sci Technol. 46: 3–9.
  24. 24. Nair PMG, Park SY, Choi J (2011) Analyses of expressed sequence tags from Chironomus riparius using pyrosequencing: molecular ecotoxicology perspective. Environ. Health Toxicol. 26: e2011010.
  25. 25. Armitage PD, Cranston PS, Pinder LCV (1995) The Chironomidae: biology and ecology of non-biting midges. Chapman and Hall, London, UK.
  26. 26. Armitage PD, Moss O, Wright JF, Furse MT (1983) The performance of a new biological water quality score system based on macroinvertebrates over a wide range of unpolluted running-water sites. Water Res. 17: 333–347.
  27. 27. Gabriels W, Lock K, de Pauw N, Goethals PLM (2010) Multimetric Macroinvertebrate Index Flanders (MMIF) for biological assessment of rivers and lakes in Flanders (Belgium). Limnologica 40: 199–207.
  28. 28. León Paumen M, Borgman E, Kraak MHS, van Gestel CAM, Admiraal W (2008) Life cycle responses of the midge Chironomus riparius to polycyclic aromatic compound exposure. Environ Pollut. 152: 225–232.
  29. 29. Marinković M, Verweij RA, Nummerdor GA, Jonker MJ, Kraak MHS, Admiraal W (2011) Life cycle responses of the midge Chironomus riparius to compounds with different modes of action. Envrion Sci Technol. 45: 1645–1651.
  30. 30. OECD (2004a) Guideline for Testing of Chemicals No. 218: Sediment-water chironomid toxicity using spiked sediment. Organization for Economic Co-operation and Development, Paris, France.
  31. 31. OECD (2004b) Guideline for Testing of Chemicals No. 219: Sediment-water chironomid toxicity using spiked water. Organization for Economic Co-operation and Development, Paris, France.
  32. 32. OECD (2010) Guideline for Testing of Chemicals No. 233: Sediment-Water chironomid life-cycle toxicity test using spiked water or spiked sediment. Organization for Economic Co-operation and Development, Paris, France.
  33. 33. OECD (2011) Guideline for Testing of Chemicals No. 235: Chironomus sp., acute immobilisation test. Organization for Economic Co-operation and Development, Paris, France.
  34. 34. Traas TP, van Leeuwen CJ (2007) Ecotoxicological Effects. In: van Leeuwen CJ, Vermeire TG, editors. Risk assessment of chemicals: An introduction, 2nd ed. Springer: Dordrecht, the Netherlands. 281–347.
  35. 35. Steiner G, Suter L, Boess F, Gasser R, de Vera MC, et al. (2004) Discriminating different classes of toxicants by transcript profiling. Environ Health Perspect. 112: 1236–1248.
  36. 36. Swain S, Wren JF, Stürzenbaum SR, Kille P, Morgan J, et al. (2010) Linking toxicant physiological mode of action with induced gene expression changes in Caenorhabditis elegans. BMC Syst Biol. 4: 32.
  37. 37. Martin JA, Wang Z (2011) Next-generation transcriptome assembly. Nat Rev Genet 12: 671–682.
  38. 38. Meyer E, Aglyamova GV, Wang S, Buchanan-Carter J, Abrego D, et al. (2009) Sequencing and de novo analysis of a coral larval transcriptome using 454GSFlx. BMC Genomcis 10: 219.
  39. 39. Ewen-Campen B, Shaner N, Panfilio KA, Suzuki Y, Roth S, et al. (2011) The maternal and early embryonic transcriptome of the milkweed bug Oncopeltus fasciatus. BMC Genomics 12: 61.
  40. 40. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, et al. (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25: 25–29.
  41. 41. Severson DW, Behura SK (2012) Mosquito genomics: progress and challenges. Annu Rev Entomol. 57: 143–66.
  42. 42. Ogata H, Goto S, Sato K, Fujibuchi W, Bono H, et al. (1999) KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 27: 29–34.
  43. 43. Boonstra M, Asselman M, Maas P, Verweij RA, Breeuwer JAJ, et al. (2009) Persistence of chironomids in metal polluted environments. Proc Neth Entomol Soc Meet. 20: 9–15.
  44. 44. Conesa A, Götz S, Garcia-Gómez JM, Terol J, Talon M, et al. (2005) Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics 21: 3674–3676.
  45. 45. Brunner J, Wittink FRA, Jonker MJ, de Jong M, Breit TM, et al. (2010) The core genome of the anaerobic oral pathogenic bacterium Porphyromonas gingivalis. BMC Microbiol 10: 252.