An Updated Insight into the Sialotranscriptome of Triatoma infestans: Developmental Stage and Geographic Variations

Background Triatoma infestans is the main vector of Chagas disease in South America. As in all hematophagous arthropods, its saliva contains a complex cocktail that assists blood feeding by preventing platelet aggregation and blood clotting and promoting vasodilation. These salivary components can be immunologically recognized by their vector's hosts and targeted with antibodies that might disrupt blood feeding. These antibodies can be used to detect vector exposure using immunoassays. Antibodies may also contribute to the fast evolution of the salivary cocktail. Methodology Salivary gland cDNA libraries from nymphal and adult T. infestans of breeding colonies originating from different locations (Argentina, Chile, Peru and Bolivia), and cDNA libraries originating from F1 populations of Bolivia, were sequenced using Illumina technology. Coding sequences (CDS) were extracted from the assembled reads, the numbers of reads mapped to these CDS, sequences were functionally annotated and polymorphisms determined. Main findings/Significance Over five thousand CDS, mostly full length or near full length, were publicly deposited on GenBank. Transcripts that were over 10-fold overexpressed from different geographical regions, or from different developmental stages were identified. Polymorphisms were mapped to derived coding sequences, and found to vary between developmental instars and geographic origin of the biological material. This expanded sialome database from T. infestans should be of assistance in future proteomic work attempting to identify salivary proteins that might be used as epidemiological markers of vector exposure, or proteins of pharmacological interest.


Introduction
Chagas disease is endemic to Latin America [1,2] and is caused by the protozoan parasite Trypanosoma cruzi, which is transmitted to humans by triatomine vectors [3]. Although there are 140 extant species of triatomine bugs, a relatively small number are implicated as human vectors, related to their adaptation to colonize human dwellings. Among these limited numbers of species, Triatoma infestans is recognized as an important vector in South America, being responsible for half of the disease transmission to humans. It historically covered a large geographical range, including Argentina, Chile, Brazil, Paraguay, Bolivia and Peru [4].
When attempting to feed, blood sucking animals inject saliva into their vertebrate hosts' skin to counteract their hemostasis and inflammatory reactions that might otherwise stop blood flow. In particular, anti-platelet and anti-clotting inhibitors, vasodilators and anesthetics are known to occur in these animals saliva as well as in T. infestans [5,6]. Probably because of their hosts' immune pressure against salivary proteins, genes coding for salivary polypeptides in blood sucking arthropods are at a very fast pace of evolution, demonstrated for a set of salivary coding genes from the mosquito, Anopheles gambiae, which showed indications of positive selection [7]. Related to the fast pace of salivary protein evolution, host immune response to vectors can be quite specific and serve as an epidemiological marker of vector exposure [8][9][10][11][12][13][14].
This has also been considered for a T. infestans salivary antigen that might serve as an epidemiological marker of chicken exposure to this insect [15,16]. Its recombinant form, rTiSP14.6, was very effective in detecting differences in infestation levels of T. infestans in Bolivian households by analyzing IgG levels against the corresponding salivary protein using chicken sera [17]. IgM antibodies of chicken sera also reacted with rTiSP14. 6, but compared to IgG immune responses of chickens, no differences were detectable in the overall antibody reactions to either crude saliva or rTiSP14.6 from sera originating from animals at low or high T. infestans infested households [15]. The saliva composition of hematophagous arthropods does not only differ between populations of the same species as analyzed for sand flies [18] and triatomines [19,20], but also between developmental stages [21,22]. Furthermore, the immune response of T. infestans-exposed guinea pigs varies according to the developmental stage (nymphs or adults) and the geographical origin of the colonies [23].
In order to develop an appropriate T. infestans exposure marker, in particular a salivary antigen that will be recognized by sera of triatomine host species exposed to any developmental stage or strain of T. infestans, we aim in this study to use an RNAseq strategy to determine developmental stage and geographical variations in the sialome (from the Greek sialo = saliva) of T. infestans that could eventually be used to design specific immunological markers of vector exposure. Additionally we aim to extend the sialome database of T. infestans that could be used for further functional determination of the identified salivary proteins.

Ethics statement
All experimental exposures of animals to triatomines carried out in the Czech Republic were in accordance with the Animal Protection Law of the Czech Republic (117, Act No. 246/1992 Sb) and with the approval of the Academy of Science of the Czech Republic (protocol approval no. 172/2010) which complies with the regulations of the European Directive 2010/63/EU on the protection of animals used for scientific purposes in Europe.

Triatoma infestans and haplotyping
All T. infestans strains (n = 22) originating from Argentina, Bolivia, Chile and Peru were reared in the insectary at an air temperature of 2861uC, a relative humidity of 60-70% and with a 12 h/12 h light/dark cycle. Supplemental Table S1 summarizes the origin and the collection date of the different T. infestans strains from the natural settings in South America. For colony maintenance triatomines were regularly fed on guinea pigs or rabbits.
Complete sequences of ITS-1, 5.8S and ITS-2 comprising the entire rDNA intergenic region of the different T. infestans strains were analysed and for each T. infestans strain an adult and either a 4 th or 5 th nymphal stage were examined. One or two legs of each bug were used for DNA extraction using methods as previously described [24][25][26] and the primers designed for the flanking regions of the intergenic region derived from [27] and [4]. Amplification procedures and thermal cycler conditions were carried out in a Mastercycle ep Gradient (Eppendorf, Germany) using 30 cycles of 30 s at 94uC, 30 s at 50-55uC and 1 min at 72uC following 30 s at 94uC and 7 min at 72uC. Primers and nucleotides were removed from PCR products by purification using the Ultra Clean PCR Clean-up DNA Purification System (MoBio, USA) according to the manufacturer's protocol. Amplified DNA was resuspended in 50 ml of 10 mM TE buffer (pH 7.6) and the final DNA concentration was determined by measuring the absorbance at 260 nm and 280 nm. Sequencing of the complete intergenic region was performed on both DNA strands by the dideoxy chain-termination method using the Taq dyeterminator chemistry kit for ABI 3130 and ABI 3700 (Applied Biosystems, USA) and PCR primers. Given the importance of the recent discovery of a pseudogenic 5.8S+ITS-2 sequence, named as ''ps(5.8S+ITS-2)'', that is widely distributed in triatomines of North, Central and South America [28], it was assured that no double sequencing signal was present in the chromatograms in order to confirm that variable positions in the intergenic region were not due to an underlying paralogous sequence.

Author Summary
Triatoma infestans is the main vector of Chagas disease in South America. As in all hematophagous arthropods, its saliva contains a complex cocktail that assists blood feeding by preventing platelet aggregation and blood clotting and promoting vasodilation. These salivary components can be immunologically recognized by their hosts and targeted with antibodies that might disrupt blood feeding. The respective antibodies can be used to detect vector exposure using immunoassays. On the other hand, antibodies may also contribute to the fast evolution of the salivary cocktail. In this work, we attempted to identify variations in the salivary proteins of T. infestans using Illumina technology that allowed identification of over five thousand proteins based on over 300 million sequences obtained from ten salivary gland libraries. This expanded sialome database from T. infestans should be of assistance in future work attempting to identify salivary proteins that might be used as epidemiological markers of vector exposure, or proteins of pharmacological interest.
Triatomine exposure experiments, salivary gland dissections and total RNA isolations All exposure experiments were performed with starved triatomines of 14 out of 22 T. infestans strains. These triatomines had starved for two weeks following moulting. Guinea pigs were obtained from our in-house breeding of the animal facility for the triatomine feeding experiments and rabbits from Velaz, Ú nětice, Czech Republic. As summarized in Supplemental Table S1 seven out of the 14 T. infestans strains were fed on guinea pigs and the other 7 strains were fed on rabbits. A total of 1876 triatomines were fed once until a full blood meal was completed (Supplemental Table  S2); feeding times depended on the developmental stage and were about 45-60 min using guinea pigs and 15-30 min using rabbits.
Triatomines were dissected in sterile DEPC/PBS buffer and the salivary glands (SG) transferred into lysis buffer RA1 of the NucleoSpin RNA and RNA XS kits (Macherey-Nagel, Germany). The salivary glands of 2 individuals of each developmental stage and strain of T. infestans were dissected after 6 h, 12 h, and 24 h after feeding followed by additional dissections every 2 days after the triatomine blood meal until each developmental stage moulted into the next stage (see Supplemental Table S2 for development times). Adults (1 female and 1 male per dissection time point) were dissected until female T. infestans started to lay eggs (about 12 days after a blood meal) and SGs of starved T. infestans of all instars and adults were additionally prepared. A total of 1,624 nymphs and 252 adults were dissected and the glands were stored at 270uC. No normalization of the RNA was made, since the pools had equal representation of tissues from their instar developmental time.
From these insects, a total of 10 pooled total RNA samples were prepared for cDNA library constructions and next generation sequencing following the NucleoSpin RNA and RNA XS kits (Macherey-Nagel) manufacturer's instructions (Supplemental Table S1 and Table 1). In particular, 4 total RNA samples contained SG RNA of adults from colony T. infestans strains of Chile (Chile-A, 18 bugs dissected), Peru (Peru-A, 36 bugs dissected), Argentina (Arg-A,18 bugs dissected) and Bolivia (BolCol-A, 108 bugs dissected) and 4 other RNA samples were prepared from nymphal SG RNA of also colony strains from Chile (Chile-N, 116 bugs dissected), Peru (Peru-N, 232 bugs dissected), Argentina (Arg-N, 116 bugs dissected) and Bolivia (BolCol-N, 696 bugs dissected). The further 2 RNA samples were set up of either adult SG RNA from the Bolivian T. infestans strain collected in 2012 (BolNat-A, 72 bugs dissected) or of nymphal SG RNA from field collected T. infestans (BolNat-N, 464 bugs dissected). Each RNA sample contained total RNA from all SGs of all developmental stages of the different T. infestans strains dissected 6 h after a blood meal until the triatomines moulted (covering the entire life cycle) and/or started to lay eggs as in the case of female T. infestans. The total RNA concentration of all 10 samples was measured using the NanoDrop spectrophotometer (Thermo Fisher Scientific, USA), and the RNA of each sample was precipitated with ethanol using a final concentration of 0.3 M Na-acetate, pH 5.2. Precipitated RNA was overlaid with 80% ethanol and stored at 270uC. cDNA library constructions and next generation sequencing All 10 SG RNA samples were sent to the Genomic Sciences Laboratory of the North Carolina State University, Raleigh, NC, USA, for Illumina cDNA library constructions and sequencing. Prior to cDNA library constructions, the RNA quality and concentration were checked with the Agilent Bioanalyzer 2100 using an Agilent RNA 6000 Nano Chip (Agilent Technologies, USA) and 0.5 mg of each cDNA library was used to prepare Illumina cDNA sequencing libraries. Poly-A mRNA was purified using the oligo-dT beads provided in the NEBNExt Poly(A) mRNA Magnetic Isolation Module (New England Biolabs (NEB), USA). Libraries were prepared using the NEBNext Ultra Directional RNA Library Prep Kit (NEB) for Illumina and indexed with the NEBNext Mulitplex Oligos for Illumina (NEB). The Poly-A mRNA was chemically fragmented and primed with random oligos for first strand cDNA synthesis with a heating step of 94uC for 5 min. First strand cDNA synthesis was performed with an incubation time of 10 min at 25uC, 50 min at 42uC, and 15 min at 70uC. Second strand cDNA synthesis was carried out with dUTPs to preserve strand orientation information. The sample was purified, end repaired and dA-tailed prior to adaptor ligation. Illumina Multiplex Adaptors were ligated, the ligation reaction was purified according to the protocol for a 500-700 bp insert, and a PCR amplification of 15 cycles was performed. The PCR product was purified and the cDNA libraries were checked for quality and concentration with the Agilent Bioanalyzer 2100 using a High Sensitivity DNA Chip

Bioinformatic analysis
Bioinformatic analyses were conducted following methods described previously [37,38]. Briefly, the fastq files were trimmed of low quality reads (,10, rejecting those that have an average , 20) and concatenated for single-ended assembly using the Abyss [39] and Soapdenovo Trans (http://arxiv.org/ftp/arxiv/papers/ 1305/1305.6760.pdf) assemblers using k parameters from 21-91 in 10 fold increments. The combined fasta files were further assembled using a iterative blast and cap3 pipeline as previously described [40]. Coding sequences were extracted based on the existence of a signal peptide in the longer open reading frame (ORF) and by similarities to other proteins found in the Refseq invertebrate database from the National Center for Biotechnology Information (NCBI) as well as proteins from Hemiptera deposited at NCBI's GenBank and from Swissprot. CDS were automatically annotated based on a program written by JMCR that searched a vocabulary of ,300 words on the matches of the Swissp and Refseq databases, as well as the CDD, KOG and PFAM databases. This automatic annotation was further refined by manual annotation when needed. Transposable elements were discovered by RPSBlast of the transcripts against a PSIBLASTmade database derived from the clusterization at 90% identity on 50% of the larger sequence length of all larger open reading frames available in the Repbase database [41]. An e value of 1e-15 was considered as a threshold call. Reads for each library were mapped on the deducted coding sequences using blastn with a word size of 25, 1 gap allowed and 96% identity or better required. Up to five matches were allowed if the scores were the same as the largest score. A X 2 test was performed for each CDS to detect statistically significant differences between the number of reads in paired comparisons. The results of these tests are mapped to the hyperlinked excel sheet presented as a supplemental file S1. The normalized ratio of the reads were calculated as r16R2/ [R16(r2+1)] and r26R1/[R26(r1+1)] where r1 and r2 are reads for libraries 1 and 2, and R1 and R2 are total number of reads from libraries 1 and 2 mapped to all CDS. One was added to the number of reads in the denominator to avoid division by zero. Sequence alignments were done with the ClustalX software package [42]. Phylogenetic analysis and statistical neighbor-joining bootstrap tests of the phylogenies were done with the Mega 6.0 package [32]. For visualization of synonymous and non-synonymous sites within coding sequences, the tool BWA aln [43] was used to map the reads to the coding sequences (CDS), producing SAI files that were joined by BWA samse which was converted to BAM format and sorted. The samtools package [44] was used to do the mpileup of the reads (samtools mpileup) and the bcftools program from the same package was used to make the final variant call format (VCF) file containing the single nucleotide polymorphic sites (SNP), which were only taken if the CDS coverage depth was at least 20 and the quality was 13 or better (default). Determination of whether the SNP lead to a synonymous or non-synonymous codon change was achieved by a program written in Visual Basic by JMCR, the results of which are mapped into the excel spreadsheet shown as supplemental file S1. Heat maps were made with the gplots and heatmap.2 programs [45] from the R package [46] using average normalized data (row values were divided by the row average).

Data accession
The raw RNAseq data were submitted to the Sequence Read Archive (SRA) of the NCBI under bioproject PRJNA238208. The raw data file accession numbers can be found in Table 1. Extracted coding sequences were submitted to the Transcriptome Shotgun Annotation (TSA) portal of the NCBI under accessions GBBI01000001-GBBI01005114.

Selection of T. infestans strains for next generation sequencing
To detect the most genetic variations in the transcriptome of different T. infestans populations from South America, especially between the Bolivian T. infestans strains either kept for several years in our laboratory or collected recently in the field (F1 generation), the haplotypes of 22 T. infestans strains (Supplemental table S1) were identified. T. infestans strains with different haplotypes were selected for Illumina sequencing.
A total of 44 sequences of the ITS-1, 5.8S and ITS-2 were obtained from sylvatic, peridomestic and domestic T. infestans specimens. The sequence alignment revealed the existence of only two combined haplotypes (CH) named T.inf-CH1A and T.inf-CH2A (Table 2). Their length and AT content were of 1304 bp and 1310 bp and 67.41% and 67.56%, respectively. The sequences of the haplotypes T.inf-CH1A and T.inf-CH2A were identical to those found in two different ITS1-5.8-ITS2 CH regions previously reported for T. infestans [4]. The 5.8S gene had a length of 155 bp and an AT content of 41.94% and was identical in all specimens analysed. The relation between geographic origins and genetic characteristics of the populations, whether sylvatic, peridomestic or domestic, suggests that CH1A is present only in Bolivia and Peru and well-established among peridomestic, domestic and sylvatic samples, while CH2A is only found in domestic and peridomestic habitats of Chile and Argentina and in one locality from Bolivia (Table 3). In the 481 bp-long alignment, including the two ITS-2 haplotypes, representing the samples of the T. infestans analysed and the eleven ITS-2 sequences from GenBank, 13 variable positions appeared (2.70%) of which, 3 were substitutions (0.62%), including 2 parsimony-informative positions (P-info) and 1 singleton site, and 10 (2.05%) gapped positions. The analysis of variable positions among all ITS-2 sequences, revealed the existence of only 8 different haplotypes for T. infestans, although there is only the complete sequence of five of them (Table 3). In the 703 bp-long alignment including the only one ITS-1 haplotype representing the samples of the T. infestans analysed and the six ITS-1 sequences from GenBank, 36 variable positions appeared (5.12%) of which, 2 were substitutions (0.28%), including 1 P-info position and 1 singleton site, and 34 (4.84%) gapped positions. The main genetic variation was observed in the minisatellite region between positions 179 and 217 in the alignment ( Table 3). As in the case of the ITS-2, six different ITS-1 haplotypes have been described for T. infestans, although three of them are partial or incomplete sequences. Following the above analysis, and having in consideration the determination of developmental stage and strain specific variations in the T. infestans salivary gland transcriptome, we selected 14 of the 22 available strains of T. infestans (Supplemental Table S1) as follows (strain numbers refer to strain named in the first column of supplemental table S1): All Peruvian, domestic and peridomestic (strains 4 and 5), Chilean, 1 domestic strain (6), and Argentinian, 1 peridomestic strain (30); from the more numerous Bolivian strains  Table 3. Polymorphic sites allowing differences between ITS-2 and ITS1 rDNA haplotypes of T. infestans samples analysed from Bolivia, Peru, Chile and Argentina and those available in GenBank.  Overall characteristics of the sialome of T. infestans

ITS
From the fourteen T. infestans strains selected for next generation sequencing, ten Illumina libraries were prepared and sequenced discriminating nymphal (5 libraries) and adult T. infestans transcripts (5 libraries) of different triatomine populations from Chile, Peru, Argentina and Bolivia, including two Bolivian libraries, one from colony strains and another from insects recently collected in the field (F1 generation) (Supplemental table S1 and Table 1). A total of over 395 million sequences, summing over 109 billion nucleotides, were recovered from the 10 sequenced libraries, ranging from 38 to 48 million sequences, or reads, per library ( Table 1). The reads had an average length of 277 bases and a median of 300 bases after clipping low quality (,10) bases. The assembly allowed the extraction of 11,188 coding sequences (CDS) having an average deducted protein length of 398 amino acids. These CDS mapped a total of 213,756,622 reads from all 10 libraries, or about half of the total reads (Table 4). Following an automated classification, the CDS were classified into putative Secreted, Housekeeping, Transposable element, Viral and Unknown (Table 4). Although the 2,228 CDS of the Secreted class were only 20% of the total 11,188, they accrued 49% of the reads. On average these CDS were assembled each from 46 thousand reads. On the other hand, the Housekeeping class represented 65% of all CDS, collected 49% of the reads, and was assembled with an average of 14 thousand reads per CDS. Transposable element-coding sequences were represented by 5% of the CDS, a larger value than found on previous sialotranscriptomes, which is near 1%. Among the viral sequences identified, the assembly recovered the 5,388 nucleotides of the full length CDS for the nonstructural protein precursor of the Triatoma virus [47], which was assembled from over 171 thousand reads. Other viral fragments were additionally found, some of which may derive from retrotransposable elements.
A description of the sialome of T. infestans was disclosed in 2008 based on 1,534 EST sequences sequenced by Sanger methodology [6]. From this work, 167 protein sequences have been deposited to GenBank, including many lipocalin sequences, which are abundant in triatomine sialomes [48], salivary enzymes such as apyrase, and other triatomine-specific proteins. The current update has identified over 1 order of magnitude additional sequences that appear as full length, or near full length, 5,114 of which have been deposited to GenBank. This submission expands the T. infestans sialome database and should help in the identification of pharmacologically active proteins as well as markers of vector exposure. Presently we will describe highlights of this expanded transcriptome, identify and describe stage specific and geographic specific protein variants, and attempt a measure of the polymorphism of the T. infestans sialome. This work has no biological replicates in the comparisons described below, a fact considered in this discussion.

Highlights of the expanded T. infestans sialotranscriptome
Compared to classical Sanger-derived transcriptomes with a few thousand sequenced EST's, next-generation Illumina libraries derived from hundreds of millions of sequences provides for an unprecedented depth of transcript coverage allowing detection of poorly expressed mRNA as well as their quantitation as hundreds, or millions of reads accrue to each assembled contig. For example, the most expressed CDS in this sialome is for the peptide trialysin, which accrued over 13 million reads and deriving an RPKM = 77,477 when all reads from all libraries are considered. Supplemental file S2 presents a classified version of supplemental file S1 that has the different functional classes sorted by their overall RPKM, allowing for fast identification of well-expressed CDS, including putative secreted proteins. In the following, we present some remarkable findings in this sialotranscriptome.
Enzymes. Serine proteases have been described earlier in T. infestans salivary homogenates [49], and one such enzyme has been found in a previous sialotranscriptome (gi|149689028) [6]. The current sialotranscriptome identified this enzyme coded by TiSigP-124791, with an RPKM = 1,243 plus two other full length enzymes that have an RPKM.700, plus several others with lower expression. TiSigP-124931 is 76% identical with a previously described Panstrongylus megistus salivary trypsin [50]. A salivary metalloprotease was found well expressed (RPKM = 715), being 92% identical to a protein deducted from the T. matogrossensis sialotranscriptome [51]. An asparaginyl peptidase and a Cathepsin L were also found well expressed, with RPKM.790. Apyrases are enzymes that hydrolyze ATP and ADP, mediators of platelet and neutrophil aggregation, and were found to be members of the 59 nucleotidase family in T. infestans [52]. The current sialotranscriptome identified additional well expressed transcripts being only 66% identical to the described salivary apyrase of T. infestans, suggesting that this salivary activity derives from a family of related proteins suggestive of gene duplications, a common motif in salivary genes of blood sucking arthropods [5]. Inositol phosphate phosphatases are also commonly found in the sialotranscriptomes of blood sucking Hemiptera [48]; a few additional representatives of this family were found in the current sialotranscriptome. Protease inhibitors. The Kazal family of protease inhibitors is found expanded and abundantly transcribed in the sialotranscriptome of bloodsucking Hemiptera [48]. Several novel full length representatives were found in the current transcriptome, nine of which have RPKM.650. A pacifastin, 66% similar to a Rhodnius prolixus homologue was identified, with a RPKM = 869.
Lipocalin superfamily. The lipocalin family of proteins is extremely expanded in triatomines where it exerts anti-clotting functions as well as antagonism of biogenic amines, leukotrienes and other inflammatory mediators and carriers of NO in Rhodnius [48,53]. Since many of the differentially transcribed CDS described further below belong to this superfamily, we aligned the sequences discovered in this transcriptome with lipocalins described from previous triatomine sialotranscriptomes, generating a phylogenetic tree with at least 13 different clades containing T. infestans proteins. Of these 13 clades, only 3 have at least one protein that has been studied, including those named Triabin (an anti-clotting protein) [54,55], Pallidipin (an anti-platelet protein that binds thromboxane A2) [56,57] and the antigen Procalin [58]. Supplemental figure S1 displays this phylogenetic tree and indicates near the CDS name those that were previously characterized, as well as those that are differentially expressed in nymphs, adults or particular geographic populations. The tree also shows that over 40 novel lipocalin sequences were found in this T. infestans sialotranscriptome.
Salivary odorant binding proteins. Two members of the odorant binding protein family have been found expressed in the sialotranscriptome of T. infestans (gi|149689106 and gi|149689110). Eight additional members were found in the current transcriptome, underlining the gene expansion of this family within T. infestans. Two of these novel transcripts are well expressed, with RPKM.1000.
Antigen 5 family. Eight full length or near full length novel members of this family were identified, three of which with RPKM.1000. Members of this family expressed in the salivary glands of Dipetalogaster maxima were shown to possess superoxide dismutase activity [59].
Antimicrobial peptides and lysozyme. Twenty transcripts coding for defensins, diptericin, attacin and lysozyme were found, only one of which has been previously identified. Antimicrobial peptides are commonly found in sialotranscriptomes of blood sucking arthropods [5].
Deorphanized and novel putative secreted proteins.
Supplemental file S2 identifies 19 CDS similar to previously identified orphan Hemiptera proteins, five of which being well expressed (RPKM.1000), and over 1,000 additional CDS coding for putative secreted proteins not producing a good match to any known protein; however, only 13 accrue a RPKM.500.

Differentially expressed transcription
To obtain a general view of the differential transcriptome expression among the 10 libraries, we produced a hierarchical clustering-based heat map of average normalized RPKM values for the contigs with overall RPKM equal or larger than 20, totaling 4,207 CDS (Fig. 1). Notice that following normalization, the average row value is equal to one, with transcripts expressing above average having values above one and vice versa. The resulting clusterization is complex and shows that only a minority of the CDS have 4 or more units above average (represented by white or red color in the graph). Some of these transcript differences will be further highlighted in the following sections.

Differentially expressed transcripts according to developmental stage
Nymphal specific proteins. We combined the number of reads per CDS from all nymphal and all adult libraries allowing detecting nymphal enriched and adult enriched sequences, independent of geographic origin. A total of 180 CDS were significantly 10 fold or more expressed in nymphal than in adult libraries ( Table 5, supplemental file S3). Interestingly, among the most nymphal-specific CDS are 43 coding for cuticle proteins, including the homolog of Acyrthosiphon pisum cuticular protein CPG12-like precursor, assembled from 1,153 reads from nymphal libraries and zero from adult libraries. Nymphal cuticle proteins are different from the hardier adult cuticle and the genes coding for these CDS are developmentally regulated [60]. Similarly the sequence coding for juvenile hormone (JH) acid methyltransferase was assembled from 1,053 reads from nymphal libraries, and zero from adult libraries. This enzyme is crucial for nymphal developmental processes and is thus expected to be enriched in nymphs. Two sequences coding for JH binding proteins are also overexpressed in nymphs. Three sequences coding for cytochrome P450 enzymes were overexpressed in nymphal libraries and may be related to hormonal regulation or cuticle metabolism. A gene coding for chorion peroxidase may also be related to cuticle metabolism. These CDS probably originate from contaminating tissues during dissection, but the high number of sequences (near half billion) allowed for their assembly.
Among the products coding for secreted proteins overexpressed in nymphal libraries, 8 lipocalins were found, with an average RPKM over 2,500, and being 40 to 80 times more expressed in nymphs. This family of proteins is associated with agonist scavenging (kratagonist) as well as anti-clotting activities [48]. Two members of the antigen 5 family, some of which have been described to have superoxide dismutase activity [59] were also found with increased nymphal expression, with average RPKM of 1,340 ( Table 5).
The interested reader can sort the provided Excel spreadsheet (supplemental file S1) in the desired comparison columns to obtain the names and sequences identified above and in the following sections of this manuscript.
Adult specific proteins. When adult overexpressed CDS were investigated, only 29 were found 106more expressed than in the combined nymphal libraries ( Table 6, supplemental file S3). These included two coding for vitellogenin, an oocyte storage protein produced by adult females [61], assembled most probably from contaminating fat body. These two CDS were 235 and 607 times more expressed in adult libraries. Another adult-specific marker found was a transcript coding for the doublesex transcription factor, involved in adult sex determination [62]; it was assembled from 248 reads from the adult libraries and 5 from the nymphal libraries, having an RPKM of 2.8 related to the adult libraries. Among the secreted products, a member of the GGY family, related to antimicrobial peptides in C. elegans [63] and homologous to Rhodnius prolixus salivary proteins named for their Gly-Gly-Tyr repeats [64] was found 106 overexpressed in adult libraries (Table 6). Transcripts coding for two lipocalins were 13 and 10.6 times overexpressed in adults, and their average RPKM was over 10,000.
It has been previously reported in R. prolixus that members of the nitrophorin family are progressively expressed from the first instar nymph to the last and in adults [65]. This developmental control of gene expression is suggestive of epigenetic regulation [66][67][68] involving DNA-modifying as well as histone-modifying enzymes. We have identified 42 transcripts in the T. infestans sialotranscriptome that might be involved in epigenetic regulation, including several full length histone-lysine-N-methyltrasferases, histone deacetylase complex members and sirtuins. All have similar expression among nymphal and adult libraries, with RPKM varying between 2 and 470. Identification of these gene products can help the design of RNAi experiments to evaluate epigenetic regulation of gene expression in T. infestans.

Differentially expressed transcripts according to geographical origin
For these analyses, nymphal and adult reads from each geographical area were combined and compared with the sum of reads from the other libraries, to identify differentially expressed CDS specific to each region. We report here only those CDS that are significantly expressed 106 or more according to these comparisons.
Argentinian overexpressed transcripts. Twenty four transcripts were found 10 fold or more overexpressed in Argentinian bugs (Table 7, supplemental file S3). Six lipocalins with high expression levels include Ti-133850 (RPKM = 8,622, other libraries RPKM = 16.5), one Kazal-domain containing peptide and 4 novel putative secreted proteins constituting gene products coding for candidate secreted proteins overexpressed in Argentinian Triatoma (Table 7). Transposable elements were also identified. One highly expressed contig of unknown class (Ti-123232, RPKM = 13,318) codes for a protein that is 36-46% identical to three Ixodes scapularis proteins and has no other significant matches to the NR database; the remainder libraries account for a small RPKM = 1.65. Interestingly, the Argentinian population also differentially expresses a member of the cytochrome c oxidase polypeptide IV (RPKM = 38.4, others = 0.17), which is a nuclearly encoded mitochondrial gene. A second isoform of this enzyme is also found in the database, but it is not differentially expressed among the populations (TiSigP-123830), and has a higher RPKM in the Argentinian population of 378, while all other populations it is 405.
Bolivian colony overexpressed transcripts. Only 10 transcripts were found 10 fold or more overexpressed in the Bolivian colony, all with relatively low transcription levels, the maximum RPKM being 20 (Table 8, supplemental file S3).
Among these a viral transcript was found best matching a structural protein precursor of Drosophila A virus, with 26% identity and 45% similarity over 442 amino acids. It has an RPKM of 7.1 in the Bolivian library and only 0.0097 in the combined remaining libraries.
Bolivian F1 overexpressed transcripts. On the other hand, when the Bolivian F1 population was analyzed for overexpressed transcripts, 30 were found, including 15 lipocalins, several of which appear well expressed (Table 9, supplemental file S3). Among these, Ti-130968 has a RPKM of 2,888 from the Bolivian F1 reads while all other reads amount to a value of only 5.26. Similarly, TiSigP-123883 has a high expression value (RPKM = 18,053) deriving from the Bolivian F1 library but only 59.3 from the remainder libraries.
Peruvian overexpressed transcripts. Thirty nine transcripts were found 10 fold or more overexpressed in the Peruvian library, including three highly expressed lipocalins and 10 other putative secreted peptides of low expression (Table 11, supplemental file S3). Of note are several viral transcripts of moderate to high RPKM values (ranging from 38-3,810), but significantly higher expressed in Peruvian bugs, being so over 1,000 fold overexpressed. This set includes Ti-133027 and Ti-126254 which are 26-36% identical and 41-52% similar to the Drosophila A virus mentioned above, but are not the same contigs described above in the Bolivian colony section. Ti-130944 and Ti-141814 both best match the Sugarcane yellow leaf virus TNA-directed RNA polymerase, being 30% identical and 46% similar over a stretch of 413 amino acids. Some of the eight transcripts of the unknown class produce matches to viral proteins, albeit with an e value higher than 1e-15 and could also derive from viral products.

Salivary gland polymorphism
Single nucleotide polymorphisms were identified from each of the 10 libraries and mapped into the derived CDS. A rich text format (RTF) file is provided for each CDS showing the polymorphisms, hyperlinked to the Excel supplemental file S1. Of the 11,188 deducted CDS, one or more polymorphic sites were identified in 5,391 when all data was combined (Table 12). The classes Viral, Transposable element, Unknown and secreted had the highest percentages of both synonymous (S) and nonsynonymous (NS) polymorphisms, and also displayed high NS/S  ratios, although the highest ratio derived from the protein synthesis category (Table 12). Sixty three members of the lipocalin family had on average 2.5 NS polymorphisms per 100 codons while the secreted class rate was 0.83 S/100 codons (results not shown). This high rate of NS to S rate suggests that these genes are under positive selection, as has been suggested for a subset of salivary genes of the mosquito Anopheles gambiae [7]. The degree of polymorphism was also compared among the ten sequenced libraries using the 5,391 polymorphic CDS mentioned above (Fig. 2). Results indicate that polymorphisms are lower in the Peruvian and Bolivian colony libraries, but somewhat surprisingly the nymphal Argentinian library has significant less polymorphism than the adult counterpart, as is the case of the Bolivian F1 library. The Chilean strain that was collected in 1979 has higher polymorphism than the Peruvian strains collected 30 years later, or the recently collected (2012) Bolivian strains. The observed differences do not appear to derive from CDS coverage depth differences between libraries, because the average RPKM for each library were not significantly different, the minimum being 49.164.7 (average 6 SE) for the Bolivian adult colony and the maximum being 60.568.51 for the Peruvian adult library, considering the 5,391 CDS. The differences here observed between nymphs and adults as well as from different geographical areas are congruent to previous T. infestans salivary immunological studies [23]. The polymorphism differences between long colonized and more recently colonized insects indicate that colonization time not necessarily leads to reduced polymorphism, which may be maintained by the immune pressure created by using and reusing live animals for colony maintenance.

Conclusions
Discovery-driven research, as opposed to conventional hypothesis-driven work, should provide or enlarge the knowledge platform to allow further development of hypothesis-driven research. This study enlarges the sialome database for T. infestans and at the same time provides information on differentially expressed transcripts based on developmental stage and geographical origin. As a result, 5,114 sequences are now publicly  available from GenBank and should be of value in assisting proteomic experiments attempting to determine proteins that might be used to immunologically identify T. infestans exposure, as proposed in [23], as well as identification of pharmacologically active salivary proteins of T. infestans. Polymorphisms were mapped to individual coding sequences and its differential distribution was identified among geographical strains and developmental stages. This should be of help in developing specific markers of exposure, as well as to avoid extreme polymorphic proteins. Several novel viral and transposon sequences were discovered, which may stimulate virologists or scientists interested in transposable elements to describe new insect viruses or transposable elements. The awkward finding of a second cytochrome oxidase IV gene product in the Argentian population may represent a second horizontal transfer of this gene from a bacterial or mitochondrial genome and remains to be investigated. Despite the lack of biological replicates in our study, the observed increased sequences associated to cuticle proteins and JH metabolism in nymphs, as well as the increased vitellogenin and doublesex transcription factor in adults which are expected from the known insect physiology, serve to validate the broader differences in transcript abundance found among the different libraries despite lack of replicates, but these should be verified in future follow up work.    Supporting Information Figure S1 Bootstrapped phylogenetic tree of triatomine lipocalins. The sequences discovered in this work are represented by Tior TiSigp-followed by the contig number. Other sequences were derived from GenBank and are indicated by the first three letters of their genus name, followed by the first three letters of their species name followed by their gi| accession number. Triatoma infestans sequences found in this study are marked with a red symbol. Those from GenBank are marked with a blue symbol. The tree was built from a ClustalX alignment using the NJ algorithm from the Mega package following 1,000 bootstrap iterations. The numbers at the branches represent the bootstrap percentage support when larger than 50. (RTF) Table S1 Origin of the T. infestans strains from South America, total SG RNA samples prepared for next generation sequencing and haplotypes obtained for the complete intergenic rDNA region (ITS-1, 5.8S, ITS-2). (DOC)