Long read single cell RNA sequencing reveals the isoform diversity of Plasmodium vivax transcripts

Plasmodium vivax infections often consist of heterogenous populations of parasites at different developmental stages and with distinct transcriptional profiles, which complicates gene expression analyses. The advent of single cell RNA sequencing (scRNA-seq) enabled disentangling this complexity and has provided robust and stage-specific characterization of Plasmodium gene expression. However, scRNA-seq information is typically derived from the end of each mRNA molecule (usually the 3’-end) and therefore fails to capture the diversity in transcript isoforms documented in bulk RNA-seq data. Here, we describe the sequencing of scRNA-seq libraries using Pacific Biosciences (PacBio) chemistry to characterize full-length Plasmodium vivax transcripts from single cell parasites. Our results show that many P. vivax genes are transcribed into multiple isoforms, primarily through variations in untranslated region (UTR) length or splicing, and that the expression of many isoforms is developmentally regulated. Our findings demonstrate that long read sequencing can be used to characterize mRNA molecules at the single cell level and provides an additional resource to better understand the regulation of gene expression throughout the Plasmodium life cycle.

Introduction 74% of all reads to the P. vivax P01 genome sequence [36]. While full-length mRNAs are captured and converted into cDNA in the 10X droplets, only the 3'-end of each transcript is sequenced due to the cDNA fragmentation occurring during library preparation (Fig 1). All 3'-end scRNA-seq reads should therefore derive from the last~300 bp of the transcripts (see e.g., S1 Fig). However, only 27-33% of the reads mapped within annotated P. vivax genes and 47-52% mapped more than 500 bp away from any annotated gene S1 Table). These results are consistent with previous analyses [24] and highlight that many genes and/or their 3'-UTRs remain incompletely annotated in the P. vivax genome. After removing PCR duplicates and stringent quality control filters, we obtained information about 949 and 1,807 single cell blood-stage transcriptomes, each characterized by more than 5,000 unique reads ( Table 1). To characterize which blood stages were present in each infection, we analyzed these transcriptomes using principal component analysis and showed that several asexual and sexual developmental stages were present in both samples, although with some differences in their relative proportions (S2 Fig).
We also prepared two scRNA-seq libraries from salivary gland sporozoites dissected from Anopheles stephensi and Anopheles freeborni, respectively, fed on Saimiri monkeys infected  with P. vivax parasites. Out of the 63,039,836 and 79,688,059 reads generated from each library, 3-7% mapped to the P. vivax genome sequence, similar to the numbers obtained previously for P. berghei sporozoites [21]. 25-26% of those reads mapped within annotated P. vivax genes and 58-59% mapped more than 500 bp away from any gene (S1 Table). After removing PCR duplicates and stringent quality control filters, we obtained information about 2,609 and 2,363 single cell transcriptomes, each characterized by more than 250 unique reads (Tables 1  and S1). We used a lower threshold for the sporozoite analysis due to the lower number of P. vivax reads recovered as a result of the overwhelming presence of mosquito RNA (but the cutoff is comparable to those previously used for analyzing sporozoite scRNA-seq data [21]). In contrast to the heterogeneity of the blood stage parasites, these salivary gland sporozoites formed a relatively homogeneous population (S2 Fig).

Long read sequencing of scRNA-seq libraries provides full-length transcript information
Since only the 3'-end of each transcript is usually sequenced in scRNA-seq experiments, it is sometimes difficult to assign the scRNA-seq reads to a specific gene, or to identify signals derived from different gene isoforms [24]. We therefore used cDNA from the same four 10X scRNA-seq libraries before fragmentation to generate full-length isoform sequences using the PacBio chemistry (see Fig 1 and Material and Methods for details).
From each blood sample, we generated 3,390,270 and 3,570,838 circular consensus sequences (CCS) that each derived from at least 10 passes of sequencing of an individual cDNA molecule. 50-64% of these sequences carried a 10X barcode matching one of the cells characterized by Illumina sequencing, and >99% of those sequences mapped unambiguously to the P. vivax genome (Tables 1 and S1). After removal of PCR duplicates, we ended up with 1,688,745 and 1,523,456 unique reads (each derived from a unique mRNA molecule) for further analysis ( Table 1). While the Illumina and PacBio library preparations diverged after cDNA amplification, the same barcoded cDNA molecules were used for both experiments (Fig 1) and the numbers of reads obtained with each technology for each individual cell were highly correlated (S3 Fig).
For the sporozoite samples, we generated 2,578,778 and 1,935,640 circular consensus sequences. 5-8% of these sequences carried a 10X barcode matching one of the cells characterized by Illumina sequencing, and 50-65% of those sequences mapped unambiguously to the P. vivax genome. After removal of PCR duplicates and additional QCs, we ended up with 106,531 and 40,977 unique reads respectively for further analysis (Tables 1 and S1). Similar to the pattern observed for the blood stage samples, the numbers of reads obtained by Illumina and PacBio sequencing for each sporozoite were highly correlated (S3 Fig) although, due to the lower sequencing depth, the overall number of cells with PacBio information was limited to 2,449 and 1,670.
We then summarized these mapped reads into P. vivax transcripts and, to avoid including sequences that may represent partially degraded molecules or technical artefacts, we only considered predicted transcripts represented by more than 10 PacBio unique reads in both samples of the same type (e.g., in both blood-stage samples). Overall, we identified a total of 9,982 transcripts from blood stage parasites and 784 from sporozoites (Tables 1 and S2), with an average transcript length of 2,432 bp from blood stages and 2,054 bp from sporozoites and ranging from 100 to 20,506 bp and from 197 to 6,809 bp, respectively. (Note that the transcript prediction algorithm occasionally leads to artefacts when transcripts in the same orientation overlap, which seems to be the case for the 20 kb transcript).

Most transcripts encode proteins corresponding to the genome annotation
Out of the 9,982 blood stage transcripts, 8,550 transcripts were predicted to encode more than 100 amino acids and 5,368 of those (63%) were predicted to encode a full-length protein (including a start and stop codon) (S1 Table). The median length of these full-length protein sequences was 291 amino acids, significantly smaller than the length of the protein sequences annotated in the latest version of the P01 P. vivax genome [36] (median of 401 amino acids) (S4 Fig). This discrepancy is possibly due to the lower 10X reverse transcriptase processivity that may hamper synthesis of extended cDNA molecules and prevent recovery of transcripts from long genes (note that most full-length transcripts obtained from PacBio sequencing match the annotated reference, see below). Of the 784 sporozoite transcripts, 563 transcripts were predicted to encode more than 100 amino acids and 344 of those (60%) were predicted to encode a full-length protein (including a start and stop codon), with a median length of 267 amino acids (S4 Fig and S1 Table).
We then compared the protein sequences predicted from these full-length transcripts to all P. vivax protein sequences annotated in the P01 genome (see Most transcripts were predicted to encode protein-coding sequences highly similar to those annotated in the genome: 4,553 of the blood-stage transcripts (84.8%) and 254 of the sporozoite transcripts (73.8%) were identical to an annotated P. vivax protein-coding sequence for >90% of their length (S6 Fig, see e.g., Fig 2A). Additionally, 378 blood-stage (7.1%) and 23 sporozoite transcripts (6.7%) partially matched known protein-coding sequences and were identical over 50-90% of their sequence. While some of these transcripts could represent instances where the current gene annotation is possibly incorrect (see e.g., Fig 2B), in 253 of these cases (63%) another isoform in our dataset matched the entire annotated protein-coding sequence, suggesting that these differences represent incomplete annotations of transcript isoforms rather than incorrect annotations (see also below). One example is the cytochrome b5-like heme/steroid binding protein (PVP01_0716500) that is transcribed as annotated in the genome by blood-stage parasites but, due to an alternative start site, is transcribed into a shorter mRNA by sporozoites, resulting in a shorter predicted protein (Figs 2C and S7). Finally, 436 blood-stage (8.1%) and 67 sporozoite transcripts (19.5%) aligned over less than 50% of their length to annotated protein sequences but those transcripts were very short (less than 200 amino acids) and/or had low support from the PacBio reads and likely represented artifacts or fragmented transcripts. The complete list of predicted transcripts, their sequence and their identity to reference transcripts is presented in S2 Table.

Most P. vivax transcripts have extended UTRs that often contain introns
Even for the transcripts highly similar to the annotated protein-coding genes, the PacBio sequences add novel information by providing a detailed description of the 5'-and 3'-UTRs which, despite recent efforts [28], remain incompletely annotated in P. vivax. Consistent with previous reports [11,15], our data showed extensive UTRs in many genes (Fig 3). We observed that 5'-and 3'-UTRs were roughly similar in length in transcripts expressed by blood-stage parasites, while 5'-UTRs were, on average, slightly longer than 3'-UTRs in sporozoite transcripts (745 vs 731 bp in blood stages [p-value = 0.3149] and 762 vs 650 bp in sporozoites [pvalue = 0.0076]). To evaluate how this improved characterization of UTRs would affect future analyses of scRNA-seq data, we compared the percentage of scRNA-seq reads mapped to the currently annotated P. vivax genes with the percentage mapped to annotations supplemented by our predicted isoforms (available in S1 Data). While only 27-33% of the scRNA-seq reads mapped within previously annotated genes for the blood stage parasites and 25-26% for the sporozoites, including the PacBio predictions raises these figures to 69-77% and 65-70%, containing two annotated P. vivax genes, the ubiquitin fusion degradation protein 1 (PVP01_1330900) and an ATP synthase-associated protein (PVP01_1331000). The blue horizontal bars at the bottom shows the annotations for these genes in plasmoDB v54, while the top panel shows the PacBio reads mapping to this locus (each red horizontal line is a unique read mapped to the positive strand with the grey lines indicating spliced introns). Note that, while the PacBio reads support a shorter 3'-UTR than annotated for PVP01_133100, the predicted protein coding sequences are identical to the ones annotated. (B) Example of protein coding transcript differing from the current gene annotation. The figure shows 5 kb of chromosome 13 surrounding the rhoptry-associated protein 1 (RAP1, PVP01_1338500). The PacBio reads (mapped to the negative strand and displayed in blue) support the presence of an unannotated intron (red box), leading to additional predicted coding sequences upstream of this intron, and a different protein than annotated in the genome (thick blue bars at the bottom). (C) Example of two isoforms with different predicted protein coding sequences. The top panel shows that blood-stage parasites express a transcript for cytochrome b5-like heme/ steroid binding protein (PVP01_0716500) identical to the annotated protein-coding sequence (although with a shorter 5'-UTR). The middle panel shows that P. vivax sporozoites express this gene from a different start site (red box) resulting in a shorter transcript and a different predicted protein. (Note also the presence of an unannotated, and alternatively spliced, intron in the 3'UTR).
While UTRs have not been extensively studied in Plasmodium, in many eukaryotes they contain important regulatory elements which often affect mRNA stability [37][38][39][40][41]. While the length of the 5'-UTR does not appear to be associated with the level of the transcript expression determined using the short reads (p = 0.315), the 3'-UTR length was negatively correlated with expression level (p = 2.8x10 -6 ), although with a very low coefficient of correlation (Spearman's Rho = -0.0798) (S8 Fig). In an attempt to identify regulatory elements in the UTRs, we compared the abundance of all possible 5-mers in the 5'-UTRs, 3'-UTRs and promoter regions but failed to detect significant enrichment (S9 Fig). More in-depth analyses will be required to understand the function of these UTRs but the data presented here will provide a solid foundation to implement these studies.
Interestingly, out of the 5,368 full-length blood stage protein-coding transcripts, 1,072 (20%) had at least one intron in the UTRs: 419 transcripts had at least one intron in the 5'-UTR, 568 in the 3'-UTR and 89 had introns in both UTRs (see e.g., S10 Fig). Of these, 322 transcripts had more than one intron in their UTRs. Of the 344 full-length sporozoite transcripts, 26 had introns in the UTRs, 12 had at least one intron in the 5'-UTR, 11 in the 3'-UTR and 3 had introns in both UTRs. Five of these transcripts had more than one intron in the 5'or 3'-UTR. Remarkably, in blood-stage parasites the levels of expression of the transcripts with an intron in the UTR were, on average, 3-fold higher than those of genes with no UTR intron (Wilcoxon rank test, p-value < 2.2x10 -16 ), suggesting that presence of introns in the UTR may be associated with increased mRNA stability in P. vivax. Furthermore, genes with introns in their coding sequences were also three times more likely to have introns in the UTR (χ 2 = 206.78, p-value < 2.2e-16), which may suggest that UTR splicing is mechanistically associated with splicing of coding sequences, possibly due to a more effective recruitment of the splicing machinery at those transcripts. To evaluate whether the presence of UTR introns was more frequent, at a specific developmental stage, we assigned each gene to the blood stage where it was most abundantly expressed. The proportion of genes with introns in their UTR was significantly different among stages (χ 2 = 26.141, p-value = 8.9 x 10 −6 ), with 11% and 18% of the genes most expressed in early and late trophozoites having an UTR intron, respectively, compared to only 3% and 8% of the genes expressed in female gametocytes and schizonts (note that genes with multiple isoforms were excluded from this analysis as it is difficult to robustly quantify the relative expression of isoforms). Finally, we tested whether genes containing introns in their UTRs disproportionally belonged to a specific pathway but failed to detect any gene ontology enrichment (FDR < = 0.1, S3 Table).

Transcript isoforms are common in P. vivax and can be expressed in a stage-specific manner
The 5,368 full-length protein coding transcripts derived from blood-stage parasites were transcribed from 2,869 genes: 1,687 genes (59%) were transcribed into a single isoform, while 1,182 (41%) showed evidence of multiple isoforms (and out of those, 719 genes were expressed in two isoforms, 247 in three and 216 were transcribed in four or more isoforms). Most isoforms (n = 819, 70%) encoded the same protein sequence and differed only in their UTR length: 1,024 (87%) genes with isoforms differed in their 5'-UTR length and 1,008 (85%) differed in the 3'-UTR length, with 450 genes with isoforms differing in their UTR introns ( Table 2, see also S10 Fig for an example). 363 genes showed evidence of isoforms that were predicted to encode for different proteins, due to an alternate protein coding start (299), alternative end (311), and/or exon skipping (67) ( Table 2).
In sporozoites, the 344 full-length protein coding transcripts were transcribed from 238 genes: 211 genes (89%) were transcribed into a single isoform, while 27 (11%) showed evidence of multiple isoforms (20 genes were expressed in two isoforms and three in three isoforms and four in four or more). All but eight isoforms were predicted to encode the same protein. Five of these genes had an alternate coding start and six had an alternate coding stop resulting in the change in coding sequence, and all had an exon skip or truncation ( Table 2).
Since the full-length transcript data derived from molecules characterized by scRNA-seq, these data provide a unique opportunity to preliminarily examine whether different isoforms were expressed at different stages of the parasite development. Using the 10X cell barcodes, we determined the developmental age of each cell using pseudotime (determined from the Illumina scRNA-seq data). We only considered in this analysis 636 genes that had two (or more) isoforms expressed in at least 50 individual cells each. 123 (19%) of these genes showed evidence of expressing isoforms according to the parasite development (S4 Table). These stagespecific isoforms included 62 and 35 genes with differences in 5'-or 3'-UTR length (see e.g., Figs 4 or S11A) and/or changes in coding sequences (n = 40) (S11B Fig). These findings suggest that P. vivax utilizes alternative start or termination of transcription as a means of transcriptional or translational regulation between stages.

Conclusion
Long read PacBio sequencing coupled with short read Illumina sequencing of single cell RNAseq libraries provide a robust method to detect and characterize full-length transcripts and to identify mRNA isoforms expressed at different times. These simple modifications of existing laboratory protocols for generating 10XGenomics scRNA-seq can be easily applied to any eukaryotic cells and could be invaluable to examine variations in gene expression in organisms that are difficult to study in the laboratory and display a complex life cycle. This study also yielded novel insights on the variations and regulation of gene expression in P. vivax, which provides a solid framework to improve our understanding of P. vivax biology and regulation. This study notably highlighted the presence of extensive and incompletely annotated UTRs, and of ubiquitous UTR introns, that were sometimes expressed in a stage-specific manner. In particular, the observation that different stages used different start sites for transcribing the

Ethics statement
All animal procedures were conducted in accordance with the National Institutes of Health (NIH) guidelines and regulations [42], under approved protocols by the National Institute of Allergy and Infectious Diseases (NIAID) Animal Care and Use Committee (ACUC) (Animal study NIAID LMVR15). Animals were purchased from NIH-approved sources and transported and housed according to Guide for the Care and Use of Laboratory Animals [42].

Animal studies and sample collection
We infected two splenectomized Saimiri boliviensis monkeys with the Chesson strain of P. vivax using parasitized erythrocytes from cryopreserved stocks. Once they developed a parasitemia >0.1%, we collected 1 mL of blood from the femoral vein of each monkey after anesthesia with 10 mg/kg of ketamine and processed the blood samples on MACS LS columns as previously described [24]. Note that this enrichment procedure, that relies on the paramagnetic properties of hemozoin generated by maturing blood-stage parasites, fails to capture ring stage parasites. Two blood samples from two additional Saimiri boliviensis monkeys coinfected with a NIH-1993 clone [24] and the Chesson P. vivax strain were used for membrane feeding of Anopheles stephensi and Anopheles freeborni. Salivary glands sporozoites were collected from each feeding at 21 days post-feed: 50 female mosquitoes were anesthetized on ice and their salivary glands dissected in PBS under a stereomicroscope. The salivary glands were transferred to a low-retention tube (Protein LoBind Tube; Eppendorf) containing PBS, homogenized with a disposable pestle, spun down, washed, resuspended, and quantified.

10X single cell RNA-sequencing library preparation and sequencing
An estimated 3,000 infected red blood cells or sporozoites from each sample were loaded onto 10X Chromium controller to prepare scRNA-seq libraries according to the manufacturer's instructions. We then generated, from each library, 57-75 million paired-end reads using an Illumina NovaSeq.
In addition, an aliquot of the cDNA prior to fragmentation was amplified by eight additional cycles of PCR before preparation of a PacBio library using the SMRTBell Express kit 2.0. We then generated 196-328 million reads from each library using a PacBio Sequel II.

Short read analysis and single cell characterization
Following Illumina sequencing, the short reads were processed as described previously [24]. Briefly, we mapped all reads to the P01 P. vivax genome [36] using Hisat2 [43] with the default parameters except for a maximum intron length of 5,000 bp. We then removed PCR duplicates by identifying reads with identical barcode, unique transcript identifier (UMI) and mapping coordinates. We assigned each unique read to i) a cell based on its barcode and ii) a 500 bp window based on its genomic position. Only cells defined by more than 5,000 unique reads in blood stages and 250 unique reads in sporozoites were further analyzed. Count matrixes and principal component analysis was performed using in-house scripts.

Long read analysis
The entire analytical pipeline for processing and analyzing the long-read data is described in S5 Fig. Briefly, the raw PacBio reads were first collapsed into circular consensus sequencing (CCS) using smrtanalysis [44] and only CCS supported by more than 10 passes were considered for further analysis. We then compared the 10X barcodes of each CCS reads with those obtained after Illumina sequencing and kept all reads matching the barcodes of one of the cells characterized by more than 5,000 unique Illumina reads. To account for the higher error rate of PacBio sequencing we allowed up to one nucleotide mismatch in the barcode sequence. After trimming the 10X adapters, 10X barcodes and the polyA tails using custom scripts, we mapped all CCS reads to the P. vivax P01 genome using minimap2 [45] using the cdna parameters and a k of 14. PCR duplicates were removed as described above. Only reads mapped to the P01 genome over 50% of their length were kept for further analysis.

Transcript and protein identification
Transcript prediction was performed separately for each sample utilizing Stringtie2 [46,47]. Mapping files were divided by forward and reverse reads, and into single and multiple exon reads. Each of the four files was run separately using Stringtie2 long read default parameters. Only transcripts supported by more than 10 reads in each sample were considered for further processing (separately for the blood-stage and sporozoite samples). Transcript predictions were then compared and collapsed across samples into a single gtf for the blood-stage and sporozoite samples using gffcompare and custom scripts. We then used Transdecoder [48] to identify putative protein coding sequences from the predicted transcript sequences. Finally, we compared the predicted protein sequences to those annotated in the most current P. vivax P01 genome (v54) using BlastP and custom scripts.

Stage-specific transcript analysis
To identify isoforms expressed in a stage-specific manner, we first assigned each PacBio read to a specific isoform with BLAT, using all predicted stringtie transcripts (including non-coding and incomplete transcripts) and considering the match with the greatest overall identity (and only considering reads aligned with >90% identity as aligned). We then identified the pseudotime of the cell expressing this transcript by matching the GEMS barcode to the Illumina data. For each gene that had two (or more) isoforms, each expressed in more than 50 cells, we then tested for qualitative differences in isoform expression associated with the parasite development by comparing the pseudotime ranks of the cells expressing the first isoform with the pseudotime ranks of the cells expressing the second isoform using a Kolmogorov-Smirnov test. Each panel shows the PacBio reads mapped to a selected locus and split in four groups according to the stage of the parasites they derived from: early trophozoites, late trophozoites, schizonts and female gametocytes (from top to bottom). (A) Early trophozoites express the Ham 1-like protein (PVP01_0316500) from a more upstream TSS than the other stages and the resulting transcripts have a longer 5'-UTR containing five introns (red box). (B) The isoforms expressed from suppressor of kinetochore protein 1 (PVP01_1105000) result into different predicted protein coding sequences: some transcripts expressed exclusively in early trophozoites retain the third intron (red box) leading to a different open reading frame (blue bars at the bottom). (TIF) S1