High Throughput Sequencing of Extracellular RNA from Human Plasma

The presence and relative stability of extracellular RNAs (exRNAs) in biofluids has led to an emerging recognition of their promise as ‘liquid biopsies’ for diseases. Most prior studies on discovery of exRNAs as disease-specific biomarkers have focused on microRNAs (miRNAs) using technologies such as qRT-PCR and microarrays. The recent application of next-generation sequencing to discovery of exRNA biomarkers has revealed the presence of potential novel miRNAs as well as other RNA species such as tRNAs, snoRNAs, piRNAs and lncRNAs in biofluids. At the same time, the use of RNA sequencing for biofluids poses unique challenges, including low amounts of input RNAs, the presence of exRNAs in different compartments with varying degrees of vulnerability to isolation techniques, and the high abundance of specific RNA species (thereby limiting the sensitivity of detection of less abundant species). Moreover, discovery in human diseases often relies on archival biospecimens of varying age and limiting amounts of samples. In this study, we have tested RNA isolation methods to optimize profiling exRNAs by RNA sequencing in individuals without any known diseases. Our findings are consistent with other recent studies that detect microRNAs and ribosomal RNAs as the major exRNA species in plasma. Similar to other recent studies, we found that the landscape of biofluid microRNA transcriptome is dominated by several abundant microRNAs that appear to comprise conserved extracellular miRNAs. There is reasonable correlation of sets of conserved miRNAs across biological replicates, and even across other data sets obtained at different investigative sites. Conversely, the detection of less abundant miRNAs is far more dependent on the exact methodology of RNA isolation and profiling. This study highlights the challenges in detecting and quantifying less abundant plasma miRNAs in health and disease using RNA sequencing platforms.


Introduction
Extracellular RNAs (exRNA) have recently been identified as novel biomarkers and potential cell-cell communicators in plasma and other body fluids. To date, the most abundantly studied class of these exRNAs are microRNAs (miRNA), which have been implicated in a wide range of diseases including heart failure [1,2], cancer [3,4], and multiple sclerosis [5,6]. miRNAs a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 are small, non-coding RNAs that have the ability to regulate whole networks of genes through transcriptional and translational regulation. In the circulation they are commonly found enclosed in extracellular vesicles (EVs) [7], bound to lipoproteins [8], or complexed with Argonaute-2 [9]. Furthermore, in vitro studies have shown that numerous cell types are capable of releasing miRNA in these complexes. exRNA profiling may therefore reflect cellular content and identify disease specific variations in expression. More recently, the application of next generation sequencing to exRNA discovery has led to the recognition that various other species of RNAs are also present in biofluids [10,11]. While not as extensively studied as miRNAs, emerging data suggests a role for these other species of RNA as prognostic biomarkers in disease [12], and as possible mediators of disease pathogenesis.
While exRNA offers an exciting new branch of functional biomarker research, the field is still in its infancy and faces numerous technical challenges. RNA concentrations in biofluids are significantly lower than in tissue and standard methodologies developed for tissue RNA extraction and quantification are not always appropriate. Secondly, the RNAs may be present in different compartments, each of which may have different susceptibilities and vulnerabilities to RNA isolation techniques, thereby creating an unintended bias in RNA profiling based on the exact isolation method used [13,14]. Furthermore, samples available to researchers for study have often not been collected for the purpose of exRNA analysis and have been archived for significant lengths of time. Finally, little is known about the normal physiology of exRNAs and what level of variation exists between healthy donors. These factors all need to be taken into consideration in the development of standard protocols for the study of exRNA. Recent studies have compared different commercially available RNA isolation kits in their relative efficiencies in isolating RNA from plasma [13] or EVs derived from plasma [11]. Based partly on these results, and partly on our preliminary experiments, we chose the miRCURY RNA Isolation Kit for Biofluids (Exiqon) as the standard kit in this study to address other variabilities in RNA isolation with small RNA sequencing as the ultimate output.
Current technologies available for the quantification of exRNA include qRT-PCR, RNA microarray, and RNA sequencing (RNAseq). While qRT-PCR and RNA microarrays utilize specific primers or probes, and as such can only detect known RNA sequences, high throughput RNAseq has the ability to detect novel transcripts across a broad dynamic range and offers a potentially sensitive means to characterize and quantify exRNA. However, in addition to the biases introduced by the technique [11,15], the landscape of exRNAs in biofluids such as plasma appear to be dominated by certain species of RNA, such as ribosomal RNA fragments or particular miRNAs: this leads to a skewed distribution of exRNA species which limits the sensitivity of detection of less abundant species. This study aims to address some of the current caveats for RNAseq, specifically by optimizing aspects of exRNA extraction from plasma for the quantification of miRNA by RNAseq.
Plasma Isolation 40 mL of blood was obtained by venipuncture using a 21G needle and collected in K 2 EDTAcontaining tubes. Blood was immediately processed by centrifugation at 1,000 x g for 10 min at room temperature to separate plasma. The isolated plasma was immediately frozen at -80˚C in 1 mL aliquots until use. Samples were frozen for approximately 1 week prior to extraction, except where noted. Plasma was thawed at room temperature and centrifuged at 2,000 x g prior to RNA extraction to remove any remaining platelet or debris contamination.

RNA Extraction and Treatment
RNA was extracted from 1 mL of plasma using the miRCURY RNA Isolation Kit for Biofluids (Exiqon) according to manufacturer's protocol except for noted modifications. 1 μL of 20 mg/ mL glycogen (Roche) was added to plasma prior to RNA extraction. All samples were treated with T4 polynucleotide kinase (New England Biolabs) to facilitate 5' hydroxyl terminus phosphate labelling and allow greater binding of adaptors during library preparation, and RNA sample volume was reduced to 10 μL in a standard lyophilizer at room temperature. RNA concentration was determined by Quant-iT RiboGreen RNA Assay Kit.
Prior to RNA extraction, a subset of plasma samples were treated with Proteinase K (PK; 100 μg/mL in 0.5% SDS solution) for 30 min at 50˚C. Samples were treated in one of three groups: 1) no PK treatment; 2) PK treatment prior to addition of guanidinium thiocyanate (GITC) containing lysis buffer (from miRCURY RNA Isolation Kit); 3) PK treatment following the addition of GITC-containing lysis buffer. Where indicated, ribodepletion was performed using the Ribo-Zero Magnetic Gold kit (Epicentre) according to manufacturer's protocol. A schematic summarizing plasma sample treatments is shown in Fig 1. For samples that were analyzed by digital droplet PCR (ddPCR), ethanol precipitation was performed to remove any excess SDS solution that could inhibit the PCR reaction. Briefly, 0.3 M NaCl, and 3x volumes of 100% ethanol were added to the solution and the RNA precipitated at -80˚C overnight. Samples were then pelleted by centrifugation, washed with 80% ethanol, and re-suspended in 10 μL RNAse-free water.

RNA Library Preparation
Each sequencing library was constructed from 2 ng of isolated and treated plasma RNA. All libraries were uniquely bar-coded with index primer for multiplexing into sequencing lanes. The small RNA libraries were prepared and amplified using the NEBNext small RNA Library Prep Set (New England BioLabs, Ipswitch, MA, USA) following manufacturer instruction. The amplified libraries were resolved on a 10% Novex TBE gel (Life technologies) for size selection and the 140 to 160 nucleotide bands that correspond to adapter-ligated constructs derived from the 21 to 40 nucleotide RNA fragments were excise and recovered in DNA elution buffer. The average size distribution of each library was determined using Agilent Bioanalyzer with High Sensitivity Chip Kit (Agilent, Santa Clara, CA, USA) and quantified on ABI 7900HT Fast RT-PCR instrument using the KAPA Library Quantification kit according to the manufacture's protocol (Kapa Biosystems, Woburn, MA, USA). Each library was adjusted to final concentration of 2 nM, pooled, and sequenced on an Illumina HiSeq 2000 or MiSeq sequencer for single read 50 cycles at the Center for Cancer Computational Biology at Dana-Farber Cancer Institute.
The processed sequences were filtered for small RNAs greater than 16 nucleotides in length. The sequences were then aligned, quantified and annotated using sRNABench 1.0 pipeline [16]. Briefly, the pipeline implemented hierarchical sequence mapping strategy that first mapped and remove spike-in library, contaminants, and rRNA before sequentially mapped to known mature miRNA, tRNA, snoRNA and piRNA onto the human genome sequence (hg19) using Bowtie2 [17] with parameters that allow for 1 mismatch in seed alignment (-N 1), try two set of seeds (-R 2), and set the length of seed substrings to be 16 (-L 16). Mapped small RNA species was quantified to read counts and normalized to RPM as described in sRNA-Bench. Detected species were mapped to mature miRNA only and not precursor miRNA. Reads derived from microRNA with multiple copies in the genome were summed together, and read counts from sample duplicates were aggregated by mean for where it is applicable. All statistical analysis was performed using R version 3.2. To avoid introducing inherent variability by normalization methods for miRNA [18,19], only read counts were used for Spearman correlation analysis. All fastq files had been deposited into the exRNA Atlas (http://exrna. org/resources/data/) as part of Extracellular RNA Communication Consortium (ERCC).

Digital Droplet PCR
RNA was diluted 1:10 in RNAse-free water and reverse transcribed with the Universal cDNA synthesis kit II (Exiqon) according to manufacturer's protocol. cDNA was diluted 1:10 and loaded into PCR reactions with EvaGreen supermix (Bio-Rad) and pre-designed primers for miR-30d, miR-150, and miR-122 (Exiqon) according to manufacturer's protocol (Exiqon). Droplet formation was carried out using a QX200 droplet generator and droplets were transferred to a 96-well plate for PCR. No template reactions were used as a negative control. Endpoint amplification was conducted with the following steps: 95˚C for 5 min, 40 cycles of 95˚C for 30 s, 58˚C for 30 s, 4˚C for 5 min and 90˚C for 5 min at a ramp-rate of 2.5˚C/s for all steps. Droplets were then read with the QX200 droplet detector (Bio-Rad) and analysis conducted with Quantsoft 1.7. Results are presented as expression relative to the no PK treatment counts.

Proteinase K treatment after lysis achieved highest yield of microRNA
To determine the impact of proteinase K (PK) treatment at different stages of RNA extraction affecting miRNA detection, we extracted RNA from plasma treated with a) no PK, b) PK treatment before GITC (Pre-GITC), and c) PK treatment in GITC (PK in GITC and PK in GITC #2) that was drawn from a single subject. The no PK, PK pre-GITC, and PK in GITC samples were from the same blood draw, whereas the PK in GITC #2 was from a separate draw. Total RNA yield from these samples (1 mL starting volume; measured by RiboGreen RNA Assay kit) was a) 22.1 ± 1.3 ng, b) 23.0 ± 3.2 ng, and c) 67.0 ± 6.0 ng (mean ± SEM). Libraries were constructed using NEBNext small RNA Library Prep Set and run on the Illumina HiSeq2000 platform. The number of initial reads post-filtering varies from 5.5 to 8.4 million for each sample, with more than 50% of the reads being ribosomal RNA (rRNA) ( Table 1). The rRNA reads were first filtered and the remaining reads were mapped against the human genome (hg19) and 2578 annotated mature miRNA sequences. We observed 43.3%, 27.9% and 32.1% of reads mapped to human genome for no-PK, Pre-GITC, and PK-in-GITC samples respectively; however, the percentages of miRNA reads were the lowest for the no-PK sample at 3.7%.
Despite the large variation in the total number of miRNA reads, all three samples yielded more than 346 mature miRNA with PK-in-GITC sample yielding the most diverse result (n = 428). Overall we identified reads mapping to 593 mature miRNA across samples with detectable counts (count>0, S1 Table), but only 210 out of 593 miRNA were shared among them with modest level of correlation (r>0.43, Spearman Correlation, Fig 2), suggesting strong miRNA profile variability as a result of different PK treatment. If only miRNA with >10 reads were considered, the number reduced to 97 shared and 198 total mature miRNA. The degree of correlation also showed general increases between all protocols (r>0.67), with Pre-GITC and PK-in-GITC samples showing the highest correlation (r = 0.75).
Lastly we checked if commonly observed miRNA species were expressed at higher levels and thus more resistant to technical variation. miRNA that were observed in all samples showed significantly higher expression compared to those observed in two or less samples (P<2.09e-05, Wilcoxon Rank Sum Test). Similarly, miRNAs uniquely observed in only one sample generally have lower expression, particularly among those uniquely identified from the PK in GITC sample. We repeated PK-in-GITC treatment on a different plasma sample (PKin-GITC #2) derived from the same subject, and observed similar level of percentage mapped reads (31.3%, Table 1). Taken together, this suggests PK in GITC resulted in higher RNA yield as well as higher miRNA sequencing sensitivity.
To further validate these findings we performed ddPCR on repeated samples of no PK vs PK in GITC from the original single healthy donor. From the RNAseq results, we selected a high abundant (miR-122), mid-abundant (miR-30d), and low abundant miRNA (miR-150) species to measure (S1 Fig). Interestingly, the greatest increase in species count with PK in GITC treatment was seen in the low abundant miR-150 both with RNAseq (PK in GITC vs No PK: 40 fold change) and with ddPCR (PK in GITC vs No PK: 1.13 fold change). A modest increase was seen in miR-30d: 2.9 fold by RNAseq and 1.09 fold by ddPCR. Finally, the high abundant miR-122 showed a 1.4 fold increase by RNAseq and a 0.81 fold decrease by ddPCR.
This suggests that PK treatment increases availability and yield of low abundant miRNA, whereas the high abundance species may be at a saturated detection level that is not affected so greatly.

The Effect of Ribodepletion on miRNA Detection
Due to the high percentage of rRNA reads observed, the effect of ribodepletion on sample reads was tested. Plasma samples from the same healthy individual, frozen for either 1 week (fresh) or 1 year (archived) were extracted with PK pre-treatment and ribodepletion was performed. As expected, ribodepleted fresh samples showed decreases in the percentage of rRNA reads, ranging from 15 to 25%, while the percentages of miRNA reads and detectable mature miRNA species both showed increases (Table 1) when compared to non-ribodepleted samples. However, ribodepletion of the archived samples did not result in a decrease of rRNA reads. Instead, the ribodepletion step appeared to significantly reduce the percentages of miRNA reads, with the number of detected miRNA species decreased to less than a hundred in both samples ( Table 1). The degree of expression correlation also weakened substantially among the commonly identified miRNA species (r = 0.43 for RD1, r = 0.27 for RD2, Spearman's Test, S2 Fig). * RNA isolated with PK in GITC. Sample a and b are duplicates from the same blood draw.
** mapping rate after removal of rRNA sequences.
To further validate this observation, we sequenced both ribodepleted and non-treated archived plasma samples derived from three cardiovascular patients (subjects Pt1 to Pt3) that were archived for 3-4 years. The result concurred with previous observation. Ribodepleted archived samples consistently yield dramatically lower percentage of miRNA mapping reads and lower number of miRNA species compare to non-treated samples ( Table 1, S2 Table). Paradoxically, the percentages of ribosomal reads were higher among the ribodepleted samples. As most samples from clinical settings would have been frozen for extended period of time, ribodepletion was not incorporated for rest of this study despite the observed improvement in fresh plasma samples.

Variability of miRNA Expression Between Samples
To assess technical reproducibility of miRNA profiling, we examined duplicated miRNA profiles derived from the plasma of three healthy subjects between the ages of 20 to 40 (raw miRNA read counts in S3 Table). We obtained greater than 12 million reads per sample, but again with more than 60% being ribosomal RNA in subject 2 and 3. While samples derived from subject 4 exhibited lower percentage of rRNA reads, these samples contained high level of adaptor dimers and mapped poorly to the human genome (4.5% of total raw reads). An average of 326 known miRNAs were quantified, with subject 4 samples showing the lowest number of miRNA diversity.
To determine sequencing reproducibility, we compared quantified miRNA species between sample duplicates. We considered all quantifiable miRNA from individual run with read count of greater than 10 to minimize sampling fluctuation. The overlaps between biological duplicates were modest, ranging from 54% to 63% of all quantified miRNA across duplicates (Fig 3). We then determined if the overlapping miRNA species were more highly expressed and thus likely more resistant to sampling fluctuation by comparing the expression level of commonly observed miRNA (n = 62) across all runs to others. As expected, the expression of commonly observed miRNA species was significantly higher (P<0.001).
To examine whether commonly detected miRNA expression is generally reproducible, we first determined the correlation of these miRNA expression among duplicates. Intra-sample correlations were significant across all samples (P<1e-29, Spearman correlation, Fig 3) with r-values from 0.82 to 0.92, indicating strong expression reproducibility. We then performed the same analysis across all sample pairs to assess miRNA expression concordance across different control samples. The percentages of common miRNA across different control samples were similar to biological duplicates, ranging from 54% to 63% of all quantified miRNA. While the concordances across samples were lower than biological duplicates, the expression levels remain highly correlated with r>0.71 (Fig 3). Notably, samples 4a and 4b, which are the ones with the lowest sequence depth and number of mature miRNA species detected, showed the lowest degree of concordance with other samples. Taken together, plasma miRNA derived from health subjects showed highly consistent expression level across runs and individuals.

Comparison with Published Data
The expression consistency of plasma miRNA across different individuals suggests there maybe sets of plasma miRNA species that are more conserved in their expression in healthy subjects. Here, we used 142 miRNA (S4 Table) that were expressed in all three healthy subjects and their duplicates as our set of conserved plasma miRNA. Of these, the top 20 expressed miRNA and their read counts in the duplicated samples are shown in Table 2. These include known disease associated miRNA species such as miR-451a [20,21], miR-1246 [22], miR-423 [23], and miR-148a [24,25] (Table 3). Overall, these miRNA were significantly highly expressed (P<1.89e-13, Wilcoxon Rank Sum Test) compared to the remaining detected species across all samples.
In order to assess whether these conserved miRNAs exhibit similar expression patterns in plasma derived from normal subjects and can be detected in other dataset, we compared our sequencing data against plasma miRNA sequence data previously published by William et al [80] that was generated with different RNA isolation and library preparation protocols. Briefly, the study sequenced microRNA from plasma collected from mothers, fathers, umbilical cords, and biopsy samples of the corresponding placentas as well as non-pregnant control females in good health. First, we evaluated the expression level of our conserved miRNAs in relation to rest of the detected miRNA in each sample type. While a few of these conserved miRNAs were not observed in all samples, the observed miRNAs were significantly highly expressed compared to rest of the miRNA in all sample types (Fig 4A), including placentas, suggesting these conserved miRNAs that we found were highly expressed in different plasma types and tissue type.
Next we evaluated if these conserved miRNA expression patterns were consistent between the plasma samples from the two datasets. Correlations across datasets for plasma samples were strong (Fig 4B) for plasma samples collected from mothers, fathers, and non-pregnant controls, with most of the paired comparisons with r > 0.6 (Spearman Correlation). Plasma

sRNA Content in Plasma by Transcriptome Sequencing
We examined the number of reads that were mapped to other class of small RNAs in addition to rRNA and miRNA. Overall, among the total 558,000 reads of small RNA sequences generated from the three healthy subjects, there were 67% miRNA, 23% tRNA, 5% small nucleolar RNA (snoRNA), and 5% piwi-interacting RNA (piRNA) reads mapped (Fig 5). Despite significant amounts of variability in the number of initial and post-processed reads across individual samples, miRNA and tRNA reads were consistently the most and the second most abundant small RNA class identified. Out of the 356 distinct tRNA that were identified, more than half (238 out of 355) were overlapping among the control subjects. The number of overlapping tRNA remained close to half (168 out 355) even after restricting to ones with at least 10 reads (S3 Fig).

Discussion
We began this study with the intention to establish a robust small RNA extraction procedure for peripheral blood plasma miRNA quantification that can be reliably utilized for biomarker discovery from clinical samples. While settling on a single kit based on other studies, our goal was to examine the effects of modifications of RNA extraction procedures for assessing plasma miRNA expression profiles using NGS as our quantification assay. The experiment was designed to select the optimal RNA extraction procedure on plasma samples derived from a single healthy individual, taking consideration of sample storage duration, and then extended to multiple individuals as well as published data to evaluate consistency. We first evaluated the impact of PK treatment, which has been reported to increase RNA yield [81], on extracting RNA from plasma. We found that PK treatment following addition of GITC-based denaturing buffer to plasma markedly increases RNA yield, the number of mapped miRNA reads, and improved detection of low-abundance miRNA species. It is known that miRNAs in the circulation are relatively sequestered: associated with EVs, bound to lipoprotein, or complexed with RNA binding proteins. At the same time, RNAses in plasma are relatively more resistant to PK digestion compared to other RNA binding proteins. We therefore think that the timing of addition of PK to the sample is of great importance: PK added directly to the sample may destroy the protective RNA binding proteins more readily than the RNAses, thereby allowing for degradation of exRNAs and lower yields. In the presence of denaturing buffer, the RNAses are inactive (while the PK may be still be active), and the improved RNA yield is presumably due to the dissociation of exRNAs from their RNA binding partners. This may also explain why the low-abundance miRNA are more readily detected following PK in GITC treatment, as confirmed by our ddPCR experiment. We therefore recommend that for discovery purposes PK in GITC treatment is the optimal condition as it allows for detection of a greater number of miRNA species. This may not be necessary for validation studies or where high abundant miRNA is being assessed.
One of the most surprising outcomes from our sequencing data was the consistently high percentage of ribosomal RNA reads that occupies large proportion of sequence reads. This made it more challenging to achieve a high depth of coverage for profiling miRNA expression. We tested the ability of ribodepletion, which is standardly performed for RNAseq of tissues samples, to remove rRNA from extracted plasma RNA derived from fresh peripheral blood plasma. The result was initially highly encouraging with both substantial decrease in the percentage of rRNA and increase in the percentage of miRNA reads. However, the same plasma sample that were frozen for one year yielded poor sequencing results after ribodepletion, with no reduction in the percentage of rRNA and substantial decrease in the percentage of miRNA reads. As most clinical samples would have been archived for extended period of time, it is important to note that fresh and archived samples should be handled differently. The use of ribodepletion, while potentially beneficial in fresh samples, may not be suitable for the study of archived samples. Other techniques, such as blocking of highly abundant miRNAs for plasma RNA sequencing, have been described but not adapted to multiplexing and could offer another avenue of increasing the detection of less abundant species of exRNAs.
Our comparisons of miRNA expression profiles showed strong correlation of miRNA species for both sample replicates and between samples from different healthy donors, supporting the reproducibility of NGS as a methodology for plasma miRNA quantification. In particular, despite the low RNA concentration within peripheral blood plasma that may introduce strong sampling bias, large numbers of abundantly expressed miRNA species were observed across samples that we examined in this study with similar profiles. We further investigated if these conserved plasma miRNA that we observed exhibit similar patterns from healthy individuals in published study and found that these profiles remain strongly correlated with our results despite differences in sample acquisition, RNA extraction, and sequence library preparation protocols. While this may suggest the presence of a stable landscape of extracellular miRNA in healthy plasma, further studies with larger cohorts are required to confirm this observation.
As the scope of this study is limited to optimize only small aspects of exRNA extraction process from plasma for small RNAseq quantification, there remain arrays of challenges in generating reproducible and reliable data. For instance, T4 polynucleotide kinase treatment during sequencing library generation likely facilitated extracellular DNA nucleotide for sequencing [82] to result in DNA sequence contamination in our data. Also, it has been reported that RNA ligases have strong sequence specific biases [83] and can distort small RNA expression profile considerably. While pooled and randomized adapter strategies have been proposed to reduce ligation bias, the technology would need to be further evaluated. The dominance of ribosomal RNA sequence in plasma remains a challenge that constrained our ability to obtain desirable sequencing depth to consistently detect less abundant small RNA species. Overall, this study optimizes exRNA extraction process for plasma small RNA sequencing and further demonstrates NGS platform as a reliable methodology for plasma extracellular biomarker discovery.