The authors have declared that no competing interests exist.
Conceived and designed the experiments: NN EAT EAP JJ. Performed the experiments: NN JMC JMK BN BWE XJG. Analyzed the data: NN AB ZS YWA DJS KRK KJT. Contributed reagents/materials/analysis tools: EAP EAT ZS YWA XJG. Wrote the manuscript: NN EAT EAP ZS YWA.
Advantages of RNA-Seq over array based platforms are quantitative gene expression and discovery of expressed single nucleotide variants (eSNVs) and fusion transcripts from a single platform, but the sensitivity for each of these characteristics is unknown. We measured gene expression in a set of manually degraded RNAs, nine pairs of matched fresh-frozen, and FFPE RNA isolated from breast tumor with the hybridization based, NanoString nCounter (226 gene panel) and with whole transcriptome RNA-Seq using RiboZeroGold ScriptSeq V2 library preparation kits. We performed correlation analyses of gene expression between samples and across platforms. We then specifically assessed whole transcriptome expression of lincRNA and discovery of eSNVs and fusion transcripts in the FFPE RNA-Seq data. For gene expression in the manually degraded samples, we observed Pearson correlations of >0.94 and >0.80 with NanoString and ScriptSeq protocols, respectively. Gene expression data for matched fresh-frozen and FFPE samples yielded mean Pearson correlations of 0.874 and 0.783 for NanoString (226 genes) and ScriptSeq whole transcriptome protocols respectively, p<2x10-16. Specifically for lincRNAs, we observed superb Pearson correlation (0.988) between matched fresh-frozen and FFPE pairs. FFPE samples across NanoString and RNA-Seq platforms gave a mean Pearson correlation of 0.838. In FFPE libraries, we detected 53.4% of high confidence SNVs and 24% of high confidence fusion transcripts. Sensitivity of fusion transcript detection was not overcome by an increase in depth of sequencing up to 3-fold (increase from ~56 to ~159 million reads). Both NanoString and ScriptSeq RNA-Seq technologies yield reliable gene expression data for degraded and FFPE material. The high degree of correlation between NanoString and RNA-Seq platforms suggests discovery based whole transcriptome studies from FFPE material will produce reliable expression data. The RiboZeroGold ScriptSeq protocol performed particularly well for lincRNA expression from FFPE libraries, but detection of eSNV and fusion transcripts was less sensitive.
The wealth of clinical data such as patient outcome and disease-free survival from large clinical trials is an invaluable tool in the successful application of genomic technologies. More than one billion samples are preserved with formalin fixed paraffin embedding (FFPE) in hospitals and tissues banks across the world [
Formalin induces chemical modification by crosslinking between nucleic acids and proteins, limiting the reverse transcription of mRNA into cDNA, a key step in the process of library preparation for transcriptome sequencing. Further challenges originate from the enrichment procedure of mRNA from the total RNA fraction and subsequent 3’ amplification bias when starting material is fragmented. There are three potential solutions.
First, formalin induced chemical modifications are partially reversible [
The goal of this study was to first assess gene expression using degraded material with the hybridization-based, NanoString nCounter platform (thus avoiding 3’ amplification bias) and RiboZeroGold / ScriptSeq library preparation (Epicentre) (a rRNA depletion method of mRNA enrichment, designed to avoid 3’ amplification bias of degraded material), followed by Illumina sequencing. Our experimental design compared manually degraded RNA against the same sample in its original undegraded condition and matched fresh-frozen / FFPE pairs. A high degree of correlation of gene expression between degraded and undegraded sample sets with these genomic applications would indicate the potential of archival material for discovery of gene expression patterns relevant to breast cancer. In addition, use of whole transcriptome RNA-Seq is often described as advantageous over hybridization-based platforms such as the NanoString nCounter or microarrays because next generation sequencing based technologies allow not only quantitation of gene expression, but also simultaneous discovery of genomic features such as single nucleotide variants (SNVs) and fusion transcripts. Hence, we also assessed the capability of the RiboZeroGold / ScriptSeq protocol for discovery of expressed SNVs and fusion transcripts within FFPE material.
Gene expression was assessed in two sample sets. First, we took high quality RNA from a breast cancer cell line (MDA-MBA-436) and the Universal Human Reference RNA (Stratagene, La Holla, CA) and manually degraded aliquots of each by heat and physical shearing. This resulted in a total of 10 RNA samples with RNA integrity (RIN) values ranging 1.2-10, described in
MDA-MBA-436 | Undegraded | 10 |
MDA-MBA-436 | Medium (by heat) | 6.8 |
MDA-MBA-436 | Medium (by shearing) | 6.1 |
MDA-MBA-436 | High (by heat) | 2.2 |
MDA-MBA-436 | High (by shearing) | 1.2 |
UHRR | Undegraded | 8.1 |
UHRR | Medium (by heat) | 4.7 |
UHRR | Medium (by shearing) | 5.2 |
UHRR | High (by heat) | 1.8 |
UHRR | High (by shearing | 1.7 |
BRB123 | 1.83 | 2.3 | 8.1 |
BRB144 | 1.50 | 2.1 | 7.7 |
BRB147 | 1.42 | 2.2 | 7.0 |
BRB157 | 1.42 | 2.1 | 7.3 |
BRB212 | 1.00 | 2.0 | 7.7 |
BRB215 | 1.00 | 2.2 | 8.4 |
BRB248 | 0.83 | 2.2 | 7.8 |
BRB277 | 0.42 | 1.8 | 6.9 |
MCJBCR-028 | 4.33 | 2.3 | 8.9 |
FastQ files are available at
All breast tumor samples were collected between 2008 and 2012 according to a protocol that was approved by the Mayo Clinic Institutional Review Board with written informed consent, and were de-identified for this work.
RNA extraction from the breast cancer cell line, MDA-MBA-436 was performed with the miRNeasy Kit (Qiagen, Inc, CA). The UHRR RNA was purchased directly from Stratagene. Both samples were treated with DNase (Qiagen, Inc, CA). Each sample was divided into aliquots, with one aliquot left undegraded, and each remaining aliquot degraded under one of the following conditions: 1) 37°C for 48 hours; 2) 65°C overnight; 3) shearing by CovarisTM (Woburn, MA) under target base pair peak conditions for 1500bp (10 second then 15 second bursts) and 4) shearing by CovarisTM under target base pair peak conditions for 1500, 800 and 200bp (10 second then 15 second bursts). The 1500, 800 and 200bp sheared aliquots were then combined to provide a range of different fragment sizes. Agilent Bioanalyzer profiles are shown in
RNA extraction from fresh-frozen material was performed with the Qiagen miRNeasy kit, including on-column treatment with DNase. RNA extraction from FFPE material was performed on 6x10µm sections with the Qiagen AllPrep FFPE kit, including on-column treatment with DNase. All RNA samples were assessed for quality using the RNA 6000 Nano assay on the 2100 Bioanalyzer (Agilent Tehcnologies, Inc) and for quantity by Nanodrop 2000c (Thermo Scientific). Agilent Bioanalyzer profiles are shown in
Gene expression on the NanoString platform was assessed with the NanoString Cancer Gene expression panel of 226 genes known to be differentially expressed in cancer. 200ng of each total RNA sample was prepared as per the manufacturer’s instructions. Gene expression was quantified on the NanoString nCounterTM and raw counts were generated with nSolverTM. Raw counts per gene were extracted using the nSolver software, (Nanostring). Raw gene counts were read into R (2.15.0) and each dataset (Fresh-frozen RNA and FFPE RNA) was separately quantile normalized via the Bioconductor (2.11) package “aroma.light”. Log2-transformed distributions, before and after quantile normalization, are shown in
rRNA depletion was performed on 200-400ng of each total RNA sample with the RiboZeroGold (Illumina, San Diego, CA) as per the manufacturer’s instructions. The entire rRNA-depleted fraction (ranging 4-22ng) was used as input for library preparation using the ScriptSeq V2 library preparation kit (Illumina, San Diego, CA) as per the manufacturer’s instructions with one exception: For the degraded and FFPE samples, the cDNA synthesis step at 37°C was increased from 10 minutes to 90 minutes. Each library was indexed with Illumina compatible barcodes to allow multiplexing.
In addition to ScriptSeq libraries, for the RNA isolated from fresh-frozen tumor, we also made TruSeq V2 libraries as per the manufacturer’s instructions (Illumina, Inc, San Diego, CA).
All libraries were validated and quantified with the Bioanalyzer DNA 1000 assay (Agilent Technologies, Inc, CA) and further quantified with the Qubit DNA Broad Range assay (Life Technologies, Carlsbad, CA).
10µL of each library were diluted to a concentration of 10nM. Equal volumes of each 10nM library were then pooled for subsequent 51 base paired-end sequencing on an Illumina HiSeq 2000 (Illumina). For the manually degraded ScriptSeq V2 and the fresh-frozen tumor TruSeq V2 libraries, sequencing was performed with two samples per lane. For the ScriptSeq fresh-frozen and FFPE libraries, sequencing was performed with three samples per lane.
Sequence alignment and gene level expression quantification: To account for higher coverage in TruSeq samples (only two per lane, verses three per lane in the ScriptSeq), and to make fair comparisons among the samples with different sequence depths, we randomly selected ~56 million reads (the smallest number of reads observed in a single sample) from each other sample. Sequence alignment and quantification of gene and exon level expression was carried out using our internally developed RNA-Seq analytical pipeline, MAPRSeq 1.2. Briefly, the paired end reads were aligned to the human genome build 37.1 using TopHat (2.0.6). HTSeq (0.5.3p9) was used to perform gene counting while BEDTools (2.16.2) was used to count the reads mapping to individual exons according to RefSeq gene annotations (Feb 2009, GRCh37/hg19) with 23,498 genes.
Reads not uniquely mapped were discarded. Duplicate reads were marked with Picard and removed prior to SNV calling with GATK. eSNV inclusion criteria were a quality score ≥30, total read depth ≥10 and a read depth of ≥2 for the alternate allele.
Fusion transcripts were detected as previously described in Asmann et al, [
Raw gene counts were read into R (2.15.0) and each RNA sample set (manually degraded, undegraded, fresh-frozen and FFPE) for each platform (Nanostring and Illumina RNASeq), and each RNASeq protocol (ScriptSeq and TruSeq), was separately quantile normalized via the Bioconductor (2.11) package “aroma.light”. Log2 transformed distributions, before and after quantile normalization, are shown in
Cross-platform agreement was assessed using both Spearman’s and Pearson’s correlation coefficients on a per-sample basis.
All correlation p-values were calculated via Fisher’s z-transformation.
RNA extraction from 6x10µM FFPE sections yielded a mean of 7.99µg per sample (ranging 3.96-15.03µg). RNA extraction from 10x10uM sections of fresh-frozen tumor yielded a mean of 7.75ug per sample (ranging 1.8-22.1ug). RIN scores and FFPE sample age at the time of RNA extraction are shown in
Raw gene counts for 226 genes quantified by NanoString were log2-transformed, and normalized by quantile normalization. For each high quality undegraded RNA isolated from cell lines (MDA-MB-436 and UHRR), we compared these values against the same sample that had been manually degraded to different extents with either heat or shearing. Excellent correlation was observed for both samples regardless of the state of degradation. Medium degradation (RIN scores 4.8-6.2) yielded Pearson correlations ranging 0.996 to 0.999. High degradation (RIN scores 1.2-2.2) yielded Pearson correlations ranging 0.993-0.998 (
Nanostring correlation for (A) undegraded RNA against the same manually degraded sample (RIN=2.0); (B) log2 fold change between two high quality RNAs and the same two samples when manually degraded (RIN 2.0); (C) nine matched fresh-frozen and FFPE pairs. ScriptSeq correlation for (D) undegraded RNA against the same manually degraded sample (RIN=2.0); (E) log2 fold change between two high quality RNAs and the same two samples when manually degraded (RIN 2.0); (F) nine matched fresh-frozen and FFPE pairs.
We also performed correlation of the log2 fold changes between MDA-MB-436 and UHRR in their undegraded state verses the log2 fold changes in each state of degradation to test if the differences in gene expression between the two samples would also be detected when the samples were degraded. Again, high correlation was observed with Pearson correlation r= 0.943-0.960 and r=0.925-0.949 for medium and high states of degradation respectively (illustrated in
To test technical reproducibility on the NanoString we performed two technical replicates of the same FFPE sample. Pearson correlation for technical replicates was 0.996 (
Log2 gene expression for FFPE technical replicate with (A) NanoString platform and (B) RiboZeroGold ScriptSeq; (C) Gene expression correlation of nine FFPE samples across nanoString and RiboZeroGold ScriptSeq protocols. Gene expression correlation of nine fresh-frozen (FROZ) RNA samples across (D) nanoString and ScriptSeq, (E) nanoString and TruSeq and (F) RiboZeroGold ScriptSeq and TruSeq.
In the second test set of nine matched pairs of fresh-frozen and FFPE RNAs, normalized log2 values for all 226 genes were compared by Pearson correlation and log rank correlation (Spearman). The NanoString platform showed excellent correlation between all pairs (
NanoString FROZ vs FFPE | ScriptSeq FROZ vs FFPE | |
---|---|---|
BRB123 | 0.926 (0.950) | 0.790 (0.943) |
BRB144 | 0.955 (0.965) | 0.807 (0.957) |
BRB147 | 0.728 (0.917) | 0.802 (0.945) |
BRB157 | 0.807 (0.930) | 0.793 (0.932) |
BRB212 | 0.972 (0.962) | 0.830 (0.963) |
BRB215 | 0.818 (0.976) | 0.598 (0.961) |
BRB248 | 0.942 (0.976) | 0.830 (0.972) |
BRB277 | 0.973 (0.970) | 0.825 (0.971) |
MCJBCR028 | 0.747 (0.941) | 0.773 (0.933) |
mean | 0.874 (0.954) | 0.783 (0.953) |
Spearman correlation in parentheses
In this set of experiments, we assessed a protocol for whole transcriptome sequencing of manually degraded and FFPE RNA samples. The RiboZeroGold / ScriptSeq protocol is designed to avoid 3’ bias due to fragmented or degraded material. We measured the gene body coverage by plotting read depth from
We also performed correlation of the log2 fold changes between MDA-MB-436 and UHRR in their undegraded state verses the log2 fold changes in each state of degradation to test if the differences in gene expression between the two different samples would also be detected when the samples were degraded. Again, high correlation was observed with Pearson correlation r= 0.864-0.868 and r=0.826-0.862 for medium and high states of degradation respectively (
We next tested reproducibility by performing a technical replicate for a single FFPE sample with the RiboZeroGold ScriptSeq protocol. Pearson correlation for FFPE technical replicates was extremely high, r=0.998 (
In this experiment, we performed comparisons of the same samples across different platforms. Additionally for this experiment we included data generated by polyA pull-down (TruSeq V2) for the nine RNA samples isolated from fresh-frozen tumor only. Correlation coefficients for all comparisons are shown in
NanoString FFPE vs ScriptSeq FFPE | NanoString FROZ vs ScriptSeq FROZ | NanoString FROZ vs TruSeq FROZ | ScriptSeq FROZ vs TruSeq FROZ | |
---|---|---|---|---|
BRB123 | 0.921 (0.771) | 0.721 (0.740) | 0.881 (0.829) | 0.702 (0.931) |
BRB144 | 0.924 (0.707) | 0.746 (0.707) | 0.710 (0.779) | 0.579 (0.920) |
BRB147 | 0.468 (0.727) | 0.372 (0.734) | 0.780 (0.822) | 0.66 (0.920) |
BRB157 | 0.869 (0.751) | 0.805 (0.728) | 0.880 (0.820) | 0.682 (0.915) |
BRB212 | 0.919 (0.711) | 0.664 (0.714) | 0.903 (0.811) | 0.651 (0.929) |
BRB215 | 0.744 (0.767) | 0.634 (0.762) | 0.610 (0.830) | 0.8212 (0.942) |
BRB248 | 0.889 (0.721) | 0.679 (0.724) | 0.922 (0.800) | 0.686 (0.926) |
BRB277 | 0.887 (0.717) | 0.827 (0.730) | 0.870 (0.794) | 0.655 (0.934) |
MCJBCR028 | 0.923 (0.731) | 0.525 (0.743) | 0.558 (0.801) | 0.686 (0.942) |
mean | 0.838 (0.734) | 0.664 (0.731) | 0.790 (0.809) | 0.680 (0.929) |
Spearman correlation in parentheses
Both ScriptSeq and TruSeq protocols behaved similarly when compared against 226 genes analyzed on the NanoString platform, mean correlations r= 0.664 and 0.790, respectively (
Correlative studies of gene expression with FFPE material have previously been demonstrated with a number of different platforms such as NanoString, microarrays and quantitative PCR [
We compared sequencing statistics for the same nine fresh-frozen tumor samples using the TruSeq (polyA pull-down) and RiboZeroGold (rRNA depletion) ScriptSeq protocols and between matched fresh-frozen and FFPE samples for the RiboZeroGold ScriptSeq protocol.
This resulted in two observations illustrated in
(A) % reads mapped to genome, genes and exon junctions, B) Venn diagram of fusion transcript detection, (C) Number SNV calls, (D) Correlation of FFPE library insert size with sensitivity for single nucleotide variant detection.
We also observed a difference between fresh-frozen and FFPE material that was not due to protocol. The total number of reads mapped to the genome for nine FFPE samples was 79.2%, similar to that of fresh-frozen material with either protocol. When comparing the same RiboZeroGold ScriptSeq protocol across matched fresh-frozen and FFPE samples, we observed a significantly lower percentage of reads mapping to genes (fresh-frozen = 50.6%, SD ± 3.95 and FFPE = 24.8%, SD ±2.8) and to exon junctions (fresh-frozen = 10.29%, SD ±1.09 and FFPE = 3.98%, SD ±0.58), in FFPE samples, demonstrating a marked loss of exonic sequence in FFPE material.
Comparison of TruSeq verses RiboZeroGold ScriptSeq libraries from the same fresh-frozen tumor allowed us to examine differences in gene expression that were protocol specific. Firstly, we identified outliers by assessing log2 fold change between the same nine samples. This identified 116 genes (
We also aligned RNA-Seq data from both TruSeq and RiboZeroGold ScriptSeq libraries against 5,749 lincRNAs from the Havana group (http://www.sanger.ac.uk/research/projects/vertebrategenome/havana/) and performed correlation analyses of the nine fresh-frozen and FFPE pairs, (
% reads mapped to lincRNA |
lincRNA correlation of expression |
||||
---|---|---|---|---|---|
ScriptSeq FFPE library | ScriptSeq FROZ library | TruSeq FROZ library | ScriptSeq FROZ vs ScriptSeq FFPE | TruSeq FROZ vs ScriptSeq FROZ | |
BRB123 | 5.74% | 3.68% | 1.66% | 0.997 (0.858) | 0.465 (0.77) |
BRB144 | 4.93% | 3.80% | 1.30% | 0.99 (0.881) | 0.361 (0.716) |
BRB147 | 6.32% | 3.23% | 1.10% | 0.98 (0.835) | 0.376 (0.677) |
BRB157 | 5.80% | 3.72% | 1.14% | 0.969 (0.818) | 0.36 (0.641) |
BRB212 | 4.70% | 3.44% | 1.26% | 0.996 (0.896) | 0.366 (0.74) |
BRB215 | 4.44% | 2.63% | 1.39% | 0.989 (0.855) | 0.382 (0.716) |
BRB248 | 5.94% | 4.89% | 1.85% | 0.992 (0.879) | 0.355 (0.737) |
BRB277 | 5.55% | 3.49% | 1.42% | 0.995 (0.874) | 0.486 (0.717) |
MCJBCR028 | 5.84% | 3.37% | 1.67% | 0.982 (0.849) | 0.477 (0.807) |
mean | 5.47% | 3.58% | 1.42% | 0.988 (0.861) | 0.403 (0.725) |
Spearman correlation in parentheses
The percentage of reads mapping to lincRNA from ScriptSeq FFPE libraries ranged from 4.44-6.32% (mean 5.47%), better than that achieved in the matched fresh-frozen libraries using the same ScriptSeq protocol (range 2.63-4.89%, mean 3.58%) and markedly better than that achieved using the TruSeq protocol (range 1.10-1.85%, mean 1.42%). This observation was mirrored when comparing the percent of total lincRNAs present in each library (ScriptSeq FFPE mean % of total lincRNA = 57.57%, compared with 55.80% and 42.73% for matched fresh-frozen libraries with ScriptSeq and TruSeq protocols respectively).
We observed excellent correlation of lincRNA expression between the nine matched FFPE and fresh-frozen pairs under the RiboZeroGold ScriptSeq protocol (Pearson correlation ranging 0.969-0.997, mean correlation r=0.989). Correlation between libraries prepared from fresh-frozen material with the ScriptSeq and TruSeq protocols was relatively poor, (Pearson correlation ranging 0.355-0.477, mean correlation 0.403). We further examined the lincRNA outliers (>1.5*IQR) between ScriptSeq and TruSeq protocols (
We applied the snowshoes-FTD, fusion transcript detection [
Overlap between protocols and samples are shown by Venn diagram in
We next asked the question, could the sensitivity to detect fusion transcripts in FFPE libraries be improved by increased depth of coverage? Of the 19 fusions identified in both TruSeq and ScriptSeq libraries from fresh-frozen material, only 2/19 were detected in FFPE material when using all available reads, in this case ~159 million (~ three-fold increase). We also note that for exactly the same FFPE library (BRB-277), 4/19 fusion transcripts present in the matched fresh-frozen libraries prepared with TruSeq and ScriptSeq were not identified in the FFPE library, despite increasing read depth to 159 million reads. These data suggest that an increase in read depth, at least up to ~3-fold, does not compensate for FFPE specific exonic loss.
The library preparation methods used in this study employ different techniques for mRNA enrichment and cDNA synthesis, as demonstrated by a decrease of ~27% of reads mapped to RefSeq genes and exon junctions between TruSeq and ScriptSeq protocols in the same high quality RNA samples (
As a measure of data quality, we calculated Transition / transversion (Ti/Tv) ratios for known and novel SNV’s and mean quality scores for the alternate allele. Ti/TV ratios have been previously reported as ~3.0 for SNVs inside of exons, and ~2.0 elsewhere in the genome, [
The mean quality score for alternate alleles was similar across all samples sets and protocols, ranging 83.3-86.9 in the TruSeq fresh-frozen samples, 87.9-89.5 in the ScriptSeq fresh-frozen samples and 85.7-86.6 in the ScriptSeq FFPE samples. Mean and median quality scores for alternate alleles and coverage for alternate alleles are shown in detail in
To assess the sensitivity of SNV detection in our FFPE samples, we defined high confidence SNV calls as those that overlapped between TruSeq and ScriptSeq libraries in the same RNA samples isolated from fresh-frozen tumor. We defined sensitivity of SNV detection in FFPE libraries as the fraction of SNVs detected as a proportion of the high confidence SNV calls observed in the matched fresh-frozen RNA libraries. Sensitivity to detect SNV in FFPE material ranged from 39-61% and was correlated with library insert size, R2 = 0.73 (
Comparison of fresh-frozen and FFPE libraries made with the same RiboZeroGold ScriptSeq protocol, (
Discovering the unique genomic architecture within tumors will be essential to the goal of personalized medicine for breast cancer. Next generation sequencing approaches applied to large clinical samples with detailed phenotypic data are potentially invaluable but sample preservation by formalin-fixation has hindered this application. In this study we assessed the feasibility of RNA sequencing of formalin-fixed material for gene expression including lincRNA expression, discovery of single nucleotide variants and discovery of fusion transcripts.
RNA sequencing of FFPE material is not possible with methods that require mRNA enrichment by polyA pull-down due to sample degradation. Alternative protocols and platforms are available, but differentiation between artifact due to FFPE and artifact due to different protocols and platforms is equivocal. In this study we break down these differences. We used both a hybridization based platform (NanoString nCounter) and a rRNA depletion protocol for transcriptome sequencing (RiboZeroGold ScriptSeq library preparation with Illumina sequencing) in a set of manually degraded RNAs and a set of nine matched fresh-frozen and FFPE pairs. Correlative analyses of matched fresh-frozen and FFPE pairs on the same platform and for the same samples (both fresh-frozen and FFPE) across platforms demonstrates 1) gene expression data from RNA isolated from FFPE breast tumor is viable; 2) Both the NanoString platform and RiboZeroGold ScriptSeq whole transcriptome sequencing protocols generate reliable gene expression data for degraded and FFPE RNA.
Our initial study design systematically assessed gene expression as a function of RNA integrity, by correlation of a set of manually degraded RNA samples against undegraded, high quality aliquots of the same sample. Both platforms performed well with no systematic difference between degraded and undegraded states. RNA expression study designs depend on accurate estimation of fold change in expression between different samples. By using two different samples, one from a breast cancer cell line and a universal human reference RNA (a pool of ten different cell lines), we were also able to correlate log2 fold change between samples in their undegraded and degraded states, further demonstrating the robustness of both the NanoString nCounter platform and the RiboZeroGold ScriptSeq protocol.
Correlative analyses with manually degraded samples were encouraging for both platforms. However, it is unknown if these artificial changes are representative of the biological changes that we seek in degraded material. Certainly, they do not account for the cross-linking caused by formalin fixation. We next moved to assess three critical issues for gene expression studies in FFPE samples. Firstly, we used technical replicates of RNA isolated from FFPE to assess reproducibility with the NanoString and ScriptSeq protocols, which yielded Pearson correlation >0.99 for both platforms. Secondly, for study designs using a discovery sample from FFPE material, we need to know that the genes that appear to be associated with disease outcome, are not systematic artifacts of FFPE material. This issue was assessed for both platforms by correlation of gene expression in matched pairs of fresh-frozen and FFPE RNA. In agreement with Reis et al (2011), we observed a high degree of correlation between fresh-frozen and FFPE pairs with the NanoString platform, Pearson r=0.874. Using the RiboZeroGold/ScriptSeq protocol, fresh-frozen and FFPE pairs showed slightly lower correlation, (Pearson r=0.783), although significantly better than that reported with quantitative PCR, (Pearson r=0.53), [
The third gene expression issue we addressed relates to orthogonal platform validation. Genome-wide studies in discovery samples will generate a high number of disease or disease associated genes, requiring validation and further replication samples. For this reason, it is important to identify genome-wide and replication platforms which achieve high correlation for FFPE material. The two protocols we assessed in this study are very different. The RNA-Seq protocol is based on rRNA depletion and PCR amplification of the cDNA, whereas the NanoString protocol is based on hybridization of selected probes to total RNA, with no cDNA synthesis and no PCR amplification. Despite these differences, these platforms showed excellent correlation for nine FFPE samples, Pearson r=0.838. We believe this combination, ScriptSeq then NanoString is ideally suited for discovery and analytical validation, respectively.
We next moved to assess the performance of a rRNA depletion protocol for whole transcriptome sequencing of FFPE RNA. By generation of libraries from the same nine RNAs isolated from fresh-frozen tumor, with both TruSeq and ScriptSeq complete kits, we were able to separate those characteristics due to protocol from those due to FFPE. Comparison of sequencing statistics between RNA-Seq protocols revealed a marked (~50%) decrease in the percentage of reads mapping to RefSeq genes and exon junctions in FFPE material, that was not observed in the matched frozen and manually degraded libraries prepared with the same RiboZeroGold ScriptSeq protocol. The exonic loss from RiboZeroGold ScriptSeq FFPE libraries was also reflected in the lower number of fusion transcripts detected, although a significant number (22.2% of total detected fusions) were still identified.
When comparing the same fresh-frozen samples with TruSeq and RiboZeroGold ScriptSeq protocols, the TruSeq libraries generated a higher percentage of reads mapping to genes, exon junctions and hence a higher number of fusion transcripts. 86.7% of total fusion transcripts identified were present in the TruSeq libraries compared to 68.9% in the ScriptSeq libraries for the same high quality RNA sample. To make these comparisons, we extracted ~56 million reads from each sample (the number of reads observed across all samples ranged from 56 million to 361 million), such that the starting number of total reads was the same. Application of our fusion pipeline to the fully available reads for each sample suggest that an increase in read depth, at least up to ~3-fold would not compensate for these differences.
However, on the other side of the coin, the RiboZeroGold ScriptSeq protocol is suitable for low starting amounts of FFPE material and also generated a higher proportion of reads mapping to non-coding genes and genes lacking a polyA tail, an aspect which may prove an important characteristic of the breast cancer genome. This was demonstrated both in our own study where we observed superb correlation of lincRNA expression between matched fresh-frozen and FFPE pairs (mean Pearson r = 0.988) and a higher proportion of reads mapping to lincRNAs with the RiboZeroGold ScriptSeq protocol, and independently by Sinicropi et al, [
Finally, we assessed the quality and sensitivity of SNV detection in our FFPE libraries compared to matched fresh-frozen pairs, and accounting for difference between TruSeq and ScriptSeq protocols. Transition:transversion (Ti/Tv) ratios of the RiboZeroGold ScriptSeq FFPE libraries were within the range reported from DNA sequencing studies [
FFPE samples detected on average 52.2% of SNVs that were detected with high confidence in matched fresh-frozen samples. Lower sensitivity correlated with smaller library insert size, r2 = 0.73, a parameter that could be used to predict performance of SNV detection specifically for each FFPE sample. Performance of gene expression from FFPE material has previously been correlated with time of fixation rather than age of sample [
In conclusion, we observed a high correlation of gene expression between fresh-frozen and FFPE RNA with both hybridization based and next generation sequencing platforms. The high degree of cross-platform correlation suggests the NanoString nCounterTM will provide an excellent platform for analytical validation of gene expression generated from FFPE whole transcriptome datasets. Our data suggest that RiboZeroGold ScriptSeq libraries are also suitable for eSNV and fusion transcript detection with some reduction in sensitivity, the proportion of which may be estimated by library insert size. Library preparation protocols that do not rely on polyA pull-down for mRNA enrichment may allow detection of non-coding genes and genes that lack polyA tails that may be relevant to cancer genomics.
(DOCX)
(DOCX)
(DOCX)
(XLSX)
(XLSX)
(XLSX)
(XLSX)
(XLSX)