Testing of library preparation methods for transcriptome sequencing of real life glioblastoma and brain tissue specimens: A comparative study with special focus on long non-coding RNAs

Current progress in the field of next-generation transcriptome sequencing have contributed significantly to the study of various malignancies including glioblastoma multiforme (GBM). Differential sequencing of transcriptomes of patients and non-tumor controls has a potential to reveal novel transcripts with significant role in GBM. One such candidate group of molecules are long non-coding RNAs (lncRNAs) which have been proved to be involved in processes such as carcinogenesis, epigenetic modifications and resistance to various therapeutic approaches. To maximize the value of transcriptome sequencing, a proper protocol for library preparation from tissue-derived RNA needs to be found which would produce high quality transcriptome sequencing data and increase the number of detected lncRNAs. It is important to mention that success of library preparation is determined by the quality of input RNA, which is in case of real-life tissue specimens very often altered in comparison to high quality RNA commonly used by manufacturers for development of library preparation chemistry. In the present study, we used GBM and non-tumor brain tissue specimens and compared three different commercial library preparation kits, namely NEXTflex Rapid Directional qRNA-Seq Kit (Bioo Scientific), SENSE Total RNA-Seq Library Prep Kit (Lexogen) and NEBNext Ultra II Directional RNA Library Prep Kit for Illumina (NEB). Libraries generated using SENSE kit were characterized by the most normal distribution of normalized average GC content, the least amount of over-represented sequences and the percentage of ribosomal RNA reads (0.3–1.5%) and highest numbers of uniquely mapped reads and reads aligning to coding regions. However, NEBNext kit performed better having relatively low duplication rates, even transcript coverage and the highest number of hits in Ensembl database for every biotype of our interest including lncRNAs. Our results indicate that out of three approaches the NEBNext library preparation kit was most suitable for the study of lncRNAs via transcriptome sequencing. This was further confirmed by highly consistent data reached in an independent validation on an expanded cohort.


Introduction
The rise of next-generation RNA sequencing (NGS or RNASeq) of transcriptome has largely accelerated genomic and epigenetic research and allowed the discovery of new RNA species which could harness a potential to be useful in the study of particular diseases and the development of new therapeutic strategies [1,2]. However, as is usually the case during implementation of new methods, there are issues that need to be tackled with in order to maximize the accuracy and the value of gathered experimental data. One such issue is the selection of a proper approach to molecular library preparation and sequencing with regard to the molecular targets of interest [3][4][5].
Our aim is to study transcripts called long non-coding RNAs (lncRNAs), which do not code for protein and, instead, perform as fully functional regulatory elements which have been implicated in many biological processes including transcriptional regulation of neighbouring or distant genes [6] and histone modification [7], and those related to malignancies such as gliomas [8,9]. Many lncRNAs are antisense to coding genes, intergenic or intronic [10] and, thus, suitable library prepration protocol needs to reflect this fact and retain strand-specific information. This can be achieved via incorporation of deoxyuridine triphosphates (dUTPs) during the synthesis of the second strand which is later cleaved using uracil-DNA glycosylase [11,12] or via starter/stopper binding sites determining the insert size and modulating the activity of reverse transcriptase [13]. Another important aspect which needs to be looked at is the target molecule recovery of NGS after comparing processed and mapped reads to an annotated database such as Ensembl [14]. The goal is to achieve the highest numbers of target molecules captured via NGS; both mRNAs and lncRNAs in the case of our current studies. Each method of library preparation may produce significantly different numbers based on the applied technology and steps of the procedure. This particular aspect could render some of the library preparation methods not suitable for the study of lncRNAs since they by default principle are somehow limited or biased towards other RNA species, for example genes from coding regions of the genome. Last but not least, other factors such as the sequencing depth, eveness of transcript coverage and the duplicate rates need to be also considered in order to find the most suitable approach which would produce sequencing data with sufficient sequencing depth, even transcript coverage from start to end [15] and low rates of artificial duplicates introduced during the construction and amplification of libraries [16]. Otherwise, sequencing of libraries prepared with improper approach would result in the permanent loss of valuable data about lowly expressed lncRNAs.
We used 3 library preparation kits to generate 9 libraries from glioblastoma (GBM) and non-tumor (NonT) samples sequenced in Core Facility Genomics (CFG) at CEITEC MU and compared their performance on the basis of abovementioned factors. The results generated with the best performing kit were then independently validated in an expanded cohort of patient samples in Vienna Biocenter (VB) Core Facility NGS Unit.

RNA quantification and quality control
The concentration, purity and RNA integrity number (RIN) of total RNA samples are listed in Table 1. Pure RNA is characterized by A 260 /A 280 ratio being approximately 2.00 which is the

Descriptive parameters of molecular libraries
The concentrations, modal lengths and calculated molarities of 9 CFG libraries and 8 VB libraries are listed in Table 2. The concentrations of libraries were highly variable which was dependent on the input amount of rRNA-depleted RNA and the number of PCR cycles which was set according to the manufacturers' protocols. The modal lengths of each triad of libraries differed depending on the employed method of generating fragments from input rRNAdepleted RNA and its integrity. While the modal length of NEXTflex (NF) libraries was~390 bp, the application of the SENSE kit (LX) and NEBNext kit (NB and VB) resulted in lower numbers.

Quality control of raw sequencing reads and evaluation of GC content
The acquired NGS datasets were downsampled to 2.5 million reads, and they underwent quality control (QC) performed with FastQC. Percentage of duplicate reads ranged from 17.8% to 73.7% with the top three values belonging to all three libraries prepared from sample NonT_1. The average GC content in CFG libraries ranged from 47% to 54% with the top two values belonging to NB libraries G2_NB (53%) and N1_NB (54%). As for the VB libraries, the percentage of duplicate reads ranged from 25.0% to 51.6% and the average GC content was between 46% and 50% (Table 3). GC content was also measured across the whole length of each sequence in a given dataset, normalized per sequence (GC count per read) and plotted. A roughly normal distribution of GC content is typical for normal random libraries. Libraries most closely resembling the normal distribution were LX libraries G1_LX and G2_LX (Fig 2).
Adapter content in all libraries determined via FastQC was significantly low (� 0.1%). The amount of over-represented sequences was evaluated in every library again with FastQC which may indicate problems with ribosomal depletion or other bias. Out of CFG libraries, the highest presence of these sequences was found in N1_NB (29.25%), followed by G2_NB (5.32%), N1_NF (4.47%) and G1_NB (3.18%). No over-represented sequences were present in the rest of the libraries (Fig 3A). The highest amount of over-represented sequences was found in N1_VB (11.48%) which corresponds to CFG library N1_NB. Other VB libraries contained less than 5% of over-represented sequences. Interestingly, G1_VB and G5_VB contained predominantly one over-represented sequence ( Fig 3B).

Alignment and mapping statistics
Reads from downsampled datasets were aligned to the reference genome using STAR [17] with the resulting numbers of overall uniquely mapped reads from CFG libraries ranging between 46.4% to 83.7% of all mapped reads with the top two highest values belonging to LX libraries. Moreover, between 21.8% and 61.7% of all reads were those that overlap one and only one gene ( Fig 4A). Uniquely mapped reads from VB libraries comprised 60.6% and 81.0% of all mapped reads and the percentage of reads that overlapped only one gene ranged from 60.6% to 81.0% of all reads ( Fig 4B).
The overall percentage of bases in primary alignments that align to particular regions in the reference genome calculated using Picard [18] was between 80 and 90% for CFG libraries and 90% for VB libraries. While the SENSE kit seemed to prefer reads whose bases predominantly aligned to coding regions (>30%), the lowest numbers of bases in such primary alignments belonged to libraries N1_NF and N1_NB (10.7% and 12.2%, respectively). On the contrary, the numbers of base reads mapped to untranslated regions (UTR) were on average higher for NF and NB libraries. The number of bases aligning to intronic and intergenic regions ranged from 9.5 to 21.8% and 3.0 to 6.1%, respectively, with no major differences between the three groups of libraries. The top three highest numbers of bases aligning to ribosomal regions belonged to libraries G2_NB (11.1%), G2_NF (14.0%) and N1_NB (14.7%) ( Fig 5A). The numbers of base reads from VB libraries mapped to coding regions ranged between 15.0% and 26.9% and those that were aligned to UTR regions ranged between 29.7% and 45.4%. The number of base reads mapped to intronic and intergenic regions ranged from 14.3 to 26.4% and from 4.3 to 7.6%, respectively. Finally, the number of bases mapped to ribosomal regions ranged from 2.0 to 7.9% ( Fig 5B). Uniquely mapped reads were also compared with Ensembl database using the Ensembl automatic annotation system [14,19] and transcripts were classified into several biotypes which comprise larger groups. The most populous biotypes which are relevant to our current study are plotted in Fig 6, namely "protein coding", "lincRNA", "antisense RNA" and "sense intronic". While the first is the major biotype of larger group of protein coding transcripts, the rest comprises group of lncRNA. Upon comparison of all three library preparation kits, we found that the highest number of hits for each biotype of interest belonged to NB libraries, while NF libraries came second and LX libraries third. This hinted at the lesser variability of library fragments generated with SENSE kit.

Transcript coverage
Normalized gene coverage was calculated for all groups of libraries using Picard. Coverage of NF libraries seemed relatively even with notable peak in coverage of 3' terminal parts of genes (~90-100%) (Fig 7A). Coverage of LX libraries seemed more irregular hinting again at a lesser variability of fragments generated with this particular kit (Fig 7B). Relatively same even coverage was noted for NB and VB libraries again with the exception of the 3' terminal parts of genes where a spike in coverage could be observed in most of the libraries (Fig 7C and 7D).

Efficiency of rRNA depletion and duplicate rates
Picard determined the percentage of rRNA ranging from as low as 0.3% to 17.9% with the three lowest values belonging to LX libraries G1_LX (0.3%), G2_LX (1.3%) and N1_LX (1.5%), and the highest value belonging to NF library G2_NF (17.9%). Significantly higher percentage of rRNA was also observed in NB libraries G2_NB (12.4%) and N1_NB (16.6%). The percentage of rRNA reads in VB libraries ranged between 2.2 and 8.8% and, thus, was even lower than in NB libraries. The duplicate rates for CFG libraries determined with Picard ranged from 25.6% to 91.5% with those belonging to LX libraries (55.1%, 59.3% and 91.5%) being significantly higher among each group of three libraries prepared from the same RNA sample. The duplicate rates for VB libraries ranged from 27.6% to 53.8% (Table 4).
The 2D density scatter plots and histograms of the distribution of reads per kilobase (RPK) values per gene corresponding to each library generated using dupRadar [20] are depicted in Figs 8 and 9. Density scatter plots are relations of duplicate percentage to expression in reads/ kbp. While NF and NB libraries showed relatively low duplicate rates, significantly higher numbers were observed in LX libraries, which was consistent with calculations from Picard. Moreover, duplication rates of NB libraries were comparable to those of VB libraries, as was expected. No apparent skewed distributions of RPK values per gene with unusual amount of lowly expressed genes were observed.

Summary score
The most important results of comparative study are summarized in Table 5. The most suitable library preparation kit for NGS of lncRNAs reached the highest total scoring.

Discussion
The fact that the prognosis of patients with GBM remains very poor to this day despite the available therapy poses a serious problem. Therefore, finding both new therapeutic targets and prognostic, predictive or diagnostic biomarkers is of utmost importance for making progress. One such candidate is the class of non-coding transcripts called lncRNAs which has been shown to regulate various biological processes related to GBM biology, including angiogenesis, epigenetic silencing, and alkyl agents and ionizing radiation resistance. In order to fully understand their value, deeper understanding of lncRNAs in GBM biology is vital. One of the approaches to finding potential novel biomarkers is the differential sequencing of transcriptomes of GBM patients and healthy controls which would reveal lncRNAs with significantly different expression between the two cohorts. This could indicate their importance in the biology of GBM where they could function as oncogenic transcripts or tumor suppresors. Thus, choosing the best approach for generating molecular libraries of sufficient quality suitable for the study of lncRNAs is fundamental.
The aim of our study was to compare three molecular library preparation kits NF, LX and NB for the sequencing of patient transcriptomes and evaluate their performance based on chosen parameters with the focus on lncRNAs. Three RNA samples with varying integrity have been chosen in order to see how well the library preparation methods would deal with this particular issue. The first stark difference observed before alignment of reads to the reference genome was the distribution of normalized average GC content which would reveal whether generated libraries were random or biased. This bias could indicate a specific contamination by over-represented sequences [21], most likely rRNA since the adapter content in all libraries was � 0.1%. The most uneven distribution of GC content was observed in all three NF libraries whereas LX libraries showed roughly normal distribution. This would also later correlate with the percentage of rRNA reads which was generally high for NF libraries and the lowest for LX libraries. NB libraries showed relatively even distribution with the noticeable exception of N1_NB which again showed biased GC content distribution with unusually high normalized average GC content at 60%. The common denominator was RNA sample NonT-1 from which the most biased library from every group was prepared. Regarding the rRNA content in libraries, the SENSE kit would perform significantly better than the other two as the percentage of rRNA reads in LX libraries was 1.5% and lower. This could render the combination of ribosomal depletion with Ribocop and library preparation with the SENSE kit ideal for whole transcriptome sequencing. However, it did not necessarily mean that this particular kit would be  suitable for the study of lncRNAs and further evidence was needed in order to confirm or disprove this claim. After evaluating normalized transcript coverage, almost all libraries showed a significantly higher and biased coverage of the 3' end of transcripts. This highlights the importance of high Testing of library preparation methods for transcriptome sequencing integrity of RNA samples. If the sample quality is poor, polyadenylated transcripts are partially degraded which could result in truncated species with polyA tail still intact. These smaller transcripts might produce cDNA fragments which may, in turn, become overexpressed during the PCR step of the library preparation procedure which could potentially result in deeper coverage of the 3' end of transcripts and introduce the 5' to 3' bias [22]. While all libraries showed normal increase in coverage after 5' end, the remaining portion of the graph hints at different levels of variability in library complexity. NF libraries were characterized with relatively even coverage throughout the genes. This was true more so for NB libraries which showed the most even coverage out of all three groups. This was in sheer contrast with LX libraries which were characterized by irregular coverage resulting in spiky graph curve. However, this was expected since the procedure for this particular approach involves non-random fragmentation which results in lesser variability of fragments generated with the SENSE kit.
Moreover, this was also confirmed upon comparing the aligned reads with Ensembl database which revealed another notable difference between the three methods. The number of hits for each Ensembl biotype of our interest was significantly higher for NB libraries compared to the other two groups, and actually the lowest for LX libraries. This together with the fact that the SENSE kit seemed to prefer coding sequences judging by the significantly higher number of base reads from LX libraries aligning to coding regions would ultimately render the SENSE kit the least suitable for the study of lncRNAs. Another notable difference between these three approaches, which would further affirm this notion, were the percentages of duplicate reads and especially their relation to the expression in reads/kbp. The highest duplication rates were observed in LX libraries with duplicate reads in N1_LX as high as 91.5%. The other two kits performed markedly better which was confirmed by the results both from Picard and dupRadar.
With everything taken into account, the SENSE kit performed well compared to the other two in regard to the normalized GC content and the amount of reads aligning to ribosomal regions but was outperformed by NEBNext in other aspects of our study. Especially important was the discovery that libraries prepared with NEBNext kit had most hits in Ensembl database in every category of our interest including both mRNAs and lncRNAs. This put together with relatively low duplication rates and even transcript coverage made this kit the most applicable for the study of lncRNAs.
To independently validate our results, an expanded cohort of 8 RNA samples with varying levels of integrity was brought to another NGS facility where it underwent ribosomal depletion using RiboZero and library preparation with the NEBNext kit. When comparing resulting VB libraries to NB libraries, the distribution of normalized average GC content was observed to be similar but in case of VB libraries with less pronounced abnormal peaks indicating specific contamination with rRNA. In actuality, this was confirmed using Picard on post-alignment Table 5. Summary scores of chosen parameters of all three tested library preparation kits. Scores were assigned in four categories, specifically the percentage of rRNA reads and duplicate reads and the number of hits in Ensembl database for the groups of mRNAs and lncRNAs. The lowest score in each category is denoted by "-" and the highest by "++". Based on these parameters, the most suitable kit for preparation of libraries for sequencing of lncRNAs, NEBNext, would be chosen and later independently tested in another NGS facility. Testing of library preparation methods for transcriptome sequencing data since the resulting percentages of rRNA reads from VB libraries were lower than those from corresponding libraries G2_NB and N1_NB. The amount of over-represented sequences and that of top over-represented sequence was observed also to be lower for VB libraries compared to NB libraries. After alignment, we noticed slight improvement in mapping statistics for VB libraries, mainly the higher numbers of uniquely mapped reads and reads which overlap only one gene and lower multimapping rates compared to NB libraries. The normalized transcript coverage for VB libraries was similarly even throughout the genes with the exception of 3' end where a much stronger bias resulting in increased depth of sequencing could be observed. On the other hand, duplication rates in VB libraries were still relatively low and comparable to those in NB libraries. All in all, these independent observations confirmed what we had previously seen with NB libraries.

Conclusions
Looking for new therapeutic targets and clinical biomarkers is vital in case of malignancies with highly unfavorable prognosis such as GBM. Since the discovery of long non-coding transcripts and their multiple regulatory functions, new array of methods including NGS gave rise to a whole new field of genomics and epigentics which focuses on these molecules and their implication in the context of pathophysiological processes. Considering our findings, we conclude that choosing the most suitable approach to library construction for transcriptome sequencing with the aim of studying lncRNAs is highly important in order to retain valuable data on lowly expressed RNA species that could otherwise be permanently lost. Out of the three library preparation kits we tested, NEBNext Ultra II Directional RNA Library Prep for Illumina kit performed best with regard to relatively even transcript coverage, low rates of PCR duplicates and the highest number of hits for biotypes of interest in Ensembl database. We also confirmed the consistency of NEBNext's performance with independent validation on an expanded cohort of patient samples. Therefore, we suggest this approach for lncRNAs sequencing to be used in the studies based on analysis of GBM and brain-tissue specimens.

Patient samples
Samples used in this analysis were part of a larger cohort included in a study focused on sequencing transcriptomes of GBM patients and healthy controls with subsequent bioinformatic and statistical evaluation of the acquired data with the aim of finding putative diagnostic, prognostic or predictive biomarkers and therapeutic targets. Fresh-frozen samples of primary GBM were collected periodically from several Czech neurosurgical departments and independently confirmed by two histopathologists. Non-tumor brain tissues obtained from non-dominant anterior temporal cortexes resected during surgery for intractable pharmacoresistant epilepsy served as healthy controls.

Isolation and purification of RNA
Tissue samples were mechanically homogenized with ceramic beads and total RNA enriched for small RNA species was then isolated and purified using mirVana miRNA Isolation Kit (Invitrogen) which is based on phenol-chloroform extraction and purification in a filter cartridge. Isolated and purified total RNA was then quantified using both NanoDrop 2000 Spectrophotometer (Thermo Scientific) and diluted to approximately 250 ng/μl for downstream analyses. Diluted RNA was then quantified using Qubit 2.0 Fluorometer in combination with Qubit RNA BR Assay Kit (both Invitrogen) for a more precise measurement. In total, three NGS library preparation kits were used to generate molecular libraries from rRNA-depleted RNA samples and compared, namely NEXTflex Rapid Directional qRNA-Seq Kit (Bioo Scientific), SENSE Total RNA-Seq Library Prep Kit (Lexogen) and NEBNext Ultr II Directional RNA Library Prep Kit for Illumina (NEB). All procedures were performed according to the protocols suggested by the manufacturers except VB libraries, which were prepared with NEBNext kit according to the adjusted protocol with minor deviations, which were a result of previous optimization done by the VBCF. Specifically, the time of RNA fragmentation was shortened to 7 minutes for all samples regardless of RIN and the PCR step of library preparation was performed in a real-time set-up with EvaGreen fluorescent dye (Biotium) mixed in to the PCR reaction mix, which, in turn, allowed halting PCR reactions with individual libraries at the proper time in order to avoid overcycling. Samples were multiplexed using suitable molecular barcodes and resulting cDNA pools were processed according to the NextSeq System Denature and Dilute Libraries Guide [23]. The minimum requirements and brief characterization of individual library preparation kits are summarized in Table 6. Table 6. Minimum requirements and brief characterization of three tested library preparation kits. The minimum requirements according to manufacturers' protocols are the quality of pre-depletion RNA and the amount of input rRNA-depleted RNA. The quality of RNA is based either on RNA integrity number (RIN) or the source of RNA, for example formalin-fixed paraffin-embedded (FFPE) tissue blocks. Preceding depletion of rRNA is required for all library prepration procedures.

Quantification and quality control of molecular libraries
Firstly, the concentration of molecular libraries was measured using Qubit 2.0 Fluorometer in combination with Qubit dsDNA HS Assay Kit (both Invitrogen). Secondly, libraries were analyzed on Agilent 2200 TapeStation in combination with Agilent High Sensitivity D1000 ScreenTape System (both Agilent Technologies) according to the manufacturer's protocol. VB libraries were analyzed on Fragment Analyzer using High Sensitivity NGS Fragment Analysis Kit (both Advanced Analytical). The molarity of individual libraries was calculated using determined concentrations and modal lengths using free online weight-to-moles conversion calculator for nucleic acids [24].

Next-generation sequencing
Single-read sequencing of CFG libraries with a read length of 75 was performed with NextSeq 500 Sequencing System using NextSeq 500/550 High Output v2 kit (75 cycles) (both Illumina). Paired-end sequencing of VB libraries with a read length of 50 was performed with HiSeq 2500 System using HiSeq SBS Kit V4 kit (50 cycles) (both Illumina). PhiX Control v3 (Illumina) was added at 1% to all pools as an internal control before the sequencing.

Processing of acquired sequencing data
Subsequent processing of acquired sequencing data was performed with tools in the environment of R-based Bioconductor software. Firstly, the pre-alignment QC of acquired sequencing data contained in FASTQ files was done using FastQC [21]. The sequences were then trimmed for adaptors with the command-line tool cutadapt [25] and all reads were aligned to a reference human genome using STAR [17]. The QC of aligned reads contained in resulting Binary Alignment Map (BAM) files was done with Picard [18]. The same tool was used for determining the percentages of rRNA, mRNA and duplicates. All generated numerical and graphical output was gathered in cohesive reports and exported via MultiQC [26]. Additionally, dupRadar was used for assessment of the distribution of RPK values per gene and relation of duplication rates to expression (RPK) and generating histograms and 2D density scatter plots [20]. Aligned uniquely mapped reads were also compared with Ensembl database (version 93) in order to identify the percentage of distinct molecular biotypes [19].

Ethics approval and consent to participate
This study was approved by the local Ethical Committee University Hospital Brno (Brno, Czech Republic) and written informed consent was obtained from each patient entering the study.