Circulating miRNAs, isomiRs and small RNA clusters in human plasma and breast milk

Circulating small RNAs, including miRNAs but also isomiRs and other RNA species, have the potential to be used as non-invasive biomarkers for communicable and non-communicable diseases. This study aims to characterize and compare small RNA profiles in human biofluids. For this purpose, RNA was extracted from plasma and breast milk samples from 15 healthy postpartum mothers. Small RNA libraries were prepared with the NEBNext® small RNA library preparation kit and sequenced in an Illumina HiSeq2000 platform. miRNAs, isomiRs and clusters of small RNAs were annotated using seqBuster/seqCluster framework in 5 plasma and 10 milk samples that passed the initial quality control. The RNA yield was 81 ng/mL [standard deviation (SD): 41] and 3985 ng/mL (SD: 3767) for plasma and breast milk, respectively. Mean number of good quality reads was 4.04 million (M) (40.01% of the reads) in plasma and 12.5M (89.6%) in breast milk. One thousand one hundred eighty two miRNAs, 12,084 isomiRs and 1,053 small RNA clusters that included piwi-interfering RNAs (piRNAs), tRNAs, small nucleolar RNAs (snoRNA) and small nuclear RNAs (snRNAs) were detected. Samples grouped by biofluid, with 308 miRNAs, 1,790 isomiRs and 778 small RNA clusters differentially detected. In summary, plasma and milk showed a different small RNA profile. In both, miRNAs, piRNAs, tRNAs, snRNAs, and snoRNAs were identified, confirming the presence of non-miRNA species in plasma, and describing them for the first time in milk.

Introduction preterm delivery (<37 weeks of gestation). Basic demographics of the women and data about gestation and delivery were collected. Peripheral blood was collected at 0-48 h post-partum for 15 participants; and breast milk was collected at 48-72 h post-partum for 10 of the participants. Participants signed an informed consent and protocols were approved by the Hospital del Mar and Hospital Clinic (Barcelona, Spain) ethical committees.
Collection and pre-processing of samples Plasma. Twelve mL of peripheral blood were collected in VACUETTE1 EDTA Tubes (K3EDTA vacutainers) (Greiner Bio-One, Cat No.: 456038). EDTA blood tubes were gently mixed by inverting 10 times and stored at 4˚C upright until centrifugation. Plasma separation was performed by centrifugation of blood samples in horizontal rotor for 10 minutes at 2000 x g at 4˚C. A second centrifugation at 16,000 x g for 15 minutes at 4˚C, was performed to eliminate any cell debris. Plasma was transferred to a new RNase-free cryotube avoiding the collection of any cell debris and stored at -80˚C.
Plasma haemolysis was evaluated by spectrophotometry using a NanoDrop ND-2000 equipment (Thermo Fisher Scientific). Absorbance at 414 nm of 0.2 was set as cut-off haemolysis as described before [27].
Milk. Fifteen mL of breast milk were extracted using an electric pump (Medela, Cat No.:008.0176) following manufacturer's instructions and transferred to a 100 mL sterile RNase-free recipient. Milk was kept in ice and vortexed gently before centrifugation at 2000 x g for 10 minutes at 4˚C to remove milk fat globules, cells, and large debris. Immediately after carefully transferring, supernatant was carefully transferred into new 2 mL RNase-free tubes, a second centrifugation at 16,000 x g for 15 minutes was performed at 4˚C to remove residual fat, cell debris and the casein fraction. Skim milk, avoiding fat or cells, was transferred into 2.5 mL cryotubes and stored at -80˚C.

RNA extraction, quantification and quality control
Small RNAs were extracted using the miRNeasy Serum/Plasma kit (Qiagen, Cat No.: 217184) with minor modifications. RNA extraction was performed from at least 3 mL of plasma and milk. One mL of biofluid was filtered per each column and washed two times with ethanol 80% before elution. Each sample was measured for RNA quantity using Fluorometric quantification (Qubit™ 2.0 Fluorometer, Thermo Fisher Scientific) with the Qubit miRNA assay (Thermo Fisher Scientific) and for integrity and quality using two different methods: spectrophotometer (NanoDrop™ 1000, Thermo Scientific) and electrophoresis and flowcytometry (Agilent RNA 6000 pico chip and Agilent 2100 Bioanalyzer, Agilent Technologies).

Library preparation and sequencing
Libraries were prepared using the NEBNext1 Small RNA Library Prep Set for Illumina1 (Multiplex Compatible) (NEB, Cat No.: E7330) following manufacturer instructions and using 150ng of RNA obtained from plasma samples and 500 ng from milk. Library size selection was done using acrylamide gels. Quality and concentration of cDNA libraries was checked, and finally groups of 13-20 samples were pooled at the same concentration before sequencing in an Illumina HiSeq2000 Sequencing System (50 nt, single read). removal, reads with the following features were removed: 1) Reads <18nt, 2) Mean PHRED scores < 30, and 3) Low complexity reads based on mean score of the read. Then, QCed reads were mapped to miRNAs, and miRNA complexity was estimated as the number of miRNA genes that are observed as a function of the number of miRNA reads.
Identification and quantification of small RNAs. Sequences fulfilling QC were used as the input for the seqBuster/seqCluster tool that retrieves three matrices of counts: i) miRNA, ii) isomiRs, and iii) small RNA clusters [25,28]. In order to detect miRNAs and isomiRs, reads were mapped to hairpins released from miRBasev21 using miraligner allowing one mismatch [28]. Sequences mapping ambiguously to several positions of the genome were filtered out. Only isomiRs with 3' or 5' trimming or 3' addition modifications are kept in the result files. isomiRs with one nucleotide substitution are not reported due to the difficulty of linking them unambiguously to a particular miRNA. IsomiR naming followed the miRTop project's recommendations (http://mirtop.github.io/). IsomiR names can be merged with isomiR sequences using a GFF3 format file provided on GEO. In parallel, sequences were mapped to hs37d5 using bowtie [29], and hotspots (sets of overlapping sequences according to their position in the genome) were detected. Hotspots sharing any sequence were grouped in primary clusters. Then, a recursively heuristic algorithm based on reduction and cluster correction was applied to the ones sharing over 60% of the sequences. These were annotated to different RNA species using miRBase v20 [30], refGene (RefSeq genes), wgRna (CD and H/ACA Box snoRNAs and miRNAs from Weber and Griffiths-Jones), rmsk (Repeating elements created using Repeat-Masker) and tRNA from University of California Santa Cruz (UCSC) genome browser (hg19) [31], and piR_hg19_v1.0 from the regulatoryrna database (www.regulatoryrna.org) [32]. Clusters can be considered as unique units of transcription, regardless of their annotation to one or multiple positions in the genome (irrespective of their genomic origin).
Raw FASTQ files, quality controlled unnormalized counts, metadata and a GFF3 format file with isomiR annotation can be downloaded from Gene Expression Omnibus (GEO): GSE107524.
Normalization and differential expression. Normalization and differential expression was performed with the R package DESeq2 v.1.10.1 (R version 3.3.2) [33]. DESeq2 performs an internal normalization where geometric mean is calculated for each gene across all samples (scaling factor method) The impact of main technical and demographic/biological variables on normalized counts was explored through Principal Component Analysis (PCA) and dendrograms implemented within the FactoMineR R package [34]. Differential expression was assessed with negative binomial generalized linear models adjusting for multiple testing with the False Discovery Rate (FDR) method [35]. All models considered biofluid as the independent variable and were adjusted for subject as a fixed effect. In addition, a sensitivity analysis considering only paired samples was performed and results did not change substantially.
Milk samples presented systematically higher sequencing depth than plasma samples. As sensitivity analyses, we repeated the miRNA differential analyses with different filtering approaches. Our main analysis used DESeq2 to filter out features having little chance of showing significant evidence as they contain outliers or low mean normalized counts. Second, to avoid comparison of miRNAs not detected due to the lower sequencing depth in plasma samples, an initial filtering of miRNAs present in <80% of the samples and with <10 reads was done. Finally, ten random milk and plasma subsamples of 0.2M miRNA reads each (the minimum in plasma samples) were obtained and analysed. P values of each of the 10 differential analysis were combined using simes test [36], from mppa v.1.0 (eprint arXiv:1408.3845) in R (3.3.2).

Study population and sample collection
Fifteen healthy volunteer mothers were enrolled in the study during 24-72h post-delivery. Main characteristics of the mothers are found in Table 1. Mean age of the participants was 32.9 years. Forty percent of them were primiparous and only one of them smoked during pregnancy. One third of the participants were born in Spain, one third in Asia and the rest were from Morocco and South America. Plasma samples were collected for all 15 women, while milk was available for 10 of them.

RNA extraction and quality control
The average amount of RNA extracted from plasma and milk samples was 81.14 ng/mL (standard deviation, SD 40.92) and 3,984.85 ng/mL (SD 3,766.97), respectively (S1 Table). In the Bioanalyzer plot, all biofluids showed a peak between 25 to 200 nucleotides which corresponded to the expected size for small RNAs (S2 Fig). No peaks were detected at 18S and 28S in plasma, whereas a peak was observed in half of the milk samples, suggesting potential contamination with cellular RNA.

Quality control
The mean number of reads obtained for each biofluid was similar: 13.96 million (M) in plasma and 13.87 M in milk. However, the average proportion of good quality reads was 89.58% in milk, and only 40.01% in plasma ( Table 2 and S2 Table). Three point ninety seven percent and 35.72% of the raw reads mapped to miRNAs in plasma and milk, respectively. Only samples with >2M of good quality reads of which >0.2 M mapped to miRNAs were considered for further analysis (10 milk and 5 plasma samples).  Table). miRNA complexity was quite heterogeneous among samples, even among samples from the same biofluid, and tended to be higher in milk than in plasma (S3 Fig). Samples clustered by biofluid and not by subject (Fig 2). Data did not cluster by technical factors (time to  (Table 3 and S4 Table). Among the 112 miRNAs found in >80% of the samples and with >10 counts, 87 of them showed statistically significant differences between biofluids (66 milk > plasma). Seventy-four out of the 87 miRNAs overlapped with the ones detected in the main analysis (S7 Fig).  To further confirm the results given the different sequencing depth of each biofluid, 10 random subsamples of 0.2 M miRNA reads were obtained for each plasma and milk sample. The total number of unique miRNA in the 10 milk subsamples was 534 (115 common to all milk samples) and in the 10 plasma subsamples was 715 (100 common to all plasma samples). Samples were again classified by biofluid (S8 Fig). Two-hundred and fourteen miRNAs exhibited different levels depending on the biofluid [72 (33.64%) milk > plasma]. One hundred and eighty six overlapped with the results of the main analysis (S7 Fig, S4 Table).

Small RNA clusters
In total, one thousand and fifty three small RNA clusters were identified in the 10 milk and 5 plasma samples. One thousand and three of them were found in milk with >1 count (930 with mean abundance of >10 RPM), and 819 in plasma (648 with mean abundance of >10 RPM). Only for descriptive purposes, clusters with multiple annotations were classified to one single RNA class following this prioritization: miRNA, rRNA, tRNA, snoRNA, small nuclear RNA (snRNA), vault RNAs (VTRNA), long non-coding RNA (lncRNA), piRNA, gene, and repeats. The number and proportion of clusters in each RNA class by biofluid can be found in Table 5. In both biofluids, more than 30% of the clusters identified were annotated to genes, and around 20% to miRNAs. Milk samples showed a higher proportion of reads mapping to miR-NAs (46.76%) and tRNAs (30.91%), whereas plasma had a higher proportion of reads in clusters represented by piRNAs (32.03%), genes (6.25%), and repeats (40.57%) ( Table 5). miRNA clusters in plasma corresponded to 18.20% of the total number of reads. Around 6% and 3% of the reads remained not annotated in plasma and milk, respectively.

Discussion
In this study, we identified several species of small RNAs, including miRNAs, piRNAs, tRNAs, snRNAs and snoRNAs in both plasma and breast milk. Their profiles were biofluid specific. As the identification of circulating small RNAs depends on the quality of the biological material, the quantification method and the bioinformatic pipeline, we discuss bellow the technical aspects followed during this study that can potentially affect the results obtained.

RNA extraction and sequencing
The RNA extraction and sequencing yielded data of enough quality for all milk samples and 5 out of the 15 plasmas. Forty % of reads in plasma and 89.58% in milk passed the quality control step, with only 3.97% and 35.72% mapping to miRNAs, respectively. RNA input for plasma and milk small RNA libraries were 150 ng and 500 ng, respectively, which respond to the different RNA quantity obtained for each biofluid. RNA yield is thus a key factor for the quality of sequencing data. In plasma, the lack of 18S and 28S peaks in the RNA patterns obtained with the Bioanalyzer suggested the absence of RNA cell contamination. Indeed, plasma samples were centrifuged twice in <3.5 h after blood collection and the spectrophotometer haemolysis values were below 0.2, except for one sample. Given the low RNA yield obtained per mL of plasma (around 40 ng/mL), the addition of carriers during RNA extraction would be recommended [37].
In contrast, milk samples presented some signal at 18S and 28S, independently of whether samples were centrifuged one or two times before storing. Milk samples also showed systematically higher RNA levels by mL of biofluid than plasma samples (around 50-times higher levels). Given the higher amount of RNA obtained in milk and the presence of ribosomal RNA, we suspect contamination with the cellular fraction during skim milk sample separation. However, other studies reported similar RNA yields for skim milk (around 4000 ng/mL) as ours, and higher amounts in the milk fat and cellular fractions [38]. Contamination with cellular RNA is one of the main issues in the use of circulating free small RNAs as biomarkers of disease [39]. Therefore, all the findings shown here, as well as in other publications, should be interpreted considering this potential limitation.

Small RNA bioinformatic analysis
Currently, the analysis of miRNA sequencing data is quite straight forward. In contrast, differences of one or few nucleotides in isomiRs with respect to reference miRNAs and frequent multicopy non-miRNA small RNAs in the genome impose several limitations to their bioinformatics analysis. To quantify small RNAs, we used SeqCluster, a tool that groups RNAs in clusters based on their sequence similarity. RNA clusters are defined as unique transcriptions units with potentially the same molecular function given their similar sequence, regardless of their genomic location. For instance, tRNAs which are present in different copies in the genome are clustered together. The reference databases and approach used in this study differ from the pipeline followed in two of the most recent publications on circulating small RNAs: Freedman et al. 2016 (ExceRpt, based on sRNABench) [19] and Yeri et al. 2017 (sRNABench) [21]. In sRNABench, reads are first mapped to miRNA, and remaining reads are mapped to other small RNAs using ENSEMBL 75. For isomiRs characterization, 3' additions and 3' and 5' trimming editions were considered in the analysis, as there are strong evidences of their processing [40][41][42]. In contrast, isomiRs with substitutions were discarded due to difficulty of mapping them unambiguously.

Small RNAs in plasma
Ten out of the 824 miRNAs detected in plasma represented >70% of total number of reads mapping to miRNAs. Other authors have also reported few miRNAs accounting for around 50% of the total number of sequencing reads [19,43]. In agreement with Yeri et al. [21] hsa-miR-486-5p was the most abundant miRNA in plasma (representing approximately 40% of the miRNA reads). A bias towards this miRNA in samples prepared with the Illumina kit has been reported in plasma-derived exosomes [20]. Ten and 8 of the 10 top miRNAs identified in this study were detected among the top 50 positions in Freedman et al. [19] and in Yeri et al. [21], respectively. The most abundant miRNA in Freedman et al was hsa-miR-451a, while in our sample set this miRNA was found in the 5 th position. According to public data deposited in the miRmine database [44], hsa-miR-486-5p and hsa-miR-451a are the two most abundant miRNAs in plasma, and are produced by red blood cells [39]. Hsa-miR-122, which is highly abundant in plasma, is mostly expressed in liver and enters easily into circulation, pointing towards its potential as biomarker for several liver pathologies [45].
In agreement with Freedman et al. [19] and Yeri et al. [21] we also found piRNAs, tRNAs, snoRNAs, snRNAs, VTRNAs and lncRNAs in plasma. The abundance of miRNAs, tRNAs, snoRNAs, snRNAs, and VTRNAs was in a similar range among previous studies. In contrast, there were strong discrepancies in the piRNA levels. Interestingly, a piRNA cluster mapping to several piRNAs and RXRB gene accounted for 37.94% of the reads in the plasma samples analysed in this study. After visual inspection of the genomic region, we realized that this cluster also overlaps with Y RNA genes (RNAY4/RNAY4P10). Indeed, Yeri et al. (2017) described a high proportion of sequences mapping to Y RNA in plasma (>60%) and a much lower abundance of piRNAs [21] Y RNAs participate in chromosomal DNA replication and are involved in RNA stability and response to stress [46]. Circulating Y RNAs of 25-33 nt have been found in blood, within vesicles or as cell-free ribonucleoprotein (RNP) complexes [47][48][49]. Their function in biofluids, if any, is unknown and it cannot be discarded whether they are degradation products.
piRNAs were thought to be specific to germ cells, where they play essential roles in fertility and regeneration [50], but they have been described in plasma [19,21], urine, and saliva [21] and in healthy and cancer somatic cells [51,52]. In particular, Keam et al showed that Hiwi2 protein, one of the three human Piwi proteins, was ubiquitously expressed in somatic cells. The sequencing of the RNA co-immunoprecipitated with Hiwi2 protein identified a wide range of small RNA sequences of 18 to 34 nt that mapped to processed tRNA fragments, known piRNAs and coding genes, among others. The authors suggest that, in somatic cells, the Piwi-piRNA pathway participates in the regulation of protein translation [51]. Of note, the way how piRNAs are identified, through sequencing of Piwi bound RNAs, might introduce some noise in the piRNA sequence pool in reference databases.
Differences in small RNA abundance with previous publications [19,21] can be explained by biological and technical factors. The quantification of small RNAs was performed using different NGS platforms, for which some bias to specific small RNAs have been described before [20]. Moreover, these studies used different bioinformatics pipelines. The small sample size of our dataset (5 plasma samples) in combination with the low and highly heterogeneous expression pattern of small RNAs in plasma with only a few of them detected in all samples might have biased our findings. Finally, our study included a highly specific population of post-partum mothers, while Freedman et al. and Yeri et al. enrolled participants from the general population and male athletes, respectively.
In the present study focusing on skim milk, 10 miRNAs accounted for >70% of the reads mapped to miRNAs. Hsa-miR-148a-3p represented around 30% of the reads. This miRNA was previously reported to be an abundant miRNA in human milk fat [53] and milk exosomes/microvesicles [56,58]. Of notice, this miRNA was only found in studies using NGS, but not those applying qPCR or microarrays. Hsa-miR-146b, hsa-miR-200c and hsa-30a-5p, among top miRNAs expressed in our sample set, were also reported to be highly abundant in milk fat, skim milk or milk exosomes/microvesicles [53,55,56,58]. Moreover, some of them were also abundant in the milk cellular fractions [55]. A direct comparison of this study versus previous publications is subjected to the differences in starting biological material, lactation period, and quantification methods.
Besides miRNAs, we detected other small RNAs, being tRNAs one of the most abundant class. tRNAs have been found before in several biofluids [19,21], including milk exosomes [56,58]. tRNA fragments seem to have distinct expression patterns and regulatory functions [59]. As far as we know, this is the first study describing the presence of piRNAs, snRNAs, snoRNAs and VTRNAs in milk. Whether breast milk tRNAs and other small RNAs can reach child circulation and modulate child physiology is of great interest and requires further investigation.

Differential detection of RNAs in both biofluids
As expected, we observed a clear separation between biofluids in terms of their small RNA profile [21,60]. Previous studies have shown that the miRNA profile observed in each biofluid recapitulates the miRNA profile of the main cell types releasing small RNAs in those biofluids (blood cells in plasma [39] and mammary epithelial cells in milk [55]).
Besides miRNAs, several small RNA clusters showed a different expression pattern between biofluids. As an example, levels of VTRNA2-1 were higher in plasma than in milk. VTRNA2-1 transcript is a putative tumour suppressor and modulator of innate immunity, which responds to environmental stressors during development through loss of imprinting [61]. HOTAIR, also detected at higher levels in plasma, is a long non-coding RNA that represses transcription of HOXD genes in trans. HOTAIR has been implicated in cancer [62]. We detected >10,000 iso-miRs which beard one or several editions (3' addition, 3' trimming or 5' trimming): 4,760 in plasma and 10,541 in milk, processed from 707 and 875 miRNAs, respectively. Differentially isomiRs found in milk and plasma matched with the differentially expressed miRNAs detected in these two biofluids. Some studies suggest that specific isoforms of the same miRNA are the ones that have an effect on disease [63]. Further investigation is needed to corroborate whether isomiRs affected by specific diseases can be found in circulation, however our study shows a myriad of isomiRs which could be potentially used as biomarkers.
Although sequencing depth of milk and plasma samples was similar, the proportion of good quality reads mapping to miRNA was much higher in milk than in plasma. This could bias the results of the differential analysis (i.e. some miRNAs expressed at mid or low levels in milk might not have been detected in plasma due to the lower sequencing depth). DESeq2 normalizes using the geometric median of the ratio between the gene of each sample and the average among samples, but when there are important differences in library size, normalization methods might not to be sufficient [64]. To confirm the biofluid-specific findings, we applied a subsampling approach in the miRNAs analysis. This approach consists in a random selection of the same read number in all the samples of the study. To control for a potential selection bias, we performed ten random subsamplings. Approximately 60% (186 out of 308) of the miRNAs detected in the main differential analysis were confirmed in the subsampling analysis. miRNAs only detected in the main analysis and not in the subsampling analysis could be false positives resulting from the different depth of sequencing between biofluids.

Summary
Plasma and milk samples showed a completely different small RNA profile. Several types of small RNA species were detected in both biofluids, including miRNAs, tRNAs, snRNAs, snoR-NAs, lncRNAs, and piRNAs. The presence of different species of small RNAs in biofluids, besides miRNAs, opens the door to explore them as potential biomarkers of disease or as mediators of lactation health effects.