Discovery of Novel MicroRNAs in Female Reproductive Tract Using Next Generation Sequencing

MicroRNAs (miRNAs) are small non-coding RNAs that mediate post-transcriptional gene silencing. Over 700 human miRNAs have currently been identified, many of which are mutated or de-regulated in diseases. Here we report the identification of novel miRNAs through deep sequencing the small RNAome (<30 nt) of over 100 tissues or cell lines derived from human female reproductive organs in both normal and disease states. These specimens include ovarian epithelium and ovarian cancer, endometrium and endometriomas, and uterine myometrium and uterine smooth muscle tumors. Sequence reads not aligning with known miRNAs were each mapped to the genome to extract flanking sequences. These extended sequence regions were folded in silico to identify RNA hairpins. Sequences demonstrating the ability to form a stem loop structure with low minimum free energy (<−25 kcal) and predicted Drosha and Dicer cut sites yielding a mature miRNA sequence matching the actual sequence were considered putative novel miRNAs. Additional confidence was achieved when putative novel hairpins assembled a collection of sequences highly similar to the putative mature miRNA but with heterogeneous 3′-ends. A confirmed novel miRNA fulfilled these criteria and had its “star” sequence in our collection. We found 7 distinct confirmed novel miRNAs, and 51 additional novel miRNAs that represented highly confident predictions but without detectable star sequences. Our novel miRNAs were detectable in multiple samples, but expressed at low levels and not specific to any one tissue or cell type. To date, this study represents the largest set of samples analyzed together to identify novel miRNAs.


Introduction
MicroRNAs (miRNAs) are short (,22-nucleotide), singlestranded, non-coding RNAs that modulate gene expression. Through their binding to the 39-UTR (untranslated region) of target mRNAs, miRNAs trigger either the degradation of the mRNA transcript or the inhibition of protein translation. miRNAs are initially transcribed as primary microRNAs (pri-miRNAs) and then undergo two processing steps. The first step is the generation, within the nucleus, of stem-loop precursors (pre-miRNAs ,70 nt) by the enzyme Drosha. In the second step, the pre-miRNAs are exported into the cytoplasm and processed by the enzyme Dicer into a double stranded RNA duplex with two nucleotide 39-overhangs, subsequently releasing the 17-25 nucleotide long mature miRNA. miRNAs are essential for normal mammalian development and help regulate genes and processes involved in cell growth, differentiation, and apoptosis [1]. Alterations in miRNA expression have been observed in a variety of human cancers [2,3,4] including ovarian cancer [5,6,7]. miRNAs also appear deregulated in other diseases affecting organs of the reproductive system such as uterine leiomyomas and endometriosis [8,9,10].
The number of miRNAs confidently identified in the human genome is currently over 700 (703 in miRBase v13.0). Though the actual number of miRNAs is not known, some in silico studies suggest as many as tens of thousands of miRNAs exist [11]. miRNAs have been traditionally discovered using experimental approaches such as cloning and Sanger sequencing [12]. However, the recent introduction of deep sequencing technology, enabling the simultaneous sequencing of up to millions of DNA or RNA molecules, has provided another option for the discovery of novel miRNAs that may have eluded previous efforts [13]. Previous studies using computational methods combined with high throughput experimental data-such as deep sequencing or tiling expression arrays-have successfully identified novel miRNAs [14,15,16,17].
To date, we have exhaustively sequenced the small RNAome of over 100 human samples derived from various organs of the female reproductive system in both diseased and normal states, including ovarian samples (both normal epithelium and ovarian cancer), endometrial samples (from both healthy non-endometriosis and endometriosis patients), and uterine samples (both normal myometrium and benign and malignant uterine tumors). Studies on the functional roles of known miRNAs in the diseased states of these various systems are currently ongoing and either have been [18,19] or will be described in other papers. However, the exceptional volume of sequence data generated from this work provided us a unique opportunity to mine for novel miRNAs that have eluded previous cloning and standard sequencing efforts. In the present study, we focused on novel miRNA discovery and have confidently identified both mature and star sequences for 7 previously unknown miRNAs using our deep sequencing data. We have also identified nearly 100 additional putative novel miRNAs with mid-to-high confidence which await additional confirmation.

Results
Using Next Generation Sequencing, we dissected the small RNAome of 103 human specimens obtained or derived from various tissues from female reproductive system-related organs. Samples profiled are listed in Table 1. These specimens included short-term cultures of normal ovarian surface epithelium (NOSE), primary ovarian cancers and established cell lines (including serous, clear cell, and endometrioid histotypes), endometrium from normal and endometriosis patients, cyst wall from endometriomas, and uterine myometrium and uterine smooth muscle tumors (leiomyomas and leiomyosarcomas subtypes). Our sequencing efforts uncovered a large number of small RNA sequences that were unlike any known human miRNAs. We hypothesized that some of these unique sequences were novel human miRNAs and further examined them using a computational method for miRNA discovery that was developed by our group to integrate several published criteria for miRNA prediction [13,20,21].
A schematic for our sequence analysis workflow is shown in Figure 1. In all, there were over 300 million valid sequence reads obtained from the 103 samples (''valid'' sequence reads passing our quality control filters described in Methods), of which about 216 million mapped to known mature miRNAs, and an additional 15.6 million mapped to a known miRNA hairpin sequence (these sequences presumably representing the byproducts of miRNA processing). The remaining ,69 million sequence reads not aligning with a known miRNA precursor (representing potential novel miRNAs) were mapped to the entire human genome. Those reads that mapped exactly to the genome were used to extract 200bp of genomic sequence flanking either side and folded as RNA using the Vienna RNA secondary structure prediction and comparison package [22]. The resulting putative novel hairpins were then filtered for single stem loop hairpins with the putative mature miRNA read sequence mapping to one side of the stem and satisfying the Ambros criteria for hairpin structures [20]. Valid novel miRNA hairpins whose putative mature miRNA sequence was identified in at least one of our samples were considered to represent putative novel microRNAs. In all, we identified 14,731 sequence reads representing 132 distinct putative novel miRNAs by virtue of their ability to form miRNA-like single hairpins. Here, we used the naming convention ''hsa-bcm-miR-X'' for each novel miRNAs, where ''bcm'' symbolizes ''Baylor College of Medicine'' and ''X'' is a unique identifier. Of the 132 novel miRNA, 20 mapped to multiple genomic loci; in this case, we used the mirBase convention of ''-1,-2,-3, etc.'' endings to the miRNA root name (e.g. ''hsa-bcm-miR-15-1'', ''hsa-bcm-miR-15-2'', and ''hsa-bcm-miR-15-3'').
In addition to a name referring to the miRNA mature sequence, each novel miRNA genomic mapping received a unique hairpin identifier referring to the associated predicted hairpin structure. We used our putative novel hairpin sequences as precursors for alignment of all reads that did not map to a known miRNA precursor. In the next step of our analysis, we screened for putative novel hairpins that captured a collection of sequences (strong signal) mapping to a specific region of 15-25 nt within the reference hairpin on one side of the stem. Scattered sequence mappings across the full length of hairpin, a strong signal in a limited region that mapped to the loop, and hairpins mapping to known tRNA or ribosomal RNA regions were rejected. The strong signal was expected to contain short 17-25 nt sequences that exhibited a stable 59-end and significant length heterogeneity at the 39-end. The sequence with the highest copy number was considered the putative novel mature miRNA. The other related sequences were hypothesized to have been generated through 'imperfect' Dicer processing events reported by Morin et al. [23] and Reid et al. [24]. Additionally, predicted Drosha and Dicer cut sites must have been able to yield a mature miRNA sequence that matched to the actual miRNA sequence read. If a putative novel miRNA fulfilled all of the above, we denoted the novel miRNA as representing a ''high confidence'' prediction. In all, we identified 5,773 sequence reads representing 58 distinct high confidence, novel miRNAs.
For confirmation of these high confidence novel miRNAs, we needed to detect the star sequence in addition to the mature miRNA sequence in our sequence collection (though presumably at a much lower frequency since star sequences are degraded and usually occur at significantly lower levels). We defined the potential star sequence as the sequence base pairing to the potential mature sequence on the novel hairpin, correcting for the 2-nt 39 overhangs which are known to be typical for Dicer processing [14,23]. We denoted high-confidence novel miRNAs as ''confirmed'' if the star sequence was independently detected along with the mature sequence, as the novel miRNA sequence in question was then shown to fulfill all of the criteria that could be applied to known miRNAs. In all, we identified 1334 sequence reads representing seven distinct confirmed novel miRNAs. The complete list of confirmed and/or high confidence novel miRNAs is provided in Table 2. The inability of our sequencing efforts to discover the star sequence for the 51 high confidence (but not ''confirmed'') candidates does not necessarily eliminate them as representing novel miRNAs. Star sequences (passenger strands) are usually degraded after Dicer processing of the hairpin and therefore are present in much lower abundance than the active miRNA (guide strand). In particular, when the sequence copy numbers of a mature miRNA are relatively low, an even lower abundance of the miRNA star form may make it undetectable. In fact, for many of the low abundance known miRNAs that we identified (including hsa-miR-1301, hsa-miR-183, hsa-miR-203, and hsa-miR-371), we were unable to detect their representative star sequence (data not shown). For the seven novel miRNAs confirmed by virtue of star sequence identification, the predicted secondary structures of the precursor hairpins are represented in Figure 2. For most of these hairpins (with the possible exception of the hairpin formed by ''hsa-bcm-miR-8''), the small 39 overhangs, which are known to be typical for Dicer processing, were evident.
We also identified an additional 46 putative novel miRNAs (representing 2,715 sequence reads) that were flagged as ''potential,'' pending additional confirmation. These candidates had weak predicted Dicer and Drosha processing sites in the predicted hairpin structure, yet at the same time, these structures did show stable 59 ends and variable 39-ends. One of these potential candidates (''hsa-bcm-miR-49'') fulfilled all the criteria for the candidates at the high confidence level, and also mapped to a known snRNA. Since snoRNA ACA45 was found to be processed into a miRNA [25], hsa-bcm-miR-49 might represent another snRNA that can be processed into a miRNA. The complete set of confirmed, high confidence, and potential novel miRNA candidates, as well as the rest of the candidates out of the original list of 132 putative novel miRNAs that failed our additional stringency criteria, are provided as Supporting Data File S1. This file includes the information on the samples in which each miRNA was identified. The hairpin structure predictions for each of the 132 putative novel miRNAs is provided as Supporting Data File S2. Figure 3A shows the relative abundance levels of both our confirmed and high confidence novel miRNAs across the samples profiled. When compared to the set of known miRNAs from our samples, our novel miRNAs were at much lower abundance levels. Even the most abundant novel miRNAs were at levels lower than most of the known miRNAs and several logs lower than what was observed for the let-7 family. Most of our confirmed and high confidence novel miRNAs were detected in multiple samples with various sequence copy numbers ( Figure 3B as well as Table 2). Over 60% of the putative miRNAs were detected in more than a single sample. The novel miRNAs did not appear to be preferentially expressed in one reproductive tissue type versus another ( Figure 3B). Many miRNAs were detectable in multiple reproductive tissue types. Statistically, we were not able to identify novel miRNAs as being specific to a diseased tissue as compared to the corresponding normal tissue type. Using an alternative method from deep sequencing, we were able to confirm expression by qPCR of one ''high confidence'' novel hsa-bcm-miR-15 ( Figure 3C), which was the seventh most abundant novel miRNA in our list (and the most abundant miRNA detected in cell lines for which we had RNA).

Discussion
The discovery of miRNAs has revealed a previously unanticipated layer of regulation that integrates the transcriptome (the complete set of RNAs expressed in a cell) with the proteome (complete set of proteins expressed). The ,700 miRNAs uncovered from the human genome so far (as cataloged in miRBase) are predicted to target and repress over 60% of the protein coding genes [26]. Genome-wide miRNA predictions estimate that the true number of miRNAs may be anywhere from 10-100 times more than the current numbers [11]. Indeed, largescale cloning and deep sequencing efforts in the last few years have led to a doubling of the number of human miRNAs from ,470 (miRBase v8) to ,700 (miRBase v13) and the miRNA targeted transcriptomes from .30% to .60% protein coding genes [26,27], though the number of discovered miRNAs still falls short of the upper predicted limits of tens of thousands.
We used Next Generation Sequencing (NGS) technology to identify novel miRNAs in the female reproductive tract of women. In addition to our 7 confirmed novel miRNAs, many of the 51 high confidence novel miRNAs could also be eventually confirmed. The only feature that distinguishes a high confidence novel miRNA from a confirmed miRNA by our criteria is experimental evidence for both the miRNA and its star sequence within the collection of NGS sequences from a given samples. The novel miRNAs that we discovered occur generally in low abundance (,1000 copies in 2-10 million sequences per sequencing lane for a sample). It is therefore, highly likely that deeper sequencing of the tissues we sampled or other new tissues where they may occur in greater abundance may reveal a miRNA star sequence for the majority of high confidence miRNAs in which case they can be converted to confirmed status.
Previous deep sequencing efforts to identify novel miRNAs have typically relied upon data from a single human sample [14,16,17,23] while one recent study used tiling array data from 14 human cell lines [15]. To date, this study represents the largest set of samples analyzed together by deep sequencing to identify novel miRNAs. The large collection of tissue and cell line samples used in this study allowed us to capture a large number of novel miRNA candidates detectable in some samples but not others. In fact, only one novel candidate (''hsa-bcm-miR-265'') was detected in more than a third of the samples, with about 40% of the 141 putative miRNAs being detected in only a single sample. Arguably, our approach has allowed us to uncover many more novel miRNAs than what might be uncovered in a single sample. For instance, the recent miRDeep study [14] uncovered 10 novel miRNA candidates in the HeLa cell line, while we uncovered 104 potential candidates (seven confirmed, 51 high confidence, and 46 potential) across all of our 103 samples. The novel miRNAs that we identified were typically low abundance as compared to most of the known miRNAs, explaining how they might have eluded previous efforts of detection.
Since sequencing is becoming faster, cheaper, and deeper, we anticipate the number of miRNAs to possibly approach the numbers estimated by some whole genome prediction algorithms. If this happens, it is possible that the miRNA-regulated transcriptomes include most of the genes in our genome, making them widespread agents of gene silencing. From the data that we have generated, it is clear that the most abundant miRNAs in mammalian systems have likely been found. It is possible that the lower abundance novel miRNAs uncovered here are present in higher abundance in as yet unanalyzed tissues or cell types. It is also possible that there are major miRNAs (i.e., the majority of known miRNAs) that have a strong influence on gene silencing and minor miRNAs (i.e., the majority of novel miRNAs) that have a subtle effect on fine tuning gene expression. In any event, only when the complete miRNAome is determined can we understand the full impact of miRNAs on gene expression on a genome-wide scale. Our novel miRNA detection study has added significantly to the work to complete the miRNAome.

Ethics Statement
Patients undergoing elective gynecologic surgery at Ben Taub General Hospital or St. Luke's Episcopal Hospital were approached prior to surgery for participation. After written informed consent, patients underwent scheduled surgical procedures. All tissues collected for this study were collected under Baylor College of Medicine Institutional Review Board (IRB) approval and IRB approval from each individual hospital.

Small RNA Mapping
Sequence reads with Solexa 39 adapter (the read length being 36 nt) were picked for miRNA mapping (the same adaptor being used for each sequencing run). Each sequence read was passed through a number of quality control filters. Reads which did not pass the Illumina chastity and no-calls filter were removed. Reads with copy number less than 4, length less than 10 nt, or more than 10 consecutive, repetitive nucleotides were removed. Reads matching the E. coli genome were removed using WU BLAST [28]. The remaining reads (which we termed ''valid'' sequence reads) were compared with known mature miRNA hairpins (miRBase 13.0) [29]. Reads were mapped as either exact match or loose match (loose match only for reads without an exact match). For loose match (which would account for miRNAs being subjected to non-templated nucleotide changes and RNA editing [24]), up to three mismatches were allowed in a single alignment (by our experience, allowing four mismatches yielded diminishing returns decreasing the cost benefit of engaging all sequence reads that pertain to a specific miRNA versus the increase in alignment times); for loose matches, we used a custom Smith-Waterman local alignment algorithm, where our gap penalty was 23, match score was 2, and mismatch penalty was 21, with a cutoff of 1.46. The reads that did not align to any known miRNA were passed to our novel miRNA discovery platform as described below.

Novel miRNA Discovery
Our basic approach for novel miRNA discovery has been described previously [13]. Briefly, each sequence which passed our quality control filters was first mapped on the reference genome sequence (hg18) plus 200 bases of flanking the sequence on either side were extracted to find the putative hairpin. This extracted sequence was then folded using RNAfold of the Vienna RNA folding package [22], in order to determine secondary RNA structure. After folding of the long (,425 nt) sequence (and confirmation that it contains a miRNA-like hairpin with the sequenced tag in the proper place) the putative precursor sequence was trimmed down to include only the hairpin bases (60-150 nt) and refolded to confirm that the structure is maintained in both the larger and shorter contexts. To determine if a structure forms a plausible miRNA hairpin, we applied the three Ambros criteria [20]: 1) the mature putative miRNA sequence must rest on one side of a single hairpin; 2) the putative miRNA sequence must bind relatively tightly (i.e. at least 16 bound bases in the first 22 or fewer bases of the miRNA) within the hairpin stem containing no large or energetically unfavorable loops (i.e. a single loop with more than 6 bases in the arm of the hairpin); and 3) the putative hairpin must have a miRNA-appropriate energy (free energy below 225 kcal/mol). We recognized small RNA sequences as representing putative novel miRNA if all the above were met.
We carefully curated the set of putative novel miRNAs and divided them into four different categories: ''not likely,'' ''potential,'' ''high confidence,'' and ''confirmed.'' Candidates were flagged as ''not likely'' if any of the following was determined: the mature sequence did not map clearly within a specific region of 15-25 nt of the predicted hairpin (e.g. were scattered evenly across the full length of the hairpin), or fell within the hairpin loop, or mapped to known tRNAs or rRNAs. (Though the minimum length of a miRNA is thought to be 17 nt, we used a lower limit of 15 for the mapping region, to account for the fact that microRNAs exhibit significant length heterogeneity at the 39 end and to capture all isomiRNAs [23,24].) Candidates were categorized as ''high confidence'' if they passed all of the above criteria, and in addition formed a hairpin with predicted Drosha and Dicer cut sites that were able to yield a mature miRNA sequence matching the actual miRNA sequence read, as well as not mapping to known snoRNAs or snRNAs. Candidates were categorized as ''potential'' if they had weak predicted processing sites in the hairpin, yet the hairpin showed a stable 59 end (candidates representing snoRNAs and snRNAs that otherwise fulfilled the ''high confidence'' criteria were also categorized as ''potential''). Candidates were categorized as ''confirmed'' if they met all of the criteria of the ''high confidence'' candidates, and in addition to both having a predicted star sequence that was detected in our deep sequencing pool and forming a hairpin with a stable 59 end and a variable 39 end.
Reverse Transcription (RT) of Mature MicroRNA from Total RNA Total RNA isolated from OVTOKO and HEK293 cell lines was reverse transcribed using the TaqManH MicroRNA Reverse Transcription Kit from Applied Biosystems (Part Number 4366596) following the manufacturers suggested protocol and specific RT stem-loop primers for mature novel_hsa-bcm-miR-15 sequence and for U6 for an internal control.

qPCR of cDNA Products
PCR products were amplified from each cDNA sample using the TaqMan MicroRNA Assay together with the TaqManH Universal PCR Master Mix. Following the manufacturer's recommended reaction conditions, using the Applied Biosystems Veriti system. After 40 cycles of the recommended cycle conditions, data was collected from the machine and analyzed using the Applied Biosystem qPCR software. Using the comparative ggCT method, we used the HEK293 as the reference sample for OVTOKO levels of expression and small RNA U6 as the endogenous control to normalize the expression levels of the novel_ hsa-bcm-miR-15 target.

Supporting Information
Supporting Data File S1 The set of 132 putative novel miRNAs with sequence copy counts for each sample.