Analysis of Antisense Expression by Whole Genome Tiling Microarrays and siRNAs Suggests Mis-Annotation of Arabidopsis Orphan Protein-Coding Genes

Background MicroRNAs (miRNAs) and trans-acting small-interfering RNAs (tasi-RNAs) are small (20–22 nt long) RNAs (smRNAs) generated from hairpin secondary structures or antisense transcripts, respectively, that regulate gene expression by Watson-Crick pairing to a target mRNA and altering expression by mechanisms related to RNA interference. The high sequence homology of plant miRNAs to their targets has been the mainstay of miRNA prediction algorithms, which are limited in their predictive power for other kingdoms because miRNA complementarity is less conserved yet transitive processes (production of antisense smRNAs) are active in eukaryotes. We hypothesize that antisense transcription and associated smRNAs are biomarkers which can be computationally modeled for gene discovery. Principal Findings We explored rice (Oryza sativa) sense and antisense gene expression in publicly available whole genome tiling array transcriptome data and sequenced smRNA libraries (as well as C. elegans) and found evidence of transitivity of MIRNA genes similar to that found in Arabidopsis. Statistical analysis of antisense transcript abundances, presence of antisense ESTs, and association with smRNAs suggests several hundred Arabidopsis ‘orphan’ hypothetical genes are non-coding RNAs. Consistent with this hypothesis, we found novel Arabidopsis homologues of some MIRNA genes on the antisense strand of previously annotated protein-coding genes. A Support Vector Machine (SVM) was applied using thermodynamic energy of binding plus novel expression features of sense/antisense transcription topology and siRNA abundances to build a prediction model of miRNA targets. The SVM when trained on targets could predict the “ancient” (deeply conserved) class of validated Arabidopsis MIRNA genes with an accuracy of 84%, and 76% for “new” rapidly-evolving MIRNA genes. Conclusions Antisense and smRNA expression features and computational methods may identify novel MIRNA genes and other non-coding RNAs in plants and potentially other kingdoms, which can provide insight into antisense transcription, miRNA evolution, and post-transcriptional gene regulation.

Computational and experimental efforts in plants have explored long non-coding RNAs (ncRNA), RNA species with limited or no capacity to encode proteins [16]. Teramoto et al. first identified a CR20 gene repressed by cytokinin, stress and/or developmental conditions in cucumber and a homolog in Arabidopsis that encodes no long ORFs [17]. A tyrosine kinase-like gene was found to have an antisense transcript, ATH132404, which does not appear to encode any protein [18]. Other ncRNAs were discovered by DNA library screening such as ZCF83, RXF6, and RXW18, in which ZCF83 is antisense to a helix-loop-helix gene [19]. In 1999 and 2000, At4 and its homolog, AtIPS1, from IPS1/ Mt4 family were shown to be induced by phosphate (P i ) deprivation [20,21]. AtIPS1 inhibits the activity of P i starvationinduced miR399 by a mechanism termed 'target mimicry' of base pairing without RISC cleavage [22]. Computational searches and experimental validation of expressed sequence tags (EST) have been the main focus for discovery of ncRNAs, from which hundreds of sequences have been identified [23][24][25][26][27][28][29]. Interestingly, miR162a and miR869 primary transcripts were originally described as ncRNAs, demonstrating the efficacy of finding miRNA-like sequences by characterizing ncRNAs.
Antisense transcription is a pervasive but poorly understood phenomenon associated with RNA interference and miRNAs in plants and animals. The function of miRNAs, their relationship to antisense transcripts, their subcellular pools, and the precise mechanisms by which these processes suppress gene expression remain elusive and controversial [30][31][32][33]. A homologue (ARS2) of the Arabidopsis zinc finger-containing protein SERRATE that functions in pre-mRNA splicing and miRNA processing has recently been shown to be a component of the nuclear RNA capbinding complex in mice and to mediate both antiviral defense and developmental patterning in Drosophila [34,35], establishing that RNAi-and miRNA-dependent processes are deeply conserved between plants and animals. Prokaryotes and simple eukaryotes have ncRNAs and antisense transcripts [36][37][38][39], but ncRNAs increasingly dominate the genomes of multicellular organisms as their complexity increases, in contrast to proteincoding genes [40][41][42][43], providing a plausible explanation for the ''C-value paradox.'' It is estimated that 40% of all transcription units in human and mouse genomes exist in cis-antisense coexpressed pairs [44,45] and there is correlative evidence for a regulatory function of antisense in animals and plants [46,47]. The 59UTRs and first exons of genes with overlapping antisense transcripts are significantly longer than the genomic average, and a similar size distribution is observed for genes silenced by CpG island methylation in human cancer, supporting a role for antisense transcripts in regulation [48]. Recent results show that human genes are regulated transcriptionally by promoterassociated and terminator-associated antisense RNAs [49][50][51][52][53]. Studies of plant development and environmental stress responses have converged on the roles of ncRNAs and their metabolism as primary regulators of gene action, but it is still under debate to what extent those antisense transcripts are associated with siRNAs that couple exogenous signals to gene regulation [46,[54][55][56][57][58][59][60][61][62][63][64]. Understanding the origins of antisense ncRNAs may lead to new insights into fundamental processes such as tissue-specific and developmental gene regulation, chromatin dynamics, dsRNA biogenesis and processing, and genome evolution.
Plant miRNAs have high levels of complementarity to their target mRNAs, which greatly facilitates homology-based computational methods for MIRNA gene and target discovery in plants [65][66][67][68][69][70][71][72][73][74][75][76]. Nonetheless many recently discovered miRNAs and miRNA-associated smRNAs were instead uncovered functionally by deep sequencing of smRNA libraries [77][78][79][80][81]. We hypothesize that antisense transcription detected in plant whole genome tiling array transcriptome [82][83][84] and deeply sequenced smRNA datasets [85,86] can serve as a biomarker to discover miRNAs and ncRNAs. Here we characterize MIRNA gene transitivity (antisense siRNAs mapping to miRNA hairpins) for Arabidopsis, C. elegans and rice and the topology (exon-intron signal-to-noise ratios) of strand-specific signals for annotated protein-coding genes. We identified several Arabidopsis MIRNA gene homologues and hundreds of potentially mis-annotated ncRNAs mapping to the antisense strand of annotated 'unknown' orphan proteincoding genes. A Support Vector Machine (SVM) was employed to analyze the importance of smRNA abundance features and sense/ antisense topology as predictors of miRNA target sites and MIRNA genes. Our results suggest the utility of modeling whole genome tiling array transcriptome datasets for gene discovery and genome annotation.

Materials and Methods
Perl [87] was used to extract, examine and manipulate data; scripts are available upon request. Before analysis of microarray data, those C-rich probes found previously to be affected by a sample amplification artifact [88] were removed. Other key programs were Unafold [89], SOAP (Short Oligonucleotide Alignment Program) [90], BLAST (Basic Local Alignment Search Tool) [91], and SVM for Matlab [92]. Unafold is based on dynamic programming principles of over-calculating the solution of thermodynamic free energy as a quicker solution than absolute calculation. Unafold's web-based portal is located at http:// dinamelt.bioinfo.rpi.edu/quikfold.php. SOAP maps short probes from whole genome tiling arrays onto large databases of genomic DNA sequence more quickly than BLAST, but has a tradeoff of accuracy for speed. SOAP is not designed to work in conjunction with the Massively Parallel Signature Sequence (MPSS) smRNA databases [85], which is mostly comprised of 17 nt sequences where SOAP only performs at 50% accuracy. SVM for Matlab (MathWorks, Natick, Massachusetts; http://www.mathworks. com/products/matlab/) creates a prediction model to discriminate between training sets using supervised learning methods that require a set of features and a class label. Support vector machine (SVM) is an algorithm that learns by example to assign labels to samples [93]. SVMs have been successfully applied in various biological problem domains, particularly classification problems. Its popularity is due to its high generalization performance, sound mathematical foundation, and ease of use. The classification can be done on binary class data or multi-class data. Our dataset is of binary class. That is, the class labels are either +1 (target genes), or 21 (paralogs). SVM learns a prediction model from training samples. The model is used to predict an unseen sample's class label. The prediction model can be a decision line in twodimensional data, a plane in three-dimensional data, or a hyperplane in higher-dimensional data. If a sample, or equivalently a point, lies above or on the decision plane it is predicted as '+1' (target gene), or otherwise as '21' (paralog). SVM constructs a decision plane which lies furthest from the samples of both classes. That is why SVM classifier is called the maximum-margin hyperplane and this is the most distinguishing characteristic compared to other classification algorithms.
There are three main tasks in the workflow of establishing a classification model using an SVM that could have the ability to predict miRNA target binding sites: 1) identify useful attributes (features) for prediction and encode them into a dataset; 2) learn an SVM classification model; and 3) evaluate its performance. We identified and used three attributes to encode a candidate miRNA target gene: RNA transcript abundance from whole genome tiling microarrays; novel smRNA counts from deep sequencing of smRNA libraries; and free energy of binding of miRNAs to their cognate target genes.

Expression data
The expression data is a novel feature for our classification system. We explored the antisense transcripts in relation to the miRNA targets by extracting the expression data for Arabidopsis miRNA target genes from existing high-resolution (25-36 b.p. probe size) whole-genome tiling micro-array datasets [82,83]. All microarray data is MIAME compliant and has been extracted from NCBI Gene Expression Omnibus raw data series GSE601, 605, 636-639, and 2247. Graphical representation of the sense and antisense tiling array signals for all Arabidopsis and rice genomic sequences and AGI annotations can be viewed at www. systemix.org. The expression values for the sequences relative to the location in the genome were retrieved from the expression files using mapping coordinates built from the Arabidopsis Small RNA Project (ASRP) and Sanger mirBase databases [94,95]. The results are compiled from five different biological samples, three of which correspond to different parts of the plant (flower, leaves, roots), and the other two from independent suspension-cultured dedifferentiated callus lines. Expression data corresponded to 800 base pairs (b.p.) surrounding the miRNA binding site on the miRNA target gene or MIRNA gene. These values were then normalized and were added at each position on all target genes to assess average effects.
The control (null) set for the miRNA target genes was created using homology-based bioinformatics searches to identify appropriate paralog ''pseudo-target'' genes from Arabidopsis. Paralogs are evolutionarily related to a miRNA target gene because of their high sequence homology except for the apparent lack of a remnant miRNA binding site. Paralogs were determined for the cognate genes by BLAST searches and manual inspection and assignment of ''mock'' miRNA binding site coordinates, and a file containing theoretical ''binding site'' locations was built. Using a similar approach as for the validated miRNA target sites, expression datasets were created with Perl scripts.
The tiling expression data [14] for the test sets (86 validated Arabidopsis miRNA targets representing 25 of 27 miRNA families; all MIRNA genes) and control (125 paralogs representing 16 families; Datafile S2) were normalized such that summation of expression values for a sample becomes 1 before they were applied to SVM. Expression signals at the locations ranging from 800 b.p. downstream and upstream relative to the miRNA binding site in target genes or miRNA* for MIRNA genes was stored into its own feature number at 25 b.p. resolution for the sense and antisense strand, resulting in 130 individual features for the general feature of expression levels.

smRNA Counts
The second expression feature implemented for the SVM was smRNA counts mapping to the gene, based on previous results showing a statistically significant association of smRNAs with miRNA target genes [14]. In our analysis, the smRNA feature represented the number of expressed distinct signatures and their normalized abundance (transcripts per quarter-million reads, TPQ) obtained from MPSS and deep pyrosequencing datasets of different tissues and genotypes affected in smRNA metabolism, a conservative and quantitative method [85]. The list of potential miRNA target genes and MIRNA genes was processed by a bulk query of the MPSS [85] web portal (http://mpss.udel.edu). The normalized (TPQ) data was summed for unique reads (found only once in the genome) from multiple libraries. This treatment allows gene-by-gene comparisons of smRNA abundances. The MPSS dataset comprised four independent samples (FLR, inflorescence; RDR, rdr2 mutant; two seedling libraries SD1 and SD2) that were used as separate features. The final feature was more qualitative than MPSS-based features: the sum of all normalized (TPQ) nextgeneration (454 pyrosequencing) smRNA datasets compiled from several different groups [78,79,94,96,97]. All Arabidopsis MIRNA genes that were listed as validated in ver. 11.0 of mirBase [95] were analyzed. The smRNA features were incorporated along side the expression levels. If there were no smRNA counts associated with a gene, the features were set to zeroes.
Antisense smRNAs with and without 59 triphosphate moieties cloned from C. elegans somatic tissues [98] were BLASTed against mirBase stem-loops to map their topologies relative to mature miRNAs, miRNA*, and the hairpin loop.

Thermodynamic Energy of Binding
The established method of quantifying miRNA complementarity to the target gene is accounted for in the thermodynamic energy of binding feature [99]. Target gene sequences were extracted and connected by a series of seven uridines to the matching miRNA sequence in order to create a ''pseudo-hairpin'' for reproducible folding. This string was analyzed by batch query of the UNAfold algorithm [89]. The ratio of the calculated free energy of miRNA: target to the free energy if the miRNA had perfect complementarity to the target gene (percent minimum free energy) was the final feature.

Building a classification model
After building the dataset with the features consisting of expression data, smRNA counts and energy values for each of the possible target genes and the paralogs, ten-fold cross validation analysis was performed. The SVM was supplied training values for genes, actual target genes were given a positive class label (+1) and paralogs were given a negative class label (21). The SVM developed a model based on the labels relative to the features and created a discriminant method using a linear kernel with default parameters to predict plausible target genes from paralogs.

Results
Evidence from smRNA and whole genome tiling array datasets for miRNA-associated transitivity In order to interpret antisense transcripts more broadly in a functional context, the datasets were qualified by characterization of signals associated with miRNAs [14] and ribosomal genes, the latter which serve as controls by virtue of being deeply conserved and highly expressed (see Supplemental Text File S1). Availability of whole genome tiling array and smRNA datasets for rice [84,86] allow us to test the hypothesis that transitivity associated with miRNAs (antisense transcription leading to production of siRNAs that flank miRNA target sites or MIRNA loci) is broadly conserved in plants. Because there are few validated miRNA target genes in rice [100][101][102], we mapped and quantified unique rice smRNAs from deep sequencing datasets [86] to MIRNA hairpins available in miRBase [95] as a function of relative position to the mature miRNA and miRNA* and compared the topology to that of Arabidopsis siRNAs mapping to MIRNA hairpins ( Figure 1; Datafile S1). Similar to our previous results [14] in Arabidopsis, abundant rice sense and antisense smRNAs were found for MIRNA genes ( Figure 1B) with an apparent bias for 59 upstream (relative to the sense strand) of the miRNA*. These data suggest activity of the miRNA (or miRNA*) binding to miRNA* (or miRNA) sites which triggers transitivity (spreading of siRNAs) in both directions on both strands. Recent results from C. elegans [98] have shown that two classes (26 n.t. 59-monophosphate and 22 n.t. 59-triphosphate species) of antisense siRNAs are produced against many transcripts in a two-step amplification by RNAdependent RNA polymerases RRF-3 and RRF-1, respectively, in conjunction with DICER, Argonautes and other specificity determinants. We mapped 59-monophosphate (primary) and 59triphosphate (secondary) antisense siRNAs to C. elegans miRNA hairpins and show in Supplemental Figure S1 (Datafile S6) that the most common primary and secondary siRNA map positions are the miRNA* and loop positions, respectively, similar to the observed topology of rice and Arabidopsis siRNAs in MIRNA genes ( Figure 1).
We further mapped Arabidopsis whole genome tiling array sense and antisense transcript signals to 93 ''ancient'' MIRNA genes (those with at least one homolog in other distant plant species (27 families) and compared average normalized signal topology with 68 recently evolved ''new'' MIRNA genes (64 families) [77][78][79] by adding signals at each position of the data ( Figure 2; Datafile S2). Ancient MIRNA genes had more abundant transcript signals on both sense and antisense strands, especially on the region of 200 n.t. upstream and downstream (relative to sense strand) of the miRNA* position (normalized expression . = 2.0, Figure 2A, arrows), whereas ''new'' MIRNA transcripts are not clearly evident above noise except for a peak signal precisely at the miRNA* position (normalized expression = ,1.2, Figure 2B, arrow). It is apparent that the ancient MIRNA genes have a 'pingpong-like' expression topology (downstream sense, upstream antisense; Figure 2A, arrows) similar to that previously described for miRNA target mRNAs [14]. In order to extend the analysis to rice tiling array data, we analyzed whole genome tiling array signals for Arabidopsis and rice that had perfect matches to mature miRNAs, miRNAs*, siRNAs (17 nt reads from MPSS data [86]), and to probes mapping to other regions of the cognate hairpin.
The results are shown in Table 1. Relative to the previously established signal cutoff of log 2 .0.73 based on background signals from probes for both strands of promoters of ,4,600 verified Arabidopsis genes [82], it is apparent that Arabidopsis MIRNA hairpin expression was low for most probes. Consistent with Figure 2A (upstream of miRNA* site), there was significantly more sense and antisense signals associated with miRNAs than elsewhere in the hairpins (Table 1; Datafile S3). For rice whole tiling array data there was higher signal associated with sense strand of miRNAs and antisense strand of miRNA* (Table 1), consistent with Arabidopsis data (Figure 2A), but the differences compared to other regions of the hairpin were not statistically significant and the rice tiling array data were not considered further. Supplemental Figures S2, S3, Supplemental Table S1, and Datafiles S4 and S5 document the quality of tiling array data by analyzing signal to noise ratios of ribosomal genes.

Assessment of Arabidopsis whole genome tiling array antisense signals
To characterize the quality of sense and antisense whole genome tiling array transcript signals for Arabidopsis, we endeavored to better understand the 'topology' of gene tiling array signals by calculating genome-wide the average exonic and intronic signal strengths for tiling array probes mapping uniquely to the sense strand of all 27,344 annotated protein-coding genes in TAIR9 and 50,090 rice genes (TIGR6.1, http://rice.plantbiology. msu.edu/, see Datafiles S4 and S5) that had tiling array [84] probe matches to exons. We divided average exon sense strand probe signals by the corresponding average antisense exonic signals to obtain a ratio of sense expression/antisense expression for each gene. We then sorted the genes (Datafiles S4 and S5) from lowest to highest based on this ratio, hypothesizing that highly expressed protein-coding genes (such as ribosomal protein genes) would be at the bottom of the ranking (sense expression .. antisense). Similarly we hypothesized that those validated miRNA target genes previously shown to produce abundant smRNAs and antisense transcripts in Arabidopsis [10,[12][13][14]103] would be towards the top of the ranked genes because they produce more antisense transcripts (that in turn spawn antisense smRNAs). Table 2 presents results of an analysis of Arabidopsis ribosomal genes ( Table 2, row 1) and miRNA target genes that produce abundant antisense siRNAs in Arabidopsis (row 2) as a function of ranking in the genome list for antisense exonic expression. As predicted from results of Supplemental Figure S2, there were ten times as many Arabidopsis ribosomal protein-coding genes in the bottom half of the antisense expression-ranked genome as the top half (Table 2; Datafile S4), but only 2.7 times as many rice , with miRNA* position as ''0''. Sense strand is colored red and antisense blue. Note the abundant antisense signals mapping at or upstream to miRNA* sites (small arrow), and downstream sense signals for ancient MIRNA genes (large arrowhead) similar to miRNA target genes previously described [14]. See Datafile S2 for details. doi:10.1371/journal.pone.0010710.g002 ribosomal protein-coding genes (Supplemental Table S1, row 1; Datafile S5). For those select miRNA target genes in Arabidopsis known to produce antisense smRNAs, there were 3.3 times as many genes in the top half of the antisense transcript abundance genome list, which was statistically significant. No such relationship was observed for the validated and predicted homologues in rice (Supplemental Table S1, row 2).
Arabidopsis antisense whole genome tiling array signals suggest some 'unknown' predicted proteins may be mis-annotated non-coding RNAs, including new MIRNA homologues Rice and Arabidopsis genomes have been recently re-annotated based on multiple gene-finding and annotation algorithms that attribute confidence scores to exons based on different types of experimental and computational evidence [104,105]. Taking advantage of the new features of the TAIR9 Arabidopsis genome release, we analyzed our genome-wide lists of Arabidopsis genes ranked as a function of relative antisense strand tiling array expression, focusing on annotated protein-coding genes defined as expression-confidence classes (star rankings) for Arabidopsis 'unknown' genes and 'unknown' expressed or 'hypothetical' genes (no expression data available) for rice. The results are summarized in Table 2 and Supplemental Table S1, respectively. For Arabidopsis it is evident that hundreds of 'unknown' genes with low expression confidence rankings (zero or one star) and those producing antisense smRNAs are significantly more abundant (about three-fold) in the top half of the genome ranked as a function of low sense/antisense exon signal ratio ( Table 2). For those unknown Arabidopsis genes with high confidence expression data (four or five stars), the abundance ranking based on antisense expression is actually fewer in the upper versus lower halves of the genome, consistent with results for highly expressed ribosomal protein-coding genes ( Table 2). The trend for more 'unknown' (including 'expressed') genes in the upper half of the rice genome ranked on antisense tiling array expression was barely discernable (Supplemental Table S1) and not statistically significant in the context of analogous Arabidopsis genes. We interpret this finding as consistent with the low quality antisense tiling array expression data for rice. A recent report described the antisense strand expression of some rice hypothetical genes [106].
The potential significance of results in Table 2 is that 'orphan' Arabidopsis genes predicted by gene-finding algorithms and having relatively abundant antisense strand expression may actually be ncRNA genes. We computationally tested this hypothesis five ways, and in addition found supporting expression evidence from ESTs (see below). The binomial distribution (upper vs. lower halves of the ranked transcriptome ratio of gene sense exon/antisense exon signals) of predicted unknown protein-coding genes that overlap recently published antisense ncRNAs [58] showed a similar pattern of two-to four-fold enrichment for TAIR9 expression confidence classes less than four stars ( Table 2, rows ''with as-TU''; Datafile S4). Two additional independent tests comprised the bionomial distributions for two exclusive sets comprising 1,044 predicted 'unknown' ORFs [27](Y. Xiao and C.D. Town, personal communication). Table 2 shows that for these two subsets of unknown genes (rows ''sORFs'' and ''qRT-PCR verified''), similar to the genome-wide general pattern and the pattern for ncRNAs mapping to unknown genes, there were significantly more genes in the upper half of the ranked sense/ antisense transcriptome for TAIR9 lower expression confidence rankings (star rating), especially for the qRT-PCR predictions (weighted mean of upper/lower ratio for zero-three star confidence = 3.9, data not shown). As predicted by the working hypothesis that antisense transcripts are processed into smRNAs, genes that produce antisense siRNAs [107] were over-represented in the upper binomial distribution for exon antisense signal abundance, with strong evidence for over-representation of unknown protein-coding genes with independent ncRNAs mapping to them [58] or encoding unknown qRT-PCR tested genes (Y. Xiao and C.D. Town, personal communication) compared to all protein-coding genes ( Table 2, compared bottom two rows).
A fourth computational test of the hypothesis was meta-analysis for congruence across the datasets, specifically whether 105 unknown genes recently predicted as small ORFs [27] or predicted and tested by conventional means (Y. Xiao and C.D. Town, personal communication) and having independent quantitative and qualitative evidence of strong antisense expression from tiling array experiments [58,82,83] clustered as a function of gene model quality. Results in Supplemental Figure S4 for 105 genes show a positive correlation (r = 0.61) between expression quality (TAIR9 star ranking) and measured abundance of antisense transcripts (sense exon/antisense exon signal ratio , unity), and an inverse correlation (r = 0.83) for expression quality rankings between recently predicted small unknown ORFs [27] and previously predicted unknown genes found by conventional algorithms (Y. Xiao and C.D. Town, personal communication). The latter class of genes is the subject of targeted expression studies by qRT-PCR and therefore is better represented in the two-to five-star expression rank classes.
A final computational test was to BLAST the Arabidopsis genome with known MIRNA hairpins to search for homologues, reasoning that some antisense transcripts may encode MIRNA homologues. Results are shown in Table 3 for candidate MIRNA gene homologues identified as mapping to the antisense strand of predicted protein coding genes and producing some antisense smRNAs. Consistent with our hypothesis, all of the MIRNA homologues were found in the upper half of the antisense expression-ranked genome list or on a previously described antisense non-coding RNA. For miRNA targets, vast majority of miR846 predicted and validated target genes (jacalin/lectin-like, which have extended homology to miR846 hairpin [78]) were also  : Small open reading frames (sORFs) were from [27]. f : Genes with antisense transcript units were from [58]. g : Genes with antisense transcripts verified by quantitative RT-PCR were from Y. Xiao and C.D. Town, personal communication. h : Unknown genes with different confidence ratings were from TAIR9 (http://www.arabidopsis.org). Zero rating means no expression data. One star rating means there is weak EST data, and/or another type of low quality functional evidence. Higher (2-5 star) rankings derive from qualitative meta-analysis of full-length cDNAs, proteomics, moncot and dicot cross-species sequence alignments, and genomic conservation. found in the top half of the list for higher expression of antisense transcription (Table 3). Three additional functional evidences supporting the results are that one predicted miRNA target, AT1G07650, which has a miR404 hairpin homologue on the antisense strand for its 39-UTR, is up-regulated in the miRNA metabolism mutants hst1-15 and hyl1-2 [94,97]. Another example is AT4G03050/AOP3, which has elevated expression in the hen1-1 miRNA methyltransferase mutant and produces unique smRNAs sequenced from immuno-precipitated AGO4 complexes [96]. Two hairpin-containing genes, At1G55045, At5g26262, and a 7SLlike signal recognition particle ncRNA (Ath-383) [29] on the sense strand of At2g31141 were uncovered by manual inspection of the predicted small ORF unknown genes for production of smRNAs predominantly from one strand ( Table 3). At1g55045 and At5g26262 antisense transcripts can fold into stable hairpins (Supplemental Figure S5A; data not shown), produce moderately abundant phased 23-24 n.t. antisense siRNAs (Supplemental Figure S5B; data not shown), and the genes are methylated by DNA maintenance and de novo establishment methyltransferases MET1 [108] and DRM1-2/DRM2-2/CMT3-11 [109] respectively. At5g26262 has significant homology (E,0.03) to a rice transposon ORSgTETNOOT00686 (http://plantrepeats. plantbiology.msu.edu/search.html) and seven other intergenic loci in the genome (data not shown). Because of the lack of a candidate miRNA/miRNA* duplex that maps to the At1g55045 hairpin, this foldback does not meet the criteria of a miRNA [110] but may be a case of an evolving or devolving MIRNA-or TAS-like locus subject to transitivity [78,111,112] and processive cleavage by DICER-LIKE3/4 complexes.
Sense/Antisense tiling array transcriptome topology and smRNA abundance as expression-based features of Support Vector Machines (SVM) for MIRNA gene/target prediction Machine learning algorithms for MIRNA gene and target site prediction utilize sequence complementarity as the primary feature, which is tractable in plants due to extensive homology between miRNAs and their targets but of limited use in animals [75]. Because plant and animal miRNA pathways share mechanisms and components with RNA interference and post-transcriptional gene silencing, we hypothesized that antisense expression-based topological features may be a useful predictor of miRNA targets and MIRNA genes. A molecular mechanism has been established [80,113,114] for the observed abundant downstream sense strand tiling array signal of miRNA target genes in which 39 exonuclease degradation of the upstream cleavage product by EXORIBONUCLEASE4 (XRN4) is postulated. However the mechanism resulting in upstream antisense signal for target genes [14] and MIRNA genes (Figure 2A) has yet to be elucidated. The downstream-sense/upstream antisense Arabidopsis whole genome tiling expression data for validated miRNA target genes versus non-target paralogs was adopted as a key expression feature, along with smRNA abundances and thermodynamic energy of binding to implement an SVM for prediction of miRNA targets and genes. The normalized tiling array expression signals from 800 base pairs upstream and downstream on the sense and antisense strand, respectively, of miRNA binding sites [14] was extracted. This resulted in 130 values associated with each of the genes in the training set. The next feature consisted of smRNA counts and resulted in five additional features for the SVM. Four of the features correspond to the MPSS data from four different Arabidopsis samples: flowers, RNA-dependent RNA Polymerase 2 (rdr2) mutant, and two seedling libraries [79]. The fifth smRNA feature corresponded to the sum of all normalized (TPQ) unique smRNA reads mapping to loci from many pyrosequencing experiments [78,79,94,96,97] (Datafile S2). The last feature was that of sequence complementarity represented mathematically as relative thermodynamic stability. The most stable combinations of miRNA and the target gene or MIRNA genes were normalized to percent minimum free energy which works well for plant miRNAs because plants possess near perfect complementarity between miRNA and target genes [65,115].
In machine learning such as SVM where the goal is to classify samples, the ''Gold Standard'' refers to a set of data that can be used to train the prediction model and to test predictions. Our dataset was based on validated miRNA targets including the previously documented cases of transitivity (i.e. PPR and AGO1 genes [10,13]). We used various measures of SVM classification performance to evaluate the individual features, assigning validated Arabidopsis target genes the value of unity and the negative control paralogs [14] the value of negative one. The dataset was then analyzed through tenfold cross validation. The various combinations of features were analyzed to evaluate the importance of each in identifying correctly the validated miRNA target genes.
Accuracy, Specificity, Sensitivity, and Precision. The accuracy was calculated for the ten-fold cross validation performed on the dataset. Table 4 lists the results for each feature alone and in combination for this statistic. The smRNA counts and tiling array expression topology alone and in combination were weakly predictive (,60-70%), but not robust compared to the biological standard of thermodynamic stability (97%; Table 4). These results suggest the expression features under study may be useful features, but additional specificity determinants must be identified to strengthen an SVM for miRNA target gene prediction based primarily on expression. Table 4 also reports further statistical analyses of SVM specificity, sensitivity and precision. Specificity refers to how well a classification test can identify the negative cases, namely the probability to classify a gene as 21 if the target gene is a paralog with no miRNA binding site. All three features performed well for specificity 96-100%). The sensitivity of SVM is an evaluation of the test to predict the targets (+1 class). smRNAs were a weakly sensitive feature, and expression topology was insensitive as a predictor of miRNA targets ( Table 4). The Positive Predictive Value, or the precision, addresses the evaluation of the machine. The number represents the probability that if the SVM predicted the gene to be a target, how likely is it a bona fide target gene. This test is the reverse of the previous two; the sensitivity and specificity test the machine in the respect of if the actual label is known, how likely is it to identify it correctly. smRNAs were a fairly good (87%) feature of SVM performance, but expression topology was not (Table 4).
Further SVM testing on MIRNA genes. MIRNA genes are transcribed by RNAPol II [7,8,116] and therefore polyadenylated MIRNA precursor gene transcripts should be detected in the whole Table 4. Accuracy, Sensitivity, Precision and Specificity of an expression-based Support Vector Machine for miRNA target gene prediction trained on 86 Arabidopsis miRNA target genes and 125 non-target paralogs. genome tiling microarray datasets and evidence ( Figure 2) supports this model. Analysis of smRNA abundances and map positions on the antisense strand of MIRNA genes (which are generated by transitive processes) established that MIRNA genes, viz. at the complementary ''miRNA*'' position of the foldback, produce phased antisense siRNAs in a process similar to the working model of miRNA target genes that produce siRNAs [14]( Figure 1A). Using the analogous features of the miRNA target genes and paralogs, the SVM was implemented on the miRNA* dataset in order to examine its utility for predicting MIRNA genes, since they are transcribed similarly and have complementarity at the miRNA* position to mature miRNAs (and thus homology to miRNA target genes). To facilitate SVM evaluation, the miRNA* were labeled as +1, assuming the sequences would exhibit properties of miRNA targets. Table 5 displays results of the SVM evaluations. Using normalized expression topology, energy of binding, and the sum of sense and antisense smRNA reads mapping to regions of the hairpin other than mature miRNA or miRNA*, the SVM produced results nearly as predictive (81% versus 97%) as the Gold Standard training set of miRNA target genes (compare Tables 4 and 5). The comparison of ancient versus new MIRNA gene predictions by the SVM was consistent with the expression topologies; the ancient miRNA*s display the ''pingpong'' topology ( Figure 2A) analogous to that seen in the miRNA target genes [14] and produced the better result (84%) from the SVM analysis (Table 5).

Discussion
Antisense transcripts detected by whole genome tiling arrays are real Our analyses [14] and those of others [117,118] establish by multiple independent criteria that Arabidopsis antisense transcripts are real and of biological significance. Results from different whole genome tiling array technologies and platforms have shown congruence (e.g. Supplemental Table S2) for many antisense transcripts [58,82,83]. We have shown that MIRNA genes from both Arabidopsis and rice produce antisense smRNAs 'spreading' from the miRNA and miRNA* sites. Consistent with Figure 2A (upstream of miRNA* site), there was significantly more sense and antisense signals associated with miRNAs than elsewhere in the hairpins (Table 1; Datafile S3). Functional evidence of the antisense transcripts is seen by statistically significant overrepresentation in the upper half of the Arabidopsis transcriptome ranked on exon sense/antisense signal abundance for all genes including well-annotated protein-coding genes that produce antisense siRNAs ( Table 2, last row). An additional evidence is strong over-representation in the upper half binomial distribution of unknown protein-coding genes with independent ncRNAs mapping to them [58] (Table 2, ''with as-TU'' rows). The breadth of extant antisense EST coverage (Supplemental Table S2) which includes genes with high confidence TAIR9 expression rankings (four and five star) is prima facie evidence that antisense transcripts identified by whole genome tiling arrays are biologically significant and support our computational evidence that a significant number of unknown predicted protein-coding genes are actually ncRNAs. The extant Arabidopsis tiling array data quality is high, but that of rice is not (Supplemental Figures S2, S3). This situation is likely due to the lower number of probes with perfect matches between the Oryza sativa var. indica reference genome (from which the tiling array probes were designed) and the MIRNA hairpin sequences in miRBase which have been sequenced predominantly from japonica (Nipponbare) varieties. A recent report describes antisense transcripts and siRNAs associated with hypothetical genes in rice [106], consistent with our results (Supplemental Table S1; Datafile S5).

Expression-based computation as a means to ncRNA discovery and genome annotation
In this study we approach the broad question of applying computation to deep experimental expression datasets to develop methods for gene discovery, focusing on miRNAs, ncRNAs, and antisense transcripts. Because these RNA classes span eukaryotic kingdoms where the molecular processes are deeply conserved but the molecules themselves are not, genomic analysis of antisense expression patterns in plants may reveal associations (a la 'a smoking gun') that can provide insight into animal miRNA and ncRNAs, where complementarity is less conserved. Analogous approaches for miRNA target genes classified according to the promoter features of the cognate MIRNA genes have been described [119,120]. We show that the phenomenon of spreading/transitivity of smRNAs associated with miRNA target genes and MIRNA genes in Arabidopsis is conserved in rice (Figure 1). Similar processes occur on worm MIRNA gene transcripts (Supplemental Figure S1), likely mediated by multiple interactions between RNA-dependent RNA polymerases RRF-1 and RRF-3 and associated with DICER and Argonautes NRDE-3 and ERGO-1 [98]. It is interesting to note that those MIRNA genes (e.g. miR158, miR159/319, miR164, miR167, miR168, miR172) whose transcripts accumulate in post-transcriptional processing mutants [80,114,116] also produce abundant smRNAs (Datafile S2) [14]. Xue et al. also observed antisense smRNAs associated with several rice miRNAs and miRNA*s [121] and noted an example (miR55) from C. elegans. Taken together with reports of a novel RNA-dependent RNA polymerase in Drosophila [122] and functional antisense miRNAs in Drosophila and mouse [123][124][125], our data in rice, C. elegans, and Arabidopsis [14] lend credence to the notion that similarly complex transitivity mechanisms operate on plant and animal miRNAs. We show that antisense transcription signals for MIRNA and protein-coding genes are detectable by whole genome tiling arrays (Figure 2; Supplemental Figures S2, S3; Table 1), providing evidence of the molecular mechanism of smRNA production. However, the low abundance of the antisense signals requires high quality microarray data ( Figure 2; Supplemental Figures S2, S3) that are not yet available for rice (Supplemental Figure S2, Supplemental Table S1). Deep sequencing of mRNAs and epigenetic marks on DNA reveal hidden facets of RNA processing, chromatin remodeling, and gene regulation, but the method is expensive. Computational analysis of smRNA datasets, which are less costly on a molar basis to generate, in conjunction with inexpensive high resolution custom tiling microarrays can provide a more integrated view of gene expression, especially in genomes with limited annotation. We incorporated expression data, smRNA counts, and thermodynamic energy of binding as features for a Support Vector Machine to build a model for prediction of miRNA target sites in Arabidopsis. Using a dataset based on validated Arabidopsis miRNA targets, the machine was internally tested based on accuracy, precision, sensitive and specificity. The results were modestly supportive of a predictive value for smRNA counts over tiling array expression signals, suggesting both these features have potential utility as filters in miRNA prediction methods over thermodynamic stability alone. The performance of the SVM was further tested with an external dataset: MIRNA genes. Although the miRNAs* corresponding to ancient miRNAs produced more supportive results, the newly discovered miRNAs* were also fairly predicted (Table 5). Mathematically modeling the 'downstream sense, upstream antisense' tiling array signal to describe more precisely the transitive activity and reduce the dimensionality might improve the performance of the SVM. The extreme case is seen when considering only those probes with perfect matches to miRNA hairpin domains (Table 1), which decreases noise and feature dimensions of the machine. Improvements such as utilizing a non-linear kernel and optimizing the ''slack'' parameter may improve the power of the SVM. The machine can be further developed with datasets of predicted miRNA targets [65][66][67][68][69][70], as well as candidate MIRNA genes and targets from purely computational methods. Thousands of predicted miRNAs in Arabidopsis and rice have no functional evidence to support their being expressed or having bona fide targets, and therefore represent a large investment to qualify by wet lab methods. The SVM could facilitate prioritizing those that have a greater likelihood of being real based on collective expression topologies. However, our results that qualify the rice tiling array expression data as low-quality (Table 1, Supplemental Figure S2, Supplemental Table S1) limit the potential of an expression-based SVM for rice until a high-quality tiling expression dataset is available. Optimizing the SVM features based on biology (e.g. sizes of the siRNAs, ''phasing'') are other candidate features for adoption. Two new miRNA families (miR2118, miR2775) conserved in monocots (and Phaseolus for miR2118 [126]) were recently discovered by analysis of rice phased siRNAs produced from TASi-like target ncRNAs [127].
Gene models suffer from errors in reading frame, exon border definition, and exon identification. It is estimated that 13% of the Arabidopsis proteome is incomplete due to approximately equal numbers of missing and incorrect gene models [128,129], suggesting that there is ample scope for gene discovery even in well-annotated genomes. Whole-genome tiling arrays have utility for characterizing alternative splicing [130]. Tiling array expression and TAIR9 confidence rankings are useful metrics for ncRNA discovery (Supplemental Figure S3, Supplemental Table S2). Our computational results begin to address expression 'topology,' the relationship between RNA expression signals and gene structure. We show, in the case of exons versus introns, that expression topology is a valuable metric for interrogating genome annotation. Arabidopsis signals show excellent congruence with exon/intron annotations in all five samples (from two different technology platforms) with only a slight bias of expression signal toward the 39 end of the gene and minimal signal in the 39-UTRs (Supplemental Figure S2; Datafile S4). This aspect of gene expression topology can be developed further by calculating an integral for each separate functional domain of a gene (promoter, 59 UTR, exons, introns, 39 UTR) and modeling expression topology to identify outliers that could facilitate gene discovery and genome annotation. Several groups have recently published Arabidopsis whole genome tiling array transcriptome studies on stress responses [58,119,120,131] and note changes in 59 and 39 UTR and MIRNA gene expression. The existence of promoter-associated antisense transcripts in animals that regulate transcriptional activation and repression by RNAi-associated processes [49,50,[132][133][134][135][136][137] suggest that tiling array interrogation of promoter-associated RNAs can indentify similar classes of ncRNAs in plants.
Applying machine learning algorithms, we could identify associations between miRNA target genes or MIRNA genes and smRNAs (Tables 4, 5) and between protein-coding genes and ncRNAs (Supplemental Table S2) that fit a model of transitivity based on their antisense expression topology. Our methods reveal the potential of expression-based machine learning and unsupervised association to discover new miRNAs, target genes and ncRNAs based on expression features such as strand bias for production of phased siRNAs (Supplemental Figure S5). Contingent upon generation and availability of high quality datasets, whole genome tiling array transcriptomes and deep smRNA datasets such as for rice [85,86] and other species will be suitable subjects for further computational methods testing and analysis. Expression-based determinants have potential applications for ncRNA discovery in other kingdoms and species where miRNA binding site free energies are lower, especially in transitive processes which are poorly understood.
The method presented is equally applicable to transcriptome data generated from ultra high-throughput sequencing (UHTS) approaches. Conceptually, short read data can be represented in a format similar to tiling array data (genomic location versus read frequency instead of array signal). The added benefit of UHTS is that signals for any genomic coordinate are potentially generated rather than fixed a priori with predetermined probes in tiling array experiments. Therefore, UHTS data is conceptually equivalent to tiling array data with probes derived from every nucleotide of the genome-once the transcriptome data is converted to genomic coordinate versus signal representation, the methods described can follow without change.
The presented approach is applicable to animal systems. C. elegans and D. melanogaster are likely to be the best candidates to directly apply this approach, because their genomes are well annotated with whole genome tiling array and EST data [138][139][140]. Moreover, their gene structures and intron-exon sizes are comparable to the model plants Arabidopsis and rice. However, higher levels of transcriptional complexity in animals with prevalence of alternate splicing and overlapping antisense transcripts need to be properly accounted for. Genomes of higher mammals pose additional difficulties due to the presence of short exons separated by large intronic regions and lowcomplexity transposon-related sequences in their gene structures which spawn smRNAs.

The biological significance of antisense ncRNAs
The congruence of antisense tiling array signals to the exonic regions [14] manifest in antisense S/N ratios ..1 for most annotated genes (Supplemental Figure S3, Datafile S4) is remarkable and strongly suggests that the majority of antisense transcription occurs predominantly on spliced mRNAs and is dependent on the activity of RNA-dependent RNA polymerases. A recent report suggests nuclear RNA distorts transcriptome microarray results, consistent with our inference that cytoplasmic RNA is a major source of antisense RNAs [141]. However, deep sequencing of smRNAs from Arabidopsis does uncover a small percentage that map to intron-exon junctions and introns, suggesting that precursor-mRNAs, or more likely DNA in the nucleus is also a source of antisense transcription [78].
Expression of ncRNAs is commonly regulated by stress and environmental stimuli, and many different ncRNAs accumulate at specific developmental stages or in specific cell types, or even within specific subcellular domains, suggesting important and tightly controlled biological roles [1,16,25,56]. New miRNAs continue to be discovered by deep sequencing and are expressed at very low levels or only in a few tissues or at particular times during development. It is speculated that the antisense miRNA* signal we observe ( Figure 2) is due to hybridization of pre-miRNA transcripts, but other interpretations are possible such as spurious labeling of abundant miRNA and miRNA* species, or hybridization of miRNA target mRNAs, or mismatch hybridization of homologues. Analysis of available whole tiling array data RNAi knockdown mutants of the exosome [116] and other miRNA metabolism and RNA processing mutants such as xrn4/abh1 double mutant, hyponastic leaves1, argonaute1, nonsense-mediated decay effectors upf1/upf3, RNA-dependent RNA polymerase2 (rdr2), DNA methylation triple mutant drm1/drm2/cmt3, and serrate [114,[142][143][144][145] should be informative, especially when combined with machine learning to find other affected loci.
There remains the important question of the biological significance of antisense transcription as it relates to our findings and the myriad examples found across eukaryotic phyla. It has been postulated that evolution of MIRNA genes includes an early stage when antisense transcription is triggered by long perfect dsRNA of an inverted repeat or transposon-related repeat [77,78,111,112]. Our finding of a long inverted repeat with abundant 21 and 24 n.t. siRNAs mapping within antisense ncRNA (Group4327) highly homologous to MIRNA843 hairpin (Table 3) yet not conserved for mature miR843 (data not shown) suggests a different evolutionary origin of MIRNA843 than postulated by Fahlgren et al. [77] who noted weak foldback homology with a protein-coding gene (At3g48340) not targeted by miR843. Recent findings in pollen and female gametes and their accessory cells [146,147], endosperm [148][149][150] and gametes of mouse [151] and Drosophila [123,152] show endogenous siRNAs are formed from cis and trans antisense transcripts and function in epigenetic regulation of germ line gene expression and cell fate and may serve a 'memory' role to mediate RNA-and DNA level silencing of transposons during vertical transmission to the next generation. A link between genomic imprinting and RNA silencing in plants has come from studies of PolIVb/V-dependent siRNA accumulation in the maternal gametophyte and developing seed: expression of siRNAs in endosperm is specifically from maternal chromosomes [150]. Newly discovered gypsy and copialike retroelements can transpose in hybrid met1/wild type epigenomes and in mutants of the chromatin-remodeling ATPase decrease in dna methylation1 (ddm1); subsequent movements are suppressed by RNA-directed DNA methylation that requires Pol IVb,/V and the histone methyltransferase KRYPTONITE (KYP). These results establish that epigenetic control of retrotransposons extends beyond transcriptional suppression [153,154]. The transposon-and TAS-like hairpins we describe (Table 3, Supplemental Figure S5) which produce phased siRNAs may be cases of post-transcriptional antisense regulation of relevance to proteincoding gene regulation or miRNA evolution [155]. We speculate that similar mechanisms may affect MIRNA genes and miRNA targets that produce smRNAs and are subject to DNA methylation [5,156,157]. Figure S1 C. elegans primary (59 mono-) and secondary (59-triphosphorylated) antisense siRNAs [98] that map to various positions of miRNA hairpins. Primary siRNAs map predominantly to miRNA* positions, and secondary siRNAs map predomi-nantly to loop regions, similar to results seen in Arabidopsis and rice ( Figure 1). See Datafile S6 for details. Found at: doi:10.1371/journal.pone.0010710.s001 (0.60 MB TIF) Figure S2 Comparison of Arabidopsis and rice sense strand signal profiles for highly conserved domains of eight select ribosomal genes, from whole genome tiling arrays. Signal to noise (S/N) ratios were calculated from the arithmetic means of probe signals mapping to exons divided by intron probe signals. For Arabidopsis, signal line colors indicate RNA samples from T87 callus cultures (blue) [82]; flowers (green); root (magenta); lightgrown leaves (brown); and suspension cells (tan) [83]. Exons are denoted below the plot as green boxes on the Watson (upper) or Crick (lower) strands (x-axis). Note the trend for increasing signal strengths towards the 39 end of the gene (arrows) including 39 UTRs (ovals), especially for rice data, consistent with degradome studies [80,114]. Found at: doi:10.1371/journal.pone.0010710.s002 (1.32 MB TIF) Figure S3 Antisense strand signal profiles from Arabidopsis whole tiling arrays for eight ribosomal genes of Supplemental Figure S2. Signal to noise (S/N) ratios were calculated from the arithmetic means of probe signals mapping to exons divided by intron probe signals. Signal line colors indicate RNA samples from T87 callus cultures (blue) [82]; flowers (green); root (magenta); light-grown leaves (brown); and suspension cells (tan) [83]. Exons are denoted below the plot as green boxes on the Watson (upper) or Crick (lower) strands (x-axis). Note the antisense signals are largely congruent with exons, suggesting that antisense transcription occurs on mature mRNAs. Found at: doi:10.1371/journal.pone.0010710.s003 (0.85 MB TIF) Figure S4 Meta-analysis of two exclusive sets of ''unknown'' annotated proteins (filled diamonds [27] and open circles [Y. Xiao and C.D. Town, personal communication]) plotted as functions of TAIR9 expression quality (y axis) and ratio of sense exon/ antisense exon expression (data from [82,83]). All genes have independent evidence of antisense expression [58]. The average expression ratios for all 105 genes correlated positively (r = 0.61) as a function of expression rating class, whereas there was an inverse correlation (r = 0.83) between expression rating classes and numbers of genes with sense/antisense expression ratios ,1. Found at: doi:10.1371/journal.pone.0010710.s004 (0.72 MB TIF) Figure S5 (A) Hairpin-containing secondary structure corresponding to phased antisense siRNAs mapping to predicted small ORF At1g55045. Base-pair probabilities from RNAfold [89] are shown as heat map. (B) Phased siRNAs [94] to At1g55045 hairpin mapped with pssRNAMiner [158], P,6e-5 (random hypergeometric distribution). Antisense strand is labeled (-). Approximately 20% of all known smRNAs mapping to this locus are phased.