The small RNA locus map for Chlamydomonas reinhardtii

Small (s)RNAs play crucial roles in the regulation of gene expression and genome stability across eukaryotes where they direct epigenetic modifications, post-transcriptional gene silencing, and defense against both endogenous and exogenous viruses. It is known that Chlamydomonas reinhardtii, a well-studied unicellular green algae species, possesses sRNA-based mechanisms that are distinct from those of land plants. However, definition of sRNA loci and further systematic classification is not yet available for this or any other algae. Here, using data-driven machine learning approaches including Multiple Correspondence Analysis (MCA) and clustering, we have generated a comprehensively annotated and classified sRNA locus map for C. reinhardtii. This map shows some common characteristics with higher plants and animals, but it also reveals distinct features. These results are consistent with the idea that there was diversification in sRNA mechanisms after the evolutionary divergence of algae from higher plant lineages.


Introduction
Small (s)RNAs in many organisms are involved in regulation of gene expression at the transcriptional (epigenetic marks) and post-transcriptional (RNA degradation/translational repression) levels, as well as in host defenses against viruses [1]. In eukaryotes, their double stranded or highly structured RNA precursors are processed by Dicer-like (DCL) endonucleases into short 20-25nt RNA duplexes that are bound by Argonaute (AGO) proteins. One of the sRNA strands guides an AGO-containing effector complex toward complementary DNA/ RNA to mediate the various mechanisms of RNA silencing [2]. Based on their origin and biogenesis, sRNAs can be classified in diverse classes including small-interfering (si)RNAs, micro (mi)RNA, and piwi-interacting (pi)RNAs [3], but there are many different sRNA subtypes [4] and diffuse boundaries between sRNA classes, as demonstrated for Arabidopsis thaliana [5]. Consequently, development and improvement of comprehensive classification methods is required for understanding of sRNA-based regulation networks.
Due to its small genome, vegetative/sexual reproduction, fast growth, motility and capacity to use acetate as carbon source, the photosynthetic green alga Chlamydomonas reinhardtii

Preparation and sequencing of sRNA libraries
Chlamydomonas cells were grown in liquid TAP media with constant shaking at 25 degree celcius under continuous illumination until cultures reached saturation. Total RNA was extracted from cell pellets with TRIzol reagent (ThermoFisher) by following a protocol previously described [7]. sRNA libraries were prepared directly from total RNAs by using the TruSeq v2 RNA Sample Preparation Kit (Illumina) following the manufacturer instructions, and then they were further sequenced on a HiSeq 2000 sequencer. Sequencing data were preprocessed using the ADDAPTS pipeline and tracking system [5,22]. After 3' adaptor removal, all sequences <15 nt in length were discarded, and the remaining sequences were aligned against the Chlamydonomas genome using the bowtie alignment program tolerating zero mismatches [23,24]. Only sequences with at least one perfect match were included in further analyses. The Chlamydomonas reference genome and transcriptome used were Phytozome v5.0 and version 281, respectively. sRNA locus map 145 sRNA libraries consisting of 54 replicate groups were used as the basis for analysis. 142 libraries were internally generated laboratory datasets [7,10] while 3 were from [25]. A locus map for Chlamydomonas sRNAs was produced using the Bioconductor (www.bioconductor. org) package segmentSeq [26]. This package uses a heuristic approach based on sRNA densities to establish an initial locus map which is then refined using Bayesian methods to take into account the separate replicate groups. Sequences that aligned to the genome more than 200 times were excluded from the segmentation. Any gap of greater than 100nt with no reads was sufficient to split a locus. The quality of the segmentation was analysed using a series of diagnostic plots (S1-S4 Figs). The locus map was formed from all loci with a false discovery rate (FDR) of less than 0.05.

Locus map annotation
The locus map was annotated with intrinsic locus characteristics and publicly available annotations using functions mainly run in the R programming language [27]. The full annotated locus map can be found in S1 File.

Loci sizes
Chlamydomonas loci were classified into five discrete size classes using class divisions of 100, 400, 1500 and 3000 (corresponding to log 4.6, 6.0, 7.3 and 8.0 respectively) nucleotide width. Class divisions were determined through a combination of the gradient changes seen in the locus size distribution (S5 Fig) balanced with the need to capture biologically relevant size classes. The 100nt cut-off was used to capture loci which might be driven by just one sRNA. The 100-400nt class captures loci which are roughly the same size as an average intron or exon in the Chlamydomonas genome [23]. At 1500nt there is a marked change in the graph and then the 3000nt division captures the long tail of very large loci.

Predominant 5' nucleotide
To investigate whether there was a predominance for particular 5' nucleotides of the sRNAs originating from each locus, each of the four nucleotides were tested as to whether levels differed significantly from the normal ratio across all loci assuming a binomial distribution [28]. The Benjamini-Hochberg procedure was used to minimise the FDR [29].

Repetitiveness
Repetitiveness is a measure of the extent to which small RNAs that align to a given location may also align to other genomic locations. We assessed this at each locus using the following equation: where x i is the number of times the ith small RNA within the locus is sequenced and m i is the number of genomic locations to which that small RNA aligns. This gave a score between 1 and 0 (1 being highly repetitive, 0 not at all repetitive) which was divided into three groups (low R < 0.6, median 0.6 < R < 0.9, and high with R > 0.9) corresponding to the peaks of the distribution shown in S6 Fig.

sRNA strand bias
Strand bias ratios were calculated from loci in wild type samples with more than five reads. Confidence intervals for strand bias at each locus were calculated assuming a binomial distribution and using a modified Jeffreys interval [28]. Loci were then classified as having a strong (< 0.2 or > 0.8), medium (between 0.2 and 0.4 or between 0.6 and 0. 8)

Phasing
The detection of phasiRNA production was undertaken using PhaseTank, a perl based tool which searches for regions of a minimum of four 21nt sRNAs and computes a phasing score for each one [30]. Overlap between sRNA loci and phased regions were then used to annotate loci as phased or not. The phasing score was reported as part of the annotation in S1 File ranging between 0 (no phasing) and 313. This allowed the identification of loci overlapping with a medium phasing region (0< score <60) or with a high phasing region (score > 60) (S8 Fig).

sRNA size ratios
Since both 20nt and 21nt long sRNAs were shown to be predominant sRNA classes in Chlamydomonas [7] we calculated the ratio between them. Furthermore, there are potential physiological roles proposed for larger [31] and smaller [32] sRNAs in Arabidopsis. We therefore assigned each locus according its predominant sRNA population namely using 20nt and 21nt sizes as thresholds.

Expression type
Loci were classified as to how ubiquitous their expression was. 10 wild type replicate groups were identified and loci present in more than 5 defined as common (expression:common in Fig 5), between 1 and 5 as inbetween and only 1 as specific.

Mutant, strain and developmental stage annotation
DCL3 and AGO3 dependence was calculated by determining loci present in at least one wild type replicate group but not in any of the mutant libraries. We also determined loci found specifically in one of the three used strains (CC4350, CC1883, and J3). Presence of loci specifically in libraries of either vegetative or zygotic developmental stages was also calculated.

Genome annotations
Overlap with various genome features was calculated including predictions for genes, 5' and 3' UTRs, exons and coding sequences obtained from the Phytozome genomics portal (phytozome.jgi.doe.gov) using the most recent Chlamydomonas genome annotation (v5.5) [23]. Promoter regions were calculated as the 500 bp flanking each gene. Transposon locations were established by processing the Chlamydomonas repeat masker file to remove any sequence not explicitly identified as known transposons. Transposons were then classified (S1 Table) according to the unified system proposed by Wicker et al. and using the extensive literature concerning transposon identities [33]. Predictions of miRNAs, inverted repeats (IRs) and tandem repeats (TRs) were sourced from internal lab data with the IRF and TRF algorithms used to identify IRs and TRs respectively [10].

DNA methylation
Bisulphite sequencing data generated was processed using yama (https://github.com/tjh48/ YAMA) with the heuristics based functionality of segmentSeq used to determine loci enriched in CG, CHH or CHG methylation. sRNA Loci were then probed for overlap with these methylation enriched loci.

Multiple correspondance analysis
MCA was used to cluster the loci according to their annotations using the CRAN (cran.r-project.org) package FactoMineR with the HCPC function adapted to enable K-means clustering [34]. Some annotations were used as supplementary where they were not predictive of the clustering but their correlations were calculated (S2 Table). These supplementary annotations can provide cross-validation of the clustering outputs. The number of dimensions and clusters to select was determined by integrating information from a number of analyses consistent with that applied by Hardcastle et al. [5]. Dimensional reduction techniques like MCA are designed to concentrate the variance explained in the lower dimensions. Thus, at a certain cut-off, higher dimensions can be excluded as not being particularly significant for explaining overall variation. The graphical elbow-method is a common means to do this by considering the % of variation explained for each dimension, with a clear elbow displayed suggesting 6-10 dimensions would be appropriate (S9 Fig). To calculate the stability of the clustering we re-ran the clustering multiple times using random sub-samples derived by random sampling of the loci with replacement to create datasets of the same size as the original. For each iteration of the sub-sampling and clustering we calculated the proportion of sampled loci which retained the same clustering. This enabled us to create box-plots showing the proportion of loci preserving their original clustering across multiple iterations, thus providing an indication of stability. This analysis was repeated for different combinations of dimensions (1-10) and clusters (2-10) (S10 Fig). This analysis showed that clustering was generally stable between seven and ten dimensions for two, three or six clusters. Based on the combination of these two analyses we determined seven dimensions to be appropriate.
To determine whether two, three or six clusters would be most appropriate, we calculated the gap statistic for seven dimensions [35]. Two, three and six clusters all showed large differences between the observed and expected sum-of-squares within the clusters compared to the cluster means (Wss) (S11 Fig). Computing the normalised mutual information (NMI) compared the clustering to annotation feature overlap showed a large increase in NMI for six clusters (S12 and S13 Figs). Thus six clusters was selected for primary analyses.
A cluster hierarchy was also generated by carrying out clustering for two, three and six clusters each time using seven dimensions which allowed the determination of how loci clustered together for lower values of k.

sRNAs characteristics in Chlamydomonas
In order to construct a complete locus map we first obtained a comprehensive collection of 145 sRNA libraries encompassing 54 replicate groups (S3 Table). To capture a maximum diversity of sRNA the libraries represent a wide range of conditions, strains and stages of the life cycle. After initial trimming and filtering (see Methods), the libraries contained a total of 22.3 million non-redundant (336 million redundant) sRNA reads mapping to the Chlamydomonas reference assembly genome [23].
To obtain a general overview of the sRNA population within our datasets, we first computed the same intrinsic key determinants found to be informative in the sRNA locus classification of Arabidopsis [5]: locus size, 5' nucleotide, repetitiveness of genomic mapping locations and abundance of individual sRNA species. To avoid a bias towards non-representative conditions, for this first part of the analysis we selected only libraries made from wild type strains.
The size distribution of sRNAs ( Fig 1A) is dominated by 21nt species comprising 28% of all sRNAs, in line with previous studies [7,8,10,11,36] and unlike Arabidopsis and other land plants in which the 24nt sRNA fraction predominates ( Fig 1A). Also, the 21nt species are the only class exhibiting counts with more than 1 million copies per sRNA species which makes up about 20% within that class (Fig 1B). Many reads (43% redundant and 70% non-redundant) map uniquely to the reference genome, which is comparable to Arabidopsis (52% and 80%), albeit more evenly distributed across size range as shown in Fig 1C. This difference is most likely due to the absence of a RdDM pathway in Chlamydomonas.
Studies have shown the importance of 5' nucleotide for AGO binding specificity in Chlamydomonas [36] and Arabidopsis [37]. For Chlamydomonas, and in line with previous reports, we found that most sRNAs have adenine (A) and uracil (U) as 5' nucleotide. As in Arabidopsis, however, the 5'-end nucleotide varies greatly for different size classes (Fig 1D). For the predominant 21nt fraction, there was a preference for A and U (26% and 53%) with a higher proportion of 5' A sRNAs in the larger fractions up to 27nt. Overall, we found general agreement of our data with other datasets and we concluded that they are suitable for subsequent locus map generation and classification.

Defining a comprehensive small RNA locus map
To generate a locus map with these datasets we used the R package SegmentSeq. It employs a heuristic approach based on sRNA densities to derive an initial locus map which is then refined using Bayesian methods to take into account replicate groups [26]. The locus map based on a false discovery rate of less than 0.05 (FDR) had 6164 loci (S1 Fig, S4 Table) and covering 4.1% (4.57Mb) of the reference genome (110Mb). All loci were allocated a unique ID consisting of the CRSL prefix (acronym of Chlamydomonas Reinhardtii SRNA Locus) combined with a successive number starting with CRSL000010. While

Small RNA loci annotation
To gain insight into potential function of individual loci, we annotated them based on intrinsic loci features, such as locus size and repetitiveness, and based on features associated with the sRNAs, such as sRNA size, 5'-nucleotide, strand bias and phasing pattern (S2 Table and Methods). In addition, extrinsic annotation features of sRNA loci included sRNA expression in specific conditions, genotype or overlap with genomic features (e.g. genes, transposons, methylation level). In a strong validation of the locus map, all 42 miRNAs previously identified in Chlamydomonas appear as defined sRNA loci [10]. An example is depicted in Fig 2 showing the loci CRSL0041450 which represents miR1157 and miR1157 � as part of the 22th intron of the gene Cre12ġ537671.

Multiple correspondence analysis
Having assigned intrinsic and extrinsic features to each locus (a detailed breakdown of intrinsic and extrinsic features are listed in S2 Table), we used the MCA function from the FactoMi-neR R package [34] to search for underlying patterns and feature associations. Following dimensional reduction with MCA, k-means clustering was used to group loci according to the annotation patterns (see Methods). We hypothesised that such grouping might identify distinct sRNA types and therefore potentially reveal distinct biogenesis or effector pathways.
To optimise the number of clusters and dimensions to be used we followed an approach used in a similar analysis of Arabidopsis [5] (see Methods). We first evaluated the stability of the clustering using different combinations of clusters and dimensions through random subsampling (S10 Fig). We also calculated the additional variance explained by inclusion of an additional dimension (S9 Fig). Taken together, this analysis indicated that seven dimensions is the optimal number. In addition to the stability tests, computation of the gap statistic [35] and the normalised mutual information (NMI) between the clusters and annotation features both suggested six clusters to be optimal (S11-S13 Figs). The presence of robust clustering for two and three clusters was noted and used to generate a hierarchy plot to demonstrate how loci from clusters grouped together for lower values of k (Fig 4).
The resulting six clusters, referred to as locus class (LC) 1-6, have relatively similar sizes and their association with different annotation features is shown in Fig 5 (also see S15 Fig and S6 Table). All loci with corresponding locus classifications are listed in S4 Table. Associations with annotation features not used predictively in the MCA and K-means clustering (S2 Table) were also calculated and enable some cross-validation of the clustering (S15 Fig). Genome browser views of paragons for each of the six clusters are provided in Fig 2 for LC3 and S16-S20 Figs for LC1, 2, 4, 5 and 6 respectively. The cluster hierarchy (see Fig 4) suggests that the primary division in clusters is between LC1-2 and LC3-6. LC1-2 have a high levels of repetitiveness and a stronger association with genomic methylated regions than LC3-6.
As observed in Arabidopsis [38], the repetitiveness of loci in the genome may correlate with different sRNA silencing mechanisms. LC1 is more associated with transposons (specifically retrotransposons) than LC2. S16 Fig illustrates the overlap of LC1 loci with transposons, low mappability (which corresponds to high repetitiveness) and absence of overlapping genes. Loci were split up mostly due to varying coverage.
LC2 is associated with genic regions, exemplified by a LC2 paragon (CRSL003890) (S17 Fig). Interestingly, a CRTOC1 transposon is superimposed on both the loci and the gene. LC2 contains the largest loci of all classes. Indeed, the shown sRNA locus is 6kb long, which is well outside the normal locus size range.
All miRNA containing loci are in LC3 along with the majority of DCL3-and AGO3-dependent loci. Of the AGO3-dependent loci, 73% have a predominance for U at the 5' end, consistent with AGO3's strong U preference [36]. LC3 loci also demonstrated common expression across wild type libraries and enrichment for 21 nt sRNAs. Interestingly, LC3 contained 130 out of a total of 149 loci that exhibit evidence of phasing.
LC4 sRNAs were similarly represented in most wild type libraries suggesting constitutive or housekeeping roles. However, unlike LC3, there was a lack of sRNA size specificity with 95% having a bias for sRNAs larger or smaller than the modal 20 and 21 nt sRNAs. DNA transposons are associated with 11% of LC5 for which there is a strong bias for sRNAs with a C at the 5'-end potentially indicative of different AGO protein association [39]. There was also a much higher level of LC5 loci specifically expressed during the zygote stage (5%) perhaps indicating roles in silencing DNA transposons at specific points in the life-cycle. LC6 had typically smaller loci (average size 386 nt) as well as most loci (85%) having an enrichment for sRNAs shorter than 20 bp.
Chromosome tracks (Fig 6), demonstrate distinct genomic location patterns for the different LC. LC1-2 are concentrated at the centromere along with higher levels of DNA methylation and a concentration of retrotransposons whereas LC3-6 meanwhile are more evenly spread along the chromosome arms. These patterns are a validation of the locus clusters as chromosomal location was not included as a feature in the MCA.

PLOS ONE
The small RNA locus map for Chlamydomonas reinhardtii

PLOS ONE
The small RNA locus map for Chlamydomonas reinhardtii

Discussion
Chlamydomonas has a silencing machinery more complex than might be expected for a unicellular organism. This complexity precludes a comprehensive characterisation of sRNArelated pathways simply by investigating individual sRNA loci in detail or by studying individual genome-wide features. To address this complexity in this study we followed an approach that allows the data-driven identification of distinct types of sRNA loci [5,26]. Importantly, the MCA uses a wide range of features as inputs to enable the robust identification of clusters, which could not have been derived by using individual features. Inspection of the feature associations for each cluster enables the validation of the clustering, as well as the dissection of important and unimportant features.
By reporting the first comprehensive map of sRNA loci in a unicellular organism we demonstrate the multi-applicability of our pipeline, which was previously used for Arabidopsis [5,26], for locus map generation, annotation and clustering. We are mindful that, as with all genome-wide analyses, it is inevitable that there will be some level of noise in the locus map. There is also a possibility that some of the gene-associated loci may correspond to RNA breakdown products rather than bona fide sRNA loci. Yet, our results corroborate (i) the overall size bias for 21nt sRNAs, (ii) the lack of enrichment in the 24nt fraction associated with the RdDM in higher plants, and (iii) the bias for U and A at the 5'-end of sRNAs [7,8,10,36]. Importantly, our analyses groups known Chlamydomonas miRNAs into the same cluster, LC3. Together, these results indicate the robustness of our approach and the overall validity of our findings.
The characteristics of LC3 loci indicate that this cluster includes canonical miRNAs along with other sRNAs, all produced by DCL3-mediated cleavage of precursors and then bound by the AGO3 effector protein. As these non-miRNAs have most, but not all, features corresponding to bona fide miRNAs, we propose that they derive from immature miRNA precursors, which will evolve to either become canonical pre-miRNAs or are on a pathway for potential elimination from the genome. Meanwhile, the computational detection of phasing in some LC3 loci provides further evidence for the possibility of there being phasiRNAs in Chlamydomonas despite its lack of conserved RDRs [19]. If these are in fact true phased loci, this raises the question of whether the Chlamydomonas genome might contain an as yet uncharacterised non-canonical RDR or posses a distinct phasiRNA biogenesis pathway.
LC4 loci are typically independent of DCL3 and AGO3 and, among other peculiarities, they lack bias for U/A at the 5'-end and they are variable in size. The variability in size could indicate imprecise processing by DCL1/2 possibly due to their lack of PAZ domain, which is thought to confer measuring specificity [40]. However, there is also a strong possibility, given the proncounced strand bias and association with genic regions, that the lack of size-specificity of these loci is in fact due to them deriving from RNA breakdown products. In this scenario, a key benefit of the approach taken here is that such loci are clustered together in one locus class, preventing them from undermining the overall results. Further analyses of sRNA species from dcl1/2 and ago1/2 mutants will provide key insight on this matter.
It is likely that LC1-2 are also processed by DCL1/2 because they have a high 21nt bias. In that scenario a PAZ domain-like measuring function may be performed by other dsRNAbinding proteins, as with DGCR8 in the microprocessor complex of animals [41]. An RNAbinding protein DUS16, similarly partners with DCL3 for the proper processing of miRNAs [42,43].
Our findings also raise questions about DNA methylation and genomic defense from transposons in Chlamydomonas. In Arabidopsis, the RdDM pathway is well characterised [44] and, in the Arabidopsis map, a subset of loci show a combination of RDR2 dependence, bias for 24nt sRNAs (the size class which directs RdDM), and overlap with methylated DNA regions including transposons (the primary targets for RdDM pathway) [5]. The crucial RdDM machinery has not been identified in Chlamydomonas and, in this study consistent with previous findings, no enrichment in the 24nt size fraction was found in any of the sRNA types. However, the presence of loci overlapping with methylated retrotransposons (LC1) and with methylated genic regions (LC2) suggests a possible role of sRNAs during establishment and/or maintenance of methylation states in Chlamydomonas genome. If this connection between sRNA and methylation does exist, then there is a possibility that these loci may represent a distinct form of RdDM in Chlamydomonas that is highly divergent from that of higher plants. Moreover, LC5 loci, with their zygotic-specific expression and DNA transposon overlap, could possibly represent a transposon silencing system activated specifically during the zygotic stage.
Our data-driven approach to identify and classify sRNA loci is intended primarily for the purpose of hypothesis generation, giving possible insights into the biosynthesis and function of sRNAs in Chlamydomonas and, potentially, in other unicellular eukaryotes. The results have allowed the identification of a number of areas for further exploration, as discussed above. Overall, when compared to the previously presented Arabidopsis locus map [5], the results indicate both similarities with higher plants (e.g. LC3) as well as diversification. While these findings relate to just two species, they add to a growing body of evidence suggesting that Chlamydomonas possesses distinct sRNA pathways compared to land plants, with some studies suggesting that they may in fact show more animal-like features [9][10][11][12]. Alternatively, these distinct sRNA pathways could have evolved independently. Further studies with mutant strains will enable deeper characterisation aiming to elucidate the functional significance of sRNAs in the unicellular algae Chlamydomonas as well as the evolution of RNA silencing pathways in diverse linages.