A genome-wide analysis of the RNA-guided silencing pathway in coffee reveals insights into its regulatory mechanisms

microRNAs (miRNAs) are derived from self-complementary hairpin structures, while small-interfering RNAs (siRNAs) are derived from double-stranded RNA (dsRNA) or hairpin precursors. The core mechanism of sRNA production involves DICER-like (DCL) in processing the smallRNAs (sRNAs) and ARGONAUTE (AGO) as effectors of silencing, and siRNA biogenesis also involves action of RNA-Dependent RNA Polymerase (RDR), Pol IV and Pol V in biogenesis. Several other proteins interact with the core proteins to guide sRNA biogenesis, action, and turnover. We aimed to unravel the components and functions of the RNA-guided silencing pathway in a non-model plant species of worldwide economic relevance. The sRNA-guided silencing complex members have been identified in the Coffea canephora genome, and they have been characterized at the structural, functional, and evolutionary levels by computational analyses. Eleven AGO proteins, nine DCL proteins (which include a DCL1-like protein that was not previously annotated), and eight RDR proteins were identified. Another 48 proteins implicated in smallRNA (sRNA) pathways were also identified. Furthermore, we identified 235 miRNA precursors and 317 mature miRNAs from 113 MIR families, and we characterized ccp-MIR156, ccp-MIR172, and ccp-MIR390. Target prediction and gene ontology analyses of 2239 putative targets showed that significant pathways in coffee are targeted by miRNAs. We provide evidence of the expansion of the loci related to sRNA pathways, insights into the activities of these proteins by domain and catalytic site analyses, and gene expression analysis. The number of MIR loci and their targeted pathways highlight the importance of miRNAs in coffee. We identified several roles of sRNAs in C. canephora, which offers substantial insight into better understanding the transcriptional and post-transcriptional regulation of this major crop.

Along with the core mechanism of sRNA production described above, using DCL in processing and AGOs as effectors, and additional participation of the RDR, Pol IV and Pol V in siRNA biogenesis, several other proteins interact with these core proteins to guide sRNA biogenesis, action, and turnover. These proteins have been recently reviewed [17,19]. For instance, RECEPTOR FOR ACTIVATED C KINASE 1 (RACK1) and C-TERMINAL DOMAIN PHOSPHATASE-LIKE 1 (CPL1) interact with SE and have been implicated in pri-miRNA processing [29,30]. Due to their recent emergence, the sRNA silencing pathways have not been fully elucidated, and knowledge of these pathways is constantly evolving. More recently, the protein REGULATOR OF CBF GENE EXPRESSION 3 (RCF3) has been described as a cofactor affecting miRNA biogenesis in specific plant tissues by interacting with CPL1 and CPL2 [31].
Aiming to expand the knowledge from model plants, the silencing complex has been identified in native and cultivated species, including rice (Oryza sativa) [32], common bean (Phaseolus vulgaris) [33], sorghum (Sorghum bicolor), and soybean (Glycine max) [34]. In Coffea arabica and Coffea canephora, the main economically important species of coffee, one of the most important crops in the world and the second most traded global commodity, MIR families have been identified based on Expressed Sequence Tags (EST), Genome Survey Sequences (GSS), and other transcript-based analyses [35][36][37][38].
With the release of the C. canephora genome, miRNAs were also identified [39]. However, the number of miRNAs was significantly underestimated. Moreover, the genes implicated in the generation and function of the miRNAs and siRNAs have not been described in coffee plants.
In this work, we present a thorough analysis of the identification and characterization of the small RNA-guided silencing complex in the C. canephora genome. Eleven AGO proteins; nine DCL-like proteins, including a previously unannotated DCL1; eight RDR proteins; and 48 other proteins implicated in the sRNA pathways, including HYL1, HST, HEN1, SE, and TGH, were identified. Furthermore, we conducted a conserved domain, catalytic site, and phylogenetic analysis to characterize the main proteins of the silencing pathway and validated their expression using RNA-seq libraries. We also identified 235 miRNA precursors producing 317 mature miRNAs belonging to 113 MIR families. We structurally and evolutionarily characterized and identified the putative targets of the MIR families MIR156, MIR172, and MIR390. A total of 2239 putative C. canephora miRNA targets were identified, and gene ontology analyses showed that significant pathways were targeted by miRNAs, demonstrating the importance of miRNAs in C. canephora.
The identification and analysis of the sRNA silencing pathways in C. canephora not only provide insights into the species but also provide a basis for further study of C. canephora and C. arabica regarding sRNA biogenesis and activity. The comprehension of these pathways in such an important crop provides insights into the species for further use of genetic engineering technologies available for crop breeding.

miRNA and protein prediction datasets
The C. canephora genome data and genome features were accessed and downloaded from The Coffee Genome Hub [39]. Mature plant miRNA sequences and precursor miRNA sequences were downloaded from miRBase version 21. For protein prediction, Arabidopsis (Arabidopsis thaliana) ortholog sequences were retrieved from the nucleotide and protein databases at the NCBI (National Center for Biotechnology Information).

Prediction of genes and proteins involved in the sRNA pathway in C. canephora
Putative proteins involved in the sRNA pathways were identified and selected by mining C. canephora sequences in the Coffee Genome Hub, an integrated web-based database, using the Basic Local Alignment Search Tool (BLAST) algorithm BLASTp with protein sequences from Arabidopsis as queries to search previously annotated protein-coding genes. The resulting protein sequences were retrieved for further analysis.

Prediction of mature miRNAs and their precursors (pre-miRNAs)
To search for putative conserved miRNAs and their precursors, we applied an adapted algorithm previously described by de Souza Gomes et al. (2011) to the genome and transcriptome databases of C. canephora [40]. First, the genome and transcriptome sequences of C. canephora were searched using BLASTN to identify putative hairpin-like structures. The retrieved sequences were E-inverted (EMBOSS tool) using the maximum repeat parameters of 336 nucleotides and a threshold value of 25. Then, several filters were applied based on the thermodynamics and structural characteristics of known miRNAs. These filters included a GC content (guanine and cytosine) between 20% and 65%, Minimum Free Energy (MFE), homology with known mature miRNAs, homology to repetitive regions in RepeatMasker 4.0.2 [41], and homology to non-coding RNAs, such as rRNA, snRNA, SL RNA, SRP, tRNA, and RNase P, deposited in the Rfam microRNA Registry version 11.0 [42].
The sequences of pre-miRNAs identified in C. canephora were characterized according to their structures and thermodynamic parameters. The assessed parameters included the MFE, Adjusted Minimum Free Energy (AMFE), Minimum Free Energy Index (MFEI), size, A content, U content, C content, G content, GC and AU contents, GC ratio, AU ratio, Minimum Free Energy of the thermodynamic ensemble (MFEE), Ensemble Diversity (Diversity), and frequency of the MFE structure in the ensemble (Frequency). The adjusted MFE (AMFE) was determined to be a sequence of 100 nt, and the MFEI was determined using the equation MFEI = [(AMFE) X 100]/(G% + C%)] [43,44]. The secondary structures of pre-miRNA, diversity, MFE, frequency ensemble, and MFE were predicted using RNA-fold software (http://rna.tbi.univie.ac.at/cgi-bin/RNAfold.cgi). The GC content and other structural properties were defined using Perl scripts.

Analyses of the sRNA pathway proteins and miRNA precursors
The protein families, domains, and active sites were analyzed using PFAM (version 27.0, available at http://pfam.sanger.ac.uk) and the Conserved Domains Database (CDD; http://www. ncbi.nlm.nih.gov/cdd/). The protein sequences from C. canephora and their orthologs from different species were used to perform multiple sequence alignments using ClustalX 2.0 based on the default settings (available at http://www.clustal.org/clustal2/; [45]). The homologs and the C. canephora pre-miRNAs were aligned using ClustalX 2.0 based on the following alignment parameters: a gap opening of 22.50 and a gap extension of 0.83. They were also aligned in RNAalifold (http://rna.tbi.univie.ac.at/cgi-bin/RNAalifold.cgi). Phylogenetic trees were inferred using the neighbor-joining method, and sequence divergence was estimated using the Jones-Taylor-Thornton model for proteins [46] and Kimura's (1980) two-parameter model for pre-miRNAs [47]. Statistical reliabilities of the internal branches were assessed using 2000 bootstrap replicates for proteins and 5000 bootstrap replicates for pre-miRNAs with values greater than 30 above the branches. Molecular phylogenetic analyses were conducted using MEGA 5 software [48]. The catalytic domains of ARGONAUTE and DICER-like proteins were aligned using Clustal Omega. Pictures highlighting the catalytic residues were generated from the alignment. Multiple Em for Motif Elicitation (MEME) (Version 4.11.2) [49] was then used to find RDR-like catalytic motifs.

RNA-seq analysis
RNA-seq libraries were downloaded from the SRA (https://www.ncbi.nlm.nih.gov/sra/?term= ERP003741) for the three leaf stages (young, expanded, and old) and stems of the C. canephora samples.
For CcDCL1 prediction, the RNA-seq libraries were assembled using Trinity [50]. BLASTN was run against the assembled data using AtDCL1 as a query. The six retrieved sequences were re-assembled using CAP3 [51], and two novel contigs were formed. The protein sequence of the largest contig was predicted using GenScan (http://genes.mit.edu/GENSCAN.html).
For expression validation, the transcriptome in different tissues was assembled using the alignment of the RNA-seq reads against the C. canephora genome with the software TopHat2. The subsequent identification of new genes and alternative splicing analysis were completed with the Cufflinks package. After alignment, possible coding sequences were extracted and identified with the Trans Decoder algorithm and subjected to homology analysis with BLAST. After selecting the proteins involved in the sRNA pathways, differential expression analysis was conducted with the CuffDiff software. The results were visualized and plotted using several packages of the statistical environment R, including the cummeRbund package.

Prediction of C. canephora miRNA target genes
To search for putative target genes of the predicted miRNAs in C. canephora, transcript (CDS +UTR) sequences were retrieved from the Coffee Genome Hub (http://coffee-genome.org/ download) and from RNA-seq libraries (transcript-predicted) of two tissue types: leaves and stem. C. canephora miRNA target genes were predicted using the webtool psRNATarget [52]. To avoid false-positive predictions for the miRNA target genes, we used a stringent cutoff threshold for a maximum expectation of 2.0. The other parameters were based on default settings, which included a length for complementarity scoring (hspsize) of 20 bp, top number of target genes for each small RNA of 200, target accessibility, maximum energy to unpair the target site (UPE) of 25, flanking length around the target site for target accessibility analysis of 17 bp upstream/13 bp downstream, and a range of the central mismatch leading to translational inhibition of 9-11 nt.
Using the RNA-seq sequences, BLAST2GO was run with the resulting predicted targets for each of the miRNAs MIR156, MIR172, and MIR390. BLAST2GO began with a BLASTP search against SwissProt, followed by mapping and annotation.
GO classes of the miRNA targets were classified and grouped using the web tool SEA (Singular Enrichment Analysis) from agriGO (http://bioinfo.cau.edu.cn/agriGO/index.php) [53]. The input was the target genomic IDs, which were compared against all of the IDs of the Coffee Genome Hub.

sRNAs pathways proteins prediction and validation
The proteins involved in the miRNA pathways were identified by BLASTP in the Coffee Genome using Arabidopsis orthologs as queries. The components of the miRNA pathway, HYL1, SE, DDL and TGH [7,[9][10][11], were identified, and one copy of each of these proteins was identified in the C. canephora genome ( Table 1). Two core proteins of the sRNA pathways, HEN1 and HST, were also identified. One putative CcHEN1 and one CcHST protein were identified (Table 1). In addition, we also identified at least 48 proteins in the C. canephora genome associated with the sRNA pathways described in the literature (S1 Table).
The core proteins of the sRNA pathways-DCL-like, AGO-like, and RDR-like-were identified and characterized as described below. The C. canephora protein name, locus position, length, and identity with their respective orthologs from Arabidopsis are presented in Table 2.
The annotated protein-coding sequences identified from the BLASTP of the DCL-like search in the Coffee Genome Hub were retrieved, and conserved domain analysis revealed that nine of these sequences contained DCL-like conserved domains ( Table 3). Two of the sequences (Cc02_14900 and Cc02_14910) that are sequential in chromosome 2 presented complementary domains of a DCL protein. Then, the genomic region comprising both contigs was retrieved, and the resulting protein was predicted using GenScan (http://genes.mit.edu/ GENSCAN.html) and used for further analyses.
Multiple alignments with ortholog DCLs from other angiosperm species and phylogenetic analyses were performed to assign the coffee DCLs and to determine the evolutionary relationship among species. One DCL3, one DCL4, and six DCL2s were assigned. No DCL1 was found using this approach, then we identified one putative CcDCL1 from RNA-seq libraries. Conserved domain analysis (Table 3) of the resulting sequence confirmed a DCL protein, and BLASTP at the NCBI database matched DCL1 proteins with 99% coverage and an E-value of 0. The sequence was then searched by tBLASTN in the Coffee Genome Hub and aligned with a genomic sequence in chromosome 0, an arbitrary pseudochromosome created with all of the unmapped sequences from the 11 chromosomes [39] (S1 Fig). Therefore, although present in the genome assembly, the CcDCL1 was not previously annotated as a protein-coding gene on the Coffee Genome Hub.
The new phylogenetic analysis, including the putative CcDCL1, generated a tree in which the CcDCL clustered similarly to their respective orthologs from other species (Fig 1). In total, nine DCL-like proteins were found in the C. canephora genome ( Table 2) and were distributed in four distinct clades in the phylogenetic tree (Fig 1); the clades matched the four paralogous DCL-like proteins described in Arabidopsis [57].
The DCL proteins have six domains types, DExD-helicase (DExDc), Helicase-C (HELICc), Duf283, PAZ, RNAse III (RIBOc), and double-stranded RNA-binding (dsRB), although some of these may not be present [58]. Conserved domain analysis (Table 3) revealed that the CcDCL1-like and CcDCL4-like proteins contain DExD, Helicase-C, Dicer-dimer, PAZ, two RNAse III (RIBOc), and two dsRB (DSRM) domains. The CcDCL3-like, CcDCL2.1-like, and DCL4-like proteins contain no DSRM domains. The CcDCL2 proteins have five more paralogs, which appear to be partial sequences lacking the N-terminal domains (DExD, Helicase-C, and DUF283). These sequences also lack one (CcDCL2.3, CcDCL2.4, and CcDCL 2.6) or two (CcDCL2.5) DSRM domains. The shortest CcDCL2-like protein, CcDCL2.3, also lacks a PAZ domain. We also analyzed the conservation of the RNase III catalytic sites of CcDCL-like proteins in the two RNase III domains (RIBOc I and II): glutamate (E), aspartate (D), glutamate (D), aspartate (E) (EDDE) [59]. CcDCL1, CcDCL2.1, CcDCL3, and CcDCL4 contain these conserved catalytic residues (Fig 2).  [54]. A BLASTP search using AtAGO as a query in the Coffee Genome Hub resulted in 12 C. canephora protein-coding sequences, which were retrieved and subjected to Conserved Domain analysis to confirm the presence of the conserved domains of ARGONAUTE proteins (N-terminal, PAZ, ArgoMid, and PIWI). Two of the sequences (Cc04_g10830 and Cc04_g10840) that were found sequentially in Chromosome 4 presented as partial sequences, one containing a PIWI domain (Cc04_g10830) and the other containing a PAZ (Cc04_g10840) domain. The genomic sequence comprising both contigs was retrieved, and the protein product was predicted using GenScan (http://genes.mit.edu/GENSCAN.html). BLASTP and Conserved Domain analysis confirmed an AGO protein that was considered for further analyses. Therefore, in total, eleven putative AGO proteins comprising seven homologs were found in C. canephora (Table 2).
Conserved domain analysis confirmed the presence of the N-terminal, PAZ, and PIWI domains in all sequences but showed an only variable presence of ArgoMid (Table 4). AGO1 proteins have an additional glycine-rich region at the N-terminus (Gly-rich_Ago1), which was present in one putative AGO sequence. To further determine the evolutionary conservation and assign the AGO-like proteins found in C. canephora, we compared the sequences to orthologs from other angiosperm species on a phylogenetic tree. The eleven AGO proteins were assigned and found to cluster with their closest orthologs from other species; the C. canephora AGO proteins also similarly grouped into three major phylogenetic clades [17,61]: one AGO1, one AGO5, and two AGO10s in Clade I; two AGO2s and one AGO7 in Clade II; and three AGO4s in Clade III (Fig 3). One AGO16 was also identified, which grouped with the AGO4s in Clade III. A similar pattern has been found in rice, maize, Arabidopsis, soybean, sorghum, and other species, indicating the conservation of small RNA functions in higher plants [34].
To investigate whether CcAGOs possess conserved catalytic residues and could potentially act as the slicer component of RISC, we aligned the PIWI domains of all of the CcAGOs and searched for the Asp-Asp-His (DDH) catalytic triad in CcAGOs and for a residue corresponding to the conserved H798 residue of AtAGO1 [62]. Four proteins (CcAGO1, CcAGO5, CcAGO7, and CcAGO10.1) possessed the conserved DDH/H798 residues (Table 5). In four CcAGOs, the DDH catalytic motif was conserved, but the H798 was replaced by a serine (CcAGO16), proline (CcAGO4. 2   In C. canephora, eight putative RDR proteins were found after BLASTP on the Coffee Genome Hub. Conserved domain analysis confirmed the presence of the RNA-dependent RNA polymerase (RdRP) domain, and Multiple Em for Motif Elicitation (MEME) (Version 4.11.2) [49] analysis revealed that six coffee RDR proteins possess a DLDGD motif and two possess a DFDGD motif (Fig 4). Multiple alignments with orthologs sequences and phylogenetic tree analysis were also performed to assign the coffee RDR proteins and to determine the  [46]. The bootstrap consensus tree inferred from 2000 replicates is taken to represent the evolutionary history of the taxa analyzed. Branches corresponding to partitions reproduced in less than 50% bootstrap replicates are collapsed. The percentage of replicate trees in which the associated taxa clustered together in the bootstrap test (2000 replicates) are shown next to the branches. The tree is drawn to scale, with branch lengths in the same units as those of the evolutionary distances used to infer the phylogenetic tree. The evolutionary distances were computed using the JTT matrix-based method and are in the units of the number of amino acid substitutions per site [48]. The analysis involved 33 amino acid sequences. All positions containing gaps and missing data were eliminated. There were a total of 286 positions in the final dataset.   RNA-guided silencing pathway in coffee evolutionary relationship with the other angiosperm species. Four RDRs corresponded to RDR1, one to RDR2, one to RDR6, and two to RDR3 (Fig 5). The name, locus position, length, and identity of the CcRDR proteins with their respective orthologs from Arabidopsis are presented in Table 2.
In Arabidopsis, the six RDR proteins are divided into four families: RDR1, RDR2, RDR3 (RDR3a and RDR3b), and RDR6 [63]. RDR1, RDR2, and RDR6 function in the formation of dsRNA from ssRNA sequences, which are processed into several types of siRNAs targeting specific endogenous loci [64]. Among the six Arabidopsis RDR genes, AtRDR1, AtRDR2, and AtRDR6 are involved in processes such as viral resistance, chromatin silencing, and Post-Translational Gene Silencing (PTGS) [65]. The function of the RDR3 genes remains unknown, but the presence of at least one copy of the RDR3 gene in several plant genomes and other organisms suggests that these proteins may have functional significance [66].
In the phylogenetic tree, two main clades are observed, one consisting of RDR1, RDR2, and RDR6 and the other consisting of RDR3. This observation is consistent with the division of the two clades predicted based on their catalytic motifs (Fig 5). Although we found two RDR3 genes in C. canephora, similarly to tomato (SlRDR3a and SlRDR3b), the two CcRDR3 genes grouped with SlRDR3a (Fig 5).
To confirm the expression of the main RNA-silencing components, we searched the RNAseq data of Coffea canephora publicly available in the Sequence Read Archive (SRA) of the NCBI (https://www.ncbi.nlm.nih.gov/sra/?term=ERP003741). Sequencing data of leaves collected at different development stages (young, expanded, and old) and stem tissues were analyzed to determine the expression profile of the sRNA silencing components identified in coffee, including CcAGO, CcDCL, CcRDR, CcHYL1, CcSE, CcDDL, CcTG, CcHEN1, and Phylogenetic tree showing relationships between the paralogous and orthologs proteins of the AGO family. The evolutionary history was inferred using the Neighbor-Joining method [46]. The bootstrap consensus tree inferred from 2000 replicates is taken to represent the evolutionary history of the taxa analyzed. Branches corresponding to partitions reproduced in less than 50% bootstrap replicates are collapsed. The percentage of replicate trees in which the associated taxa clustered together in the bootstrap test (2000 replicates) are shown next to the branches. The evolutionary distances were computed using the JTT matrix-based method and are in the units of the number of amino acid substitutions per site [48]. The analysis involved 55 amino acid sequences. All positions containing gaps and missing data were eliminated. There were a total of 333 positions in the final dataset. https://doi.org/10.1371/journal.pone.0176333.g003

RNA-guided silencing pathway in coffee
CcHST. The heatmap showed expression in all the tested tissues (Fig 6). However, Cufflinks analysis assigned three loci annotated as DCL2 in the coffee genome (Cc02_g14900, Cc02_g14910, and Cc02_g14920 -herein referred to as DCL2.2 and DCL2.3) as isoforms of the same genetic locus; therefore, these were not included in the heatmap (S3 Fig). Furthermore, CcAGO4.1 was not expressed in any of the tissues.

miRNAs and miRNA target prediction
Homology-based miRNA search was conducted by comparing plant miRNAs deposited in the miRBase database version 21 against the coffee genome. After applying filters to retrieve miRNA precursors, a total of 235 precursors and 317 mature miRNAs were identified and characterized, belonging to 113 MIR families (S2 Table). The mature miRNAs were found in both the 3' and 5' arms of the precursor, with sizes ranging from 19 to 25 nt, most of which were 21 nt (S2 Table). The preferred first 5' nucleotide was Uracil (U). The location of the pre-miRNAs in the genome was determined, including the chromosome, start and end point, strand position, and genic/intergenic position (S2 Table). MIR genes were observed in all chromosomes, and chromosome 2 contained the highest number of MIR genes (36 genes). A total of 38 precursors were found either in antiparallel clusters or clustered with a maximum distance of 10 kb between the two miRNAs, but most were widespread throughout the chromosomes. A total of 193 precursors were identified in the intergenic regions, and the other 43 precursors were found within genes (S2 Table). The precursor sizes varied from 68 to 338 nt, and the AU (Adenine+Uracil) content ranged from 41% to 69% (S3 Table). The thermodynamic aspects of the precursors-Minimal Free Energy (MFE), adjusted MFE (AMFE), MFE index (MFEI), Minimal Free Energy of the thermodynamic ensemble (MFEE), Ensemble Diversity (Diversity) and frequency of the MFE structure in the ensemble (Frequency)-were measured (S3 Table). The MFE ranged from -21.9 to -97.5 kcal mol -1 , with a mean of -56.4 kcal mol -1 ; the AMFE ranged from -21.4 to -59.6 kcal mol -1 , with a mean of -36.46 kcal mol -1 ; and the MFEI varied from 0.7 to 1.7, with a mean of 0.88.
We chose some of the highly conserved MIR families-MIR156, MIR172, and MIR390 -for further characterization. We analyzed the conservation of their sequences and structure as well as their phylogenetic distributions. For each of these MIR families, multiple sequence alignment and secondary structure prediction were performed to verify the primary and secondary conservation relative to other plant species orthologs (Figs 7-9). These MIR families presented high conservation between their primary and secondary structures and their orthologs (Figs   Fig 5. Phylogenetic tree of RDR proteins identified in C. canephora. Phylogenetic tree showing relationships between the paralogous and orthologs proteins of the RDR family. The evolutionary history was inferred using the Neighbor-Joining method [46]. The bootstrap consensus tree inferred from 2000 replicates is taken to represent the evolutionary history of the taxa analyzed. Branches corresponding to partitions reproduced in less than 50% bootstrap replicates are collapsed. The percentage of replicate trees in which the associated taxa clustered together in the bootstrap test (2000 replicates) are shown next to the branches. The tree is drawn to scale, with branch lengths in the same units as those of the evolutionary distances used to infer the phylogenetic tree. The evolutionary distances were computed using the JTT matrix-based method and are in the units of the number of amino acid substitutions per site [48]. The analysis involved 33 amino acid sequences. All positions containing gaps and missing data were eliminated. There were a total of 312 positions in the final dataset.
We also identified potential miRNA target genes using psRNATarget [67] based on the C. canephora genome. In total, 2239 genes were identified as potential targets of the miRNAs, many of which were targeted by more than one miRNA (S4 Table).
To classify and group the Gene Ontology (GO) classes of the miRNA targets, the web tool SEA (Singular Enrichment Analysis) from agriGO (http://bioinfo.cau.edu.cn/agriGO/index. php) was used [53]. A total of 1356 GO terms were annotated for the target genes in C. canephora, and these were summarized in 57 main terms. The genes belonging to the 25 overrepresented terms among the three GO categories, namely the biological process (Fig 10A), molecular function (Fig 10B), and cellular component (Fig 10C) categories, are presented.
We further identified the putative targets of ccp-MIR156, ccp-MIR172, and ccp-MIR390 in the RNA-seq libraries of stem and leaf tissues. The complete list of the targets assigned to these miRNAs is presented in S5 Table. Discussion Duplication events and domain and catalytic site configurations reveal insights into the sRNA pathway core members in C. canephora Duplication of DCL2 has been observed in several species [56,68,69]. The largest of the six CcDCL2 members, CcDCL2.1, is located on chromosome 9 and is missing its DsRB (DSRM) domain. DCL2 usually contains only one DsRB (DSRM) domain, but in the four tomato DCL2s, only one member (SlDCL2d) possesses a DsRB (DSRM) domain [55]. The shortest CcDCL2 identified, CcDCL2.5 (354 aa), is located on chromosome 6, along with CcDCL2.6 (762 aa). Both of these proteins are truncated. Similar findings were observed for CcDCL2.2, CcDCL2.3, and CcDCL2.4, which are located sequentially on chromosome 2 and are also incomplete according to the current version of the genome annotation.
Expression analyses demonstrated that at least four DCL2-like genes are active in coffee (Fig 6 and S3 Fig), including the only complete sequence, CcDCL2.1. The other two DCL2 genes that are expressed are DCL2.4 (Cc02_g14930) and DCL2.6 (Cc06_g19980) (Fig 6). In addition to that, a total of seven isoforms were assigned to the same locus (Cc02_g14900) (S3 Fig). This might indicate misannotation of the three DCL2 assigned to the sequential loci at Chromosome 2 (Cc02_g14900, Cc02_g14910 and Cc02_g14920), which are probably exons of a unique gene. Finally, DCL2.5 (Cc06_g19770), which is the most incomplete DCL2 annotated in the genome, is not expressed in either tissue and could not be confirmed. Although it remains unclear how many DCL-like proteins are present and where on the genome their complete sequence can be found, an expansion of the DCL-like proteins appears to have occurred in C. canephora through the duplication of the DCL2-like family.
DCL-like proteins might contain the characteristic catalytic residues of RNase III domaincontaining proteins [59]. The RNase III domains bind dsRNA and are responsible for cleavage and processing; therefore, they are essential to sRNA generation [58]. Only the incomplete CcDCL2 (CcDCL2.2-CcDCL2.6) proteins did not present the conserved residues (EDDE-Glu-Asp-Asp-Glu) in one or both RNAse III domains, reinforcing the need for further investigation into these short CcDCL2-like proteins.
The presence of CcAGO10, CcAGO2, and CcAGO4 paralogs indicates the occurrence of duplication events in the C. canephora genome. Gene duplication is one possible reason for the expansion of AGO proteins. The expansion of the AGO family in flowering plants suggests functional diversification of the AGO proteins [61]. vinifera, tcc-Theobroma cacao, ptc-Populus trichocarpa, aly-Arabidopsis lyrata, sly-Solanum lycopersicum. The evolutionary history was inferred using the Neighbor-Joining method [46]. The bootstrap consensus tree inferred from 5000 replicates is taken to represent the evolutionary history of the taxa analyzed. Branches corresponding to partitions reproduced in less than 50% bootstrap replicates are collapsed. The percentage of replicate trees in which the associated taxa clustered together in the bootstrap test (5000 replicates) are shown next to the branches. The tree is drawn to scale, with branch lengths in the same units as those of the evolutionary distances used to infer the phylogenetic tree. The evolutionary distances were computed using the Kimura 2-parameter method [3] and are in the units of the number of base substitutions per site [47]. The analysis involved 23 nucleotide sequences. All positions containing gaps and missing data were eliminated. There were a total of 68 positions in the final dataset.
https://doi.org/10.1371/journal.pone.0176333.g007 PIWI domains contain the three conserved metal-chelating residue motif aspartate, aspartate, histidine (DDH). The DDH motif functions as a catalytic triad. A conserved histidine found at position 798 of AtAGO1 is also important for the catalytic function of the AGO proteins [62]. The four CcAGO proteins that possess the DDH/H motif (CcAGO1, CcAGO5, CcAGO7, and CcAGO10.1) potentially act as the slicer of RISC (Table 5). CcAGO2.1 and CcAGO2.2 showed a third aspartate residue instead of histidine, which was also observed in SlAGO2 [55], AtAGO2 and AtAGO3 [56]; GmAGO3a and SbAGO2 [34]; and OsAGO2 and OsAGO3 [56]. The absence of catalytic amino acids could prevent the processing of target RNA by cleavage; therefore, accessory factors for mediating mRNA turnover could be required [56]. However, the presence of a third aspartate in the triad restores the catalytic activity to function as slicer components of the silencing effector complexes in Arabidopsis and rice AGO2 and AGO3 [56].
CcAGO4.1 presented neither of the residues required for catalytic activity, which could represent either functionalization or loss of function. CcAGO4.1 expression was not found in the RNA-seq libraries, corroborating the hypothesis that this protein is not active due to a lack of effective catalytic residues. However, AGO4 proteins can function either dependent on or independent of their catalytic activity [70]. The expression of CcAGO4.2 and CcAGO4.3 indicates that Transcriptional Gene Silencing (TGS) guided by RNA is upregulated in coffee because AGO4 has been implicated in RNA-Directed DNA Methylation (RdDM) [71].
Interestingly, four RDR1 genes were found in C. canephora, all of which were located sequentially on chromosome 11 (Table 2). RDR1 is involved in plant defenses against biotic and abiotic components [17,73]. Most enriched GO terms in C. canephora belong to the defense response class [39]. It was also observed that the C. canephora genome includes several species-specific gene family expansions, including the defense-related genes [39]; this could also be the case for the RDR1 genes. ccp-Coffea canephora, ath-Arabidopsis thaliana, cme-Cucumis melo, gma-Glycine max, lus-Linum usitatissimum, mtr-Medicago truncatula, vvi-Vitis vinifera, bra-Brassica rapa, stu-Solanum tuberosum, nta-Nicotiana tabacum, aly-Arabidopsis lyrata, mdm-Malus domestica. The evolutionary history was inferred using the Neighbor-Joining method [46]. The bootstrap consensus tree inferred from 5000 replicates is taken to represent the evolutionary history of the taxa analyzed. Branches corresponding to partitions reproduced in less than 50% bootstrap replicates are collapsed. The percentage of replicate trees in which the associated taxa clustered together in the bootstrap test (5000 replicates) are shown next to the branches. The tree is drawn to scale, with branch lengths in the same units as those of the evolutionary distances used to infer the phylogenetic tree. The evolutionary distances were computed using the Kimura 2-parameter method and are in the units of the number of base substitutions per site [47]. The analysis involved 28 nucleotide sequences. All positions containing gaps and missing data were eliminated. There were a total of 46 positions in the final dataset. https://doi.org/10.1371/journal.pone.0176333.g008 RNA-guided silencing pathway in coffee The C. canephora genome possesses several conserved and nonconserved MIR loci that target major cellular processes Using a robust pipeline, we were able to significantly enrich the number of predicted miRNAs annotated in Coffea spp [35][36][37][38][39]. We identified 235 precursors and 317 mature sequences, whereas previous analyses of the coffee genome identified only 92 precursors [39]. The precursors belonged to 113 MIR families, representing a considerable increase relative to the 33 families originally described in the coffee genome report [39]. Our stringent and robust pipeline predicted sequences that were real miRNA precursors and identified more paralogous loci for several families already described.
Some of the most conserved families in plants, MIR156, MIR172, and MIR390 [43], have been identified in several species [33,43,[75][76][77] and play central roles in plant development and stress responses. For instance, miR156 targets SQUAMOSA PROMOTER BINDING PROTEIN-LIKE (SPL) transcription factor family members, and miR156-SPL networks define an essential regulatory module that controls phase transitions, leaf trichome development, male fertility, embryonic patterning, and anthocyanin biosynthesis [78][79][80][81][82]. In the C. canephora genome, miR156 has 24 putative targets (S4 Table). Based on the transcriptomes of the stem and leaf tissue samples, we found that miR156 potentially targets SPL-6 and SPL-12 in both tissues (S5 Table). In total, 15 putative targets were identified in the stems and leaves, some of which were identified either in both tissues or in only one (S5 Table).
The MIR172 family consists of five precursors and ten mature miRNAs (S2 Table). This highly conserved family is found in several species and is related to the regulation of flowering time and floral organ identity by targeting APETALA2-like transcription factors in Arabidopsis [83,84]. miR172 acts downstream of miR156 to regulate phase transition [84], as an increase in miR156 levels corresponds to lower expression of miR172 and vice versa in several species [84][85][86][87]. In the C. canephora genome, 118 putative targets for miR172 were identified (S4 Table). Based on the transcriptome data, a total of 66 putative targets were identified, including AP2 in stem tissue (S5 Table). miR390 is involved in the regulation of development and the response to several stresses [88][89][90][91]. Among its targets, miR390 regulates the Auxin Response Factor (ARF) by mediating non-protein coding Trans-Acting siRNA locus 3 (TAS3) generation in an AGO7-dependent manner [92]. miR390 also targets Leucine-Rich Repeat Receptor-like kinases (LRK) and regulates a LRK protein in Oryza sativa in response to cadmium stress [91]. In the C. canephora genome, 11 putative targets were identified (S4 Table). Four putative targets were identified in ccp-Coffea canephora, aly-Arabidopsis lyrata, ath-Arabidopsis thaliana, bna-Brassica napus, gma-Glycine max, ptc-Populus trichocarpa, sly-Solanum lycopersicum, mdm-Malus domestica, tcc-Theobroma cacao, lus-Linum usitatissimum. The evolutionary history was inferred using the Neighbor-Joining method [46]. The optimal tree with the sum of branch length = 1.87754489 is shown. The percentage of replicate trees in which the associated taxa clustered together in the bootstrap test (5000 replicates) are shown next to the branches. The tree is drawn to scale, with branch lengths in the same units as those of the evolutionary distances used to infer the phylogenetic tree. The evolutionary distances were computed using the Kimura 2-parameter method and are in the units of the number of base substitutions per site [47]. The analysis involved 22 nucleotide sequences. All positions containing gaps and missing data were eliminated. There were a total of 65 positions in the final dataset.
The ccp-MIR156, ccp-MIR172, and ccp-MIR390 members were highly conserved in their primary and secondary structures relative to their respective orthologs from other species and relative to their distributions within the phylogenetic tree in a clade of Eudicotyledons, consistent with plant phylogeny (Figs 7-9) [93].
The GO terms of the putative C. canephora miRNA targets were categorized and compared with the GO terms of the whole genome as background (Fig 10). In total, 1356 GO terms were assigned to the putative targets, including a total of 14975 GO terms annotated to the genome. The main overrepresented subcategories belonging to the 'Biological Process' category were 'cellular process' and 'metabolic process'. In the 'Cellular Component' category, the main overrepresented terms were 'cell part' and 'cell'. In the 'Molecular Function' category, the main overrepresented terms were 'catalytic activity' and 'binding'. Interestingly, the main categories of the potential targets were also the main categories annotated for the genome (green bars- Fig 10). Therefore, one can infer that miRNAs in C. canephora target major cellular processes.
Considering the importance of this pioneering work, we elucidated several aspects of sRNAs in C. canephora, which offers a significant step towards a better understanding of the transcriptional and post-transcriptional regulation of this major crop. An understanding of the sRNA pathways in coffee provides insights for plant breeding through genetic engineering technology.