Expression and Regulation of PIWIL-Proteins and PIWI-Interacting RNAs in Rheumatoid Arthritis

Objective The PIWIL (P-element induced wimpy testis like protein) subfamily of argonaute proteins is essential for Piwi-interacting RNA (piRNA) biogenesis and their function to silence transposons during germ-line development. Here we explored their presence and regulation in rheumatoid arthritis (RA). Methods The expression of PIWIL genes in RA and osteoarthritis (OA) synovial tissues and synovial fibroblasts (SF) was analysed by Real-time PCR, immunofluorescence and Western blot. The expression of piRNAs was quantified by next generation small RNA sequencing (NGS). The regulation of PIWI/piRNAs, proliferation and methylation of LINE-1 after silencing of PIWIL genes were studied. Results PIWIL2 and 4 mRNA were similarly expressed in synovial tissues and SF from RA and OA patients. However, on the protein level only PIWIL4 was strongly expressed in SF. Using NGS up to 300 piRNAs were identified in all SF without significant differences in expression levels between RA and OASF. Of interest, the analysis of the co-expression of the detected piRNAs revealed a less tightly regulated pattern of piRNA-823, -4153 and -16659 expression in RASF. In RASF and OASF, stimulation with TNFα+IL1β/TLR-ligands further significantly increased the expression levels of PIWIL2 and 4 mRNA and piRNA-16659 was significantly (4-fold) induced upon Poly(I:C) stimulation. Silencing of PIWIL2/4 neither affect LINE-1 methylation/expression nor proliferation of RASF. Conclusion We detected a new class of small regulatory RNAs (piRNAs) and their specific binding partners (PIWIL2/4) in synovial fibroblasts. The differential regulation of co-expression of piRNAs in RASF and the induction of piRNA/Piwi-proteins by innate immune stimulators suggest a role in inflammatory processes.


Introduction
The analyses reported in this document are part of the stats-edgeR project. The aim is to find features that are differentially expressed between OA and RA. The statistical analysis process includes data normalization, graphical exploration of raw and normalized data, test for differential expression for each feature between the conditions, raw p-value adjustment and export of lists of features having a significant differential expression between the conditions. The analysis is performed using the R software [R Core Team, 2014], Bioconductor [Gentleman, 2004] packages including edgeR [Robinson, 2010] and the SARTools package developed at PF2 -Institut Pasteur. Normalization and differential analysis are carried out according to the edgeR model and package. This report comes with additional tab-delimited text files that contain lists of differentially expressed features.
For more details about the edgeR methodology, please refer to its related publications [Robinson, 2007, 2008, 2010and McCarthy, 2012.

Description of raw data
The count data files and associated biological conditions are listed in the following table.
After loading the data we first have a look at the raw data Looking at the summary of the count table provides a basic description of these raw counts (min and max values, median, etc). Figure 1 shows the total number of mapped reads for each sample. Reads that map on multiple locations on the transcriptome are counted more than once, as far as they are mapped on less than 50 different loci. We expect total read counts to be similar within conditions, they may be different across conditions. Total counts sometimes vary widely between replicates. This may happen for several reasons, including: different rRNA contamination levels between samples (even between biological replicates); slight differences between library concentrations, since they may be difficult to measure with high precision.      It may happen that one or a few features capture a high proportion of reads (up to 20% or more). This phenomenon should not influence the normalization process. The edgeR normalization has proved to be robust to this situation [Dillies, 2012]. Anyway, we expect these high count features to be the same across replicates. They are not necessarily the same across conditions. Figure 4 and table 4 illustrate the possible presence of such high count features in the data set.
(counts + 1) log 2 Figure 4: Percentage of reads associated with the sequence having the highest count (provided in each box on the graph) for each sample.
We may wish to assess the similarity between samples across conditions. A pairwise scatter plot is produced (figure 5) to show how replicates and samples from different biological conditions are similar or different ( are used instead of raw count values). Moreover, as the Pearson correlation has been shown not to be relevant to measure the similarity between replicates, the SERE statistic has been proposed as a similarity index between RNA-Seq samples [Schulze, 2012]. It measures whether the variability between samples is random Poisson variability or higher. Pairwise SERE values are printed in the lower triangle of the pairwise scatter plot. The value of the SERE statistic is: 0 when samples are identical (no variability at all: this may happen in the case of a sample duplication); 1 for technical replicates (technical variability follows a Poisson distribution); greater than 1 for biological replicates and samples from different biological conditions (biological variability is higher than technical one, data are over-dispersed with respect to Poisson). The higher the SERE value, the lower the similarity. It is expected to be lower between biological replicates than between samples of different biological conditions. Hence, the SERE statistic can be used to detect inversions between samples. (counts + 1) log 2

Filtering low counts
edgeR suggests to filter features with null or low counts because they do not supply much information. For this project, 162 features (31.46%) have been removed from the analysis because they did not satisfy the following condition: having at least 1 counts-per-million in at least 9 samples. Seite 9 von 15 Statistical report 11.2.2016 file:///G:/RUZ_Exp_Rheuma/Fellows/Astrid%20Jüngel/2-Anderes/PIWIpiRNA%20sh...

Variability within the experiment: data exploration
The main variability within the experiment is expected to come from biological differences between the samples. This can be checked in two ways. The first one is to perform a hierarchical clustering of the whole sample set. This is performed after a transformation of the count data as moderated log-counts-per-million. Figure 6 shows the dendrogram obtained from CPM data. An euclidean distance is computed between samples, and the dendrogram is built upon the Ward criterion. We expect this dendrogram to group replicates and separate biological conditions.

Normalization
Normalization aims at correcting systematic technical biases in the data, in order to make read counts comparable across samples. The normalization proposed by edgeR is called Trimmed Mean of M-values (TMM) but it is also possible to use the RLE (DESeq) or upperquartile normalizations. It relies on the hypothesis that most features are not differentially expressed. edgeR computes a factor for each sample. These normalization factors apply to the total number of counts and cannot be used to normalize read counts in a direct manner. Indeed, normalization factors are used to normalize total counts. These in turn are used to normalize read counts according to a total count normalization: if is the total number of reads of the sample and its normalization factor, is the normalized total number of reads. Then, let with the mean of the s. Finally, the normalized counts of the sample are defined as where is the gene index.
Boxplots are often used to assess the quality of the normalization process, as they show how distributions are globally affected during this process. We expect normalization to stabilize distributions across samples. Figure 8 shows boxplots of raw (left) and normalized (right) data respectively.
x ij s j i 6 Differential analysis 6.1 Modelization edgeR aims at fitting one linear model per feature. For this project, the design used is ~ group and the goal is to estimate the models' coefficients which can be interpreted as . These coefficients will then be tested to get p-values and adjusted p-values.

Dispersions estimation
The edgeR model assumes that the count data follow a negative binomial distribution which is a robust alternative to the Poisson law when data are over-dispersed (the variance is higher than the mean). The first step of the statistical procedure is to estimate the dispersion of the data. Figure 9: Dispersion estimates. Figure 9 shows the result of the dispersion estimation step. The x-and y-axes represent the mean count value and the estimated dispersion respectively. Black dots represent empirical dispersion estimates for each feature (from the observed count values). The blue curve shows the relationship between the means of the counts and the dispersions modeled with splines. The red segment represents the common dispersion.

Statistical test for differential expression
Once the dispersion estimation and the model fitting have been done, edgeR can perform the statistical testing. Figure 10 shows the distributions of raw p-values computed by the statistical test for the comparison(s) done. This distribution is expected to be a mixture of a uniform distribution on and a peak around 0 corresponding to the differentially expressed features.

Final results
A p-value adjustment is performed to take into account multiple testing and control the false positive rate to a chosen level . For this analysis, a BH p-value adjustment was performed [Benjamini, 1995 and2001] and the level of controlled false positive rate was set to 0.05. Figure 11 represents the MA-plot of the data for the comparisons done, where differentially expressed features are highlighted in red. A MA-plot represents the log ratio of differential expression as a function of the mean intensity for each feature. Triangles correspond to features having a too   Figure 12 shows the volcano plots for the comparisons performed and differentially expressed features are still highlighted in red. A volcano plot represents the log of the adjusted P value as a function of the log ratio of differential expression. Full results as well as lists of differentially expressed features are provided in the following text files which can be easily read in a spreadsheet. For each comparison: TestVsRef.complete.txt contains results for all the features; TestVsRef.up.txt contains results for up-regulated features. Features are ordered from the most significant adjusted p-value to the less significant one; TestVsRef.down.txt contains results for down-regulated features. Features are ordered from the most significant adjusted p-value to the less significant one.
These files contain the following columns: Id: unique feature identifier; sampleName: raw counts per sample; norm.sampleName: rounded normalized counts per sample; baseMean: base mean over all samples; OA and RA: means (rounded) of normalized counts of the biological conditions; FoldChange: fold change of expression, calculated as ; log2FoldChange: as estimated by the GLM model. It reflects the differential expression between Test and Ref and can be interpreted as . If this value is: around 0: the feature expression is similar in both conditions; positive: the feature is up-regulated ( ); negative: the feature is down-regulated ( ); pvalue: raw p-value from the statistical test; padj: adjusted p-value on which the cut-off is applied; tagwise.dispersion: dispersion parameter estimated from feature counts (i.e. black dots on figure 9); trended.dispersion: dispersion parameter estimated with splines (i.e. blue curve on figure 9).