Extracting information from RNA SHAPE data: Kalman filtering approach

doi:10.1371/journal.pone.0207029

Fig 1.

The mean-dependence in the standard deviation of SHAPE measurements.

Data from 5 SHAPE replicates obtained on the cucumber mosaic virus RNA3 sequence (experiments performed on data from infected plant cell lysates) [51]. For each nucleotide, the mean value of the 5 measurements were calculated and plotted against their standard deviation on a log-log plot. A linear fit is overlaid in red. Note that negative reactivity values were not included as they are incompatible with the log-log plot.

More »

Expand

Fig 2.

A conceptual representation of measurement combination methods under the log-normal noise model for three SHAPE replicates.

Replicates are first transformed into the log domain. The log-average () and KF (K) profiles are then computed. The resulting profiles are transformed back to the data domain.

More »

Expand

Fig 3.

Log domain standard deviation values of measurements coming from real SHAPE data.

Standard deviation values were calculated for each nucleotide on log measurements. (a) Histogram of standard deviation values. (b) Empirical CDF of standard deviation values. The shaded regions correspond to our definition of low, medium, and high noise regimes. All non-positive measurements were removed from the initial set of data. Nucleotides with a single positive measurement were excluded so that a total of 3723 data points were considered.

More »

Expand

Table 1.

Summary of RNA sequences with SHAPE profiles included in database.

More »

Expand

Fig 4.

Comparison of log-averaging and Kalman filtering using (a) N = 3 and (b) N = 10 simulated replicates under log-normal noise model.

The vertical axis represents the data domain ground truth reactivity, s_m. The horizontal axis represents the log domain standard deviation of the simulated measurements, . Nucleotides were binned based on s_m and values. Left panel shows RMS errors calculated between ground truth and log-averaged reactivities for all nucleotides in a bin. Right panel shows RMS errors calculated between ground truth and Kalman filtered reactivities for all nucleotides in a bin. Error calculations were carried out in the log domain and ground truth values were the log reactivities. See Methods for RMS calculation details.

More »

Expand

Fig 5.

Comparison of the log-average and Kalman filter approaches using N = 2 to N = 10 replicates simulated at (a) low (b) medium and (c) high noise levels under log-normal noise model.

RMS errors were calculated between ground truth and log-averaged reactivities (solid line) and between ground truth and the Kalman filtered reactivities (dotted line) over entire set of nucleotides. Error calculations were carried out in the log domain and the ground truth values were the log reactivities. See Methods for RMS calculation details. In low noise regimes, only a negligible difference between the log-averaging and Kalman filtering approaches is observed. However, in the higher noise regime, the Kalman filtering approach better recovers the ground truth. This advantage is marginal after the replicate count is increased beyond 4. Note that errors increase with increasing noise levels.

More »

Expand

Fig 6.

KF results as the prior mean and standard deviation are varied for N = 3 replicates simulated at (a) low (b) medium and (c) high noise levels under log-normal noise model.

The horizontal axis represents an increase in the prior standard deviation, σ_m,0. The vertical axis represents the offset, μ_offset, which was added to the ground truth log reactivity to define the prior mean. The value of each bin is the RMS error calculated over all nucleotides in our database between the ground truth and Kalman filtered reactivities. Error calculations were carried out in the log domain and the ground truth values were the log reactivities. See Methods for RMS calculation details.

More »

Expand

Fig 7.

Kalman filtering results using an inaccurate (biased) prior improves with increased uncertainty in prior.

RMS errors were calculated over all nucleotides in our database. Error calculations were carried out in the log domain and the ground truth values were the log reactivities. See Methods for RMS calculation details. The prior used in the KF was biased by adding the offset μ_offset = 3 to the ideal prior mean. As the standard deviation of the prior, σ_m,0, was increased, the filters performance improved, despite the mean offset. On the other hand, when standard deviation was close to 0, the filter is influenced by a narrow, biased prior and produced poor results.

More »

Expand

Fig 8.

Kalman filtering results using an accurate (unbiased) prior performs comparable to log-averaging when the uncertainty is increased.

RMS errors were calculated over all nucleotides in our database. Error calculations were carried out in the log domain and the ground truth values were the log reactivity. See Methods for RMS calculation details. The prior mean was fixed to the ideal value. Its standard deviation, σ_m,0, was then increased. As the standard deviation increased, the more comparable the Kalman filtering’s performance was to log-averaging.

More »

Expand

Fig 9.

RNAstructure results for profiles calculated using different processing methods.

3 replicates simulated at (a) low (b) medium and (c) high noise regimes. MCC differences are plotted compared to the baseline calculated in SET0. An MCC difference of 0 indicates that when the processed profile was used as input to the RNAstructure software, the resulting predicted structure had the same accuracy as the one predicted using the ground truth profile as input. A positive MCC difference indicate that when the processed profile was input to to the RNAstructure software, the resulting predicted structure was less accurate than the one predicted using the ground truth profile as input. Note that the scale of the MCC differences vary between noise regimes. RNAs are ordered by length. See Table 1 of Methods for corresponding sequence names and lengths. Error bars represents standard errors over 10 repeated runs of replicate simulations.

More »

Expand

Fig 10.

Comparison of mean-dependence in the standard deviation of (a) real and (b) simulated SHAPE measurements.

For each nucleotide, the mean value of the 5 measurements (real and simulated) were calculated and plotted against their standard deviation on a log-log plot. A linear fit is overlaid in red for each. The left panel is a recreation of Fig 1 for comparison. The right panel consists of data coming from simulated replicates for the same RNA. The ground truth reactivity used the in replicate simulation was the average measurement per nucleotide coming from the real replicates. For the simulated replicates, noise levels were between σ_min = 0 and σ_max = 1.5. Note that negative reactivity values in the real data are not included as they are incompatible with the log-log plot.

More »

Expand