A new method for detecting signal regions in ordered sequences of real numbers, and application to viral genomic data

doi:10.1371/journal.pone.0195763

Fig 1.

Example of a simulated and recaptured signal.

The upper plot gives simulated x_i for a ‘gene’ of length 500, with a signal region length 100 in the middle. The background x_i are simulated as independently normally distributed with mean zero and variance 1, the signal positions are independently normally distributed with mean −1 and variance 1. The lower plot gives the c_i for this same simulation: the normalised cumulative walk. In both panels the simulated signal region is the highlighted region (from positions 201 to 300 inclusive). In the lower figure the maximum Z descent is marked in red. The location of the maximal descent corresponds well to the simulated signal.

More »

Expand

Fig 2.

Example of a simulated and recaptured signal.

The example is exactly as in the previous figure but here some statistics from many random permutations of the x_i are shown. For illustration 1000 such permutations are presented and these are represented along the horizontal axis. For clarity these are sorted by their maximum Z. The upper plot gives the max Z for each run (blue dots). The lower plot gives the length of the signal found. In both, the red dashed line gives corresponding ‘real’ value (from the simulated data in original order). The real max Z is much higher than any Z from the permutations: we say 0 out of 1000 bootstrap Z were higher. The plateau in each panel is discussed in the main text.

More »

Expand

Fig 3.

Effect of varying strength of simulated signal: Summary statistics.

For a range of different simulated signal strengths 2000 simulated sequences were generated and the algorithm applied to attempt to recapture the signal. This figure gives some summary statistics from these. The left column corresponds to setting the required significance level to 5% and the right to 1%. In each subplot the individual simulations are shown as blue dots and their mean for a given value of signal strength shown in red. To show the density more clearly the dots have been dithered slightly: horizontally only in the upper two plots and vertically also in the lower plots. The upper panels give the proportion of the signal (the middle 100 of 500 positions) that was recaptured by the algorithm. The middle panels give the proportion of the 400 positions of background that is determined to be signal by the algorithm. The lowest panels give the number of separate pieces that are detected.

More »

Expand

Fig 4.

Effect of varying strength of simulated signal: Sample outputs.

These are graphical representations of the regions found for simulated and recaptured signals. In the interests of space and clarity only a random set of 20 individual outputs is shown for each value of strength shown: they are arranged stacked vertically with no special ordering applied. If a region is detected it is shown as a horizontal bar: red if significant at 1% level, pink if only at 5% level.

More »

Expand

Fig 5.

Effect of varying the width of the signal and length of the sequence.

In both panels the colour of the region represents the mean proportion of signal recovered. Points were sampled along lines of fixed width or length to find the locations of the contours. In both panels the red line shows the transept that corresponds to the default set-up above (signal width 100, sequence length 500). The left panel shows the signal width varied from 50 to 450, with sequence length fixed at 500. The signal is placed in the centre of the sequence. Each contour is fairly horizontal in the range 100 to 300 and curves up at the ends. The curve up at the left shows that shorter signals are marginally harder to detect in that they must be stronger for the same success rate. The steep curve up at the right indicates that the signals are very hard to detect when they constitute the majority of the sequence. The right panel shows the sequence length varied from 150 to 1000 with signal width fixed at 100. The same effect is seen again: if the sequence length is not much longer than the signal, then the signal is relatively hard to detect. The contours are almost horizontal towards the right of the figure. Once the sequence is several times the signal length there is little further change.

More »

Expand

Fig 6.

Effect of varying signal density.

In the left panel the colour of the region represents the mean proportion of signal recovered. Points were sampled along lines of fixed density to find the locations of the contours. The panel on the right shows an extrapolation by taking the results for density one, and extending the contours proportionately to inverse density. In both panels, the red line shows the transept that corresponds to the default set-up above (density 1).

More »

Expand

Fig 7.

Finding two signals in same sequence, varying strengths.

In both panels the colour of the region represents the mean proportion of signal recovered. The left panel shows the proportion of total signal found when there were two signals within the same sequence (spaced as 100 background, 100 signal one, 100 background, 100 signal two, 100 background). The points were sampled adaptively along lines of fixed signal ratio to find the locations of the contours (only half of the figure was generated in this way; the results were then reflected across the diagonal, in the interests of computational speed). The right panel shows a similar output for two signals in two separate sequences of length 500. As these computations for the two sequences are independent, this was generated using the results for a single signal of length 100 and extrapolating.

More »

Expand

Fig 8.

Finding two signals in same sequence, varying strength of one signal.

Here we simulate two signals in the same sequence with the same arrangement as in the previous figure. The first signal’s strength is varied, the second signal’s strength is kept fixed at 0.5. The solid curve in the upper plot is the mean proportion recovered of signal 1 and the lower plot for signal 2 (dots for computed values, line segments to join). The grey dashed and dotted curves are for comparison with a single signal, either in a sequence of total length 500 (dashed) or 400 (dotted), with strength as signal 1 (upper plot) or signal 2 (lower plot). The first signal is still sigmoidal in its proportion found as a function of strength, but sits under the curves for equal strength signal alone (either in 400 or 500 length sequence). When the first signal strength is zero, the proportion of the second signal found matches that of a single signal of strength 0.5 in a sequence of length 500. When the first signal is strong the proportion of the second signal found is comparable to a single signal of strength 0.5 in a sequence of length 400. When the first signal is at intermediate strength, its effect of interference on the second signal is most clear.

More »

Expand

Fig 9.

Regions in virus data.

In each pair, the upper plot gives the normalised MPD (x_i) with ignored values marked in grey. The lower plot gives the cumulative normalised values (c_i). The coloured bars in each indicate significant regions found. Red indicates p < 0.01 and orange p < 0.05. Details of locations and order in which the regions were found within each gene are given in Table 1. Note that the cumulative plot is strictly only correct for the first region found. For subsequent regions the cumulative values considered were renormalised after the previous region(s) were removed. For clarity all regions are marked on the original cumulative plot.

More »

Expand

Table 1.

Summary of regions found.

More »

Expand