A new method for detecting signal regions in ordered sequences of real numbers, and application to viral genomic data

We present a fast, robust and parsimonious approach to detecting signals in an ordered sequence of numbers. Our motivation is in seeking a suitable method to take a sequence of scores corresponding to properties of positions in virus genomes, and find outlying regions of low scores. Suitable statistical methods without using complex models or making many assumptions are surprisingly lacking. We resolve this by developing a method that detects regions of low score within sequences of real numbers. The method makes no assumptions a priori about the length of such a region; it gives the explicit location of the region and scores it statistically. It does not use detailed mechanistic models so the method is fast and will be useful in a wide range of applications. We present our approach in detail, and test it on simulated sequences. We show that it is robust to a wide range of signal morphologies, and that it is able to capture multiple signals in the same sequence. Finally we apply it to viral genomic data to identify regions of evolutionary conservation within influenza and rotavirus.


Introduction
This document gives a worked example of the application of the methods described in the main text. In the first section, we apply our method of finding maximum Z to a simple simulated dataset. In the second section we show how the statistical testing works for this example. In the final section, we use the same example to illustrate the methods discussed in the main text for improving the speed of computation.

Finding max Z
Our method is for finding low regions in an ordered sequence of numbers x 1 , . . . , x n . Here we illustrate this with a specific sequence of 24 numbers: x 1 , x 2 . . . , x 24 . These are first normalised by shifting by a constant to have mean zero and rescaling them by a (positive) constant to have variance one:x 1 ,x 2 . . .x n . Finally the partial sums of thex i are formed to give the c i . These are all given numerically in Table 1. These are also shown graphically in Figure 1. Note that plotting the c i is often enough to reveal to the eye where low regions are likely to be, as it does here.
Next we find Z: As this toy example is small, it does not take long to search over all possible i and j, but computational speed improvements are discussed below. For this example, Z is maximised with i = 8 and j = 16. Note that this means that the positions of the original data included are 9 to 16 inclusive, discovered by checking carefully which x i are the difference between c 8 and c 16 . For this pair, Z = 3.24.

Statistical Testing
As described in main text, to determine if this 'real' Z is statistically significant, permute the original datapoints and find the new Z and do this repeatedly. As the normalisation step is the same no matter what order the x i are in, it is slightly more efficient to start with thex i and permute those with no further need to normalise. One example is given in Figure 3. In this case, the start and end point of the run giving maximal Z are consecutive, corresponding to a single point in the 'region' of interest. This example has Z = 2.73. In fact for any permutation, this will be a lower bound on the maximal Z as this same scoring possibility will be present (it is the largest single downwards step, and we we can read off Table 1 that this corresponds tô x 11 ). Indeed for this dataset, this often comes out as the maximal Z for a random permutation but not always (this is typical for a small dataset). Computing the Z for 1000 random permutations of the data, we find Z < 3.24 in all but 17 runs. So we would say this is significant at 5% level (but not the 1% level). Hence our method would propose x 9 to x 16 comprise an interesting region, unlikely to have occurred by chance. (Indeed, this happened to be exactly what was input in generating this example.) To complete the method, remove x 9 to x 16 and repeat the process for the remaining dataset. This is not shown here (as it runs through much the same) and in this the second proposed region is not statistically significant. So we end with the region 8-16 as the only significant one found.

Computational speed improvements
For finding the maximal Z ij , one could naïvely compute Z ij for all possible pairs of i, j with i < j. With n datapoints, this is n(n − 1)/2 pairs for which to compute Z ij . For the toy example with  Figure 4: Potential start and end locations. This is as Figure 1 lower panel, but with the potential starts marked in blue and potential ends marked in red (with some points marked both). The lines in grey connect each of the start-end pairs that occur in the right order (start first) -these are not all easy to see, but there are 55 of them.
n = 24, this 276 pairs to check, which is not at all unreasonable, and indeed for n of hundreds or thousands, this does not take long to compute. The real computational cost comes in when bootstrapping using permuted datasets, as done above, where we need to repeat the whole process many times, say 1000 or 10,000 times. In the main text, some suggestions are given for approaches that will decrease the amount of computational work required by limiting the set of pairs i, j that need be checked: we illustrate them here with the example above. The first relatively easy speed-up is to note that the best pair i, j must have the property that both the first step and the last step in the sequence c i , c i+1 , . . . , c j must be steps downwards. If not, if one of them is neutral or upwards, then trim it off: the resulting sequence will be shorter (the denominator √ j − i decreases), and have greater or equal height drop (the numerator c i − c j does not decrease), hence the corresponding Z ij will be larger, hence we did not have the maximal Z ij originally.
The steps 'downwards' will correspond exactly to the values of i wherex i < 0, so these i give the ends, and subtracting one from each gives the starts (given more formally in the main text). These are then a shortlist for the possible 'starts' for i and 'ends' for j: call these sets s 1 , s 2 , . . . , s N and e 1 , e 2 , . . . , e N respectively. Note that s u + 1 = e u . Pairing up each of the starts and ends that occur in the right order now, we have (N + 1)N/2 pairs (as s u can be paired with e v for u ≤ v).
Typically N ≈ n/2 so the number of pairs needed to check will have reduced by a factor of four (from n 2 /2 to n 2 /8 to leading order). For the example here, actually only 10 of the 24x i are negative (skewed distribution, which can happen when there is a signal to be found) so we go from 276 pairs to only 55. The potential starts and ends for this example are shown in Figure  4. Note as the bootstrapped data are permutations of thex i , the number of pairs to compare will be the same for all of these (55 for this example).
It is possible to go further with this approach and limit in which combination these potential starts and ends may be paired up. This is described formally in the main text, but is slightly harder to visualise, so a specific example is picked out here. Focussing on the last potential end point (at j = 22), does it make sense to check the Z value for it paired with each of the start points? No: for all the start points apart from 21, they would achieve a greater Z with j = 16 as the end point. This is because j = 16 is both earlier and lower, giving a greater Z. This is illustrated in Figure 5. Indeed for each end point, we can identify whether there is another end point that 'beats' it in this way, and if more than one choose that which is closest, and then exclude all pairings between the original end point and any starts earlier than the other end point. The function f defined in the main text encapsulates this: f (v) gives the index of the end This extra method is actually not very helpful for our example here, as the endpoints are in almost decreasing order, and this will be at least partially the case if there is a real signal there to detect. In fact here, f (v) = 0 for all v except for the last, where f (10) = 9, which corresponds to the case illustrated in Figure 5. So indeed only these nine pairs are excluded and the number of pairs to check has merely gone down from 55 to 46.
But the real benefits from this method is not in calculating max Z for the real data, but for the bootstrapped permutations as the endpoint heights are not typically in such an order. Returning for example to the permutation in Figure 3, the same method here reduces the pairs from 55 to 31. Repeating over many permutations, the average number of pairs is around 28.
For this short example, the extra complication of computing f may not seem to be worth the relatively modest reduction in pairs that need their Z ij computed. For clarity here then, two additional points are worth noting. Firstly, as we scale up n, the length of the input dataset, the benefit scales disproportionately: rather than the number of pairs to check growing as n 2 , it appears to be growing rather more slowly. For example, for a sequence of independent identically distributed uniform random samples of length n = 1000, pairing all i and j gives 499,500 pairs to check. Limiting to potential starts and ends as described reduces this to around 125,000 pairs. The last step above of limiting which starts and ends can be paired together dramatically reduces this to about around 10,000 (based on numerical simulation). Secondly, the time taken to compute the f also scales less than quadratically with n using the iterative method given in the main text (and sample code), where later evaluations of f reuse previous values iteratively.
In summary, these additional computational speed-ups are not crucial for small datasets (say n in the order of tens), but these methods are offered and available in sample code, and are increasingly worth the complication with larger datasets.