Piecewise Polynomial Representations of Genomic Tracks

Genomic data from micro-array and sequencing projects consist of associations of measured values to chromosomal coordinates. These associations can be thought of as functions in one dimension and can thus be stored, analyzed, and interpreted as piecewise-polynomial curves. We present a general framework for building piecewise polynomial representations of genome-scale signals and illustrate some of its applications via examples. We show that piecewise constant segmentation, a typical step in copy-number analyses, can be carried out within this framework for both array and (DNA) sequencing data offering advantages over existing methods in each case. Higher-order polynomial curves can be used, for example, to detect trends and/or discontinuities in transcription levels from RNA-seq data. We give a concrete application of piecewise linear functions to diagnose and quantify alignment quality at exon borders (splice sites). Our software (source and object code) for building piecewise polynomial models is available at http://sourceforge.net/projects/locsmoc/.


A Supplementary methods
In this section, we describe the smoothing process in more detail. We consider as input a piecewise constant curve. Such a curve is characterized by a set of positions called knots (a starting position, an end position, and intermediate locations where the value of the curve changes discontinuously) and the value (height) of the curve on each piecewise constant section. We will refer to the intervals between knots as sections or segments. We begin by defining a new piecewise polynomial curve (PPC), which we initially set equal to the input data. It therefore has the same number of knots as the original and all its sections are piecewise constant. The rest of the procedure runs in passes. For the first pass, we mark all knots as potential targets for processing.
In each pass, we visit flagged knots in a random order and, on each one, attempt to perform two operations. The preferred operation is to remove the knot and replace its neighboring sections by a single segment. This option is implemented if and only if the resulting section does not excessively deviate from the original curve. If this change occurs, we flag neighboring knots for processing in the next pass.
The alternative operation is to adjust segments neighboring a knot without removing the knot itself. This is useful, for example, when removing discontinuities in the curve. As in the previous case, this option is taken if and only if the result does not differ too much from the original curve. If it is taken, we flag the neighboring knots as well as the knot itself for processing in the the next pass.
If neither operation succeeds, the knot is marked as unchangeable. After a pass completes, the procedure restarts using the currently flagged set of knots and ends when the set is empty. As the number of flagged knots usually decreases in each pass, the process finishes after a relatively small number of iterations.
The above algorithm constitutes a sub-part of the whole program. Indeed, we repeat this process several times. At first, we attempt to construct a new 'curve' that is piecewise flat like the original. In the second and third rounds, we allow the curve segments to be linear and quadratic (the number of these rounds actually performed depends on the maximal polynomial order specified by the user.) We adopted such a layered approach primarily in order to ensure that quadratic fits are not attempted before all linear possibilities are attempted.
The remainder of this section introduces some notation, derives formulae for computing the polynomial fits, and describes the criteria used in these procedures.

A.1 Notation
Consider a piecewise constant curve divided into N sections. Denote the starting positions and values of the sections as x i and b i , respectively, with i = 0, 1, 2, . . . N −1. Also, let x N stand for the curve end point. Without loss of generality, we choose a coordinate system with x 0 = 0 and thus obtain x N = L, which we call the 'length' of the curve. We also denote the set of starting indeces as I = {0, 1, 2, . . . , N − 1}.
For each pair of neighboring knots (i, i + 1), or equivalently for each curve section i, we define The simple sum over the set I is We also consider the weighted sum of ∆ in , with weights b i , In this notation, the area under each section i is b i ∆ i1 and the total area under several sections is Γ 1 . We will be interested in polynomials of order at most equal to two, i.e.
The squared error between this polynomial and the piecewise constant curves is defined as For future convenience, the terms are organized to separate the higher order coefficients a 1 and a 2 .

A.2 Removing knots
Every non-boundary knot of a piecewise polynomial curve has one polynomial section on its immediate left and another on its immediate right, each of which can be composed of one or more sections of the original curve. We now discuss strategies for merging two neighboring sections together. Following the notation from section A.1, we define a set I and corresponding x i for the starting positions of the piecewise constant sections within the new interval. We then define a polynomial and set the coordinate system so the leftmost position of the interval is x = 0 and the rightmost position is x = L. We require that the polynomial preserve the existing boundary conditions at x = 0 and x = L.
Thus we impose where F and G are constants fixed by the two endpoints of the segments that we want to join together. Enforcing these boundary conditions is an important step in our procedure because it ensures that when a continuous curve with a large number of segments is processed, the outcome is also continuous. Further details of the polynomial fitting depend on the orders of the previous polynomials on the two sections. At first, we attempt to use a polynomial or order equal to the maximum of the orders on the old sections. If this is not possible (see criteria in Section A.5), we attempt to use a polynomial with an order incremented by one, but not greater than two. If this fails as well, we give up on knot removal. This procedure guarantees that the smoothed curve is less "overfitted" than the original and in particular avoids joining two piecewise constant sections by a quadratic polynomial. Some of the situations that may arise and their smoothings are shown in Figure 1.

To order 1
When the required polynomial order is one (Figures 1b and 1c), we use equation (6) and immediately set a 2 = 0. The boundary conditions then become a set of two linear equations in two unknowns with the solution given by

To order 2
For quadratic fits (Figure 1d), we consider equation (6) but allow all three parameters a 0 , a 1 , and a 2 to be nonzero. To determine the solution uniquely, we minimize the error (5) with respect to a 2 . This involves setting a derivative to zero, Together with the boundary conditions, this leads to the solution Importantly, this operation is not attempted in cases where the two segments being joined are both order 0.

To order 0
The case with order 0 (Figure 1a) is slightly different because it implies setting a 1 = 0 and a 2 = 0 in (6). We are thus free to adjust only one parameter, a 0 , and it becomes impossible to simultaneously enforce continuity both the left and right boundary conditions. We thus abandon both continuity conditions and resort to minimization of squared error. This involves setting which gives Note that in this equation, the right hand side is defined in terms of the weighted sum (3) over the, perhaps numerous, piecewise constant sections of the original curve that fall within the interval of interest. However, because of linearity and the fact that this computation is invoked only when the left and right sections of the smoothed curve are themselves constants, it can be shown that Γ 1 can be equivalently computed using the two constant sections from the smoothed representation.

A.3 Adjusting knots
If a knot cannot be removed entirely, we attempt to adjust the polynomials on its left and right in order to remove a discontinuity at the knot and minimize the error between the smoothed and the original curves. Some examples of such smoothings are shown in Figure 2. In all cases, we only consider adjusting coefficients of existing polynomials and never consider increasing the order of a polynomial. We describe the left and right neighborhoods using I and I , respectively. Generally, we use unprimed and primed variables to describe quantities on the left and on the right. We denote the polynomial equations for the two segments as 1 Importantly, we use different coordinate systems on the two regions, and set the origin in each as the left-most position (section A.1) in that section. We impose boundary conditions through the constraints and ask for continuity at the knot position by requiring

Orders (0,0)
If the existing segments on the left and right are both of order zero and there is a discontinuity at the knot, it is impossible to perform a boundary preserving smoothing without changing the orders of the segments. This situation is thus Orders (1,0), (2,0), (0,1), and (0,2) If the right segment is constant and the left segment is linear or quadratic (cases (1,0) and (2,0)), the only way to remove any discontinuity is to adjust the end level of the left polynomial ( Figure 2a). The smoothing problem is then reduced to one of the type discussed in section A.2. That is, the left segment is set using (8) or (10) in the linear and quadratic cases, respectively. The right segment is unchanged. When the left segments is constant (cases (0,1) and (0,2)), the situation is almost equivalent and formulae (8) and (10) are again suitable, except that they are applied on the right segment.

Orders (1,1)
If both left and right segments are linear, they may be adjusted to make the endpoint of the left segment match the start point of the right segment ( Figure 2b). In this case we have a 2 = 0 and a 2 = 0 and the boundary and continuity conditions are There is one free parameter, so we set out to minimize the total squared error over the two segments, Differentiating with respect to a 0 , we set Using this together with which follow from (17), we obtain the solution

Orders (1,2)
In this case, the boundary and continuity conditions are We treat a 2 and a 0 as independent variables. The other unknowns become dependent variables, so we have Minimizing the total error with respect to the independent variables leads us to the following system of linear equations (24) As the system of equations is fairly complicated, we do not write the solution explicitly. Rather, in practice we construct matrices with the appropriate values and solve the system numerically.

Orders (2,1)
This is closely related to the previous case. The boundary and continuity conditions are We treat a 2 and a 0 as independent variables. We then have Minimizing the total error, we obtain the following system of linear equations in four unknowns  As in the previous case, we solve for the unknown variables numerically.

Orders (2,2)
The boundary and continuity conditions become We select a 0 , a 2 , and a 2 to represent the three degrees of freedom. We then find and Minimizing the total error with respect to the three independent variables, we obtain the following system of equations  As in the previous cases, we solve this system numerically.

A.4 Moving knots
The knot removal and adjustment strategies above do not guarantee that the formed PPC is an optimal description of the original data. In order to fine-tune the smoothed PPC, we implement an optional knot-moving stage. When enabled, we consider adjusting not only the vertical position of a knot but also its location along the curve ( Figure 3). We compute the error between the original data and the two segments neighboring the knot. We repeat the calculation after moving the knot location to the left by one segment of the original PPC, then by two, and so on up to a user-specified maximum. We also repeat the calculation moving to the knot toward the right. Finally, we set the knot at the position that minimizes the error.

A.5 Acceptance criteria
The above strategies provide suggestions for replacing one description of the piecewise polynomial curve by another one. Before implementing the removal or smoothing actions, however, the suggestions need to be checked to ensure that the resulting curve does not deviate too much from the input data. To this extent, we consider the relevant interval I of the original piecewise constant curve and compute the areas . We also compute the areasÃ i under the suggested polynomial segment over the matching regions. We then use these areas to reach a decision about the suggested operation using one of several methods. The method and an associated numerical parameter p are user-specified.

A.5.1 Poisson
In the Poisson acceptance scheme, we compare the paired original and proposed areas pairwise. We accept the suggested segment if all pairs satisfy The parameter p (defaults to 1) is useful for changing the aggressiveness of the smoothing. The default value of 1 is suitable when A i are Poisson random variables and higher values can be used to achieve greater compression. Values lower than 1 can be useful when the processed signal encodes Poisson random variables which have been normalized (e.g. to represent a probability scale).
It is worth stressing that the above test condition is applied on areas under segments. The consequence of this can be understood via an example. Consider two sections of an input signal with length 1 and 2 respectively, both with value 30. Such sections can be frequently encountered in sequencing coverage data. Consider a piecewise-flat segment covering these two sections having value 26. The areas compared for the shorter segment will be 30(= 30 · 1) and 26(= 26 · 1), which will pass the acceptance test for p = 1 (|26 − 30| < √ 26). For the longer section, the areas compared will be 60(= 30 · 2) and 52(= 26 · 2), which result in a failed acceptance test (|52 − 60| > √ 52). In a signal with heterogeneous section lengths, this means that knot removal and smoothing operations occur less frequently around long sections.

A.5.2 Percent
For the percentage-based acceptance scheme, we check that all area pairs satisfy In this method, the parameter p (defaults to 0.1) is a threshold for the relative difference in areas.
Compared to the Poisson method, this criterion gives equal treatment to sections of all lengths.

A.5.3 t-test
The t-test acceptance scheme is designed for use with piecewise-constant smoothings and works quite differently from the previous schemes. When a new segment is proposed, for example by joining two PPC segments together, we do not verify that the original data is consistent with the new segment. Instead, we compute a t-test like score which determines whether or not the two PPC segments are significantly different to warrant keeping them separated. In practice, we first assign weights to the original areas according to the lengths of the original segments. Then, we compute the means µ, µ and standard deviations σ, σ of the data on the left and right PPC segments. The t-test like score is then defined as To determine whether a value of z is significant, we permute the areas N = (L+L ) 2 times (at most 10/p times) and repeat the above calculation to obtain a reference distribution of scores. This distribution is used to assign a p-value to z. A knot removal operation is accepted if the p-value is above a threshold set by the parameter p (defaults to 0.01), and rejected otherwise. This test becomes active when both left and right segments contain at least two sections of the original data. If one of the sections contains fewer than two elements, a knot removal operation is automatically accepted.

Subdivision into windows
Very long curves can take a large amount of memory and take a long time to process. To improve efficiency, we note that curve sections with values of zero are immutable as changes to such segments are rejected by our acceptance criteria (Section A.5. This is not true for the t-test.). Thus, we divide curves into windows separated by long zero-level pieces and then process each of these windows independently (the length of the zero-level run can be set by the user). The windows are patched together at the end.

Boundary knots
In practice, piecewise polynomial curves are defined over compact intervals and this implies that two knots define the curve boundaries. (Implicit boundaries also appear when a long curve is split into windows as described above.) In contrast to non-boundary knots, they have only one neighboring curve segment instead of two. Although these knots cannot be removed, they can be smoothed. The procedures defined in the previous section, however, need to be modified to account for the boundaries. Because we deal with curves with discontinuities, we allow for the possibility of a discontinuity at the boundary knots as well. For example, while the starting position of a first segment might be a certain value, we can in principle fix the left boundary condition to a different value. This trick is handy when splitting a long signal into windows and smoothing the windows independently.

Smoothing accuracy
A knot adjustment suggestion (Section A.3) changes the start and end values of two neighboring segments (Figure 2). When a knot is adjusted, its neighbors are then flagged for adjusting as well. If these knots are smoothed, then the knot we started with is flagged again, and so on. The adjustments normally become very small after a few iterations. We thus monitor the value of the coefficient a 0 (see Section A.3) before and after the knot adjustment, and reject if the change is smaller than a threshold value. This threshold value is absolute (as in non-relative, not un-changeable) and is set by default to 0.01.

A.7 Comparison to other methods
We compare the outputs generated by our PPC generating algorithm to other smoothing methods in Table 1. One important method missing in the comparison is adaptive spline modeling, i.e. smoothing splines with dynamically determined knots. This actually the method most closely related to our approach, except using Bayesian fitting criteria, maximally smooth polynomials, and a monte-carlo engine for optimizing knot placement. Unfortunately, the implementation we found [6] did not compile and run in our locale. We would expect it to give similar results to our approach, but its more sophisticated criteria would make it less suitable for processing very long signals with many domain changes. Segmented regression is produced with the segmented package in R. It is fairly successful, but includes an unnecessary very steep segment on the right. A disadvantage of the algorithm is that it requires defining an initial set of knots and sometimes aborts with no output if this set is not close to optimal.

B Supplementary tables and figures
In all cases, linear and quadratic smoothing is performed with the -c option off (Section C). This means that two piecewise constant segments are never smoothed into constant sections.     Figure 7: Low-resolution CGH signals with aberrations. Dots are probe intensities, horizontal lines indicate true copy-number levels (green), and segments output by CBS (red) and the median of 5 PPC runs (blue). This is a situation in which, due to large overall chromosome variance, CBS is unable to detect distinct segments while the bottom-up PPC approach can.  Table 3: Classifier performance on array data. Array data for the synthetic array signals were segmented using PPCs and CBS. For the PPC method, we prepared a segmentation from a single smoothing run (1 pass) as well as a median of five passes (5 pass). We assigned each PPC and CBS segment a label of CN abnormality if more than half of the width overlapped with a true CN event (in this calculation, signals with multiple events were pooled together). The dataset was divided into a part for training (85%) and testing (15%) and SVMs were trained as described in section B.   Table 2 in the main text. We considered five previously published datasets [7] and three sets of signals of the type described in the main text. For each dataset, we performed segmentation using PPC and three other methods: CBS [5], GLAD [2] and HMM [1]. Values in the tables are proportions of times the output from a method had lower mean squared error than a segmentation produced via PPC. Instances where PPC was better more often (proportions < 0.5) are highlighted in bold. The results show that CBS and GLAD often give better segmentation than PPC, but that PPC is nonetheless better in a non-negligible fraction of cases. Results for datasets labeled 9 events (described in the main text, an example is in Figure 8) suggests there exist classes of signals not represented in the other datasets for which PPC gives considerably better results than any of the other methods. We thus conclude that PPC is complementary to existing methods.As our algorithm is stochastic, all proportions are approximate with estimated error of about 0.05 (differences in values here and in Table 2 of main text are helpful in understanding variation between runs.) As is common with stochastic algorithms, performance improves when averaging results over multiple runs/passes.  Table 2 in the main text. For the short signals with many CN events, the proportion of times CBS segmentation gives lower error compared to PPC is lower than 50%. This observation is robust to changing the p-value used in CBS.  In the bottom-up approach, the running times are roughly equivalent for signals that are flat (with noise) or consist of a square wave (with noise). For the same signals, the top-down approach exhibits a shift in the running times while simultaneously missing many more events in the square wave pattern. All measurements performed with default settings. The diagonal line with unit slope shows linear scaling.

B.2.2 Sequencing data
We tested a number of PPC segmentation strategies by varying the aggressiveness of smoothing via parameter p (Poisson method, p set to 4.5, 5, and 5.5, other values not shown). For comparison, we computed mean coverage in fixed windows (50, 75, and 100bp, other values not shown), converted these means to log-ratios, and treated them using circulary binary segmentation (CBS) [5]. Since neither the PPC curve smoothing or CBS segmentation explicitly call copy number changes, we created classifiers using support vector machines (SVM). The sample was divided into a part for training (8500 chromosomes) and a part for testing (1500 chromosomes). We classified segments output by each approach as copy-number abnormal if more than half of their widths overlapped with the true abnormal region. For the PPC based segmentations, we trained the SVMs using four features: segment width, segment mean, and the means of segments immediately up-and down-stream. The need for features other than segment mean can be understood from Figure 10, which shows that PPC segmentation sometimes produces very short segments above and below the main trend. The classifier must be trained with such information to ignore these local fluctuations. For the CBS based segmentations, we used only two features, segment width and segment mean, as we found this produced better performance. The classification process was repeated 40 times in cross validation (changing sets of training and test chromosomes) by 4 times recomputing smoothing of all curves to investigate the effect of stochasticity in our algorithm. To summarize the performance of each classifier, we computed the number of bases called as copy-number normal and abnormal and compared to the true counts (Table 5). All approaches exhibit high sensitivity (T P/(T P + F N ) > 90%) and specificity (T N/(T N + F P ) > 99%). The CBS strategy produces best results when the width of the fixed windows are equal to the read length, i.e. autocorrelation length of the signal (75bp). The PPC strategies achieve comparable performance and sometimes show higher true positive rates together with lower false negative rates. We stress this does not arise because the control method misses regions of copy number changes entirely. The reference method is in fact quite reliable, but is limited in detecting breakpoints due to the fixed window length (the example in Figure 10 is picked to demonstrate this point).
Another measure of comparison is the number of best true positives (BTP). This is the maximal number of positions that could be called as copy-number abnormal given a perfect classifier; it thus provides information about detection ability that is independent of the specific SVM classifier used in these experiments. The measure is at a maximum for the CBS approach when using 75bp windows. Working with dynamically defined PPC segments can give higher BTP rates, and can be improved beyond the results shown at the expense of using shorter but more numerous segments, which can take more time to classify.

C.1 Command line parameters
The smoothing program, locsmoc, takes some required and some optional parameters/flags (Table 6).

Option
Default Description -f [required] input/output files. The first instance of -f is interpreted as the input file. Input files can be in gzip format (gz extension) or text format (any other extension). The second instance is used as the output destination. Inputs/outputs can be directories, in which case all files contained are processed sequentially.
-z 64 zeros. A value of zero forces all the curve to be smoothed at once. Positive values allow the curve to be split into windows separated by runs with zero value. --seed 0 seed for random number generation. When set to 0, the random number generator is initialized using the current date as a seed. Otherwise, the specified value is used as the seed.
-g NA force compress output using gzip -c NA consider smoothing into piecewise constant pieces (required when -o=0) -h NA input/output files have header rows -t 1 number of threads/processors used

C.2 Examples
The options and flags provide a lot of flexibility. Some of the possibilities are reviewed below.

Input/Output options
To smooth one file using all the default settings, use java -jar locsmoc.jar -f infile.txt -f=outfile.txt If the input file contains a header row, use java -jar locsmoc.jar -h -f infile.txt -f=outfile.txt The software can manipulate gzip files, so the following generates a compressed outputfile java -jar locsmoc.jar -f infile.txt -f outfile.gz It is also possible to smooth multiple files at once. This can be done by passing more that two input/output file names java -jar locsmoc.jar -f input1.txt -f output1.txt -f input2.txt -f output2.txt Alternatively, it is possible to specify input/output directories java -jar locsmoc.jar -f inputdir -f outputdir This produces output files with the same names as the input files. The output is compressed whenever the input is. To force compress the output, set the -g flag java -jar locsmoc.jar -g -f inputdir -f outputdir

Smoothing options
To change the default settings, use the parameters/flags from Table 6.

C.3 Incorporation in an analysis pipeline
In this section we give a minimalist example of how to use the software to build PPC models for sequencing data. We begin with an alignment file and aim to compute PPCs for the coverage tracks. The example assumes we are working in a UNIX-like environment and have R installed with the package Rsamtools [3]. We want to compute the coverage track for an alignment in a file aln.bam. We can do this in R, so we issue the following commands within the R environment: > library("Rsamtools") > aln = readGappedAlignment("aln.bam") > alncov = coverage(aln) > alncov.chr1 = alncov$chr1 The first command loads the samtools library for R. The second loads the small alignment file we created earlier. The third computes the coverage track and the fourth extracts only the coverage track for chromosome 1. (We assume the chromosomes in the alignment are labelled "chr1," "chr2," etc. If they are not, the last command should be modified accordingly.) The object alncov.chr1 now holds a run-length encoded signal. We can check this by viewing the object using > alncov.chr1 We want to put this signal in a tabular format and save it to disk. We do this as follows > alncov.chr1.table = cbind(runLength(alncov.chr1), runValue(alncov.chr1)) > write.table(alncov.chr1.table, file="aln.chr1.txt", quote=F, row.names=F, col.names=F, sep="\t") We want to compress the text file to save disk space. Then, we want to create two PPC models for the coverage track: one in terms of piecewise flat and the other in terms of piecewise linear segments. We issue the following commands > system("gzip aln.chr1.txt") > system("java -jar [path-to-locsmoc.jar] --method poisson -p 2 -o 0 -c --move 10 -f aln.chr1.txt.gz -f aln.chr1.o0p2m10.txt.gz") > system("java -jar [path-to-locsmoc.jar] --method poisson -p 2 -o 1 --move 10 -f aln.chr1.txt.gz -f aln.chr1.o1p2m10.txt.gz") These are actually commands that can be executed on the command line, but we wrap them with system() here to perform the full analysis without switching back-and-forth between the command line and R. The outputs of the last two commands are files ending with p2m10.txt.gz. We can load them into R as normal tables > alncov.chr1.o0p2m10 = read.delim("aln.chr1.o0p2m10", sep="\t", header=F) > alncov.chr1.o1p2m10 = read.delim("aln.chr1.o1p2m10", sep="\t", header=F) The first of these objects will be a table with two columns: the segment length and the segment level (a 0 ). The second object will be a table with three columns: the segment length, level (a 0 ), and slope (a 1 ).
Information about the R session used in this example:

C.4 Dependencies
The software locsmoc makes use of the following third-party features: