^{1}

^{1}

^{1}

^{*}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: TBH MDE DKG. Performed the experiments: TBH MDE. Analyzed the data: TBH MDE. Contributed reagents/materials/analysis tools: TBH MDE. Wrote the paper: TBH MDE DKG.

We show that existing RNA-seq, DNase-seq, and ChIP-seq data exhibit overdispersed per-base read count distributions that are not matched to existing computational method assumptions. To compensate for this overdispersion we introduce a nonparametric and universal method for processing per-base sequencing read count data called F

High-throughput DNA sequencing has been adapted to measure diverse biological state information including RNA expression, chromatin accessibility, and transcription factor binding to the genome. The accurate inference of biological mechanism from sequence counts requires a model of how sequence counts are distributed. We show that presently used sequence count distribution models are typically inaccurate and present a new method called F

High-throughput sequencing is used in a variety of molecular counting assays

Although a myriad of specialized methods exist for analyzing read count data, it is frequently assumed (implicitly or explicitly) that read counts are generated according to a Poisson distribution with a local mean. The assumption is explicitly introduced by using the Poisson density directly as well as implicitly by relying on binned per-base counts in ranking and statistical testing (see

We introduce a general and asymptotically correct preprocessing technique called F

The normalization strategy of de-duplication is prevalent in multiple ChIP-seq peak callers

Existing RNA-seq normalization techniques work at a higher conceptual level than F

In contrast to most of these existing strategies, F

The distributions of per-base mapped read counts in all ChIP-seq and RNA-seq runs for the human embryonic stem cell type (H1-hESC or ES cells) in the ENCODE project and a set of K562 cell line DNase-seq experiments (see

Differences in log-likelihood per base between the fitted model and the empirical distribution, also interpreted as the log-difference between observed and fitted counts. This error metric represents the error when calculating p-values or significance tests using a Poisson assumption. Three assay types are shown in each panel, analyzed by three models: (a) Poisson. (b) Negative binomial. (c) Log-normal Poisson. A model that fits the data would have points along the

The distribution described as log-concave is the statistical model used in F

The left panel shows uncorrected counts, and the right shows counts after correction. Poisson distributed counts would follow a straight line; all experiments show significant deviation from linearity that is corrected by F

We quantified the degree of overdispersion with respect to a distribution by comparing per-base empirical log-likelihoods against the per-base maximum log-likelihood distributions for the Poisson, negative binomial, and log-normal Poisson, where the per-base rates of the Poisson are assumed to be drawn from a log-normal distribution. For the negative binomial and log-normal Poisson, maximum likelihood distributions were found via numerical optimization with randomized restarts.

The deviation from Poisson is consistent across experiment and assay type, as shown in the left column of

The wide variation in overdispersion level and type suggests that any single parametric approach is unlikely to be effective for all assay types. Instead of attempting to model each assay type with a separate parametric family, we will use nonparametric distributions that are flexible enough to fit all observed assay types well.

While we have already seen that the Poisson assumption fails for most of the assay types we consider, it is not feasible or necessary to modify every analysis algorithm to use overdispersed distributions. Instead, for the class of inference algorithms which implicitly or explicitly assume Poisson counts and independent bases, such as most ChIP-seq callers, DNase-binding identifiers, and RNA-seq exon read counting methods, we can construct improved datasets with transformed counts that correct for overdispersion.

The two major nonparametric approaches to data transformation are quantile normalization, which matches input samples to a reference sample via their quantiles, and distribution matching, which fits a distribution to both the input and reference and constructs a mapping function between them.

Quantile normalization, which is a popular approach in the microarray literature, cannot easily be adapted to sequencing data, due to the large number of bases with equal counts. In order to rank normalize our observed counts to a Poisson, we would have to arbitrarily break ties between bases with equal reads, which could lead to spurious inference as well as force bases with non-zero counts to be discarded.

Instead of breaking ties, we employ a different approach to distribution mapping: given a distribution

Our approach transforms the non-Poisson, curved count histogram on the left column of

De-duplication, or removal of all but one read at each base position, has gained adoption in the ChIP-seq analysis literature as an effective way of reducing noise and improving replicate consistency

The heuristic of de-duplication can be derived as a distribution mapping data transformation by assuming that the read counts arise from a degenerate count distribution, where the number of bases with non-zero reads is drawn from a binomial, and the number of reads at non-zero bases is drawn from a uniform noise component over

De-duplication works well in practice by drastically reducing the error and additional variance from overdispersion, despite assuming that the data follow a degenerate distribution.

Preprocessing by de-duplication does not continue to reduce per-base errors as sequencing depth increases.

Therefore while de-duplication may be effective at lower sequencing depths, it relies upon a limited heuristic justification and will not remain effective as sequencing depths increase. On the other hand, F

We compared three methods of count preprocessing: original (raw) counts, removal of all duplicates (de-duplication), and our novel preprocessing technique (F

We evaluate our model on the ability to identify transcription factor binding sites based upon DNase-seq counts on the ENCODE human K562 DNase-seq data using two different methods: an unsupervised task using the CENTIPEDE binding site caller

In the unsupervised task shown in

Boxplots depicting AUC improvement across multiple factors (boxes above zero represent improvement due to F

We tested F

We evaluate quantile-quantile correlation for replicate consistency, as this allows us to evaluate the distribution of q-values generated by each method without pairing binding sites explicitly. The quantile-quantile (QQ) correlations are an effective means of detecting not only whether we call similar numbers of binding sites across replicates, but also whether our ChIP-seq call confidence is consistent across replicates. The quantile-quantile correlations across all analyzed ENCODE ChIP-seq experiments shown in

Subfigure (a) shows that F

An alternative measure of ChIP-seq experiment quality is the number and size of overlapping sites across replicates. F

We ran F

Replicate consistency was measured in two ways: Spearman's rank correlation and the number of false positive differential expression events called by DEseq

The rank correlation between replicates shown in

Higher rank correlations across exon expression measurements between replicates indicate greater data quality and reproducibility. F

Our DEseq results in

Caltech Align 75 | Caltech Splice 75 | Caltech Align 75x2 | Caltech Splice 75x2 | CSHL polyA- | CSHL polyA+ | |

Original | 2903 | 1454 | 5719 | 2748 | 8403 | 6955 |

De-duplication | 2559 | 1033 | 4951 | 2027 | 6640 | 5253 |

F |

We have shown that per-base count overdispersion is a widespread and consistent phenomenon in high-throughput sequencing experiments. While correcting for exon-level overdispersion has been studied in RNA-seq, per-base methods and corresponding tools for ChIP-seq and DNase-seq have largely been unexplored outside of aggressive count truncation methods particular to individual algorithms. One reason for the slow adoption of overdispersed models has been the empirical success of the de-duplication heuristic as a preprocessing scheme. However, we show that de-duplication assumes the data arise from a degenerate distribution, and that the performance of de-duplication will degrade as sequencing depth increases.

F

The F

Our count preprocessing method, F

Parameter inference for a novel class of distributions called log-concave Poisson distributions.

A probability integral transform method to map counts generated under log-concave Poisson to a Poisson distribution.

Rounding techniques to adapt datasets to methods that utilize only integral counts.

In the case that the algorithm downstream of our method is able to take weighted counts, F

The challenge of constructing a universal preprocessor is finding a class of count distributions that is flexible enough to model a variety of assay types while remaining non-degenerate. We achieve this goal by letting the per-base rates of a Poisson distribution be drawn from a nonparametric class of distributions called log-concave. Log-concave distributions are a family of distributions

The log-concave family includes a large family of unimodal distributions, such as most of the exponential family (including common cases such as the normal, gamma with shape parameter greater than one, Dirichlet)

In sequencing experiments log-concave Poisson families are capable of modeling zero-inflation as well as mixtures induced by copy number variation for low Poisson rates with

Our algorithmic contribution is the use of compound log-concave distributions, where we use latent log-concave distributions which generate Poisson counts along the genome. Inference for latent log-concave distributions does not follow directly from recent results in log-concave density estimation because of the ambiguity of parameters in the latent space.

The full model is as follows: per-base counts

Note that the two exponential operators above are intentional:

The form of this model naturally suggests an expectation-maximization strategy, which has been shown to be effective for clustering tasks

Instead we propose an inference technique based upon accelerated proximal gradient descent. The marginal likelihood for counts can be written as:

The bottom term normalizes the log-concave distribution. Approximating the integral with a sum over

Since

Both the objective function and constraints are concave, and therefore we can use accelerated gradient descent to quickly find the global optimum

Our gradient,

This gradient has a straightforward interpretation: the first term is the distribution of

The projection operator

The inference algorithm is guaranteed to converge to a global optima of the quadrature approximation, which as the number of quadrature points increase will converge to the global optima. If there are sufficiently many quadrature points, Fixseq will converge to the log-concave distribution closest to the data-generating distribution in the KL-divergence sense

When compared to the naive expectation maximization based method, our algorithm converges more quickly, with average runtime on our DNase datasets reducing from

Once we fit a log-concave distribution

Throughout this section, we will use the continuous extension of the Poisson PDF, CDF and the analogous densities for the log-concave compound distributions, defined below as:

Given the continuous extensions, we can apply the probability integral transform directly. Given

Our preprocessing function

Examples of the

Finally,

While some algorithms, such as CENTIPEDE

The straightforward count flooring schemes, where

We also propose a more sophisticated randomized rounding scheme, where we take

We compared these schemes on DNase data, where the unsupervised classifier, CENTIPEDE, was capable of accepting weighted counts, allowing us to compare various rounding schemes to the direct weighting scheme using the same comparison method as our DNase-seq results. The results in

All rounding schemes outperform baseline methods (bottom left) but only randomized rounding approaches performance of the weighted counts (top right).

F

(EPS)

(EPS)

(PDF)

(PDF)

(PDF)

(PDF)

(PDF)

(ZIP)