Bayesian modelling of high-throughput sequencing assays with malacoda

NGS studies have uncovered an ever-growing catalog of human variation while leaving an enormous gap between observed variation and experimental characterization of variant function. High-throughput screens powered by NGS have greatly increased the rate of variant functionalization, but the development of comprehensive statistical methods to analyze screen data has lagged. In the massively parallel reporter assay (MPRA), short barcodes are counted by sequencing DNA libraries transfected into cells and the cell’s output RNA in order to simultaneously measure the shifts in transcription induced by thousands of genetic variants. These counts present many statistical challenges, including overdispersion, depth dependence, and uncertain DNA concentrations. So far, the statistical methods used have been rudimentary, employing transformations on count level data and disregarding experimental and technical structure while failing to quantify uncertainty in the statistical model. We have developed an extensive framework for the analysis of NGS functionalization screens available as an R package called malacoda (available from github.com/andrewGhazi/malacoda). Our software implements a probabilistic, fully Bayesian model of screen data. The model uses the negative binomial distribution with gamma priors to model sequencing counts while accounting for effects from input library preparation and sequencing depth. The method leverages the high-throughput nature of the assay to estimate the priors empirically. External annotations such as ENCODE data or DeepSea predictions can also be incorporated to obtain more informative priors–a transformative capability for data integration. The package also includes quality control and utility functions, including automated barcode counting and visualization methods. To validate our method, we analyzed several datasets using malacoda and alternative MPRA analysis methods. These data include experiments from the literature, simulated assays, and primary MPRA data. We also used luciferase assays to experimentally validate several hits from our primary data, as well as variants for which the various methods disagree and variants detectable only with the aid of external annotations.

2 23 We have developed an extensive framework for the analysis of NGS functionalization 24 screens available as an R package called malacoda (available from 25 github.com/andrewGhazi/malacoda). Our software implements a probabilistic, fully Bayesian 26 model of screen data. The model uses the negative binomial distribution with gamma priors to 27 model sequencing counts while accounting for effects from input library preparation and 28 sequencing depth. The method leverages the high-throughput nature of the assay to estimate the 29 priors empirically. External annotations such as ENCODE data or DeepSea predictions can also 30 be incorporated to obtain more informative priors -a transformative capability for data 31 integration. The package also includes quality control and utility functions, including automated 32 barcode counting and visualization methods. 33 To validate our method, we analyzed several datasets datasets using malacoda and 34 alternative MPRA analysis methods. These data include experiments from the literature, 35 simulated assays, and primary MPRA data. We also used luciferase assays to experimentally 36 validate the strongest hits from our primary data, as well as variants for which the various 37 methods disagree and variants detectable only with the aid of external annotations.

54
The advent of next generation sequencing (NGS) has generated an explosion of observed 55 genetic variation in humans. Variants with unclear effects greatly outnumber those with obvious, 56 severe impact; the 1000 Genomes Project [1] has estimated that a typical human genome has 57 roughly 150 protein-truncating variants, 11,000 peptide-sequence altering variants, and 500,000   Genome-wide databases like ENCODE and computational predictors like DeepSea contain real 100 information about variant effects, but the method for incorporating this information into a 101 statistical framework for experimental analysis of variants has been unclear. 102 We hypothesized that a Bayesian approach to high throughput NGS screens such as  Bayesian statistical framework that addresses the gaps in the prior approaches while providing 112 novel methods for incorporating prior information. The malacoda method centers on MPRA but 113 also has potential extension to a broad array of NGS-based high-throughput screens. We 6 114 establish the superior performance of malacoda on MPRA compared to alternatives using 115 simulation studies. Then, we apply the method to previously published findings to make new 116 biological discoveries that we explore in the paper. We also apply malacoda to primary MPRA 117 studies that we performed. The results demonstrate that using malacoda we can discover 118 biologically important findings that were missed by prior approaches. We have made the   The negative binomial distribution is a natural choice for modelling NGS count data 152 given its ability to accurately fit overdispersed observations frequently seen in sequencing data 153 [12]. Briefly, the observed dispersion in NGS count data usually exceeds that expected from 154 simpler binomial or Poisson models. We chose gamma distributions as priors for several reasons.

155
They have the appropriate [0,∞) support, and for a non-negative random variable whose In order to incorporate external knowledge, the malacoda method also allows users to 195 provide arbitrary annotations to supplement the analysis. Figure 2C contrasts the marginal prior

247
Simulation Studies 248 We evaluated our simulation results in three ways. First, we focused on the accuracy of 249 transcription shift estimates. Figure 3A shows the results of analyzing one simulated dataset,  Figure 3B shows the ROC curves 13 269 by method for a randomly chosen simulation with ten barcodes per allele, 5% truly functional 270 variants, and 3000 variants. Figure 3C shows that across all simulations with these tend to disagree with one another but agree with malacoda, that would imply that malacoda is 280 working well across the cases where others fail. Indeed, Figure 4 shows that the other methods 281 tend to correlate with malacoda better than the other alternatives. The one exception is when 282 applied to our dataset, mpralm tends to agree best with the t-test method. Given that linear 283 models underlie both mpralm and the t-test method, it seems plausible that they would 284 sometimes show similar results.  The number of luciferase reporter assays we performed was not enough to overcome the 292 amount of noise inherent to light intensity-based measurements, thus we did not have enough 293 data to clearly demonstrate that any of the MPRA analysis methods outperform the others in 294 terms of correlation with luciferase results. However, the results show that the various methods 295 are consistent with MPRA-based estimates Figure 3D, providing further evidence that MPRA 296 results are biologically realistic. 297 We closely inspected a particular biological discovery to demonstrate malacoda's ability 298 to identify low-signal variants. for an assay that takes weeks to perform.  Therefore, the statistical methods used for these data should be as efficient as possible, 358 accounting for all sources of variation and quantifying the resulting uncertainty. Our software, 359 malacoda, provides an end-to-end framework for the probabilistic analysis of MPRA data.

360
Through our well-documented, easy-to-use R package, users can perform sequencing error 361 correction and data pre-processing before executing a fully Bayesian analysis in as little as two 362 lines of code. When informative annotations on variant function are available, malacoda is 363 capable of taking full advantage through a conditional prior estimation process. We hope that 364 this work may act as a stepping stone towards further integrative, probabilistic analysis in the 365 field of high-throughput genomics.