RNA velocity unraveled

Gennady Gorin; Meichen Fang; Tara Chari; Lior Pachter

doi:10.1371/journal.pcbi.1010492

Abstract

We perform a thorough analysis of RNA velocity methods, with a view towards understanding the suitability of the various assumptions underlying popular implementations. In addition to providing a self-contained exposition of the underlying mathematics, we undertake simulations and perform controlled experiments on biological datasets to assess workflow sensitivity to parameter choices and underlying biology. Finally, we argue for a more rigorous approach to RNA velocity, and present a framework for Markovian analysis that points to directions for improvement and mitigation of current problems.

Author summary

Single-cell sequencing data are snapshots of biological processes, making it challenging to infer dynamic relationships between cell types. RNA velocity attempts to bypass this challenge by treating the unspliced RNA content as a proxy for spliced RNA content in the near future, and using this “extrapolation” to build directional relationships. However, the method, as implemented in several software packages, is not yet reliable enough to be actionable, in part due to the large number of arbitrary, user-set hyperparameters, as well as fundamental incompatibilities between the biophysics of transcription in the living cell and the models used throughout the velocity workflows. In this study, we review these issues, and use existing results from the fields of stochastic modeling and fluorescence transcriptomics to develop an alternative theoretical framework. We show that our framework can facilitate the development and inference of physically consistent models for sequencing data, as well as the unification of single-cell analyses to self-consistently treat variation due to cell type dynamics and identities, the stochasticity inherent to single-molecule processes, and the uncertainty introduced by sequencing experiments.

Citation: Gorin G, Fang M, Chari T, Pachter L (2022) RNA velocity unraveled. PLoS Comput Biol 18(9): e1010492. https://doi.org/10.1371/journal.pcbi.1010492

Editor: Qing Nie, University of California Irvine, UNITED STATES

Received: March 17, 2022; Accepted: August 14, 2022; Published: September 12, 2022

Copyright: © 2022 Gorin et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All analyses, including all figure generation, were performed using the Jupyter notebooks hosted at https://github.com/pachterlab/GFCP_2022 and deposited at Zenodo (DOI: 10.5281/zenodo.6360950). All simulated data were generated using custom, deterministic routines implemented in scripts and notebooks available at the GitHub repository. The human forebrain dataset used for numerical experiments was obtained from http://pklab.med.harvard.edu/velocyto/hgForebrainGlut/hgForebrainGlut.loom and rehosted at the GitHub repository. The ten datasets used for pre-processing comparisons were obtained from https://support.10xgenomics.com/single-cell-gene-expression/datasets and the Sequence Read Archive (runs SRR14713295 for dmso and SRR14713295 for idu). The processed loom files generated by the three workflows are available at the CaltechData repository, at https://data.caltech.edu/records/20030. The scripts used to generate the loom files are available at the GitHub repository.

Funding: L.P. received the National Institutes of Health (nih.gov) award U19MH114830, administered by the National Institute of Mental Health (nimh.nih.gov). G.G., M.F., T.C., and L.P. were partially funded by this award. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Background

The method of RNA velocity [1] aims to infer directed differentiation trajectories from snapshot single-cell transcriptomic data. Although we cannot observe the transcription rate, we can count molecules of spliced and unspliced mRNA. The unspliced mRNA content is a leading indicator of spliced mRNA, meaning that it is a predictor of the spliced mRNA content in the cell’s near future. This causal relationship can be usefully exploited to identify directions of differentiation pathways without prior information about cell type relationships: “depletion” of nascent RNA suggests the gene is downregulated, whereas “accumulation” suggests it is upregulated. This qualitative premise has profound implications for the analysis of single-cell RNA sequencing (scRNA-seq) data. The experimentally observed transcriptome is a snapshot of a biological process. By carefully combining snapshot data with a causal model, it is for the first time possible to reconstruct the dynamics and direction of this process without prior knowledge or dedicated experiments.

The bioinformatics field has recognized this potential, widely adopting the method and generating numerous variations on the theme. The roots of the theoretical approach date to 2011 [2], but the two most popular implementations for scRNA-seq were released in 2017–2018: velocyto by La Manno et al. [1], which introduced the method, and scVelo by Bergen et al. [3], which extended it to fit a more sophisticated dynamical model. Aside from these packages, numerous auxiliary methods have been developed, including protaccel [4] for incorporating newly available protein data, MultiVelo [5] and Chromatin Velocity [6] for incorporating chromatin accessibility, VeTra [7], CellPath [8], Cytopath [9], CellRank [10], and Revelio [11] for investigating coarse-grained global trends, scRegulosity [12] for identifying local trends, Velo-Predictor [13] for incorporating machine learning, dyngen [14] and VeloSim [15] for simulation, and VeloViz [16] and evo-velocity [17] for constructing velocity-inspired visualizations. This profusion of computational extensions has been accompanied by a much smaller volume of analytical work, including discussions of potential extensions and pitfalls [18–21], as well as theoretical studies based on optimal transport [22, 23] and stochastic differential equations [24]. However, at their core, these auxiliary methods are built on top of the theory and code base from velocyto or scVelo.

These two most popular software implementations emphasize usability and integration with standard visualization methods. The typical user-facing workflows, with internal logic abstracted away, are shown in Fig 1: a set of reads is converted to cell × gene matrices derived from spliced and unspliced mRNA molecule measurements, the matrices are processed to generate phase plots describing a dynamical transcription process, and finally the transcriptional dynamics are fit, extrapolated, and displayed in a low-dimensional embedding.

Download:

Fig 1. A summary of the user-facing workflow of a typical RNA velocity workflow.

Initial processing of sequencing reads produces spliced and unspliced counts for every cell, across all genes. Inference procedures, implemented in velocyto and scVelo, fit a model of transcription, and predict cell-level velocities. The final embedding of cells and smoothed velocities is displayed in the top two principal component dimensions. Visualizations adapted from [25, 26]; dataset from [1]. The DNA and RNA illustrations are derived from the DNA Twemoji by Twitter, Inc., used under CC-BY 4.0.

https://doi.org/10.1371/journal.pcbi.1010492.g001

Despite the popularity of RNA velocity [13, 27] and increasingly sophisticated attempts to combine it with more traditional methods for trajectory inference [8, 10], there has been little comprehensive investigation of the modeling assumptions that underlie the seemingly simple workflow, with the sole dedicated critique to date largely focusing on the embedding process [28]. This is an impediment to applying, interpreting, and refining the methods, as problems arise even in the simplest cases. Consider, for example, the result displayed in Fig 1, where the outputs of the two most popular RNA velocity programs applied to human embryonic forebrain data generated by La Manno et al. [1] (“forebrain data”) are qualitatively different. The inferred directions in the example should recapitulate a known differentiation trajectory from radial glia to mature neurons. However, scVelo, which “generalizes” velocyto, fails to identify, and even reverses the trajectory, suggesting totally different causal relationships between cell types. This type of problematic result has been reported elsewhere (Figs 2–3 of [3], Fig 2 of [21], Fig 4B of [5], Fig 5A of [9], Fig 5 of [10], and Fig 3 of [29]).

Motivated by such discrepancies, we wondered whether either velocyto or scVelo are reliable for standard use in applications where ground truth may be unknown. An examination of their theoretical foundations, and those of related methods, revealed that they are largely informal. Even the term “RNA velocity” is not precisely defined, and is used for the following distinct concepts:

A generic method to infer trajectories and their direction using relative unspliced and spliced mRNA abundances by leveraging the causal relationship between the two RNA species, which is the interpretation in [24].
A set of tools implementing this method or parts of it, as in an “RNA velocity workflow implemented in kallisto|bustools,” which is the interpretation in [30].
A gene- and cell-specific quantity under a continuous model of transcription, as in “the RNA velocity of a cell is ”, which is the interpretation in [18, 27].
A gene- and cell-specific quantity under a probabilistic model of transcription, as in “the RNA velocity of a cell is ”, which is the interpretation in [4].
A gene-specific average quantity, as in “the total RNA velocity of a gene is ∑_i(βu_i − γs_i)”, which is the interpretation in [12, 27].
A cell-specific vector composed of gene-specific velocity components, as in “the vector RNA velocity of a cell is β_ju_ij − γ_js_ij”, which is the interpretation in [7, 9, 27].
The cell-specific linear or nonlinear embedding of a cell-specific vector in a low-dimensional space, which is the interpretation in [9].
A local property, such as curvature, of a theorized cell landscape computed either from an embedding or a set of velocities, which is the interpretation in [22, 29].

These discrepancies and, more broadly, the limitations of current theory, stem from historical differences between sub-fields, which have calcified over the past twenty years of single-cell biology. On the one hand, fluorescence transcriptomics methods, including single-molecule fluorescence in situ hybridization and live-cell MS2 tagging, which target small, well-defined systems with a narrow set of probes [31–33], have motivated the development of interpretable stochastic models of biological variation [34, 35]. On the other hand, “sequence census” methods [36], such as scRNA-seq, provide genome-wide quantification of RNA, but the associated challenges of exploratory, high-dimensional data analysis have not, for the most part, been addressed with mechanistic models. Instead, descriptive summaries, such as graph representations and low-dimensional embeddings, are the methods of choice [37]. Nevertheless, descriptive analyses, even if ad hoc, can still facilitate biological discovery: RNA velocity has been used to produce plausible trajectories [38–46], and our simulations show that it can recapitulate key information about differentiation trajectories in best-case scenarios (Fig A in S1 Text). These results highlight the potential of RNA velocity, and motivated us to review its assumptions, understand its current failure modes, and to solidify its foundations.

Towards this end, we found it helpful to contrast the sub-fields of fluorescent transcriptomics and sequencing, which have analogous goals, albeit disparate origins that have led to analytical methods with distinct philosophies and mathematical foundations. The sub-fields have, at times, interacted. Fluorescence transcriptomics can now quantify thousands of genes at a time, and this scale of data is now occasionally presented using visual summaries popular for RNA sequencing data, such as principal component analysis (PCA) [47], Uniform Manifold Approximation and Projection (UMAP) [48], and t-distributed stochastic neighbor embedding (t-SNE) [49, 50]. Conversely, the commercial introduction of scRNA-seq protocols with unique molecular identifiers (UMIs) has spurred the adoption of theoretical results from fluorescence transcriptomics for sequence census analysis [51–55]. Sequencing studies frequently use count distribution models that arise from stochastic processes, such as the negative binomial distribution, albeit without explicit derivations or claims about the data-generating mechanism [51, 56, 57]. These connections highlight the promise of mechanistic gene expression models: in principle, parameters can be fit to sequencing data to produce a physically interpretable, genome-scale model of transcriptional regulation in living cells, and some steps have been taken in this direction over the past decade [52–55, 58, 59].

RNA velocity methods are products of the sequence census paradigm: they draw heavily on low-dimensional embeddings and graphs derived from the raw data. Their current limitations stem from viewing biology through the lens of signal processing, where noise is something to be eliminated or smoothed out. We posit that it is more appropriate to view the data through the lens of quantitative fluorescence transcriptomics, in which noise is a biophysical phenomenon in its own right. Through this lens, modeling that decomposes variation into single-molecule (intrinsic) and cell-to-cell (extrinsic) [60] components, in addition to technical noise [61], is key. Beyond this conceptual issue, we find that an assessment of the impact of hyper-parameterized, heuristic data pre-processing and visualization in current RNA velocity workflows is useful for developing more reliable analyses.

Goals and findings

To fully describe what RNA velocity does, why it may fail, and how it can be improved, requires work on several fronts:

In the section “Workflow and implementations,” we describe an idealized “standard” RNA velocity workflow. We introduce the biophysical foundations presented in the original publication, outline the methodological choices implemented in the software packages, and enumerate the tunable hyperparameters left to the user.

In the section “Logic and methodology,” we probe the logic of the assumptions made in the workflow and describe potential failure points. This analysis revisits the outline through complementary critical lenses, adapted to the mechanistic and phenomenological steps. To characterize its biological coherence, we compare the concrete and implicit biophysical models to those standard in the field of fluorescence transcriptomics, and discuss the implications of assumptions that do not appear to be backed by a biophysical or mathematical argument. To characterize its stability, we test the quantitative effects of tuning hyperparameters and using different software implementations on real datasets.

Our findings on RNA velocity have implications for other scRNA-seq analyses. On one hand, the theory behind RNA velocity is not sufficiently robust. The models disagree with known biophysics: they do not recapitulate bursty production [62], and place needlessly restrictive constraints on regulatory trends. They are also internally inconsistent, as they do not preserve cell identities: genes are fit independently, so the same cells’ placement along putative trajectories differs between genes. Furthermore, the embedding processes are ad hoc and heavily rely on error cancellation, apparently discarding much of the data in the process. These problems are intrinsic, and derived methods inherit them.

Fortunately, better options, inspired by fluorescence transcriptomics models, are available. In order to develop a meaningful foundation for RNA velocity, we formalize its stochastic model and describe an inferential procedure that can be internally coherent and consistent with transcriptional biophysics. Furthermore, by examining the assumptions underpinning RNA velocity and reframing them in terms of stochastic dynamics, we find that the velocyto and scVelo procedures naturally emerge as approximations to our solutions. Our approach, presented in the section “Prospects and solutions,” provides an alternative to current trajectory inference methods: instead of using physically uninterpretable adjacency metrics and fitting a narrow set of topologies, it is relatively straightforward to solve many combinations of transient or stationary topologies and apply standard Bayesian methods to identify the best fit. Conceptually, instead of “denoising” data, our approach proposes fitting the molecule distributions and preserving the uncertainty inherent in noisy biological and experimental processes.

Workflow and implementations

We begin with a conceptual overview of an idealized RNA velocity workflow, with a description of implementation-specific choices. We focus on datasets with cell barcodes and UMIs, such as those generated by the 10x Genomics Chromium platform [63], as they provide the most natural comparison to discrete stochastic models later in the discussion (“Occupation measures provide a theoretical framework for scRNA-seq” under “Prospects and solutions”). We summarize the workflow in Fig 2, giving particular attention to the parameter choices required at each step. To clarify the information transfer in the process, we report the manipulations performed and the variables defined in a single run of the processing workflow in Fig B in S1 Text (as used to generate Fig 4 of [1]).

Download:

Fig 2. An RNA velocity workflow, beginning with read processing and ending with two-dimensional projection, and the parameters that must be specified by the user.

https://doi.org/10.1371/journal.pcbi.1010492.g002

Pre-processing

RNA velocity analysis begins by processing raw sequencing data to distinguish spliced and unspliced molecules. This is a genomic alignment problem. For example, reads aligning to intronic references are assigned to unspliced molecules, whereas reads spanning exon-exon splice junctions are assigned to spliced molecules. Data from reads associated with a single UMI are combined to generate a label of “spliced,” “unspliced,” or “ambiguous” for each read. “Ambiguous” reads are omitted from downstream analysis, so the assignments are effectively binary.

Until recently, traditional alignment and UMI counting software, such as Cell Ranger from 10x Genomics, discarded intronic information [63]. The same was true of pseudoalignment methods, as they identify transcript classes consisting of annotated, and presumably terminal, isoforms [64]. The explicit quantification of transient intron-containing molecules appears to have been introduced in the velocyto command-line interface [1]. Since then, existing workflows have added functionality for unspliced transcript quantification [27]. In particular, alignment can be performed via STARsolo [65] and dropEst [66], whereas pseudoalignment can be performed via kallisto|bustools [30] or salmon [27]. Benchmarking has shown discrepancies between the outputs of these workflows [27, 30], apparently due to differences in filtering, thresholding, and counting ambiguous reads. However, there is currently little principled reason to prefer one program’s results to another, as quantification rules largely follow velocyto, and assume a two-species model is sufficient.

Count processing

The raw count data are processed to smooth out noise contributions that can skew the downstream analysis. This step is generally combined with the standard quality control techniques for scRNA-seq [37]. First, cells with extremely low expression are filtered out. Then, a subset of several thousand genes with the highest expression and variation are selected. The counts are normalized by the number of cell UMIs to counteract technical and cell size effects. At this point, the PCA projection is computed from log-transformed spliced RNA counts. Finally, the normalized counts are smoothed out by nearest-neighbor pooling. To accomplish this, the algorithm computes the k nearest cell neighbors in a PCA space for each cell, then replaces the abundance with the neighbors’ average. This step is crucial, as it produces the cyclic or near-cyclic “phase portraits” used in the inference procedure.

The implementation specifics vary even between the two most popular packages, the Python versions of velocyto and scVelo. For example, there appears to be no consensus on the appropriate k or neighborhood definition for imputation. The original publication reports k between 5 and 550, calculated using Euclidean distance in 5–19 top PC dimensions [1]. By default, scVelo uses k = 30 in the top 30 PC dimensions [3].

Inference

The normalized and smoothed count matrices are fit to a biophysical model of transcription. The model structure for a single gene is outlined in Fig 3a. α(t) is a transcription rate, which has pulse-like behavior over the course of the trajectory. The constant parameters are β, the splicing rate, and γ, the degradation rate. Driving by α(t) induces continuous trajectories μ_u(t) and μ_s(t), which informally represent instantaneous averages, μ, of the unspliced, u, and spliced, s, species, governed by the following ordinary differential equations (ODEs): (1)

Download:

Fig 3.

a. The continuous model of transcription, splicing, and degradation used for RNA velocity analysis. b. Plots of α(t), μ_u(t), and μ_s(t) over time t and the corresponding governing equations for the system. Dashed lines indicate time of switching event. c. Outline of the common phase portrait representation, with both steady state and dynamical models denoted. Adapted from [1]. The DNA and RNA illustrations are derived from the DNA Twemoji by Twitter, Inc., used under CC-BY 4.0.

https://doi.org/10.1371/journal.pcbi.1010492.g003

The qualitative behaviors of these functions are shown in Fig 3b. By fitting smoothed count data for a single gene, now interpreted as samples from a dynamical phase portrait governed by Eq 1 (Fig 3c), it is possible to estimate the ratio γ/β. Finally, with this ratio in hand, the velocity v_i may be computed for each cell i: (2) where s_i and u_i are cell-specific counts, Δt is an arbitrary small time increment, and Δs_i is the change in spliced counts achieved over that increment.

The popular packages differ on the appropriate way to fit the rate parameters. The velocyto procedure presupposes that the system reaches equilibria at the low- and high-expression states of α(t), and approximates them by the extreme quantiles of the phase plots. By computing the slope of a linear fit to these quantiles, it obtains the parameter γ/β (Fig 3c). On the other hand, scVelo relaxes the assumption of equilibrium and implements a “dynamical” model, which fits the solution of Eq 1 to the entire phase portrait to obtain γ and β separately. This methodological difference corresponds to conceptual differences in the interpretation of imputed data. In velocyto, imputation appears to be an ad hoc procedure for filtering technical effects, in line with the usual usage [67, 68]. On the other hand, in scVelo, the imputed data are called “moments” and treated as identical to the instantaneous averages μ_u(t) and μ_s(t) of the process. In addition, scVelo offers a “stochastic” model, which posits pooled second moments are equivalent to the instantaneous second moments (e.g., the sum of s² over neighbors is equal to ).

The genes are analyzed independently, generating a velocity v_ij for each cell i and gene j. As the velocyto procedure cannot separately fit β_j and γ_j, its velocities have different units for different genes. On the other hand, the scVelo procedure does separately fit the rate parameters, albeit by assigning a latent time t_ij to each cell, distinct for each gene’s fit.

Embedding

Low-dimensional representations are generated using one of the conventional algorithms, such as PCA, t-SNE, or UMAP. These algorithms can be conceptualized as functions that map from a high-dimensional vector s_i to a low-dimensional vector E(s_i). The original publication offers two methods to convert cell’s velocity vector v_i to a low-dimensional representation [1].

If the embedding is deterministic (e.g., E is PCA on log-transformed counts), one can define a source point E(s_i), compute a destination point E(s_i + v_iΔt) = E(s_i + Δs_i), and take the difference of these two low-dimensional vectors to obtain a local vector displacement: (3)

This displacement is then interpreted as a scalar multiple of the cell-specific embedded velocity.

If the embedding is non-deterministic, one can apply an ad hoc nonlinear procedure. This procedure essentially computes an expected embedded vector by weighting the directions to k embedding neighbors; neighbors that align with Δs_i are considered likely destinations for cell state transitions in the near future: (4) where w is a composition of the softmax operator (with a tunable kernel width parameter) with a measure of concordance between the arguments. Once an average direction is computed, it undergoes a set of corrections, e.g., to remove bias toward high-density regions in the embedded space. Finally, the cell-specific embedded vectors are aggregated to find the average direction over a region of the low-dimensional projection.

The packages almost exclusively use the nonlinear embedding procedure. There is no consensus on the appropriate choice of embedding, number of neighbors, or measure of concordance. PCA, t-SNE, and UMAP have been used to generate low-dimensional visualizations [1, 3]. The original publication uses k between 5 and 300 and applies square-root or logarithmic transformations prior to computing the Pearson correlation between the velocity and neighbor directions [1]. In contrast, scVelo uses a recursive neighbor search by averaging over neighbors and neighbors of neighbors (with k = 30), and implements several variants of cosine similarity [3]. An optional step adjusts the embedded velocities by subtracting a randomized control; this correction is usually omitted in demonstrations of velocyto and implemented but apparently undocumented in scVelo.

As demonstrated in Fig 2, the linear PCA embedding is the simplest dimensionality reduction technique; it consists of a projection and requires fewer parameter choices than other methods. However, it is only consistently used in Revelio [11]. The velocyto package does not appear to have a native implementation of this procedure, although it is briefly demonstrated in the original article (Fig 2d and SN2 Figs 8–9 of [1]). On the other hand, scVelo does implement the PCA velocity projection, but disclaims the results of using it as unrepresentative of the high-dimensional dynamics.

Logic and methodology

To understand the implications of the choices implemented in various RNA velocity workflows, we examined the procedures from a biophysics perspective, with a view towards understanding the mechanistic and statistical meaning of methods implemented. In this section, we broadly discuss potential challenges, problematic assumptions, and contradictory results. In the following section, we draw on lessons learned and propose a modeling approach of our own.