Towards reliable quantification of cell state velocities

A few years ago, it was proposed to use the simultaneous quantification of unspliced and spliced messenger RNA (mRNA) to add a temporal dimension to high-throughput snapshots of single cell RNA sequencing data. This concept can yield additional insight into the transcriptional dynamics of the biological systems under study. However, current methods for inferring cell state velocities from such data (known as RNA velocities) are afflicted by several theoretical and computational problems, hindering realistic and reliable velocity estimation. We discuss these issues and propose new solutions for addressing some of the current challenges in consistency of data processing, velocity inference and visualisation. We translate our computational conclusion in two velocity analysis tools: one detailed method κ-velo and one heuristic method eco-velo, each of which uses a different set of assumptions about the data.


Author summary
Single cell transcriptomics has been used to study dynamical biological processes such as cell differentiation or disease progression. An ideal study of these systems would track individual cells in time but this is not directly feasible since cells are destroyed as part of the sequencing protocol. Because of asynchronous progression of cells, single cell snapshot datasets often capture cells at different stages of progression. The challenge is to infer both the overall direction of progression (pseudotime) as well as single cell specific variations in the progression. Computational methods development for inference of the overall direction are well advanced but attempts to address the single cell level variations of the dynamics are newer. Simultaneous measurement of abundances of new (unspliced) and older (spliced) mRNA in the same single cell adds a temporal dimension to the data which can be used to infer the time derivative of single cells progression through the dynamical

Introduction
Single cell transcriptomics has facilitated the study of asynchronous cellular processes such as cell differentiation in the high-dimensional gene expression space. Development of computational methods for extracting temporal information from snapshots of the system has attracted much attention in recent years. The output of these methods is typically a pseudo-temporal ordering of cells, representing their progression along the (deterministic) path of directed differentiation. However, this ordering does not reflect the intrinsic stochastic characteristics of the process and leaves several biologically interesting questions unanswered. Can cells go back along de-differentiation paths? If yes, how far and how likely is that? How strong is the stochastic component of the dynamics compared to the deterministic directed part? Answering these questions would allow quantification of cell fate plasticity in different transcriptional regions. RNA velocity, proposed by [1] (and the corresponding package called velocyto), was a breakthrough towards obtaining a more complete description of the dynamics of cell differentiation. Simultaneous measurement of abundances of nascent unspliced and mature spliced mRNA in single cells adds a temporal dimension to the collected data which can be used to infer the temporal motion of cells in transcriptomic space. A later method, scVelo [2], further advanced the concept by solving the transcriptional dynamics of splicing kinetics and velocity inference. Other extensions included additional temporal layers of gene regulation such as protein levels [3] or chromatin accessibility [4] to the unspliced and spliced mRNA levels to extract further information on cell state dynamics. Recently, there have also been advancements in using cell state velocities to study the degree of cell plasticity [5]. For all these methods, it is important to first ensure robust and reliable estimation of single cell velocities. Ideally, the estimated velocities should capture both the overall course in the population as well as the single-cell specific (stochastic) part of the dynamics. However, reliable inference of cell state velocities is still impeded by multiple computational issues. Some weaknesses in current velocity visualisation approaches, as well as challenges in inclusion of genes with multiple dynamics, have been pointed out in [1,2,6,7]. Another issue on scale invariance of gene-wise velocity components was described in more detail in [8]. Current methods either do not address this scale invariance issue or address it incompletely using unrealistic assumptions.
Moreover, there are several inconsistencies in the current methods' processing pipeline and the stochastic part of the dynamics is lost through multiple layers of data imputation and smoothing. In parallel to this study, [9] and [10] point out some of the limitations and problems of current velocity visualisation methods. [9] also suggests that, due to the highly stochastic nature of gene expression process, currently used (deterministic) approaches are insufficient and propose development of probabilistic alternatives. More recently, a variational inference method for RNA velocity estimation has also become available [11].
In this manuscript, we argue that when dealing with highly stochastic processes, deterministic approaches are only useful when talking about average velocities over specific time intervals, instead of talking about spontaneous velocities which are immeasurable in real life. We propose two different approaches for estimation and visualisation of RNA velocities. In κ-velo, we first design a processing workflow specifically adapted to downstream velocity calculations, thereby addressing problems in previously used workflows. We then solve for the gene-wise reaction rate parameters and propose an approach to relate velocity components across genes, hence resolving the scale invariance issue. We also present a new visualisation method that more faithfully represents the stochastic part of the velocities. In addition, we propose ecovelo, a heuristic method that bypasses several cumbersome, computationally costly and stochasticity killing steps used by other available methods.
A table of contents is provided in S1 Table of contents.

Dynamical inference
Building high-dimensional cell state velocities as vector sums of their gene-wise components (as is the current practice) requires careful handling of two major issues: ambiguity of the time scales and the relative scaling between different velocity components. In this section, we discuss current problems in state-of-the-art velocity estimation approaches and introduce our novel κ-velo and eco-velo approaches. The time scale over which average cell state velocities are reported. In the physical world, we can only measure average velocities in a given time interval Δt. As Δt ! 0 measured velocities get closer to instantaneous velocities, which are impossible to measure directly. When adding multiple velocity components one would ideally need to measure all gene-wise displacement componentsDx g in the same interval Δt. Mathematically: whereṼ is the G-dimensional velocity vector and Δt is the same for all genes. However, in the RNA velocity framework (even without scale invariance problem discussed in the next subsection) we use:Ṽ the result of which is different from Eq 1 for non-smooth expression dynamics. Using a different Δt g for each gene g, raises an immediate question: which time interval does the average cell state velocityṼ calculated from Eq (2) correspond to? Obscurity in the physical meaning of velocities calculated as such is more pronounced when including genes with noisy expression dynamics, e.g. bursting genes where velocities will change depending on the time scale (S1 Fig). For such genes, it would be interesting to experimentally measure velocities at multiple time scales. This could help us better understand the extent of cell fate plasticity. One would expect to see more variance in the direction of individual cells velocities reported in small time scales, whereas velocities over sufficiently large time scales would better align with the pseudotemporal direction of differentiation. Scale invariance of gene-wise velocity components. According to the RNA velocity formalism: where u g and s g represents the number of unspliced and spliced counts for gene g. α g , β g , γ g represent transcription rate, splicing rate and mature mRNA degradation rate respectively. v g represent the instantaneous velocity component of gene g. Eq 3 provides a deterministic (smooth) explanation of gene transcription and splicing events, in which the kinetic rate parameters are assumed to be constant (equal to their mean value over a relatively large time interval). In absence of temporal measurements (i.e., when working with snapshot u-s counts data) the actual time scales for which the assumption of constant kinetic rates for each gene would be valid are not known. In essence, one has to work with a time-independent relation between u and s counts, which we know as the u-s phase portrait of the data. This implies that scaling dt (or equivalently t) by κ does not change the u-s phase portrait of a gene. This scaling factor at the left hand side denominators of Eq 3 can be absorbed to the right hand side (RHS) of the equation, suggesting that if (α g , β g , γ g ) is a solution, (κα g , κβ g , κγ g ) is also a solution for any κ. This complicates deduction of the relative scaling of different genes, as was also shown in previous studies [8]. To get a valid highdimensional velocity vectorṼ , one needs to know the real scaling factor κ g for each gene: whereê g represents the unit vector for gene g.
To overcome the scale invariance, velocyto assumes κβ = 1 (i.e. same splicing rate) for all genes. scVelo assumes that the time of the observed kinetics (i.e. turning on, reaching stationary state and turning off) on the u-s phase portrait is equal for all genes (using a default constant of 20 hours). They then fit a latent time between 0 and this constant to the cells on the phase portrait of each gene, and scale the other kinetic parameters accordingly.
Having a global (i.e., gene independent for each cell) latent time would put the time scales of different genes' u-s phase portraits in perspective and resolve the scale invariance issue. However, we note that optimisation of cells' global latent time is not part of the expectation maximisation procedure in scVelo. Rather, after fitting the gene-specific parameters (including the latent times of the cells), scVelo uses a multi-step ad-hoc voting method among the fitted latent times from multiple high-likelihood genes to calculate a global latent time for the cells. This approach does not realistically address the relative scale of different genes and the full cycle time for all genes remain equal by assumption.
Instead, we suggest that a proxy for typical travel times between cell states could be used as global latent time. For example, one could use pseudotime or the cell density scaled version of it called universal time [12] as proxy for actual transition time between cell states. Here, the accuracy of pseudotime recovery for multiple branches of typical differentiation processes would be crucial for estimation of the velocity parameters.
In κ-velo, we circumvent prior recovery of global latent times and use an equivalent pergene approach. Here, we use the number of cells between two cell states as a proxy for the typical travel time between them on the gene specific u-s phase portraits. This approach assumes that the probability of capturing cells in a given expression state is proportional to the time cells spend in that state. In eco-velo we take a different approach, which does not decompose the gene-wise velocity components in the first place but, similarly to velocyto, relies on strongly simplifying assumptions on the kinetic rate parameters.
First approach: κ-velo. Our first approach recovers the full dynamics of splicing kinetics and addresses the scale invariance problem by using a proxy of travel time between cell states. In this subsection, we drop the gene-wise indices g as we address the scaling factor κ for one gene at a time.
Consider one gene with true parameters of reaction rate θ true = (κα, κβ, κγ). In a first step, we recover an arbitrary solution of the reaction rate parameters with β = 1, i.e θ = (α, 1, γ), and in a second step we recover the κ which scales this solution to its actual magnitude relative to the other genes. Below we elaborate on each of the two steps.
The analytical solutions to Eq 3 are given by: [2] uðtÞ ¼ u 0 expðbðt À t 0 ÞÞ þ að1 À expðÀ bðt À t 0 ÞÞÞ where t 2 (t 1 , . . ., t n ) is the gene specific latent time assigned to each cell and u 0 = u(t 0 ), s 0 = s (t 0 ) are the initial conditions. Transcriptional regulation is inscribed in α, which is set to 0 at downregulation. Cells can then either be in up-or downregulation, as encoded in the parameter k i , with k = 1 at upregulation and k = 0 at downregulation. We set (u 0 , s 0 ) = (0, 0) in the upregulation phase (k = 1) and (u 0 , s 0 ) = (u(t switch ), s(t switch )) in the downregulation phase (k = 0). We note that if t was given as a global (gene independent) latent time assigned to each cell, the scale invariance problem would already be resolved. However, in practice we do not have t.
From the solution u(t), we get expðÀ bðt À t 0 ÞÞ ¼ buðtÞÀ a bu 0 À a . Therefore, Substituting this Eq in s(t), we get the time-independent relation between unspliced and spliced counts. For β = 1 specifically, we get: This is the form of a function s(u) which we can directly fit to the data points in a u-s phase portrait.
In practice, there is one more amendment needed for parameter fitting. As current procedures for assignment of the sequence reads to either unspliced or spliced mRNA are biased towards spliced assignments and heavily underestimate the unspliced counts, a function of the form Eq 6 cannot approximate the data unless we upscale the measured u counts by a (gene-specific) factor m g (see Note A in S1 Appendix). Thus instead of Eq 3 we now have: We note that, through necessity from current data qualities, scVelo also scales u, but scales u to have the same variance as s [2]. In fact scaling in this way is equivalent to setting κγ/κβ � 1 (see Note A in S1 Appendix). We also note that upscaling u by m g is different from separate normalisation as here the counts of that gene are multiplied by the same constant for all cells, whereas a separate normalisation will affect cells differently for the same gene. With m g , Eq 6 becomes: which we use for fitting to the u-s data and inference of the parameters (α, γ, u switch , m) (see Note B in S1 Appendix for the details of our expectation maximisation (EM) procedure).
Once the EM is done, we recover the time scale κ for each gene. Let Δt ij be a measure of time that can be used to relate time between two states i, j across genes, with i before j in time. Consider one gene with true parameters of reaction rate θ = (κα, κβ, κγ) and recovered parameters θ = (α, β, γ) and u i , u j the measured unspliced counts for cells i, j. Note that for that gene, i, j need to be in the same state of transcriptional induction or repression because the speed of genes is only measurable during transcriptional change, i.e. outside of steady-state. If the cells spend time in steady state, the change in transcriptional state will not be proportional to the distance in time, which is why we only consider cells in the same state.
Considering the time scale κ in the solution for u(t) in Eq 5, yields for two measurements from cell i and j: Solving for κΔt ij we get: with β = 1, and m and α inferred from EM. As a proxy for the true Δt ij s (which we do not have) we use the number of cells that occur between the cells i and j calling it d(i, j). The rational being that, in snapshot data, the probability of capturing cells in a specific region of the expression space is proportional to the time cells typically spend in that region. This assumption serves as a valid approximation for most single-cell datasets, but is undermined in presence of non-uniform cell proliferation and death rates as well as biased sampling of cell types (e.g. enrichment for specific cell types). Let us call the RHS of Eq 10, f(i, j). For cell pairs that are in the same transcriptional phase (i.e., the upregulation or the downregulation phase), f(i, j) has a linear relation to d(i, j), with the slope given by κ. However if either (or both) cells are in steady-state, f will be smaller than expected from Eq 10. Thus, plotting f(i, j) versus d(i, j) for random pairs of i, j, produces a parallelogram of which the left slope equals κ. To recover κ, we fit a parallelogram to the data points with minimum area, while maximising the number of points in the parallelogram (Note C in S1 Appendix and S2 Fig).
Here, we inferred κ from the unspliced counts data. One could similarly use the spliced counts data and infer κ from the s(t) solution in Eq 5, which yields κ estimations congruent with those inferred from the unspliced data (see Note D in S1 Appendix and S3 Fig). However, as u(t) depends only on α, β while s(t) depends on α, β, γ, i.e. on one more imputed parameter, we consider recovery of κ values from u(t) as more straightforward and less error-prone.
After determination of the gene-wise κ, we are ready to call the high-dimensional, correctly scaled parameters Θ = (A, B, Γ), with A g = κ g α g , B g = κ g β g and Γ g = κ g γ g . We call the highdimensional unspliced counts scaling parameters m g , M. For calculating the high-dimensional velocity for cell i, we thus useṼ i ¼ BMU i À GS i , where U i and S i respectively represent the Gdimensional u-s counts in cell i (G being the number of genes).
Second approach: Eco-velo. Our second approach eco-velo estimates cell state velocities directly in the high-dimensional gene space by calculating the displacement for each cell in a fixed time interval. This approach eliminates the need for cumbersome and error-prone genewise parameter estimations. It also specifies the time interval over which high-dimensional velocities are reported, a feature that the gene-wise parameter estimation approaches (including κ-velo) are missing. Specification of the velocity estimation time interval can be important for data sets that include multiple non-smooth-dynamics genes where short-term cell velocities can deviate significantly from their long-term velocity directions.
Starting from Eq 3, for the change of the spliced counts of gene g over Δt we can write: By fixing Δt g = 1/γ g (this is the time in which existing spliced reads for gene g will be degraded) we get: This means that knowing β g /γ g is sufficient to estimate the cell state displacements over Δt g . If we further assume all genes have a similar β and γ, we can conclude that the unspliced counts u(t) in a cell are proportional (with a constant factor β/γ) to its spliced counts at the later time point (t + 1/γ).
The assumption of similar γ as well as β across genes, allows us to avoid decomposition of high-dimensional velocities into gene-wise components for velocity estimation and recombining the estimated components again. Thus, leading to another level of simplification that turns out very handy as a heuristic velocity estimation from u and s counts, where we can find cell state displacements by mapping U to S. We do so by searching for the nearest neighbors (NNs) of U in S that are also within the first k nearest neighbors of S in U. We call these pairs mutual nearest neighbors (MNNs). Note that not every point needs to have MNNs. The velocity arrow then goes from a cell's position in S space to the the mean of the first k MNNs of that same cell's U space in S. Here, u and s counts can be used directly for estimating cell state velocity directions without any need for smoothing and parameter fitting.
The strong assumptions of eco-velo (similar γ and β across genes) may not hold for every biological processes and every subset of genes. Thus here, one would ideally select a set of genes that are only transcriptionally regulated (via α), but not post-transcriptionally regulated (involving gene-specific β and γ rates). An example of such a cases seems to occur in Fig 1E of the original RNA velocity paper [1], where the authors observed for bulk RNA-seq measurements of cell cycle genes in the mouse liver over a time course of the circadian cycle, that unspliced mRNAs appear predictive of spliced mRNA at the next time point with a similar signal intensity coefficient. Conditioned on its assumptions, eco-velo (in contrast to the methods based on gene-wise parameter estimation) specifies the time interval of the reported velocities and also skips several error-prone parameter estimation and data smoothing steps. How much the different assumptions of each method are satisfied for different experimental settings, data qualities, as well as the purposes of velocity analysis (e.g. estimating the overall velocity directions or obtaining the average cell state velocities over a specific time scale) would determine which method is more appropriate to use.

Visualisation
La Manno et al. [1] suggested using projection of the end of the velocity vectors (s þṽDt) with Δt = 1 on an embedding of the spliced counts. While projection using principal component analysis (PCA) (Note E in S1 Appendix) is the most accurate low-dimensional representation of cell state velocities, it usually does not capture the full complexity of differentiation manifolds with several subpopulations in high-dimensional gene space. Projection of the velocities onto non-parametric nonlinear embeddings (which do not have gene-defined axes) is more challenging. To work around this difficulty, velocyto projects the velocities in a direction relative to the neighbouring cells. This is done by computing a transition probability matrix P containing probabilities of cell-to-cell transitions in accordance with the velocity vector: with σ the kernel width parameter, rðxÞ ¼ sgnðxÞ ffi ffi ffi ffi ffi jxj p a variancestabilising transformation and corr() the Pearson correlation coefficient. The matrix is rownormalised so that ∑ j P ij = 1. Given n observations and Y i the positions of cell i on a K-dimensional embedding, the projected end of velocity vector for cell i is calculated asỸ i þDY i , where:D To project the velocities, scVelo uses a similar approach to velocyto but with a slightly different P matrix that calculates Pearson correlations (also called cosine similarity) directly on theDs ij andṽ i vectors without using the ρ(x) transformation, via P ij ¼ exp A vector summation as proposed in Eq 13 used in velocyto and scVelo is questionable for three reasons. First, this approach is not faithful to the velocity vectors length, e.g., two velocity vectors with the same direction, but different length (in the same neighbourhood) in the high-dimensional space will be visualised with similar lengths. That is because they will be assigned the same P ij as Pearson correlation does not respect the length of the vectors. Second, kY j À Y i k does not in general provide an orthonormal basis as the direction of several neighbouring cells to cell i can be correlated on the low dimensional embedding. As a result, this approach may change the direction of the velocity vectors depending on how much the orthonormality principle is disturbed for a given neighbourhood. For example, if the chosen neighbourhood extends longer along the differentiation path than its width, velocities will be visualised as more smooth vectors along the path. Third, P ij À 1 n � � can be negative even if the velocity directionṽ i is correlated with the direction of a neighbouring cell j, which is not correct.

Nyström projection (velocity visualisation for κ-velo).
To deal with visualisation of complex data manifolds which require nonlinear embeddings, in κ-velo we propose using the Nyström projection which is more faithful to the actual high-dimensional estimated cell state velocities than the current practices. We use a nonlinear visualisation of the (normalised) spliced counts of the single cells as the start of the velocity vectors and project the end points of the velocity vectors onto this existing embedding using the Nyström method. Nyström projection has also been used for other single cell data integration applications e.g. in [13,14]. The nonlinear embedding choice is arbitrary and can be diffusion maps [15], t-distributed stochastic neighbor embedding (t-SNE) [16] and uniform manifold approximation and projection (UMAP) [17].
If a K dimensional embedding Y train has been created for n train data points X train and we want to project a set of n test points X test on the existing map, we first compute a transition probability matrix between the new and old data points, P 0 of size [n test , n train ] calculated as: Note that when the test data is exactly the same as the training set X train = X test , P 0 would (ideally) be the same transition matrix as the one used for generation of the train set embedding (ideally one would use the same parameters σ i as used in construction of the transition matrix for generating the train set embedding. See Note F in S1 Appendix for the spacial case of projection on Diffusion maps). The projection of new points Y test is then given by: In our application, n train equals n test as each velocity vector has a start as well as an end point. For cell i Eq 15 implies: This looks to some extent similar in form to the previously used velocity projection methods in Eq 13. However, one major difference being that we are calculating the end of the velocity arrow on the embedding space rather than the displacements, hence avoiding the collapse of velocity vectors with different lengths onto the same visualised length. Another advantage is that here Y train (j, k) more likely presents an orthonormal basis considering all data points, hence less affected by the shape of neighbourhoods arbitrarily chosen independent from the generation of the reference embedding. For some embedding methods (generally those which perform an analytical embedding optimisation, in contrast to the methods using iterative optimisation techniques such as gradient descent) like diffusion maps, the embedding Y train (j, k) is indeed guaranteed to be orthonormal (i.e., . Lastly, all terms in P 0 are positive, making the projected point a weighted average of the data points in the train set.
Note that the Nyström theorem is only valid for projection of test data points which are close enough to the data points existing in the training set. That is, extrapolation for test data to expression regions which have not been sampled in the training set is not possible. In κ-velo, we ensure closeness of the end point of the velocity vector to the existing data manifold of spliced counts by adequately down scaling all inferred high-dimensional velocities by the same factor.
In light of the above, linear projection, e.g. by PCA (Note E in S1 Appendix) although not capable to capture the complexity of several datasets which consist of multiple branching events and subpopulations, remains the only approach in which the visualised arrows are a true representation of the high-dimensional velocity vectors. None of the non-parametric nonlinear projection approaches can deal with projection of out of distribution data points, implying that near the boundaries of the differentiation paths, where actual velocities may point to directions going out of the existing manifold of the start point of velocity vectors, velocity visualisations will be less reliable. Moreover, embedding methods which may not keep the continuity of the data manifold (e.g. t-SNE and UMAP) are more prone to the artefacts of out of distribution data points projection.
Even though our non-linear projection method does not explicitly depend on the dimension of the train and test data sets, we recommend to use the same gene space for projecting the velocities (i.e., for computing of P 0 ½n test �n train � ) as the gene space that was used for generating the trained embedding, i.e., we use the spliced counts matrix of the filtered gene set S as the training data in κ-velo. This ensures that the embedding only represents a space that can be spanned by following the velocity directions, thus making a closed set of the embedding under addition by velocity vectors. Therefore, we calculate the embedding on the same space used for parameter recovery and velocities' estimation. This also means that if we use imputed counts for parameter recovery, we calculate the low-dimensional embedding on those imputed counts.
Visualisation for eco-velo. For eco-velo, visualisation of velocities is integrated within the inference of the velocities and hence does not require visualisation by projection. We identify the first k mutual nearest neighbours (MNNs) [18] of U and S for every cell, which we use to visualise the velocities on a low-dimensional embedding of the spliced counts. We simply draw an arrow starting from the position of a cell on the embedding to the mean of the coordinates of its first k MNNs on the same embedding. That means that our velocity arrows point from s i to P k j s j =k, where these s j are the first k MNNs of u i for cell i. These arrows corresponding to a relatively large Δt in which all current spliced counts in the cell would be degraded. For ease of visualisation and obtaining an un-cramped map without intersecting cell velocities, we then scale all velocities by the same factor so that the arrows only point in the direction of the point and not all the way to the future state.

Processing
Before calculating the velocities, single-cell RNAseq datasets are preprocessed (aligning the reads and counting numbers of unspliced and spliced reads) and subsequently processed (filtering, normalisation, etc.). Both the κ-velo and the eco-velo workflows start with processing raw U and S count matrices. Since the methods are based on different assumptions, the processing steps differ per method. Below, we will describe the processing protocol for both approaches.
Processing pipeline of κ-velo. To reduce the number of dimensions of the dataset, we select only genes with high variability. Variability is calculated on the spliced counts using analytic Pearson residuals [19]. We then filter genes with extremely low u or s counts because we want to focus only on genes with significant velocity signal. After gene filtering, the counts in each cell are size-normalised. Since the size of a cell is represented by its u and s counts together, the counts in each cell are normalised using the sum of the counts for u and s. To recover the dynamics, the noise in the u and s counts has to be reduced. As such, all counts are imputed by averaging the counts of each cell's nearest neighbours. The nearest neighbours for each cell are found in PCA space calculated on scaled s counts. For a more detailed description of each step see Note G in S1 Appendix and S4 Fig. Processing pipeline of eco-velo. Similar to κ-velo processing, the eco-velo workflow starts by filtering the dataset for genes with high variability and sufficient u and s counts. After this, all non-zero counts are log-transformed and both count matrices are normalised separately. Here, we deviate from the κ-velo protocol, because u and s counts are treated as separate modalities. Following standard MNN protocols [18], the counts are L2 normalised.

Overview of the workflow for κ-velo and eco-velo
Both the κ-velo and the eco-velo workflows consist of three main steps: processing, velocity calculation and visualisation (Fig 1). First the data is processed as described in Section

PLOS COMPUTATIONAL BIOLOGY
Cell state velocities "Processing". In κ-velo, after processing, we recover the scaled parameters ακ, βκ and γκ for all genes in the dataset. For downstream velocity analysis, only genes with a likelihood above a certain threshold are used. All other genes are filtered out to reduce the technical noise caused by poorly recovered or noisy genes. Additionally, the user is provided with an option to remove genes where the order of clusters in the recovered dynamics do not match the known hierarchy of the cell types (e.g. when an assigned upregulation starts at the the most differentiated cells and ends in the progenitor population). Using the scaled parameters, a high-dimensional velocity vector is calculated for each cell. To visualise the cells and velocities, we compute an embedding (e.g. PCA, UMAP) using the processed (i.e. filtered, normalised and imputed) and scaled s counts. Lastly, the velocities are projected onto the embedding.
The eco-velo workflow includes fewer steps. After processing, the u counts are used to find the first five mutual nearest neighbours of each cell in S space. The embedding is calculated using processed (i.e. filtered and normalised) s counts and velocities are projected onto the embedding by averaging the position of the cell's first five mutual nearest neighbours.

Simulation data
For the simulation, we randomly sampled g log-normally distributed parameters of reaction rates, scaled by a scaling factor κ: θ = (κα, κβ, κγ). The true time of the n observations is sampled from a uniform distribution. The time points are such that the final mature steady cell state, for which all genes would reach steady-state, has not been sampled. The u and s counts are simulated following u(t), s(t) with added random normal noise (Note H in S1 Appendix). We simulate the data such that the time of activation of each gene's transcription is inversely proportional to the gene's speed. This means that the fastest genes are only active towards the end of the differentiation trajectory. The resulting differentiation trajectory has high velocity variation at the beginning when most genes are not yet committed to change and more deterministic dynamics with higher speed at the end of the trajectory. The motivation for this simulation scenario is to include regions with both high-and low variance velocities and to have velocities for some cells pointing to future states outside of the space observed in the original set. See Note H in S1 Appendix for a more detailed description.

Real data
We demonstrate the performance of κ-velo and eco-velo on four different datasets and compare them with the state-of-the-art scVelo. The first dataset is a subset of the pancreatic endocrinogenesis dataset [20]. The second is a subset of the murine gastrulation dataset [21]. Both datasets were obtained using the 10x genomics platform. The third dataset consists of mouse Schwann cell precursors (SCPs) differentiating into chromaffin cells, obtained using SMART-seq2 [22]. For these three datasets, our RNA velocity analysis starts from the U and S count matrices, which were originally analysed in [2] (pancreatic endocrinogenesis), [6] (murine gastrulation) and [1] (chromaffin cells). Lastly, we also analyse a dataset of murine hematopoiesis [23], obtained using the 10x genomics platform. We used velocyto's sequence alignment and u-s counting pipeline to get the U and S count matrices, as this dataset has not been analysed for RNA velocity before. For all four datasets, we ran the complete κ-velo and eco-velo workflow as described in Section "Overview of the workflow for κ-velo and eco-velo". See Note I in S1 Appendix for further details of parameter and threshold settings for each dataset.

Results
In this section, we first demonstrate the artefacts of scVelo's velocity projection on simulation data with known cell state velocities (i.e. no velocity inference step involved) and compare scVelo to visualisation with linear and nonlinear projection methods. We then compare our velocities with the velocities returned by scVelo on simulation. Afterwards, we show computational experiments on real data which support the design of the processing steps we propose and use in this manuscript. In the last section we apply κ-velo and eco-velo on real datasets: first a pancreas endocrinogenesis dataset and then a hematopoiesis dataset. To validate the method on different sequencing technologies, we also applied it to a dataset of Schwann cell precursors (SCPs) differentiating into chromaffin cells (κ-velo: S5A and S5B Fig and eco-velo:  S5C Fig).

PCA and Nyström projection faithfully represent the high-dimensional velocity vectors
Ideally, a visualisation of cell state velocities should faithfully represent all aspects of the highdimensional vector. The visualisation should respect the direction of velocity vectors as well as their magnitude (speed of change). This can be particularly difficult if the new states are in gene space not yet observed in the original set, e.g. the velocities point further than existing points. The embedding should also preserve local variations, representing fluctuations of the dynamics and cell plasticities. To assess these points, we compare existing RNA velocity visualisation methods with ours, on simulated data where the true high-dimensional velocities are known and do not need to be inferred. We design a simulation to assess all these aspects of the projection. In that simulation the cells follow a hidden true time with a high variance at the beginning and faster transitions towards the end of the trajectory (Section "Simulation data"). The final stable cell state is not yet reached in our simulation, and the velocities of the latest cells point towards not yet observed future states. Projection of the velocities on a PCA embedding (Fig 2A) reliably represents all these aspects. scVelo's velocity projection on the same PCA embedding (Fig 2B) smooths over the biologically interesting variation and removes the information on speed of change (i.e, disproportionately changes the length of the velocity vectors). The Nyström projection method (Fig 2C) captures the expected cell to cell variation, as well as the direction and length of the simulated velocities on PCA (Fig 2D). The velocities are also well represented when projected on a non-linear embedding such as t-SNE (Fig 2E and S6A  and S6B Fig, UMAP shown in S6C and S6D Fig). t-SNE tends to map regions of higher density, e.g. of slower velocities, in gene space to a larger space in the embedding as highlighted by the cells outlined in blue and red. Consequently, the velocity arrows are also visualised in a scale proportional to the distance of cells in a given region on the embedding, hence looking longer than their true length in gene space. On embeddings that do not distort cell to cell distances in the gene space such as PCA or diffusion maps with a constant kernel width, the length of the velocity arrows are well represented by Nyström projection (S7 Fig diffusion map and Fig 2D  PCA). We note that, unlike PCA projection, neither scVelo's nor Nyström projection are able to project end of velocity arrows that are out of distribution of existing data points.

κ-velo recovers simulated velocities
To ensure that the high-dimensional velocity vector points in the right direction we need to address the scale invariance of gene-wise velocity components (as discussed in Section "Dynamical inference", Fig 3A). We introduce κ-velo, a method that recovers the full transcriptional dynamics from s as a function of u and thus does not need to fit a hidden latent time to the cells. The method then uses the cell densities as a proxy of time spent in a specific region of the expression space ( Fig 3B) to relate velocities across genes and solve the scale invariance issue. To validate our method we simulate reaction kinetics following randomly sampled parameters scaled by a factor κ varied between 1 and 15. The method recovers the scaling factors (Fig 3C). In fact, using cell densities to infer the scaling factors is equivalent to using true time for a given differentiation branch (S8 Fig). Note that the recovery becomes more difficult for higher κ. Very fast genes have few or no cells in transient state so in those cases we would need to sample more cells to reliably recover κ. We note that the scale of recovered κ and true κ is still off by a constant factor related to the chosen Δt, but if all components are scaled by the same factor, the direction of the high dimensional vector is still correct. After scaling, the high dimensional κ-velo velocity vector is much closer to truth (Fig 3D and S9  Fig), than scVelo's velocity vector. In fact, the errors in the scVelo vectors are proportional to the relative scale of the genes (Fig 3D). Because the high-dimensional vector is not directly conceivable to the human mind, low-dimensional representations of the velocities are usually used for interpretation of the result. We also compare the vectors after projection on a PCA embedding (S10 Fig) and find that they are also much closer to truth, both for direction and length (Fig 3E and S11 Fig). Here, for both κ-velo and scVelo, we find the biggest errors in regions of lowest and highest velocities, but scVelo's errors are much higher than κ-velo's.

Careful processing prevents introduction of artefacts
To illustrate the importance of processing, we apply our processing pipeline to a dataset of erythroid development during murine gastrulation. Previously, it has been shown that scVelo falsely predicts de-differentiation at the end of erythroid development. This has been attributed to the contribution of genes with multiple rate kinetics (MURK genes) to the velocity calculation [6]. In our processing pipeline, we not only filter for low variability genes, but also remove genes with insufficient u and s counts. After normalisation, the counts are imputed by averaging spliced or unspliced counts across neighbouring cells, thereby smoothing the data. This usually produces unreliable results for genes with only few u or s counts (S9 Fig). After filtering, in scVelo's processing pipeline, the count matrices are normalised separately. This separate normalisation introduces artefacts in the u-s phase portrait (S13A and S13B Fig), which can be traced back to variation in the ratio between total unspliced and total spliced counts between cell types. We found that some of the patterns identifying MURK genes were artefacts of this normalisation (S13A and S13B Fig). Furthermore, many MURK genes in the original publication were imputed from very low counts and are filtered out in our pipeline.  θ). B. We propose to use cell densities as a proxy of time. For a same time interval, the displacement in u will be proportional to a gene's speed. This allows us to relate velocities across genes and solve the scale invariance problem. C. To validate κ-velo, we simulate splicing kinetics scaled by a scaling factor κ and evaluate how well the factors are recovered. D. We compare the κ-velo and scVelo velocities to the true velocities for two genes with different speeds. The high-dimensional velocity vectors are normalised to have equal variance for ease of comparison. E. The high-dimensional vector is projected on the first two principal components to evaluate differences between true velocities and recovered velocities. We return the change in direction (cosine similarity) and length (difference in vector norm) (Note J in S1 Appendix) for κ-velo and scVelo. To make the length comparable, the vectors are variance-normalised. Note the log-scale for frequency. https://doi.org/10.1371/journal.pcbi.1010031.g003 Comparing the original processing pipeline to our processing steps, we reduce the number of MURK genes from 98 to 18 (S13C Fig), correcting most of the false de-differentation.
After recovery of the parameters, we remove low-likelihood genes where the learned parameters do not fit the u-s phase portrait well. This prevents us from including the (usually noisy) genes for which the recovered parameters could be incorrect (S4 Fig: step 5). The calculated velocities for those genes would therefore not accurately reflect true dynamics. Even after filtering of low-likelihood genes, we still find genes where the recovered dynamics do not match the known order of cell types. For example, early upregulation or late downregulation can often not be easily differentiated based on the u-s phase portrait alone (S4 Fig: step 6). This could ultimately lead to incorrect velocity assignments. To avoid this issue, we can use prior information about the temporal order of cell types to perform one more round of filtering if that information is given (see Note G in S1 Appendix: step 6). We use this information to exclude genes where the fitted state assignments of up-or downregulation do not fit the expected state assignments. After both filtering steps, we calculate the low-dimensional embedding on the reduced gene set, so that the embedding only represents space that can be reached by velocities.

κ-velo explains cell state plasticities and speed of transcriptional change in pancreas endocrinogenesis
To test whether κ-velo's velocity estimations better capture the different time scales of genes, we apply our method to a dataset of developing mouse pancreas cells sampled at embryonic day 15.5 [20]. The endocrine progenitor cells differentiate into four main fates: alpha, beta, delta and epsilon cells. In previous work, scVelo delineated cycling progenitors and the endocrine cell differentiation.
After processing, we recover the reaction rate parameters fitted by scVelo and κ-velo. True splicing rates are difficult to determine and different ranges have been reported [24] but none come close to the more than 10000-fold range reported by scVelo (Fig 4A and S14 Fig). We report a range of splicing rates close to 30-fold (Fig 4A), which is more in line with the reported ranges. After scaling, we can distinguish fast and slow genes based on their κβ. Among the fast genes, we find genes associated with the cell cycle such as Adk, while slow genes are constantly up-or downregulated during the whole differentiation trajectory ( Fig  4B). This is consistent with prior expectation as the cell cycle in developing mouse pancreas takes less than a day [25], while pancreatic endocrine cell differentiation starts at embryonic day 9 and goes until day 15.5 in the analysed sample. We also find fast genes that are upregulated during commitment to a cell fate at the end of the differentiation trajectory, such as Gcg and Nnat. We note that when filtering genes based on prior knowledge of the expected order of cell types, we also filter many cycling genes that tend to have high variance, and thus partially incorrect state assignments.
We display the high dimensional vector field in a UMAP embedding of the data and compare the κ-velo velocities (Fig 4C) to scVelo velocities (Fig 4D), both projected with Nyström projection to compare only the velocity vectors (S15 Fig show projections of the velocities on a PCA embedding, S16 Fig shows smoothed velocities on the UMAP embedding). The κ-velo velocities better capture the differences in speed along the trajectory, as well as the progression within the four terminal states. scVelo's embedding (Fig 4E) smooths over the velocities, returning a view that partially appears more consistent with the expected direction of differentiation but not with the actual noisy velocity vectors. Comparing the projected velocities of the full scVelo pipeline (Fig 4E) to κ-velo pipeline (Fig 4C), we see that the methods most strongly disagree in the high-plasticity ductal population (Fig 4F and 4G and S17 Fig). There is also a strong disagreement in the delta cells, which scVelo predicts to differentiate into the alpha cells, as well as in the alpha cells themselves that are predicted to have very small velocities all along the branch. Looking at single genes u-s phase portrait such as the Gcg gene (Fig 4B), we see that the cells are still differentiating and the full alpha branch has not reached the terminal state yet.

κ-velo recovers multiple differentiation paths in hematopoietic system
RNA velocity analysis of single-cell datasets of differentiation of hematopoietic stem cells into different blood progenitor cells has proved difficult in the past [6,7], and often the predicted velocities display a direction reversal. This reversal was attributed to genes with more complex kinetics leading to u-s phase portraits that do not have the shape expected from the current RNA velocity model. To investigate the potential of κ-velo on more complex datasets, we applied the method to a dataset of murine hematapoietic stem and progenitor cells (HSPCs) [23]. The HSPCs in this dataset were acquired by sorting bone marrow cells using a broad Lin neg c-Kit + (LK) gating strategy. Additionally, the datasets has been enriched for long-term hematopoietic stem cells (HSCs), which are usually less abundant than other populations. HSCs, which have a high multipotent potential (as indicated by the stemness score, Fig 5A) are at the beginning of the differentiation trajectory, and give rise to all mature blood cells [26]. In this dataset, these final states of mature blood cells are not yet reached since only HSPCs were included. Using a curated set of cell type gene markers, we identify the HSCs and progenitor populations, matching the original annotations (S18 Fig) [23].
The κ-velo pipeline correctly recovers the overall differentiation paths from the HSCs to various progenitor populations, such as to the myeloid and megakaryocyte progenitors ( Fig  5B), while still capturing cell specific velocity variations (S19 Fig shows smoothed velocities on the embedding). The velocities show higher plasticity in the regions with higher stemness score and more commitment towards the ends of the differentiation branches. On the same dataset, scVelo recovers velocities in the exact opposite directions with velocities pointing from the more differentiated progenitor cells towards the HSCs (Fig 5C). We also identify fast genes, such as Fcnb and Ermap (Fig 5D and 5E), which are known to be involved in the commitment to the myeloid lineage and erythroid lineage respectively [27,28]. Pum2 is identified as a slow gene because its downregulation takes place over the full span from stem cell to progenitor (Fig 5F). This gene is known to suppress differentiation in HSCs [29].

Eco-velo approximates cell state velocities using minimal data processing and computation
As a heuristic method that does not require cumbersome recovery of the rate parameters, we apply eco-velo on some of the introduced data sets. By simply taking the unspliced counts as a proxy of a cell's future state (Fig 6A), we can skip a few gene set filtering steps, imputation and parameter fitting, all of which are computationally expensive and can kill some of the true signal variability. We validate the model on a simulated dataset (Fig 6B and S20 Fig), where the model recovers the expected flow. We then test eco-velo on the pancreas endocrinogenesis dataset and the hematopoiesis dataset (pancreas endocrinogenesis: Fig 6C and S21 Fig for  smoothed velocities, hematopoiesis dataset: S22 Fig). Since the method is based on the assumption that genes have the same splicing and degradation rates, and we know that cell cycle genes have different rates in the pancreas endocrinogenesis dataset, we exclude them from this analysis. The model delineates the directional flow from progenitor cells to alpha and beta cell fates. eco-velo also captures the high cell plasticities in the ductal population seen in Fig 4C. The final state of epsilon cells is also captured (S21 Fig smoothed) but the dynamics within the delta cells cannot be resolved. For delta and epsilon cells the issues could arise from trying to capture future states within sparse populations that are transcriptionally close to the more abundant population of alpha cells. A quantitative comparison of the projected velocities from eco-velo and κ-velo is shown in S23 Fig, where we see a strong similarity in the Ngn3 low endocrine progenitor, but more variation between the methods in the cycling ductal population as well as in the terminal states. Given the strong theoretical assumptions of the model, eco-velo still captures the complex lineages of endocrinogenesis remarkably well. For the hematopoiesis dataset however, eco-velo is unable to capture the dynamics correctly (S22 Fig). Similarly to scVelo, the velocities are falsely projected back to the most stem-like state, hinting that the more basic assumptions about the splicing dynamics in eco-velo may not hold up for this particular biological process.

Computational efficiency of the methods
We report the runtime on an Intel Core i5 CPU with 2GHz, 4 Cores and 16 GB of RAM. On the pancreatic endocrinogensis dataset with 3696 cells and top 5000 highly variable genes, the κ-velo workflow takes 15 minutes while the eco-velo worflow takes about 40 seconds. Full scVelo pipeline on the same dataset takes about 8 minutes.

Data and software availability
All analysed datasets are publicly available. The pancreatic endocrinogenesis dataset is available from the Gene Expression Omnibus (GEO) under accession GSE132188 [20]. The murine gastrulation dataset is available on the Arrayexpress database (http://www.ebi.ac.uk/ arrayexpress) under accession number E-MTAB-6967 [21]. For both datasets the count matrices can be downloaded directly from the scVelo Python implementation (https://scvelo.org) v0.2.4. The raw data from the chromaffin dataset is available on GEO under accession number GSE99933 [22]. The count matrices are made available by [1] at http://velocyto.org. The count matrices of the HSPC dataset are available on our GitHub Page: https://github.com/Haghver diLab/velocity_notebooks. This GitHub page also contains all notebooks necessary to reproduce the results reported in this paper. A python implementation of the κ-velo and eco-velo pipeline can be found at https://github.com/HaghverdiLab/velocity_package.

Discussion
In this manuscript, we study some of the current challenges in the inference of cell state velocities from scRNA-seq data and suggest novel approaches for tackling these problems. We argue that one of the interests in obtaining single cell velocities is to quantify the variation of dynamics among individual cells. This variance in single cell velocities can inform us about fluctuations of the dynamics, cell state plasticities and heterogeneity. We demonstrate that the processing procedure, several data smoothing steps and the visualisation approach in existing methods kill such biologically meaningful variance. The resulting information is closer to knowledge we could get from pseudotemporal ordering of cells than the true single cells velocity directions; one gets good looking cell velocity maps (i.e. conforming the expected pseudotime directions) that do not reflect the reality of the information contained in the u-s mRNA data.
For applications in which obtaining the average cell state velocities over the specific time scale of mRNAs degradation is desired, we propose the eco-velo approach. It eliminates multiple cumbersome and error-prone steps, such as the gene-wise parameter estimation and visualisation of high-dimensional velocities.
For more detailed velocity analyses, we designed the κ-velo approach. The method recovers the full dynamics of splicing kinetics and addresses the relative scaling of velocity components across genes. We also design a consistent processing pipeline and suggest a new visualisation approach. We demonstrate how our model achieves better estimation of velocities than current methods on simulation. On real data, our method returns more plausible ranges of splicing rates and velocity magnitudes in several differentiation regions. κ-velo's velocity components' scaling is based on the assumption that cell densities can be used as a proxy of typical travel time between two cell states. Heterogeneous cell birth and death rates along the differentiation path could partly disturb this assumption. To further improve this model, one could therefore consider estimating the heterogeneous cell birth and death rates based on the activity of apoptotic and proliferation genes [30]. Our results on simulation data (S8 Fig) demonstrate that the true global time of cells also resolves the scale-invariance issue. This indicates that other proxies of the true global time, e.g. cell density-scaled pseudotime, may also be used for inferring the relative scaling of velocity components among the genes in future work.
As described in Section "Careful processing prevents introduction of artefacts", we find that there can be difficulty in fitting reaction rate parameters for genes that do not display clear kinetic patterns of up-or downregulation on the u-s phase portrait. In the current version of κ-velo, we filter out genes where the fitted state assignments do not match the known pseudotemporal order of cell types. In future work, we could use this prior information as initialisation in the parameter fitting procedure. The recovered high-dimensional velocity vectors now contain the deterministic part, but also capture the stochasticity of the dynamics. This can be used to perform several downstream analyses and answer questions about cell's progress through the dynamical process.
In the past, recovery of cell specific global latent time has been done after velocity analysis [2]. The recovery of a cell's global time was based on a heuristic integration of time assignment from individual genes. However, the gene-wise assignment of latent time are error-prone and additionally do not take into account the time that genes spend in steady-state. Integrating these errors does not necessarily mean that they cancel out. Because of these two reasons, recovery of global latent time should be done more carefully in follow up studies with strategies similar to CellRank [5], where several sequential cell state transitions are chained together to construct long transition paths along the differentiation manifold. Alternatively, estimation of global latent time may be integrated in the expectation maximisation procedure, similar to the approach in a recent preprint [11].
We also raise awareness about the time scales for which average velocities are being estimated. It would be interesting to measure velocities at multiple time scales to get an overview on the "plans" individual cells have in preparation for their short-or long-term developmental journey. One way of studying the changes that cells undergo at different time scales would be by inferring velocities from different sets of genes related to these time scales. For example investigating velocities on the time scale of the cell cycle or of the entire differentiation process. This also supports growing interest for inferring cell state velocities from other pairs of singlecell data modalities, e.g. mRNA coupled with protein levels [3], as they correspond to different time scales of gene regulation. Furthermore, inferring cell state velocities from modalities in which measurements are more accurate (in comparison to the uncertainty in quantification of unspliced-spliced mRNA counts) can enhance our ability to understand the biological variation in cell state velocities rather than variations due to measurement noise.
Estimation of cell state velocities in presence of multiple time point measurements or multiple batches of data collection is another important problem. However, the solution is not trivial as existing batch effect correction methods can distort the proportions between the s and u counts from separate batches. One possible strategy can be to estimate the velocities within each batch separately and visualise and project the estimated velocities on a shared embedding of all batches. Investigation of different approaches and possibilities remain open.
To conclude, we suggest that a comprehensive grasp of what we are actually estimating and visualising as cell state velocities is crucial for obtaining a full description of cell differentiation dynamics. True cell state velocities encompass both stochastic and deterministic parts of the biological dynamics. This information can be complementary to attempts for describing cell differentiation as a full diffusion process [1,5,12,15,[31][32][33] which contains the three terms of deterministic, stochastic and cell birth and death rates. Reliable quantification of cell state velocities in different transcriptional regions can put the relative magnitude (i.e. coefficients) of these terms into perspective in relation with one another.
Supporting information S1 Table of contents. Table of  In the middle, a schematic representation of how the spliced and unspliced matrices change during each step is shown. A size reduction of the coloured area indicates a filtering step where the number of genes are reduced. A change in colour represent a data manipulation, which does not changes the number of cells or genes, but changes the values in the matrix. On the left, some extra information is provided for some of the processing steps. More detailed information can be read in Note G in S1 Appendix. On the right, the u-s phase portraits of several example genes are shown to demonstrate how the different steps change the phase portraits, as well as which kind of genes are selected or removed in the filtering steps. Each of the genes is selected from the pancreas endocrinogenesis dataset that is analysed in main In the original paper, the purple cluster was identified as symphatoblasts and the yellow and red cluster as "bridge" cells [22]. Here, we show how scVelo would recover the dynamics if these genes were not filtered out. (TIFF) S13 Fig. Applying κ-velo processing pipeline on erythroid lineage dataset. The scRNA-seq dataset on the erythroid lineage of mouse gastrulation [21] has been described in the context of RNA velocity by Barile et al. [6]. Here, we show that the subset has a varying ratio of total unspliced to total spliced counts in different cell types (A). This results in artefacts when using the standard scVelo processing pipeline (U and S normalised separately) (B, second row). Those artefacts are mostly resolved by normalising U and S combined (B, third row), which is part of the κ-velo processing workflow (B, last row). Using the κ-velo processing workflow fixes some of the reported de-differentiation (C).