beachmat: a Bioconductor C++ API for accessing single-cell genomics data from a variety of R matrix types

Recent advances in single-cell RNA sequencing have dramatically increased the number of cells that can be profiled in a single experiment. This provides unparalleled resolution to study cellular heterogeneity within biological processes such as differentiation. However, the explosion of data that are generated from such experiments poses a challenge to the existing computational infrastructure for statistical data analysis. In particular, large matrices holding expression values for each gene in each cell require sparse or file-backed representations for manipulation with the popular R programming language. Here, we describe a C++ interface named beachmat, which enables agnostic data access from various matrix representations. This allows package developers to write efficient C++ code that is interoperable with simple, sparse and HDF5-backed matrices, amongst others. We perform simulations to examine the performance of beachmat on each matrix representation, and we demonstrate how beachmat can be incorporated into the code of other packages to drive analyses of a very large single-cell data set.

analyses. It is also extensible through the installation of optional packages, often 23 contributed by the research community, which provide implementations of bespoke 24 methods targeted to specific scientific problems. In particular, the Bioconductor 25 project [6] supports a number of packages for biological data analysis, many of which 26 focus on the processing of genomics data [9]. Packages are usually written in R but can 27 also include compiled code (e.g., in C/C++ or Fortran), which is beneficial for 28 computationally intensive tasks where high performance is required. For C++ code, 29 this process is facilitated by the Rcpp package [4], which simplifies the integration of 30 package code with the R application programming interface (API).

31
In its simplest form, a scRNA-seq data set consists of a count matrix where each 32 column is a cell, each row is a gene, and the value of each matrix entry is set to the 33 quantified expression (e.g., number of mapped reads, transcripts-per-million) for that 34 gene in that cell. This can be most directly represented in R as a simple matrix, where 35 each entry is explicitly stored in memory. Alternatively, it can be represented as a 36 sparse matrix using classes from the Matrix package [1], which saves memory by only 37 storing non-zero entries. This exploits the fact that scRNA-seq protocols have low 38 capture efficiencies [7] -RNA molecules are present in cells but are not 39 reverse-transcribed to cDNA for sequencing, resulting in a preponderance of zeroes in 40 the final count matrix. Droplet-based protocols also exhibit very low sequencing depth 41 per cell, further increasing the sparsity of the data. Another option is to use file-backed 42 representations such as those in the bigmemory [10] or HDF5Array packages, where the 43 data set is stored on disk and parts of it are extracted into memory upon request. In 44 each case, methods are provided in R for common operations such as subsetting, 45 transposition and arithmetic, such that code written by users (or other developers) can 46 be agnostic to the exact representation of the matrix. This simplifies the development 47 process and improves interoperability. 48 Unfortunately, for compiled code written in statically typed languages like C++, the 49 details of the matrix representation must be known during compilation. This makes it 50 difficult to write a single, general piece of code that can be applied to many different 51 representations. Writing multiple versions for each representation is difficult and 52 unsustainable when more representations become available. The alternative is to 53 perform all processing in R to exploit the availability of common methods. However, 54 this is an unappealing option for high-performance code. For scRNA-seq data stored in 55 matrices, consider the most common access pattern, i.e., looping across all cells or genes 56 and performing operations on the cell-or gene-specific expression profiles. If this was 57 performed in R, the code within the loop would need to be re-interpreted at each of 58 thousands or millions of iterations. This increases the computational time required to 59 perform analyses, which is inconvenient for small scripts, undesirable for interactive 60 analyses and unacceptable for large simulation studies. It would clearly be preferable to 61 implement critical functions (including loops) in compiled code wherever possible. 62 Here, we describe a C++ API named beachmat (using Bioconductor to handle Each 63 Matrix Type), which enables access to R matrix data in a manner that is agnostic to 64 the exact matrix representation. This allows developers to implement computationally 65 intensive algorithms in C++ that can be immediately applied to a wide range of R 66 matrix classes, including simple matrices, sparse matrices from the Matrix package, and 67 2/15 HDF5-backed matrices from the HDF5Array package. Using simulated and real 68 scRNA-seq data, we assess the performance of beachmat for data access from each 69 matrix representation. We show that each representation has specific strengths and 70 weaknesses, with a clear memory-speed trade-off that motivates the use of different 71 representations in different settings. We also demonstrate how beachmat can be used by 72 other Bioconductor packages to empower the analysis of a very large scRNA-seq data 73 set. By operating synergistically with existing Bioconductor infrastructure, beachmat 74 extends R's capabilities for analyzing scRNA-seq and other large matrix data.

75
Description of the beachmat API

76
Overview of the API 77 The beachmat API uses C++ classes to provide a common interface for data access from 78 R matrix representations. We define a base class that implements common methods for 79 all matrix representations. Each specific representation is associated with a derived C++ 80 class that provides customized implementations of the access methods. The intention is 81 for a user to pass in an R matrix of any type, in the form of an RObject instance from 82 the Rcpp API ( Figure 1). A function is then called to produce its C++ equivalent, 83 returning a pointer to the base class. This pointer is the same regardless of the R 84 representation and can be used in downstream code to achieve run-time polymorphism. 85 While the API is agnostic to the matrix representation, it still needs to know the 86 type of data that are stored in the matrix. We use C++ templating to recycle the code 87 to define specific classes for common data types, i.e., logical, integer, double-precision 88 floating point or character strings. The same methods are available for all classes of 89 each data type, improving their ease of use for developers. Briefly, when access to a 90 specific row or column (or a slice thereof) is requested, the API will fill a Rcpp-style

91
Vector object with corresponding data values from the matrix. A request for a specific 92 entry of the matrix will directly return the corresponding data value.

93
In the following text, we discuss some of the specifics of the beachmat API 94 implementation. This includes the details of each matrix representation, its memory 95 footprint and the computational time required for data access.

96
Performance with simple matrices 97 By default, R stores matrices as one-dimensional arrays of length N r N c , where N r and 98 N c are the number of rows and columns, respectively. This is done in column-major 99 format, i.e., the matrix entry (x, y) corresponds to array element x + N r y (assuming 100 zero-based indexing). We refer to this format as a "simple matrix". The simple matrix 101 is easy to manipulate and the time required for data access is linear with respect to the 102 number of rows/columns (Supplementary Figure 1). However, its memory footprint is 103 directly proportional to its length. For example, a double-precision matrix containing 104 data for 10000 genes in each of one million cells would require 80 GB of RAM to store 105 in memory. This is currently not possible for most workstations, instead requiring 106 dedicated high-performance computing resources. Even smaller matrices will cause 107 problems on systems with limited memory due to R's copy-on-write semantics. Thus, 108 the utility of simple matrices is limited to relatively small scRNA-seq data sets.

109
We compare the access speed of the beachmat API to that of a reference   Figure 1. Schematic of the beachmat workflow. Various matrix representations at the R level are passed as RObject instances to a C++ function. beachmat identifies the specific representation, constructs an instance of the appropriate C++ derived class, and returns a pointer to base class. (In this case, a numeric matrix pointer is returned for input matrices holding double-precision data.) This pointer can then be used in user-level code in a manner that is agnostic to the details of the original representation. data into a Vector by default. The reference implementation avoids the overhead of 115 creating a new copy by simply iterating across the original data. Our use of copying is 116 deliberate as it ensures that the API is consistent across matrix representations -for 117 example, file-based representations must copy the data to a new location in memory.

118
Copying is also required for operations that involve transformations and/or re-ordering 119 of data, as well as for libraries such as LAPACK that accept a pointer to a contiguous 120 block of memory. Nonetheless, for read-only access to column data in a simple matrix, 121 developers can direct beachmat to return an iterator directly to the start of the column. 122 This avoids making a copy and allows for some optimization in certain scenarios.

123
Performance with sparse matrices 124 The dgCMatrix class from the Matrix package stores sparse matrix data in compressed 125 sparse column-orientated (CSC) format (Supplementary Figure 2). Consider that every 126 non-zero entry in this matrix is characterized by a triplet: row x, column y and value v. 127 To convert this into the CSC format, entries are sorted in order of increasing x + N r y. 128 All entries with the same value of y are now grouped together in the ordered sequence. 129 We refer to each column-based group as G y , the entries of which are sorted internally in 130 order of increasing x. The representation is further compressed by discarding y from 131 each triplet. All entries from the same column are at consecutive locations of the 132 ordered sequence, so only the start position of G y on the sequence needs to be stored for 133 each column. (The end position of one column is simply the start position of the next 134 column.) This reduces memory usage to s I N c + (s I + s v )N =0 where s I is the size of an 135 integer, s v is the size of a single data element and N =0 is the number of non-zero 136 elements in the matrix. For double-precision matrices with many rows, sparse matrices 137 will be more memory-efficient than their simpler "dense" counterparts if the density of 138 non-zero elements is less than ≈ 66% (assuming 4-byte integers and 8-byte doubles). 139 The CSC format simplifies data access by imposing structure on the non-zero entries. When accessing a particular column y, all corresponding entries in G y can be quickly 141 extracted by taking the relevant part of the ordered sequence. For low-density sparse 142 matrices, column access via beachmat is even faster than access from simple matrices 143 (Supplementary Figure 3). This is because only a few non-zero entries need to be copied 144 -the rest of the Vector can be rapidly filled with zeroes. As the density of non-zero 145 entries increases, column access becomes slower but is still comparable to that of simple 146 matrices. We note that the RcppArmadillo package [5] also handles sparse matrices via 147 the SpMat class. This provides faster column-level access than the beachmat API as no 148 copying of data is performed -see above for a related discussion with simple matrices. 149 Row-level access is more difficult in the CSC format as entries in the same row do 150 not follow a predictable pattern. If a row is requested, a binary search on x needs to be 151 performed within G y for each column y, which requires an average time proportional to 152 log(N r ). In contrast, obtaining the next element in a row of a dense matrix can be done 153 in constant time by jumping N r elements ahead on the one-dimensional array. To speed 154 up row access for sparse matrices, we realized that the most common access pattern 155 involves requests for consecutive rows. If row r is accessed, beachmat will loop over all 156 columns and cache the index of the first value of x in G y that is not less than r. When 157 r + 1 is accessed, we simply need to check if each of the indices should be incremented 158 by one. This avoids the need to perform a new binary search and reduces the row access 159 time substantially (Figure 2). Even when the row access pattern is random, we mitigate 160 the time penalty by checking if the requested row is greater than or less than the 161 previous row for which indices are stored. If greater, we use the stored indices to set the 162 start of the binary search; if less, we use the indices to set the end of the search. This 163 reduces the search space and the amount of computational work for large N r .

164
Despite these optimisations, row access with sparse matrices in beachmat remains 165 slower than that with simple matrices (Figure 2a). This is not surprising as there is 166 simply less work to do with dense matrices. The exception is with large matrices of low 167 density (Figure 2b) For large, non-sparse matrices that do not fit into memory, the most obvious option is 186 to store them on disk and load submatrices into memory as required. We consider the 187 use of the hierarchical data format (HDF5) [23], which provides flexible and efficient 188 storage of and access to large amounts of data in a filesystem-like format. Each large 189 matrix is stored as a dataset in a HDF5 file on disk, while in memory, it is represented 190 in R by a HDF5Matrix object from the HDF5Array package.  representation is very small -fewer than 3 kilobytes in size -and simply extracts data 192 from disk upon request. Each HDF5Matrix instance provides methods to mimic a real 193 matrix object and allows users to manipulate the matrix in real time without the need 194 to load all of the data into memory. Compression of data in the HDF5 file also ensures 195 that the on-disk footprint remains manageable throughout the course of the analysis. 196 The beachmat API supports row-and column-level access from a HDF5Matrix 197 instance. Specifically, beachmat directly accesses data from the underlying HDF5 file 198 through the official HDF5 C++ API, which has been stored in the Rhdf5lib package for 199 portability. This means that even very large data sets can be accessed in C++, using 200 the same code for simple and sparse matrix representations. However, data access is 201 inevitably slower than that from a simple matrix as the data need to be read from disk 202 at regular intervals. We observed a 20-fold increase in the time required to access each 203 column and a 40-fold increase in the time required to access each row ( Figure 3). This 204 suggests that the HDF5Matrix representation should be used sparingly -if possible, 205 smaller data sets should use alternative in-memory representations for faster access.

206
A key determinant of the performance of the HDF5Matrix representation is the 207 layout of data in the HDF5 file. There are two layout choices for large matrices: 208 contiguous or chunked. In the contiguous layout, raw data are flattened into a 209 one-dimensional array analogous to column-major storage of simple matrices in memory. 210 In the chunked layout, data are arranged into "chunks" of a pre-defined size. For 211 example, in a row-chunked layout, each chunk would correspond to a row of the matrix. 212 Each chunk is always read (or written) in its entirety, even when only a portion of the 213 chunk is requested. Chunking is required for fast access to data, provided that the layout 214 is consistent with the expected access pattern. For example, a row-chunked layout allows 215 fast access to each row, as only one disk read is required to obtain the chunk for each 216 row. However, access to any given column is very slow, as the value of each element in 217 the column must be obtained by performing a disk read for every row chunk in its 218 entirety, i.e., N r reads in total. In practice, both row and column accesses are often  required (e.g., to access gene-and cell-level scRNA-seq data), which means that the file 220 layout must be carefully chosen to allow for these orthogonal access patterns.

221
The choice of file layout is the responsibility of the process that constructs the HDF5 222 file. This can be the original data provider; the developer whose function returns a 223 HDF5Matrix; or a user who coerces their data into a HDF5Matrix. As such, the chunking 224 scheme is generally outside of beachmat's control, preventing the API from automatically 225 choosing the optimal layout for the requested access pattern. Nonetheless, for a given 226 layout, beachmat will dynamically resize the HDF5 chunk cache to speed up access to convert an existing HDF5 file to pure row-or column-based chunks (Additional Note 2), 231 which performs optimally for random row and column access, respectively.

232
Another benefit of chunking is that the data in each chunk can be compressed using 233 filters such as ZLIB and SZIP. This decreases the size of the HDF5 file by at least 4-fold 234 for dense matrices (80 MB to 19 MB for a 10000-by-1000 double-precision matrix), with 235 even greater gains for sparse data (9 MB for the same matrix with 1% non-zero entries). 236 The use of smaller files reduces the risk that disk space will be exceeded during the 237 course of an analysis. This is important when many on-disk matrices need to be 238 constructed, e.g., to store transformed expression values or intermediate results.

239
Other matrix types 240 While the matrix representations described above are the most commonly used for 241 storing scRNA-seq data, beachmat can be easily extended to other representations. For 242 example, the packed symmetric representation from the Matrix package only stores the 243 upper or lower half of a symmetric matrix. This provides an efficient representation of 244 distance matrices, which are often used to cluster cells based on their expression profiles. 245 beachmat supports row and column access of data from packed symmetric matrices 246 7/15 through the same interface that is used for the other representations.

247
beachmat also supports data access from RleMatrix instances from the 248 DelayedArray package. The RleMatrix stores its values as a column-major run-length 249 encoding, where stretches of the same value in the one-dimensional array are stored as a 250 single run. This reduces memory usage in a more general manner than a sparse matrix, 251 especially for matrices with many small but non-zero counts. As with CSC matrices, 252 beachmat caches the row indices to speed up consecutive row access.

253
Another option for storing matrices on disk is to use the bigmemory package [10].

254
This constructs an in-memory big.matrix object that contains external pointers 255 pointing to an on-disk representation. In beachmat, we have deliberately chosen 256 HDF5Matrix rather than big.matrix due to the standardized nature of the HDF5 257 specification and portability of HDF5 files across systems. Nonetheless, we note that it 258 is simple to extend beachmat to accept big.matrix inputs if required. package does not support sparse integer or character matrices, so these are ignored.) 267 The output representation can either be explicitly specified in the code, or it can be 268 automatically chosen to match some input representation. To illustrate, consider a C++ 269 function that accepts a matrix as input and returns a matrix of similar dimensions. If 270 the input is in the simple matrix format, one might assume that there is enough 271 memory to also store the output in the simple format; whereas if the input is a 272 HDF5Matrix, one could presume that the output would be similarly large, thus requiring 273 a HDF5Matrix representation for the results. This means that results of processing in 274 C++ can be readily returned in the most suitable representation for manipulation in R. 275 Note that the layout considerations described for read access to HDF5-backed input 276 are equally applicable to HDF5-backed output. If rows/columns are to be filled 277 consecutively, we suggest using rectangular chunks that are proportional to the 278 dimensions of the matrix (Additional Note 1). For random write access, pure column-279 or row-based chunks are more suitable. Chunk dimensions can be specified directly with 280 the beachmat API; or by using functions from the HDF5Array package to set the global 281 chunking dimensions in R, which will be respected by beachmat in the C++ code.

282
Performance of beachmat on real data sets 283 Access times for a small brain data set 284 We evaluated the performance of beachmat with the different matrix representations on 285 real data, using the count matrix from a scRNA-seq study of the mouse brain [24]. This 286 data set contains integer expression values for 19972 genes in each of 3005 cells, of 287 which 18% are non-zero. We note that this is not a particularly large matrix, especially 288 in the context of droplet-based experiments that routinely generate data for tens of 289 thousands of cells. However, its size ensures that each of the matrix representations -290 including those that are stored in memory -can be easily evaluated and compared. 291 The performances of the different representations on the brain data set largely 292 recapitulate the results with simulated data. Row and column accesses from a simple  matrix are the fastest, followed by accesses from a sparse matrix (Figure 4).

294
HDF5-backed matrices provide the slowest access but also the smallest memory footprint 295 (2 KB, compared to 480 MB for simple matrices and 130 MB for sparse matrices). The 296 on-disk size of the HDF5 file is also relatively small, requiring only 16-20 MB of space 297 for each HDF5Matrix instance. These results demonstrate that the strengths and 298 weaknesses of the different representations are recapitulated with real data.

299
Analysis of the very large 10X data set 300 To demonstrate the utility of beachmat for faciliting analyses of large data sets, we 301 converted several functions in the scater [17] and scran packges [14] to use the beachmat 302 API in their C++ code. We applied these functions to the 1 million neuron data set 303 from 10X Genomics (see Methods). First, we called cell cycle phase with the cyclone 304 method [21]. The vast majority of cells were identified as being in G1 phase (Figure 5a), 305 consistent with the presence of differentiated neurons that are not actively cycling.

306
Next, we applied the deconvolution method [13] to compute size factors to normalize for 307 cell-specific biases. The size factor generally correlated well with the library size for each 308 cell (Figure 5b). Deviations were observed for a small number of cells, consistent with 309 composition effects [20] caused by differential expression between cell subpopulations.

310
We then detected highly variable genes (HVGs) based on the variance of the 311 log-normalized expression values [14]. This was performed while blocking on the (c) (d) Figure 5. Analysis of the 10X million neuron data set. (a) Cell cycle phase assignment, based on the G1 and G2M scores reported by cyclone. The intensity of colour is proportional to the density of cells at each plot location. Dashed lines indicate the score boundaries corresponding to each phase, and the number of assigned cells is also shown for each phase. (b) Size factor for each cell from the deconvolution method, plotted against the library size. Cells were coloured according to the deviation from the median log-ratio of the size factor to the library size for all cells. (c) Variance of the normalized log-expression values for each gene, plotted against the mean log-expression. The red line indicates the mean-dependent trend fitted to all genes. Orange points correspond to highly variable genes detected at a false discovery rate of 5%, with the top 10 genes highlighted. (d) PCA plot generated from the HVG expression profiles of all cells. The variance explained by each of the first two principal components is shown in brackets.
10/15 each cell can be used as a summary of its expression profile. This can be stored in 320 memory as a simple matrix and supplied directly to other R functions for further 321 processing, provided the underlying algorithms are scalable with the number of cells.

322
While a full characterisation of this data set is outside the scope of this paper, it is 323 clear that we can proceed through many parts of the scRNA-seq analysis pipeline using 324 beachmat-driven C++ functions. By taking advantage of the file-backed HDF5Matrix, 325 this analysis can be conducted in reasonable time on a desktop with modest 326 specifications (see Methods). In particular, the incoporation of the beachmat API only 327 required modest modifications to the existing C++ code for scRNA-seq data analysis. 328 Obtaining this level of functionality without beachmat would be much more difficult.

330
The popularity of the R programming language stems, in part, from the ease of its 331 extensibility. Packages can be easily developed by the research community to implement 332 cutting-edge algorithms for new sources of data. The increasing number of packages 333 designed to analyze scRNA-seq data (13 on Bioconductor at time of writing) provides a 334 case in point. Here, we describe the beachmat package, which provides a common API 335 for data access from a variety of R matrix representations in C++. This simplifies 336 package development and improves interoperability within the R/Bioconductor 337 ecosystem, by enabling arbitrary C++ code to accept many different matrix inputs 338 without any further effort on the part of the developer. While we have focused on 339 scRNA-seq in this paper, analyses of other large matrices (e.g., genome-wide contact 340 matrices in Hi-C data [15]) may also benefit from beachmat-driven code.

341
As we have shown, each matrix representation has specific strengths and weaknesses 342 for data access. Matrices that occupy more memory generally provide faster access, as 343 data do not need to be unpacked or retrieved from disk. Obviously, though, this may 344 not be practical for large data sets. Sparse matrix representations are not effective if 345 sparsity-destroying operations (e.g., mean-centering during PCA) are applied. Even 346 high-performance computing resources have their limits, especially in academic 347 environments with many users where high-resource jobs are difficult to schedule. In 348 such cases, it may be preferable to sacrifice speed for reduced memory consumption by 349 using a file-backed representation such as the HDF5Matrix class. By incorporating 350 beachmat into the C++ code, an R package can dynamically accept different matrix 351 types appropriate for the size of the data set and computing environment.

352
An alternative to using beachmat is to write C++ code for one matrix representation 353 (usually simple matrices) and apply it to chunks of a given input matrix. Each chunk is 354 coerced into the chosen representation and the C++ code is applied to the coerced 355 object. After looping across all chunks, the chunk-wise results are combined to obtain 356 the final result for the entire matrix. However, this hybrid approach requires extensive 357 coordination between R and C++ to keep track of the chunk that is being processed, to 358 monitor intermediate variables that persist between chunks, and to combine the results 359 in an appropriate manner. The need to ensure that R and C++ are interacting correctly 360 at multiple points imposes a significant burden on the developer. Computational 361 efficiency is also reduced by the use of R loops, multiple matrix coercions and repeated 362 C++ function calls. The beachmat API provides a natural solution by moving the 363 entire procedure into C++, simplifying development and maintenance.

364
It is straightforward to integrate beachmat into C++ code in existing R packages.

365
Our modifications to scran and scater have enabled the analysis of a very large 366 scRNA-seq data set in low-memory environments using file-backed representations, 367 without compromising speed for smaller data sets that can be held fully in memory. We 368 anticipate that beachmat will be useful to developers of computationally intensive 369 11/15 bioinformatics methods that need to access data from different matrices. Given the 370 infrastructure that is now available for handling large data sets, it is fair to say that the 371 rumours of R's demise (Supplementary Figure 6)

395
Access timings with simulated data 396 Double-precision dense matrices of a specified dimension were filled with values sampled 397 from a standard normal distribution. By default, all dense matrices were constructed in 398 the simple format. Double-precision CSC matrices of a specified dimension and density 399 were generated as dgCMatrix instances, using the rsparsematrix function from the  To time row and column access with beachmat and other APIs, we wrote a C++ 403 function that computes the sum of values in each row or column, respectively.

404
Calculation of the row/column sums ensures that each entry of the row/column is 405 visited in order to use its value. This means that each API has to do some work, 406 avoiding trivially fast approaches where a pointer or iterator is returned to the start of 407 12/15 the column/row. Summation is also simple enough that the access time of the API will 408 still constitute a major part of the overall time spent by the function.

409
Timings were performed in R using the system.time function on a call to each C++ 410 function via .Call. This was repeated 10 times with new matrices, and the average 411 time and standard error were computed for each access method. Row and column access 412 times were evaluated with respect to the numbers of rows and columns, respectively.

413
For sparse matrices, times were also recorded with respect to the density of non-zero 414 entries. In all cases, standard errors were negligible and not plotted for clarity.

415
Access timings with the brain data 416 scRNA-seq data from the mouse brain study [24] were obtained as a count matrix from 417 http://linnarssonlab.org/cortex/. Counts were read into R and converted into a 418 double-precision simple matrix, a dgCMatrix or a HDF5Matrix. For each representation, 419 timings of the calculation of row or column sums were performed as previously 420 described. This was repeated 10 times to obtain an average time. Column-and row-wise 421 chunking were used for timing column and row access, respectively, of a HDF5Matrix.

422
Analysis of the 1 million neuron data set 423 We downloaded the 1 million neuron data set from the 10X Genomics website 424 (https://support.10xgenomics.com/single-cell-gene-expression/datasets/1. 425 3.0/1M_neurons, obtained 15 June 2017). We used the TENxGenomics package 426 (https://github.com/mtmorgan/TENxGenomics) to compute metrics for each cell, 427 including the total number of unique molecular identifiers (UMIs) and the total number 428 of expressed genes (i.e., with at least one UMI). Low-quality cells were identified as 429 those with log-total UMI counts or log-numbers of expressed genes that were more than 430 three median absolute deviations below the median value for all cells in the same 431 sequencing library. After filtering out the low-quality cells, we converted the count data 432 into a HDF5Matrix with column-based chunks using the HDF5Array package.

433
To call cell cycle phase, we applied the cyclone method [21] from the scran package 434 using the pre-defined mouse classifier. This was performed over three cores to reduce 435 computational time. To compute size factors, we used the computeSumFactors method 436 after splitting the data set into chunks of 2000-3000 cells. Cells in each chunk were 437 normalized using the deconvolution method [13], and size factors were calibrated across 438 chunks by normalizing the chunk-specific pseudo-cells. The size factor for each cell was 439 used to compute normalized log-expression values [14], which were represented in a new 440 HDF5Matrix object. Note that, when computing the size factors, we only used genes 441 with an average count above 0.1 (as calculated by the calcAverage function in scater ). 442 This ensures that the pooled expression profiles will not be dominated by zeros.

443
To identify highly variable genes, we used the trendVar and decomposeVar functions 444 from scran. We computed the variance of the normalized log-expression values for each 445 gene while blocking on the sequencing library of origin. We fitted a mean-dependent 446 trend to the variances to model the mean-variance relationship. Assuming that most 447 genes were not highly variable, we tested whether the residual from the trend for each 448 gene was significantly greater than zero. Highly variable genes were identified as those 449 with a significantly non-zero component at a FDR of 5%. To reduce computational 450 time, the severity of the multiple testing correction and to avoid discreteness when 451 fitting the trend, we only considered genes with an average count above 0.01. 452 We used a simple approach to perform PCA on a very large matrix. After subsetting 453 the expression matrix by the detected HVGs, we mean-centred and standardized the 454 expression vector for each gene. We randomly selected 10000 cells and coerced the 455 expression profiles into a simple matrix. This was used as input into the prcomp 456 13/15 function to obtain the loading vectors. We then projected all cells onto the space 457 defined by the first two loading vectors to obtain PC1 and PC2 coordinates for all cells. 458 This approach assumes the selected 10000 cells provide a good representation of the 459 variance structure in the full population, allowing the approximate loading vectors to be 460 obtained by PCA on the smaller matrix. More sophisticated strategies such as 461 randomized PCA [8] could also be used, but a correct and efficient implementation of 462 such algorithms for HDF5Matrix objects is beyond the scope of this paper.  Only the phase assignment step was run on multiple cores.