Correction: Visualizing the structure of RNA-seq expression data using grade of membership models

[This corrects the article DOI: 10.1371/journal.pgen.1006599.].

Introduction Ever since large-scale gene expression measurements have been possible, clustering-of both genes and samples-has played a major role in their analysis [1][2][3]. For example, clustering of genes can identify genes that are working together or are co-regulated, and clustering of samples is useful for quality control as well as identifying biologically-distinct subgroups. A wide range of clustering methods have therefore been employed in this context, including distancebased hierarchical clustering, k-means clustering, and self-organizing maps (SOMs); see for example [4,5] for reviews.
Here we focus on cluster analysis of samples, rather than clustering of genes (although our methods do highlight sets of genes that distinguish each cluster). Traditional clustering methods for this problem attempt to partition samples into distinct groups that show "similar" expression patterns. While partitioning samples in this way has intuitive appeal, it seems likely that the structure of a typical gene expression data set will be too complex to be fully captured by such a partitioning. Motivated by this, here we analyse expression data using grade of membership (GoM) models [6], which generalize clustering models to allow each sample to have partial membership in multiple clusters. That is, they allow that each sample has a proportion, or "grade" of membership in each cluster. Such models are widely used in population genetics to model admixture, where individuals can have ancestry from multiple populations [7], and in document clustering [8,9] where each document can have membership in multiple topics. In these fields GoM models are often known as "admixture models", and "topic models" or "Latent Dirichlet Allocation" [8]. GoM models have also recently been applied to detect mutation signatures in cancer samples [10].
Although we are not the first to apply GoM-like models to gene expression data, previous applications have been primarily motivated by a specific goal, "cell type deconvolution", which involves using cell-type-specific expression profiles of marker genes to estimate the proportions of different cell types in a mixture [11][12][13]. Specifically, the GoM model we use here is analogous to-although different in detail from-blind deconvolution approaches [14][15][16] which estimate cell type proportions and cell type signatures jointly (see also [17,18] for semi-supervised approaches). Our goal here is to demonstrate that GoM models can be useful much more broadly for understanding structure in RNA-seq data-not only to deconvolve mixtures of cell types. For example, in our analysis of human tissue samples from the GTEX project below, the GoM model usefully captures biological heterogeneity among samples even though the inferred grades of membership are unlikely to correspond precisely to proportions of specific cell types. And in our analyses of single-cell expression data the GoM model highlights interesting structure, even though interpreting the grades of membership as "proportions of cell types" is clearly inappropriate because each sample is a single cell! Here we are exploiting the GoM as a flexible extension of traditional cluster models, which can capture "continuous" variation among cells as well as the more "discrete" variation captured by cluster models. Indeed, the extent to which variation among cells can be described in terms of discrete clusters versus more continuous populations is a fundamental question that, when combined with appropriate single-cell RNA-seq data, the GoM models used here may ultimately help address.

Methods overview
We assume that the RNA-seq data on N samples has been summarized by a table of counts C N×G = (c ng ), where c ng is the number of reads from sample n mapped to gene g (or other unit, such as transcript or exon) [19]. The GoM model is a generalization of a cluster model, which allows that each sample has some proportion ("grade") of membership, in each cluster. For RNA-seq data this corresponds to assuming that each sample n has some proportion of its reads, q nk coming from cluster k. In addition, each cluster k is characterized by a probability vector, θ kÁ , whose gth element represents the relative expression of gene g in cluster k. The GoM model is then where p ng ≔ The number of clusters K is set by the analyst, and it can be helpful to explore multiple values of K (see Discussion). To fit this model to RNA-seq data, we exploit the fact that this GoM model is commonly used for document clustering [8]. This is because, just as RNA-seq samples can be summarized by counts of reads mapping to each possible gene in the genome, document data can be summarized by counts of each possible word in a dictionary. Recognizing this allows existing methods and software for document clustering to be applied directly to RNA-seq data. Here we use the R package maptpx [20] to fit the GoM model. Fitting the GoM model results in estimated membership proportions q for each sample, and estimated expression values θ for each cluster. We visualize the membership proportions for each sample using a "Structure plot" [21], which is named for its widespread use in visualizing the results of the Structure software [7] in population genetics. The Structure plot represents the estimated membership proportions of each sample as a stacked barchart, with bars of different colors representing different clusters. Consequently, samples that have similar membership proportions have similar amounts of each color. See To help biologically interpret the clusters inferred by the GoM model we also implemented methods to identify, for each cluster, which genes are most distinctively differentially expressed in that cluster; that is, which genes show the biggest difference in expression compared with the other most similar cluster (see Methods). Functions for fitting the GoM model, plotting the structure plots, and identifying the distinctive ("driving") genes in each cluster, are included in our R package CountClust [22] available through Bioconductor [23].

Results
Bulk RNA-seq data of human tissue samples We begin by illustrating the GoM model on bulk RNA expression measurements from the GTEx project (V6 dbGaP accession phs000424.v6.p1, release date: Oct 19, 2015, http://www. gtexportal.org/home/). These data consist of per-gene read counts from RNA-seq performed on 8,555 samples collected from 450 human donors across 53 tissues, lymphoblastoid cell lines, and transformed fibroblast cell-lines. We analyzed 16,069 genes that satisfied filters (e.g. exceeding certain minimum expression levels) that were used during eQTL analyses by the GTEx project (gene list available in http://stephenslab.github.io/count-clustering/project/ utilities/gene_names_all_gtex.txt).
We fit the GoM model to these data, with number of clusters K = 5, 10, 15, 20. For each K we ran the fitting algorithm three times and kept the result with the highest log-likelihood. As might be expected, increasing K highlights finer structure in the data, and for brevity we focus discussion on results for K = 20 (Fig 1(a)), with results for other K shown in S1 Fig. For comparison we also ran several other commonly-used methods for clustering and visualizing gene expression data: Principal Components Analysis (PCA), Multidimensional Scaling (MDS), t-Distributed Stochastic Neighbor Embedding (t-SNE) [24,25], and hierarchical clustering (Fig 2).
These data present a challenge to visualization and clustering tools, because of both the relatively large number of samples and the complex structure created by the inclusion of many different tissues. Indeed, neither PCA nor MDS provide satisfactory summaries of the structure in these data (Fig 2(a) and 2(b)): samples from quite different tissues are often super-imposed on one another in plots of PC1 vs PC2, and this issue is only partly alleviated by examining more PCs (S2 Fig). The hierarchical clustering provides perhaps better separation of tissues (Fig 2(d)), but producing a clear (static) visualization of the tree is difficult with this many samples. By comparison t-SNE (Fig 2(b)) and the GoM model (Fig 1(a)) both show a much clearer visual separation of samples by tissue, although they achieve this in very different ways. The t-SNE representation produces a two-dimensional plot with 20-25 visually-distinct clusters. In contrast, the GoM highlights similarity among samples by assigning them similar membership proportions, resulting in groups of similarly-colored bars in the structure plot. Some tissues are represented by essentially a single cluster/color (e.g. Pancreas, Liver), whereas other tissues are represented as a mixture of multiple clusters (e.g. Thyroid, Spleen). Furthermore, the GoM results highlight biological similarity among some tissues by assigning similar membership proportions to samples from those tissues. For example, samples from several different parts of the brain often have similar memberships, as do the arteries (aorta, tibial and coronary) and skin samples (sun-exposed and un-exposed).
Although it is not surprising that samples cluster by tissue, other results could have occurred. For example, samples could have clustered according to technical variables, such as sequencing batch [26] or sample collection center. While our results do not exclude the possibility that technical variables could have influenced these data, the t-SNE and GoM results clearly demonstrate that tissue of origin is the primary source of heterogeneity, and provide a useful initial assurance of data quality.
While in these data both the GoM model and t-SNE highlight the primary structure due to tissue of origin, the GoM results have at least two advantages over t-SNE. First, the GoM clusters. (b): Structure plot of estimated membership proportions for K = 4 clusters fit to only the brain tissue samples. This analysis highlights finerscale structure among the brain samples that is missed by the global analysis in (a). doi:10.1371/journal.pgen.1006599.g001 Visualizing the structure of RNA-seq expression data using grade of membership models model provides an explicit, quantitative, estimate of the mean expression of each gene in each cluster, making it straightforward to assess which genes and processes drive differences among clusters; see Table 1 (and also S1 Table). Reassuringly, many results align with known biology. For example, the purple cluster (cluster 18), which distinguishes Pancreas from other tissues, is enriched for genes responsible for digestion and proteolysis, (e.g. PRSS1, CPA1, PNLIP). Similarly the yellow cluster (cluster 12), which primarily distinguishes Cell EBV Lymphocytes from other tissues, is enriched with genes responsible for immune responses (e.g. IGHM, IGHG1) and the pink cluster (cluster 19) which mainly appears in Whole Blood, is enriched with genes related hemoglobin complex and oxygen transport (e.g. HBB, HBA1, HBA2). Further, Keratinrelated genes characterize the skin cluster (cluster 6, light denim), Myosin-related genes characterize the muscle skeletal cluster (cluster 7, orange), etc. These biological annotations are particularly helpful for understanding instances where a cluster appears in multiple tissues. For example, the top genes in the salmon cluster (cluster 4), which is common to the Gastroesophageal Junction, Esophagus Muscularis and Colon Sigmoid, are related to smooth muscle. And the top genes in the red cluster, highlighted above as common to Breast Mammary tissue, Adipose Subcutaneous and Adipose Visceral, are all related to adipocytes and/or fatty acid synthesis.
A second advantage of the GoM model is that, because it allows partial membership in each cluster, it is better able to highlight partial similarities among distinct tissues. For example, in Fig 1(a) the sky blue cluster (cluster 13), appears in testis, pituitary, and thyroid, reflecting shared hormonal-related processes. At the same time, these tissues are distinguished from one another both by their degree of membership in this cluster (testis samples have consistently stronger membership; thyroid samples consistently weaker), and by membership in other clusters. For example, pituitary samples, but not testis or thyroid samples, have membership in the light purple cluster (cluster 2) which is driven by genes related to neurons and synapsis. In the t-SNE results these three tissues simply cluster separately into visually distinct groups, with no indication that their expression profiles have something in common (Fig 2(b)). Thus, although we find the t-SNE results visually attractive, this 2-dimensional projection contains less information than the Structure plot from the GoM (Fig 1(a)), which uses color to represent the samples in a 20-dimensional space.
In addition to these qualitative comparisons with other methods, we also used the GTEx data to quantitatively compare the accuracy of the GoM model with hierarchical clustering. Specifically, for each pair of tissues in the GTEx data we assessed whether or not each method correctly partitioned samples into the two tissue groups; see Methods. (Other methods do not provide an explicit clustering of the samples-only a visual representation-and so are not included in these comparisons.) The GoM model was more accurate in this test, succeeding in 88% of comparisons, compared with 79% for hierarchical clustering (S3(a) vs S3(c) Fig).

Sub-analysis of brain tissues
Although the analysis of all tissues is useful for assessing global structure, it may miss finerscale structure within tissues or among similar tissues. For example, here the GoM model applied to all tissues effectively allocated only three clusters to all brain tissues (clusters 1,2 and 9 in Fig 1(a)), and we suspected that additional substructure might be uncovered by analyzing Visualizing the structure of RNA-seq expression data using grade of membership models Visualizing the structure of RNA-seq expression data using grade of membership models the brain samples separately and using more clusters. Fig 1(b) shows the Structure plot for K = 6 on only the Brain samples. The results highlight much finer-scale structure compared with the global analysis; see Table 2. Brain Cerebellum and Cerebellar hemisphere are essentially assigned to a separate cluster (lime green), which is enriched with genes related to cell periphery and communication (e.g. PKD1, CBLN3) as well as genes expressed largely in neuronal cells and playing a role in neuron differentiation (e.g. CHGB). The spinal cord samples also show consistently strong membership in a single cluster (yellow-orange), the top defining gene for the cluster being MBP which is involved in myelination of nerves in the nervous system [27]. Another driving gene, GFAP, participates in system development by acting as a marker to distinguish astrocytes during development [28]. The remaining samples all show membership in multiple clusters. Samples from the putamen, caudate and nucleus accumbens show similar profiles, and are distinguished by strong membership in a cluster (cluster 4, bright red) whose top driving gene is PPP1R1B, a target for dopamine. And cortex samples are distinguished from others by stronger membership in a cluster (cluster 2, turquoise in Fig 1(b)) whose distinctive genes include ENC1, which interacts with actin and contributes to the organisation of the cytoskeleton during the specification of neural fate [29].
In comparison, applying PCA, MDS, hierarchical clustering and t-SNE to these brain samples reveals less of this finer-scale structure (S4 Fig). Both PCA and MDS effectively cluster the samples into two groups-those related to the cerebellum vs everything else. Hierarchical clustering also separates out the cerebellum-related tissues from the others, but again the format seems ill-suited to static visualization of more than one thousand samples. For reasons that we do not understand t-SNE performs poorly for these data: many samples are allocated to essentially identical locations, and so overplotting obscures them.

Single-cell RNA-seq data
Recently RNA-sequencing has become viable for single cells [30], and this technology has the promise to revolutionize understanding of intra-cellular variation in expression, and regulation more generally [31]. Although it is traditional to describe and categorize cells in terms of Visualizing the structure of RNA-seq expression data using grade of membership models distinct cell-types, the actual architecture of cell heterogeneity may be more complex, and in some cases perhaps better captured by the more "continuous" GoM model. In this section we illustrate the potential for the GoM model to be applied to single cell data.
To be applicable to single-cell RNA-seq data, methods must be able to deal with lower sequencing depth than in bulk RNA experiments: single-cell RNA-seq data typically involve substantially lower effective sequencing depth compared with bulk experiments, due to the relatively small number of molecules available to sequence in a single cell. Therefore, as a first step towards demonstrating its potential for single cell analysis, we checked robustness of the GoM model to sequencing depth. Specifically, we repeated the analyses above after thinning the GTEx data by a factor of 100 and 10,000 to mimic the lower sequencing depth of a typical single cell experiment. For the thinned GTEx data the Structure plot for K = 20 preserves most of the major features of the original analysis on unthinned data (S5 Fig). For the accuracy comparisons with hierarchical clustering, both methods suffer reduced accuracy in thinned data, but the GoM model remains superior (S6 Fig). For example, when thinning by a factor of 10,000, the success rate in separating pairs of tissues is 0.32 for the GoM model vs 0.10 for hierarchical clustering.
Having established its robustness to sequencing depth, we now illustrate the GoM model on two single cell RNA-seq datasets: data on mouse spleen from Jaitin et al [32] and data on mouse preimplantation embryos from Deng et al [33].
Mouse spleen data from Jaitin et al, 2014. Jaitin et al sequenced over 4,000 single cells from mouse spleen. Here we analyze 1,041 of these cells that were categorized as CD11c+ in the sorting markers column of their data (http://compgenomics.weizmann.ac.il/tanay/?page_ id=519), and which had total number of reads mapping to non-ERCC genes greater than 600. Our hope was that applying the GoM model to these data would identify, and perhaps refine, the cluster structure evident in [32] (their Fig 2A and 2B). However, the GoM model yielded rather different results (Fig 3), where most cells were assigned to have membership in several clusters. Further, the cluster membership vectors showed systematic differences among amplification batches (which in these data is also strongly correlated with sequencing batch). For example, cells in batch 1 are characterized by strong membership in the orange cluster (cluster 5) while those in batch 4 are characterized by strong membership in both the blue and yellow clusters (2 and 6). Some adjacent batches show similar patterns-for example batches 28 and 29 have a similar visual "palette", as do batches 32-45. And, more generally, these later batches are collectively more similar to one another than they are to the earlier batches (0-4).
The fact that batch effects are detectable in these data is not particularly surprising: there is a growing recognition of the importance of batch effects in high-throughput data generally [34,35] and in single cell data specifically [26,36]. And indeed, both clustering methods and the GoM model can be viewed as dimension reduction methods, and such methods can be helpful in controlling for batch effects [37,38]. However, why these batch effects are not evident in Fig 2A and 2B of [32] is unclear.
Mouse preimplantation embryo data from Deng et al, 2014. Deng et al collected singlecell expression data of mouse preimplantation embryos from the zygote to blastocyst stage [33], with cells from four different embryos sequenced at each stage. The original analysis [33] focuses on trends of allele-specific expression in early embryonic development. Here we use the GoM model to assess the primary structure in these data without regard to allele-specific effects (i.e. combining counts of the two alleles). Visual inspection of the Principal Components Analysis in [33] suggested perhaps 6-7 clusters, and we focus here on results with K = 6.
The results from the GoM model (Fig 4) clearly highlight changes in expression profiles that occur through early embryonic development stages, and enrichment analysis of the Visualizing the structure of RNA-seq expression data using grade of membership models  (Table 3, S4 Table) indicate that many of these expression changes reflect important biological processes during embryonic preimplantation development.
In more detail: Initially, at the zygote and early 2-cell stages, the embryos are represented by a single cluster (blue in Fig 4) that is enriched with genes responsible for germ cell development (e.g., Bcl2l10 [39], Spin1 [40]). Moving through subsequent stages the grades of membership evolve to a mixture of blue and magenta clusters (mid 2-cell), a mixture of magenta and yellow clusters (late 2-cell) and a mixture of yellow and green (4-cell stage). The green cluster then becomes more prominent in the 8-cell and 16-cell stages, before dropping substantially in the early and mid-blastocyst stages. That is, we see a progression in the importance of different clusters through these stages, from the blue cluster, moving through magenta and yellow to green. Examining the genes distinguishing each cluster reveals that this progression (bluemagenta-yellow-green) reflects the changing relative importance of several fundamental biological processes. The magenta cluster is driven by genes responsible for the beginning of transcription of zygotic genes (e.g., Zscan4c-f show up in the list of top 100 driving genes: see https://stephenslab.github.io/count-clustering/project/src/deng_cluster_annotations.html), which takes place in the late 2-cell stage of early mouse embryonic development [41]. The yellow cluster is enriched for genes responsible for heterochromation Smarcc1 [42] and chromosome stability Cenpe [43] (see S4 Table). And the green cluster is enriched for cytoskeletal genes (e.g., Fbxo15) and cytoplasm genes (e.g., Tceb1, Hsp90ab1), all of which are essential for compaction at the 8-cell stage and morula formation at the 16-cell stage.
Finally, during the blastocyst stages two new clusters (purple and orange in Fig 4) dominate. The orange cluster is enriched with genes involved in the formation of trophectoderm (TE) annotated by the genes that are most distinctively expressed in that cluster, and by the gene ontology categories for which these distinctive genes are most enriched (see Table 1 for more extensive annotation results). See text for discussion of biological processes driving these results. Visualizing the structure of RNA-seq expression data using grade of membership models (e.g., Tspan8, Krt8, Id2 [44]), while the purple cluster is enriched with genes responsible for the formation of inner cell mass (ICM) (e.g., Pdgfra, Pyy [45]). For comparison, results for PCA, MDS, t-SNE and hierarchical clustering are shown in S7 Fig. All these methods show some clustering structure by pre-implantation stage; however only PCA and MDS seem to capture the developmental trajectory from zygote to blastocyst, exhibiting a "horse-shoe" pattern that is expected when similarities among samples approximately reflect an underlying latent ordering [46,47]. And none of them provide any direct indication of the ICM vs TE structure in the blastocyst samples.
Although the GoM model results clearly highlight some of the key biological processes underlying embryonic preimplantation development, there are also some expected patterns that do not appear. Specifically, just prior to implantation the embryo consists of three different cell types, the trophectoderm (TE), the primitive endoderm (PE), and the epiblast (EPI) [48], with the PE and EPI being formed from the ICM. Thus one might expect the late blastocyst cells to show a clear division into three distinct groups, and for some of the earlier blastocyst cells to show partial membership in one of these groups as they begin to differentiate towards these cell types. Indeed, the GoM model seems well suited to capture this process in principle. However, this is not the result we obtained in practice. In particular, although the two clusters identified by the GoM model in the blastocyst stages appear to correspond roughly to the TE and ICM, even the late blastocyst cells tend to show a gradient of memberships in both these clusters, rather than a clear division into distinct groups. Our results contrast with those from the single-cell mouse preimplantation data of [44], measured by qPCR, where the late blastocyst cells showed a clear visual division into three groups using PCA (their Fig 1).
To better understand the differences between our results for RNA-seq data from [33] and the qPCR results from [44] we applied the GoM model with K = 3 to a small subset of the RNA-seq data: the blastocyst cell data at the 48 genes assayed by [44]. These genes were specifically chosen by them to help elucidate cell-fate decisions during early development of the mouse embryo. Still, the GoM model results (S8 Fig) do not support a clear division of these data into three distinct groups (and neither do PCA or t-SNE; S9 Fig). Rather, the GoM model highlights one cluster (Green in figure), whose membership proportions essentially reflect expression at the Actb gene, and two other clusters (Orange and Purple in figure) whose driving genes correspond to genes identified in [44] as being distinctive to TE and EPI cell types respectively. The Actb gene is a "housekeeping gene", used by [44] to normalize their qPCR data, and its prominence in the GoM results likely reflects its very high expression levels relative to other genes. However, excluding Actb from the analysis still does not lead to a clear separation into three groups (S8 Fig). Thus, although there are clear commonalities in the structure of these RNA-seq and qPCR data sets, the structure of the single-cell RNA-seq data from [33] is fundamentally more complex (or, perhaps, muddied), and consequently more difficult to interpret.
In addition to trends across development stages, the GoM results also highlight some embryo-level effects in the early stages (Fig 4). Specifically, cells from the same embryo sometimes show greater similarity than cells from different embryos. For example, while all cells from the 16-cell stage have high memberships in the green cluster, cells from two of the embryos at this stage have memberships in both the purple and yellow clusters, while the other two embryos have memberships only in the yellow cluster.
The GoM results also highlight a few single cells as outliers. For example, a cell from a 16-cell embryo is represented by the blue cluster-a cluster that represents cells at the zygote and early 2-cell stage. Also, a cell from an 8-stage embryo has strong membership in the purple cluster-a cluster that represents cells from the blastocyst stage. This illustrates the potential for the GoM model to help in quality control: it would seem prudent to consider excluding these outlier cells from subsequent analyses of these data.

Discussion
Our goal here is to highlight the potential for GoM models to elucidate structure in RNA-seq data from both single cell sequencing and bulk sequencing of pooled cells. We also provide tools to identify which genes are most distinctively expressed in each cluster, to aid interpretation of results. As our applications illustrate, the results can provide a richer summary of the structure in RNA-seq data than existing widely-used visualization methods such as PCA and hierarchical clustering. While it could be argued that the GoM model results sometimes raise more questions than they answer, this is exactly the point of an exploratory analysis tool: to highlight issues for investigation, identify anomalies, and generate hypotheses for future testing.
Our results from different methods also highlight another important point: different methods have different strengths and weaknesses, and can compliment one another as well as competing. For example, t-SNE seems to provide a much clearer indication of the cluster structure in the full GTEx data than does PCA, but does a poorer job of capturing the ordering of the developmental samples from mouse pre-implantation embryos. While we believe the GoM model often provides a richer summary of the sample structure, we would expect to use it in addition to t-SNE and PCA when performing exploratory analyses. (Indeed the methods can be used in combination: both PCA and t-SNE can be used to visualize the results of the GoM model, as an alternative or complement to the Structure plot.) A key feature of the GoM model is that it allows that each sample has a proportion of membership in each cluster, rather than a discrete cluster structure. Consequently it can provide insights into how well a particular dataset really fits a "discrete cluster" model. For example, consider the results for the data from Jaitin et al [32] and Deng et al [33]: in both cases most samples are assigned to multiple clusters, although the results are closer to "discrete" for the latter than the former. The GoM model is also better able to represent the situation where there is not really a single clustering of the samples, but where samples may cluster differently at different genes. For example, in the GTEx data, the stomach samples share memberships in common with both the pancreas (purple) and the adrenal gland (light green). This pattern can be seen in the Structure plot (S4 Fig) but not from other methods like PCA, t-SNE or hierarchical clustering (Fig 2).
Fitting GoM models can be computationally-intensive for large data sets. For the datasets we considered here the computation time ranged from 12 minutes for the data from [33] (n = 259;K = 6), through 33 minutes for the data from [32] (n = 1,041;K = 7) to 3,370 minutes for the GTEx data (n = 8,555;K = 20). Computation time can be reduced by fitting the model to only the most highly expressed genes, and we often use this strategy to get quick initial results for a dataset. Because these methods are widely used for clustering very large document datasets there is considerable ongoing interest in computational speed-ups for very large datasets, with "on-line" (sequential) approaches capable of dealing with millions of documents [49] that could be useful in the future for very large RNA-seq datasets.
A thorny issue that arises when fitting clustering models is how to select the number of clusters, K. Like many software packages for fitting these models, the maptpx package implements a measure of model fit that provides one useful guide. However, it is worth remembering that in practice there is unlikely to be a "true" value of K, and results from different values of K may complement one another rather than merely competing with one another. For example, seeing how the fitted model evolves as K increases is one way to capture some notion of hierarchy in the clusters identified [21]. More generally it is often fruitful to analyse data in multiple ways using the same tool: for example our GTEx analyses illustrate how analysis of subsets of the data (in this case the brain samples) can complement analyses of the entire data. Finally, as a practical matter, we note that Structure plots can be difficult to read for large K (e.g. K = 30) because of the difficulties of choosing a palette with K distinguishable colors.
The version of the GoM model fitted here is relatively simple, and could certainly be embellished. For example, the model allows the expression of each gene in each cluster to be a free parameter, whereas we might expect expression of most genes to be "similar" across clusters. This is analogous to the idea in population genetics applications that allele frequencies in different populations may be similar to one another [50], or in document clustering applications that most words may not differ appreciably in frequency in different topics. In population genetics applications incorporating this idea into the model, by using a correlated prior distribution on these frequencies, can help improve identification of subtle structure [50] and we would expect the same to happen here for RNA-seq data.
Finally, GoM models can be viewed as one of a larger class of "matrix factorization" approaches to understanding structure in data, which also includes PCA, non-negative matrix factorization (NMF), and sparse factor analysis (SFA); see [51]. This observation raises the question of whether methods like SFA might be useful for the kinds of analyses we performed here. (NMF is so closely related to the GoM model that we do not discuss it further; indeed, the GoM model is a type of NMF, because both grades of membership and expression levels within each cluster are required to be non-negative.) Informally, SFA can be thought of as a generalization of the GoM model that allows samples to have negative memberships in some "clusters" (actually, "factors"). This additional flexibility should allow SFA to capture certain patterns more easily than the GoM model. For example, a small subset of genes that are overexpressed in some samples and under-expressed in other samples could be captured by a single sparse factor, with positive loadings in the over-expressed samples and negative loadings in the other samples. However, this additional flexibility also comes at a cost of additional complexity in visualizing the results. For example, S10, S11 and S12 Figs show results of SFA (the version from [51]) for the GTEx data and the mouse preimplantation data: in our opinion, these do not have the simplicity and immediate visual appeal of the GoM model results. Also, applying SFA to RNA-seq data requires several decisions to be made that can greatly impact the results: what transformation of the data to use; what method to induce sparsity (there are many; e.g. [51][52][53][54]); whether to induce sparsity in loadings, factors, or both; etc. Nonetheless, we certainly view SFA as complementing the GoM model as a promising tool for investigating the structure of RNA-seq data, and as a promising area for further work.

Model fitting
We use the maptpx R package [20] to fit the GoM models (1 and 2), which is also known as "Latent Dirichlet Allocation" (LDA). The maptpx package fits this model using an EM algorithm to perform Maximum a posteriori (MAP) estimation of the parameters q and θ. See [20] for details.

Visualizing results
In addition to the Structure plot, we have also found it useful to visualize results using t-distributed Stochastic Neighbor Embedding (t-SNE), which is a method for visualizing high dimensional datasets by placing them in a two dimensional space, attempting to preserve the relative distance between nearby samples [24,25]. Compared with the Structure plot our t-SNE plots contain less information, but can better emphasize clustering of samples that have similar membership proportions in many clusters. Specifically, t-SNE tends to place samples with similar membership proportions together in the two-dimensional plot, forming visual "clusters" that can be identified by eye (e.g. http://stephenslab.github.io/count-clustering/project/src/ tissues_tSNE_2.html). This may be particularly helpful in settings where no external information is available to aid in making an informative Structure plot.

Cluster annotation
To help biologically interpret the clusters, we developed a method to identify which genes are most distinctively differentially expressed in each cluster. (This is analogous to identifying "ancestry informative markers" in population genetics applications [55].) Specifically, for each cluster k we measure the distinctiveness of gene g with respect to any other cluster l using KL g ½k; l :¼ y kg log y kg y lg þ y lg À y kg ; ð3Þ which is the Kullback-Leibler divergence of the Poisson distribution with parameter θ kg to the Poisson distribution with parameter θ lg . For each cluster k, we then define the distinctiveness of gene g as The higher D g [k], the larger the role of gene g in distinguishing cluster k from all other clusters. Thus, for each cluster k we identify the genes with highest D g [k] as the genes driving the cluster k. We annotate the biological functions of these individual genes using the mygene R Bioconductor package [56]. For each cluster k, we filter out a number of genes (top 100 for the Deng et al data [33] and GTEx V6 data [57]) with highest D g [k] value and perform a gene set over-representation analysis of these genes against all the other genes in the data representing the background. To do this, we used ConsensusPathDB database (http://cpdb.molgen.mpg.de/) [58] [59]. See Tables  1, 2 and 3 for the top significant gene ontologies driving each cluster in the GTEx V6 data and the Deng et al data respectively.

Comparison with hierarchical clustering
We compared the GoM model with a distance-based hierarchical clustering algorithm by applying both methods to samples from pairs of tissues from the GTEx project, and assessed their accuracy in separating samples according to tissue. For each pair of tissues we randomly selected 50 samples from the pool of all samples coming from these tissues. For the hierarchical clustering approach we cut the dendrogram at K = 2, and checked whether or not this cut partitions the samples into the two tissue groups. (We applied hierarchical clustering using Euclidean distance, with both complete and average linkage; results were similar and so we showed results only for complete linkage.) For the GoM model we analysed the data with K = 2, and sorted the samples by their membership in cluster 1. We then partitioned the samples at the point of the steepest fall in this membership, and again we checked whether this cut partitions the samples into the two tissue groups. S3 Fig shows, for each pair of tissues, whether each method successfully partitioned the samples into the two tissue groups.

Thinning
We used "thinning" to simulate lower-coverage data from the original higher-coverage data. Specifically, if c ng is the counts of number of reads mapping to gene g for sample n for the original data, we simulated thinned counts t ng using t ng $ Binðc ng ; p thin Þ ð5Þ where p thin is a specified thinning parameter.

Code availability
Our methods are implemented in an R package CountClust, available as part of the  (d) Hierarchical clustering on counts data with the assumption that, for each gene the sample read count c ng has a variance " c g þ 1 that is constant across samples. And, the the genespecific variance " c g þ 1 was used to scale the distance matrix for clustering. (e) Hierarchical clustering applied to adjusted count data. Each gene has a mean expression value of 0 and variance of 1. Taken together, these results suggest that regardless of the different data transformation scenarios, the GoM model with K = 2 is able to separate samples of different tissue of origin, better than hierarchical cluster methods. (PDF)  Fig 1(a), though there are a few differences from the unthinned version. (PDF) S6 Fig. A comparison of accuracy of hierarchical clustering vs GoM on thinned GTEx data, with thinning parameters of p thin = 0.01 and p thin = 0.001. For each pair of tissue samples from the GTEx V6 data we assessed whether or not each clustering method (with K = 2 clusters) separated the samples according to their tissue of origin, with successful separation indicated by a filled square. Thinning deteriorates accuracy compared with the unthinned data (Fig 2), but even then the model-based method remains more successful than the hierarchical clustering in separating the samples by tissue or origin. (PDF)