Evaluation of critical data processing steps for reliable prediction of gene co-expression from large collections of RNA-seq data

Motivation Gene co-expression analysis is an attractive tool for leveraging enormous amounts of public RNA-seq datasets for the prediction of gene functions and regulatory mechanisms. However, the optimal data processing steps for the accurate prediction of gene co-expression from such large datasets remain unclear. Especially the importance of batch effect correction is understudied. Results We processed RNA-seq data of 68 human and 76 mouse cell types and tissues using 50 different workflows into 7,200 genome-wide gene co-expression networks. We then conducted a systematic analysis of the factors that result in high-quality co-expression predictions, focusing on normalization, batch effect correction, and measure of correlation. We confirmed the key importance of high sample counts for high-quality predictions. However, choosing a suitable normalization approach and applying batch effect correction can further improve the quality of co-expression estimates, equivalent to a >80% and >40% increase in samples. In larger datasets, batch effect removal was equivalent to a more than doubling of the sample size. Finally, Pearson correlation appears more suitable than Spearman correlation, except for smaller datasets. Conclusion A key point for accurate prediction of gene co-expression is the collection of many samples. However, paying attention to data normalization, batch effects, and the measure of correlation can significantly improve the quality of co-expression estimates.


Introduction
Understanding the functions and regulatory mechanisms of genes is one of the central challenges in biology. Gene co-expression is an important concept in bioinformatics because it serves as a foundation for predicting gene functions and regulatory mechanisms, and for more complex network inference methods [1][2][3][4][5][6]. Several gene co-expression databases have been developed [7][8][9][10].
High numbers of samples are needed to accurately infer correlation of expression [11]. Public databases are attractive sources of expression data, but in practice high numbers of samples can only be obtained by aggregating data from different studies conducted by different laboratories. As a result, the input data for gene co-expression analysis often contains considerable technical variability, called batch effects [9,12].
In a recent study, we showed that correcting batch effects improved the quality of gene coexpression estimates significantly [9]. However, our previous study was limited to microarray data, and considered only one data normalization method and one batch correction method, i.e. ComBat [13]. Moreover, other studies have shown that treating batch effects can also result in unwanted artifacts such as exaggerated differences between covariates in gene expression and DNA methylation data [14][15][16][17].
Here, we present a systematic analysis of the effects of RNA-seq data normalization, batch effect correction, and correlation measure on the quality of gene co-expression estimates. We applied 50 data processing workflows on data for 68 human and 76 mouse cell types and tissues, resulting in 7,200 sets of genome-wide gene-gene co-expression predictions. Through analysis of the quality of these cell type-and tissue-specific co-expression predictions, we confirmed the importance of large numbers of samples [11]. We also found that some normalization methods (especially UQ normalization) resulted on average in better co-expression predictions than others. In addition, treating batch effects resulted in a significant improvement of the co-expression estimates, especially in larger datasets consisting of samples produced by many different studies. It is imperative that future studies pay attention to batch effects in order to make optimal use of large amounts of public data. Finally, the difference between Pearson's correlation and Spearman's correlation was small, with Spearman working better in small datasets and Pearson better in medium-sized datasets. To the best of our knowledge, this is the first comprehensive study evaluating the importance of batch effect correction for the prediction of gene co-expression from large collections of RNA-seq data.

Overview of this study
The goal of this study is to gain insights into which data processing steps are preferable for obtaining high-quality gene co-expression estimates from large collections of RNA-seq data. To address this issue, we collected a dataset of 8,796 human and 12,114 mouse bulk RNA-seq samples, from 401 and 630 studies, covering 68 human and 76 mouse cell types and tissues (see Methods; S1 and S2 Tables) [18]. On these two datasets, we applied combinations of data normalization approaches and batch effect correction approaches (see Fig 1 for a summary of the workflow). As proxies for batches we used the studies that produced each sample (1 study is 1 batch). We also applied the method ComBat-seq on the raw read count data without any prior normalization [19]. The resulting 25 (6 normalizations x 4 batch effect correction approaches, and ComBat-seq without normalization) human and 25 mouse datasets were used to estimate correlation of expression using Pearson's correlation and Spearman's correlation in the data of each cell type or tissue. This resulted in a total of 7,200 (3,400 human and 3,800 mouse) genome-wide sets of cell type or tissue-specific gene co-expression predictions. We will refer to each genome-wide set of cell type or tissue-specific gene co-expression predictions as a "co-expression network". However, the goal of this study is not to analyze network topology. Our focus is to identify the key features that result in accurate co-expression predictions.

Defining the quality of co-expression predictions
Next, we evaluated the quality of the co-expression predictions produced by each workflow. Many studies have used enrichment of shared functional annotations among correlated genes or regulatory DNA motifs in their promoter sequences as quality measures for co-expression predictions [10,[20][21][22]. In high-quality co-expression networks, we expect correlated genes to belong to shared pathways or to be controlled by a common regulatory mechanism (Fig 1B). In contrast, in a randomly generated network, correlated genes are expected to lack common functions or regulatory mechanisms. In this study, in each co-expression network, for every gene X, we extracted the set of 100 genes with the highest correlation, which we refer to as set X . We then defined eight quality measures that are based on how frequently we observed significant enrichment of Gene Ontology (GO) functional annotations and transcription factor binding site (TFBS) motifs among these sets of 100 genes (Table 1, see Methods for more details). In highquality networks this frequency should be high, and in low-quality networks it should be low.
Although the eight quality measures were based on sets of 100 highly correlated genes, using instead the top 50 or top 200 genes resulted in highly consistent quality estimates (S1 Fig).
We collected the eight quality measures of the 7,200 co-expression networks and Principal Component Analysis (PCA) revealed that they are highly consistent and correlated: the first PC explained 81.  Raw RNA-seq data was processed with 50 different combinations of normalization, batch effect correction, and correlation measures into 7,200 genome-wide sets of cell type and tissue-specific co-expression predictions, which we refer to as "co-expression networks". (B) Quality of co-expression networks was estimated based on the enrichment of functional annotations of correlated genes and regulatory motifs in their promoters. In random co-expression networks no common annotations and motifs are expected to be found among correlated genes. In contrast, in ideal networks such enrichments should be encountered frequently. Here nodes represent genes and edges co-expression. (C) Quality measures were processed into 7,200 quality scores, which were used for downstream analysis. We use 75% of cell types and tissues for regression and exploratory analysis, and the remaining 25% for validation.
https://doi.org/10.1371/journal.pone.0263344.g001 accounted for only 12.1% of the variance in the data and was did not show consistent correlation with all measures (S2A, S2B and S2D Fig). To facilitate the downstream analysis, we therefore decided to use this first PC as a general quality score (Quality, see Methods) after rescaling it to the range 0 (worst networks) to 1 (best networks) (S3A Fig). As an illustration, S3B Fig shows the quality measures for the networks with the lowest (Med + ComBat-seq + Spearman applied on human salivary gland data), 50 th percentile (Rlog + ComBat + Pearson applied on human neuron data), and highest (UQ + removeBatchEffect + Spearman applied on mouse liver data) Quality. From the lowest-quality network to the highest-quality network, the measures of quality are progressively increasing.
At this point we randomly split the 68 human and 76 mouse cell types and tissues into 2 groups (S1 and S2 Tables). One group (51 human and 57 mouse cell types and tissues; corresponding to 75% of the datasets) will be used for the analysis of features that contribute to high-quality gene co-expression predictions. We refer to this as our training set, and will focus on it in the following sections. The remaining 25% of cell types and tissues (17 human and 19 mouse cell types and tissues) will be used as an independent validation set later (section "The best workflows result in significantly better co-expression estimates"). Fig 2 shows the 50 workflows that we examined, sorted by the average Quality of the 108 (51 human and 57 mouse cell types and tissues in the training set) networks that they each resulted in. We observed that the top four workflows used UQ normalization, while Quantile normalization resulted in low average quality. Similarly, the top 15 workflows all include a batch correction step, while many of the worst-performing workflows did not treat batch effects. The topranking workflow (UQ + ComBat + Pearson) resulted in an above-average network for 102 (94%) of the 108 training datasets, and for 135 (94%) of all 144 datasets (S4 Fig).

Modeling the quality of co-expression networks
To gain more quantitative insights into what factors contribute to high-quality co-expression estimates, we performed linear regression on the Quality scores using as predictors: 1) the number of RNA-seq samples which the network was based on (log 10 values), 2) the number of batches in the data (log 10 values), 3) the species (human or mouse), 4) the data normalization approach, 5) batch correction approach, and 6) the correlation measure. In the next sections, we will focus on workflows that did not include ComBat-seq. ComBat-seq differs from Com-Bat and RemoveBatchEffect in that it takes integers as input and therefore cannot be used on data that has already been normalized. Networks generated using ComBat-seq will be treated separately in section "ComBat-seq results in lower-quality networks". The resulting linear model is summarized in Table 2. Despite its simplicity, this model explains 55% of the variability in Quality (R 2 = 0.55) in the training datasets. The reliability of estimated coefficients was confirmed by 4-fold cross validation (CV), each time leaving out 25% of the cell types and tissues and repeating the same linear regression (S3 Table). In each of the four models, the signs and relative magnitudes of coefficients was consistent. For example, in each case, the coefficient of the number batches was negative, and the ordering of the coefficients of normalizations methods was the same. Below we discuss the roles of sample counts, data normalization, batch effect correction, and correlation measures in more detail.

The importance of sample counts
The most significant predictor for the quality of co-expression estimates was the number of samples they were based on ( Table 2). Quality follows a roughly linear trend with the logarithm of the sample count ( Fig 3A). This is consistent with a previous study [11]. At the same Evaluating the quality of co-expression networks. All 50 workflows are shown in order of decreasing average quality of the networks they produced. From left to right are shown: Normalization method, batch effect correction method, and measure of correlation used in each workflow. Next, the number of training datasets (108 in total) in which the workflow resulted in an above-average quality network is shown, and the mean Quality of the 108 networks generated using each workflow.
https://doi.org/10.1371/journal.pone.0263344.g002 time, an increase in the number of batches results in a small decrease in Quality (Table 2). A number of samples obtained from a small number of batches is expected to result in better coexpression predictions than an equal number of samples generated by many smaller batches. It should be pointed out that there is a strong correlation between the number of samples and the number of batches for each cell types and tissue (S5 Fig; Pearson correlation 0.84) which can result in instability of estimated coefficients. However, in our 4-fold CV analysis, the coefficients of sample count (log 10 ) and batch count (log 10 ) were consistently positive and negative, respectively (S3 Table).
Quality being roughly linearly related to the logarithm of the sample count implies that an ever-increasing number of additional samples is needed to achieve the same improvement in quality. Collecting hundreds or thousands of additional samples is practically impossible under most circumstances. Therefore, it makes sense to look for data processing steps that can maximize the quality of co-expression predictions even in the absence of an increase in samples.

The importance of data normalization approach
Regression analysis revealed clear differences in the average quality of networks generated using the six different normalization methods (Table 2), confirming the tendencies observed in Fig 2. Med and UQ resulted in an average increase of 0.064 and 0.078 in Quality compared to the baseline (here Quantile normalization, the worse performing method), respectively. These improvements are equivalent to a 66% and 86% increase in sample count. Additional exploratory analysis revealed interactions exist between sample counts and normalization methods: the performance of normalization methods depends on the size of datasets. We divided our datasets according to sample counts into three sets of 36 cell types and tissues each. Fig 3B shows the Quality of networks based on small (20 to 44 samples), medium-sized (45 to 111 samples), and large (113 to 2,644 samples) datasets. While all normalization methods showed progressively higher performance with larger dataset sizes, UQ performed relatively well not only on the large, but also on the small and medium-sized datasets.

Correcting batch effects in general improves co-expression quality
Correction of batch effects by removeBatchEffect or ComBat resulted in better networks, increasing the Quality on average by respectively 0.041 and 0.047 compared to no correction ( Table 2). These improvements are equivalent to a 39% and 45% increase in sample count, respectively. However, here too, the improvement depends on sample counts, and on the number of batches in the dataset. The improvement in quality appears to increase roughly with the sample count, for both removeBatchEffect ( Fig 4A) and ComBat ( Fig 4B). Especially ComBat consistently resulted in higher-quality networks in larger datasets (Fig 4A and 4B). For datasets with >200 samples, ComBat's average improvement in Quality exceeded 0.10, equivalent to a >120% increase in sample count. removeBatchEffect failed to correct batch effects in a few datasets, resulting in somewhat worse overall quality (Fig 4A, indicated datasets). Batch effect correction offered no clear advantage when a dataset contained few batches (e.g. less than 5, Fig 4D and 4E). However, for datasets containing 5 or more batches, using ComBat or removeBatchEffect resulted in general in better networks.

Spearman correlation is preferred for small datasets
Using Spearman's correlation instead of Pearson correlation resulted on average in a 0.011 decrease in Quality (equivalent to a 8.2% decrease in sample count) ( Table 2). However, for small datasets (sample count < 30) Spearman's correlation had on average an advantage ( Fig  5). For medium-sized datasets (roughly 30 to 100 samples) Pearson's correlation lead in general to better co-expression networks, but the difference became smaller with higher sample counts.
These results make intuitive sense. Spearman's correlation, which is based on ranks and not on raw values, is less sensitive to extreme values. In small datasets, extreme values have a strong influence on correlation, adversely affecting Pearson's correlation. However, in medium-sized datasets, the influence of extreme values decreases and Pearson's correlation appears to be better able to capture biological signals.

ComBat-seq results in lower-quality co-expression estimates
On average, ComBat-seq did not result in high-quality networks compared to ComBat and removeBatchEffect (Fig 2). Adding a normalization step following the correction by ComBatseq did not improve qualities but rather reduced them (Fig 2). Zhang and colleagues themselves noted that on some datasets ComBat-seq did not outperform ComBat [19]. In our data, ComBat-seq resulted in lower-quality networks even in the datasets with more samples (Fig  4C) or more batches (Fig 4F). Although the reason for this failure is not clear, we did note that ComBat-seq returned unrealistically high read counts in a substantial subset of samples. For example, in 821 human samples (out of a total of 8,796) the total read count after correction exceeded 10 billion reads. A subset of genes had extremely high reads counts in some samples. For example, Prh1 had corrected read counts > 1e100 in a subset of human salivary gland samples. These observations suggest that ComBat-seq adjusted a subset of the data to negative binomial distributions that are extremely skewed, negatively affecting the quality of co-expression networks.

The best workflows result in significantly better co-expression estimates
Finally, we returned our attention to the eight raw quality measures, and the validation datasets. We compared the networks produced by a default workflow (Rlog + no batch correction + Pearson) with those produced by optimized workflows (UQ + ComBat + Spearman for datasets with < 30 samples, and UQ + ComBat + Pearson otherwise) (Fig 6). For all eight measures, using the optimized approaches resulted in a significant improvement compared to the default workflow (one-sided paired t-tests; all eight p-values < 0.01; improvements seen in 25 to 33 of the 36 validation datasets). The optimized workflows lead to co-expressed genes sharing common functional annotations more frequently (Fig 6A-6C left side), as well as shared annotations fitting with the known annotations of genes more frequently (Fig 6A-6C right side). The same was true for regulatory motifs in promoter sequences (Fig 6D).

Discussion
We presented a systematic analysis of 50 workflows for processing large collections of RNAseq data into gene co-expression predictions. We applied the workflows on data for 68 human and 76 mouse tissues and cell types, and estimated the quality of the resulting 7,200 sets of genome-wide gene co-expression datasets ("co-expression networks"). We used linear regression analysis to gain understanding of the important factors for obtaining high-quality coexpression networks. Our aim was to re-analyze existing large RNA-seq expression datasets, that have already been trimmed, aligned to a reference genome, and counted using a standardized pipeline. We focused on the steps of RNA-seq data normalization, batch effect correction, and measure of correlation of expression. Other studies have compared between read trimming and alignment approaches in related contexts [23].
We found that co-expression network quality is to a large degree determined by the number of samples which it is based on, as has been reported before [11]. It is therefore important to gather as many samples as possible. However, in practice the number of available samples is always limited. Moreover, co-expression network quality appeared to be roughly a linear function of the logarithm of the sample count. This means that for cell types and tissues with abundant data, adding hundreds of additional samples might result in only modest improvements in quality. It therefore makes sense to optimize the data processing workflow to obtain highquality networks even in the absence of large sample counts.
Treating batch effects in general lead to better networks. On average, ComBat performed better than limma's removeBatchEffect function. ComBat-seq however performed considerably worse than ComBat and removeBatchEffect. In our analysis, correcting batch effects using ComBat resulted in improvements to network quality equivalent to a 45% increase in sample count on average. In larger datasets, the advantage was even more pronounced, equivalent to roughly a doubling in sample count. Unfortunately, gene co-expression studies still often ignore batch effects. Clearly, more attention needs to be paid to the issue of batch effects in order to extract the maximum potential out of ever-increasing public gene expression datasets.
We found that some data normalization approaches lead to better co-expression estimates than others. Especially UQ performed well. UQ has also been found to perform relatively well compared with total count normalization (equivalent to CPM in our study) and quantile normalization in the context of predicting differentially expressed genes [24]. Other comparisons of RNA-seq normalization methods (outside of the context of co-expression) have come to different and conflicting conclusions [25][26][27].
The measure of correlation appeared to be less crucial, but Pearson's correlation seems to have a slight advantage, except when there are only small numbers of samples (<30). In the latter case, Spearman's correlation seems better.
Although no workflow dominated all others, UQ + ComBat + Pearson (or Spearman for small datasets) resulted in the best quality overall, and in above-average co-expression networks in >90% of the tissues and cell types we examined (Fig 2).
In addition to the findings described above, the dataset collection and the workflows used in this study are valuable resources. The collection of raw human and mouse samples and their annotation data have been made public, together with the data processed using UQ normalization and ComBat batch effect correction (see section "Data and code availability"). Scripts and an example workflow have been made public in a GitHub repository. We hope that together this data and code can serve as a basis for future studies.

Gene expression data collection and normalization
We used the RNASeq-er REST API of the European Bioinformatics Institute [18] to obtain read count data for 8,796 human and 12,114 mouse RNA-seq samples, produced by 401 and 630 studies, covering 68 human and 76 mouse cell types and tissues, respectively (see S1 Methods and S1 and S2 Tables). On these two large datasets of human and mouse samples, we applied the following six normalization approaches: Trimmed Mean of M-values (TMM). For all genes, log ratios are calculated versus a reference sample [28]. The most highly expressed genes, and genes with high log ratios are filtered out. The mean of the remaining log ratios is used as a scaling factor. This normalization method is the default normalization method of the edgeR function calcNormFactors [28,29]. In this study, we first removed genes that have less than 1 read per million reads in all samples prior to normalizing the remaining genes.
Counts per million (CPM). The number of reads per gene is divided by the total number of mapped reads of the sample and multiplied by 1 million [25,30].
There are several variations on CPM, describe below, including Upper Quartile and Median.
Upper Quartile (UQ). Counts are divided not by total count but by the upper quartile of non-zero values of the sample [24].
Median (Med). Counts are divided not by total count, but by the median of non-zero values of the sample [25].
Regularized Logarithm (RLog or RLE). A regularized-logarithm transformation is applied which is similar to a default logarithmic transformation, but in which lower read counts are shrunken towards the genes' averages across all samples. We applied this normalization using the R package DESeq2 [31].
Quantile. All samples are normalized to have the same quantiles. We applied this normalization using the function normalizeQuantiles of the limma R package [32].
Note that methods that correct for differences in gene length (RPKM and FPKM) are not relevant here, since they don't affect correlation values. In this study, these methods would be equivalent to CPM normalization.
We thus obtained 12 normalized datasets (6 each for the human and the mouse data). Each dataset was transformed to log values after addition of a small pseudo count (defined as the 1% percentile of non-zero values in the normalized dataset).

Batch effect correction using ComBat and limma's removeBatchEffect function
On the 12 log-transformed datasets we applied two batch effect correction methods: ComBat (function ComBat in the sva R package) and the removeBatchEffect function of the limma R package [13,32]. Both ComBat and removeBatchEffect allow users to specify batch covariates to remove from the data ("batch" parameter in both functions). Here, studies were used as substitutes for batches. Users can also give biological covariates to retain ("mod" parameter in ComBat and "design" parameter in removeBatchEffect). In this study, biological covariates are the cell type or tissue from which the samples were obtained. Both ComBat and removeBatch-Effect were used using default parameter settings.
To be able to treat batch effects in a dataset there can be no confounding between technical and biological covariates. In practice, studies often focus on a single cell type, which makes confounding of batches and cell types highly probable. Both the human and mouse datasets could be divided into several subsets with no shared cell types or tissue annotations. Therefore, batch effects were corrected for each of these subsets of samples separately, and finally the treated datasets were merged again into one dataset.
In addition to correcting the data using ComBat and removeBatchEffect we also considered 2 other options. One is to ignore batch effects and use the normalized data directly for estimating gene co-expression. Another is to use ComBat-seq.

Batch effect correction and normalization using ComBat-seq
ComBat-seq differs from ComBat and limma's removeBatchEffect function in that both its input and output are integer counts, making it more suitable for RNA-seq read count data [19]. To correct batch effects using ComBat-seq (function Combat_seq in the sva R package), we therefore gave as input the raw count data (without any normalization step applied to it), as well as the same batch covariates and biological covariates as we used for ComBat and remove-BatchEffect. ComBat-seq was used with default parameter settings. The output read counts were transformed to log values after adding a pseudocount of 1. These log-transformed data were used for estimating correlation of expression (see next section) without an additional normalization step.
However, quality estimates of the co-expression networks generated using ComBat-seq were relatively low compared to ComBat and removeBatchEffect, which use normalized data as input (Figs 2, 4C and 4F). Therefore, to avoid an unfair comparison, we also applied the 6 normalization approaches (TMM, CPM, UQ, Med, Rlog and quantile) on the ComBat-seq output data. Rlog normalization failed because of the extremely high reads counts in a subset of the data (see also section "ComBat-seq results in lower-quality co-expression estimates"). We therefore trimmed all read counts >1e9 to 1e9 before conducting the Rlog normalization.
Together, this resulted in 7 datasets which had been processed using ComBat-seq (i.e. 6 with a normalization step, and 1 without).

Estimating correlation of expression
For each of the 25 data processing combinations (6 normalization methods x 4 batch correction methods, and ComBat-seq without normalization), we calculated correlation of expression between each pair of genes in the log-transformed data for each cell type or tissue. We did this using Pearson's correlation and Spearman's rank correlation. Before calculating correlation coefficients in the expression data of a tissue or cell type, genes with general low levels of expression (less than 10 mapped reads in >90% of samples, or less than 10 reads in >80% of samples and fewer than 50 reads in all samples) or with no variation in expression (standard deviation = 0) were removed. The number of samples and genes used for calculating coexpression in each tissue and cell type are listed in S1 and S2 Tables.

Evaluation of gene co-expression network quality
The processing steps described above resulted in 7,200 (3,400 human and 3,800 mouse) sets of genome-wide cell type or tissue-specific gene co-expression predictions, which we refer to as "co-expression networks". The main goal of this study is to gain understanding into what are the critical features that distinguish good co-expression networks from bad ones. Because there is no gold standard co-expression network available, we first defined eight measures of quality (see also Table 1) based on the enrichment of biologically meaningful features (functional annotations of genes and regulatory DNA motifs in promoter sequences) among co-expressed genes. For each gene X in a co-expression network, we define set X to be the 100 genes with the highest correlation of expression with X (excluding X itself). Our quality measures are based on the enrichment of GO annotations and TFBS motifs in the set X of every gene X in each genome-wide co-expression network. These measures should not be interpreted as strict measures of accuracy of for example functional annotation predictions, but rather as rough indicators of quality of the inferred gene co-expression values.
GO enrichment frequency. Genes involved in the same biological process are expected to be co-expressed more frequently than unrelated sets of genes. In a high-quality co-expression network, we would expect the genes in set X to share a functional annotation (Fig 1B). In contrast, in a low-quality network (e.g. a randomly generated network) we expect genes in set X to have a random set of annotations. We define Enrichment MF , Enrichment BP , and Enrichment CC as the fraction of genes in a network for which set X contained one or more significantly enriched GO terms (after correction for multiple testing) for Molecular Function (MF), Biological Process (BP) and Cellular Component (CC) GO terms. The number of tested GO terms in each dataset is shown in S1 and S2 Tables.
GO enrichment accuracy. Where we found set X to have enriched GO terms, we checked if the enriched terms overlapped with the GO terms of gene X. Accuracy MF , Accuracy BP , and Accuracy CC were defined as the fraction of genes in the network for which this was the case for MF, BP, and CC GO terms.
TFBS enrichment frequency. Genes with similar expression profiles are likely to be under the control of a shared regulatory mechanism, including regulation by a similar set of transcription factors (TFs). In a high-quality co-expression network, we would therefore expect the genes in set X to contain a shared set of transcription factor binding sides (TFBSs). We define Enrichment TFBS as the fraction of genes in a network for which the promoter sequences of set X contained one or more significantly enriched TFBSs.
TFBS enrichment accuracy. Where we found set X to have enriched TFBSs, we checked if the promoter of gene X contained one or more of those TFBSs. Accuracy TFBS was defined as the fraction of genes in the network for which this was the case.
Correlation between the eight quality measures was high (range 0.60 to 0.98). To facilitate comparison between co-expression networks, the eight measures were combined into a single quality score, Quality, which is based on PCA of the eight measures. PCA was conducted using the function prcomp in R, after standardizing the eight quality measures to mean 0 and standard deviation 1. Analysis of the Principal Components (PCs) revealed that 81.4% of the total variation in the quality measures could be explained by the first PC (S2A Fig). We decided to use this first PC as the general quality score, Quality, after rescaling it to values between 0 (worst network) to 1 (best network). The correlation between Quality and each of the quality measures was high (range 0.77 to 0.96; S2C Fig). The p-values of Pearson correlation coefficients as shown in S1 and S2 Figs are based on a Student's t-distribution with n-2 degrees of freedom [33].
To evaluate the sensitivity of our quality measures with regard to the number of genes they are based on, we also calculated the eight quality measures using the top 50 and top 200 genes (instead of the top 100). We did this for 200 randomly selected co-expression networks (out of the total of 7,200). We used scatterplots of the results (S1 Fig) and Pearson correlation to evaluate the consistency of the eight quality measures based on the top 50, 100, and 200 genes.

Linear regression analysis
We randomly split the 68 human and 76 mouse cell types and tissues into four parts, each representing 25% of the human and mouse cell types and tissues. We used three parts to form the training set (representing 75% of cell types and tissues; 51 human and 57 mouse cell types and tissues), and the remaining part was used as a validation set (representing the remaining 25%; 17 human and 19 mouse cell types and tissues).
We conducted least squares regression using the lm function in R using the 3,888 networks of the training set, excluding networks where batch effects were treated using ComBat-seq. As response variable we used Quality, and as predictors we used 1) the number of RNA-seq samples on which the co-expression network was based (log 10 values), 2) the number of batches in the data (log 10 values), 3) the species (human or mouse), 4) the data normalization approach, 5) batch correction approach, and 6) the correlation measure. For the categorical predictors (i.e. species, data normalization approach, batch correction approach and correlation measure) the lm function uses dummy variables with values 0 and 1. The baseline levels of these categorical variable were set in a way that facilitates interpretation of the results. The resulting model as shown in Table 2 is the output of the R function lm, including the estimated coefficients, their standard errors, t values (= estimated coefficient/standard error), and p-value of a twosided t-test. In this case, the degree of freedom in the t-test is 3,876 (= the number of observations in the dataset minus the number of coefficients to estimate = 3,888-12).
To evaluate the stability of estimated coefficients of the above model, we step-by-step left out each of the three parts used to form the original training set, and used the other parts (the two remaining parts of the training set and the original validation set) to fit the same linear model, effectively implementing a 4-fold cross validation (CV) analysis. Coefficients of the resulting four models are shown in S3 Table. Supporting information  Table. Human datasets. The cell type or tissue, the number of RNA-seq samples, the number of genes included in the final co-expression network, and the number of GO terms tested for the estimation of the network quality is shown. The last column indicates which datasets were included in the validation set. (DOCX) S2 Table. Mouse datasets. The cell type or tissue, the number of RNA-seq samples, the number of genes included in the final co-expression network, and the number of GO terms tested for the estimation of the network quality is shown. The last column indicates which datasets were included in the validation set. (DOCX)

S3 Table. Coefficients and p-values of linear models trained by 4-fold cross-validation.
Human and mouse cell types and tissues were randomly divided into 4 folds. Each fold was left out and a linear regression model was trained on the remaining 3 folds. Model 1 is equivalent to the model shown in Table 2