## Figures

## Abstract

RNA-Sequencing (RNA-Seq) provides valuable information for characterizing the molecular nature of the cells, in particular, identification of differentially expressed transcripts on a genome-wide scale. Unfortunately, cost and limited specimen availability often lead to studies with small sample sizes, and hypothesis testing on differential expression between classes with a small number of samples is generally limited. The problem is especially challenging when only one sample per each class exists. In this case, only a few methods among many that have been developed are applicable for identifying differentially expressed transcripts. Thus, the aim of this study was to develop a method able to accurately test differential expression with a limited number of samples, in particular non-replicated samples. We propose a local-pooled-error method for RNA-Seq data (LPEseq) to account for non-replicated samples in the analysis of differential expression. Our LPEseq method extends the existing LPE method, which was proposed for microarray data, to allow examination of non-replicated RNA-Seq experiments. We demonstrated the validity of the LPEseq method using both real and simulated datasets. By comparing the results obtained using the LPEseq method with those obtained from other methods, we found that the LPEseq method outperformed the others for non-replicated datasets, and showed a similar performance with replicated samples; LPEseq consistently showed high true discovery rate while not increasing the rate of false positives regardless of the number of samples. Our proposed LPEseq method can be effectively used to conduct differential expression analysis as a preliminary design step or for investigation of a rare specimen, for which a limited number of samples is available.

**Citation: **Gim J, Won S, Park T (2016) LPEseq: Local-Pooled-Error Test for RNA Sequencing Experiments with a Small Number of Replicates. PLoS ONE 11(8):
e0159182.
https://doi.org/10.1371/journal.pone.0159182

**Editor: **Shu-Dong Zhang, Queen's University Belfast, UNITED KINGDOM

**Received: **March 30, 2015; **Accepted: **June 28, 2016; **Published: ** August 17, 2016

**Copyright: ** © 2016 Gim et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

**Data Availability: **All data (R-package, package tutorial, datasets) are available from https://figshare.com/s/963e895f812d6f06468a.

**Funding: **This work is supported by the National Research Foundation of Korea (NRF) grants funded by the Korea government (MSIP) (no. 2012R1A3A2026438) and by the Bio-Synergy Research Project (2013M3A9C4078158) of the Ministry of Science, ICT and Future Planning through the NRF. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

**Competing interests: ** The authors have declared that no competing interests exist.

## Introduction

High-throughput sequencing of cDNA derived from an RNA sample, known as RNA-Seq, has recently been developed and applied to various studies depending on the scientific interests such as detecting fusion genes, transcribed single nucleotide polymorphisms (SNPs), and differential expression (hereafter, DE) [1, 2]. In particular, profiling gene expression and testing DE between classes have been the primary process of many biological studies.

The main purpose of DE analysis is to identify transcripts that changed significantly in abundance across experimental conditions. This goal has been achieved using different statistical methods for data from array-based technologies. Compared to microarray, however, RNA-Seq has different characteristics such as high dynamic range and low background expression level. In order to address those properties, various methods have been proposed using Poisson and negative binomial distributions [3–5].

The software edgeR assumes that the mean and variance are related with a single proportionality constant that is the same throughout the experiments; thus, only one parameter needs to be estimated for each transcript [4]. Instead, DESeq decomposes the variance into two terms, a shot-noise term and a raw-variance term [5]. By assuming that the raw variance of each transcript is a function of the expectation value of a transcript’s concentration and condition, DESeq extends the model proposed by edgeR. NBPSeq uses an over-parameterized version of the negative binomial called “NBP distribution”, which incorporates a non-constant dispersion parameter directly within a parametric family [6]. All these three methods are based on similar principles, i.e., explicit modeling of the counts using a negative binomial distribution.

Comprehensive reviews on pre-existing methods for DE analysis with both real and simulation datasets has been recently published. By examining 11 methods including edgeR [4], DESeq [5], and NBPSeq [6], it was reported that all methods performed well with large sample sizes, while most of them showed weakness in DE analysis with small sample sizes [7]. A similar result was obtained in Seyednasrollah et al’s paper; unlike the study with large sample size, the choice of the method becomes critical when the number of samples is small [8]. With small sample size (two or three samples) per each class, the review papers reported that the methods based on negative binomial modelling, such as edgeR and DESeq, displayed relatively better performance [7–9].

Despite the decreasing sequencing costs, RNA-Seq experiments remain expensive to allow extensive biological replications. Moreover, limited specimen availability often leads to studies with a small number of (sometimes none of) replicates. Unfortunately, many of the existing methods have been largely unsuccessful in DE analysis with a small number of samples, and few researches have correctly addressed the problem arisen from non-replicated samples per each class. Therefore, there remains the need for an effective method that could be applicable to DE analysis with small samples or with one sample per each class.

Here, we proposed a method, named local-pooled-error with RNA-Sequencing data (LPEseq), which proposes to analyze DE detection of RNA-Seq experiments with a small number of or non-replicated samples in each class. Based on the LPE method [10], we extended the protocol to RNA-Seq data with a non-replicated set by introducing additional processes. Then, we compared the results obtained from the proposed method with those from several other techniques using both real and simulated datasets. Because it was reported that the performance of edgeR, DESeq, and NBPSeq with a small number of replicates is slightly better than that of other methods [7], these methods were chosen to compare the results with those obtained using our proposed method for replicated data analysis. Since DESeq was updated to DESeq2 recently, DESeq2 was also included in comparison [11]. However, for non-replicated data sets, only edgeR and DESeq are applicable and, thus, the comparison was made among edgeR, DESeq, and our proposed method for non-replicated data analysis. We found that LPEseq generally shows similar performance to other methods in the presence of replicates but performs better for non-replicated data sets.

## Materials and Methods

### A brief review of the LPE method

Jain *et al*. developed the original LPE method, which pools the error in each local intensity bin and shrinks each error variance estimate toward the mean of other probes (or genes) with similar intensities, for microarray experiments in which gene expression intensity is continuous [10]. The method first evaluates the baseline error distribution for each of the compared experimental conditions, say class *X* and *Y*, respectively. For duplicated arrays (subscript with 1 or 2) in each class, for instance in class *X*, the mean and difference (*M* = ±|*x*_{1}−*x*_{2}|) values between arrays are evaluated first. Then the variance of *M* on predetermined quantiles of *A* is evaluated and a cubic smoothing spline is fit to the variance estimates on the quantiles. The baseline error distribution for class *Y* is derived using the same way. The test statistic of the LPE method is calculated as follows:
(1)
where Med(.) denotes median and,
(2)
*n*_{X} and *n*_{Y} are the number of replicates in the two array samples being compared; and are the estimates of variance of the genes in class *X* and *Y*, respectively; and *τ* is a scaling factor [10].

The LPE method assumes that genes with similar observed intensities have similar expression variances. Based on this assumption, it estimates the gene-specific variance from a calibration curve derived from pooling the variance estimates of replicated expression differences of genes within similar expression intensities. Since the method is based on calculating the error variance from replicated experiments in the same class, it cannot be directly applied to experiments with no biological replicates in one or both classes. Furthermore, the method is suited for the analysis of continuous probe intensities on microarray platforms, unlike the read counts measured on sequencing technology.

### The proposed method: LPEseq

In this study, we improved the LPE method by focusing on two refinement aspects: applicability both to count measurement in RNA-Seq and to experiments with no replicates in each class. We describe here how we addressed these two aspects into the model. Simply, for the former case, the count variable is treated as a continuous variable by performing normalization and log-transformation. For the latter case, we treat two non-replicated samples as if they were replicates and remove the outliers so to reduce the impact on the LPE estimation. For RNA-Seq experiments with replicates, the proposed method conducts the similar procedure, *i*.*e*., estimating the variance of *M* from replicates in each class *X* and *Y*, as was described in the previous section.

To be more specific, we focused our attention on the problem of inferring DE between two different classes. Let and represent the number of read counts mapped to a specific transcript (or gene) *i* in the *j*_{th} sample (or replicate) from the experimental condition or class *X* and *Y*, respectively. Since and are influenced by the sequencing depth, these values are not directly comparable. Instead, the relative abundance of the transcripts across the samples can be normalized. Each transcript’s read count was divided by the total number of read counts for that experiment and was log-transformed, i.e., and .

Without replicates, only a single measurement in each class is available; thus, the subscript *j* can be dropped without ambiguity. Let *x*_{i} and *y*_{i} denote the normalized log-transformed count values of the transcript *i* under two different classes *X* and *Y*, respectively. The baseline error distribution was obtained by regarding each sample in different classes as replicates. Under this additional assumption, the *M* and *A* values were evaluated using two samples (each from different classes)., i.e., *M* = ±|*y*−*x*| and . Then , the variance of *M* of the genes on the predetermined *k*_{th} bin of *A*, was calculated in the same manner as for replicate analysis. In this case, however, since the variance of *M* is not drawn from the same class, differentially abundant transcripts could act as outliers and adversely affect proper evaluation of the LPE per each class.

Here, we give an example to understand the problem. Suppose we have a dataset consists of *n* transcripts duplicated in two different groups, say X and Y. Three transcripts among *n* transcripts are differentially expressed as shown in Fig 1B and 1C (DE transcripts are colored in red in the figure). With replicate information, the LPEs are drawn from each class separately as described the previous section (Fig 1B), whereas those with non-replicated experiments have to be evaluated using two single observations from two different classes (Fig 1C). When evaluated with non-replicated data, the LPEs of the three DE transcripts (colored in red in the Fig 1B and 1C) show a clear distinction compared to those with replicates and, thus, the fitted variance curve might be biased (red dashed curve in Fig 1C) due to these apparent outliers. By denoting an outlier in each quantile of a given data, we can remove outlying observations and recalibrate an LPE curve less affected by outliers (orange solid curve in Fig 1C).

(A) The flow chart of the proposed algorithm. The proposed method first determines intensity bins (percentile by default) and evaluates the LPE distribution differently depending on the existence of replicates in each class: LPE per each class with replicates and LPE between classes with non-replicated experiments. For non-replicated cases, the addition step smoothens the LPE distribution by removing outliers. Detailed examples are depicted in case of replicated (B) and non-replicated (C) experiments. Blue and green colors represent different classes (i.e., X and Y). The red dotted line and orange line represent the LPE curve with and without outliers, respectively. DE transcripts are colored in red.

For removing outliers, both formal (or tests of discordancy) tests and informal (outlier labeling) methods can be used. Formal tests based on *Z*-score [(, where *x*_{i} ∼ *N*(*μ*,*σ*^{2}), where *sd* denotes the standard deviation of data] usually assume a well-behaving distribution, and test if the target extreme observation is an outlier of that distribution. It is obvious that these tests largely depend on the type of observations and distribution assumed [12]. On the other hand, informal methods usually generate an interval or criterion for outlier detection instead of hypothesis testing. Therefore, any observations beyond this criterion could be considered as an outlier. LPEseq provides various strategies for detecting outliers and the criterion of fold difference between classes was used as the default one.

We summarize the detailed procedure of the LPE analysis without replicates as follows:

- Calculate the mean intensity of transcript
*i*in two conditions [i.e., (*x*_{i}+*y*_{i})/2]. - Calculate quantiles (percentiles by default) with the mean intensities of the whole transcripts evaluated at step I, and define “intensity bins” using adjacent quantile values. Thus the width of the bin depends on these quantile values and each bins have different widths but the same number of transcripts. Note that 100 bins are generated by default and the number of bins can be adjusted manually.
- Place all the transcripts in the bin to which their mean intensities belong.
- Label the transcript as an outlier if the difference between classes is larger than the threshold (Default value is 1.2, please see the S1 File and S1 Fig).
- Remove the outliers labeled in step IV and, then, evaluate the variance of
*M*for each bin, (3) where*i*_{(k)},*n*_{(k)}and denote the genes, the total number of genes, and the variance of*M*on the*k*_{th}bin, respectively. - Generate the LPE curve by fitting a cubic smoothing spline to the variance along with the bins. This makes the LPE as a function of transcript abundance level.
- Use the LPE function drawn in step VI to estimate the transcript-specific variance by plugging in the value of each transcript.

Once the variance in each experimental condition is derived, testing a hypothesis on differential expression is similarly conducted as for the analysis with replicates. Note that it is difficult to find the optimal number of bins which might affect the result of the DE test. However, the number of DEGs does not vary much with varying number of bins (S2 Fig). In addition to equal frequency interval binning (using percentiles), we also tried equal spaced interval binning. Although the results did not differ much, the equal spaced interval seemed to be very sensitive to the size of intervals (data not shown).

### Real RNA-Seq data sets

Four real datasets, which had been preprocessed and distributed by *Recount*, were examined [13]. We summarized the characteristics of each dataset below.

- Sultan’s dataset (two replicates in each condition): Sultan
*et al*. [14] performed RNA-Seq experiments in human embryonic kidney and B cell line. With each sample, two replicates were generated, and we analyzed the DE between the two conditions. The dataset provides a case in which a small number of replicates are available. - Hammer’s dataset (single replicate in each condition): Hammer
*et al*. [15] studied rats with chronic neuropathic pain induced by spinal nerve ligation (SNL) in serial experiments using RNA-Seq. The RNA of rats with SNL for 2 weeks and 2 months was sequenced and compared with that of controls. We performed DE analysis between the SNL and control groups for 2 weeks. The dataset provides a case in which only single replicates are available in each condition. - MAQC dataset (7 replicates in each condition): Bullard
*et al*. [16] considered two types of biological samples: Ambion’s human brain reference RNA and Stratagene’s human universal reference RNA, referred to as Brain and UHR, respectively. Since the dataset has seven replicates for each condition, the dataset provides a situation in which a large number of replicates are available. Also, the dataset was used to measure how much the result with 7 replicates is reproduced by the analysis with different numbers of replicates. - Montgomery’s dataset (60 normal samples): Montgomery
*et al*. [17] generated RNA-seq data with 60 unrelated normal Caucasian individuals. Among these 60, 33 individuals are female and 27 individuals are male. The dataset provides an analysis of a large biological samples with unequal group sizes.

### Simulation study

#### Parameter estimation from a real dataset.

The purpose of the simulation study was to investigate the ability of DE detection under varying effects of four different factors: effect size (counts difference), DE portion (number), dispersion, and number of replicates in each class. To generate the simulation datasets, we adopted the same assumption that has been made in many other studies [4, 5, 7, 18, 19], i.e., the abundance of transcript *i*, denoted by *X*_{i}, follows negative binomial distribution, *NB*(*μ*_{i},*ϕ*_{i}). Here, *μ*_{i} and *ϕ*_{i} are the mean and dispersion parameters, respectively.

We estimated (*μ*_{i},*ϕ*_{i}) from the Montgomery *et al*.’s real data set, one of the largest sample datasets including 60 unrelated normal Caucasian individuals [17]. The total number of transcripts of this data is 52580 and the number of sample, denoted as *N*, is 60. The log-likelihood function for 60 independent observations for each transcript *i* from a negative binomial distribution, given the counts *x*_{i1},⋯,*x*_{iN}, is
, a similar form in the supplementary material of the Soneson et al’s paper [7]. The maximum likelihood estimate (MLE) of *μ*_{i} was first obtained for each transcript across 60 samples; then, *ϕ*_{i} was estimated by numerically maximizing the log-likelihood function using the simulated annealing algorithm [20].

#### Simulated dataset generation.

Based on these estimates, we generated simulation datasets as follows. First, we generated 20,000 transcripts in total. Among them 70% were generated from the zero-inflated Poisson distribution with mean 1.25 and zero probability of 0.9. Then, *μ*_{i} for the remaining 30% were randomly selected from the values estimated from the real data. Then, for each transcript *i*, we generated transcript counts, *x*_{ij} and *y*_{ij} from *NB*(*μ*_{i},*ϕ*_{i} = *ϕ*) in each class. For DE transcripts, we randomly selected *κ*% of the total transcripts and added the effect size *δ* in one of the two classes. In our simulation scenario, we varied the value of *κ*, *δ*, *ϕ*, and the number of replicates per each class as stated below.

- Case I. Different effect size: (500, 1000, and 5000): We varied the effect size ranging from 500 to 5000 in counts, which roughly corresponded to the range of mean and maximum count difference between the two classes in the real dataset with total read counts similar to those of the simulated data (data not shown).
- Case II. Different number of DE transcripts: (1000 and 2000): The number of DE transcripts can vary due to the biological phenomena of interest and can affect the DE detection performance of the tests. We set 5% and 10% of total transcripts (1000 and 2000 transcripts, respectively) as to be DE and observed their effects on the DE detection ability.
- Case III. Different dispersion: (0.01, 0.25, and 0.4): We used values in the interquantile range [0.2, 0.41] of the estimates for the dispersion parameter
*ϕ*. We also included 0.01 (Poisson-like behavior) and 0.1 (around minimum of the estimates) values to observe the performance of the methods with small dispersion. - Case IV. Different number of replicates per each class: (1, 2, 3, and 5 replicates): We repeated the analyses of case I–III with different numbers of replicates per each class, focusing on small sample analysis.

All the cases were repeated 100 times.

### The R package of LPEseq

We implemented the method as a package for the R environment and named it “LPEseq.” It is freely downloadable from our website (http://bibs.snu.ac.kr/software/LPEseq) or Bioconductor. LPEseq uses any kind of count table format as the input file. The package was designed to perform the analysis with only a single command, allowing researchers who are unfamiliar with the R language to use it easily. All the analyses in this work were performed with the package LPEseq. In addition, the whole analysis code used in this work and the LPEseq manual can be found in our website.

### Comparison with other methods

We compared LPEseq with the pre-existing methods edgeR [4, 21], DESeq [5], DESeq2 [11], and NBPSeq [6] with respect to true positive rate (TPR) and FDR. In case of the studies with one sample per each class, DESeq2 and NBPseq were excluded for comparison because they are not applicable. The Benjamini–Hochberg procedure was used to adjust multiple testing problem [22]. Transcripts were reported as DE at an adjusted p-value threshold of 0.05. We compared TPR and FDR defined as the number of true DE transcripts detected divided by the total number of true DE transcripts and the number of false DE transcripts detected divided by the number of DE transcripts detected, respectively. In case of replicated data analysis, we ran the programs using the setting provided in the supplementary material of a review article [7] and the reproducible code provided in the DESeq2 [11]. For the non-replicated data analysis, we used the options recommended by manuals of edgeR and DESeq.

## Results and Discussion

### Analysis of the simulation data

We investigated the performance of LPEseq regarding the true DE detection ahead of false discovery under 72 combinatorial cases of four different parameters, *δ*, *ϕ*, *κ*, and *m* (i.e., effect size, dispersion, DE proportion, number of replicates, respectively).

The primary purpose of developing LPEseq was to apply it to data with non-replicated samples in each class. In such a case, among the methods under consideration, edgeR and DESeq are applicable. Thus, we compared the results obtained from LPEseq, edgeR and DESeq. The overall performances of the three methods with one replicate in each class are reported in Table 1. In the majority of cases, DESeq produced an error message of “parametric dispersion fit error”, suggesting the use of “local” fit option for the dispersion estimation by the package. Using the “local” fit option solved the problem but resulted in a loss of power where none of true DE transcript was detected, thus generating NAs for FDR except a few cases. However, the performance of edgeR and LPEseq was dramatic. In all cases, LPEseq showed less than 5% FDR and more than 92% TPR. Even though edgeR performed superior than LPEseq in aspect of finding true DE transcripts, showing more than 97% TPR, FDR was not well-controlled. The failure of edgeR to correctly find the DE transcripts might lead to suspicious interpretation of the outcomes. By comparing the results in Table 1, it is obvious that a significant improvement was obtained by applying LPEseq to datasets with non-replicated samples.

In this study, we sought to establish a method applicable to a small number of replicates, not just one in each class. Table 2 summarizes the results of five different methods with three replicates in each class. As can be seen, when replicated samples were used, no method is clearly superior under all conditions and all methods gave satisfactory results. It can be observed from the table that the performance of LPEseq, i.e.,<5% FDR and TPR quantitatively similar to that of other methods in all cases, is consistent with that of other methods. However, it should be noted that, in the NBPSeq method, in some cases the FDR was >5%. Note that we have also investigated the situations of different numbers (2, 5, and 10) of replicates and with a larger number (4,000) of DEGs. Similar results as described above were observed with these simulation settings (data not shown).

Here, we considered testing between two classes in RNA-Seq experiments assuming that transcripts (or genes) are independent from each other, which might be unlikely in many real datasets. We also have investigated a situation including correlated transcripts with simulated datasets and the results did not differ much in DE analysis (S3 Fig). Since the methods repeat a univariate statistical test for each genes separately, the inclusion of correlated genes shows little effects on the performance of DE analysis.

### Analysis with real data

We used LPEseq to analyze four different RNA-Seq data sets: non-replicates (Hammer’s dataset), small replicates (Sultan’s dataset), large replicates (MAQC dataset), and large normal sample with unequal group sizes (Montgomery’s dataset). The primary purpose of the analysis was to discover DE transcripts and to compare the results obtained using competing methods.

First, we compared the number of DE transcripts found by each method (Fig 2). For the Hammer’s dataset, which has only one sample in each class, LPEseq, edgeR and DESeq were applicable. Hammer *et al*. reported that about 2,000 transcripts were DE and, among them, validated 755 transcripts using quantitative polymerase chain reaction (qPCR) [15]. Interestingly, LPEseq obtained 2,030 DE transcripts, which was similar to what reported by the Hammer *et al*., while edgeR and DESeq detected 1,305 and 74 transcripts as DE, respectively (Fig 2A).

Two RNA-Seq datasets with one replicate (A) and two replicates (B) in each class. Five different methods, i.e., LPEseq (brown), edgeR (sky blue), DESeq (green), DESeq2 (violet) and NBPSeq (red) were used. A density plot of the mean difference between classes of uniquely found DE transcripts in each method was indicated. X- and Y-axis represent group mean difference and density. The number in parentheses indicates the total number of DE transcripts found. The criterion used to call DE was Benjamini-Hochberg corrected p-value less than 0.05 for all methods. The enriched terms gene set analysis was performed by DAVID web-tool.

As shown in Fig 2A, 357 genes were additionally called as DEGs by LPEseq. To identify their roles, we performed gene set analysis (GSA). We used gene ontology (GO) terms and KEGG pathways as gene sets and found the terms such as “defence and inflammatory responses”, “sensory perception of pain” and neuroactive ligand-receptor interaction” enriched in higher ranks (< 10% FDR). These terms are strongly related to chronic neuropathic pain induced by SNL, which was the interest of Hammer et al’s paper [15]. GSA with a list of genes only found by edgeR, however, failed to provide relevant information (see also S2 File for a whole list of DEGs and GSA results).

The Sultan’s dataset consists of two different cell types, a kidney and a B-cell, and it would be expected that for such a comparison a large number of genes are differentially expressed. As can been seen in Fig 2B, more DEGs compared to Hammer’s dataset were identified by all methods and many DEGs found by each method were overlapped each other (Fig 2B). The number of DE detected by LPEseq was 2,177, which was the minimum number of DE, while over 3,000 DE were detected by other methods except DESeq2. We noted that the almost DE transcripts found by LPEseq and DESeq2 overlapped with those identified by other methods. However, the other methods included non-overlapped DE transcripts, whose enriched gene functions have no explicit relevance to the comparison of the study (Fig 2B).

To observe characteristics of transcripts uniquely found by each method, we plotted the group mean differences in raw counts using DE transcripts found in each dataset (density plots in Fig 2). As shown in density plot in Fig 2, strong peaks were observed at zero in all competing methods, indicating that there were no considerable mean differences between classes. In fact, these DE transcripts included zero raw counts in both classes, demonstrating that many of these non-overlapping transcripts found were artifacts. For LPEseq, however, the mean difference between classes exhibited a bimodal peak distribution, indicating less likely the DE transcripts found by our proposed method being artifacts.

We also investigated MAQC dataset (comparing brain versus universal reference cell with 7 replicates per each condition) and the similar observations were made. Except edgeR, most methods identified uniquely found by each method and those uniquely identified genes with very small mean difference between conditions showed no enriched term by GSA. Only genes identified by LPEseq were enriched in transcriptional regulation function (See S4 Fig), which is highly activated in brain [25]. By studying real data, it is difficult to be conclusive that LPEseq performs better than other competing methods, but it limits the possibility of producing false calls and the result is robust to the number of replicates in each class (See also S5 and S6 Figs).

To make inferences based upon large as well as small samples through the LPEseq, we analyzed Montgomery’s dataset [17], which consists of 33 normal females and 27 normal males. We first compared the LPE curves evaluated with total samples and two randomly selected samples using the Kolmogorov-Smirnov (KS) test. More specifically, we estimated the variance curve conditioned on the expression levels of the genes using the total samples and we randomly selected one sample from each group and estimate the variance using LPEseq. Then we compared the variance curves by performing a KS test. By repeating 10 times, we obtained 10 p-values from a two-sample KS test and none of them rejected the null hypothesis (S6 Fig), showing the variance estimates from LPEseq method are concordant with the ‘true’ variance estimated from the total samples. We also conducted DE analysis with sex as the group indicator and interpreted the functions of the DEGs using GSA. Since the comparison was made between males and females, it might be reasonable to expect that the genes in sex-chromosomes (X and Y) are more likely to be differentially expressed. As can be seen in the Table 3, genes in Y chromosome were enriched in the list of DEGs identified by each methods, except NBPSeq. Also most of the genes in sex chromosomes were top ranked in the list sorted by increasing p-value (S1 Table). Taken all together, the results show that the LPEseq robustly performs regardless of the sample size and the performance with large sample sizes is comparable to other competing methods.

## Conclusions

We proposed a method for DE analysis with a small number of replicates, especially when a single replicate in each class is available. By extending the LPE method, the proposed method is applicable to the RNA-Seq experiments either with or without replicates per each class. Even though the proposed LPEseq followed the idea of the original LPE method, it is different from that in two aspects: (i) estimating local-pooled variance from different classes assuming them as replicates and (ii) removing outliers derived from the replicates assumption between classes. These two differences make DE analyses with non-replicated datasets feasible. By adding an auxiliary step, our proposed method performed more robustly compared to existing methods regardless of the number of replicates in each class, even when only one replicate in each class was available.

In general, the performance of DE test for RNA-Seq data can be influenced by sequencing depth and the number of replications. By increasing the sequencing depth, it is expected that the accuracy of expression counts will increase. With higher accuracy, we expect that LPEseq could better estimate the variances, resulting in improved performance. However, it was reported that the number of sample replicates is a more significant factor in accurate identification of DE genes in comparison to the sequencing depth [9, 26]. A further detailed investigation of the effect of sequencing depth on the performance of DE test is desirable.

We considered the reproducibility score which represents how much the result made with replicated data is reproduced by the analysis with non-replicated or a smaller number of replicated data in our model development. Using this score, the threshold value for detecting outliers was suggested. By comparing the reproducibility of five methods, LPEseq showed the largest number of DEGs and the highest overlaps between the analyses with subset of samples and total samples, indicating the robust reproducibility regardless of the number of samples (S4 and S5 Figs).

It is worth noting that, even though LPEseq takes the reproducibility into account to the statistical testing without replicates, the scope of any conclusions drawn from it may be limited and need to be interpreted with extra cautions. For example, the outliers flagged in LPEseq may not really be outliers. They can be treated as outliers due to the mean difference between two groups, and this might affect the DE result. We have investigated this effect with MAQC dataset as follows. For MAQC data, there are 7 replicates in each condition. We analyzed all possible combinations of choosing 2 replicates from 7 in the same condition. We performed DE tests as if the two replicates had been obtained from different conditions. The average number of DE found by LPEseq was 5.095 (SD was 5.991) from 52580 transcripts. This analysis shows that the LPEseq method controls the false discovery rate when there are no outliers. Through the analysis with non-replicated data, we believe that the proposed method can be useful to obtain results that are comparable with those obtained from data containing replicates.

Note that the log-transformed count data such as negative binomial data would be an approximation to normal distributed data [27]. Although our LPEseq method treats the log-transformed count data as continuous ones, we believe that the LPE method taking into account the discrete random variables from the NB distribution would have advantage. This extension will be our next research topic.

The proposed idea can be easily extended to more than two comparison problems. For example, more complicated models such as analysis of variance (ANOVA), the “local-pooled-error” and replicates assumption between classes can be applied to the ANOVA problems, with the exceptions that the residual mean sum of squares and residual degrees of freedom are evaluated using observations without outliers in a local bin. The method can also be applicable to test differential expression in the context of high-throughput experiments such as measuring protein expression or DNA methylation.

## Supporting Information

### S1 Fig. Suggestive threshold value D for non-replicated data analysis.

Six different datasets were used to suggest optimal threshold value used in LPEseq method. Reproducibility score versus D values is plotted in blue line with 95% confidence interval colored in red. The D value giving the highest reproducibility score is shown in the center of each plot. The key characteristics of the data appear below each plot.

https://doi.org/10.1371/journal.pone.0159182.s001

(TIF)

### S2 Fig. The number of DEGs with varying number of bins.

The number of DEGs is plotted with different number of bins (from 50 to 150 bins).

https://doi.org/10.1371/journal.pone.0159182.s002

(TIF)

### S3 Fig. FDR and TPR of LPEseq with correlated genes.

The effect of correlated genes in DE analysis with LPEseq is shown in boxplot for FDR (left) and TPR (right). The different proportions of correlated genes (blue) and the difference correlation coefficient between correlated genes (pink) were denoted in each plot. The analysis was repeated 100 times.

https://doi.org/10.1371/journal.pone.0159182.s003

(TIF)

### S4 Fig. MAQC data analysis.

Venn diagram of DEGs is shown for MAQC dataset. Five different methods, i.e., LPEseq (brown), edgeR (sky blue), DESeq (green), DESeq2 (violet) and NBPSeq (red) were used. A density plot of the mean difference between classes of uniquely found DE transcripts in each method was indicated. X- and Y-axis represent group mean difference and density. The number in parentheses indicates the total number of DE transcripts found. The criterion used to call DE was Benjamini-Hochberg corrected p-value less than 0.05 for all methods. The enriched terms gene set analysis was performed by DAVID web-tool.

https://doi.org/10.1371/journal.pone.0159182.s004

(TIF)

### S5 Fig. Reproducibility of the different methods with varying number of samples.

The overlapped proportion of DEGs with varying number of technical replicates (left) and biological replicates (right) are shown. The overlap proportion indicates the number of DEGs identified both with subset of samples and with total samples divided by the number of DEGs identified with total samples.

https://doi.org/10.1371/journal.pone.0159182.s005

(TIF)

### S6 Fig. Variance curve comparison.

The plot shows the variance curve estimated with different numbers of samples. The X-axis represents log-transformed intensity and the Y-axis does the variance estimates. The solid blue line indicates the ‘true’ variance curve (estimated using the total samples) and all other dashed lines the variance curve estimates using different numbers of samples. None of p-values by a two-sample KS test using the solid blue line and the dashed grey lines were less than 0.05.

https://doi.org/10.1371/journal.pone.0159182.s006

(TIF)

### S1 Table. The most significant DEGs and their chromosome position (top 8 genes are shown).

https://doi.org/10.1371/journal.pone.0159182.s007

(DOCX)

### S1 File. Supplementary Note.

A file describing how optimal threshold value ‘D’ was defined and evaluated for LPEseq.

https://doi.org/10.1371/journal.pone.0159182.s008

(DOCX)

### S2 File. An excel file containing DEG list and GSEA result of each method for Hammer’s dataset.

https://doi.org/10.1371/journal.pone.0159182.s009

(XLSX)

## Acknowledgments

Thanks to Yoav Benjamini (Tel Aviv University) and Nancy Cox (University of Chicago) for invaluable comments and helpful discussions. This work is supported by the National Research Foundation of Korea (NRF) grants funded by the Korea government (MSIP) (No. 2012R1A3A2026438) and by the Bio-Synergy Research Project (2013M3A9C4078158) of the Ministry of Science, ICT and Future Planning through the NRF.

## Author Contributions

**Conceived and designed the experiments:**TP JG.**Analyzed the data:**JG.**Contributed reagents/materials/analysis tools:**JG SW.**Wrote the paper:**JG TP.

## References

- 1. Maher CA, Kumar-Sinha C, Cao X, Kalyana-Sundaram S, Han B, Jing X, et al. Transcriptome sequencing to detect gene fusions in cancer. Nature. 2009;458(7234):97–101. pmid:19136943; PubMed Central PMCID: PMCPMC2725402.
- 2. Gregg C, Zhang J, Weissbourd B, Luo S, Schroth GP, Haig D, et al. High-resolution analysis of parent-of-origin allelic expression in the mouse brain. Science. 2010;329(5992):643–8. pmid:20616232; PubMed Central PMCID: PMCPMC3005244.
- 3. Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y. RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 2008;18(9):1509–17. pmid:18550803; PubMed Central PMCID: PMCPMC2527709.
- 4. Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 2010;11(3):R25. pmid:20196867; PubMed Central PMCID: PMCPMC2864565.
- 5. Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010;11(10):R106. pmid:20979621; PubMed Central PMCID: PMCPMC3218662.
- 6. Di YM, Schafer DW, Cumbie JS, Chang JH. The NBP Negative Binomial Model for Assessing Differential Gene Expression from RNA-Seq. Stat Appl Genet Mol. 2011;10(1). Artn 2410.2202/1544-6115.1637. pmid:WOS:000291213200005.
- 7. Soneson C, Delorenzi M. A comparison of methods for differential expression analysis of RNA-seq data. BMC bioinformatics. 2013;14:91. pmid:23497356; PubMed Central PMCID: PMC3608160.
- 8. Seyednasrollah F, Laiho A, Elo LL. Comparison of software packages for detecting differential expression in RNA-seq studies. Briefings in bioinformatics. 2015;16(1):59–70. pmid:24300110; PubMed Central PMCID: PMC4293378.
- 9. Rapaport F, Khanin R, Liang Y, Pirun M, Krek A, Zumbo P, et al. Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data. Genome Biol. 2013;14(9):R95. pmid:24020486; PubMed Central PMCID: PMC4054597.
- 10. Jain N, Thatte J, Braciale T, Ley K, O'Connell M, Lee JK. Local-pooled-error test for identifying differentially expressed genes with a small number of replicated microarrays. Bioinformatics. 2003;19(15):1945–51. pmid:WOS:000186179000012.
- 11. Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15(12):550. pmid:25516281; PubMed Central PMCID: PMC4302049.
- 12.
Iglewicz B, Hoaglin DC. How to detect and handle outliers. Milwaukee, Wis.: ASQC Quality Press; 1993. ix, 87 p. p.
- 13. Frazee AC, Langmead B, Leek JT. ReCount: a multi-experiment resource of analysis-ready RNA-seq gene count datasets. BMC Bioinformatics. 2011;12:449. pmid:22087737; PubMed Central PMCID: PMCPMC3229291.
- 14. Sultan M, Schulz MH, Richard H, Magen A, Klingenhoff A, Scherf M, et al. A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science. 2008;321(5891):956–60. pmid:18599741.
- 15. Hammer P, Banck MS, Amberg R, Wang C, Petznick G, Luo S, et al. mRNA-seq with agnostic splice site discovery for nervous system transcriptomics tested in chronic pain. Genome Res. 2010;20(6):847–60. pmid:20452967; PubMed Central PMCID: PMCPMC2877581.
- 16. Bullard JH, Purdom E, Hansen KD, Dudoit S. Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC bioinformatics. 2010;11:94. pmid:20167110; PubMed Central PMCID: PMC2838869.
- 17. Montgomery SB, Sammeth M, Gutierrez-Arcelus M, Lach RP, Ingle C, Nisbett J, et al. Transcriptome genetics using second generation sequencing in a Caucasian population. Nature. 2010;464(7289):773–7. pmid:20220756.
- 18. McCarthy DJ, Chen Y, Smyth GK. Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Res. 2012;40(10):4288–97. pmid:22287627; PubMed Central PMCID: PMCPMC3378882.
- 19. Li J, Tibshirani R. Finding consistent patterns: A nonparametric approach for identifying differential expression in RNA-Seq data. Statistical Methods in Medical Research. 2013;22(5):519–36. pmid:WOS:000325863700005.
- 20. Henningsen A, Toomet O. maxLik: A package for maximum likelihood estimation in R. Computation Stat. 2011;26(3):443–58. pmid:WOS:000292976100007.
- 21. Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26(1):139–40. pmid:WOS:000273116100025.
- 22. Benjamini Y, Hochberg Y. Controlling the False Discovery Rate - a Practical and Powerful Approach to Multiple Testing. J Roy Stat Soc B Met. 1995;57(1):289–300. pmid:WOS:A1995QE45300017.
- 23. Huang da W, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc. 2009;4(1):44–57. pmid:19131956.
- 24. Huang da W, Sherman BT, Lempicki RA. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 2009;37(1):1–13. pmid:19033363; PubMed Central PMCID: PMC2615629.
- 25. Lein ES, Hawrylycz MJ, Ao N, Ayres M, Bensinger A, Bernard A, et al. Genome-wide atlas of gene expression in the adult mouse brain. Nature. 2007;445(7124):168–76. pmid:WOS:000243384300043.
- 26. Robles JA, Qureshi SE, Stephen SJ, Wilson SR, Burden CJ, Taylor JM. Efficient experimental design and analysis strategies for the detection of differential expression using RNA-Sequencing. BMC genomics. 2012;13:484. pmid:22985019; PubMed Central PMCID: PMC3560154.
- 27. Cassie RM. Frequency-Distribution Models in the Ecology of Plankton and Other Organisms. J Anim Ecol. 1962;31(1):65–92. pmid:WOS:A1962WV72800006.