^{1}

^{3}

^{1}

^{1}

^{2}

The authors have declared that no competing interests exist.

Methods of estimating the local false discovery rate (LFDR) have been applied to different types of datasets such as high-throughput biological data, diffusion tensor imaging (DTI), and genome-wide association (GWA) studies. We present a model for LFDR estimation that incorporates a covariate into each test. Incorporating the covariates may improve the performance of testing procedures, because it contains additional information based on the biological context of the corresponding test. This method provides different estimates depending on a tuning parameter. We estimate the optimal value of that parameter by choosing the one that minimizes the estimated LFDR resulting from the bias and variance in a bootstrap approach. This estimation method is called an

Methods of estimating the local false discovery rate (LFDR) [

In many situations, the considered hypotheses are connected by a scientific context. However, ignorance of this scientific context in a data analysis can be misleading, because it may introduce bias into the LFDR estimates [

We consider coronary artery disease (CAD) data [_{i}, _{i}) associated with each hypothesis for _{i} is an observed test statistic and _{i} presents the minor allele frequency (MAF). For our data, all observed test statistics are used to identify the disease-associated SNPs.

The horizontal line represents the threshold of 0.2. The vertical lines in (A) indicate the symmetry around _{i} = 0.0306 with Δ = 0.04.

A set of hypotheses or features used to determine the posterior probability of a null hypothesis is called a _{i} = 0.0306 and _{i} = −4.3971, with an estimated LFDR of 0.1973 that is close to the threshold of 0.2. For this MAF, we define a reference class of SNPs in such a way that the MAFs are within a symmetric window around _{i} = 0.0306, with width 2Δ. Different window widths yield different reference classes. Again, each subset of SNPs is used to estimate the LFDR.

The hypotheses can be divided into groups based on the characteristics of the problem. For example, the CAD data can be divided into two distinct groups according to MAFs, low-frequency SNPs (1% ≤ _{i} ≤ 5%) and common SNPs (_{i} > 5%). Thus, we need to determine which reference class should be used to determine the posterior probability that the SNP is not associated with the disease occurring at the MAF _{i} = 0.0306. Should we use the entire set of SNPs, or the low-frequency SNPs [

Many methods have been proposed for incorporating covariates into statistical techniques for testing multiple hypotheses. Bickel [

In this study, we assume the prior probability to be a function of covariates (Section 2.1). Then, we propose an adaptive reference class (ARC) method for estimating the LFDR, using a bootstrap approach to estimate the optimal reference class (Section 2.2). We compare the performance of the proposed ARC method and the CRC method using the mean squared error (MSE) as the performance criterion. We prove that, under certain assumptions, the ARC method has an MSE that is asymptotically no greater than that of the CRC method (see

Suppose _{01},…, _{0N} are considered simultaneously. For example in GWA study, let _{0i} denote the null hypothesis that the ^{th} SNP is not associated with the disease. Under the genetic additive model [^{2} test statistic _{i}. Under the ^{th} null hypothesis, it holds that ^{th} alternative hypothesis, we have ^{th} null hypothesis, assume that _{i} ∼ _{i} represents the z-transform that converts the ^{2} statistic into a standard normal statistic. In addition, for DTI data, let _{0i} denote the null hypothesis that there is no dyslexic-non-dyslexic difference for the ^{th} voxel. Under the ^{th} null hypothesis, assume that _{i} ∼ _{i} represents the z-transform which converts two-sample

The observed statistics _{1},…, _{N})^{T} are considered realizations of _{1},…, _{N})^{T}. Let _{i} be an indicator variable for the event that the ^{th} alternative hypothesis _{ai} is true. Assume that _{i}’s are independent and identically distributed (i.i.d.) Bernoulli(1 − _{0}) variables, where _{0} is the prior probability that the ^{th} null hypothesis is true. Let _{0}(_{i}) and _{1}(_{i}) be the null and alternative density, respectively.

The posterior probability that the ^{th} null hypothesis is true, given _{i} = _{i} is the LFDR [_{i}), where
_{i}; _{0}) denotes the mixture density of _{i} given by
_{0}(_{i}) of the statistic _{i} is the standard normal, and is called the _{i};_{0}) is estimated by fitting a high-degree polynomial to the histogram counts, denoted by

The described model in _{1},…, _{N})^{T} be i.i.d. random variables. Any test statistics are transformed to the standard normal statistic _{i}, for _{1},…, _{N})^{T} is considered a realization of _{1},…, _{N})^{T}. Let _{i} be the event that the ^{th} alternative hypothesis _{ai} is true. Assume that _{i}|_{i} = _{i} ∼ Bernoulli(1 − _{0}(_{i})), where _{0}(_{i}), the prior probability that the ^{th} null hypothesis is true, is an unknown function of the given covariate _{i} = _{i}. We denote the posterior probability that the ^{th} null hypothesis is true, given _{i} = _{i} and _{i} = _{i} by
_{i} conditional on the covariate _{i} = _{i} is given by
_{0}(_{i}) denotes the null density of _{i} and _{1}(_{i};_{i}) is the alternative density of _{i}. The mixture density in _{0}(_{i}) and _{1}(_{i};_{i}) are unknown. The ARC method is applied to estimate the LFDR in _{i}; _{i}), defined in

Under the ARC method, certain assumptions only hold locally within a symmetric window for each covariate. Let a symmetric window of width 2Δ be centered at given covariate _{i} = _{i}. Such a symmetric window is denoted by _{0} denote the smallest considered value of the tuning parameter Δ. The reference class _{j} such that their covariates are within a distance Δ of _{i}. Denoting the expected dimension of the reference class

The optimal tuning parameter Δ specifies the symmetric window width of a given reference class, and is determined by minimizing the errors resulting from the bias and the variance. In the following, we introduce several notational conventions.

Let the mean and variance of the estimator _{i} = _{i}, the prediction bias for the estimator _{i} = _{i} is defined as

We re-sample _{1}, _{1}),…, (_{N}, _{N})} until _{i}, _{i}), where _{i} ∈ _{i} ∈ ^{th} bootstrap sample ^{th} bootstrap reference class is defined as
_{i}; _{i}) based on the ^{th} bootstrap reference class is denoted by _{Δ}(_{i}) and _{0}(_{i}). We propose using a reference class _{j}s the covariates of which are within a distance Δ_{0} of _{i}. Thus, the estimator _{0}(_{i}). Denoting the bootstrap estimator of the prediction bias by _{j}s, the covariates of which are within a distance _{i}. Then, the optimal reference class

_{1}, _{1}),…, (_{N}, _{N}); number of bootstrap samples _{0}; tuning parameter Δ ≥ Δ_{0}

For

Build ^{th} bootstrap samples _{i}, _{i}) For

Determine bootstrap reference class

Estimate LFDR

Compute

Minimize _{0}

Let _{0}(_{i}) denote the true prior probability that the ^{th} null hypothesis is true. In GWA study, the null hypothesis means no disease association, and in the DTI study, it means no differences between dyslexic and non-dyslexic children. For a given _{0}, we suppose that the unknown prior probability _{0}(_{i}) is a step function of the covariate _{i}, given by
_{01} and _{02} are both unknown, and _{01} ≤ _{02}. This function splits the _{0} and Δ_{0}, the observed vector of covariates _{0}(_{i})

_{i})

_{0}(_{i}) _{0}(_{i}).

The aim of the simulation analysis presented here is to compare the finite dataset performances of the CRC and ARC methods when estimating the LFDR in

We assume that the proportion of disease-associated tends to be very small. Then, we present several simulation studies, each with a different value of _{0} ∈ [0.05, 0.40]. The datasets are simulated as follows. In each simulation, we randomly generate 1000 datasets, each corresponding to an artificial case-control study. For each dataset, we simultaneously generate both the auxiliary information and the observed Wald ^{2} test statistics, denoted by _{i} and _{i}, respectively. Each observed covariate _{i} is generated randomly from the uniform distribution between 0 and 1.

In each simulation, the true prior probability _{0}(_{i}) is determined according the given value of _{0} as a function of the observed covariate in _{i} ∼ Bernoulli(1 − _{0}(_{i})) independently. To generate the observed ^{2} test statistics, if _{i} = 1, the observed statistics are sampled from _{0}, a different value of ^{2} test statistics when _{i} = 0 are sampled from ^{2} test statistics are then transformed into

Each dataset has _{i}, _{i}). The total number of pairs _{i}, _{i}) is selected randomly from each dataset to estimate Ψ(_{i}; _{i}). For a given covariate _{i}, the estimators of the LFDR are computed using the two methods. Under the ARC method, Δ_{0} has to be specified in advance in order to determine _{0} ∈ (0, _{0}), and set _{0} values and the region of the covariates. When _{0} results in a smaller MSE approximation for the ARC method in the regions _{0}.

_{0}(_{i}) = _{0} for

The true prior probabilities are constant, _{0}(_{i}) = _{0}. The log_{2} value is given for the marginal relative MSE. Under the ARC method, Δ_{0} is 0.01.

_{0} |
0.60 | 0.80 | 0.90 | 0.95 |

ReMSE_{marg} |
0.9030 | 0.8156 | 1.3729 | 2.0908 |

We apply both the ARC and CRC methods to the CAD and DTI datasets, and compare the disease-associated SNPs and dyslexic-non-dyslexic difference voxels under each method, respectively. The purpose of this comparison is to demonstrate the practical difference between the methods rather than to determine which method performs better.

The CAD dataset originating from the United Kingdom includes 500, 568 SNPs genotyped for 2, 000 cases, and 3, 000 combined on 22 autosomal chromosomes. The control individuals come from two groups: 1500 individuals from the 1958 British Birth Cohort (58C), and 1500 individuals selected from UK blood services (UKBS) controls. Following [^{−7} using trend tests and general genotype tests between each case and the combined controls. We also excluded SNPs with MAFs smaller than 0.01. A total of

The CAD related data introduced in Section 1.1, with _{0}. _{0} = 0.001. The results show that, 160 SNPs are disease-associated based on the ARC method, while the CRC method detects 44 disease-associated SNPs. From _{0} has a direct effect on the number of disease-associated SNPs. Under the ARC method, increasing the value of Δ_{0} brings the proportion of disease-associated SNPs closer to the corresponding proportion under the CRC method.

(A) presents the LFDR estimate under the ARC method for Δ_{0} = 0.001 versus that for the CRC method and (B) illustrates the proportion of disease-associated SNPs under the ARC method when the LFDR estimate is less than 0.2 versus Δ_{0} ∈ (0, 0.50).

Schwartzman et al. [^{2}/s). In this study, 12 children were tested and divided equally in each group (i.e., dyslexic or non-dyslexic group). Each child received DTI brain scans in N = 15443 locations, with each represented by its own voxel’s response. The aim is to determine the dyslexic-non-dyslexic difference at the ^{th} voxel (location), in relation to reading development in children aged 7-13 [_{i}, _{i}) associated with each hypothesis for i = 1,…,N, where _{i} is an observed test statistic that compares the dyslexic children with those who are not (see in Section 2), and _{i} is the location (i.e., the distance from back of brain to the front). We apply both the ARC and CRC methods to the DTI data and compare the dyslexic-non-dyslexic difference voxels under each method. The DTI brain scans data with a total of N = 15443 locations, each represented by its own voxel’s response, is employed in the following statistical analysis.

Let _{1}, _{2},…, _{N})^{T} are considered as a realization of _{1}, _{2},…, _{N})^{T}. Under the ^{th} null hypothesis, it holds that _{i} ∼ _{1}, while under the ^{th} alternative hypothesis _{i} ∼ _{1,δ}, where _{0}(_{i}) ∼ _{1} (i.e., null density), and _{i}; _{0}, _{1}(_{i}; _{0}. _{0} = 20. We observe from _{0} has a direct effect on the number of dyslexic-non-dyslexic difference voxels. Under the ARC method, increasing the value of Δ_{0} brings the proportion of dyslexic-non-dyslexic difference voxels closer to the corresponding proportion under the CRC method.

(A) presents the LFDR estimate under the ARC method for Δ_{0} = 20 versus that for the CRC method and (B) illustrates the proportion of dislexic-non-dyslexic difference voxels under the ARC method when the LFDR estimate is less than 0.2 versus Δ_{0} ∈ (0, 50).

In this study, we employ a novel approach that incorporates a covariate (i.e., a scientific context corresponding to each hypothesis test) to improve the LFDR estimate when identifying alternative hypotheses. Using this approach, both the test statistic distribution under the alternative hypothesis and the prior probability that the null hypothesis is true, are modulated by the covariate. In the case where the prior probability _{0}(_{i}) is the step function given in _{0}(_{i}) in region _{0}, the proportion of significant null hypotheses decreases, and approaches the proportion based on the CRC method. This suggests that further investigation may be necessary on how the tuning parameter Δ_{0} can be controlled to improve results.

(PDF)

The