^{1}

^{2}

^{3}

^{2}

^{3}

^{2}

^{3}

^{2}

I have read the journal’s policy and have the following conflicts: EU is a member of the scientific board and a consultant at Nabsys Inc.

Conceived and designed the experiments: FV BJR EU. Performed the experiments: FV AP. Analyzed the data: FV AP. Contributed reagents/materials/analysis tools: FV BJR EU. Wrote the paper: FV BJR EU.

A key challenge in genomics is to identify genetic variants that distinguish patients with different

The identification of genetic variants associated with survival time is crucial in genomic studies. To this end, a number of methods have been proposed to computing a

Next-generation DNA sequencing technologies are now enabling the measurement of exomes, genomes, and mRNA expression in many samples. The next challenge is to interpret these large quantities of DNA and RNA sequence data. In many human and cancer genomics studies, a major goal is to find associations between an observed phenotype and a particular variable (e.g., a single nucleotide polymorphism (SNP), somatic mutation, or gene expression) from genome-wide measurements of many such variables. For example, many cancer sequencing studies aim to find somatic mutations that distinguish patients with fast-growing tumors that require aggressive treatment from patients with better prognosis. Similarly, many human disease studies aim to find genetic alleles that distinguish patients who respond to particular treatments, i.e. live longer. In both of these examples one tests the association between a DNA sequence variant and the

The most widely approach to determine the statistical significance of an observed difference in survival time between two groups is the log-rank test [

(A) In a typical clinical study, two pre-selected groups of similar size are compared. Because the groups are balanced and each has a suitable number of patients, the asymptotic approximation (normal distribution) used in common implementations of the log-rank test gives an accurate approximation of the exact distribution, resulting in accurate

The design of a genomics study is typically very different from the traditional clinical trials setting. In a genomics study, high-throughput measurement of many genomics features (e.g. whole-genome sequence or gene expression) in a cohort of patients is performed, and the goal is to

While this fact has been noted in the statistics literature [

We propose to compute the ^{2} test [

We introduce an efficient and mathematically sound algorithm, called ExaLT (for ^{−9} is required if one wants to test the association of 1% of the human genome (e.g., the exome) with survival, and using a standard MC approach requires (with the Clopper-Pearson confidence interval estimate) the evaluation of ≥ 10^{11} samples, that for a population of 200 patients requires > 8 days; in contrast ExaLT is capable of estimating ^{−13} on 200 patients in < 2 hours. In contrast to heuristic approaches (see

We first assessed the accuracy of the asymptotic approximation for the log-rank test on simulated data from a cohort of 500 patients with a gene

The inaccuracy of the asymptotic log-rank test results in a large number of false discoveries: for example, considering a randomized version of a cancer mutation dataset (

The

As noted above, there are two exact distributions for the log-rank test in the literature: the permutational distribution [

To demonstrate the applicability of ExaLT we compared the ^{−6} is reported. In contrast, ExaLT computes an exact ^{−3}, a reduction of three orders of magnitude in the significance level. Additional comparisons are shown in the

In [^{−3}, while the exact permutational ^{−4}, indicating a

We analyzed somatic mutation and survival data from studies of six different cancer types (

Each data point represents a gene. (A) Comparison of the

For most datasets the asymptotic ^{−8} and an additional 19 genes have ^{−5}, but none of these have a known association with survival.

The top 10 genes reported by ExaLT contain several novel associations that are supported by the literature and are not reported using ^{−4}), ^{−3}) and ^{−3}), among others. As noted above, the association between mutations in ^{−3}) and ^{−3}), and others. Germline and somatic mutations in

Thus, the exact test implemented by ExaLT appears to have higher sensitivity and specificity in detecting mutations associated with survival on the sizes of cohorts analyzed in TCGA. Finally, we note that the exact conditional test obtains results similar to ^{−5} by ExaLT and ^{−10} by the asymptotic permutational test.

In this work we focus on the problem of performing survival analysis in a genomics setting, where the populations being compared are not defined in advance, but rather are determined by a genomic measurement. The two distinguishing features of such studies are that the populations are typically unbalanced and that many survival tests are performed for different measurements, requiring highly accurate

The problem with the log-rank test for unbalanced populations has previously been reported [

We considered the two versions of the log-rank test, the conditional [

The method we propose can be generalized to assess the difference in survival between more than two groups, by considering the exact permutational distribution for the appropriate test statistic. For this reason, our method can be adapted to test the difference in survival between groups of patients that have homozygous or heterozygous mutations, or to test whether the presence of a group of genomic features has a different effect on survival compared to the presence of the single genomic features. For the same reason, our method can incorporate categorical covariates, while it is unclear how methods based on the log-rank test, as ours, can incorporate continuous covariates or how they can be used to assess specific (e.g., additive) models of interactions between genomic features and survival.

While our focus here was the log-rank test, our results are relevant to more general survival statistics. First, in some survival analysis applications, samples are given different weights; our algorithm can be easily adapted to a number of these different weighting schemes. Second, an alternative approach in survival analysis is to use the Cox Proportional-Hazards model [

The challenges of extending multivariate regression models to the multiple-hypothesis setting of genome-wide measurements is not straightforward. Direct application of such a multivariate Cox regression will often not give reasonable results as: there are a limited number of samples and a large number of genomic variants; and many variants are rare and not associated with survival. Witten and Tibshirani (2010) [

We focus here on the two-sample log-rank test of comparing the survival distribution of two groups, _{0} and _{1}. Let _{1} < _{2} < … < _{k} be the times of observed, uncensored events; in case of ties, we assume that they are broken arbitrarily. Let _{j} be the number of patients _{j}, i.e. the number of patients that survived (and were not censored) up to this time, and let _{j,1} be the number of _{1} patients at risk at that time. Let _{j} be the number of observed uncensored events in the interval (_{j−1}, _{j}], and let _{j,1} be the number of these events in group _{1}. If the survival distributions of _{0} and _{1} are the same, then the expected value _{j,1} from the expectation,

(In some clinical applications one is more interested in either earlier or later events. In that case the statistic is a weighted sum of the deviations. Our results easily translate to the weighted version of the test.) Under the null hypothesis of no difference in the survival distributions of the two groups,

In the _{0} or _{1} independently of the survival time. Let _{1} the number of patients in group _{1}. We consider the sample space of all _{1} patients of group _{1}. Each such selection is assigned equal probability

In the conditional log-rank test [_{j}, _{j}, and _{j,1} for _{j} there are a total of _{j} patients at risk, including _{j,1} patients in _{1}, then under the assumption of no difference in the survival of _{0} and _{1} the _{j} events at time _{j} are split between _{0} and _{1} according to a hypergeometric distribution with parameters _{j}, _{j,1}, and _{j}.

We considered the two versions of the log-rank test, the conditional [^{2} distribution; the two version of the tests are related, and our results hold for the version of the log-rank test based ^{2} distribution as well (

In the case of small and unbalanced populations, the two null distributions yield different

While the exact computation of

We developed an algorithm,

Since the log-rank statistic depends only on the _{j} = ∣_{j}∣, for _{0}+_{1} be the total number of patients. We represent the data by two binary vectors ^{n} and ^{n}, where _{i} = 1 if the _{1} and _{i} = 0 otherwise; _{i} = 0 if the _{i} = 1 otherwise. Note that

Let _{1}, and _{1} events of _{1} are uniformly distributed among the

For any 0 ≤ _{1}, let _{t}(_{1} occur in the first _{t}(_{1} occur in the first

At time 0:

Given the values of _{t+1} = 1 then
_{t+1} = 0 then

The process defined by these equations guarantees that the _{1} events of _{1}. Thus, the _{1},−∣_{1},∣

For fixed _{t+1} = 1, then as we vary

Similar relations hold for _{t+1} = 0, and for computing

We construct a polynomial time algorithm by modifying the above procedure to compute the probabilities of only a polynomial number of values in each iteration. We first observe that since the probability space consists of _{1} such that (1−_{1})^{−n} = 1+_{1} = _{1})^{k}, for _{1})^{k}. We prove that if iteration _{1},_{1},

We implemented the FPTAS in our software ExaLT and evaluated its performance as

We note that, given parameters _{1},^{n} with

We used synthetic data to assess the accuracy of the asymptotic approximations. We generated data as follow: when no censoring was included, we generated the survival times for the patients from an exponential distribution, and the group labeling (mutated or not) were assigned to patients independently of their survival time; when censoring in

We used synthetic data to compare the empirical

We analyzed somatic mutation and clinical data, including survival information, from the public TCGA data portal (

The results published here are in whole or part based upon data generated by The Cancer Genome Atlas pilot project established by the NCI and NHGRI. Information about TCGA and the investigators and institutions who constitute the TCGA research network can be found at

(PDF)

(PDF)

(PDF)

(PDF)

(PDF)

(PDF)

(a) Distribution of ^{5} instances with _{1} = 100 samples in the small population, different number ^{5} data points with _{1} = 5%^{5} data points with _{1}, and no censoring. (d) Distribution of ^{5} instances with _{1} = 5%^{5} instances with _{1} = 5%^{5} instances with _{1} = 5%

(PDF)

The R coefficients comparing the −_{10} exact _{10} empirical _{1} = 5%_{1} = 5%

(PDF)

Comparison of the

(PDF)

Comparison of the ^{5} instances with _{1} = 5 samples in the small population and same survival distribution for all patients (no censoring). (b) Comparison of Cox likelihood ratio _{1} = 5%_{1}) = 5%_{10} exact _{10} empirical

(PDF)

(Top) The log-rank test compares the Kaplan-Meier curves of the two groups. (Middle) Survival data is represented by sorting patients by increasing survival. _{i} = 0 if event at time _{i} is censored, _{i} = 1 otherwise). (Bottom) The conditional test is defined by a series of independent contingency tables with marginals corresponding to the number of patients at risk in each group and the number of events in each group, conditioning on the patients at risk at each non-censored time; _{i} denotes the number of events at time _{i}, _{i,j} denotes the number of patients at risk in group _{1} patients with label 1 in the vector

(PDF)

Starting from the approximation

(PDF)

(a) Runtime of FPTAS and of the exhaustive enumeration for different values of _{1} = 10,_{1}. (c) Runtime of the FPTAS for different values of _{1} = 10, no censoring. (d) Comparison of the FPTAS _{1} = 4, no censoring, and

(PDF)

^{2}from contingency tables, and the calculation of p