Advertisement
  • Loading metrics

GINOM: A statistical framework for assessing interval overlap of multiple genomic features

  • Darshan Bryner,

    Affiliation Naval Surface Warfare Center Panama City Division, Panama City, Florida, United States of America

  • Stephen Criscione,

    Affiliation Department of Molecular Biology, Cell Biology and Biochemistry, Brown University, Providence, Rhode Island, United States of America

  • Andrew Leith,

    Affiliations Department of Molecular Biology, Cell Biology and Biochemistry, Brown University, Providence, Rhode Island, United States of America, Center for Computational Molecular Biology, Brown University, Providence, Rhode Island, United States of America

  • Quyen Huynh,

    Affiliation Institute for Brain and Neural Systems, Brown University, Providence, Rhode Island, United States of America

  • Fred Huffer,

    Affiliation Department of Statistics, Florida State University, Tallahassee, Florida, United States of America

  • Nicola Neretti

    nicola_neretti@brown.edu

    Affiliations Department of Molecular Biology, Cell Biology and Biochemistry, Brown University, Providence, Rhode Island, United States of America, Center for Computational Molecular Biology, Brown University, Providence, Rhode Island, United States of America

GINOM: A statistical framework for assessing interval overlap of multiple genomic features

  • Darshan Bryner, 
  • Stephen Criscione, 
  • Andrew Leith, 
  • Quyen Huynh, 
  • Fred Huffer, 
  • Nicola Neretti
PLOS
x
?

This is an uncorrected proof.

Abstract

A common problem in genomics is to test for associations between two or more genomic features, typically represented as intervals interspersed across the genome. Existing methodologies can test for significant pairwise associations between two genomic intervals; however, they cannot test for associations involving multiple sets of intervals. This limits our ability to uncover more complex, yet biologically important associations between multiple sets of genomic features. We introduce GINOM (Genomic INterval Overlap Model), a new method that enables testing of significant associations between multiple genomic features. We demonstrate GINOM’s ability to identify higher-order associations with both simulated and real data. In particular, we used GINOM to explore L1 retrotransposable element insertion bias in lung cancer and found a significant pairwise association between L1 insertions and heterochromatic marks. Unlike other methods, GINOM also detected an association between L1 insertions and gene bodies marked by a facultative heterochromatic mark, which could explain the observed bias for L1 insertions towards cancer-associated genes.

Author summary

The age of genomics has made a large number of datasets available for the wider scientific community. Many of these datasets come in the form of genomics tracks, represented as features associated with a collections of genomic intervals along chromosomes. A common talk in genomics is to identify putative associations between these features that can lead to new insights about genome organization and function. For example, activity of certain classes of genes might be influenced by the presence of specific combinations of chromatin modifications and binding of transcription factors at their promoters or enhancers. Here, we present a novel methodology, named GINOM (Genomic INterval Overlap Model), to test for the significance of these associations. We apply it to the problem of detecting biases of the locations along chromosomes where mobile genetic elements tend to insert themselves, and identify a potential preference for L1 elements towards gene bodies marked by a facultative heterochromatic mark.

Introduction

A fundamental question in genome biology is whether two or more genomic features, for example gene promoters/bodies and histone modifications, are associated. These associations can shed light on fundamental regulatory mechanisms, such as the epigenetic regulation of gene expression. Genomic features can be represented as intervals, and thus the question becomes whether one set of genomic intervals, the query set, overlaps another set or sets of intervals, the reference set(s), significantly more or less than what would be expected by chance. The results of this test can guide the exploration of the underlying nature of the association; for example, if a histone modification is found to be associated with promoters of transcribed genes, one might hypothesize that the modification is required for active transcription. Thus, there is a widespread need for an accurate and computationally-efficient statistical test of genomic interval overlap.

Several statistical strategies to examine genomic interval overlap have been developed [15]. One study tested for associations of transposable element insertions in the genome with various epigenomic features [1]. Using transposon mutagenesis, [1] created a database of transposable element insertions across the mouse genome and subsequently tested for insertion site bias with respect to a randomized control set. Another method, MULTOVL, performs a Monte Carlo shuffling of the intervals uniformly throughout the genome to obtain an empirical null distribution of overlap lengths in order to test for significance [2]. The Binary Interval Search (BITS) algorithm also uses a Monte Carlo simulation by uniformly shuffling the query intervals many times and obtaining an empirical null distribution of the intersection count [3]. The Genome Association Test (GAT) presented in [4] is another statistical test of overlap length based on Monte Carlo simulations as in [2], but the randomization procedure used to form this empirical null distribution can be designed to exclude regions such as gaps and repetitive sequences. Another method called GenometriCorr is an R package that includes four different statistical tests for spatial relationships of genomic intervals [5]. Two of the tests (the absolute distance test and the Jaccard test) rely on Monte Carlo randomization to formulate and test against an empirical null, and the other two tests (the relative distance test and the projection test) use analytical null distributions. The main limitation of these statistical strategies is their pairwise treatment of genomic intervals, where a query interval set is compared to one or multiple reference sets individually. These methods cannot reveal any higher order associations, i.e. any association between a query interval set and multiple reference sets simultaneously.

We present a robust statistical framework called GINOM (Genomic INterval Overlap Model) that adds more flexibility to the study of associations between genomic intervals. We impose a parameterized probability model, i.e. a density function, on query interval location with respect to any number of reference sets. Given query interval data, the model parameters are estimated through likelihood-based methods. Each parameter value is interpreted as the amount of departure from the null distribution on the genomic loci and is indicative of the query interval overlap with a certain reference set or group of reference sets. To specifically address the inclusion of higher order associations, we design the model in GINOM to consider any possible combination of reference sets rather than restricting to only pairwise comparisons. Since it is possible that some combinations will have no effect on query interval location, we provide an automatic selection procedure to keep only the model terms that best describe the data.

Methods

Here we define a statistical model, i.e. a probability density, of the location of a query interval with respect to multiple sets of reference intervals. We design the model to reflect the tendencies of a query interval to overlap a reference set or a combination of reference sets more or less than would be expected according to a predefined null distribution. In this section we explicitly formulate the model equation and explain how to estimate and interpret the parameter values given a set of query intervals. Furthermore, we discuss hypothesis testing and restricting the number of model parameters through model selection.

Notation and problem setup

Our goal is to define the probability density function of a query interval starting point location conditional on a given query interval length with respect to known sets of reference intervals. This density will be defined over all possible genomic loci and formulated as a mixture of deviations from a given null distribution. These deviations from the null occur only on the genomic index sets that would indicate an overlap of a query interval of the given length with one or more of the reference sets.

Define the discrete set of nucleotides that comprise an organism’s genome as , where L is the length of the genome. A query interval is defined from two discrete random variables—the starting point location X and the length (or cardinality) Y—and is given as q(X, Y) = {X, X + 1, …, X + Y − 1}. From here onward, we use the notation of capital X and Y when referring to the random variables and lower case x and y when referring to observations of the respective random variables. Since the distribution of query interval starting points depends on the query interval length, we are concerned with modeling the conditional distribution of X|Y. We denote f(x|y) as the probability density function of X|Y.

Let R1, R2, …, RN be N known reference sets, where each Ri is defined as the union of a set of intervals on . In practice, each Ri is set to represent a particular feature of the genome that may influence the distribution of X|Y; for example, Ri could consist of all the loci that lie within genic regions. We say that a query interval q(x, y) overlaps Ri if the intersection of q(x, y) and Ri is non-empty. Let ri(x|y) be the overlap indicator function for q(x, y) and Ri, which is given by (1) In other words, ri(x|y) equals 1 over all x such that a query interval q(x, y) would overlap Ri, and it equals 0 over all x where q(x, y) would not overlap Ri.

Note that it is possible for a query interval to overlap more than one reference set at a time; therefore, we must define the overlap indicator function for multiple reference sets. Consider a non-empty subset of reference set indices given by π ⊆ {1, 2, …, N}. For example, if N = 3, all possible values of π are {1}, {2}, {3}, {1, 2}, {1, 3}, {2, 3}, and {1, 2, 3}. The overlap indicator function indexed by the set π is defined as rπ(x|y) = ∏iπ ri(x|y). It is the indicator function of the simultaneous overlap of q(x, y) with all the reference sets given in the index set π. E.g. if π = {1, 2, 3}, then r{1,2,3}(x|y) equals 1 over all values of x where q(x, y) would overlap reference sets R1, R2, and R3 simultaneously and equals 0 otherwise.

Suppose further that, conditional on the length Y = y, we have defined a null distribution f0(x|y) of query interval starting point locations. For example, the null distribution could be defined as the discrete uniform distribution over , the set of all x where q(x, y) would lie completely within the mappable regions of . That is, (2) where | ⋅ | denotes set cardinality. This null distribution suggests that a query interval has an equal chance of lying anywhere inside the mappable regions of the genome and no chance of lying outside. The restriction to the mappable regions is due to the impossibility of observing a query interval in any non-mappable region. Although we develop the theory behind GINOM to incorporate any arbitrary null distribution, in practice we use the null provided above in Eq (2).

Model formulation and interpretation

In our model we take the density of a query interval to be (3) where c(θ, y) is the normalizing constant and the summation is over the non-empty subsets of {1, 2, …, N}, i.e. all possible combinations of reference sets. Here, θ is the vector of model parameters. We choose a model density with log-linear form for various reasons. First, since the equation is formulated as an additive mixture of effects (in the logarithmic sense), model interpretation is intuitive and achieved through examining the value of θ. The components of θ describe different types of departures from the null distribution. As θπ increases (decreases), fixing all other components of θ, the probability that a random query interval intersects all of the reference sets in π increases (decreases). As θπ approaches infinity (negative infinity), fixing all other components of θ, the probability approaches 1 (0). A further advantage to the log-linear form is that the model equation belongs to an exponential family of distributions. This distribution family satisfies many regularity conditions that allow for a straightforward application of various techniques for statistical inference and hypothesis testing. Finally, in this formulation each component of θ can take any value in , and thus an optimization over θ is unconstrained and computationally more efficient than a constrained optimization.

One can use θ to define a query interval enrichment profile across the genome in the following manner. The ratio gives the relative likelihood of a random query interval of length y having its left endpoint at x versus another location . Comparing the model in Eq (3) with the null distribution f0, we say that the endpoint x has been enriched (depleted) relative to under our model if the ratio is greater than (less than) . The function (4) which is a summation of certain components of θ, may be regarded as giving an enrichment profile. If h(x|y) is greater than (less than) , then it follows that is greater than (less than) , and thus x is enriched (depleted) relative to . Moreover, the ratio between the two ratios above, , is equal to . This quantity provides a measure of the degree of enrichment (depletion) at x relative to under our model. Using this expression, it may be seen that increasing (decreasing) the value of θπ increases (decreases) the degree of enrichment for all locations x where q(x, y) intersects all of the reference sets in π relative to all other locations .

Note that the expressions presented in the above paragraph simplify when using the standard null distribution provided in Eq (2). Assuming that locations x and are both mappable, then . We say that x is enriched (depleted) with respect to if is greater than (less than) 1, i.e. if h(x|y) is greater than (less than) . Typically, we are interested in studying the enrichment of x with respect to a mappable such that would not overlap any of the reference sets. In this case of in the so-called “background” portion of the genome, , and, therefore, x is enriched (depleted) simply if h(x|y) is positive (negative). From this point onward, we exclusively use this simplified condition when determining the enrichment of x.

To help with the above model interpretation, we provide a simple example with N = 2 and f0 as in Eq (2). Since N = 2, the model terms are given by the set indices π = {1}, {2}, {1, 2}. Suppose that θ = (θ{1}, θ{2}, θ{1,2})′ = (0.5, 0.7, − 0.2)′, where the prime denotes vector transpose. Now, for an x such that q(x, y) overlaps R1 but not R2, the enrichment is given by h(x|y) = θ{1} = 0.5, which is positive. Thus, the set of all such x’s are enriched, and the probability of q(x, y) is e0.5 = 1.65 times greater than , where is in the background. Similarly, the set of all x’s such that q(x, y) overlaps R2 but not R1 are enriched, and the probability of q(x, y) is e0.7 = 2.01 times that of . For an x such that q(x, y) overlaps both R1 and R2, the enrichment is given by h(x|y) = θ{1} + θ{2} + θ{1,2} = 1.0. In this case, the probability of q(x, y) is e1.0 = 2.72 times that of . Notice that even though the individual model parameter θ{1,2} is less than zero, there is still enrichment in these x’s rather than depletion. The model parameter θ{1,2} represents the interaction effect of R1 and R2—the effect of R1 and R2 beyond that of simply adding the two individual effects together. In this case, the overall effect of R1 and R2 is slightly less than their additive effect due to the negative interaction term. For more details on model interpretation, see the Model Interpretation section of S1 Text.

In general, since there are 2N − 1 possible non-empty subsets of {1, 2, …, N}, there are that many possible model parameters in Eq (3). In practical implementations, 2N − 1 can be quite a large number, and the inclusion of all possible parameters can yield an unnecessarily complex model. Therefore, to avoid overfitting, we typically restrict to a smaller submodel by setting the less important parameters equal to zero and thus effectively dropping those terms from the model equation. For example, for ease of interpretation, we could decide that any term with third-order interactions or higher—that is, if π contains three or more reference set indices—is too complex to include in the model. Thus, we would automatically exclude those terms from consideration in the model equation, and the summation would be over all π such that |π| ≤ 2. From here on, we denote d as the number of model parameters included in the model. If all possible parameters are included in the model, i.e. if d = 2N − 1, then we say that f is the full model.

Parameter estimation, hypothesis testing, and model selection

Suppose we are given data in the form of n query intervals {q(xi, yi), i = 1, …, n}, and let us assume that the xi’s are conditionally independent given the yi’s. In order to fit the model to the data, we seek the maximum likelihood estimate (MLE) of θ given the data. To compute the MLE , we maximize over the likelihood function, which is equal to the joint density function (5) where the prime denotes vector transpose and T is a length d vector with the π-th component given by . The value tπ is the number of query intervals in the dataset that overlap all of the reference sets given in the index set π simultaneously. After taking the logarithm of the above function and dropping terms that are constant in θ, we are left with the following objective function: (6) The MLE is thus computed as (7) See S1 Text for an efficient method to compute the optimization in Eq (7) as well as approximate confidence intervals for each component of θ.

Now that we have a means for computing the MLE, and since the density f satisfies all the necessary regularity conditions, we can use the generalized likelihood ratio test (GLRT) for hypothesis testing (see Ch. 8 of [6]). In particular, we use the GLRT to test whether a specific component of θ differs from zero or not, i.e. whether a certain reference set or combination of reference sets affects query interval location. In order to perform this statistical test for the π-th component, one must solve Eq (7) twice—once allowing θπ to be unconstrained in the optimization and once with the constraint that θπ = 0. If the value of changes enough with addition of this constraint, i.e. with the removal of the π-th model term, then we claim that θπ is not equal to zero. For more detailed information on the GLRT, including the computation of a p-value for the test, see S1 Text.

In order to select the model parameter configuration that best describes the data, we implement the widely-used bidirectional stepwise model selection algorithm [7]. This algorithm systematically adds and removes model terms one at a time in an effort to find a configuration that minimizes the Bayesian Information Criterion (BIC) [8], which is given by . The BIC is a log-likelihood function that is penalized by the number of model parameters; therefore, the model configuration with optimal BIC is one that offers a high likelihood value with a small number of model parameters. The result of the stepwise algorithm is a model that contains only the most influential parameters, allowing for an easy yet biologically meaningful model interpretation.

Results/Discussion

In what follows, we first apply GINOM to in silico data to demonstrate its performance and compare its predictive ability to that of currently available software. We then apply it to a real biological dataset and again compare its performance to the same software.

Simulated query intervals

We analyzed the performance of GINOM on query interval sets simulated directly from the model equation Eq (3) using the nine reference sets listed in Table 1. This is the same set of features used in the LINE-1 insertion bias study we discuss later and contains a combination of genomic and epigenomic features associated with genome accessibility and transcriptional regulation. For simplicity in our simulations, we fix the query interval length to be Y = 1 unless stated otherwise and assume all loci in are mappable when forming the null distribution in Eq (2). A simulation is done by first selecting N reference sets to consider, then optionally restricting the domain to a subset of the entire genome, followed by setting the true values for each component of θ, and then finally generating a random sample of size n via acceptance/rejection algorithm. The result of the simulation is a set of n query intervals that are independently and identically distributed according to the specified model.

Computation time.

First, we performed two computation time experiments running GINOM in MATLAB on a laptop with a 2.6 GHz processor. The computation times reported below for GINOM include the time necessary for data preprocessing as well as an unrestricted stepwise model selection. Fig 1 (left) shows how computation time scales with an increase in the length of the genome L, and Fig 1 (right) shows how computation time scales with an increase in the number of reference sets N considered in model selection. We also scaled the query sample size n, fixing everything else; however, we found that n only slightly affects the preprocessing stage and does not factor at all into model selection. Since the parts of GINOM that are influenced by n represent only a small fraction of the overall computation time, we omit the results of any computation time experiment versus n.

thumbnail
Fig 1. GINOM computation time experiments with simulated data.

Left: Computation time v. L with fixed values of N = 5 and n = 2540. Right: Computation time v. N on chromosome 19 (L = 5.9 × 107) with n = 2540.

https://doi.org/10.1371/journal.pcbi.1005586.g001

For the experiment in Fig 1 (left), we used N = 5 reference sets—specifically, sets 3, 4, 6, 7, and 8. For each value of L, we randomly generated ten query interval datasets, each of size n = 2540, with model terms θ{6} = 1.0, θ{7} = 0.5, θ{6,7} = −1.0, and all other possible model terms set to equal zero. The parameter vector of this model is written simply as θ = (θ{6}, θ{7}, θ{6,7})′ = (1.0, 0.5, − 1.0)′, dropping all of the zero-valued model terms. We used a sample size of n = 2540 because it is the same size as the real query interval dataset used later in the paper. We increased L systematically by starting with a domain equal to all loci in chromosomes X and Y, and then appending an additional chromosome to until the entire genome was considered. For each value of L, we ran GINOM with an unrestricted model selection for each of the ten replicates and averaged the computation time. From the plot in Fig 1 (left), we conclude that computation time scales on the order of L2. In this configuration, GINOM ran on the entire genome in an average of 268 seconds.

For the experiment in Fig 1 (right), we randomly generated one query interval dataset of size n = 2540 using the same value of θ as before with domain equal to all loci in chromosome 19. We ran GINOM with unrestricted stepwise model selection nine times, each time considering an additional reference set until all nine reference sets in Table 1 were considered. From the plot in Fig 1 (right), we see that computation time increases rapidly with N. Therefore, for higher values of N, it becomes more beneficial to restrict the model selection in some way. For example, one could consider only lower-order terms like those with |π| ≤ 3, or one could restrict to a specific, predetermined list of biologically meaningful combinations. With the maximum value N = 9 in the experiment, GINOM ran on chromosome 19 in 5 hours and 21 minutes.

Performance of parameter estimation and model selection with convergence and identifiability study.

Next, we performed an experiment on chromosome 19 to compute an empirical distribution of when fitting the true model to simulated data. This time, we randomly generated 1000 query interval datasets of fixed size n = 2540 using the same parameter vector θ = (θ{6}, θ{7}, θ{6,7})′ = (1.0, 0.5, − 1.0)′. For each dataset, we fit the true model, i.e. we solved Eq (7) with all terms besides θ{6}, θ{7}, and θ{6,7} constrained to be zero. Table 2 shows the mean and standard deviation of each component of , and Fig 2 shows the corresponding histogram for each component.

thumbnail
Fig 2. Histograms of estimated model parameter values.

The components of were computed from 1000 simulations of sample size n = 2540.

https://doi.org/10.1371/journal.pcbi.1005586.g002

Using the same simulated data, we additionally ran the stepwise model selection for each of the 1000 datasets. In this experiment, we restricted the model selection to consider all possible combinations of reference sets 3, 4, 6, 7, and 8. Table 3 shows the top ten selected model configurations, percentage-wise. The true model configuration is shown in bold and is correctly selected 89.1% of the time. Each of the three correct model terms, θ{6}, θ{7}, and θ{6,7}, were selected 100% of the time, and no other incorrect model term was selected more than 1% of the time. In other words, the false negative rate for each correct model term was 0%, and the false positive rate for each incorrect model term was no more than 1% for this experiment. The rate of including at least one incorrect model term was 10.9%. The results of the simulation experiment as seen in Tables 2 and 3 provide not only a validation that the algorithms associated with GINOM are working properly but also a measure of error associated with estimation and model fitting in a scenario that incorporates the given reference sets.

The estimation in Eq (7) fails to converge when at least one of the components of goes to positive or negative infinity. Below, we list some of the conditions that must be met for convergence. The list is not exhaustive, as more complicated models can yield more complicated convergence criteria. The simplest condition for convergence is to have no component of T equal to 0, where T was introduced in Eq (5). That is, for each index set π included in the model, we need at least one query interval in the data set to simultaneously overlap all of the reference sets in π. Convergence also requires that, when there exist two model terms indexed by and π such that , then tπ cannot equal . As the sample size n increases, the probability of satisfying these criteria increases, and convergence is more certain.

The third column of Table 3 shows the minimum required sample size n to achieve a 99% convergence rate for that particular model configuration. To compute the convergence rate for a given n and a given model configuration, we first simulated 5000 query interval datasets of size n using the true model parameter values. Then, we computed the proportion of times that all components of had an absolute value less than 5, i.e. the estimate did not drift off to positive or negative infinity. Fig B in (S1 Text) shows a plot of convergence percentage versus n for the first two model configurations listed in Table 3.

In addition to convergence, model identifiability is another technical consideration when analyzing the results of GINOM. When a model is not identifiable, the MLE is not unique. Unlike convergence, identifiability is not affected by the query sample size n but rather the locations of the reference sets in relation to one another. The condition necessary for identifiability is easiest to describe when considering a full model with N reference sets and y = 1. For example, the model that we used in all of our simulation experiments, where θ = (θ{6}, θ{7}, θ{6,7})′, is a full model. This is because, for N = 2, the model terms consist of exactly all of the possible 2N − 1 combinations of reference sets 6 and 7. For a general N, notice that the reference sets partition , the mappable part of the genome, into 2N disjoint sets. If any of these disjoint sets are empty, i.e. they do not exist, then the full model is not identifiable. Continuing with our example, in order for this full model to be identifiable, the sets R6\R7, R7\R6, R6R7, and must all be non-empty. In other words, R6 cannot be contained entirely within R7 (and vice versa), R6 and R7 cannot be disjoint, and the union of R6 and R7 cannot be . Indeed, this is the case for these reference sets. Furthermore, the full model with all N = 9 reference sets listed in Table 1 is identifiable in this sense when including all chromosomes in the domain, as all 29 disjoint sets are non-empty. Identifiability becomes more complicated to describe when considering submodels and cases when y > 1, but it is easy to verify numerically in the data preprocessing stage of GINOM.

Comparison with current methods.

We compared the analysis from GINOM with that of four other methods—BITS, MULTOVL, GAT, and GenometricCorr—on one simulated dataset. As before, we simulated a dataset of n = 2540 query intervals from the density in Eq (3) with θ = (θ{6}, θ{7}, θ{6,7})′ = (1.0, 0.5, − 1.0)′. Since one of the methods, GAT, does not work for query intervals of length y = 1, we set each yi equal to a randomly-selected query interval length from the real dataset used later in the paper, which does not contain any query intervals of length 1. We provide the results in (S1 Table).

Each of the four other methods performs their analysis in a pairwise fashion, where they compute query interval overlap statistics individually for each reference set. In order to compare GINOM directly to the other methods, we fit 9 models, each consisting of only one model term π = {j}, for j = 1, 2, …, 9. The differences in outputs between GINOM when run in a pairwise fashion and each of the four methods are due to different statistical tests as well as differences in data preprocessing techniques. On the other hand, these results were similar for all five methods in the sense that some significant effects were detected outside of the true effects of R6 and R7. For example, R3 was reported as a significant effect for all methods except GAT. This phenomenon is actually due to R3’s association with the enriched set R6 rather than an association with the query set itself. A large amount of R3 (45% of all xR3) is contained within R6, and thus, query interval enrichment due to the effect of R6 was additionally and falsely attributed to R3 in the pairwise analysis.

GINOM has a significant advantage over the other four methods in that it is not constrained to a pairwise analysis; rather, it can analyze the effects of all combinations of reference sets simultaneously. The stepwise model selection in GINOM is more likely to eliminate the false associations that arise from a pairwise analysis, outputting a simplified model configuration that best describes the data according to the BIC. When running GINOM a single time with an unrestricted stepwise model selection that considered all N = 9 reference sets, it recovered the true model with no false associations. The results of GINOM from running both the pairwise analysis as well as the stepwise model selection are shown in (S1 Table).

A study of L1 insertion bias

We provide an application for GINOM using real data. For this example, we examine somatic insertions of the active Long INterspersed Element-1 (LINE-1 or L1) retrotransposon into the genome and test for insertion biases. LINEs are mobile elements present in many eukaryotes; their number can increase through a copy-and-paste mechanism and they have been implicated in several diseases [9]. Recently, somatic retrotransposition was identified as a frequent event in many human cancers and in the adult human brain [1018]. A major biological question arising from this work is whether LINE-1, the active human retrotransposon, displays a bias in the locations of somatic retrotransposition. Various and sometimes contradictory biases for L1 retrotransposon insertions have been identified. Prior experimental work on the L1 protein machinery identified a retrotransposition bias for accessible DNA [19] and an L1 endonuclease recognition motif (TTAAAA) [20]. High-throughput studies of somatic retrotransposition uncovered a disproportionate bias towards affecting protein-coding genes in the brain and towards genes commonly mutated in cancer [11, 17]. Contradicting this view, high-throughput studies of germ-line retrotranspositions were identified to display depletion in protein-coding regions [21, 22]. One other interesting bias is that somatic L1 retrotranspositions in cancer are enriched towards regions that also display DNA hypomethylation [17]. We aim to rigorously test the biases associated with somatic L1 retrotranspositions. To do so, we compiled a dataset consisting of 2540 somatic L1 retrotranspositions available from a single tissue that were identified in lung cancer from two independent studies [13, 18]. The Helman et al study identified 363 somatic retrotransposition events in lung adenocarcinoma (39 events) and lung squamous cell carcinoma (324 events) from The Cancer Genome Atlas (TCGA) [13]. Tubio et al identified 2177 L1 retrotransposition events from various lung cancer sources including primary tumor samples, cell-lines, and TCGA samples [18].

Here, we examine the overlaps of the lung cancer somatic L1 retrotransposition query set: 2540 events (hg19 coordinates) with respect to nine curated reference sets that represent features of protein-coding genes, euchromatin, and heterochromatin. We reasoned that the ability of novel LINE-1 copies to insert themselves into a given region might be affected by this region’s accessibility; hence, we selected a set of genomic and epigenomic features that are known to be associated with genome accessibility and active transcription. The gene feature reference set includes RefSeq gene bodies (UCSC version update 10/06/15). Euchromatic tracks include broad peaks identified by the ENCODE project for histone marks H3K36me3, H3K79me2, and H3K4me3 in lung cancer cell-line A549 [23]. Our selected heterochromatic marks include additional ENCODE broad peaks for histone marks H3K9me3 and H3K27me3 in A549 cells [23], which have a role in repressing gene expression. We also included a track for two additional heterochromatic features, lamina associated domains (LADs) and late DNA replicating regions that, although defined in fibroblasts, are partially conserved between cell-types [24, 25]. Finally, we included a track for regions hypomethylated in cancer cells [26].

We compared our method with four other methods: BITS, MULTOVL, GAT, and GenometriCorr. However, unlike our method, which can examine multiple features simultaneously, these other methods test for significant overlap in a pairwise manner. Somatic L1 retrotranspositions in lung cancer displayed a bias for heterochromatic marks including H3K9me3, H3K27me3, lamina-associated domains (LADs), and late-replicating regions according to BITS, MULTOVL, GAT, and GenometricCorr by the Jaccard measure (S2 Table). In addition, consistent with Lee et al. [17], somatic L1 retrotransposition also displayed a preferential bias towards regions hypomethylated in cancer according to BITS, MULTOVL, GAT, and GenometricCorr by the Jaccard measure (S2 Table). Conversely, and consistent with studies of germ-line retrotransposition, somatic L1 insertions displayed a bias against gene regions and euchromatin marks including H3K4me3, H3K36me3, and H3K79me2 according to MULTOVL, GAT, and GenometricCorr by the Jaccard measure (S2 Table). Therefore, using the pairwise approach, we did not observe a bias of lung somatic L1 retrotranspositions towards gene regions. We hypothesized that our new method GINOM, which can examine combinations of overlaps, might be able to recover gene regions that account for previously observed biases of somatic retrotranspositions towards cancer mutations [17].

We examined the lung L1 somatic retrotransposition query set with respect to all nine reference sets using GINOM with the BIC penalty in model selection (see Model Selection subsection within Materials and methods section). Our null model is that of Eq (2) with being the set of loci x such that q(x, y) would lie entirely within the mappable regions of . GINOM reports significant model parameters and their associated MLE and p-value of the GLRT, and we show the results in Table 4.

thumbnail
Table 4. GINOM model selection on lung cancer L1 somatic retrotransposition data.

https://doi.org/10.1371/journal.pcbi.1005586.t004

For ease of interpretation, we label the significant model terms as either a primary effect or a secondary effect. A primary effect indexed by model term π is such that there exists no other significant model parameter indexed by where . For example, model term {2, 7} is a primary effect because no model term with its index set contained in {2, 7} (e.g. either term {2} or term {7}) is significant (Table 4). A secondary effect is a model term that has at least one primary effect contained within its index set. For example, model term {6, 7, 8} is a secondary effect because primary effects {6} and {8} are contained within {6, 7, 8} (Table 4).

The primary effects in the model selected by GINOM under the BIC penalty were hypomethylation, late-replicating, LADs, H3K9me3, and the combination of gene regions and H3K27me3 (Fig 3). Here, all primary effects showed enrichment with respect to the background, and, interestingly, the strongest of these enrichments occured in hypomethylation, model term {5}, which was consistent with Lee et al. [17] (Fig 3). With respect to in the background, the enrichment of a location x such that q(x, y) overlaps R5 only (and not any other reference set) is . Therefore, the probability of a query interval q(x, y) is e0.7654 = 2.15 times that of . The model was also able to recover gene regions when paired with the facultative heterochromatin mark H3K27me3, given by model term {2, 7}. The euchromatin marks were also frequently incorporated into secondary effects. For example, model term {2, 4, 6, 9}, which represents the combination of gene regions, H3K79me2, H3K9me3, and late-replicating regions, was assigned a parameter value of . It therefore acts antagonistically to and with greater magnitude than the sum of the effects contained within it (terms {6} and {9}) so that the overall combined effect of R2, R4, R6, and R9 is negative. Specifically, for a location x such that q(x, y) simultaneously overlaps R2, R4, R6, and R9 only, the enrichment is given by . Therefore, such an x is depleted, with the probability of q(x, y) being e−0.7932 = 0.4524 times that of in the background, which is consistent with somatic transpositions being biased against euchromatin.

thumbnail
Fig 3. Primary effects of lung cancer somatic L1 retrotransposition.

The results were output from GINOM model selection under BIC penalty.

https://doi.org/10.1371/journal.pcbi.1005586.g003

Finally, we sought to more closely examine the main effect resulting from the intersection of gene regions and H3K27me3. It was recently reported that L1 retrotranspositions display a bias towards cancer genes [17]. To test this in our dataset, we utilized the cancer gene lists from Lee et al, which include the COSMIC cancer gene census and the Memorial Sloan-Kettering Cancer Center cancer gene list (S3 Table). In the L1 lung cancer insertion dataset, we observed that 759 genes bodies overlap a somatic L1 retrotransposition. Of these, a significant proportion (37/759) are cancer genes (p-value = 0.027, hypergeometric test, S1 Text). We then looked more specifically at the L1 insertions in gene bodies marked by H3K27me3, which represented a significant main effect in the GINOM model. Of the 759 gene bodies that overlap a somatic L1 retrotransposition, 309 genes also intersect a H3K27me3 broad peak, and are significantly enriched for cancer genes (22/309, p-value 0.001, hypergeometric test, S1 Text). Hence, we speculate that L1 insertion bias towards cancer genes could be a consequence of L1 bias towards genes located within the facultative heterochromatin marked by H3K27me3.

Together our results reconcile some of the discrepancies in observations between somatic L1 retrotranspositions and germ-line retrotransposition events. Overall, somatic L1 retrotranspositions display a bias towards features typically associated with constitutive heterochromatin, similar to germline transposition events. However, some genes, e.g. cancer genes and genes marked by H3K27me3, also display a bias for somatic L1 transposition events in lung cancer.

Conclusion

We developed a Genomic INterval Overlap Model that allows for the interrogation of significant associations between many genomic features simultaneously. Unlike prior methods, which test for associations in a pairwise manner, GINOM treats query interval location as a random variable of log-linear distribution with model terms formed from any possible combination, or interaction, of multiple reference sets. In this fashion, GINOM can uncover any higher-order interaction among reference sets that has a significant effect on query interval location. Through an implementation of a stepwise model selection routine, GINOM can handle an arbitrarily large number of reference sets simultaneously as input and subsequently output a reduced model that satisfies an optimal trade-off between number of model terms and likelihood value. The end result is a set of significant model terms with associated parameter values that defines a profile of query interval enrichment at a level of detail beyond that of pairwise comparisons.

To highlight this unique capability of GINOM, we fit the model using a query interval set of lung cancer somatic L1 retrotranspositions and a collection of reference sets representing protein-coding genes, euchromatin, and heterochromatin features. The output of GINOM indicates enrichment towards the individual heterochromatic reference sets, as do other current methods. However, GINOM uncovers more nuanced associations than the other methods by identifying a significant enrichment within genes only when paired with the H3K27me3 heterochromatic mark. Conversely, it also indicates depletion within genes when paired with certain euchromatic marks. The association of L1 somatic retrotranspositions and gene bodies, when marked by H3K27me3, is not recovered by other methods because they only consider pairwise associations.

Our results demonstrate GINOM’s ability to test for significance of interval overlap between multiple genomic features. As more data of this type becomes available, it will provide an effective method to screen for yet-uncharacterized higher-order associations between genomic features.

Supporting information

S1 Table. Analysis of results of competing methods on simulated data.

We collect results from the following four methods: BITS, MULTOVL, GAT, and GenometricCorr. Additionally, we show results from GINOM when run on individual reference sets as well as when running stepwise model selection over all combinations of reference sets.

https://doi.org/10.1371/journal.pcbi.1005586.s001

(XLSX)

S2 Table. Analysis of results of competing methods on lung cancer somatic L1 retrotransposition data.

We collect results from the following four methods: BITS, MULTOVL, GAT, and GenometricCorr.

https://doi.org/10.1371/journal.pcbi.1005586.s002

(XLSX)

S3 Table. Gene lists used in the cancer genes enrichment analysis.

https://doi.org/10.1371/journal.pcbi.1005586.s003

(XLSX)

S1 Text. Additional details regarding hypothesis testing, model interpretation, maximum likelihood estimation, convergence, and cancer genes enrichment analysis (including R code).

https://doi.org/10.1371/journal.pcbi.1005586.s004

(PDF)

Acknowledgments

The authors would like to thank Feifei Ding for data collection and assistance with the data analysis.

Author Contributions

  1. Conceptualization: DB QH NN.
  2. Data curation: DB SC AL.
  3. Formal analysis: DB FH NN.
  4. Funding acquisition: SC NN.
  5. Investigation: DB SC AL NN.
  6. Methodology: DB FH NN.
  7. Software: DB.
  8. Supervision: NN.
  9. Validation: DB SC AL.
  10. Writing – original draft: DB FH NN.

References

  1. 1. de Jong J, Akhtar W, Badhai J, Rust AG, Rad R, Hilkens J, et al. Chromatin Landscapes of Retroviral and Transposon Integration Profiles. PLoS Genetics. 2014 Apr;10(4). pmid:24721906
  2. 2. Aszodi A. MULTOVL: Fast Multiple Overlaps of Genomic Regions. Bioinformatics. 2012 Dec;28(24):3318–3319. pmid:23071271
  3. 3. Layer R, Skadron K, Robins G, Hall IM, Quinlan AR. Binary Interval Search: a scalable algorithm for counting interval intersections. Bioinformatics. 2013;29(1):1–7. pmid:23129298
  4. 4. Heger A, Webber C, Goodson M, Ponting CP, Lunter G. GAT: a simulation framework for testing the association of genomic intervals. Bioinformatics. 2013;19(16):2046–2048.
  5. 5. Favorov A, Mularoni L, Cope LM, Medvedeva Y, Mironov AA, Makeev VJ, et al. Exploring massive, genome scale datasets with the GenometriCorr package. PLoS Computational Biology. 2012 May;8(5). pmid:22693437
  6. 6. Casella G, Berger RL. Statistical Inference, Second Ed. Brooks/Cole Cengage Learning; 2002.
  7. 7. SAS/STAT® 9.2 User’s Guide. Cary, NC: SAS Institute Inc.; 2008.
  8. 8. Schwarz GE. Estimating the dimension of a model. Annals of Statistics. 1978;6(2):461–464.
  9. 9. Beck CR, Garcia-Perez JL, Badge RM, Moran JV. LINE-1 elements in structural variation and disease. Annu Rev Genomics Hum Genet. 2011;12:187–215. pmid:21801021
  10. 10. Evrony GD, et al. Single-neuron sequencing analysis of L1 retrotransposition and somatic mutation in the human brain. Cell. 2012;151(3):483–496. pmid:23101622
  11. 11. Shukla R, et al. Endogenous retrotransposition activates oncogenic pathways in hepatocellular carcinoma. Cell. 2013;153(1):101–111. pmid:23540693
  12. 12. Upton KR, et al. Ubiquitous L1 mosaicism in hippocampal neurons. Cell. 2015;161(2):228–239. pmid:25860606
  13. 13. Helman E, et al. Somatic retrotransposition in human cancer revealed by whole-genome and exome sequencing. Genome Res. 2014;24(7):1053–1063. pmid:24823667
  14. 14. Solyom S, et al. Extensive somatic L1 retrotransposition in colorectal tumors. Genome Res. 2012;22(12):2328–2338. pmid:22968929
  15. 15. Baillie JK, et al. Somatic retrotransposition alters the genetic landscape of the human brain. Nature. 2011;479(7374):534–537. pmid:22037309
  16. 16. Evrony GD, et al. Cell lineage analysis in human brain using endogenous retroelements. Neuron. 2015;85(1):49–55. pmid:25569347
  17. 17. Lee E, et al. Landscape of somatic retrotransposition in human cancers. Science. 2012;337(6097):967–971. pmid:22745252
  18. 18. Tubio JM, et al. Mobile DNA in cancer. Extensive transduction of nonrepetitive DNA mediated by L1 retrotransposition in cancer genomes. Science. 2014;345(6196):1251343. pmid:25082706
  19. 19. Cost GJ, et al. Target DNA chromatinization modulates nicking by L1 endonuclease. Nucleic Acids Res. 2001;29(2):573–577. pmid:11139628
  20. 20. Jurka J, et al. Sequence patterns indicate an enzymatic involvement in integration of mammalian retroposons. Proc Natl Acad Sci USA. 1997;94(5):1872–1877. pmid:9050872
  21. 21. Thung DT, et al. Mobster: accurate detection of mobile element insertions in next generation sequencing data. Genome Biol. 2014;15(10):488. pmid:25348035
  22. 22. Stewart C, et al. A comprehensive map of mobile element insertion polymorphisms in humans. PLoS Genet. 2011;7(8):e1002236. pmid:21876680
  23. 23. Consortium TEP. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489(7414):57–74.
  24. 24. Guelen L, et al. Domain organization of human chromosomes revealed by mapping of nuclear lamina interactions. Nature. 2008;453(7197):948–951. pmid:18463634
  25. 25. Hansen RS, et al. Sequencing newly replicated DNA reveals widespread plasticity in human replication timing. Proc Natl Acad Sci USA. 2010;107(1):139–144. pmid:19966280
  26. 26. Berman BP, et al. Regions of focal DNA hypermethylation and long-range hypomethylation in colorectal cancer coincide with nuclear lamina-associated domains. Nat Genet. 2012;44(1):40–46.